Context Vectors are Reflections of Word Vectors in Half the Dimensions

This paper takes a step towards theoretical analysis of the relationship between word embeddings and context embeddings in models such as word2vec. We start from basic probabilistic assumptions on the nature of word vectors, context vectors, and text generation. These assumptions are well supported either empirically or theoretically by the existing literature. Next, we show that under these assumptions the widely-used word-word PMI matrix is approximately a random symmetric Gaussian ensemble. This, in turn, implies that context vectors are reflections of word vectors in approximately half the dimensions. As a direct application of our result, we suggest a theoretically grounded way of tying weights in the SGNS model.


Introduction and Main Result
Today word embeddings play an important role in many natural language processing tasks, from predictive language models and machine translation to image annotation and question answering, where they are usually 'plugged in' to a larger model.An understanding of their properties is of interest as it may allow the development of better performing embeddings and improved interpretability of models using them.This paper takes a step in this direction.Notation: We let R denote the real numbers.Bold-faced lowercase letters (x) denote vectors in Euclidean space, bold-faced uppercase letters (X) denote matrices, plain-faced lowercase letters (x) denote scalars, plain-faced uppercase letters (X) denote scalar random variables, • denotes the Euclidean norm: x :=

√
x x, 'i.i.d.' stands for 'independent and identically distributed'.We use the sign ∼ to abbreviate the phrase 'distributed as', and the sign ∝ to abbreviate 'proportional to'.Tr(A) is used to denote the trace of a matrix A. M x (t) is the moment-generating function of a random vector x at t: is the Hadamard product (element-wise multiplication).Assuming that words have already been converted into indices, let {1, . . ., n} be a finite vocabulary of words.Following the setup of the widely used word2vec model (Mikolov et al., 2013), we will use two vectors per each word i: • w i when i is a center word, • c i when i is a context word.
We make the following key assumptions in our work.
Assumption 1.A priori word vectors w 1 , . . ., w n ∈ R d are i.i.d.draws from isotropic multivariate Gaussian distribution: where I is the d × d identity matrix.
This is motivated by the work of Arora et al. (2016), where the ensemble of word vectors consists of i.i.d draws generated by v = s • v, with v being from the spherical Gaussian distribution N (0, I), and s being a scalar random variable with bounded expectation and range.In their work, the norm v i of the word vector for a word i is related to its unigram probability p(i), and to allow a sufficient dynamic range for these probabilities they needed the multiplier s.In our work, unigram probabilities are not mapped to vector lengths, and this is why we do not need such multiplier.Direct relationship between word probabilities and word vector norms is also implied by the model of Hashimoto et al. (2016).
Assumption 2. Context vectors c 1 , . . ., c n are related to word vectors according to This is mainly guided by the work of Press and Wolf (2017), who showed that context vectors in the SGNS model of Mikolov et al. (2013) are distributed similarly to word vectors in the sense that pairwise cosine distances between word (input) embeddings strongly correlate with the corresponding pairwise cosine distances between context (output) embeddings (see their Table 4).This is why we choose the transform from word vectors to context vectors to be orthogonal as it preserves inner products and consequently Euclidean norms.Notice, that c i iid ∼ N 0, 1 d I .Assumption 3. Given a word j, probability of any word i being in its context 1 is given by where p i = p(i) is the unigram probability for the word i, which is inverse proportional to its smoothed frequency rank r i , i.e.
This is similar to the log-linear model of Arora et al. (2016), but differs in the following aspects: c i is not assumed to do a random walk over the unit sphere with bounded displacement; we use the factor p i to directly capture word frequencies and do not model them via vector norms.Equation (3) can be interpreted as follows: probability that the word i occurs in the context of the word j is probability that the word i occurs anywhere in a large corpus, corrected for the relationship between words i and j.This approach was already considered by Melamud et al. (2017) but in their work i is the entire left context of the word j, and c i is a vector representation of this entire context.Also, like Arora et al. (2016) but unlike Melamud et al. (2017), we use the model (3) for a theoretical analysis rather than for fitting to data.Smoothing of the unigram probabilities (i.e.raising them to power 1 − α) is motivated by the works of Mikolov et al. (2013), Levy et al. (2015), Pennington et al. (2014), where α = 0.25 is a typical choice.We notice here that α = 0 + gives us Zipf's law (Zipf, 1935), whereas α = 1 gives us uniform distribution of word frequencies which is not valid empirically but on the other hand can be used to explain additivity of word vectors (Gittens, Achlioptas, & Mahoney, 2017).The specific value of α is important for The relationship between word (input) and context (output) vectors was addressed in several previous works.E.g., in recurrent neural network language modeling (RNNLM), tying input and output embeddings is a useful regularization technique introduced earlier (Bengio et al., 2001) and studied in more details recently (Press & Wolf, 2017;Inan et al., 2017).This technique improves language modeling quality (measured as perplexity of a held-out text) while decreasing the total number of trainable parameters almost two-fold since most of the parameters in RNNLM are due to embedding matrices.The direct application of this regularization technique to SGNS worsens the quality of word vectors as was shown empirically by Press and Wolf (2017) and by Gulordava et al. (2018).This worsening was predicted earlier by Goldberg and Levy (2014) using a simple linguistic observation that words usually do not appear in the contexts of themselves.This basically means that Q = I in (2).At the same time, there is empirical evidence that the relationship between input and output embeddings is linear (Mimno & Thompson, 2017;Gulordava et al., 2018).In this paper, we provide a theoretical justification for this and reveal the exact form of the transform Q.Our main contribution is the following Theorem 1.Under Assumptions 1, 2, and 3 above, context vector c i for a word i is the reflection of the word vector w i in approximately half of the dimensions.
Figure 1 illustrates this idea for the case d = 2.In general, our word and context vectors live in a d-dimensional vector space over real numbers (R d ).By Theorem 1 we can settle them in a d/2-dimensional vector space over complex numbers (C d/2 ) in such way that the context vector c i ∈ C d/2 for a word i is the complex conjugate of the word vector w i ∈ C d/2 .This is in line with the Theorem 2 of Allen et al. (2018), however they use completely different set of basic assumptions and their primary goal is to encode statistical properties of words directly into word vectors.

Proof of Theorem 1
The proof is divided into three steps: first we show that the partition function in (3) concentrates around 1, and thus ∝ can be replaced by ≈; using this fact we show that Q is (approximately) an involutary matrix, i.e. similar to diag({+1, −1}); and finally we show that the word-word pointwise-mutual information matrix is approximately symmetric Gaussian random matrix with weakly dependent entries.The latter fact immediately implies the statement of the Theorem 1.

Concentration of the partition function
We first need the following auxiliary result.Lemma 1.Let w ∼ N (0, σ 2 I), then ∀t > 0 and any orthogonal Q Proof.Consider the random variable X = w Qw, and let Further, where we dropped the terms containing odd powers of w i , as their expectations are zeros. Hence, From ( 7) and ( 8) we have It is easy to see that i q 2 ii + i =j q 2 ij = d (the sum of squared elements of an orthogonal matrix).In this way, Applying Chebyshev inequality to X, and taking into account ( 6) and ( 8), we have Since Q 2 is orthogonal, its trace does not exceed d, and we obtain (5).
Now we are ready to show that the partition function in (3) concentrates around 1 + 1 2d .
Lemma 2. Let Z j be a partition function in (3), i.e.Z j = n i=1 p i e w j c i .Then Proof.We will first show that the conditional expectation E[Z j | w j ] depends on w j mainly through its norm w j : p i e w j c i w j = p j E e w j c j w j + i =j where M c i (w j ) is the moment-generating function of c i at w j .In Lemma 1 and Corollary 1.1 it is shown that w j Qw j and w j 2 concentrate well around their means Tr(Q) 1 d and 1 respectively and thus we can approximate The quantity 1 2d is small for d ≥ 50 which is typical for dimensionality of word vectors (Mikolov et al., 2013).Thus, using ( 14), ( 15), and Maclaurin expansion for x → e x in the last term of (13), we obtain This approximation is very helpful as the right-hand side does not contain w j and thus it is an approximation for the E[Z j ] as well.Let H n,α be the normalizer in (4), then and thus we have Now, combining ( 16), (17), and (15), we get Now let us show that Var[Z j ] is small relative to the mean E[Z j ].First, we have For the variance terms we have Var e w j c j w j = Var e w j Qw j w j = 0, and Var e w j c i w j = E e 2w j c i w j − E e w j c i w j 2. We abuse the notation here and use '∼' to denote asymptotic equivalence, i.e.
Conditioned on w j , the random variables {e w j c i } i =j are independent, while e w j c j is constant, and thus Cov e w j c i , e w j c k w j = 0, i = k.
Remark 2.1.Lemma 2 basically says that under the Assumptions 1 and 2, the model (3) self-normalizes, i.e. the normalization term is almost constant and moreover it is almost 1.This result is similar to the result of Andreas and Klein (2015), but differs in that our model ( 3) is not log-linear as its condition (j) and prediction (i) are both parameterized.The result of Goldberger and Melamud (2018) on self-normalization of the NCE language models is closer to ours but the setup differs in that p i does not appear as a factor in their model.We finally notice that Lemma 2 is an analogue of Lemma 2.1 from Arora et al.
(2016) but adapted to our settings.

Q is an involutary matrix
Lemma 3. Let Q be the matrix mapping word vectors to context vectors as in (2).Then, under Assumptions 1, 2, and 3, Q is approximately an involutary matrix.
Proof.The dimensionality d of word vectors is usually in the range [50, 1000] (Mikolov et al., 2013), and thus we can neglect the term 1 2d in ( 12) and approximate Z j ≈ 1.This means that the model (3) simplifies to where p(i, j) is the probability that the words i and j co-occur in the same context window.Notice that the left-hand side in ( 26) is the pointwise mutual information (PMI) between words i and j.From ( 2) and (26) we have where PMI stands for the PMI-matrix, and W is a n × d matrix in which i-th row is w i .Since p(i, j) = p(j, i), we should have PMI = PMI , which implies where we used the fact that W W ≈ 1 d I. Since Q is assumed to be orthogonal, from ( 27) we get Q 2 ≈ I Thus, Q is approximately an involutary matrix, and we can choose it to be a signature matrix, i.e. a diagonal matrix with ±1 on the diagonal 3 : In this way, context vectors are word vectors with some of the coordinates being multiplied by −1.The natural question is: how many of the coordinates should be "flipped"?

PMI as a random matrix
Let x i ∈ R l be the vector consisting of the first l coordinates of w i , i.e.
x i = w i,1:l = w i1 , . . ., w il , (29) and let y i ∈ R d−l be the vector consisting of the last d − l coordinates of w i , i.e.
Due to Assumption (1), x i 's are i.i.d.draws from N 0, 1 d I l×l , y i 's are i.i.d.draws from N 0, 1 d I (d−l)×(d−l) , and {x i }, {y i } are jointly independent.Without restricting the generality, assume that the first l diagonal elements in (28) are equal to +1, and the rest d − l elements are equal to −1.Thus For i = j, x i x j is a sum of l i.i.d.random variables with mean 0 and variance 1 d 2 , and by Central Limit Theorem, x i x j ≈ N 0, l d 2 .Similarly, y i y j ≈ N 0, d−l d 2 , and thus For i = j, we have 3. One can show that any involutary matrix can be represented as P diag(±1, . . ., ±1)P, where P is orthogonal, and thus by reparametrization wi = Pwi iid ∼ N (0, 1 d I), we can still have (28).
where χ 2 l is a chi-square random variable with l degrees of freedom.By combinatorial argument (similar to that of Lemma 1) one can show that covariance between any two distinct and non-symmetric entries of WQ W is zero, and thus Moreover, we can show that PMI ij and PMI pq tend to be independent when d is large enough.For the case i = p, j = q this follows directly from the Assumption (1).Now consider the case i = p, j = q (two distinct elements from the same row).The case i = p, j = q (two distinct elements from the same column) can be analyzed similarly.Let t = t 1 t 2 .Then the moment-generating function (m.g.f.) of PMI ij PMI iq at t is and the last expression for large d is approximately e which is the m.g.f. of a twodimensional Gaussian vector with distribution N (0, which implies approximate independence between PMI ij and PMI iq .Hence, from ( 31) and (32) we conclude that for the PMI matrix • the above-diagonal entries have (approximately) distribution N 0, 1 d , • the diagonal entries have (approximately • all entries on and above its diagonal tend to pairwise independence. This means that the PMI matrix is an approximately symmetric Gaussian random matrix with weakly dependent entries and it is known that the empirical distribution of eigenvalues of such matrix approaches symmetric around 0 distribution as its size n increases (de Monvel et al., 1999).4Thus, we should have Hence, taking expectation on both sides of (36) we have which concludes the proof of Theorem 1.
Remark 3.1.In terms of the introduced notation ( 29) and ( 30), each word's vector w i splits into two subvectors x i and y i , and due to Theorem 1, our model (3) for generating a word i in the context of a word j can be rewritten as Interestingly, embeddings of the first type (x i and x j ) are responsible for pulling the word i into the context of the word j, while embeddings of the second type (y i and y j ) are responsible for pushing the word i away from the context of the word j.We hypothesize that the x-embeddings are more related to semantics, whereas the y-embeddings are more related to syntax.We defer testing of this hypothesis to our future work.To verify that the real-world PMI matrices have indeed symmetric (around 0) distribution of their eigenvalues, we consider two widely-used datasets, text8 and enwik9, 5 from which we extract PMI matrices using the hyperwords tool of Levy et al. (2015).We use the default settings for all hyperparameters, except word frequency threshold and context window size.We were ignoring words that appeared less than 100 times and 150 times in text8 and enwik9 correspondingly, resulting in vocabularies of 11,815 and 29,145 correspondingly.We additionally experiment with context window 5, which by default is set to 2, and which we believe could affect the results.The eigenvalues of the PMI matrices are then calculated using the TensorFlow library (Abadi et al., 2016), and the above-mentioned threshold of 150 for enwik9 was chosen to fit the resulting PMI matrix into the GPU memory (12GB, NVIDIA Titan X Maxwell).The histograms of eigenvalues are provided in Figure 2. As we can see, the distributions are not perfectly symmetric with a little right skewness, but in general they seem to be symmetric.Notice, that this is in stark contrast with the equation (2.5) from Arora et al. (2016), which claims that the PMI matrix should be approximately positive semi-definite, i.e. that it should have mostly positive eigenvalues.Also, notice that the shapes of distributions are far from resembling the Wigner semicircle law x → 1 2π √ 4 − x 2 , which is the limiting distribution for the eigenvalues of many random symmetric matrices with i.i.d.entries (Wigner, 1955(Wigner, , 1958)).This means that the entries of a typical PMI matrix are dependent, otherwise we would observe approximately semicircle distributions for its eigenvalues.Interestingly, there is a striking similarity between the shapes of distributions in Figure 2 and of spectral densities of the scale-free random graphs (Farkas et al., 2001) and random graphs with expected degrees (Preciado & Rahimian, 2017) which arise in physics and network science.Notice that the connection between human language structure and scale-free random graphs was observed previously by Cancho and Solé (2001), and it would be interesting to dive deeper in this direction.

Weight tying in skip-gram model
We would like to apply our results to tie embeddings in the skip-gram model of Mikolov et al. (2013) in a theoretically grounded way.One may argue that our key Assumption 3 differs from the softmax-prediction of the skip-gram model.Although this is true, in fact the softmax normalization is never used in practice when training skip-gram.Instead it is common to replace the softmax cross-entropy by the negative sampling objective (equation (4) in Mikolov et al. (2013)), and its optimization is almost equivalent to finding a low-rank approximation of the shifted word-word PMI matrix in the form w i c j ≈ PMI ij − log k 5. http://mattmahoney.net/dc/textdata.html.The enwik9 data was processed with the Perl-script wikifil.plprovided on the same webpage.It filters Wikipedia XML dumps to "clean" text consisting only of lowercase letters and spaces (never consecutive).
• For answering analogy questions (a is to b as c is to ?) we use the 3CosMul of Levy and Goldberg (2014a) and the evaluation metric for the analogy questions is the percentage of correct answers.
The results of evaluation are provided in Table 2.As we can see, SGNS + WT produces embeddings comparable in quality with those produced by the baseline SGNS model despite having 50% fewer parameters.This also empirically validates the statement of our Theorem 1.We notice that similar results can be obtained by letting the linear transform Q be a trainable matrix as shown by Gulordava et al. (2018).The main difference of our approach is that we know exactly the form of Q, and thus we do not need to learn it.

Conclusion
There is a remarkable relationship between human language and other branches of science, and we can get interesting and practical results by studying deeper such relationships.For example, the modern theory of random matrices is replete with theoretical results that can be immediately applied to models of natural language once such models are cast into the appropriate probabilistic setting, as is done in this paper.

Figure 1 :
Figure 1: Context vector is a reflection of word vector in half the coordinates.

=
p j e w j Qw j + e 1 2d w j 2 i =j p i = p j e w j Qw j − e

Figure 2 :
Figure 2: Empirical distribution of eigenvalues of PMI matrices.

Table 1 :
Corpus statistics.T = total length in tokens; |W| = number of unique words.