Bilingual Distributed Word Representations from Document-Aligned Comparable Data

We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet eﬀective BWE-based models signiﬁcantly outperform the MuPTM-based and context-counting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.


Introduction
A huge body of work in distributional semantics and word representation learning almost exclusively revolves around the distributional hypothesis (Harris, 1954) -an idea which states that similar words occur in similar contexts.All current corpus-based approaches to semantics rely on the contextual evidence in one way or another.Roughly speaking, word representations are typically learned using these two families of distributional context-based models: (1) global matrix factorization models such as latent semantic analysis (LSA) (Landauer & Dumais, 1997) or generative probabilistic models such as latent Dirichlet allocation (LDA) (Blei, Ng, & Jordan, 2003), which model the word co-occurrence at the document or paragraph level; or (2) local context window models that represent words as sparse high-dimensional context vectors, and model the word co-occurrence at the level of selected neighboring words (Turney & Pantel, 2010), or generative probabilistic models that learn the probability distribution of a vocabulary word in the context window as a latent variable (Deschacht & Moens, 2009;Deschacht, De Belder, & Moens, 2012).
A natural extension of interest from monolingual to multilingual word embeddings has occurred recently (e.g., Klementiev, Titov, & Bhattarai, 2012;Hermann & Blunsom, 2014b).When operating in multilingual settings, it is highly desirable to learn embeddings for words denoting similar concepts that are very close in the shared bilingual embedding space (e.g., the representations for the English word school and the Spanish word escuela should be very similar).These BWEs may then be used in a myriad of multilingual natural language processing tasks and beyond, such as fundamental tasks leaning on such bilingual meaning representations, e.g., computing cross-lingual and multilingual semantic word similarity and extracting bilingual word lexicons using the induced bilingual embedding space (see Figure 1).However, all these models critically require (at least) sentence-aligned parallel data and readily-available translation dictionaries to induce bilingual word embeddings (BWEs) that are consistent and closely aligned over different languages.Contributions.To the best of our knowledge, this article presents the first work to showcase that bilingual word embeddings may be induced directly on the basis of comparable data without any additional bilingual resources such as sentence-aligned parallel data or translation dictionaries.The focus is on document-aligned comparable corpora (e.g., Wikipedia articles aligned through inter-wiki links, news texts discussing the same theme).
Our new bilingual distributed representation learning model makes use of pseudo-bilingual documents constructed by merging the content of two coupled documents from a document pair, where we propose and evaluate two different strategies on how to construct such pseudo-bilingual documents: (1) merge and randomly shuffle strategy which randomly permutes words from both languages in each pseudo-bilingual document, and (2) length-ratio shuffle strategy, a deterministic method that retains monolingual word order while intermingling the words cross-lingually.These additional pre-training shuffling strategies ensure that both source language words and target language words occur in the contexts of each source and target language word.A monolingual model such as skip-gram with negative sampling (SGNS) from the word2vec package (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013c) is then trained on these "shuffled" pseudo-bilingual documents.By this procedure, we steer semantically similar words from different languages towards similar representations in the shared bilingual embedding space, and effectively use available bilingual contexts instead of monolingual ones.The model treats documents as bags-of-words (i.e., it does not include any syntactic information) and does not even rely on any sentence boundary information.
In summary, the main contributions of this article are:

Related Work
In this section we further motivate why we opt for building a model for inducing bilingual word embeddings from comparable document-aligned data.For a clearer overview, we have split related work into three broad clusters: (1) monolingual word embeddings, (2) bilingual word embeddings, and (3) bilingual word representations from document-aligned data.

Monolingual Word Embeddings
The idea of representing words as continuous real-valued vectors dates way back to mid-80s (Rumelhart, Hinton, & Williams, 1986;Elman, 1990).The idea met its resurgence a decade ago (Bengio et al., 2003), where a neural language model learns word embeddings as part of a neural network architecture for statistical language modeling.This work inspired other approaches that learn word embeddings within the neural-network language modeling framework (Collobert & Weston, 2008;Collobert, Weston, Bottou, Karlen, Kavukcuoglu, & Kuksa, 2011).Word embeddings are tailored to capture semantics and encode a continuous notion of semantic similarity (as opposed to semantically poorer discrete representations), necessary to share information between words and other text units.

Monolingual vs Bilingual
Figure 1: A toy 3D shared bilingual embedding space from Gouws et al. (2015): While in monolingual spaces words with similar meanings should have similar representations, in bilingual spaces words in two different languages with similar meanings should have similar representations (both mono-and cross-lingually).
Recently, the skip-gram and continuous bag-of-words (CBOW) model from Mikolov et al. (2013aMikolov et al. ( , 2013c) ) revealed that the full neural-network structure is not needed at all to learn high-quality word embeddings (with extremely decreased training times compared to the full-fledged neural network models, see Mikolov et al.'s (2013a) work for the full analysis of complexity of the models).These models are in fact simple single-layered architectures, where the objective is to predict a word's context given the word itself (skip-gram) or predict a word given its context (CBOW).Similar models called vector log-bilinear models were recently proposed (Mnih & Kavukcuoglu, 2013).Other models inspired by skip-gram and CBOW are GloVe (Global Vectors for Word Representation) (Pennington et al., 2014), which combines local and global contexts of a word into a unified model, and a model which relies on dependency-based contexts instead of simpler word-based contexts (Levy & Goldberg, 2014a), and new models are steadily emerging (e.g., Lebret & Collobert, 2014;Lu, Wang, Bansal, Gimpel, & Livescu, 2015;Stratos, Collins, & Hsu, 2015;Trask, Gilmore, & Russell, 2015;Liu, Jiang, Wei, Ling, & Hu, 2015).
An interesting finding has been discussed recently (Levy & Goldberg, 2014b): the popular skip-gram model with negative sampling (SGNS) (Goldberg & Levy, 2014) is simply a model which implicitly factorizes a word-context matrix, with its cells containing pointwise mutual information (PMI) scores of the respective word and context pairs, shifted by a global constant.In other words, the SGNS performs exactly the same thing as traditional distributional models (i.e., context counting plus context weighting and/or dimensionality reduction), with a slight improvement in performance with SGNS (Baroni et al., 2014;Levy et al., 2015).
All these low-dimensional vectors, besides improving computational efficiency, lead to better generalizations, even allowing to generalize over the vocabularies observed in labelled data, and hence partially alleviating the ubiquitous problem of data sparsity.Their utility has been validated and proven in various semantic tasks such as semantic word similarity, synonymy detection or word analogy solving (Mikolov et al., 2013d;Baroni et al., 2014;Pennington et al., 2014).Moreover, word embeddings have been proven to serve as useful unsupervised features for plenty of downstream NLP tasks such as named entity recognition, chunking, semantic role labeling, part-of-speech tagging, parsing, selectional preferences (Turian, Ratinov, & Bengio, 2010;Collobert et al., 2011;Chen & Manning, 2014).
Due to its simplicity, as well as its efficacy and consequent popularity in various tasks (Mikolov et al., 2013c;Levy & Goldberg, 2014b), with a clear advantage on similarity tasks when compared to traditional models from distributional semantics (Levy et al., 2015) in this article we will focus on the adaptation of SGNS (Mikolov et al., 2013c).In Section 3, we provide a very brief overview of the model, and then follow up with our new bilingual model which is based on SGNS.
The current research on inducing BWEs critically relies on sentence-aligned parallel data or readily available bilingual lexicons to achieve the coherence of representations across languages (e.g., to build similar representations for similar concepts in different languages such as January-januari, dog-hund or sky-hemel).We may cluster the current work in three different groups: (1) the models that rely on hard word alignments obtained from parallel data to constrain the learning of BWEs (Klementiev et al., 2012;Zou et al., 2013;Wu et al., 2014); (2) the models that use the alignment of parallel data at the sentence level (Kočiský, Hermann, & Blunsom, 2014;Hermann & Blunsom, 2014a, 2014b;Chandar et al., 2014;Shi, Liu, Liu, & Sun, 2015;Gouws et al., 2015); (3) the models that critically require readily available bilingual lexicons (Mikolov et al., 2013b;Faruqui & Dyer, 2014;Xiao & Guo, 2014).The main disadvantage of all these models is the limited availability of parallel data and bilingual lexicons, resources which are scarce and/or domain-restricted for plenty of language pairs.In this work, we significantly alleviate the requirements: unlike prior work, we show that BWEs may be induced solely on the basis of document-aligned comparable data without any additional need for parallel data or bilingual lexicons.Note that (in theory) the work from Hermann and Blunsom (2014b), Chandar et al. (2014) may also be extended to the same setting with document-aligned data, as these two models originally rely on sentence embeddings computed as aggregations over their single word embeddings plus sentence alignments.In this work, by testing and comparing to the BiCVM model from Hermann and Blunsom (2014b), we show that these models do not work well in practice after replacing the very strong bilingual signal coded in parallel sentences with the noisy bilingual signal given by document alignments and non-parallel data.
Words in this setting are represented as real-valued vectors with conditional topic probability scores P (z k |w i ), regardless of their actual language.Topics z k are in fact latent inter-lingual concepts discovered directly from multilingual comparable data using a multilingual topic model such as bilingual LDA.We discuss the MuPTM-based representations in more detail in Section 4.1.
MuPTM-based bilingual word representations induced from comparable data have demonstrated its utility in tasks such as cross-lingual semantic similarity computation and bilingual lexicon extraction (Vulić, De Smet, & Moens, 2011;Liu, Duh, & Matsumoto, 2013) and suggesting word translations in context (Vulić & Moens, 2014).In this work, we compare the state-of-the-art MuPTM-based word representations induced from the same type of comparable corpora with BWEs learned by our new model in these two semantic tasks.
Another recent model (Søgaard, Agić, Martínez Alonso, Plank, Bohnet, & Johannsen, 2015) is also able to learn from document-aligned data.It is a count-based model which builds binary word vectors denoting the occurrence of each word in each document pair.Dimensionality reduction is then applied post-hoc on the induced sparse vectors.Since the links between documents are known, the model is able to learn cross-lingual correspondences between words and, consequently, bilingual word representations.Exactly the same idea was already introduced as a baseline model by Vulić et al. (2011), where TF-IDF weights were used instead of binary indices, and no dimensionality reduction was applied post-hoc.The model from Vulić et al. (2011) was surpassed by baseline models from document-aligned data briefly discussed in Section 4.1, while the model from Søgaard et al. (2015) obtains results that are very similar to the BWE baselines compared against in this work (described in Section 4.2).

BWESG: Model Architecture
Our new bilingual model is an extension of SGNS to bilingual settings with documentaligned comparable training data.This section describes the underlying SGNS and two variants of our SGNS-based BWE induction model.

Skip-Gram with Negative Sampling (SGNS)
Our departure point is the log-linear SGNS from Mikolov et al. (2013c) as implemented in the word2vec package. 1 The SGNS model learns word embeddings (WEs) in a similar way to neural language models (Bengio et al., 2003;Collobert & Weston, 2008), but without a non-linear hidden layer.
In the monolingual setting, we assume one language L with vocabulary V , and a corpus of words w ∈ V , along with their contexts c ∈ V c , where V c is the context vocabulary.Contexts for each word w n are typically neighboring words in a context window of size cs (i.e., w n−cs , . . ., w n−1 , w n+1 , . . ., w n+cs ), so effectively it holds V c ≡ V .2 Each word type w ∈ V is associated with a vector w ∈ R d (its pivot word representation or pivot word embedding, see Figure 2), and a vector w c ∈ R d (its context embedding).d is the dimensionality of the WE vectors, which, as a model input parameter, has to be set in advance before the training procedure commences.The entries in these vectors are latent, and treated as parameters θ to be learned by the model.In short, the idea of the skip-gram model is to scan through the corpus (which is typically unannotated, Mikolov et al., 2013a) word by word in turn (i.e., these are the pivot words), and learn from the pairs (word, context word).The learning goal is to maximize the ability of predicting context words for each pivot word in the corpus.Let ob = 1 denote that the pair of words (w, v) is observed in the corpus and thus belongs to the training set D. The probability of (w, v) ∈ D is defined by the softmax function: Each word token w in the corpus is treated in turn as the pivot and all pairs of word tokens (w, w ± 1),...,(w, w ± t(cs)) are appended to D, where t(cs) is an integer sampled from a uniform distribution on {1, . . ., cs}. 3 The global training objective J is then to maximize the probabilities that all pairs from D are indeed observed in the corpus: where θ are the parameters of the model, that is, pivot and context word embeddings which have to be learned.One may see that this objective function has a trivial solution by setting w = v c , and w • v c = V al, where V al is a large enough number (Goldberg & Levy, 2014).In order to prevent this trivial training scenario, the negative sampling procedure comes into the picture (Collobert & Weston, 2008;Mikolov et al., 2013c).
In short, the idea behind negative sampling is to present the model with a set D of artificially created or sampled "negative pivot-context" word pairs (w, v ), which by assumption serve as negative examples, that is, they do not occur as observed/positive (word, context) pairs in the training corpus.The model then has to adjust the parameters θ in such a way to also maximize the probability that these negative pairs will not occur in the corpus.While the interested reader may find further details about the negative sampling procedure, and the new exact objective function along with its derivation elsewhere (Levy & Goldberg, 2014b), for illustrative purposes and simplicity, here we present the approximative objective function with negative sampling by Goldberg and Levy (2014): The free parameters θ are updated using stochastic gradient descent and backpropagation, with learning rate typically controlled by Adagrad (Duchi, Hazan, & Singer, 2011) or with a global linearly decreasing learning rate.By optimizing the objective from eq. ( 3), the model incrementally pushes observed pivot WEs towards context WEs of their collocates in the corpus.In the words of distributional hypothesis -after training, words that occur in similar contexts should end up having similar word embeddings.In other words, to link the terminology of distributional hypothesis and the modeling assumptions of SGNS -words that predict similar contexts end up having similar word embeddings.

Final Model -BWESG: BWE Skip-Gram
In the next step, we propose a novel method that extends SGNS to work with bilingual document-aligned comparable data.Let us assume that we possess a document-aligned comparable corpus, defined as ) denotes a pair of aligned documents in the source language L S and the target language L T respectively, and N is the number of pairs in the corpus.V S and V T are vocabularies associated with languages L S and L T .The goal is to learn a shared bilingual embedding space given the data (Figure 1) and document alignments as the only bilingual signal during training.We present two strategies that, coupled with SGNS, lead to such shared bilingual spaces.An overview of the architecture for learning BWEs from document-aligned comparable data with the two strategies is given in Figures 2(a) and 2(b).
(1) Merge and Shuffle.In the first step, we merge two documents d S j and d T j from the aligned document pair d j into a single "pseudo-bilingual" document d j .Following that, we randomly shuffle the newly constructed pseudo-bilingual document.A shuffle is a (random) permutation of the word tokens given in two different languages forming the pseudo-bilingual document.The pre-training shuffling step (see Figure 2(a)) assures that each word w, regardless of its actual language, obtains word collocates from both vocabularies.The idea of obtaining bilingual contexts for each pivot word in each pseudo-bilingual document will steer the final model towards constructing a shared bilingual space.Since the  (1) non-deterministic merge and shuffle, (2) deterministic length-ratio shuffle.Source language words and documents are drawn as gray boxes, while target language words and documents are drawn as blue boxes.The right side of the figures (separated by vertical dashed lines) illustrates how a pseudo-bilingual document is constructed from a pair of two aligned documents.
model depends on the alignment at the document level, in order to ensure the bilingual contexts instead of monolingual contexts, it is intuitive to assume that larger window sizes will lead to better bilingual embeddings.We test this hypothesis and the effect of window size in Section 7.3.In another interpretation, since the model relies only on (pseudo-bilingual) document level co-occurrence, the window size parameter then just controls the amount of random data dropout, that is, the number of positive document-level training examples.The locality feature of SGNS is not preserved due to the shuffling procedure.
(2) Length-Ratio Shuffle.The non-deterministic and uncontrollable nature of the merge and shuffle procedure opens up a possibility of accidentally obtaining "bad shuffles" that will result in sub-optimal word representations.Therefore, we also propose a deterministic strategy for building pseudo-bilingual documents suitable for bilingual training.Source and target language words are inserted into an (initially empty) pseudo-bilingual document in turn based on the ratio of document lengths, with word order preserved.Document lengths are measured in terms of word tokens, and let us denote them as m S and m T for an aligned document pair (d S j , d T j ).Let us assume, without loss of generality, that m S ≥ m T .The procedure then proceeds as follows (if m T > m S the procedure proceeds in an analogous manner with the roles of d S j and d T j reversed): 1. Pseudo-bilingual document d j is empty: 2. Compute the ratio: R = m S m T .3. Scan through aligned documents d S and d T simultaneously and (3.1) append R word tokens from d S j into d j ; then (3.2) append 1 word token from d T j .Repeat steps 3.1 and 3.2 until all word tokens from d T j have been inserted into d j .4. Insert remaining m S mod m T word tokens from d S j into d j .
Using a simple example, assume that we have an English (EN) document {F rodo, Sam, orcs, goblins, M ordor, ring} and a Spanish (ES) document {anillo, orcos, mago}: the pseudobilingual document would be formed by inserting 1 Spanish word after 2 English words (as the length ratio is 6:3 = 2:1).The final pseudo-bilingual document is: In another interpretation, the length-ratio shuffle strategy constructs a single permutation/shuffle of the pseudo-bilingual document controlled by the word order in two aligned documents as well as their length ratio.As before, the model relies on pseudo-bilingual document level co-occurrence, and the window size parameter controls the amount of (now non-random) data dropout.A difference lies in the fact that this procedure now keeps word order intact monolingually while constructing a pseudo-bilingual document.
The final BWE Skip-gram (BWESG) model then relies on the monolingual variant of SGNS (or any other monolingual WE induction model) trained on these shuffled/permuted pseudo-bilingual documents (using strategies (1) or ( 2)). 4 The model learns word embeddings for source and target language words aligned over the d shared embedding dimensions.The BWESG-based representation of word w, regardless of its actual language, is then a denotes the score for the k-th shared inter-lingual feature within the d-dimensional shared bilingual embedding space.Since all words share the embedding space, semantic similarity between words may be computed both monolingually and across languages.We will extensively use this property in our evaluation tasks.

Baseline Representation Models
We quickly navigate through other approaches to bilingual word representation learning from document-aligned comparable data.The set of models in comparison may be roughly clustered into two main groups: (Group I) "pre-BWE" baseline representation models from document-aligned data, (Group II) benchmarking BWE induction models that were not originally developed for learning from document-aligned comparable data.While it is essential to compare the BWESG model with other frameworks for learning representations from document-aligned data (Group I), it is also crucial to detect main strengths of the BWESG model when compared to other approaches in the BWE learning framework which can also be adjusted to learn from document-aligned data (Group II).
4.1 Group I: Baseline Representation Models from Document-Aligned Data Basic-MuPTM The early approaches (e.g., Dumais, Landauer, & Littman, 1996;Carbonell, Yang, Frederking, Brown, Geng, Lee, Frederking, E, Geng, & Yang, 1997) tried to mine topical structure from document-aligned comparable texts using a monolingual topic model (e.g., LSA or LDA) trained on pseudo-bilingual documents with the target document simply appended to its source language counterpart, and then used the discovered latent topical structure as a shared semantic space in which both words and documents from two languages may be represented in a uniform way.
More recent work on multilingual probabilistic topic modeling (MuPTM) (Mimno et al., 2009;De Smet & Moens, 2009;Vulić et al., 2011) showed that word representations of higher quality may be built if a multilingual topic model such as bilingual LDA (BiLDA) is trained jointly on document-aligned comparable corpora by retaining the structure of the corpus intact (i.e., there is no need to construct pseudo-bilingual documents).
MuPTM discovers the latent structure of the observed data in the form of K latent cross-lingual topics z 1 , . . ., z K which optimally describe the generation of observed data.Extracting latent cross-lingual topics actually implies learning per-document topic distributions for each document in the corpus (probability scores P (z k |d j )), and discovering language-specific representations of these topics given by per-topic word distributions in each language (probability scores P (w S i |z k ) and P (w T i |z k )).Latent cross-lingual topics are in fact distributions over vocabulary words, and have their language-specific representation in each language.Per-document topic distributions and per-topic word distributions are obtained after training the topic model on multilingual data.The representation of some word w ∈ V S (or in an analogous manner w ∈ V T ) is then a K-dimensional vector: We call this representation model (RM) Basic-MuPTM (BMu).Since the number of topics, that is, the number of vector dimensions K is typically high (Dinu & Lapata, 2010;Vulić et al., 2011), additional feature pruning (Reisinger & Mooney, 2010) may be employed in order to retain only the most descriptive dimensions in the MuPTM-based representation, which was shown to improve the performance on several semantic tasks (e.g., BLE or SWTC) (Vulić & Moens, 2013a;Vulić et al., 2015).
A multilingual topic model is typically trained by Gibbs sampling (Geman & Geman, 1984;Steyvers & Griffiths, 2007;Vulić et al., 2015).Similar to the SGNS/BWESG training procedure, Gibbs sampling for MuPTM/BiLDA also scans the training corpus word by word, and then cyclically updates topic assignments for each word token.However, unlike BWESG which uses only a subset of document-level training examples, Gibbs sampling for MuPTM uses all words from the source language document as well as all words from its coupled target language document to influence the topic assignment for the pivot word.The BWESG design relying on data dropout leads to decreased training times and computation costs to obtain final representations compared to Basic-MuPTM.
Association-MuPTM Another representation is also based on the MuPTM framework: it contains association scores P (w a |w) for each w, w a ∈ V S ∪V T (Vulić & Moens, 2013a) as dimensions of real-valued word vectors.These association scores are computed as P (w a |w) = As with Basic-MuPTM, the original word representation may also be pruned post-hoc.We call this representation model Association-MuPTM (AMu).Since this approach relies on the MuPTM training plus additional |V S | • |V T | computations to estimate association scores, the cost of obtaining Association-MuPTM representations is even higher than for Basic-MuPTM, but it leads to more robust word representations for the BLE task (Vulić & Moens, 2013a).While both Basic-MuPTM and Association-MuPTM produce high-dimensional real-valued vectors with plenty of near-zero dimensions (the number of dimensions is typically measured in thousands) which have to be pruned afterwards with the pruning parameter often set ad-hoc, BWESG produces lower-dimensional dense real-valued vectors, and no additional post-hoc feature pruning is required for BWESG.
Traditional-PPMI A traditional approach to building bilingual word representations in (cross-lingual) distributional semantics is to compute weighted co-occurrence scores (e.g., using PMI, TF-IDF) between pivot words and their context words in a window of predefined size, plus an external bilingual lexicon to align context words/dimensions across languages (Gaussier et al., 2004;Laroche & Langlais, 2010).A weighting function (WeF), which is a standard choice in distributional semantics and yields optimal or near-optimal results over a group of semantic tasks (Bullinaria & Levy, 2007), is the smoothed positive pointwise mutual information statistic (Pantel & Lin, 2002;Turney & Pantel, 2010).Furthermore, in order to induce context words without the need for a readily available lexicon, we employ the bootstrapping procedure from Peirsman and Padó (2011), Vulić and Moens (2013b).This representation model is called Traditional-PPMI (TPPMI).The word representation is an R-dimensional vector: w = [sc 1 (w, c 1 ), . . ., sc k (w, c k ), . . ., sc R (w, c R )].The dimensions of the vector space are R one-to-one word translation pairs c k = (c S k , c T k ), and sc k (w, c k ) is the weighted co-occurrence score of the pivot word w and the k-th context feature, where one computes the co-occurrence score using c Vector dimensions c k = (c S k , c T k ) in the Traditional-PPMI representation and similar models with other WeFs are typically the most frequent and reliable translation pairs in the corpus.As opposed to BWESG, the obtained word vectors are again high-dimensional (typically thousands of dimensions) sparse real-valued vectors.In addition, traditional-PPMI is a purely local distributional model deriving distributional context knowledge from narrow context windows (typically 3-10 surrounding words, e.g., Laroche & Langlais, 2010).A bootstrapping approach (Vulić & Moens, 2013b) which we use to induce the Traditional-PPMI representation starts from an automatically learned seed lexicon of one-to-one translation pairs obtained using some other model (e.g., Basic-MuPTM or Association-MuPTM), and then gradually detects new dimensions of the shared bilingual semantic space.We refer the interested reader to the relevant literature (Vulić & Moens, 2013b) for more details.

Group II: BWE Induction Models Adjusted to Document-Aligned Data
BiCVM Hermann and Blunsom (2014b) introduced a model called BiCVM (Bilingual Compositional Vector Model) that learns bilingual word embeddings from a sentence-aligned parallel corpus C = {s 1 , s 2 , . . ., s N } = {(s S 1 , s T 1 ), (s S 2 , s T 2 ), . . ., (s S N , s T N )}. 5 s j = (s S j , s T j ) now denotes a pair of aligned sentences.The model assumes that the aligned sentences have the same meaning, which implies that their sentence representations should be similar.Assume two functions f and g which map sentences given in the source and language respectively to their semantic representations in R d , where d is again the representation dimensionality.The energy of the model given two sentences (s S j , s T j ) ∈ C is then defined as: The goal is to minimize E for all semantically equivalent sentences (i.e., aligned sentences) in the corpus.In order to prevent the model from degenerating, they use a noise-contrastive large-margin update which ensures that the representations of non-aligned sentences observe a certain margin from each other.For every pair of parallel sentences (s S j , s T j ), they sample a number of additional negative sentence pairs (s S j , n T neg ) from the corpus (i.e., the sampled pairs are not observed as positive pairs in C).These noise samples are used in formulating the hinge loss as follows: E(s S j , s T j ) = max(mrg + ∆E(s S j , s T j , n T neg ), 0), where mrg is the margin, and The loss is minimized for every pair of parallel sentences in the corpus with L2-regularization on the model parameters.The number of noise samples per each positive pair is a hyper-parameter of the model.A semantic signal is propagated from aligned sentences back to the individual words to obtain bilingual word embeddings.While the BiCVM model was originally built for sentence-aligned parallel data, exactly the same idea may be applied to document-aligned non-parallel data.In this paper, we test its ability to learn from noisier comparable data.The BWESG model is compared against BiCVM when inducing BWEs from both data types: comparable and parallel.
Mikolov Another collection of BWE induction models (Mikolov et al., 2013b;Faruqui & Dyer, 2014;Dinu, Lazaridou, & Baroni, 2015;Lazaridou, Dinu, & Baroni, 2015) assumes the following setup: first, two monolingual embedding spaces, R dim S and R dim T , are induced separately in each of the two languages using a standard monolingual WE model such as SGNS (Mikolov et al., 2013a(Mikolov et al., , 2013c)).dim S and dim T denote the dimensionality of monolingual embedding spaces in the source and target language respectively.The bilingual signal is provided in the form of word translation pairs (x i , y i ), where x i ∈ V S , y i ∈ V T , and Training is cast as a multivariate regression problem: it implies learning a function that maps the source language vectors from the training data to their corresponding target language vectors.A standard approach (Mikolov et al., 2013b;Dinu et al., 2015) is to assume a linear map W ∈ R dim S ×dim T , where a L 2 -regularized leastsquares error objective (i.e., ridge regression) is used to learn the map W: it is learned by solving the following optimization problem (typically by stochastic gradient descent): min W∈R dim S ×dim T ||XW − Y|| 2 F + λ||W|| 2 F .X and Y are matrices obtained through the respective concatenation of source language and target language vectors from training pairs.Once the linear map W is estimated, any 5.A very similar (but more expensive) model which also learns from parallel sentence-aligned data was also introduced by Chandar et al. (2014).
previously unseen source language word vector x u may be straightforwardly mapped into the target language embedding space R dim T as W x u .After mapping all vectors x, x ∈ V S , the target embedding space R dim T in fact serves as a bilingual embedding space (Figure 1).Although the main strength of the model is its ability to learn embeddings on larger monolingual training sets, the model may also be adjusted to the setting where the only training data are document-aligned comparable data as follows: (1) Automatically learn a seed lexicon or reliable one-to-one translation pairs from document-aligned data using a bootstrapping approach from Peirsman and Padó (2010), Vulić and Moens (2013b), (2) Train two separate monolingual embedding spaces on two separated halves of the documentaligned data set (i.e., using only source language documents and only target language documents), (3) Learn the mapping between the two spaces using the pairs from Step 1.
The monolingual objectives M ono S and M ono T ensure that similar words in each language are assigned similar embeddings and aim to capture the semantic structure of each language, whereas the cross-lingual objective Bi ensures that similar words across languages are assigned similar embeddings, and ties the two monolingual spaces together into a bilingual space.Parameters γ and δ govern the influence of the monolingual and bilingual components. 6The bilingual signal for these models, now acting as the cross-lingual regularizer during the joint training, is provided in sentence-aligned parallel data.Although they use the same data sources, the models differ in the choice of monolingual and cross-lingual objectives.In this work, we opt for the BilBOWA model from (Gouws et al., 2015) as the representative model to be included in the comparisons, due to its previous solid performance and robustness in the BLE task, its reduced complexity reflected in fast computations on massive datasets, as well as its public availability: https://github.com/gouwsmeister/bilbowa. In short, the BilBOWA model combines SGNS for the monolingual objectives together with the cross-lingual objective that minimizes the L 2 -loss between the bag-of-word vectors of parallel sentences.For more details about the exact training procedure, we refer the interested reader to the work from Gouws et al. (2015).
Again, although the main strength of the model is its ability to learn embeddings on larger monolingual training sets, the model may also be adjusted to the setting with document-or sentence-aligned data by: (1) using two halves of the aligned corpus for separate monolingual training, (2) using the alignment signal for bilingual training.

From Word Representations to Semantic Word Similarity
Assume now that we have induced bilingual word representations, regardless of the chosen RM.Given two words w i and w j , irrespective to their actual language, we may compute the degree of their semantic similarity by applying a similarity function (SF) on their vector 6.Setting γ = 0 reduces the model to the setting similar to BiCVM (Hermann & Blunsom, 2014b).γ = 1 results in the models from (Klementiev et al., 2012;Gouws et al., 2015;Soyer et al., 2015).
representations − → w i and − → w j : sim(w i , w j ) = SF ( − → w i , − → w j ).Different choices (or rather families of) SFs are cosine, the Kullback-Leibler or the Jensen-Shannon divergence, the Hellinger distance, the Jaccard index, etc. (Lee, 1999;Cha, 2007), and different RMs typically require different SFs to produce optimal or near-optimal results over various semantic tasks.When working with word embeddings, a standard choice for SF is cosine similarity (cos) (Mikolov et al., 2013c), which is also a typical choice in traditional distributional models (Bullinaria & Levy, 2007).The similarity is then computed as follows: On the other hand, a good choice for SF when working with probabilistic RMs such as Basic-MuPTM and Association-MuPTM RS is the Hellinger distance (Pollard, 2001;Cha, 2007;Kazama, Saeger, Kuroda, Murata, & Torisawa, 2010), which displays excellent results in the BLE task (Vulić & Moens, 2013a).The similarity between words w i and w j using the Hellinger distance is computed as follows: Note that the Hellinger distance is applicable only if word representations are probability distributions, which is the case for Basic-MuPTM and Association-MuPTM.P (f k |w i ) denotes the probability score for the k-th dimension (f k ) in the vector representation with Basic-MuPTM or Association-MuPTM. 7 For each word w i , we can build a ranked list RL(w i ) which consists of all other words w j ranked according to their respective semantic similarity scores sim(w i , w j ).Additionally, we label the ranked list RL(w i ) that is pruned at position M as RL M (w i ).Since we may retain language labels for words when training in multilingual settings (e.g., language labels are marked by different colors in Figure 2), we may compute: (1) monolingual similarity, e.g., given w i ∈ V S , we retain only w j ∈ V S in the ranked list (analogous for w i ∈ V T ), (2) cross-lingual similarity (CLSS), e.g., given w i ∈ V S , we retain only w j ∈ V T , and (3) multilingual similarity, where we retain all words w j ∈ V S ∪ V T .When computing CLSS for w i , the most similar word cross-lingually is called the cross-lingual nearest neighbor.
We will employ the models of context-insensitive CLSS at the word type level to extract bilingual lexicons from document-aligned or sentence-aligned data, and to compare all representation models in the BLE task in Section 7.

Context Sensitive Models of (Cross-Lingual) Semantic Similarity
The context-insensitive models of semantic similarity provide ranked lists of semantically similar words invariably or in isolation, and they operate at the level of word types.They do not explicitly encode different word senses.In practice, it means that, given a sentence "The coach of his team was not satisfied with the game yesterday.",these context-insensitive 7. Prior work has shown that the results for Basic-MuPTM and Association-MuPTM are slightly higher when cosine is replaced with the Hellinger distance.Therefore, in this particular case we have opted for the Hellinger distance to report a more competitive baseline.
CLSS models are not able to detect that the Spanish word entrenador is more similar to the polysemous English word coach in the context of this sentence than the Spanish word autocar, although autocar is listed as the most semantically similar word to coach globally/invariably without any observed context.In another example, while the Spanish words partido, encuentro, cerilla or correspondencia are all highly similar to another ambiguous English word match when observed in isolation, given the Spanish sentence "She was unable to find a match in her pocket to light up a cigarette.",it is clear that the strength of cross-lingual semantic similarity should change in context as only cerilla exhibits a strong cross-lingual semantic similarity to match within this particular sentential context.
The goal now is to build BWE-based models of cross-lingual semantic similarity in context, similar to context-aware CLSS models proposed by Vulić and Moens (2014).Two key questions are: (i) How to provide BWE-based representations beyond word level to represent the context of a word token?; (ii) How to use the contextual knowledge in a context-sensitive model of semantic similarity?
Following Vulić and Moens (2014), given a word token w in context (e.g., a window of words, a sentence, a paragraph, or a document), we build its context set or rather context bag Con(w) = {cw 1 , . . ., cw r } by harvesting r neighboring words in the chosen context scope (e.g., the context bag may comprise all content-bearing words in the same sentence as the pivot word token, the so-called sentential context).In order to present the context Con(w) in the d-dimensional embedding space, we need to apply a model of semantic composition to learn its d-dimensional vector representation − −−−− → Con(w).Formally, given word w, we may specify the vector representation of the context bag Con(w) as the d-dimensional vector/embedding: where − − → cw 1 , . . ., − − → cw r are d-dimensional WEs learned from the data, and is a compositional vector operator such as addition, point-wise multiplication, tensor product, etc.
A plethora of models for semantic composition have been proposed in the relevant literature, differing in their choice of vector operators, input structures and required knowledge (Mitchell & Lapata, 2008;Baroni & Zamparelli, 2010;Rudolph & Giesbrecht, 2010;Socher, Huval, Manning, & Ng, 2012;Blacoe & Lapata, 2012;Clarke, 2012;Hermann & Blunsom, 2014b;Milajevs, Kartsaklis, Sadrzadeh, & Purver, 2014), to name only a few.In this work, driven by the observed linear linguistic regularities in the embedding spaces (Mikolov et al., 2013d), we opt for simple addition (denoted by +) from Mitchell and Lapata (2008) as the compositional operator, due to its simplicity, the ease of applicability on bag-of-words contexts, and its relatively solid performance in various compositional tasks (Mitchell & Lapata, 2008;Milajevs et al., 2014).The d-dimensional embedding − −−−− → Con(w) is then: If we use any BWE-based RM, we may compute the context-sensitive semantic similarity score sim(w i , t j , Con(w i )) between t j and w i given its context Con(w i ) in the shared bilingual embedding space as follows: t j ∈ V T is any target language word, and − → t j its word representation, while − → w i is the new "contextualized" vector representation for w i modulated by its context Con(w i ), that is, its context-aware representation.Vulić and Moens (2014) introduced a linear interpolation of two d-dimensional vectors as a plausible solution for the modulation/contextualization.The modulation of representation for w i is computed as follows: where − → w i is the word embedding for w i computed at the word type level, −−−−−→ Con(w i ) is the embedding for the context bag computed using eq.( 7), and λ is an interpolation parameter.Another set of similar models that can yield context-sensitive similarity computations has been proposed very recently, and has displayed very competitive results regardless of its simplicity (Melamud, Levy, & Dagan, 2015).Here, we present two best scoring contextsensitive models which we adapt to the bilingual setting: Note that for the Mult model one has to avoid negative values, so a simple shift to an allpositives interval is required, e.g., the shifted cosine score becomes cos (x, y) = cos(x,y)+1

2
. Unlike the models from Vulić and Moens (2014), these two models do not aggregate single word representations into one vector that represents the context, but compute similarity scores separately with each word from the context.For more details regarding the models, we refer the interested reader to the original paper (Melamud et al., 2015).
We will employ the models of context-sensitive CLSS at the word token level to compare all representation models in the task of suggesting word translations in context in Section 8.

Training Setup
Training Data.To induce bilingual word embeddings as well as to be directly comparable with baseline representations from prior work, we use a dataset comprising a subset of comparable Wikipedia data available in three language pairs (Vulić & Moens, 2013b, 2014) 8 : (i) a collection of 13, 696 Spanish-English Wikipedia article pairs (ES-EN), (ii) a collection of 18, 898 Italian-English Wikipedia article pairs (IT-EN), and (iii) a collection of 7, 612 Dutch-English Wikipedia article pairs (NL-EN).All corpora are theme-aligned comparable corpora, that is, the aligned document pairs discuss similar themes, but are in general not direct translations of each other.To be directly comparable to prior work in the two evaluation tasks (Vulić & Moens, 2013b, 2014), we retain only nouns that occur at least 5 times in the corpus.Lemmatized word forms are recorded when available, and original forms otherwise.TreeTagger (Schmid, 1994) is used for POS tagging and lemmatization.After the preprocessing steps vocabularies comprise between 7,000 and 13,000 noun types for each language in each language pair, and the training corpora are quite small: ranging We also demonstrate that it is simple and straightforward to train BWESG on parallel sentence-aligned data using the same modeling principles.For that purpose, we use Europarl.v7(Koehn, 2005) for all three language pairs obtained from the OPUS website (Tiedemann, 2012).9As the only preprocessing step, we retain only words occurring at least 5 times in the corpus.Each corpus contains approximately 2M parallel sentences, vocabularies are by an order of magnitude larger than from the smaller Wikipedia data (i.e., varying from 45K EN word types to 75K NL word types), and the corpora sizes are approximately 120M tokens.Data statistics of the two data sources, Wikipedia vs Europarl, are provided in Table 1.The statistics reveal the different nature of the two corpora, with significantly more variance and noise reported for the Wikipedia data.
Trained BWESG Models To test the effect of random shuffling in the merge and shuffle BWESG strategy, we have trained the BWESG model with 10 random corpora shuffles for all three training corpora.We also train BWESG with the length-ratio shuffle strategy.All parameters are set to default suggested parameters for SGNS from the word2vec package: stochastic gradient descent (SGD) with a linearly decreasing global learning rate of 0.025, 25 negative samples, subsampling rate 1e − 4, and 15 epochs.
We have varied the number of dimensions d = 100, 200, 300.We have also trained BWESG with d = 40 to be directly comparable to readily available sets of BWEs from prior work (Chandar et al., 2014).Moreover, to test the effect of window size on the final results, i.e., the number of positives used for training, we have varied the maximum window size cs from 4 to 60 in steps of 4. 10We will make our pre-training and training code for BWESG publicly available, along with all BWESG-based bilingual word embeddings for the three language pairs at: http://liir.cs.kuleuven.be/software.php.
Baseline Representations: Group I All parameters of the baseline representation models (i.e., topic models and their settings, the number of dimensions K, the values for feature pruning, window size, weighting and similarity functions) were optimized in prior work.Therefore, the settings are adopted directly from previous work (Griffiths et al., 2007;Bullinaria & Levy, 2007;Dinu & Lapata, 2010;Vulić & Moens, 2013a, 2013b;Kiela & Clark, 2014), and we encourage the interested reader to check the details and exact parameter setup in the relevant literature.We provide only a short overview here.
For Basic-MuPTM and Association-MuPTM, as in (Vulić & Moens, 2013a), a bilingual latent Dirichlet allocation (BiLDA) model was trained with K = 2000 topics and the standard values for hyper-parameters: α = 50/K, β = 0.01 (Steyvers & Griffiths, 2007).Post-hoc semantic space pruning was employed with the pruning parameter set to 200 for Basic-MuPTM and to 2000 for Association-MuPTM.We refer the reader to the relevant paper for more details.
For Traditional-PPMI, as in (Vulić & Moens, 2013b), a seed lexicon was automatically obtained by bootstrapping from the initial seed lexicon of reliable pairs stemming from the Association-MuPTM model (with the same parameters for Association-MuPTM as listed above).The window size was fixed to 6 in both directions.We again refer the reader to the paper for more details.
Baseline Representations: Group II All baseline BWE models were trained with the same number of dimensions as BWESG: d = 100, 200, 300.Other model-specific parameters were taken as suggested in prior work.
For BiCVM, we use the tool released by the authors. 11We train an additive model, with hinge loss margin mrg = d as in the original paper, batch size of 50, and noise parameter of 10.All models were trained with 200 iterations.
For Mikolov, we train two monolingual SGNS models using the original word2vec package, SGD with a global learning rate of 0.025, 25 negative samples, subsampling rate 1e − 4, and 15 epochs.The seed lexicon required to learn the mapping between two monolingual spaces is exactly the same as for Traditional-PPMI.
For BilBOWA, we use SGD with a global learning rate 0.15 for training12 , 25 negative samples, subsampling rate 1e − 4, and 15 epochs.For BilBOWA and Mikolov, we vary the window size the same way as in BWESG.
Similarity Functions Unless stated otherwise, a similarity function used in all similarity computations with all RMs is cosine (cos).The only exceptions are Basic-MuPTM and Association-MuPTM where the Hellinger distance (HD) was used since it consistently outperformed cosine for these two RM types in prior work (see Footnote 7).
A Roadmap to Experiments In the first experiment, we quickly visually inspect the obtained lists of semantically similar words using the BWESG bilingual representation model.Following that, we compare BWESG-based models for bilingual lexicon extraction (BLE) and suggesting word translations in context (SWTC) against both groups of baseline models discussed in Section 4. The experiments and results for the BLE task are presented in Section 7, while the experiments and results for SWTC are presented in Section 8.

Task Description
One may employ the context-insensitive CLSS models from Section 5 to extract bilingual lexicons automatically from data.By harvesting cross-lingual nearest neighbors, one is able to build a bilingual lexicon of one-to-one translation pairs (w S i , w T j ).We test the validity of our BWEs and baseline representations in the BLE task.

Experimental Setup
Test Data For each language pair, we evaluate on standard 1,000 ground truth one-to-one translation pairs built for the three language pairs (ES/IT/NL-EN) by Vulić andMoens (2013a, 2013b).Translation direction is ES/IT/NL → EN.The data is available online. 13  Evaluation Metrics Since we can build a one-to-one bilingual lexicon by harvesting one-to-one translation pairs, the lexicon quality is best reflected in the Acc 1 score, that is, the number of source language (ES/IT/NL) words w S i from ground truth translation pairs for which the top ranked word cross-lingually is the correct translation in the other language (EN) according to the ground truth over the total number of ground truth translation pairs (=1000) (Gaussier et al., 2004;Tamura et al., 2012;Vulić & Moens, 2013b).Similar trends are observed within a more lenient setting with Acc 5 and Acc 10 scores, but we omit these results for clarity and the fact that the actual BLE performance is best reflected in Acc 1 .
13. http://people.cs.kuleuven.be/ivan.vulic/software/Table 3: Example lists of top 10 semantically similar words for ES-EN and IT-EN, obtained using BWESG (length-ratio, d = 200, cs = 48), and the three representation models from Group I.The correct translation is marked by (+).

Experiment 0: Qualitative Analysis and Comparison
Table 2 displays top 10 semantically similar words monolingually, across-languages and combined/multilingually for one ES, IT and NL word.BWESG is able to find semantically coherent lists of words for all three directions of similarity (i.e., monolingual, cross-lingual, multilingual).In the combined (multilingual) ranked lists, words from both languages are represented as top similar words.This initial qualitative analysis already demonstrates the ability of BWESG to induce a shared bilingual embedding space using only document alignments as bilingual signals. 14 In another brief analysis, we qualitatively compare the cross-lingual ranked lists acquired by BWESG with the other three baseline CLSS/BLE models from Group I.The lists for one ES word and one IT word are presented in Table 3.For the two example words, BWESG is the only model which is able to rank the actual correct translations as nearest cross-lingual neighbors.It is already symptomatic that the word gulf, which is the correct translation for golfo, does not occur in the ranked list RL 10 (golf o) at all in case of the three baseline models.We will soon quantitatively confirm this initial suspicion, and demonstrate that BWESG is superior to the three baseline models in the BLE task.
As an aside, Table 3 also clearly reveals the difficulty of judging the quality of models for computing semantic similarity/relatedness solely based on the observed output of the models.The lists RL 10 (cebolla) and RL 10 (golf o) appear significantly different across all four models, yet all these lists contain words which appear semantically related to the source 14.We also conducted a small experiment on solving word analogies using monolingual English embedding spaces, and then we repeated the experiment with the same vocabulary and bilingual English-Spanish/Italian/Dutch embedding spaces.The results follow the findings from (Faruqui & Dyer, 2014), where only slight (and often insignificant) fluctuations for SGNS vectors were reported (e.g., the fluctuations are < 1% on average in our experiments) when moving from monolingual to bilingual embedding spaces.We may conclude that the linguistic regularities (Mikolov et al., 2013d) established for monolingual embedding spaces induced by SGNS also hold in bilingual embedding spaces induced by BWESG.Table 4: BLE performance in terms of Acc 1 scores for all tested BLE models for Spanish-English, Italian-English and Dutch-English with all bilingual word representations learned from document-aligned Wikipedia data.For BWESG with merge and shuffle we report maximum (MAX), minimum (MIN) and average (AVG) scores over 10 random corpora shuffles.Highest scores per column are in bold.
word.Therefore, we require a more systematic quantitative task-oriented comparison of induced word representations.

Experiment I: BWESG vs Group I
Table 4 shows the first set of results on the BLE task: we report scores with two different BWESG strategies as well as with a BWESG model which does not shuffle pseudo-bilingual documents.The previous best reported Acc 1 scores with baseline representations for the same training+test combination are also reported in the table.By zooming into the table multiple times, we summarize the most important findings.

BWESG vs Baseline Representations
The results clearly reveal the superior performance of the BWESG model for BLE which relies on our new framework for inducing bilingual word embeddings from document-aligned comparable data over other BLE models relying on previously used bilingual word representations from the same type of training data.The increase in Acc 1 scores over the best scoring baseline models is 22.2% for ES-EN, 7% for IT-EN and 67.5% for NL-EN.
BWESG Shuffling Strategy Although both BWESG strategies display results that are above established baselines, there is a clear advantage to the length-ratio shuffle strategy, which displays a solid and robust performance across a variety of parameters and all three language pairs.Another advantage of that strategy is the fact that it has a deterministic outcome and does not suffer from "sub-optimal" random shuffles.In summary, we suggest using the length-ratio shuffle strategy in future work, and along the same line we opt for that strategy in all further experiments.The results also reveal that shuffling is universally useful, as BWESG without shuffling relies largely on monolingual contexts and cannot reach the performance of BWESG with shuffling.A partial remedy for the problem is to train BWESG with more documentlevel training pairs (i.e., by increasing the window size), but that leads to prohibitively expensive models, and nonetheless BWESG without shuffling with larger cs-s still falls short of BWESG with both shuffling strategies (see also Figures 3(a .This finding reveals that even a coarse tuning of these parameters might lead to optimal or near-optimal scores for BLE with BWESG.Differences across Language Pairs A lower increase in Acc 1 scores for IT-EN is attributed to the fact that the test set for IT-EN comprises IT words with occurrence frequencies above 200 in the training data (Vulić & Moens, 2013a), while the other two test sets comprise randomly sampled words covering all frequency spectra.As expected, all models in comparison are able to effectively utilize distributional signals for higher-frequency words, but BWESG still displays the best performance, and these improvements in Acc 1 scores are statistically significant (using McNemar's statistical significance test, p < 0.05). 15urther, the lowest overall scores for all models in comparison are observed for NL-EN.We attribute it to using less training data for NL-EN when compared to ES-EN and IT-EN (i.e., training corpora for ES-EN and IT-EN are  document-aligned data, is unable to compete with BWESG when learning BWEs from the noisier setting with non-parallel data. We also present a preliminary study where we compare BWSESG and Group II models in the setup with parallel sentence-aligned data.Results are summarized in Figures 4(a)-4(c). 16 The preliminary results clearly demonstrate that BWESG is able to learn BWEs from parallel data without the slightest change in its modeling principles.While the BilBOWA model displays better results for lower values of the cs parameter, to our own surprise, the 16.Note that the absolute scores are not directly comparable to the BLE scores when the model is trained on Wikipedia data (Tables 4 and 5) due to different training data, different preprocessing steps and vocabularies.Different vocabularies also result in different BLE search spaces and coverages of the test sets (e.g., some very common Spanish nouns from the test set such as nadador (swimmer) or colmillo (tusk) are not observed in Europarl due to the domain shift).
BWESG model is comparable to or even better than the baseline models with larger window sizes.The BiCVM model, which implicitly utilizes the entire sentence span in training also outperforms BWESG with smaller windows, but BWESG again performs significantly better with larger windows.The BWESG performance flattens out quicker than with the Wikipedia data (compare the results with cs = 16 and cs = 48), which is easily explained by the decreased length of aligned items as provided in Table 1 (i.e., sentences vs documents).
For English-Spanish, we can also compare BWESG to pre-trained 40-dimensional embeddings from Chandar et al. (2014)

Evaluation Task II: Suggesting Word Translations in Context
In another task, we test the ability of BWEs to produce context-sensitive semantic similarity modeling (see Section 5.1), which in turn may be used to solve the task of suggesting word translations in context (SWTC) proposed recently (Vulić & Moens, 2014).The goal now is to build BWESG-based models for SWTC given the sentential context, similar as in the prior work.We show that our new BWESG-based SWTC models outperform the best SWTC models (Vulić & Moens, 2014), as well as other SWTC models which rely on the baseline word representations discussed in Section 4.

Task Description
Given an occurrence of a polysemous word w i ∈ V S and the context of that occurrence, the SWTC task is to choose the correct translation in the target language L T of that particular occurrence of w i from the given set T C(w i ) = {t 1 , . . ., t tq }, T C(w i ) ⊆ V T , of its tq possible translations/meanings.We may refer to T C(w i ) as an inventory of translation candidates for w i .The task of suggesting word translations in context (SWTC) may be interpreted as ranking the tq translation candidates with respect to the observed local context Con(w i ) of the occurrence of the word w i .The best scoring translation candidate according to the scores sim(w i , t j , Con(w i )) (see Section 5.1) in the ranked list is then the correct translation for that particular occurrence of w i observing its local context Con(w i ).

Experimental Setup
Test Data We use the SWTC test set introduced recently (Vulić & Moens, 2014).The test set comprises 15 polysemous nouns in three languages (ES, IT and NL) along with sets of their translation candidates (i.e., sets T C).For each polysemous noun, the test sets provide 24 sentences extracted from Wikipedia which illustrate different senses and translations of the pivot polysemous noun, accompanied by the annotated correct translation for each sentence.It yields 360 test sentences for each language pair (and 1080 test sentences in total).An additional set of 100 IT sentences (5 other polysemous IT nouns plus 20 sentences for each noun) is used as a development set to tune the parameter λ (see Section 5.1) for all language pairs and all models in comparison.In summary, the final aim may be formulated as follows: For each polysemous word w i in ES/IT/NL, the goal is to suggest its correct translation in English given its sentential context.
Evaluation Metrics Since the task is to present a list of possible translations to a SWTC model, and then let the model decide a single most likely translation given the word and its sentential context, we measure the performance again as Top 1 accuracy (Acc 1 ).

Experiment I: BWESG vs Group I
Note that the Group I models held previously best reported SWTC scores for the train-ing+test combination.
Again, all parameters of the baseline representation models are adopted directly from prior work where they were optimized on development sets comprising additional 100 sentences (Vulić & Moens, 2014).In addition, BMu+HD+S and BMu+Cue+S also rely on the procedure of context sorting and pruning (Vulić & Moens, 2014), where the idea is to retain only context words which are most semantically similar to the given pivot polysemous word, and then use them in computations.The procedure, however, produces significant gains only for probabilistic models (BMu+HD+S and BMu+Cue+S), and therefore, we employ it only for these models.BMu+HD+S and BMu+Cue+S with context sorting and pruning were the best scoring models in the introductory SWTC paper (Vulić & Moens, 2014) and currently produce state-of-the-art SWTC results on these test sets. 19able 6 summarizes the results and comparison with Group I models on the SWTC task.NO-CONTEXT refers to the context-insensitive majority baseline (i.e., always choosing the most semantically similar translation candidate obtained by BWESG at the word type level, without taking into account any context information).

BWESG vs Baseline Representations
The results reveal that BWESG outperforms baseline bilingual word representations from Group I also in the SWTC task.The improvements are prominent for all reported values of parameters d and cs, and are often statistically significant even when compared to the strongest baseline (which is the finetuned BMu+Cue+S model with context sorting and pruning for all three language pairs from Vulić and Moens ( 2014)).The increase in Acc 1 scores over the strongest baseline is 12.9% for ES-EN, 11.9% for IT-EN, and 12.4% for NL-EN.The obtained results surpass previous state-of-the-art scores and are currently the best reported results on the SWTC datasets when using non-parallel data to learn semantic representations.BWESG Shuffling Strategy Although BWESG without shuffling (due to a reduced complexity of the SWTC task compared to BLE) already displays encouraging results, there is again a clear advantage to the length-ratio shuffle strategy, which displays an excellent performance for all three language pairs.In simple words, shuffling is again useful.

Dimensionality and Number of Training Pairs
Unlike in the BLE task, the highest Acc 1 scores on average are obtained by using lower-dimensional word embeddings (i.e., d = 100).The phenomenon may be attributed to the effect of semantic composition and the reduced complexity of the SWTC task compared to the BLE task.First, although enlarging the dimensionality of embeddings leads to an increased semantic expressiveness within the shared bilingual embedding space, it may be harmful when working with composition models, since the simple additive model of semantic composition may produce more erroneous dimensions when constructing higher-dimensional context embeddings out of single word embeddings.Second, due to its design, the SWTC task requires coarser-grained representations than BLE.While in the BLE task the goal is to detect a translation of a word from a vocabulary which typically spans (tens of) thousands of words, in the SWTC task the goal is to detect the most likely translation of a word given its sentential context, but from a small closed vocabulary of 2-4 possible translations from the translation inventory.Therefore, it is highly likely that even low-dimensional embeddings are sufficient to produce plausible rankings for the SWTC task, while at the same time, they are not sufficient and expressive enough to find correct translations in BLE.More training pairs (i.e., larger windows) still yield better results on average in the SWTC task.In summary, the choice of representation granularity is dependent on the actual task, which consequently leads to the conclusion that optimal values for d and cs are largely task-specific (compare also results in Table 4 and Table 6).
Testing Polysemy In order to test whether the gain in performance for BWESG+add is derived mostly from the effective handling of the easiest set of words, that is, bisemous words (polysemous words with only 2 translation candidates), we have performed an additional experiment, where we have measured Acc 1 scores separately for words with 2, 3, and 4 different senses.Results indicate that the performance gain comes mostly from gains on trisemous and tetrasemous words, while the scores on bisemous words are comparable.
Table 7 shows Acc 1 over different clusters of words for ES-EN, and similar scoring patterns are observed for IT-EN and NL-EN.Differences across Language Pairs Due to the reduced complexity of SWTC, we may also observe relatively higher results for NL-EN when compared to ES-EN and IT-EN, as opposed to their relative performance in the BLE task, where the scores for NL-EN are much lower than scores for ES-EN and IT-EN.Since SWTC is a less difficult task which requires coarse-grained representations, even limited amounts of training data may be sufficient to learn word embeddings which are useful for the specific task.This finding is in line with the recent work from Gouws and Søgaard (2015).
8.3.2Experiment II: BWESG vs.Other BWE Induction Models (Group II) We again test other BWE induction models in the SWTC task, using the same training setup and sets of embeddings as introduced in Section 7.3.3for the BLE task.The representations were now plugged in the context-sensitive CLSS modeling framework from Section 5.1, and the optimization of parameters for SWTC has been conducted in the same manner as for BWESG.The results with the Mikolov model and BiCVM are summarized in Table 8.
The results with BilBOWA are very similar to BiCVM, so we do not report it for brevity.
BWESG outperforms other BWE induction models in the SWTC task and further confirms its utility in cross-lingual semantic modeling.The model from Mikolov et al. (2013b) constitutes a stronger baseline: Good results in the SWTC task with this model are an interesting finding per se.While the model is not competitive with BWESG and other baseline representations models from document-aligned data in a more difficult BLE task when using noisy one-to-one translation pairs, its performance on the less complex SWTC phylum, that is, the Indo-European language family.Future extensions also include porting the framework to other more distant language pairs that do not share the same roots nor the same alphabet (e.g., English-Chinese/Hindi/Arabic), and for which benchmarking test sets are still scarce for a variety of semantic tasks (e.g., SWTC) (Camacho-Collados, Pilehvar, & Navigli, 2015).We believe that larger window sizes may solve difficulties with different word orderings (e.g., for Chinese-English).

Conclusions and Future Work
We have proposed and described Bilingual Word Embeddings Skip-Gram (BWESG), a simple yet effective bilingual word representation learning model which is able to induce bilingual word embeddings solely on the basis of document-aligned comparable data.BWESG is based on the omnipresent skip-gram with negative sampling (SGNS).We have presented two ways to build pseudo-bilingual documents on which a monolingual SGNS (or any monolingual WE induction model) may be trained to produce shared bilingual embedding spaces.The BWESG model does not make any language-pair dependent assumptions nor requires language-pair specific external resources such as bilingual lexicons, predefined category/ontology knowledge or parallel data.We have showed that the model may be trained on non-parallel and parallel data without any changes in modeling principles, which, complemented with its simplicity and lightweight design makes it potentially very useful as a tool for researchers in machine translation and information retrieval.
We have employed induced BWEs in two semantic tasks: (1) bilingual lexicon extraction (BLE), and (2) suggesting word translations in context (SWTC).Our new BWESG-based BLE and SWTC models outperform previous state-of-the-art models for BLE and SWTC from document-aligned comparable data and related BWE induction models (Mikolov et al., 2013b;Chandar et al., 2014;Gouws et al., 2015).The findings in this article follow the recently published surveys from Baroni et al. (2014), Levy et al. (2015) regarding a solid and robust performance of neural word representations/word embeddings in semantic tasks: our new BWESG-based models for BLE and SWTC significantly outscore previous state-of-theart distributional approaches on both tasks across different parameter settings.Even more encouraging is the fact that these new state-of-the-art results are attained using default parameter settings for the BWESG model as suggested in the word2vec package without any development set.Further (finer) tuning of model parameters in future work may lead to higher-quality bilingual embedding spaces.
Several straightforward lines of future research have already been tackled in Sections 7 and 8.For instance, the current length-ratio shuffling strategy may be replaced by a more advanced shuffling method in future work.Moreover, BWEs induced by BWESG may be used in other semantic tasks besides the ones discussed in this work, and it would be interesting to experiment with other types of context aggregation and selection beyond the bag-of-words assumption, such as dependency-based contexts (Levy & Goldberg, 2014a), or other objective functions during training in the same vein as proposed by Levy and Goldberg (2014b).Similar to the evolution in multilingual probabilistic topic modeling, another path of future work may lead to investigating bilingual models for learning BWEs which will be able to jointly learn from separate documents in aligned document pairs, without the need to construct pseudo-bilingual documents.
A natural step in the text representation learning research is to extend the focus from single word representations to composite phrase, sentence and document representations (Hermann & Blunsom, 2013;Kalchbrenner, Grefenstette, & Blunsom, 2014;Le & Mikolov, 2014;Soyer et al., 2015).In this article, we have relied on a simple composition model based on vector addition, and have shown that this model performs excellent in the SWTC task.However, in the long run this model is not by any means sufficient to effectively capture all complex compositional phenomena in the data.Several models which aim to learn sentence and document embeddings have been proposed recently, but they critically rely on sentence-aligned parallel data.It is yet to be seen how to build structured multilingual phrase, sentence and document embeddings solely on the basis of comparable data.Such low-cost multilingual embeddings beyond the word level extracted from comparable data may find its application in a variety of tasks such as statistical machine translation (Mikolov et al., 2013b;Zou et al., 2013;Zhang et al., 2014;Wu et al., 2014), semantic tasks such as multilingual semantic textual similarity (Agirre, Banea, Cardie, Cer, Diab, Gonzalez-Agirre, Guo, Mihalcea, Rigau, & Wiebe, 2014), cross-lingual information retrieval (Vulić et al., 2013;Vulić & Moens, 2015) or cross-lingual document classification (Klementiev et al., 2012;Hermann & Blunsom, 2014b;Chandar et al., 2014).
In another future research path, we may use the knowledge of BWEs obtained by BWESG from document-aligned data to learn bilingual correspondences (e.g., word translation pairs or lists of semantically similar words across languages) which may in turn be used for learning from large unaligned multilingual datasets (Mikolov et al., 2013b;Al-Rfou, Perozzi, & Skiena, 2013).In the long run, this idea may lead to large-scale learning models from huge amounts of multilingual data without any requirement for parallel data or manually built bilingual lexicons.

Figure 2 :
Figure 2: The architecture of our BWE Skip-Gram (BWESG) model for learning bilingual word embeddings from document-aligned comparable data with two different pretraining strategies:(1) non-deterministic merge and shuffle, (2) deterministic length-ratio shuffle.Source language words and documents are drawn as gray boxes, while target language words and documents are drawn as blue boxes.The right side of the figures (separated by vertical dashed lines) illustrates how a pseudo-bilingual document is constructed from a pair of two aligned documents.

Figure 3 :
Figure3: Acc 1 scores in the BLE task with BWESG length-ratio shuffle for all 3 language pairs, and varying values for parameters cs and d.Solid (red) horizontal lines denote the highest baseline Acc 1 scores for each language pair.Thicker dotted lines refer to BWESG without shuffling.
)-3(c)).Window Size: Number of Training Pairs The results confirm the intuition that larger window sizes, i.e., more training examples lead to better results in the BLE task.For all embedding dimensions d-s, BWESG exhibits a superior performance for cs = 48 than for cs = 16, and the performance with cs = 48 and cs = 60 seems relatively stable: intuitively, more training pairs leads to a slightly better BLE performance, but the curve slowly flattens out (Figures 3(a)-3(c))

Figure 4 :
Figure 4: Comparison of BWESG (solid curves) with two other models that rely on parallel training data:(1) BilBOWA (dotted curves), (2) BiCVM: the BWE induction modelinitially developed for parallel sentence-aligned data (dashed horizontal lines).All models were trained on the same sentence-aligned training Europarl data with exactly the same vocabularies.BLE is performed over the same search space for all models.x axes are in log scale.

Table 1 :
Training data statistics: Non-parallel document-aligned Wikipedia vs parallel sentence-aligned Europarl for all three language pairs.OTHER = ES, IT or NL.Lengths are measured in word tokens.Averages are rounded to the closest integer.from approximately 1.5M tokens for NL-EN to 4M for ES-EN.Exactly the same training data and vocabularies are used to train all representation models in comparison (both from Group I and Group II, see Section 4).

Table 6 :
A comparison of SWTC models for Spanish-English, Italian-English and Dutch-English with all bilingual word representations learned from document-aligned Wikipedia data.The asterisk (*) denotes statistically significant improvements of BWESG+add over the strongest baseline according to a McNemar's statistical significance test (p < 0.05).Highest scores per column are in bold.

Table 7 :
A comparison of the best scoring baseline model BMu+Cue+S and the best scoring BWESG+add model over different clusters of words (2-sense, 3-sense and 4-sense words) for Spanish-English.