Neural Machine Translation: A Review

The field of machine translation (MT), the automatic translation of written text from one natural language into another, has experienced a major paradigm shift in recent years. Statistical MT, which mainly relies on various count-based models and which used to dominate MT research for decades, has largely been superseded by neural machine translation (NMT), which tackles translation with a single neural network. In this work we will trace back the origins of modern NMT architectures to word and sentence embeddings and earlier examples of the encoder-decoder network family. We will conclude with a survey of recent trends in the field.

will give a comprehensive overview of current research in the field. For even more insight into the field of neural machine translation, we refer the reader to other overview papers such as [18][19][20][21].

Nomenclature
We will denote the source sentence of length I as x. We use the subscript i to index tokens in the source sentence. We refer to the source language vocabulary as Σ src .
The translation of source sentence x into the target language is denoted as y. We use an analogous nomenclature on the target side. 2 In case we deal with only one language we drop the subscript src/trg. For convenience we represent tokens as indices in a list of subwords or word surface forms. Therefore, Σ src and Σ trg are the first n natural numbers (i.e. Σ = {n ∈ N|n ≤ n} where n = |Σ| is the vocabulary size). Additionally, we use the projection function π k which maps a tuple or vector to its k-th entry: π k (z 1 , . . . , z k , . . . , z n ) = z k .
For a matrix A ∈ R m×n we denote the element in the p-th row and the q-th column as A p,q , the p-th row vector as A p,: ∈ R n and the q-th column vector as A :,q ∈ R m . For a series of m n-dimensional vectors a p ∈ R n (p ∈ [1, m]) we denote the m × n matrix which results from stacking the vectors horizontally as (a p ) p=1:m as illustrated with the following tautology: A = (A p,: ) p=1:m = ((A :,q ) q=1:n ) T .

Word Embeddings
Representing words or phrases as continuous vectors is arguably one of the keys in connectionist models for NLP. To the best of our knowledge, continuous space word representations were first successfully used for language modelling [2,37]. The key idea is to represent a word x ∈ Σ as a d-dimensional vector of real numbers. The size d of the embedding layer is normally chosen to be much smaller than the vocabulary size (d |Σ|) in order to obtain interesting representations. The mapping from the word to its distributed representation can be represented by an embedding matrix E ∈ R d×|Σ| [38]. The x th column of E (denoted as E x ) holds the d-dimensional representation for the word x.
Learned continuous word representations have the potential of capturing morphological, syntactic and semantic similarity across words [38]. In neural machine translation, embedding matrices are usually trained jointly with the rest of the network using backpropagation [39] and a gradient based optimizer such as stochastic gradient descent. In other areas of NLP, pre-trained word embeddings trained on unlabelled text have become ubiquitous [40]. Methods for training word embeddings on raw text often take the context into account in which the word occurs frequently [41,42], or use cross-lingual information to improve embeddings [43,44].
A newly emerging type of contextualized word embeddings [45,46] is gaining popularity in various fields of NLP. Contextualized representations do not only depend on the word itself but on the entire input sentence. Thus, they cannot be described by a single embedding matrix but are usually generated by neural sequence models which have been trained under a language model objective. Most approaches either use LSTM [45,47] or Transformer architectures [48,49] but differ in the way these architectures are used to compute the word representations. Contextualized word embeddings have advanced the state-of-the-art in several NLP benchmarks [47,49,50]. Goldberg [51] showed that contextualized embeddings are remarkably sensitive to syntax. Choi et al. [52] reported gains from contextualizing word embeddings in NMT using a bag of words.

Phrase Embeddings
For various NLP tasks such as sentiment analysis or MT it is desirable to embed whole phrases or sentences instead of single words. For example, a distributed representation of the source sentence x could be used as conditional for the distribution over the target sentences P (y|x). Early approaches to phrase embedding were based on recurrent autoencoders [53,54]. To represent a phrase x ∈ Σ I as d-dimensional vector, Socher et al. [54] first trained a word embedding matrix E ∈ R d×|Σ| . Then, they recursively applied an autoencoder network which finds d-dimensional representations for 2d-dimensional inputs, where the input is the concatenation of two parent representations. The parent representations are either word embeddings or representations calculated by the same autoencoder from two different parents. The order in which representations are merged is determined by a binary tree over x which can be constructed greedily [54] or derived from an Inversion Transduction Grammar [55,ITG] [56]. Fig. 2a shows an example of a recurrent autoencoder embedding a phrase with five words into a four dimensional space. One of the disadvantages of recurrent autoencoders is that the word and sentence embeddings need to have the same dimensionality. This restriction is not very critical in sentiment analysis because the sentence representation is only used to extract the sentiment of the writer [54]. In MT, however, the sentence representations need to convey enough information to condition the target sentence distribution on it, and thus should be higher dimensional than the word embeddings.

Sentence Embeddings
Kalchbrenner and Blunsom [9] used convolution to find vector representations of phrases or sentences and thus avoided the dimensionality issue of recurrent autoencoders. As shown in Fig. 2b, their model yields n-gram representations at each convolution level, with n increasing with depth. The top level can be used as representation for the whole sentence. Other notable examples of using convolution for sentence representations include [57][58][59][60][61]. However, the convolution operations in these models loose information about the exact word order. and are thus more suitable for sentiment analysis than for tasks like machine translation. 3 A recent line of work uses self-attention rather than (a) Source sentence is used to initialize the decoder state.
(b) Source sentence is fed to the decoder at each time step. convolution to find sentence representations [62][63][64]. Another interesting idea explored by Yu et al. [65] is to resort to (recursive) relation networks [66,67] which repeatedly aggregate pairwise relations between words in the sentence. Recurrent architectures are also commonly used for sentence representation. It has been noted that even random RNNs without any training can work surprisingly well for several NLP tasks [68][69][70].

Encoder-Decoder Networks with Fixed Length Sentence Encodings
Kalchbrenner and Blunsom [9] were the first who conditioned the target sentence distribution on a distributed fixed-length representation of the source sentence. Their recurrent continuous translation models (RCTM) I and II gave rise to a new family of so-called encoder-decoder networks which is the current prevailing architecture for NMT. Encoder-decoder networks are subdivided into an encoder network which computes a representation of the source sentence, and a decoder network which generates the target sentence from that representation. As introduced in Sec. 1 we denote the source sentence as x = x I 1 and the target sentence as y = y J 1 . All existing NMT models define a probability distribution over the target sentences P (y|x) by factorizing it into conditionals: Different encoder-decoder architectures differ vastly in how they model the distribution P (y j |y j−1 1 , x). We will first discuss encoder-decoder networks in which the encoder represents the source sentence as a fixed-length vector c(x) like the methods in Sec. 4. The conditionals P (y j |y j−1 1 , x) are modelled as: P (y j |y j−1 1 , x) = g(y j |s j , y j−1 , c(x)) (6) where s j is the hidden state of a recurrent neural (decoder) network (RNN). We will formally introduce s j in Sec. 6.3. Gated activation functions such as the long shortterm memory [71,LSTM] or the gated recurrent unit [10,GRU] are commonly used to alleviate the vanishing gradient problem [72] which makes it difficult to train RNNs to 5 Figure 4: The encoder-decoder architecture of Sutskever et al. [12]. The color coding indicates weight sharing.
(c) Bidirectional encoder used by Bahdanau et al. [13]. capture long-range dependencies. Deep architectures with stacked LSTM cells were used by Sutskever et al. [12]. The encoder can be a convolutional network as in the RCTM I [9], an LSTM network [12], or a GRU network [10]. g(·) is a feedforward network with a softmax layer at the end which takes as input the decoder state s j and an embedding of the previous target token y j−1 . In addition, g(·) may also take the source sentence encoding c(x) as input to condition on the source sentence [9,10]. Alternatively, c(x) is just used to initialize the decoder state s 1 [12,13]. Fig. 3 contrasts both methods. Intuitively, once the source sentence has been encoded, the decoder starts generating the first target sentence symbol y 1 which is then fed back to the decoder network for producing the second symbol y 2 . The algorithm terminates when the network produces the end-of-sentence symbol </s>. Sec. 7 explains more formally what we mean by the network "generating" a symbol y j and sheds more light on the aspect of decoding in NMT. Fig. 4 shows the complete architecture of Sutskever et al. [12] who presented one of the first working standalone NMT systems that did not rely on any SMT baseline. One of the reasons why this paper was groundbreaking is the simplicity of the architecture, which stands in stark contrast to traditional SMT systems that used a very large number of highly engineered features. Different ways of providing the source sentence to the encoder network have been explored in the past. Cho et al. [10] fed the tokens to the encoder in the natural order 6 they appear in the source sentence (cf. Fig. 5a). Sutskever et al. [12] reported gains from simply feeding the sequence in reversed order (cf. Fig. 5b). They argue that these improvements might be "caused by the introduction of many short term dependencies to the dataset" [12]. Bidirectional RNNs [73,BiRNN] are able to capture both directions (cf. Fig. 5c) and are often used in attentional NMT [13].

Attention
One problem of early NMT models which is still not fully solved yet (see Sec. 10.1) is that they often produced poor translations for long sentences [74]. Cho et al. [11] suggested that this weakness is due to the fixed-length source sentence encoding. Sentences with varying length convey different amounts of information. Therefore, despite being appropriate for short sentences, a fixed-length vector "does not have enough capacity to encode a long sentence with complicated structure and meaning" [11]. Pouget-Abadie et al. [75] tried to mitigate this problem by chopping the source sentence into short clauses. They composed the target sentence by concatenating the separately translated clauses. However, this approach does not cope well with long-distance reorderings as word reorderings are only possible within a clause. Bahdanau et al. [13] introduced the concept of attention to avoid having a fixed-length source sentence representation. Their model does not use a constant context vector c(x) any more which encodes the whole source sentence. By contrast, the attentional decoder can place its attention only on parts of the source sentence which are useful for producing the next token. The constant context vector c(x) is thus replaced by a series of context vectors c j (x); one for each time step j. 4 We will first introduce attention as a general concept before describing the architecture of Bahdanau et al. [13] in detail in Sec. 6.3. We follow the terminology of Vaswani et al. [76] and describe attention as mapping n query vectors to n output vectors via a mapping table (or a memory) of m key-value pairs. This view is related to memoryaugmented neural networks which we will discuss in greater detail in Sec. 13.3. We make the simplifying assumption that all vectors have the same dimension d so that we can stack the vectors into matrices Q ∈ R n×d , K ∈ R m×d , and V ∈ R m×d . Intuitively, for each query vector we compute an output vector as a weighted sum of the value vectors. The weights are determined by a similarity score between the query vector and the keys (cf. [76, Eq. 1]): The output of score(Q, K) is an n × m matrix of similarity scores. The softmax function normalizes over the columns of that matrix so that the weights for each query vector sum up to one. A straight-forward choice for score(·) proposed by Luong et al. [77] is the dot product (i.e. score(Q, K) = QK ). The most common scoring functions are summarized in Tab. 2.

Scoring function Citation
Additive score(Q, K)p,q = v tanh(W Qp,: + U Kq,:) Bahdanau et al. [13] Dot-product score(Q, K) = QK Luong et al. [77] Scaled dot-product score(Q, K) = QK d −0. 5 Vaswani et al. [76]  A common way to use attention in NMT is at the interface between encoder and decoder. Bahdanau et al. [13], Luong et al. [77] used the hidden decoder states s j as query vectors. Both the key and value vectors are derived from the hidden states h i of a recursive encoder. 5 Formally, this means that Q = s j are the query vectors , n = J is the target sentence length, K = V = h i are the key and value vectors, and m = I is the source sentence length. 6 The outputs of the attention layer are used as time-dependent context vectors c j (x). In other words, rather than using a fixed-length sentence encoding c(x) as in Sec. 5, at each time step j we query a memory in which entries store (contextsensitive) representations of the source words. In this setup it is possible to derive an attention matrix A ∈ R J×I to visualize the learned relations between words in the source sentence and words in the target sentence: Fig. 6 shows an example of A from an English-German NMT system with additive attention. The attention matrix captures cross-lingual word relationships such as "is" → "ist" or "great" → "groÃ §er". The system has learned that the English source word "is" is relevant for generating the German target word "ist" and thus emits a high attention weight for this pair. Consequently, the context vector c j (x) at time step j = 3 mainly represents the source word "is" (c 3 (x) ≈ h 2 ). This is particularly significant as the system was not explicitly trained to align words but to optimize translation performance. However, as we will argue in Sec. 12.4, it would be wrong to think of A as a soft version of a traditional SMT word alignment. An important generalization of attention is multi-head attention proposed by Vaswani et al. [76]. The idea is to perform H attention operations instead of a single one where H is the number of attention heads (usually H = 8). The query, key, and value vectors for the attention heads are linear transforms of Q, K, and V . The output of multi-head attention is the concatenation of the outputs of each attention head. The dimensionality of the attention heads is usually divided by H to avoid increasing the number of parameters. Formally, it can be described as follows [76]: history is a great teacher . </s> die Geschichte ist ein groÃ §er Lehrer . </s> Figure 6: Attention weight matrix A for the translation from the English sentence "history is a great teacher ." to the German sentence "die Geschichte ist ein groÃ §er Lehrer .". Dark shades of blue indicate high attention weights.  Fig. 7 shows a multi-head attention module with three heads. Note that with multi-head attention it is not obvious anymore how to derive a single attention weight matrix A like shown in Fig. 6. Therefore, models using multi-head attention tend to be more difficult to interpret.

Attention Masks and Padding
NMT usually groups sentences into batches to make more efficient use of the available hardware and to reduce noise in gradient estimation (cf. Sec. 11.1). However, the central data structure for many machine learning frameworks [101,102] are tensors -multidimensional arrays with fixed dimensionality. Re-arranging source sentences as tensor 9 the first cold shower <pad> <pad> even the monkey seems to want a little coat of straw <pad> Figure 8: A tensor containing a batch of three source sentences of different lengths ("the first cold shower", "even the monkey seems to want", "a little coat of straw" -a haiku by Basho [100]). Short sentences are padded with <pad>. The training loss and attention masks are visualized with green (enabled) and red (disabled) background.
often results in some unused space as the sentences may vary in length. In practice, shorter sentences are filled up with a special padding symbol <pad> to match the length of the longest sentence in the batch (Fig. 8). Most implementations work with masks to avoid taking padded positions into account when computing the training loss. Attention layers also have to be restricted to non-padding symbols which is also usually realized by multiplying the attention weights by a mask that sets the attention weights for padding symbols to zero.

Recurrent Neural Machine Translation
This section contains a complete formal description of the RNNsearch architecture of Bahdanau et al. [13] which was the first NMT model using attention. Recall that NMT uses the chain rule to decompose the probability P (y|x) of a target sentence y = y J 1 given a source sentence x = x I 1 into left-to-right conditionals (Eq. 5). RNNsearch models the conditionals as follows [13,Eq. 2,4]: Similarly to Eq. 6, the function g(·) encapsulates the decoder network which computes the distribution for the next target token y j given the last produced token y j−1 , the RNN decoder state s j ∈ R n , and the context vector c j (x) ∈ R m . The sizes of the encoder and decoder hidden layers are denoted with m and n. The context vector c j (x) is a distributed representation of the relevant parts of the source sentence. In NMT without attention [10,12] (Sec. 5), the context vector is constant and thus needs to encode the whole source sentence. Adding an attention mechanism results in different context vectors for each target sentence position j. This effectively addresses issues in NMT due to the limited capacity of a fixed context vector as illustrated in Fig. 9.
As outlined in Sec. 6.1, the context vectors c j (x) are weighted sums of source sentence annotations h = (h 1 , . . . , h I ). The annotations are produced by the encoder network. In other words, the encoder converts the input sequence x to a sequence of annotations h of the same length. Each annotation h i ∈ R m encodes information about the entire source sentence x "with a strong focus on the parts surrounding the i-th word of the input sequence" [ Figure 9: The RNNsearch model following Bahdanau et al. [13]. The color coding indicates weight sharing. Gray arrows represent attention.
The RNNs → f (·) and ← f (·) are usually LSTM [71] or GRU [10] The context vectors c j (x) ∈ R m are computed from the annotations as weighted sum with weights α j ∈ [0, 1] I [13, Eq. 5]: The weights are determined by the alignment model a(·): where a(s j−1 , h i ) is a feedforward neural network which estimates the importance of annotation h i for producing the j-th target token given the current decoder state s j−1 ∈ R n . In the terminology of Sec. 6.1, h i represent the keys and values, s j are the queries, and a(·) is the attention scoring function. The function g(·) in Eq. 11 does not only take the previous target token y j−1 and the context vector c j but also the decoder hidden state s j .
where f (·) is modelled by a GRU or LSTM cell. The function g(·) is defined as follows.
with t j = T s s j + T y Ey j−1 + T c c j (19) Figure 10: Illustration of the attention mechanism in RNNsearch [13].
where max(·) is the element-wise maximum, and W o ∈ R |Σtrg|×l , T s , U s ∈ R l×n , T y , U y ∈ R l×k , E ∈ R k×|Σtrg| , T c , U c ∈ R l×m are weight matrices. The definition of g(·) can be seen as connecting the output of the recurrent layer, an k-dimensional embedding of the previous target token, and the context vector with a single maxout layer [103] of size l and using a softmax over the target language vocabulary [13]. Fig. 10 illustrates the complete RNNsearch model.

Convolutional Neural Machine Translation
Although convolutional neural networks (CNNs) have first been proposed by Waibel et al. [104] for phoneme recognition, their traditional use case is computer vision [105][106][107]. CNNs are especially useful for processing images because of two reasons. First, they use a high degree of weight tying and thus reduce the number of parameters dramatically compared to fully connected networks. This is crucial for high dimensional input like visual imagery. Second, they automatically learn space invariant features. Spatial invariance is desirable in vision since we often aim to recognize objects or features regardless of their exact position in the image. In NLP, convolutions are usually one dimensional since we are dealing with sequences rather than two dimensional images as in computer vision. We will therefore limit our discussions to the one dimensional case. We will also exclude concepts like pooling or strides as they are uncommon for sequence models in NLP.
The input to an 1D convolutional layer is a sequence of M -dimensional vectors u 1 , . . . , u I . The literature about CNNs usually refers to the M dimensions in each u i ∈ R M (i ∈ [1, I]) as channels, and to the i-axis as spatial dimension. The convolution transforms the input sequence u 1 , . . . , u I to an output sequence of N -dimensional v 1 , . . . , v I of the same length by moving a kernel of width K over the input sequence. The kernel is a linear transform which maps the K-gram u i , . . . , u i+K−1 to the output v i for i ∈ [1, I] (we append K − 1 padding symbols to the input). Standard convolution   parameterizes this linear transform with a full weight matrix W std ∈ R KM ×N : with i ∈ [1, I] and n ∈ [1, N ]. Standard convolution represents two kinds of dependencies: Spatial dependency (inner sum in Eq. 21) and cross-channel dependency (outer sum in Eq. 21). Pointwise and depthwise convolution factor out these dependencies into two separate operations: DepthwiseConv: where W pw ∈ R M ×N and W dw ∈ R K×N are weight matrices. Fig. 11 illustrates the differences between these types of convolution. The idea behind depthwise separable convolution is to replace standard convolutional with depthwise convolution followed by pointwise convolution. As shown in Tab. 3, the decomposition into two simpler steps reduces the number of parameters and has been shown to make more efficient use of the parameters than regular convolution in vision [108,109]. Using convolution rather than recurrence in NMT models has several potential advantages. First, they reduce sequential computation and are therefore easier parallelizable on 13 (a) NMT with a convolutional encoder and a convolutional decoder like in the ConvS2S architecture [110].
(b) Purely attention-based NMT as proposed by Vaswani et al. [76] with two layers. GPU hardware. Second, their hierarchical structure connects distant words via a shorter path than sequential topologies [110] which eases learning [72]. Both regular [110][111][112] and depthwise separable [113,114] convolution have been used for NMT in the past. Fig. 12a shows the general architecture for a fully convolutional NMT model such as ConvS2S [110] or SliceNet [113] in which both encoder and decoder are convolutional. Stacking multiple convolutional layers increases the effective context size. In the decoder, we need to mask the receptive field of the convolution operations to make sure that the network has no access to future information [115]. Encoder and decoder are connected via attention. Gehring et al. [110] used attention into the encoder representations after each convolutional layer in the decoder.

Self-attention-based Neural Machine Translation
Recall that Eq. 5 states that NMT factorizes P (y|x) into conditionals P (y j |y j−1 1 , x). We have reviewed two ways to model the dependency on the source sentence x in NMT: via a fixed-length sentence encoding c(x) (Sec. 5) or via time-dependent context vectors c j (x) which are computed using attention (Sec. 6.1). We have also presented two ways to implement the dependency on the target sentence prefix y j−1 1 : via a recurrent connection which passes through the decoder state to the next time step (Sec. 6.3) or via convolution (Sec. 6.4). A third option to model target side dependency is using selfattention. Using the terminology introduced in Sec. 6.1, decoder self-attention derives all three components (queries, keys, and values) from the decoder state. The decoder conditions on the translation prefix y j−1 1 by attending to its own states from previous time steps. Besides machine translation, self-attention has been applied to various NLP tasks such as sentiment analysis [116], natural language inference [62,96,117,118], text summarization [119], headline generation [120], sentence embedding [63,64,121], and reading comprehension [122]. Similarly to convolution, self-attention introduces short paths between distant words and reduces the amount of sequential computation. Studies indicate that these short paths are especially useful for learning strong semantic feature extractors, but (perhaps somewhat counter-intuitively) less so for modelling long-range subject-verb agreement [123]. Like in convolutional models we also need to mask future decoder states to prevent conditioning on future tokens (cf. Sec. 6.2). The general layout for self-attention-based NMT models is shown in Fig. 12b. The first example of this new class of NMT models was the Transformer [76]. The Transformer uses attention for three purposes: 1) within the encoder to enable context-sensitive word representations which depend on the whole source sentence, 2) between the encoder and the decoder as in previous models, and 3) within the decoder to condition on the current translation history. The Transformer uses multi-head attention (Sec. 6.1) rather than regular attention. Using multi-head attention has been shown to be essential for the Transformer architecture [123,124].
A challenge in self-attention-based models (and to some extent in convolutional models) is that vanilla attention as introduced in Sec. 6.1 by itself has no notion of order.
The key-value pairs in the memory are accessed purely based on the correspondence between key and query (content-based addressing) and not based on a location of the key in the memory (location-based). 7 This is less of a problem in recurrent NMT (Sec. 6.3) as queries, keys, and values are derived from RNN states and already carry a strong sequential signal due to the RNN topology. In the Transformer architecture, however, recurrent connections are removed in favor of attention. Vaswani et al. [76] tackled this problem using positional encodings. Positional encodings are (potentially partial) functions PE : N R D where D is the word embedding size, i.e. they are D-dimensional representations of natural numbers. They are added to the (input and output) word embeddings to make them (and consequently the queries, keys, and values) positionsensitive. Vaswani et al. [76] stacked sine and cosine functions of different frequencies to implement PE(·): for n ∈ N and d ∈ [1, D]. Alternatively, positional encodings can be learned in an embedding matrix [110]: with weight matrix W ∈ R d×N for some sufficiently large N . The input to PE(·) is usually the absolute position of the word in the sentence [76,110], but relative positioning is also possible [125]. We will give an overview of extensions to the Transformer architecture in Sec. 13.1.

Comparison of the Fundamental Architectures
As outlined in the previous sections, NMT can come in one of three flavors: recurrent, convolutional, or self-attention-based. In this section, we will discuss three concrete architectures in greater detail -one of each flavor. For an empirical comparison see [126]. Fig. 13 visualizes the data streams in Google's Neural Machine Translation system [14,7 We will discuss cases in which both content and location are taken into account in Secs. 13.2 and 13.3
(d) RNMT+ [124]. GNMT] as example of a recurrent network, the convolutional ConvS2S model [110], and the self-attention-based Transformer model [76] in plate notation. We excluded components like dropout [127], batch normalization [128], and layer normalization [129] to simplify the diagrams. All models fall in the general category of encoder-decoder networks, with the encoder in the left column and the decoder in the right column. Output probabilities are generated by a linear projection layer followed by a softmax activation at the end. They all use attention at each decoder layer to connect the encoder with the decoder, although the specifics differ. GNMT (Fig. 13a) uses regular attention, ConvS2S (Fig. 13b) adds the source word encodings to the values, and the Transformer (Fig. 13c) uses multi-head attention (Sec. 6.1). Residual connections [130] are used in all three architectures to encourage gradient flow in multi-layer networks. Positional encodings are used in ConvS2S and the Transformer, but not in GNMT. An interesting fusion is the RNMT+ model [124] shown in Fig. 13d which reintroduces ideas from the Transformer like multi-head attention into recurrent NMT. Other notable mixed architectures include Gehring et al. [112] who used a convolutional encoder with a recurrent decoder, Miculicich et al. [131], Wang et al. [132], Werlen et al. [133] who added self-attention connections to a recurrent decoder, Hao et al. [134] who used a Transformer encoder and a recurrent encoder in parallel, and Lin et al. [135] who equipped a recurrent decoder with a convolutional decoder to provide global target-side context.

The Search Problem in NMT
So far we have described how NMT defines the translation probability P (y|x). However, in order to apply these definitions directly, both the source sentence x and the target sentence y have to be given. They do not directly provide a method for generating a target sentence y from a given source sentence x which is the ultimate goal in machine translation. The task of finding the most likely translationŷ for a given source sentence x is known as the decoding or inference problem: NMT decoding is non-trivial for mainly two reasons. First, the search space is vast as it grows exponentially with the sequence length. For example, if we assume a common vocabulary size of |Σ trg | = 32, 000, there are already more possible translations with 20 words or less than atoms in the observable universe (32, 000 20 10 82 ). Thus, complete enumeration of the search space is impossible. Second, as we will see in Sec. 10, certain types of model errors are very common in NMT. The mismatch between the most likely and the "best" translation has deep implications on search as more exhaustive search often leads to worse translations [136]. We will discuss possible solutions to both problems in the remainder of Sec. 7.

Greedy and Beam Search
The most popular decoding algorithms for NMT are greedy search and beam search. Both search procedures are based on the left-to-right factorization of NMT in Eq. 5. Translations are built up from left to right while partial translation prefixes are scored using the conditionals P (y j |y j−1 1 , x). This means that both algorithms work in a timesynchronous manner: in each iteration j, partial hypotheses of (up to) length j are compared to each other, and a subset of them is selected for expansion in the next time step. The algorithms terminate if either all or the best of the selected hypotheses end with the end-of-sentence symbol </s> or if some maximum number of iterations is reached. Fig. 14 illustrates the difference between greedy search and beam search. Greedy search (highlighted in green) selects the single best expansion at each time step: 'c' at j = 1, 'a' at j = 2, and 'b' at j = 3. However, greedy search is vulnerable to the so-called garden-path problem [20]. The algorithm selects 'c' in the first time step which turns out to be a mistake later on as subsequent distributions are very smooth and scores are comparably low. However, greedy decoding cannot correct this mistake later as it is already committed to this path. Beam search (highlighted in orange in Fig. 14) tries to mitigate the risk of the garden-path problem by passing not one but n possible translation prefixes to the next time step (n = 2 in Fig. 14). The n hypotheses which survive a time step are called active hypotheses. At each time step, the accumulated path scores for all possible continuations of active hypotheses are compared, and the n best ones are selected. Thus, beam search does not only expand 'c' but also 'b' in time step 1, and thereby finds the high scoring translation prefix 'ba'. Note that although beam search seems to be the more accurate search procedure, it is not guaranteed to always find a translation with higher or equal score as greedy decoding. 8 It is therefore still prone to the garden-path problem, although less so than greedy search. Stahlberg and Byrne [136] demonstrated that even beam search suffers from a high number of search errors.

Formal Description of Decoding for the RNNsearch Model
In this section, we will formally define decoding for the RNNsearch model [13]. We will resort to the mathematical symbols used in Sec. 6.3 to describe the algo- y ← arg max w ∈ Σtrg πw(p) 7: y.append(y) 8 Hcur ← {(y, pacc, s) ∈ Hnext : |{(y , p acc , s ) ∈ Hnext : p acc > pacc}| < n} {Select n-best} 13: (ŷ,pacc,ŝ) ← arg max (y, pacc, s) ∈ Hcur pacc 14: untilŷ |ŷ| = </s> 15: returnŷ rithms. First, the source annotations h are computed and stored as this does not require any search. Then, we compute the distribution for the first target token y 1 using OneStepRNNsearch(s init , <s>, h) (Alg. 1). The initial decoder state s init is often a linear transform of the last encoder hidden state h I : Greedy decoding selects the most likely target token according the returned distribution and iteratively calls OneStepRNNsearch(·) until the end-of-sentence symbol </s> is emitted (Alg. 2). We use the projection function π w (p) (Eq. 3) which maps the posterior vector p ∈ R |Σtrg| to the w-th component.
The beam search strategy (Alg. 3) does not only keep the single best partial hypothesis but a set of n promising hypotheses where n is the size of the beam. A partial hypothesis is represented by a 3-tuple (y, p acc , s) with the translation prefix y ∈ Σ * trg , the accumulated score p acc ∈ R, and the last decoder state s ∈ R n .

Ensembling
Ensembling [137,138] is a simple yet very effective technique to improve the accuracy of NMT. The basic idea is illustrated in Fig. 15. The decoder makes use of K NMT networks rather than only one which are either trained independently [12,14,139] or share some amount of training iterations [140][141][142]. The ensemble decoder computes predictions for each of the individual models which are then combined using the arithmetic [12] or geometric [141] average: S geo (y j |y j−1 Both S arith (·) and S geo (·) can be used as drop-in replacement for the conditionals P (y j |y j−1 1 , x) in Eq. 5. The arithmetic average is more sound as S arith (·) still forms a valid probability distribution which sums up to one. However, the geometric average S arith (·) is numerically more stable as log-probabilities can be directly combined without converting them to probabilities. Note that the core idea of ensembling is similar to language model interpolation used in statistical machine translation or speech recognition.
Ensembling consistently outperforms single NMT by a large margin. All top systems in recent machine translation evaluation campaigns ensemble a number of NMT systems [126,[139][140][141][142][143][144][145][146][147][148][149][150], perhaps most famously taken to the extreme by the WMT18 submission of Tencent that ensembled up to 72 translation models [150]. However, the decoding speed is significantly worse since the decoder needs to apply K NMT models rather than only one. This means that the decoder has to perform K more forward passes through the networks, and has to apply the expensive softmax function K more times in each time step. Ensembling also often increases the number of CPU/GPU switches and the communication overhead between CPU and GPU when averaging is implemented on the CPU. Ensembling is also often more difficult to implement than single system NMT. 20 Knowledge distillation which we will discuss in Sec. 16 is one method to deal with the shortcomings of ensembling. Stahlberg and Byrne [151] proposed to unfold the ensemble into a single network and shrink the unfolded network afterwards for efficient ensembling.
In NMT, all models in an ensemble usually have the same size and topology and are trained on the same data. They differ only due to the random weight initialization and the randomized order of the training samples. Notable exceptions include Freitag and Al-Onaizan [152] who use ensembling to prevent overfitting in domain adaptation, He et al. [153] who combined models that selected their training data based on marginal likelihood, and the UCAM submission to WMT18 [126] that ensembled different NMT architectures with each other. 9 When all models are equally powerful and are trained with the same data, it is surprising that ensembling is so effective. One common narrative is that different models make different mistakes, but the mistake of one model can be outvoted by the others in the ensemble [156]. This explanation is plausible for NMT since translation quality can vary widely between training runs [157]. The variance in translation performance may also indicate that the NMT error surface is highly non-convex such that the optimizer often ends up in local optima. Ensembling might mitigate this problem. Ensembling may also have a regularization effect on the final translation scores [158].
Checkpoint averaging [28,159] is a technique which is often discussed in conjunction with ensembling [160]. Checkpoint averaging keeps track of the few most recent checkpoints during training, and averages their weight matrices to create the final model. This results in a single model and thus does not increase the decoding time. Therefore, it has become a very common technique in NMT [76,126,161]. Checkpoint averaging addresses a quite different problem than ensembling as it mainly smooths out minor fluctuations in the training curve which are due to the optimizer's update rule or noise in the gradient estimation due to mini-batch training. In contrast, the weights of independently trained models are very different from each other, and there is no obvious direct correspondence between neuron activities across the models. Therefore, checkpoint averaging cannot be applied to independently trained models.

Decoding Direction
Standard NMT factorizes the probability P (y|x) from left to right (L2R) according Eq. 5. Mathematically, the left-to-right order is rather arbitrary, and other arrangements such as a right-to-left (R2L) factorization are equally correct: NMT models which produce the target sentence in reverse order have led to some gains in evaluation systems when combined with left-to-right models [126,140,148,150]. A common combination scheme is based on rescoring: A strong L2R ensemble first creates an n-best list which is then rescored with an R2L model [140,162]. Stahlberg et al. [126] used R2L models via a minimum Bayes risk framework. The L2R and R2L systems are normally trained independently, although some recent work proposes joint training schemes in which each direction is used as a regularizer for the other direction [163,164]. Other orderings besides L2R and R2L have also been proposed such as middleout [165], top-down in a binary tree [166], insertion-based [167][168][169][170], or in source sentence order [171].
Another way to give the decoder access to the full target-side context is the two-stage approach of Li et al. [172] who first drafted a translation, and then employed a multisource NMT system to generate the final translation from both the source sentence and the draft. Zhang et al. [173] proposed a similar scheme but generated the draft translations in reverse order. A similar two-pass approach was used by ElMaghraby and Rafea [174] to make Arabic MT more robust against domain shifts. Geng et al. [175] used reinforcement learning to choose the best number of decoding passes.
Besides explicit combination with an R2L model and multi-pass strategies, we are aware of following efforts to make the decoder more sensitive to the right-side target context: He et al. [176] used reinforcement learning to estimate the long-term value of a candidate. Lin et al. [135] provided global target sentence information to a recurrent decoder via a convolutional model. Hoang et al. [177] proposed a very appealing theoretical framework to relax the discrete NMT optimization problem into a continuous optimization problem which allows to include both decoding directions.

Efficiency
NMT decoding is very fast on GPU hardware and can reach up to 5000 words per second. 10 However, GPUs are very expensive, and speeding up CPU decoding to the level of SMT remains more challenging. Therefore, how to improve the efficiency of neural sequence decoding algorithms is still an active research question. One bottleneck is the sequential left-to-right order of beam search which makes parallelization difficult. Stern et al. [178] suggested to compute multiple time steps in parallel and validate translation prefixes afterwards. Kaiser et al. [179] reduced the amount of sequential computation by learning a sequence of latent discrete variables which is shorter than the actual target sentence, and generating the final sentence from this latent representation in parallel. Di Gangi and Federico [180] sped up recurrent NMT by using a simplified architecture for recurrent units. Another line of research tries to reintroduce the idea of hypothesis recombination to neural models. This technique is used extensively in traditional SMT [181]. The idea is to keep only the better of two partial hypotheses if it is guaranteed that both will be scored equally in the future. For example, this is the case for n-gram language models if both hypotheses end with the same n-gram. The problem in neural sequence models is that they condition on the full translation history. Therefore, hypothesis recombination for neural sequence models does not insist on exact equivalence but cluster hypotheses based on the similarity between RNN states or the n-gram history [182,183]. A similar idea was used by Lecorvé and Motlicek [184] to approximate RNNs with WFSTs which also requires mapping histories into equivalence classes.
It is also possible to speed up beam search by reducing the beam size. Wu et al. [14], Freitag and Al-Onaizan [185] suggested to use a variable beam size, using various heuristics to decide the beam size at each time step. Alternatively, the NMT model training can be tailored towards the decoding algorithm [186][187][188][189]. Wiseman and Rush [187] proposed a loss function for NMT training which penalizes when the reference falls off the beam during training. Kim and Rush [190] reported that knowledge distillation (discussed in Sec. 16) reduces the gap between greedy decoding and beam decoding significantly. Greedy decoding can also be improved by using a small actor network which modifies the hidden states in an already trained model [189,191].

Generating Diverse Translations
An issue with using beam search is that the hypotheses found by the decoder are very similar to each other and often differ only by one or two words [192][193][194]. The lack of diversity is problematic for several reasons. First, natural language in general and translation in particular often come with a high level of ambiguity that is not represented well by non-diverse n-best lists. Second, it impedes user interaction as NMT is not able to provide the user with alternative translations if needed. Third, collecting statistics about the search space such as estimating the probabilities of n-grams for minimum Bayes-risk decoding [126,[195][196][197][198][199] or risk-based training (Sec. 11.5) is much less effective.
Cho [200] added noise to the activations in the hidden layer of the decoder network to produce alternative high scoring hypotheses. This is justified by the observation that small variations of a hidden configuration encode semantically similar context [201]. Li and Jurafsky [192], Li et al. [193] proposed a diversity promoting modification of the beam search objective function. They added an explicit penalization term to the NMT score based on a maximum mutual information criterion which penalizes hypotheses from the same parent node. Note that both extensions can be used together [200]. Vijayakumar et al. [202] suggested to partition the active hypotheses in groups, and use a dissimilarity term to ensure diversity between groups. Park et al. [203] found alternative translations by k-nearest neighbor search from the greedy translation in a translation memory.

Simultaneous Translation
Most of the research in MT assumes an offline scenario: a complete source sentence is to be translated to a complete target sentence. However, this basic assumption does not hold up for many real-life applications. For example, useful machine translation for parliamentary speeches and lectures [204,205] or voice call services such as Skype [206] does not only have to produce good translations but also have to do so with very low latency [207]. To reduce the latency in such real-time speech-to-speech translation scenarios it is desirable to start translating before the full source sentence has been vocalized by the speaker. Most approaches frame simultaneous machine translation as source sentence segmentation problem. The source sentence is revealed one word at a time. After a certain number of words, the segmentation policy decides to translate the current partial source sentence prefix and commit to a translation prefix which may not be a complete translation of the partial source. This process is repeated until the full source sentence is available. The segmentation policy can be heuristic [208] or learned with reinforcement learning [209,210]. The translation itself is usually carried out by a standard MT system which was trained on full sentences. This is sub-optimal for two reasons. First, using a system which was trained on full sentences to translate partial sentences is brittle due to the significant mismatch between training and testing time. Ma  this problem by training NMT to generate the target sentence with a fixed maximum latency to the source sentence. Second, human simultaneous interpreters use sophisticated strategies to reduce the latency by changing the grammatical structure [212][213][214]. These strategies are neglected by a vanilla translation system. Unfortunately, training data from human simultaneous translators is rare [213] which makes it difficult to adapt MT to it.

Using Large Output Vocabularies
As discussed in Sec. 2, NMT and other neural NLP models use embedding matrices to represent words as real-valued vectors. Embedding matrices need to have a fixed shape to make joint training with the translation model possible, and thus can only be used with a fixed and pre-defined vocabulary. This has several major implications for NMT.
First, the size of the embedding matrices grows with the vocabulary size. As shown in Tab. 4, the embedding matrices make up most of the model parameters of a standard RNNsearch model. Increasing the vocabulary size inflates the model drastically. Large models require a small batch size because they take more space in the (GPU) memory, but reducing the batch size often leads to noisier gradients, slower training, and eventually worse model performance [161]. Furthermore, a large softmax output layer is computationally very expensive. In contrast, traditional (symbolic) MT systems can easily use very large vocabularies [181,[215][216][217]. Besides these practical issues, training embedding matrices for large vocabularies is also complicated by the long-tail distribution of words in a language. Zipf's law [218] states that the frequency of any word and its rank in the frequency table are inversely proportional to each other. Fig. 16 shows that 843K of the 875K distinct words (96.5%) occur less than 100 times in an English text with 140M running words -that is less than 0.00007% of the entire text. It is difficult to train robust word embeddings for such rare words. Word-based NMT models address this issue by restricting the vocabulary to the n most frequent words, and replacing all other words by a special token UNK. A problem with that approach is that the UNK token may appear in the generated translation. In fact, limiting the vocabulary to the 30K most frequent words results in an out-of-vocabulary rate (OOV) of 2.9% on the training set (Fig. 16). That means an UNK token can be expected to occur every 35 words. In practice, the number of UNKs is usually even higher. One simple reason is that the test set OOV rate is often higher than on the training set because the distribution of words and phrases naturally varies across genre, corpora, and time. Another observation is that word-based NMT often prefers emitting UNK even if a more appropriate word is in the NMT vocabulary. This is possibly due to the misbalance between the UNK token and other words: replacing all rare words with the same UNK token leads to an over-representation of UNK in the training set, and therefore a strong bias towards UNK during decoding.

Translation-specific Approaches
Jean et al. [219] distinguished between translation-specific and model-specific approaches. Translation-specific approaches keep the shortlist vocabulary in the original form, but correct UNK tokens afterwards. For example, the UNK replace technique [220,221] keeps track of the positions of source sentence words which correspond to the UNK tokens. In a post-processing step, they replaced the UNK tokens with the most likely translation of the aligned source word according a bilingual word-level dictionary which was extracted from a word-aligned training corpus. Gulcehre et al. [222] followed a similar idea but used a special pointer network for referring to source sentence words. These approaches are rather ad-hoc because simple dictionary lookup without context is not a very strong model of translation. Li et al. [223] replaced each OOV word with a similar in-vocabulary word based on the cosine similarity between their distributed representations in a pre-processing step. However, this technique cannot tackle all OOVs as it is based on vector representations of words which are normally only available for a closed vocabulary. Moreover, the replacements might differ from the original meaning significantly. Further UNK replacement strategies were presented by Li et al. [224,225], Miao et al. [226], but all share the inevitable limitation of all translationspecific approaches, namely that the translation model itself is indiscriminative between a large number of OOVs.

Model-specific Approaches
Model-specific approaches change the NMT model to make training with large vocabularies feasible. For example, Nguyen and Chiang [227] improved the translation of rare words in NMT by adding a lexical translation model which directly connects corresponding source and target words. Another very popular idea is to train networks to output probability distributions without using the full softmax [228]. Noise-contrastive estimation [229,230,NCE] trains a logistic regression model which discriminates between real training examples and noise. For example, to train an embedding for a word w, Mnih and Kavukcuoglu [231] treat w as positive example, and sample from the global unigram word distribution in the training data to generate negative examples. The logistic regression model is a binary classifier and thus does not need to sum over the full vocabulary. NCE has been used to train large vocabulary neural sequence models such as language models [232]. The technique falls into the category of self-normalizing training [228] because the model is trained to emit normalized distributions without explicitly summing over the output vocabulary. Self-normalization can also be achieved by adding the value of the partition function to the training loss [8], encouraging the network to learn parameters which generate normalized output.
Another approach (sometimes referred to as vocabulary selection) is to approximate the partition function of the full softmax by using only a subset of the vocabulary. This subset can be selected in different ways. For example, Jean et al. [219] applied importance sampling to select a small set of words for approximating the partition function. Both softmax sampling and UNK replace have been used in one of the winning systems at the WMT'15 evaluation on English-German [233]. Various methods have been proposed to select the vocabulary to normalize over during decoding, such as fetching all possible translations in a conventional phrase table [234], using the vocabulary of the translation lattices from a traditional MT system [235, local softmax], and attention-based [236] and embedding-based [237] methods.

Character-based NMT
Arguably, both translation-specific and model-specific approaches to word-based NMT are fundamentally flawed. Translation-specific techniques like UNK replace are indiscriminative between translations that differ only by OOV words. A translation model which assigns exactly the same score to a large number of hypotheses is of limited use by its own. Model-specific approaches suffer from the difficulty of training embeddings for rare words (Sec. 8.1). Compound or morpheme splitting [238,239] can mitigate this issue only to a certain extent. More importantly, a fully-trained NMT system even with a very large vocabulary cannot be extended with new words. However, customizing systems to new domains (and thus new vocabularies) is a crucial requirement for commercial MT. Moreover, many OOV words are proper names which can be passed through untranslated. Hiero [217] and other symbolic systems can easily be extended with new words and phrases.
More recent attempts try to alleviate the vocabulary issue in NMT by departing from words as modelling units. These approaches decompose the word sequences into finergrained units and model the translation between those instead of words. To the best of our knowledge, Ling et al. [240] were the first who proposed an NMT architecture which translates between sequences of characters. The core of their NMT network is still on the word-level, but the input and output embedding layers are replaced with subnetworks that compute word representations from the characters of the word. Such a subnetwork can be recurrent [240,241] or convolutional [242,243]. This idea was extended to a hybrid model by Luong and Manning [244] who used the standard lookup table embeddings for in-vocabulary words and the LSTM-based embeddings only for OOVs.
Having a word-level model at the core of a character-based system does circumvent the closed vocabulary restriction of purely word-based models, but it is still segmentationdependent: The input text has to be preprocessed with a tokenizer that separates words by blank symbols in languages without word boundary markers, optionally applies compound or morpheme splitting in morphologically rich languages, and isolates punctuation 26 symbols. Since tokenization is by itself error-prone and can degrade the translation performance [245], it is desirable to design character-level systems that do not require any prior segmentation. Chung et al. [246] used a bi-scale recurrent neural network that is similar to dynamically segmenting the input using jointly learned gates between a slow and a fast recurrent layer. Lee et al. [247], Yang et al. [248] used convolution to achieve segmentation-free character-level NMT. Costa-jussà et al. [249] took character-level NMT one step further and used bytes rather than characters to help multilingual systems. Gulcehre et al. [250] added a planning mechanism to improve the attention weights between character-based encoders and decoders.

Subword-unit-based NMT
As compromise between characters and full words, compression methods like Huffman codes [251], word piece models [14,252], or byte pair encoding [157,253,BPE] can be used to transform the words to sequences of subword units. Subwords have been used rarely for traditional SMT [254][255][256], but are currently the most common translation units for NMT. Byte pair encoding (BPE) initializes the set of available subword units with the character set of the language. This set is extended iteratively in subsequent merge operations. Each merge combines the two units with the highest number of cooccurrences in the text. 11 This process terminates when the desired vocabulary size is reached. This vocabulary size is often set empirically, but can also be tuned on data [258].
Given a fixed BPE vocabulary, there are often multiple ways to segment an unseen text. 12 The ambiguity stems from the fact that symbols are still part of the vocabulary even after they are merged. Most BPE implementations select a segmentation greedily by preferring longer subword units. Interestingly, the ambiguity can also be used as source of noise for regularization. Kudo [259] reported surprisingly large gains by augmenting the training data with alternative subword segmentations and by decoding from multiple segmentations of the same source sentence. Segmentation approaches differ in the level of constraints they impose on the subwords. A common constraint is that subwords cannot span over multiple words [157]. However, enforcing this constraint again requires a tokenizer which is a potential source of errors (see Sec. 8.2). The SentencePiece model [260] is a tokenization-free subword model that is estimated on raw text. On the other side of the spectrum, it has been observed that automatically learned subwords generally do not correspond to linguistic entities such as morphemes, suffixes, affixes etc. However, linguistically-motivated subword units [261][262][263][264] that also take morpheme boundaries into account do not always improve over completely data-driven ones.

Words, Subwords, or Characters?
There is no conclusive agreement in the literature whether characters or subwords are the better translation units for NMT. Tab. 5 summarizes some of the arguments. The tendency seems to be that character-based systems have the potential of outperforming Character-based NMT Subword-based NMT + Better at transliteration [265]. + Dynamic segmentation favors characters [266].
+ Character-level decoders better than subword-based ones in some studies [246,269].
− Long-range dependencies have to be modelled over longer time-spans [247].
+ Tends to outperform character-based models in recent MT evaluations [143][144][145]. subword-based NMT, but they are technically difficult to deploy. Therefore, most systems in the WMT18 evaluation are based on subwords [145]. On a more profound level, we do see the shift towards small modelling units not without some concern. Chung et al. [246] noted that "we often have a priori belief that a word, or its segmented-out lexeme, is a basic unit of meaning, making it natural to approach translation as mapping from a sequence of source-language words to a sequence of target-language words." Translation is the task of transferring meaning from one language to another, and it makes intuitive sense to model this process with meaningful units. The decades of research in traditional SMT were characterized by a constant movement towards larger translation units -starting from the word-based IBM models [270] to phrase-based MT [181] and hierarchical SMT [217] that models syntactic structures. Expressions consisting of multiple words are even more appropriate units than words for translation since there is rarely a 1:1 correspondence between source and target words. In contrast, the starting point for character-and subword-based models is the language's writing system. Most writing systems are not logographic but alphabetic or syllabaric and thus use symbols without any relation to meaning. The introduction of symbolic word-level and phraselevel information to NMT is one of the main motivations for NMT-SMT hybrid systems (Sec. 18).

Using Monolingual Training Data
In practice, parallel training data for MT is hard to acquire and expensive, whereas untranslated monolingual data is usually abundant. This is one of the reasons why language models (LMs) are central to traditional SMT. For example, in Hiero [217], the translation grammar spans a vast space of possible translations but is weak in assigning scores to them. The LM is mainly responsible for selecting a coherent and fluent 28 translation from that space. However, the vanilla NMT formalism does not allow the integration of an LM or monolingual data in general.
There are several lines of research which investigate the use of monolingual training data in NMT. Gulcehre et al. [271,272] suggested to integrate a separately trained RNN-LM into the NMT decoder. Similarly to traditional SMT [181] they started out with combining RNN-LM and NMT scores via a log-linear model ('shallow fusion'). They reported even better performance with 'deep fusion' which uses a controller network that dynamically adjusts the weights between RNN-LM and NMT. Both deep fusion and n-best reranking with count-based language models have led to some gains in WMT evaluation systems [148,233]. The 'simple fusion' technique [273] trains the translation model to predict the residual probability of the training data added to the prediction of a pre-trained and fixed LM.
The second line of research makes use of monolingual text via data augmentation. The idea is to add monolingual data in the target language to the natural parallel training corpus. Different strategies for filling in the source side for these sentences have been proposed such as using a single dummy token [274] or copying the target sentence over to the source side [275]. The most successful strategy is called back-translation [274,276] which employs a separate translation system in the reverse direction to generate the source sentences for the monolingual target language sentences. The back-translating system is usually smaller and computationally cheaper than the final system for practical reasons, although with enough computational resources improving the quality of the reverse system can affect the final translation performance significantly [277]. Iterative approaches that back-translate with systems that were by themselves trained with backtranslation can yield improvements [278][279][280] although they are not widely used due to their computational costs. Back-translation has become a very common technique and has been used in nearly all neural submissions to recent evaluation campaigns [140,144,145].
A major limitation of back-translation is that the amount of synthetic data has to be balanced with the amount of real parallel data [140,274,281]. Therefore, the backtranslation technique can only make use of a small fraction of the available monolingual data. A misbalance between synthetic and real data can be partially corrected by oversampling -duplicating real training samples a number of times to match the synthetic data size. However, very high over-sampling rates often do not work well in practice. Recently, Edunov et al. [282] proposed to add noise to the back-translated sentences to provide a stronger training signal from the synthetic sentence pairs. They showed that adding noise does not only improve the translation quality but also makes the training more robust against a high ratio of synthetic against real sentences. The effectiveness of using noise for data augmentation in NMT has also been confirmed by Wang et al. [283]. These methods increase the variety of the training data and thus make it harder for the model to fit which ultimately leads to stronger training signals. The variety of synthetic sentences in back-translation can also be increased by sampling multiple sentences from the reverse translation model [284].
A third class of approaches changes the NMT training loss function to incorporate monolingual data. For example, Cheng et al. [285], Tu et al. [286], Escolano et al. [287] proposed to add autoencoder terms to the training objective which capture how well a sentence can be reconstructed from its translated representation. Using the reconstruction error is also central to (unsupervised) dual learning approaches [288][289][290] Figure 17: Performance of a Transformer model on English-German (WMT15) under varying beam sizes. The BLEU score peaks at beam size 10, but then suffers from a length ratio (hypothesis length / reference length) below 1. The log-probabilities are shown as a ratio with respect to greedy decoding.
However, training with respect to the new loss is often computationally intensive and requires approximations. Alternatively, multi-task learning has been used to incorporate source-side [291] and target-side [292] monolingual data. Another way of utilizing monolingual data in both source and target language is to warm start Seq2Seq training from pre-trained encoder and decoder networks [293,294]. An extreme form of leveraging monolingual training data is unsupervised NMT which removes the need for parallel training data entirely. We will discuss unsupervised NMT in Sec. 14.4.

NMT Model Errors
NMT is highly effective in assigning scores (or probabilities) to translations because, in stark contrast to SMT, it does not make any conditional independence assumptions in Eq. 5 to model sentence-level translation. 13 A potential drawback of such a powerful model is that it prohibits the use of sophisticated search procedures. Compared to hierarchical SMT systems like Hiero [217] that explore very large search spaces, NMT beam search appears to be overly simplistic. This observation suggests that translation errors in NMT are more likely due to search errors (the decoder does not find the highest scoring translation) than model errors (the model assigns a higher probability to a worse translation). Interestingly, this is not necessarily the case. Search errors in NMT have been studied by Stahlberg et al. [34], Stahlberg and Byrne [136], Niehues et al. [295]. In particular, Stahlberg and Byrne [136] demonstrated the high number of search errors in NMT decoding. However, as we will show in this section, NMT also suffers from various kinds of model errors in practice despite its theoretical advantage.

Sentence Length
Increasing the beam size exposes one of the most noticeable model errors in NMT. The red curve in Fig. 17 plots the BLEU score [296] of a recent Transformer NMT model against the beam size. A beam size of 10 is optimal on this test set. Wider beams lead to a steady drop in translation performance because the generated translations are becoming too short (green curve). However, as expected, the log-probabilities of the found  Figure 18: The length deficiency in NMT translating the English source sentence "Her husband is a former Tory councillor." into German following Murray and Chiang [297]. The NMT model assigns a better score to the short translation "Ihr Mann ist ein ehemaliger Stadtrat." than to the greedy translation "Ihr Mann ist ein ehemaliger Stadtrat der Tory." even though it misses the former affiliation of the husband with the Tory Party.
translations (blue curve) are decreasing as we increase the beam size. NMT seems to assign too much probability mass to short hypotheses which are only found with more exhaustive search. Sountsov and Sarawagi [74] argue that this model error is due to the locally normalized maximum likelihood training objective in NMT that underestimates the margin between the correct translation and shorter ones if trained with regularization and finite data. A similar argument was made by Murray and Chiang [297] who pointed out the difficulty for a locally normalized model to estimate the "budget" for all remaining (longer) translations in each time step. Kumar and Sarawagi [298] demonstrated that NMT models are often poorly calibrated, and that calibration issues can cause the length deficiency in NMT. A similar case is illustrated in Fig. 18. The NMT model underestimates the combined probability mass of translations continuing after "Stadtrat" in time step 7 and overestimates the probability of the period symbol. Greedy decoding does not follow the green translation since "der" is more likely in time step 7. However, beam search with a large beam keeps the green path and thus finds the shorter (incomplete) translation with better score. In fact, Stahlberg and Byrne [136] linked the bias of large beam sizes towards short translations with the reduction of search errors.
At first glance this seems to be good news: fast beam search with a small beam size is already able to find good translations. However, fixing the model error of short translations by introducing search errors with a narrow beam seems like fighting fire with fire. In practice, this means that the beam size is yet another hyper-parameter which needs to be tuned for each new NMT training technique (eg. label smoothing [299] usually requires a larger beam), NMT architecture (the Transformer model is usually decoded with a smaller beam than typical recurrent models), and language pair [300]. More importantly, it is not clear whether there are gains to be had from reducing the number of search errors with wider beams which are simply obliterated by the NMT length deficiency.

Model-agnostic Length Models
The first class of approaches to alleviate the length problem is model-agnostic. Methods in this class treat the NMT model as black box but add a correction term to the NMT 31 score to bias beam search towards longer translations. A simple method is called length normalization which divides the NMT probability by the sentence length [233,301]: Wu et al. [14] proposed an extension of this idea by introducing a tunable parameter α: Alternatively, like in SMT we can use a word penalty γ(j, x) which rewards each word in the sentence: A constant reward which is independent of x and j can be found with the standard minimum-error-rate-training [302, MERT] algorithm [303] or with a gradient-based learning scheme [297]. Alternative policies which reward words with respect to some estimated sentence length were suggested by Huang et al. [304], Yang et al. [305].

Source-side Coverage Models
Tu et al. [306] connected the sentence length issue in NMT with the lack of an explicit mechanism to check the source-side coverage of a translation. Traditional SMT keeps track of a coverage vector C SMT ∈ {0, 1} I which contains 1 for source words which are already translated and 0 otherwise. C SMT is used to guard against under-translation (missing translations of some words) and over-translation (some words are unnecessarily translated multiple times). Since vanilla NMT does not use an explicit coverage vector it can be prone to both under-and over-translation [306,307] and tends to prefer fluency over adequacy [308]. There are two popular ways to model coverage in NMT, both make use of the encoder-decoder attention weight matrix A introduced in Sec. 6.1. The simpler methods combine the scores of an already trained NMT system with a coverage penalty cp(x, y) without retraining. This penalty represents how much of the source sentence is already translated. Wu et al. [14] proposed the following term: A very similar penalty was suggested by Li et al. [309]: where α and β are hyper-parameters that are tuned on the development set. An even tighter integration can be achieved by changing the NMT architecture itself and jointly training it with a coverage model [306,310]. Tu et al. [306] reintroduced an explicit coverage matrix C ∈ [0, 1] I×J to NMT. Intuitively, the j-th column C :,j stores to 32 what extend each source word has been translated in time step j. C can be filled with an RNN-based controller network (the "neural network based" coverage model of Tu et al. [306]). Alternatively, we can directly use A to compute the coverage (the "linguistic" coverage model of Tu et al. [306]): where Φ i is the estimated number of target words the i-th source word generates which is similar to fertility in SMT. Φ i is predicted by a feedforward network that conditions on the i-th encoder state. In both the neural network based and the linguistic coverage model, the decoder is modified to additionally condition on C. The idea of using fertilities to prevent over-and under-translation has also been explored by Malaviya et al. [311]. A coverage model for character-based NMT was suggested by Kazimi and Costa-Jussá [312]. All approaches discussed in this section operate on the attention weight matrix A and are thus only readily applicable to models with single encoder-decoder attention like GNMT, but not to models with multiple encoder-decoder attention modules such as ConvS2S or the Transformer (see Sec. 6.6 for detailed descriptions of GNMT, ConvS2S, and the Transformer).

Controlling Mechanisms for Output Length
In some sequence prediction tasks such as headline generation or text summarization, the approximate desired output length is known in advance. In such cases, it is possible to control the length of the output sequence by explicitly feeding in the desired length to the neural model. The length information can be provided as additional input to the decoder network [313,314], at each time step as the number of remaining tokens [315], or by modifying Transformer positional embeddings [316]. However, these approaches are not directly applicable to machine translation as the translation length is difficult to predict with sufficient accuracy.

NMT Training
NMT models are normally trained using backpropagation [39] and a gradient-based optimizer like Adadelta [317] with cross-entropy loss (Sec. 11.1). Modern NMT architectures like the Transformer, ConvS2S, or recurrent networks with LSTM [71] or GRU [10] cells help to address known training problems like vanishing gradients [72]. However, there is evidence that the optimizer still fails to exploit the full potential of NMT models and often gets stuck in suboptima: 1. NMT models vary greatly in performance, even if they use exactly the same architecture, training data, and are trained for the same number of iterations. Sennrich et al. [157] observed up to 1 BLEU difference between different models.
2. NMT ensembling (Sec. 15) combines the scores of multiple separately trained NMT models of the same kind. NMT ensembles consistently outperform single NMT by a large margin. The achieved gains through ensembling might indicate difficulties in training of the single models. 14 Training is therefore still a very active and diverse research topic. We will outline the different efforts in the literature on NMT training in this section.

Cross-entropy Training
The most common objective function for NMT training is cross-entropy loss. The optimization problem over model parameters Θ for a single sentence pair (x, y) under this loss is defined as follows: In practice, NMT training groups several instances from the training corpus into batches, and optimizes Θ by following the gradient of the average L CE (x, y, Θ) in the batch. There are various ways to interpret this loss function.
Cross-entropy loss maximizes the log-likelihood of the training data. A direct interpretation of Eq. 36 is that it yields a maximum likelihood estimate of Θ as it directly maximizes the probability P Θ (y|x): Cross-entropy loss optimizes a Monte Carlo approximation of the cross-entropy to the real sequence-level distribution. Another intuition behind the cross-entropy loss is that we want to find model parameters Θ that make the model distribution P Θ (·|x) similar to the real distribution P (·|x) over translations for a source sentence x. The similarity is measured with the cross-entropy H x (P, P Θ ). In practice, the real distribution P (·|x) is not known, but we have access to a training corpus of pairs (x, y). For each such pair we consider the target sentence y as a sample from the real distribution P (·|x). We now approximate the cross-entropy H x (P, P θ ) using Monte Carlo estimation with only one sample (N = 1): 14 I thank AdriÃă de Gispert for making that point in our discussions.
Cross-entropy loss optimizes a Monte Carlo approximation of the cross-entropy to the real token-level distribution. We arrive at the same result if we consider the cross-entropy between the conditionals of P (·|y j−1 1 , x) and P Θ (·|y j−1 1 , x) for given x and prefix y j−1

1
: Cross-entropy loss optimizes the cross-entropy to the Dirac distribution. Alternatively, we can define a (Dirac) distribution which assigns the probability of one to y and zero to all other target sentences: The cross-entropy between the Dirac distribution (in this context taking the role of the empirical distribution) and our model distribution P Θ (·|x) is: To recap, we have found that the following are equivalent: • Training under cross-entropy loss (Eq. 36).
• Maximizing the likelihood of the training data.
• Minimizing an estimate of the cross-entropy to the real sequence-level distribution.
• Minimizing an estimate of the cross-entropy to the real token-level distribution.
• Minimizing the cross-entropy to the Dirac distribution.
In particular, we emphasize the equivalence between the sequence-level and the tokenlevel estimation since cross-entropy loss is often characterized as token-level objective in the literature whereas the term sequence-level training somewhat misleadingly usually refers to risk-based training under BLEU [318,319] which is discussed in Sec. 11.5.

Training Deep Architectures
Deep encoders and decoders consisting of multiple layers have now superseded earlier shallow architectures. However, since the gradients have to be back-propagated through more layers, deep architectures -especially recurrent ones -are prone to vanishing gradients [320] and are thus harder to train. A number of tricks have been proposed recently that make it possible to train deep NMT models reliably. Residual connections [130] are direct connections that bypass more complex sub-networks in the layer stack. For example, all the architectures presented in Sec. 6.6 (GNMT, ConvS2S, Transformer, 35 RNMT+) add residual connections around attentional, recurrent, or convolutional cells to ease learning (Fig. 13). Another technique to counter vanishing gradients is called batch normalization [128] which normalizes the hidden activations in each layer in a mini-batch to a mean of zero and a variance of 1. An extension of batch normalization which is independent of the batch size and is especially suitable for recurrent networks is called layer normalization [129]. Layer normalization is popular for training deep NLP models like the Transformer [76].

Regularization
Modern NMT architectures are vastly over-parameterized [151] to help training [321]. For example, a subword-unit-level Transformer in a standard "big" configuration can easily have 200-300 million parameters [126]. The large number of parameters potentially makes the model prone to over-fitting: The model fits the training data perfectly, but the performance on held-out data suffers as the large number of parameters allows the optimizer to marginally improve training loss at the cost of generalization as training proceeds. Techniques that aim to prevent over-fitting in over-parameterized neural networks are called regularizers. Perhaps the two simplest regularization techniques are L1 and L2 regularization. The idea is to add terms to the loss function that penalize the magnitude of weights in the network. Intuitively, such penalties draw many parameters towards zero and limit their significance. Thus, L1 and L2 effectively serve as soft constraint on the model capacity.
The three most popular regularization techniques for NMT are early stopping, dropout, and label smoothing. Early stopping can be seen as regularization in time as it stops training as soon as the performance on the development set does not improve anymore. Dropout [127] is arguably one of the key techniques that have made deep learning practical. Dropout randomly sets the activities of hidden and visible units to zero during training. Thus, it can be seen as a strong regularizer for simultaneously training a large collection of networks with extensive weight sharing.
Label smoothing has been derived for expectationâĂŞmaximization training by Byrne [322], and has been applied to large-scale computer vision by Szegedy et al. [299]. Label smoothing changes the training objective such that the model produces smoother distributions. We have already established in Sec. 11.1 that standard cross-entropy training measures the distance of the output distribution to the Dirac distribution around the training sample. Label smoothing discounts the likelihood of the training sample and distributes some of the free probability mass among other hypotheses. In NMT, label smoothing is applied as cross-entropy loss to a smoothed distribution Q(·) on the token level: The distribution Q(·) can take language modelling scores into account [323], but usually it is just a smoothed version of the Dirac distribution for the reference label: for some smoothing factor α ∈ (0, 1]. Setting α = 1 recovers the normal cross-entropy loss from Sec. 11.1. While label smoothing makes intuitive sense for computer vision, applying it to neural sequence prediction in this way has objectionable side effects on the sequence level. Considering the probabilities Q(·) assigns to full sequences, we first note that Q(·) does not uniformly distribute the remaining probability mass among all other sequences. In fact, distributing it uniformly would result in infinitely small probabilities as there are infinitely many possible sequences. Interestingly, Q(·) does also not assign a fixed probability of α to the correct sequence y: Since α is less than one, Q α (·) is sharper if the correct sequence y is short, and smoother if it is long. Alternative loss functions that encourage smooth output distributions include explicit entropy penalization [324] and knowledge distillation (Sec. 16). A regularization effect can also be achieved by making the training data harder to fit by adding noise, e.g. via subword regularization [259], SwitchOut [283], or noisy back-translation [282] (see Secs. 8.3 and 9).

Large Batch Training
Another practical trick which is becoming increasingly feasible with the availability of multi-GPU training and large GPU memories is to use very large batch sizes. Large batch training can yield almost linear speed-ups [325] as the computation can be distributed across multiple GPUs. Even more importantly, gradients estimated on large batches are naturally less noisy than gradients from small batches, and can yield better overall convergence [76,126,161]. For example, distributing Transformer training across 16 (effective) GPUs can improve over single GPU training by two full BLEU points [126]. Smith et al. [326] argued that increasing the batch size during training can have a similar effect as learning rate decay. For a thorough and insightful discussion of large batch training we refer the reader to [325].
Previous studies [327,328] on batch size were limited by the hardware since -in vanilla SGD -the training batch has to fit into the GPU memories. Saunders et al. [329] presented a technique called delayed SGD which sidesteps these limitations by decoupling the batch size limit from the available hardware.

Reinforcement Learning
Ranzato et al. [318] pointed out two weaknesses of standard MLE training in neural sequence models. First, there is a discrepancy between NMT training and decoding. During training, the correct target label y j−1 is used in the j-th time step. Obviously, during decoding, the correct labels are not available, so the previous (potentially wrong) output is fed back to the model. This is called 'exposure bias' [318] as the model is never exposed to its own mistakes during training. The exposure bias can be tackled by feeding back the ground-truth labels only at early training stages, but gradually switching to feeding back the previously produced target tokens instead as training progresses [330].
The second issue in NMT training pointed out by Ranzato et al. [318] is the mismatch between training loss function and evaluation metric. Training uses cross-entropy loss on the word-level, whereas the final evaluation metric is usually BLEU [296] which is defined on sentence-or document-level. Both of these problems can be tackled with reinforcement learning [318,331]. In the standard terminology of reinforcement learning, an agent interacts with an environment via actions. A policy determines the action to pick depending on the environment. The goal is to learn a policy which maximizes the expected reward. In NMT, the agent is the NMT model that interacts with the environment consisting of the source sentence x and the translation history y j−1 1 by picking actions (words) according the policy P (y j |y j−1 1 , x). The advantage of casting NMT as reinforcement learning problem is that the reward does not need to be differentiable, and thus can be any quality measure such as BLEU or GLEU [14]. However, training is computationally very expensive as it requires sampling or decoding during training [332]. Therefore, reinforcement learning is usually used to refine a model trained with cross-entropy [14]. However, even though reinforcement learning has yielded some gains in the past in isolated experiments, it is difficult to improve over stronger baselines with recent NMT architectures and back-translation [333]. Wu et al. [14] reported that their gains in BLEU from reinforcement learning were not reflected in the human evaluation. Other possible applications for reinforcement learning in neural sequence prediction include architecture search [334], adequacy-oriented learning [308], and simultaneous translation (Sec. 7.8). An alternative way to incorporate the BLEU metric into NMT training is via a minimum risk formulation [319,335].

Dual Supervised Learning
Recall that NMT networks are trained to model the distribution P (y|x) over translations y for a given source sentence x. This training objective takes only one translation direction into account -from the source language to the target language. However, the chain rule gives us the following relation: P (y|x)P (x) = P (x, y) = P (x|y)P (y).
Eq. 43 is often not satisfied when the two translation models P (y|x) and P (x|y) are trained independently. The dual supervised learning loss L DSL aims to correlate both translation directions as follows [289,336]: An alternative way to incorporate both translation directions is the agreement-based approach of Cheng et al. [337].

Adversarial Training
Generative adversarial networks [338,GANs] have recently become extremely popular in computer vision. GANs were originally proposed as framework for training generative models. For example, in computer vision, a generative model G would generate images that are similar to the ones in the training corpus. The input to a classic GAN is noise which is sampled from a noise prior. The key idea of adversarial training is that G is trained to fool a discriminative model D. The discriminator D takes an image as input and outputs the probability of the image coming from the real training corpus as opposed to being generated by G. G and D are jointly trained with opposing objectives: G tries to drive up the probability of D making a mistake whereas D aims to discriminate between real and fake images generated by G. GANs are particularly useful when they condition on some input (conditional GANs). For example, a GAN which conditions on a textual description of an image is able to synthesize an image for an unseen description at test time.
In computer vision, it is possible to back-propagate gradients through the synthetic image and thus train G and D jointly without approximations. The main challenge for applying GANs to text is that this is no longer possible since text consists of a variable number of discrete symbols. Therefore, most work on adversarial training in NLP relies on reinforcement learning to generate synthetic text samples [339][340][341][342][343] or directly operates on the hidden activations in G [344]. Besides some exploratory efforts [339][340][341], adversarial training for NLP and particularly NMT is still in its infancy and rather brittle [340,[345][346][347].

Post-hoc Interpretability
Explaining the predictions of deep neural models is hard because they consist of tens of thousands of neurons and millions of parameters. Therefore, explainable and interpretable deep learning is still an open research question [348][349][350][351][352]. Post-hoc interpretability refers to the idea of sidestepping the model complexity by treating it as a black-box and not trying to understand the inner workings of the model. Montavon et al. [351] defines post-hoc interpretability as follows: "A trained model is given and our goal is to understand what the model predicts (e.g. categories) in terms what is readily interpretable (e.g. the input variables)". In NMT, this means that we try to understand the target tokens ("what the model predicts") in terms of the source tokens ("the input variables"). Post-hoc intepretability methods such as layer-wise relevance propagation [353] are often visualized with heat maps representing the importance of input variablespixels in computer vision or source words in machine translation.
Applying post-hoc interpretability methods to sequence-to-sequence prediction has received some attention in the literature [354]. Alvarez-Melis and Jaakkola [355] proposed a causal model which finds related source-target pairs by feeding in perturbed versions of the source sentence. Ma et al. [356] derived relevance scores for NMT by comparing the predictive probability distributions before and after zeroing out a particular source word. See [357] for some general limitations of such post-hoc analyses in NLP.

Model-intrinsic Interpretability
Unlike the black-box methods for post-hoc interpretability, another line of research tries to understand the functions of individual hidden neurons or layers in the NMT network. Different methods have been proposed to visualize the activities or gradients in hidden layers [358][359][360][361]. Belinkov et al. [362] shed some light on NMT's ability to handle morphology by investigating how well a classifier can predict part-of-speech or morphological tags from the last encoder hidden layer. Bau et al. [363], Dalvi et al. [364,365] found individual neurons that capture certain linguistic properties with different Figure 19: Word alignment from the English sentence "What's this used for" to the Spanish sentence "para que se usa esto".
forms of regression analysis. Bau et al. [363] were even able to alter the translation (e.g. change the gender) by manipulating the activities in these neurons. Other researchers have focused on the attention layer. Tang et al. [366] suggested that attention at different layers of the Transformer serves different purposes. They also showed that NMT does not use the means of attention for word sense disambiguation. Ghader and Monz [367] provide a detailed analysis of how NMT uses attention to condition on the source sentence.

Confidence Estimation in Translation
Obtaining word level or sentence level confidence scores for translations is not only very useful for practical MT, it also improves the explainability and trustworthiness of the MT system. An obvious candidate for confidence scores from an NMT system are the probabilities the model assigns to tokens or sentences. However, there is some disagreement in the literature on how well NMT models are calibrated [298,368]. Poorly calibrated models do not assign probabilities according to the true data distribution. Such models might still assign high scores to high quality translations, but their output distributions are no reliable source for deriving word-level confidence scores. While confidence estimation has been explored for traditional SMT [369][370][371], it has received almost no attention since the advent of neural machine translation. The only work on confidence in NMT we are aware of is from Rikters and Fishel [372], Rikters [373] who aim to use attention to estimate word-level confidences.
In contrast, the related field of Quality Estimation for MT enjoys great popularity, with well-attended annual WMT evaluation campaigns -by now in their seventh edition [374]. Quality estimation aims to find meaningful quality metrics which are more accepted by users and customers than abstract metrics like BLEU [296], and are more correlated to the usefulness of MT in a real-world scenario. Possible applications for quality estimation include estimating post-editing efficiency [375] or selecting sentences in the MT output which need human revision [370].

Word Alignment in Neural Machine Translation
Word alignment is one of the fundamental problems in traditional phrase-based SMT. SMT constructs the target sentence by matching phrases in the source sentence, and combing their translations to form a fluent sentence [181,217]. This approach does not only yield a translation, it also produces a word alignment along with it since each target phrase is generated from a unique source phrase. Thus, a word alignment can be seen as an explanation for the produced translation: each target phrase is explained with a link into the source sentence (Fig. 19). Unfortunately, vanilla NMT does not have the notion of a hard word alignment. It is tempting to interpret encoder-decoder attention matrices in neural models (Sec. 6.1) as (soft) alignments, but previous work has found that the attention weights in NMT are often erratic and differ significantly from traditional word alignments: • "The attention model for NMT does not always fulfill the role of a word alignment model, but may in fact dramatically diverge." [300] • "We perform extensive experiments across a variety of NLP tasks that aim to assess the degree to which attention weights provide meaningful 'explanations' for predictions. We find that they largely do not." [376] • "Attention weights are only noisy predictors of even intermediate components' importance, and should not be treated as justification for a decision." [377] • "Although attention is very useful for under-standing the connection between source and target words, only using attention is not sufficient for deep interpretation of target word generation." [360] • "Attention agrees with traditional alignments to a high degree in the case of nouns.
However, it captures other information rather than only the translational equivalent in the case of verbs." [367] • "Attention visualizations are misleading and should be treated with care when explaining the underlying deep learning system." [378] Despite considerable consensus about the importance of word alignments for practical machine translation [300], e.g. to enforce constraints on the output [379] or to preserve text formatting, introducing explicit alignment information to NMT is still an open research problem. Word alignments have been used as supervision signal for the NMT attention model [380][381][382][383]. Cohn et al. [384] showed how to reintroduce concepts known from traditional statistical alignment models [270] like fertility and agreement over translation direction to NMT.
Hard attention [82] is a discrete version of the usual soft attention and is thus closer to the concept of a hard alignment. Similar ideas have been explored for speech recognition [385], morphological inflection [386], text summarization [387,388], and image caption generation [82]. Some approaches to simultaneous translation presented in Sec. 7.8 explicitly control for reading source tokens and writing target tokens and thereby generate monotonic hard alignments on the segment level [210,389]. Hybrids between soft and hard attention have been proposed by Choi et al. [390], Shen et al. [391]. However, the usefulness of hard attention for generic offline machine translation is often limited since it usually can only represent monotonic alignments.
Alkhouli et al. [392] used separate alignment and lexical models and thus were able to hypothesize explicit alignment links during decoding. Alignment-based NMT has been extended to multi-head attention by using an additional alignment head [393]. A similar idea was pursued by Zenkel et al. [394] who added an additional alignment layer to the Transformer and trained it -unlike Alkhouli et al. [393] -in an unsupervised way. The neural operation sequence model of Stahlberg et al. [171] is another way of generating an alignment along with the translation in NMT.

Extensions to the Transformer Architecture
The Transformer model architecture [76] introduced in Sec. 6.5 has become the de facto standard architecture for neural machine translation because of its superior translation quality on a variety of language pairs [145,146]. 15 The Transformer comes with a number of techniques which sets it apart from previous architectures such as multi-head attention, self-attention, large batch training, etc. Some ablation studies in the literature aim to factor out or explain the contributions of these different techniques [123,124,366,396]. Several attempts have been made to improve different aspects of the vanilla model for machine translation, but none has been widely adopted. Most notably, Shaw et al. [125] proposed to embed relative positions rather than absolute ones. A disadvantage of the relative Transformer is the increased computational complexity. The memory keys and values with absolute positions are the same in each decoding step. With relative positioning, however, both have to be recomputed in each time step since the relative positions change over time. The model of Song et al. [397] works with attention masks (Sec. 6.2) to narrow down context. Ahmed et al. [398] proposed to weight the output of attention heads inside multi-head attention. The Star-Transformer [399] thins out inter-layer connections of the standard model to reduce computational complexity. With a similar outset, Medina and Kalita [400] reported speed-ups by replacing the single deep encoder with multiple shallow encoders.
Some recent research has focused on large scale language modelling with the Transformer [48,132,[401][402][403]. The Transformer is also the starting point for neural architectures for contextualized word embeddings (see Sec. 2) such as BERT [49].

Advanced Attention Models
As shown in Sec. 6.1, the vast majority of current NMT architectures are based on one of three attention types: additive, (scaled) dot-product, or multi-head attention [13,76,77]. In this section, we will outline attempts to improve upon these standard models.
Sec. 10.1 discussed the problem of over-and under-translation, and how coverage models can mitigate this problem by controlling the attention weights with fertilities. Alternatively, researchers have tried to equip the attention layer itself with additional components like a memory [404] or a recurrent network [405,406] to enable it to keep track of the attention history. Choi et al. [407] proposed an attention model that is able to learn different attention weights for each dimension in the values, not only one weight for each value vector.
One potential weakness of the standard models is that they are token-based: the attention output is a weighted average of the values, and the attention weights tend to focus on a single key-value pair. Therefore, there is no explicit mechanism to attend to full phrases rather than subwords or characters. 16 Phrase-based NMT which equips the model with the ability to attend to full phrases or multi-word expressions has been studied by Rikters and Bojar [408], Ishiwatari et al. [409], Feng et al. [410], Huang et al. [411], Li et al. [412], Eriguchi et al. [413].
On the other side of the spectrum, it has been noted that regular attention sometimes spreads out over too many elements, especially when applied over long sequences. The attention output in this case is an average of many values which is naturally more noisy than with sharp attention, and which impedes the propagation of information through the network. Hard attention (Sec. 12.4) removes this sort of noise, but is often restricted to monotonic alignment. Lin et al. [414] proposed to explicitly learn to set the temperature of attention weights to control the softness of attention. Another potential solution has been suggested by Zhang et al. [415] who used GRU gates rather than weighted linear combinations to compute the attention output from the values.

Memory-augmented Neural Networks
RNNs are theoretically Turing-complete [416] and thus potentially very powerful models of computation. However, since training is still a challenge (see Sec. 11), even advanced RNN architectures like LSTMs [71] fail to solve certain basic sequence-to-sequence tasks like (repeated) copying or reversal in practice [417,418]. This observation motivated researchers to add external memory structures like a memory tape [419] or a stack [420,421] to the neural network. The basic idea is illustrated in Fig. 20. Besides producing the output sequence, the neural network learns to operate an external data structure. The external memory is not part of the neural network but the network learns to communicate with it through conceptually discrete operations like PUSH and POP. However, in order to train the whole system with a gradient-based optimizer, these discrete operations are often approximated with continuous versions [417,418,422]. Various data structures have been used in combination with neural networks such as (inter alia) stacks [422], (double-ended) queues [418], addressable memory cells [417,423,424], and hierarchical memory structures [425]. Grefenstette et al. [418] suggested that even simple data structures like dequeues help to solve linguistically motivated tasks like bigram flipping or Inversion Transduction Grammar [55,ITG] tasks. Research on these kinds of neural network operated data structures still mainly focuses on synthetic tasks like relatively simple algorithmic problems. Initial efforts to apply this line of research to real world problems are limited to neural machine translation [426][427][428][429], sentence simplification [430], and text normalization [431].

Beyond Encoder-decoder Networks
All NMT architectures which we have discussed in the previous sections fall in the category of encoder-decoder networks: An encoder network computes a fixed or variable length continuous hidden representation of the source sentence, and a separate decoder network defines a probability distribution over target sentences given that representation. There are some initial efforts in the literature to depart from this overall structure. For example, variational methods that define a distribution over (a part of) the hidden representations have been explored by Zhang et al. [432], Su et al. [433], Bastings et al. [434], Shah and Barber [435]. Non-autoregressive NMT which aims to reduce or remove the sequential dependency on the translation prefix inside the decoder for enhanced parallelizability has been studied by Wang et al. [436], Gu et al. [437], Guo et al. [438], Wang et al. [439], Libovický and Helcl [440], Lee et al. [441], Akoury et al. [442]. Bahar et al. [443], Kaiser and Bengio [444] recomputed the encoder state after each time step and thus effectively expanded the hidden representation into a 2D structure. The architecture proposed by He et al. [445] does not only use the last encoder layer as hidden representation, but instead connects encoder and decoder layers at the same depth via attention.

Data Sparsity
Deep learning methods are notoriously data hungry. For example, traditional statistical machine translation still often outperforms neural machine translation when training data is scarce [169,300]. In this section we will look at the problem of training data sparsity from different angles such as reducing noise in training data (Sec. 14.1), using data from a different domain, or making use of less or no parallel data.

Corpus Filtering
Unfortunately, MT training data is usually inherently noisy as it is often extracted (semi-) automatically by crawling the web [446,447] and therefore commonly contains sentence fragments, wrong languages, misaligned sentence pairs [448], or MT output rather than genuine parallel text [449,450]. In the previous sections we discussed several instances of the use of synthetic noise in NMT. For example, adding noise to the synthetic sentences in back-translation can be beneficial (Sec. 9). Noise can also be used to generate diverse translations (Sec. 7.7) or as regularizer (Sec. 11.3). However, when discussing the role of noise in NMT it is imperative to carefully differentiate between the various kinds of noise and the ways it impacts NMT. Studies have shown that NMT is not robust against naturally occurring noise at training [448] and test [268,[451][452][453] time. Robustness at test time can be improved by training on synthetic noise [454,455]. Corpus filtering to reduce the amount of noise in the training data has been widely studied for traditional SMT [456,457], often in context of domain adaptation [458,459]. More recent research on data filtering focuses on NMT since van der Wees et al. [460] had shown that filtering techniques developed for SMT are less useful for NMT. One of the first approaches to NMT corpus filtering was the method of Carpuat et al. [461] based on semantic analysis. The most effective approaches in the WMT18 shared task on corpus filtering for NMT [462] used a combination of likelihood scores from neural translation models and neural language models which have been trained on clean data [149,463,464]. These criteria prefer sentence pairs which are likely translations of one another according the translation model [465]. Zhang et al. [466] proposed the exact opposite, arguing that NMT training should concentrate on "difficult" training samples, i.e. samples with low translation probability. An alternative to hard data filtering called curriculum learning [467] that controls the order of training samples has been applied to NMT by van der Wees et al. [460], Wang et al. [468], Kumar et al. [469], Platanios et al. [470].

Domain Adaptation
There is a robust body of research on domain adaptation for machine translation [471,472]. Popular domain adaptation techniques for both SMT and NMT aim to select [458,459,[473][474][475] or weight [474,476,477] samples in a large out-of-domain corpus. Backtranslation (Sec. 9) can also be used for domain adaptation by back-translating sentences from an in-domain monolingual corpus. Another simple yet very effective method is to jointly train on in-domain and out-domain sentences, possibly with domain-tags to help learning [478][479][480]. Sajjad et al. [481] showed that a simple concatenation of in-domain and out-domain corpora can already increase the robustness and generalization of NMT significantly. Khayrallah et al. [482] studied domain adaptation by constraining an NMT system to SMT lattices. Freitag and Al-Onaizan [152] ensembled separately trained general-domain and in-domain models.
Another widely used technique is to train the model on a general domain corpus, and then fine-tune it by continuing training on the in-domain corpus [274,483]. Fine-tuning bears the risk of two negative effects: catastrophic forgetting [484,485] and over-fitting. Catastrophic forgetting occurs when the performance on the specific domain is improved after fine-tuning, but the performance of the model on the general domain has decreased drastically. The risk of over-fitting is connected to the fact that the in-domain corpus is usually very small. Both effects can be mitigated by artificially limiting the learning capabilities of the fine-tuning stage, e.g. by freezing sub-networks [486] or by only learning additional scaling factors for hidden units rather than full weights [487,488]. A very elegant way to prevent over-fitting and catastrophic forgetting is to apply regularizers (Sec. 11.3) to keep the adapted model weights close to their original values. Khayrallah et al. [489], Dakwale and Monz [490] regularized the output distributions using techniques inspired by knowledge distillation (Sec. 16). Miceli Barone et al. [491] applied standard L2 regularization and a variant of dropout to domain adaptation. Elastic weight consolidation [492] can be seen as generalization of L2 regularization that takes the importance of weights (in terms of Fisher information) into account, and has been applied to NMT domain adaptation by Thompson et al. [493], Saunders et al. [494]. In particular, Saunders et al. [494] showed that EWC does not only reduce catastrophic forgetting but even yields gains on the general domain when used for fine-tuning on a related domain.

Low-resource NMT
One of the areas in which traditional SMT still often outperforms NMT is low-resource translation [169,300]. However, several techniques have been proposed to improve the performance of NMT under low-resource conditions. In general, the methods discussed in Sec. 9 to leverage monolingual data such as back-translation are particularly effective for low-resource MT. Ren et al. [495] proposed a scheme that could make use of translations from/into the source/target language into/from a third resource-rich language. The transfer-learning approach of Zoph et al. [496] first trains a parent model on a resourcerich language pair (e.g. French-English), and then continues training on the low-resource pair of interest (e.g. Uzbek-English). The effectiveness of transfer-learning depends on the relatedness of the languages [496][497][498][499]. The rapid adaptation of multilingual NMT systems to new low-resource language pairs has been studied by Neubig and Hu [500]. Approaches that do not rely on resources from a third language include Östling and Tiedemann [169] who supervised the generation order of an insertion-based low-resource translation model with word alignments.

Unsupervised NMT
Unsupervised NMT is an extreme case of the low-resource scenario in which not even small amounts of cross-lingual data is available, and the translation system learns entirely from (unrelated) monolingual data. Unsupervised NMT often starts off from an unsupervised cross-lingual word embedding model [502][503][504] that maps word embeddings from the source and the target language into a joint embedding space [505,506]. The translation model is then further refined by iterative back-translation [507,508]. The extract-edit scheme of Wu et al. [509] is an alternative to back-translation for unsupervised NMT that edits a sentence in the monolingual corpus rather than synthesize it from scratch. Unsupervised NMT has been targeted in recent WMT evaluation campaigns [145,146].

Multilingual NMT
NMT is usually trained to translate a single fixed source language into another fixed target language. Multilingual NMT aims to cover translation directions between multiple languages with a single model. This does not only have the potential of exploiting similarities across language pairs, it also reduces the number of systems required for all-way translation between a set of languages from quadratic to linear or even one. Multilingual NMT systems can be largely categorized by the components they share between language directions. On one side of the spectrum, the entire neural architecture (both encoder and decoder) can be shared, and source and target languages can be specified by annotating sentences [510] or words [511,512] with language ID tags or embeddings. On the other side of the spectrum, Luong et al. [513] used a separate encoder for each source language and a separate decoder for each target language. Firat et al. [514,515] extended the work of Luong et al. [513] to attentional NMT by sharing the attention mechanism across language directions. Dong et al. [516] studied one-to-many translation with a single encoder but separate decoders for each target language. A potential benefit of multilingual systems is zero-shot translation, i.e. the translation between two languages for which no direct training data is available. 17 Johnson et al. [510] reported reasonable Portuguese→Spanish translation performance of their multilingual system that has been trained on Portuguese↔English and Spanish↔English, although pivoting through English (translate Spanish to English, and then English to Portuguese) worked better. Pivot-based zero-shot translation can be further improved by fine-tuning on a pseudo parallel corpus [154] or by jointly training some components of the source-pivot and pivot-target systems like word embedding matrices [517]. Lu et al. [518] reported gains in zero-shot settings by adding a boldly named "neural interlingual" component between the encoder and the decoder which is shared across language directions. For an assessment of the current capabilities of multilingual and zero-shot translation systems see [519][520][521]. Another form of multilingual NMT is multi-source NMT [155,522], in which the system tries to generate a single translation given sentences in two source languages simultaneously. A problem with this approach is data sparsity as missing source sentences have to be synthesized [523,524] if the training corpus does not provide sentences in all source languages. In a wider context, multi-source architectures can be used for multimodal NMT (Sec. 17.1), morphological inflection [525], zero-shot translation [154], low-resource MT [523], syntax-based NMT [526], document-level MT [527], or bidirectional decoding [172]. Dabre et al. [528] provide an overview of recent trends in multilingual NMT.

NMT Model Size
NMT models usually have hundreds of millions of parameters (Tab. 4). Such large models cause a number of practical issues. GPUs are usually required to run such big models efficiently, but GPUs are expensive and their memory is limited. Smaller models would not only reduce the computational complexity but could also make better use of GPU parallelism by increasing batch sizes. Furthermore, model files require large amounts of disk space which is a problem on mobile platforms. One way to increase the space efficiency of neural models is neural architecture search [132,334]. For example, So et al. [529] found computationally efficient Transformer hyper-parameters by systematic neural architecture search. Rather than optimizing the dimensionality of layers, it is also possible to significantly speed up translation by departing from the usual 32 bit floating point arithmetics by reducing the precision to 8 or 16 bits [530][531][532][533] or by using vector quantization [14,534]. The idea of pruning neural networks to improve the compactness of the models dates back almost 30 years [535]. The literature is therefore vast [536]. One line of research aims to remove unimportant network connections. The connections can be selected for deletion based on the second-derivative of the training error with respect to the weight [535,537], or by a threshold criterion on its magnitude [538]. See et al. [539] confirmed a high degree of weight redundancy in NMT networks. Zhu and Gupta [540] demonstrated that large sparse models outperform smaller dense networks with the same memory footprint. Srinivas and Babu [541] proposed to remove neurons which are very similar to another neuron and have small outgoing weights. Stahlberg and Byrne [151] generalized their method to linear combinations of neurons. Babaeizadeh et al. [542] combined pairs of neurons with similar activities during training. Using low rank matrices for neural network compression, particularly approximations via Singular Value Decomposition (SVD), has been studied widely in the literature [543][544][545][546][547]. Another approach, known as knowledge distillation, uses a large model (the teacher) to generate soft training labels for a smaller student network [548,549]. The student network is trained by minimizing the cross-entropy to the teacher. This idea has been applied to 47 sequence modelling tasks such as machine translation and speech recognition [190,[550][551][552][553][554].

Multimodal NMT
Machine translation is usually framed as the isolated transformation of the textual representation of a single sentence in one language into another. Since language is inherently ambiguous, researchers have searched for ways to provide the translation system with more context. For example, if the source sentence describes an image, the image itself potentially carries valuable clues to help the translation process. Multimodal machine translation [555,556] aims to generate an image caption in the target language given both the source language caption and the image itself. The core of most multimodal MT models is a normal text-to-text system which integrates visual information by using global image features extracted with a separate computer vision model [555,556] or via visual attention [557]. Multimodality in translation was the subject of a series of WMT shared tasks [558][559][560]. Calixto and Liu [561] demonstrated the usefulness of visual clues in translation.

Tree-based NMT
The prevalent choice for modeling units in NMT are characters are subword-units (Sec. 8.3). This design decision is not linguistically motivated but rather stems from the difficulty of extending NMT to an open vocabulary. From the linguistic perspective, however, translation is better viewed as the transformation of larger elements in the sentence such as words, phrases, or even syntactic structures.
Various attempts have been made to introduce structures such as syntactic constituency trees or dependency trees both on the source and the target side of NMT. A popular approach is to retain the sequence-to-sequence architecture and linearize the tree structures, for example using bracket expressions [526,[562][563][564], sequences of rules [329], or CCG supertags [565]. Ma et al. [566], Zaremoodi and Haffari [567] developed a linearization of a packed forests that represented multiple source sentence parses. Saunders et al. [329] reported gains by ensembling different linearization strategies of targetside syntax trees. Recurrent neural network grammars [568] that represent syntactic parse trees as sequence of actions were applied to machine translation by Eriguchi et al. [569], Bradbury and Socher [570]. Using actions to build target side tree structures is also central to the tree-based decoders of Wang et al. [571], Wu et al. [572]. Akoury et al. [442] used syntax to speed up decoding by first predicting a parse tree, and then predicting all target tokens in parallel. Tree-LSTMs [573] make it possible to represent a tree structure directly with the neural network architecture. They are a generalization of recurrent LSTM cells (Sec. 6.3) that replaces the single input of a standard LSTM cell (usually from the previous time step) with multiple input connections, one from each child node. Thus, each Tree-LSTM cell represents a node in the tree, and the root node contains a fixed-length vector encoding of the whole tree structure. Tree-LSTMs have been applied to syntax-based NMT [574][575][576]. An alternative to Tree-LSTMs was proposed by Shen et al. [577] who rearranged neurons in an LSTM network to resemble a block representation of the tree. Bastings et al. [578], Chen et al. [579] used convolutional encoders to represent a dependency graph in the source sentence. Chen et al. [580] biased encoder-decoder attention weights with syntactic clues. Unsupervised tree-based methods have been studied by Kim et al. [581], Maillard et al. [582], Williams et al. [583].

NMT with Graph Structured Input
As a generalization of the tree-based approaches discussed in the previous section, lattice-based NMT allows more general graph structures on the input side to provide a richer description of the source sentence. Lattices can represent uncertainty of upstream components such as speech recognizers [584] or tokenizers [585,586]. Lattices have also been used to augment the input with external knowledge sources such as knowledge graphs [587,588] or semantic predicate-argument structures [589]. Factors are another way of providing more information to the translation system. Factors describe a word by a tuple consisting of its lemma and various linguistic information (prefix, suffix, partof-speech etc.) rather than its surface form. This technique is popular for traditional statistical machine translation [181,590], and has been applied to neural machine translation both on the input [591] and the output [592,593] side.

Document-level Translation
MT systems usually translate sentences in isolation. However, there is evidence that humans also take context into account, and rate translations from humans with access to the full document higher than the output of a state-of-the-art sentence-level machine translation system [594]. Common examples of ambiguity which can be resolved with cross-sentence context are pronoun prediction or coherency in lexical choice.
Various techniques have been proposed to provide the translation system with intersentential context, for example by initializing encoder or decoder states [595], using multisource encoders [527,596], as additional decoder input [595], with memory-augmented neural networks [597][598][599], a document-level LM [600], hierarchical attention [601,602], deliberation networks [603], or by simply concatenating multiple source and/or target sentences [527,604]. Context-aware extensions to Transformer encoders have been proposed by Voita et al. [605], Zhang et al. [606]. Techniques also differ in whether they use source context only [595,596,600,605,606], target context only [597,599], or both [527,598,601,602,604]. Several studies on document-level NMT indicate that automatic and human sentence-level evaluation metrics often do not correlate well with improvements in discourse level phenomena [527,594,607].

NMT-SMT Hybrid Systems
Neural models were increasingly used as features in traditional SMT until NMT evolved as new paradigm. Without question, NMT has become the prevalent approach to machine translation in recent years. There is a large body of research comparing NMT and SMT (Tab. 6). Most studies have found superior overall translation quality of NMT models in most settings, but complementary strengths of both paradigms. Therefore, the literature about hybrid NMT-SMT systems is also vast. We distinguish between two categories of approaches for blending SMT and NMT.
+ Better handles a variety of linguistic phenomena than SMT [609,610,615].
− Neural models perform not as well as specialized symbolic models on several monotone seq2seq tasks [616].
+ Translation quality degrades less on very long sentences than NMT [608,609].
+ Less errors in the translation of proper nouns [610].
• NMT and SMT require comparable amounts of (document-level) postediting [611,621]. Approaches in the first category do not employ a full SMT system but borrow only key ideas or components from SMT to address specific issues in NMT. It is straight-forward to combine NMT scores with other features normally used in SMT (like language models) in a log-linear model [271,303]. 18 Conventional symbolic SMT-style lexical translation tables can be incorporated into the NMT decoder by using the soft alignment weights of the standard NMT attention model [139,303,[622][623][624]. Cohn et al. [384] proposed to enhance the attention model in NMT by implementing basic concepts from the original word alignment models [270,625] like fertility and relative distortion.
The second category of hybrid systems is related to system combination. The idea is to combine a fully trained SMT system with an independently trained NMT system. Popular examples in this category are rescoring and reranking methods [235,482,[626][627][628][629][630], although these models may be too constraining if the neural system is much stronger. Stahlberg et al. [631] proposed a finite state transducer based loose combination scheme that combines NMT and SMT translations via an edit distance based loss. The minimum Bayes risk (MBR) based approach of Stahlberg et al. [199] biases an unconstrained NMT decoder towards n-grams which are likely according the SMT system, and therefore also does not constrain the system to the SMT search space. MBR-based combination of NMT and SMT has been used in WMT evaluation systems [126,600] and in the industry [198]. NMT and SMT can also be combined in a cascade, with SMT providing the input to a post-processing NMT system [632,633] or vice versa [634]. Wang et al. [635,636] interpolated NMT posteriors with word recommendations from SMT and jointly trained NMT together with a gating function which assigns the weight between SMT and NMT scores dynamically. The AMU-UEDIN submission to WMT16 let SMT take the lead and used NMT as a feature in phrase-based MT [159]. In contrast, Long et al. [637] translated most of the sentence with an NMT system, and just used SMT to translate technical terms in a post-processing step. Dahlmann et al. [638] proposed a hybrid search algorithm in which the neural decoder expands hypotheses with phrases from an SMT system. SMT can also be used as regularizer in unsupervised NMT [508].

Conclusion
Neural machine translation (NMT) has become the de facto standard for large-scale machine translation in a very short period of time. This article traced back the origin of NMT to word and sentence embeddings and neural language models. We reviewed the most commonly used building blocks of NMT architectures -recurrence, convolution, and attention -and discussed popular concrete architectures such as RNNsearch, GNMT, ConvS2S, and the Transformer. We discussed the advantages and disadvantages of several important design choices that have to be made to design a good NMT system with respect to decoding, training, and segmentation. We then explored advanced topics in NMT research such as explainability and data sparsity.