arXiv:1508.04025v5 [cs.CL] 20 Sep 2015 used to improve neural - PDF document

Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning Computer Science Department, Stanford University, Stanford, CA 94305 { lmthang,hyhieu,manning } @stanford.edu Abstract X Y Z <eos> An attentional mechanism has lately been arXiv:1508.04025v5 [cs.CL] 20 Sep 2015 used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper exam- A B C D <eos> X Y Z ines two simple and effective classes of at- Figure 1: Neural machine translation – a stack- tentional mechanism: a global approach ing recurrent architecture for translating a source which always attends to all source words sequence A B C D into a target sequence X Y and a local one that only looks at a subset Z . Here, < eos > marks the end of a sentence. of source words at a time. We demonstrate the effectiveness of both approaches on the WMT translation tasks between English emitting one target word at a time, as illustrated in and German in both directions. With local Figure 1. NMT is often a large neural network that attention, we achieve a significant gain of is trained in an end-to-end fashion and has the abil- 5.0 BLEU points over non-attentional sys- ity to generalize well to very long word sequences. tems that already incorporate known tech- This means the model does not have to explicitly niques such as dropout. Our ensemble store gigantic phrase tables and language models model using different attention architec- as in the case of standard MT; hence, NMT has tures yields a new state-of-the-art result in a small memory footprint. Lastly, implementing the WMT’15 English to German transla- NMT decoders is easy unlike the highly intricate tion task with 25.9 BLEU points, an im- decoders in standard MT (Koehn et al., 2003). provement of 1.0 BLEU points over the In parallel, the concept of “attention” has existing best system backed by NMT and gained popularity recently in training neural net- an n -gram reranker. 1 works, allowing models to learn alignments be- 1 Introduction tween different modalities, e.g., between image objects and agent actions in the dynamic con- Neural Machine Translation (NMT) achieved trol problem (Mnih et al., 2014), between speech state-of-the-art performances in large-scale trans- frames and text in the speech recognition task ( ? ), or between visual features of a picture and lation tasks such as from English to French (Luong et al., 2015) and English to German its text description in the image caption gener- (Jean et al., 2015). NMT is appealing since it re- ation task (Xu et al., 2015). In the context of quires minimal domain knowledge and is concep- NMT, Bahdanau et al. (2015) has successfully ap- tually simple. The model by Luong et al. (2015) plied such attentional mechanism to jointly trans- reads through all the source words until the end-of- late and align words. To the best of our knowl- sentence symbol < eos > is reached. It then starts edge, there has not been any other work exploring the use of attention-based architectures for NMT. 1 All our code and models are publicly available at http://nlp.stanford.edu/projects/nmt . In this work, we design, with simplicity and ef-

fectiveness in mind, two novel types of attention- recurrent neural network (RNN) architec- based models: a global approach in which all ture, which most of the recent NMT work source words are attended and a local one whereby such as (Kalchbrenner and Blunsom, 2013; only a subset of source words are considered at a Sutskever et al., 2014; Cho et al., 2014; time. The former approach resembles the model Bahdanau et al., 2015; Luong et al., 2015; of (Bahdanau et al., 2015) but is simpler architec- Jean et al., 2015) have in common. They, how- turally. The latter can be viewed as an interesting ever, differ in terms of which RNN architectures blend between the hard and soft attention models are used for the decoder and how the encoder proposed in (Xu et al., 2015): it is computation- computes the source sentence representation s . ally less expensive than the global model or the Kalchbrenner and Blunsom (2013) used an soft attention; at the same time, unlike the hard at- RNN with the standard hidden unit for the tention, the local attention is differentiable almost decoder and a convolutional neural network for everywhere, making it easier to implement and encoding the source sentence representation. On train. 2 Besides, we also examine various align- the other hand, both Sutskever et al. (2014) and ment functions for our attention-based models. Luong et al. (2015) stacked multiple layers of an Experimentally, we demonstrate that both of RNN with a Long Short-Term Memory (LSTM) our approaches are effective in the WMT trans- hidden unit for both the encoder and the decoder. Cho et al. (2014), Bahdanau et al. (2015), and lation tasks between English and German in both directions. Our attentional models yield a boost Jean et al. (2015) all adopted a different version of the RNN with an LSTM-inspired hidden unit, the of up to 5.0 BLEU over non-attentional systems gated recurrent unit (GRU), for both components. 4 which already incorporate known techniques such In more detail, one can parameterize the proba- as dropout. For English to German translation, we achieve new state-of-the-art (SOTA) results bility of decoding each word y j as: for both WMT’14 and WMT’15, outperforming p ( y j | y <j , s ) = softmax ( g ( h j )) (2) previous SOTA systems, backed by NMT models and n -gram LM rerankers, by more than 1.0 with g being the transformation function that out- BLEU. We conduct extensive analysis to evaluate puts a vocabulary-sized vector. 5 Here, h j is the our models in terms of learning, the ability to han- RNN hidden unit, abstractly computed as: dle long sentences, choices of attentional architectures, alignment quality, and translation outputs. h j = f ( h j − 1 , s ) , (3) 2 Neural Machine Translation where f computes the current hidden state given the previous hidden state and can be A neural machine translation system is a neural either a vanilla RNN unit, a GRU, or an LSTM network that directly models the conditional prob- unit. In (Kalchbrenner and Blunsom, 2013; ability p ( y | x ) of translating a source sentence, x 1 , . . . , x n , to a target sentence, y 1 , . . . , y m . 3 A Sutskever et al., 2014; Cho et al., 2014; Luong et al., 2015), the source representa- basic form of NMT consists of two components: tion s is only used once to initialize the (a) an encoder which computes a representation s decoder hidden state. On the other hand, in for each source sentence and (b) a decoder which (Bahdanau et al., 2015; Jean et al., 2015) and generates one target word at a time and hence de- this work, s , in fact, implies a set of source composes the conditional probability as: hidden states which are consulted throughout the � m entire course of the translation process. Such an log p ( y | x ) = j =1 log p ( y j | y <j , s ) (1) approach is referred to as an attention mechanism, which we will discuss next. A natural choice to model such a de- In this work, following (Sutskever et al., 2014; composition in the decoder is to use a Luong et al., 2015), we use the stacking LSTM 2 There is a recent work by Gregor et al. (2015), which is architecture for our NMT systems, as illustrated very similar to our local attention and applied to the image 4 They all used a single RNN layer except for the latter two generation task. However, as we detail later, our model is much simpler and can achieve good performance for NMT. works which utilized a bidirectional RNN for the encoder. 3 All sentences are assumed to terminate with a special 5 One can provide g with other inputs such as the currently “end-of-sentence” token < eos > . predicted word y j as in (Bahdanau et al., 2015).

arXiv:1508.04025v5 [cs.CL] 20 Sep 2015 used to improve neural - PDF document

Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning Computer Science Department, Stanford University, Stanford, CA 94305 { lmthang,hyhieu,manning } @stanford.edu Abstract X Y Z

Introductiontothelarge chargeexpansion Domenico Orlando Introduction Whos who S. Reffert

Michael Duff Imperial College London based on [arXiv:1301.4176 arXiv:1309.0546 arXiv:1312.6523

Introductiontothelarge chargeexpansion Domenico Orlando Introduction Whos who S. Reffert

Interim report Jan-Sep 2007 Income statement Jan-Sep 2007 (Mkr) Jan-Sep Jan-Sep 2007 2006

arXiv:1508.03133 Jonas Lippuner Luke Roberts MICRA 2015 Stockholm, August 17 21, 2105 Solar

NG is NG? Koji Hashimoto (Osaka U) w/ Minoru Eto (Yamagata U) ArXiv:1508.00433 Is internal space

Alargecharge torulestrongcoupling Domenico Orlando Introduction Whos who S. Reffert (AEC

DM models with two mediators. How to save the WIMP Michael Duerr MU Programmtag 2016 Mainz, 12

mesino oscillations AKSHAY GHALSASI, DAVE MCKEEN, ANN NELSON arxiv:1508.05392 The one minute

The Entropy of a Hole in Space-Time Based on: arXiv:1305.0856, arXiv:1310.4204, arXiv:1406.nnnn

STUDY INTERSECTION NOT TO SCALE DATA COLLECTION ATR Count Period: Sep 3 Sep 6, 2019 TMC

Alpha-bits, Teleportation and Black Holes ArXiv:1706.09434, ArXiv:1807.06041 Geoffrey Penington,

Conformal blocks from AdS Per Kraus (UCLA) Based on: Hijano, PK, Snively 1501.02260 Hijano, PK,

Logic Programming Alan Smaill Sep 21 2015 Alan Smaill Logic Programming Sep 21 2015 1/31

Orbital dynamics from double copy and EFT Talk at Radcor Conference, Avignon, France, 11 Sep

Using direct stop searches at ATLAS to constrain the parameter space of supersymmetric models

Learning outcomes Project: Supporting and Developing the Structures for the Quality Assurance

Finance Capita tal and Greece Walden Bello, TNI/Focus on th the Global South th University ty

Frederik De Decker Head International Relations Office THE IMPACT OF QUALIFICATIONS FRAMEWORKS

NEXT GENERATION TRADE AND INVESTMENT AGREEMENTS: UPCOMING CHALLENGES FOR PUBLIC SERVICES 19

Exploring critical incidents in assessment Jen Harvey (DIT) and Marion Palmer (IADT) 5 May 2011

Maritime Security and Geo-Politics: The Dialogue b/t Vital Interests & Core

TED talks have changed presentations (or changed them back remember 35mm slides?) ? slides at

Geneva Evaluation Network (GEN) 8 June 2016 Scott Chaplowe, Senior M&E Advisor

arXiv:1508.04025v5 [cs.CL] 20 Sep 2015 used to improve neural - PDF document

Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning Computer Science Department, Stanford University, Stanford, CA 94305 { lmthang,hyhieu,manning } @stanford.edu Abstract X Y Z

Introductiontothelarge chargeexpansion Domenico Orlando Introduction Whos who S. Reffert

Michael Duff Imperial College London based on [arXiv:1301.4176 arXiv:1309.0546 arXiv:1312.6523

Introductiontothelarge chargeexpansion Domenico Orlando Introduction Whos who S. Reffert

Interim report Jan-Sep 2007 Income statement Jan-Sep 2007 (Mkr) Jan-Sep Jan-Sep 2007 2006

arXiv:1508.03133 Jonas Lippuner Luke Roberts MICRA 2015 Stockholm, August 17 21, 2105 Solar

NG is NG? Koji Hashimoto (Osaka U) w/ Minoru Eto (Yamagata U) ArXiv:1508.00433 Is internal space

Alargecharge torulestrongcoupling Domenico Orlando Introduction Whos who S. Reffert (AEC

DM models with two mediators. How to save the WIMP Michael Duerr MU Programmtag 2016 Mainz, 12

mesino oscillations AKSHAY GHALSASI, DAVE MCKEEN, ANN NELSON arxiv:1508.05392 The one minute

The Entropy of a Hole in Space-Time Based on: arXiv:1305.0856, arXiv:1310.4204, arXiv:1406.nnnn

STUDY INTERSECTION NOT TO SCALE DATA COLLECTION ATR Count Period: Sep 3 Sep 6, 2019 TMC

Alpha-bits, Teleportation and Black Holes ArXiv:1706.09434, ArXiv:1807.06041 Geoffrey Penington,

Conformal blocks from AdS Per Kraus (UCLA) Based on: Hijano, PK, Snively 1501.02260 Hijano, PK,

Logic Programming Alan Smaill Sep 21 2015 Alan Smaill Logic Programming Sep 21 2015 1/31

Orbital dynamics from double copy and EFT Talk at Radcor Conference, Avignon, France, 11 Sep

Using direct stop searches at ATLAS to constrain the parameter space of supersymmetric models

Learning outcomes Project: Supporting and Developing the Structures for the Quality Assurance

Finance Capita tal and Greece Walden Bello, TNI/Focus on th the Global South th University ty

Frederik De Decker Head International Relations Office THE IMPACT OF QUALIFICATIONS FRAMEWORKS

NEXT GENERATION TRADE AND INVESTMENT AGREEMENTS: UPCOMING CHALLENGES FOR PUBLIC SERVICES 19

Exploring critical incidents in assessment Jen Harvey (DIT) and Marion Palmer (IADT) 5 May 2011

Maritime Security and Geo-Politics: The Dialogue b/t Vital Interests &amp; Core

TED talks have changed presentations (or changed them back remember 35mm slides?) ? slides at

Geneva Evaluation Network (GEN) 8 June 2016 Scott Chaplowe, Senior M&amp;E Advisor

Maritime Security and Geo-Politics: The Dialogue b/t Vital Interests & Core

Geneva Evaluation Network (GEN) 8 June 2016 Scott Chaplowe, Senior M&E Advisor