Neural Machine Translation Gongbo Tang 8 October 2018 Outline - PowerPoint PPT Presentation

Neural Machine Translation Gongbo Tang 8 October 2018

Outline Neural Machine Translation 1 Advances and Challenges 2 Gongbo Tang Neural Machine Translation 2/52

Neural Machine Translation Figure – Recurrent neural network based NMT model From Thang Luong’s Thesis on Neural Machine Translation Gongbo Tang Neural Machine Translation 3/52

Neural Machine Translation Figure – NMT model with attention mechanism Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 4/52

Modelling Translation Suppose that we have: a source sentence S of length m ( x 1 , . . . , x m ) a target sentence T of length n ( y 1 , . . . , y n ) We can express translation as a probabilistic model T ∗ = arg max p ( T | S ) T Expanding using the chain rule gives p ( T | S ) = p ( y 1 , . . . , y n | x 1 , . . . , x m ) n � = p ( y i | y 1 , . . . , y i − 1 , x 1 , . . . , x m ) i =1 Gongbo Tang Neural Machine Translation 5/52

Modelling Translation Target-side language model: n � p ( T ) = p ( y i | y 1 , . . . , y i − 1 ) i =1 Translation model: n � p ( T | S ) = p ( y i | y 1 , . . . , y i − 1 , x 1 , . . . , x m ) i =1 We could just treat sentence pair as one long sequence, but: We do not care about p ( S ) We may want different vocabulary, network architecture for source text Gongbo Tang Neural Machine Translation 6/52

Attentional Encoder-Decoder : Maths simplifications of model by [Bahdanau et al., 2015] (for illustration) plain RNN instead of GRU simpler output layer we do not show bias terms decoder follows Look, Update, Generate strategy [Sennrich et al., 2017] Details in https://github.com/amunmt/amunmt/blob/master/contrib/notebooks/dl4mt.ipynb notation W , U , E , C , V are weight matrices (of different dimensionality) E one-hot to embedding (e.g. 50000 · 512 ) W embedding to hidden (e.g. 512 · 1024 ) U hidden to hidden (e.g. 1024 · 1024 ) C context (2x hidden) to hidden (e.g. 2048 · 1024 ) V o hidden to one-hot (e.g. 1024 · 50000 ) separate weight matrices for encoder and decoder (e.g. E x and E y ) input X of length T x ; output Y of length T y Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 7/52

Encoder Figure – NMT model with attention mechanism Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 8/52

Encoder Encoder with bidirectional Recurent Neural Networks Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Gongbo Tang Neural Machine Translation 9/52

Encoder Encoder with bidirectional Recurent Neural Networks Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN � → − 0 , , if j = 0 h j = tanh ( − → W x E x x j + − → U x h j − 1 ) , if j > 0 � ← − 0 , , if j = T x + 1 h j = tanh ( ← W x E x x j + ← − − U x h j +1 ) , if j ≤ T x h j = ( − → h j , ← − h j ) Gongbo Tang Neural Machine Translation 9/52

Decoder Figure – NMT model with attention mechanism Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 10/52

Decoder Context c i-1 c i State s i-1 s i Word t i-1 t i Prediction Selected y i-1 y i Word Embedding Ey i-1 Ey i Gongbo Tang Neural Machine Translation 11/52

Decoder Context c i-1 c i State s i-1 s i Word t i-1 t i Prediction Selected y i-1 y i Word Embedding Ey i-1 Ey i ← − � tanh ( W s h i ) , , if i = 0 s i = tanh ( W y E y y i − 1 + U y s i − 1 + Cc i ) , if i > 0 t i = tanh ( U o s i + W o E y y i − 1 + C o c i ) y i = softmax ( V o t i ) Gongbo Tang Neural Machine Translation 11/52

Decoder Training y i is known, the training objective is to assign higher possibility to the correct output word. One of the cost functions is the negative log of the probability given to the correct word transaltion. cost = − logt i [ y i ] Gongbo Tang Neural Machine Translation 12/52

Decoder Training y i is known, the training objective is to assign higher possibility to the correct output word. One of the cost functions is the negative log of the probability given to the correct word transaltion. cost = − logt i [ y i ] Inference y i is unknown, we compute the probability distribution over all the vocabulary. Greedy search : select the word with the highest probability. Beam search : keep the top k most likely word choices. Gongbo Tang Neural Machine Translation 12/52

Decoding 0 hello 0.946 0.056 world 0.957 0.100 ! 0.928 0.175 <eos> 0.999 0.175 Greedy search Gongbo Tang Neural Machine Translation 13/52

Decoding 0 0 hello 0.946 HI 0.007 Hey 0.006 0.056 4.920 5.107 hello 0.946 0.056 world 0.957 World 0.010 world 0.684 0.100 4.632 5.299 world 0.957 0.100 . 0.030 ! 0.928 ... 0.014 3.609 0.175 4.384 ! 0.928 <eos> 0.999 <eos> 0.999 <eos> 0.994 0.175 3.609 0.175 4.390 <eos> 0.999 K = 3 0.175 Greedy search Beam search Gongbo Tang Neural Machine Translation 13/52

Attention Figure – NMT model with attention mechanism Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 14/52

Attention Predictions Attention Softmax c i α i h s i − 1 Decoder Networks Encoder hidden states Gongbo Tang Neural Machine Translation 15/52

Attention Predictions Attention Softmax c i α i h s i − 1 Decoder Networks Encoder hidden states e ij = v ⊤ a tanh ( W a s i − 1 + U a h j ) α ij = softmax ( e ij ) T x � c i = α ij h j j =1 Gongbo Tang Neural Machine Translation 15/52

Overview of NMT Pros More fluent translation Less lexical errors Less word order errors Less morphology errors Cons Expensive computation Over-translation and under-translation (Adequacy) Bad at translating long sentences Need more data Black box Gongbo Tang Neural Machine Translation 16/52

Advances and Challenges Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ? Gongbo Tang Neural Machine Translation 17/52

Attention Mechanism and Alignment From What does Attention in Neural Machine Translation Pay Attention to ? Gongbo Tang Neural Machine Translation 18/52

Attention Mechanism and Alignment attention to attention to POS tag alignment points % other words % NUM 73 27 NOUN 68 32 ADJ 66 34 PUNC 55 45 ADV 50 50 CONJ 50 50 VERB 49 51 ADP 47 53 DET 45 55 PRON 45 55 PRT 36 64 Overall 54 46 Table 8: Distribution of attention probability mass (in %) over alignment points and the rest of the words for each POS tag. From What does Attention in Neural Machine Translation Pay Attention to ? Gongbo Tang Neural Machine Translation 19/52

Neural Machine Translation Gongbo Tang 8 October 2018 Outline - PowerPoint PPT Presentation

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1 Advances and Challenges 2 Gongbo Tang Neural Machine Translation 2/52 Neural Machine Translation Figure Recurrent neural network based NMT

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Sequential circuits If the same input may produce different output signal, we have a sequential

Lecture no: 7 Overview Block codes Convolution codes Fading channel and

Exercise 5a: First Prediction between Transform Blocks DC Prediction ( Goal : utilization of

Low-latency software LDPC decoders for x86 multi-core devices Bertrand LE GAL and Christophe JEGO

Linear-Time Erasure List-Decoding of Expander Codes Noga Ron-Zewi (University of Haifa) Mary

Figerprinting digital documents survey Gbor Tardos Rnyi Institute & Central European

Lecture 2 Point-to-Point Communications 1 I-Hsiang Wang ihwang@ntu.edu.tw

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

Neural Machine Translation Gongbo Tang 8 October 2018 Outline - PowerPoint PPT Presentation

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1 Advances and Challenges 2 Gongbo Tang Neural Machine Translation 2/52 Neural Machine Translation Figure Recurrent neural network based NMT

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Sequential circuits If the same input may produce different output signal, we have a sequential

Lecture no: 7 Overview Block codes Convolution codes Fading channel and

Exercise 5a: First Prediction between Transform Blocks DC Prediction ( Goal : utilization of

Low-latency software LDPC decoders for x86 multi-core devices Bertrand LE GAL and Christophe JEGO

Linear-Time Erasure List-Decoding of Expander Codes Noga Ron-Zewi (University of Haifa) Mary

Figerprinting digital documents survey Gbor Tardos Rnyi Institute &amp; Central European

Lecture 2 Point-to-Point Communications 1 I-Hsiang Wang ihwang@ntu.edu.tw

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

Figerprinting digital documents survey Gbor Tardos Rnyi Institute & Central European