Introduction to Neural Machine Translation Gongbo Tang 16 - - PowerPoint PPT Presentation

introduction to neural machine translation
SMART_READER_LITE
LIVE PREVIEW

Introduction to Neural Machine Translation Gongbo Tang 16 - - PowerPoint PPT Presentation

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural Machine Translation ? 1 Introduction to Neural Networks 2 Neural Language Models 3 Gongbo Tang Introduction to Neural Machine Translation 2/38


slide-1
SLIDE 1

Introduction to Neural Machine Translation

Gongbo Tang

16 September 2019

slide-2
SLIDE 2

Outline

1

Why Neural Machine Translation ?

2

Introduction to Neural Networks

3

Neural Language Models

Gongbo Tang Introduction to Neural Machine Translation 2/38

slide-3
SLIDE 3

A Review of SMT

Translation Model Language Model Reordering Model Training Morphology Compound Syntactic Trees Pre-reordering Factored SMT Figure : An overview of SMT

The Problems of SMT Many different sub-models More and more complicated Performace bottleneck Limitted context window size

Gongbo Tang Introduction to Neural Machine Translation 3/38

slide-4
SLIDE 4

A Review of SMT

Translation Model Language Model Reordering Model Training Morphology Compound Syntactic Trees Pre-reordering Factored SMT Figure : An overview of SMT

The Problems of SMT Many different sub-models More and more complicated Performace bottleneck Limitted context window size

Gongbo Tang Introduction to Neural Machine Translation 3/38

slide-5
SLIDE 5

The Background of Neural Networks

Why now ? More data More powerful machines (GPUs) Anvanced neural networks and algorithms Using neural networks to improve SMT Replace the word alignment model Replace the translation model using word embedding Replace n-gram language models with neural language models Replace the reordering model

Gongbo Tang Introduction to Neural Machine Translation 4/38

slide-6
SLIDE 6

The Background of Neural Networks

Why now ? More data More powerful machines (GPUs) Anvanced neural networks and algorithms Using neural networks to improve SMT Replace the word alignment model Replace the translation model using word embedding Replace n-gram language models with neural language models Replace the reordering model

Gongbo Tang Introduction to Neural Machine Translation 4/38

slide-7
SLIDE 7

Pure Neural Machine Translation Models

Encoder Decoder

Input text

−0.2 −0.1 0.1 0.4 −0.3 1.1

Translated text

One single model, end-to-end Consider the entire sentence, rather than a local context Smaller model size

Figure from Luong et al. ACL 2016 NMT tutorial Gongbo Tang Introduction to Neural Machine Translation 5/38

slide-8
SLIDE 8

SMT vs. NMT

NMT models have replaced SMT models in many online translation engines (Google, Baidu, Bing, Sogou, ...)

机器翻译方法对比

15

(Junczys-Dowmunt et al, 2016)

Is Neural Machine Translation Ready for Deployment? A Case Study on 30 Translation Directions Gongbo Tang Introduction to Neural Machine Translation 6/38

slide-9
SLIDE 9

SMT vs. NMT

Figure from Tie-Yan Liu’s NMT slides Gongbo Tang Introduction to Neural Machine Translation 7/38

slide-10
SLIDE 10

Neural Networks

What is a neural network ? is built from simpler units (neurons, nodes, ...) maps input vectors (matrices) to output vectors (matrices) each neuron has a non-linear activation function each activation function can be viewed as a feature detector non-linear functions are expressive

Gongbo Tang Introduction to Neural Machine Translation 8/38

slide-11
SLIDE 11

Neural Networks

Typical activation functions in neural networks

Hyperbolic tangent Logistic function Rectified linear unit tanh(x) = sinh(x)

cosh(x) = ex−e−x ex+e−x

sigmoid(x) =

1 1+e−x

relu(x) = max(0,x)

  • utput ranges
  • utput ranges
  • utput ranges

from –1 to +1 from 0 to +1 from 0 to ∞ ✲ ✻ ✲ ✻ ✲ ✻

  • Figure 13.3: Typical activation functions in neural networks.

Figure from Philipp Koehn’s NMT chapter Gongbo Tang Introduction to Neural Machine Translation 9/38

slide-12
SLIDE 12

A Simple Neural Network Classifier

x1 x2 x3 . . . xn

g(w · x + b) y y > 0 y <= 0 x is a vector input, y is a scalar output w and b are the parameters (b is a bias term) g is a non-linear activation function

Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Introduction to Neural Machine Translation 10/38

slide-13
SLIDE 13

Neural Networks

Figure 13.2: A neural network with a hidden layer.

Gongbo Tang Introduction to Neural Machine Translation 11/38

slide-14
SLIDE 14

Backpropagation Algorithm

Training Neural Networks We use backpropagation (BP) algorithm to update neural network weights, then minimize the loss. step1 : forward pass (computation) step2 : calculate the total error step3 : backward pass (using gradient to update weights) repeat step 1, 2, 3 until convergence

Gongbo Tang Introduction to Neural Machine Translation 12/38

slide-15
SLIDE 15

Backpropagation Algorithm

Figure from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ Gongbo Tang Introduction to Neural Machine Translation 13/38

slide-16
SLIDE 16

Backpropagation Algorithm

Figure from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ Gongbo Tang Introduction to Neural Machine Translation 14/38

slide-17
SLIDE 17

Neural Networks

Training progress over time error training progress

validation minimum validation training

Gongbo Tang Introduction to Neural Machine Translation 15/38

slide-18
SLIDE 18

Neural Networks

Learning rate

λ error(λ) λ error(λ) λ error(λ) local optimum global optimum

Too high learning rate Bad initialization Local optimum

More advanced optimation method (use adapting learning rate) : Adagrad, adadelta, Adam.

Gongbo Tang Introduction to Neural Machine Translation 16/38

slide-19
SLIDE 19

Neural Networks

Dropout It could avoid local optima. It reduces overfitting and makes the model more robust.

(a) Standard Neural Net (b) After applying dropout.

Figure from Dropout : A Simple Way to Prevent Neural Networks from Overfitting Gongbo Tang Introduction to Neural Machine Translation 17/38

slide-20
SLIDE 20

Neural Networks

Mini-batch training

Online learning : update the model with each training example. Mini-batch training : update weights in batches (parallelly), can speed up training.

2016-08-07 62

  • 1. Padding and Masking: suitable for GPU’s, but wasteful
  • Wasted computation

Sentence 1 Sentence 2 Sentence 3 Sentence 4 0’s 0’s 0’s

  • 2. Smarter Padding and Masking: minimize the waste
  • Ensure that the length differences are minimal.
  • Sort the sentences and sequentially build a minibatch

Sentence 1 Sentence 2 Sentence 4 Sentence 3 0’s 0’s 0’s

Figure from Luong et al. ACL 2016 NMT tutorial Gongbo Tang Introduction to Neural Machine Translation 18/38

slide-21
SLIDE 21

Neural Networks

Layer normalzation Large or small values at each layer may cause gradient explosion or gradient vanishing Normalize the values on a per-layer basis to solve it Early stopping Stop training when we get the best result on development set. Ensembling Combine multiple models together. Random seed Used for reproduction. Different seeds lead to different results.

Gongbo Tang Introduction to Neural Machine Translation 19/38

slide-22
SLIDE 22

Neural Networks

Layer normalzation Large or small values at each layer may cause gradient explosion or gradient vanishing Normalize the values on a per-layer basis to solve it Early stopping Stop training when we get the best result on development set. Ensembling Combine multiple models together. Random seed Used for reproduction. Different seeds lead to different results.

Gongbo Tang Introduction to Neural Machine Translation 19/38

slide-23
SLIDE 23

Neural Networks

Layer normalzation Large or small values at each layer may cause gradient explosion or gradient vanishing Normalize the values on a per-layer basis to solve it Early stopping Stop training when we get the best result on development set. Ensembling Combine multiple models together. Random seed Used for reproduction. Different seeds lead to different results.

Gongbo Tang Introduction to Neural Machine Translation 19/38

slide-24
SLIDE 24

Neural Networks

Layer normalzation Large or small values at each layer may cause gradient explosion or gradient vanishing Normalize the values on a per-layer basis to solve it Early stopping Stop training when we get the best result on development set. Ensembling Combine multiple models together. Random seed Used for reproduction. Different seeds lead to different results.

Gongbo Tang Introduction to Neural Machine Translation 19/38

slide-25
SLIDE 25

Neural Networks

Some pratical concepts Tensors : scalars, vectors, and matrices Epoch : update parameters over the training set Batch size : the number of sentence pairs in a batch Step : update parameters over a batch

Gongbo Tang Introduction to Neural Machine Translation 20/38

slide-26
SLIDE 26

Computation Graph

Figure 13.2: A neural network with a hidden layer.

The descriptive language of deep learning models Using simple functions to form complex models Functional description of the required computation

Gongbo Tang Introduction to Neural Machine Translation 21/38

slide-27
SLIDE 27

Computation Graph

h = sigmoid(W1x + b1) y = sigmoid(W2h + b2)

sigmoid sum b2 prod W2 sigmoid sum b1 prod W1 x 3 4 2 3

  • 5

−5

  • −2

−4

  • −2
  • 1.0

0.0

  • 3

2

  • 1

−2

  • .731

.119

  • 3.06
  • 1.06
  • .743
  • Figure 13.8: Two layer feed-forward neural network as a computation graph, consisting of the input

value x, weight parameters W1, W2, b1, b2, and computation nodes (product, sum, sigmoid). To the right

  • f each parameter node, its value is shown. To the left of input and computation nodes, we show how

the input (1, 0)T is processed by the graph. Gongbo Tang Introduction to Neural Machine Translation 22/38

slide-28
SLIDE 28

Computation Graph

L2 t sigmoid sum b2 prod W2 sigmoid sum b1 prod W1 x 3 4 2 3

  • – µ

0484 −.0258

  • 5

−5

  • – µ
  • .0360

.00587

  • −2

−4

  • – µ

.0484 −.0258

  • −2
  • – µ
  • .0492
  • 1.0

0.0

  • 3

2

  • 1

−2

  • .731

.119

  • 3.06
  • 1.06
  • .743
  • .0331
  • 1.0
  • −.0935

−.116

  • ,

.0484 −.0258

  • .0484

−.0258

  • ,

.0484 −.0258

  • .0484

−.0258

  • .246

−.246

  • ,
  • .0360

.00587

  • .0492
  • ,
  • .0492
  • .191
  • ×
  • .257
  • =
  • .0492
  • .257
  • Figure 13.9: Computation graph with gradients computed in the backward pass for the training example

(0, 1)T → 1.0. Gradients are computed with respect to the input of the nodes, so some nodes that have two inputs also have two gradients. See text for details on the computations of the values. Gongbo Tang Introduction to Neural Machine Translation 23/38

slide-29
SLIDE 29

Computation Graph

Each node in the computation graph is comprised of : a fuction to compute its value links to input nodes (to get argument value) the computed value (in the forward pass) a function that computes its gradient links to children nodes (get downstream gradient values) the computed gradient value (in the backward pass)

Gongbo Tang Introduction to Neural Machine Translation 24/38

slide-30
SLIDE 30

Neural Language Models

Problems of N-gram models Data sparsity Have access to limited preceding words

Gongbo Tang Introduction to Neural Machine Translation 25/38

slide-31
SLIDE 31

Neural Language Models

Problems of N-gram models Data sparsity Have access to limited preceding words Neural language models are powerful in modeling conditional probability distributions with multiple inputs p(a|b, c, d).

Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer

Figure 13.10: Sketch of a neural language model: We predict a word wi based on its preceding words.

Gongbo Tang Introduction to Neural Machine Translation 25/38

slide-32
SLIDE 32

Neural Language Models

Representations : word embeddings Dense vectors. Words that occur in similar contexts should have similar word embeddings. For example : but the cute dog jumped but the cute cat jumped The language model would benefit from the knowledge that dog and cat occur in similar contexts and hence are somewhat interchangeable. Example I have a pet, it is a cat. I have a pet, it is a dog. Word embeddings enable generalizing between words which could deal with the unseen data and the data sparsity problem.

Gongbo Tang Introduction to Neural Machine Translation 26/38

slide-33
SLIDE 33

Neural Language Models

Representations : word embeddings Dense vectors. Words that occur in similar contexts should have similar word embeddings. For example : but the cute dog jumped but the cute cat jumped The language model would benefit from the knowledge that dog and cat occur in similar contexts and hence are somewhat interchangeable. Example I have a pet, it is a cat. I have a pet, it is a dog. Word embeddings enable generalizing between words which could deal with the unseen data and the data sparsity problem.

Gongbo Tang Introduction to Neural Machine Translation 26/38

slide-34
SLIDE 34

Neural Language Models

Representations : word embeddings Dense vectors. Words that occur in similar contexts should have similar word embeddings. For example : but the cute dog jumped but the cute cat jumped The language model would benefit from the knowledge that dog and cat occur in similar contexts and hence are somewhat interchangeable. Example I have a pet, it is a cat. I have a pet, it is a dog. Word embeddings enable generalizing between words which could deal with the unseen data and the data sparsity problem.

Gongbo Tang Introduction to Neural Machine Translation 26/38

slide-35
SLIDE 35

Neural Language Models

Representations : word embeddings Dense vectors. Words that occur in similar contexts should have similar word embeddings. For example : but the cute dog jumped but the cute cat jumped The language model would benefit from the knowledge that dog and cat occur in similar contexts and hence are somewhat interchangeable. Example I have a pet, it is a cat. I have a pet, it is a dog. Word embeddings enable generalizing between words which could deal with the unseen data and the data sparsity problem.

Gongbo Tang Introduction to Neural Machine Translation 26/38

slide-36
SLIDE 36

Feed-forward Neural Network Language Models

Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer C C C C

Figure 13.11: Full architecture of a feed-forward neural network language model. Context words (wi−4, wi−3, wi−2, wi−1) are represented in a one-hot vector, then projected into continuous space as word embeddings (using the same weight matrix C for all words). The predicted word is computed as a one-hot vector via a hidden layer.

Gongbo Tang Introduction to Neural Machine Translation 27/38

slide-37
SLIDE 37

Training Neural Language Models

Paramters word embedding matrix ; weight matrices ; bias vectors Training For each training example (n-gram/sentence), we feed the context words into the network and match the network’s output against the following word. The training objective for language models is to increase the likelihood of the training data. L(x, y; W) = −

  • k

yk log pk (13.60) Evaluation metric Perplexity, which is related to the probability given to proper English text. The lower perplexity, the better.

Gongbo Tang Introduction to Neural Machine Translation 28/38

slide-38
SLIDE 38

Recurrent Neural Networks

Gongbo Tang Introduction to Neural Machine Translation 29/38

slide-39
SLIDE 39

Recurrent Neural Networks

Figure from https://colah.github.io/posts/2015-08-Understanding-LSTMs/ Gongbo Tang Introduction to Neural Machine Translation 29/38

slide-40
SLIDE 40

Recurrent Neural Language Models

RNNs may condition on context sequences of any length.

Word 1 Word 2 E C 1 H1 Word 2 Word 3 E C H2 H1

copy values

Word 3 Word 4 E C H3 H2

copy values

Figure 13.13: Recurrent neural language models: After predicting Word 2 in the context of following Word 1, we re-use this hidden layer (alongside the correct Word 2) to predict Word 3. Again, the hidden layer of this prediction is re-used for the prediction of Word 4.

Gongbo Tang Introduction to Neural Machine Translation 30/38

slide-41
SLIDE 41

Recurrent Neural Networks

Gongbo Tang Introduction to Neural Machine Translation 31/38

slide-42
SLIDE 42

Recurrent Neural Networks

Problems of simple RNNs hidden layers play as both the memory of networks and the representations to predict the next word without control of preceding context words gradient explosion or vanishing

Gongbo Tang Introduction to Neural Machine Translation 31/38

slide-43
SLIDE 43

LSTM & GRU

LSTM : Long Short-term Memory ; GRU : Gated Recurrent Units

Figure from https://towardsdatascience.com/ illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21 Gongbo Tang Introduction to Neural Machine Translation 32/38

slide-44
SLIDE 44

LSTM

Gates Input gate : how much new input changes the memory state Forget gate : how much the prior memory state is retained Output gate : how strongly the memory state is passed

input gate

  • utput gate

forget gate

X i m

  • ⊗ ⊕

h m

LSTM Layer Time t-1 Next Layer Y LSTM Layer Time t Preceding Layer

Gongbo Tang Introduction to Neural Machine Translation 33/38

slide-45
SLIDE 45

GRU

Gates A simplification of LSTM. It only has two gates : reset gate and update gate.

update gate reset gate

X x

h h

GRU Layer Time t-1 Next Layer Y GRU Layer Time t Preceding Layer

Gongbo Tang Introduction to Neural Machine Translation 34/38

slide-46
SLIDE 46

Deep RNNs

Input Hidden Layer Output Input Hidden Layer 2 Output Hidden Layer 1 Hidden Layer 3

Shallow Deep Stacked

Input Hidden Layer 2 Output Hidden Layer 1 Hidden Layer 3

Deep Transition Figure 13.17: Deep recurrent neural networks. The input is passed through a few hidden layers before an output prediction is made. In deep stacked models, the hidden layers are also connected horizontally, i.e., a layer’s values at time step t depends on its value at time step t − 1 as well as the previous layer at time step t. In deep transitional models, the layers at any time step t are sequentially connected and first hidden layer is also informed by the last layer at time step t − 1.

Gongbo Tang Introduction to Neural Machine Translation 35/38

slide-47
SLIDE 47

Advanced Neural LMs

Features Character-level Bidirectional RNNs Multilingual Transformers Masked/casual/translation LMs Pre-trained Neural LMs ELMo, from AllenNLP GPT(-2.0), form OpenAI BERT, from Google Transformer-XL, from Google/CMU XLNET, from Google/CMU XLM, from Facebook

Gongbo Tang Introduction to Neural Machine Translation 36/38

slide-48
SLIDE 48

Advanced Neural LMs

Features Character-level Bidirectional RNNs Multilingual Transformers Masked/casual/translation LMs Pre-trained Neural LMs ELMo, from AllenNLP GPT(-2.0), form OpenAI BERT, from Google Transformer-XL, from Google/CMU XLNET, from Google/CMU XLM, from Facebook

Gongbo Tang Introduction to Neural Machine Translation 36/38

slide-49
SLIDE 49

Online demos

Write with Transformers Link : https://transformer.huggingface.co/ Models : Arxiv-NLP GPT-2 XLNET GPT

Gongbo Tang Introduction to Neural Machine Translation 37/38

slide-50
SLIDE 50

Information

Abel Account Application We will use the Abel cluster for the NMT assignment and (NMT) projects, please apply your account as early as possible. Here is some key information for the application : Website :

https://www.metacenter.no/user/application/form/notur/

Organization : Uppsala universitet (ZIP code : 75126) Project : NN9447K : Nordic Language Processing Laboratory (project manager : Stephan Oepen) Contact : gongbo.tang@lingfil.uu.se

Gongbo Tang Introduction to Neural Machine Translation 38/38