Introduction to Neural Machine Translation Gongbo Tang 16 - PowerPoint PPT Presentation

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019

Outline Why Neural Machine Translation ? 1 Introduction to Neural Networks 2 Neural Language Models 3 Gongbo Tang Introduction to Neural Machine Translation 2/38

A Review of SMT Language Model Reordering Model Translation Model Training Morphology Syntactic Trees Factored SMT Compound Pre-reordering Figure : An overview of SMT The Problems of SMT Many different sub-models More and more complicated Performace bottleneck Limitted context window size Gongbo Tang Introduction to Neural Machine Translation 3/38

The Background of Neural Networks Why now ? More data More powerful machines (GPUs) Anvanced neural networks and algorithms Using neural networks to improve SMT Replace the word alignment model Replace the translation model using word embedding Replace n-gram language models with neural language models Replace the reordering model Gongbo Tang Introduction to Neural Machine Translation 4/38

Pure Neural Machine Translation Models −0.2 Input −0.1 Translated Encoder 0.1 Decoder text 0.4 text −0.3 1.1 One single model, end-to-end Consider the entire sentence, rather than a local context Smaller model size Figure from Luong et al. ACL 2016 NMT tutorial Gongbo Tang Introduction to Neural Machine Translation 5/38

SMT vs. NMT NMT models have replaced SMT models in many online 机器翻译方法对比 translation engines (Google, Baidu, Bing, Sogou, ...) Is Neural Machine Translation Ready for Deployment? A Case Study on 30 Translation Directions Gongbo Tang Introduction to Neural Machine Translation (Junczys-Dowmunt et al, 2016) 6/38 15

SMT vs. NMT Figure from Tie-Yan Liu’s NMT slides Gongbo Tang Introduction to Neural Machine Translation 7/38

Neural Networks What is a neural network ? is built from simpler units (neurons, nodes, ...) maps input vectors (matrices) to output vectors (matrices) each neuron has a non-linear activation function each activation function can be viewed as a feature detector non-linear functions are expressive Gongbo Tang Introduction to Neural Machine Translation 8/38

Neural Networks Typical activation functions in neural networks Hyperbolic tangent Logistic function Rectified linear unit tanh( x ) = sinh ( x ) cosh ( x ) = e x − e − x 1 sigmoid ( x ) = relu( x ) = max(0, x ) e x + e − x 1+ e − x output ranges output ranges output ranges from –1 to +1 from 0 to +1 from 0 to ∞ ✻ ✻ ✻ � � ✲ � � ✲ � ✲ Figure 13.3: Typical activation functions in neural networks. Figure from Philipp Koehn’s NMT chapter Gongbo Tang Introduction to Neural Machine Translation 9/38

A Simple Neural Network Classifier x 1 y > 0 x 2 x 3 g ( w · x + b ) y . . . y < = 0 x n x is a vector input, y is a scalar output w and b are the parameters ( b is a bias term) g is a non-linear activation function Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Introduction to Neural Machine Translation 10/38

Neural Networks Figure 13.2: A neural network with a hidden layer. Gongbo Tang Introduction to Neural Machine Translation 11/38

Backpropagation Algorithm Training Neural Networks We use backpropagation (BP) algorithm to update neural network weights, then minimize the loss. step1 : forward pass (computation) step2 : calculate the total error step3 : backward pass (using gradient to update weights) repeat step 1, 2, 3 until convergence Gongbo Tang Introduction to Neural Machine Translation 12/38

Backpropagation Algorithm Figure from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ Gongbo Tang Introduction to Neural Machine Translation 13/38

Backpropagation Algorithm Figure from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ Gongbo Tang Introduction to Neural Machine Translation 14/38

Neural Networks Training progress over time error validation minimum validation training training progress Gongbo Tang Introduction to Neural Machine Translation 15/38

Neural Networks Learning rate error( λ ) error( λ ) error( λ ) local optimum λ λ λ global optimum Too high learning rate Bad initialization Local optimum More advanced optimation method (use adapting learning rate) : Adagrad, adadelta, Adam. Gongbo Tang Introduction to Neural Machine Translation 16/38

Neural Networks Dropout It could avoid local optima. It reduces overfitting and makes the model more robust. (a) Standard Neural Net (b) After applying dropout. Figure from Dropout : A Simple Way to Prevent Neural Networks from Overfitting Gongbo Tang Introduction to Neural Machine Translation 17/38

Neural Networks Mini-batch training Online learning : update the model with each training example. Mini-batch training : update weights in batches (parallelly), can speed up training. 1. Padding and Masking: suitable for GPU’s, but wasteful • Wasted computation Sentence 1 0’s Sentence 2 0’s Sentence 3 Sentence 4 0’s 2. Smarter Padding and Masking: minimize the waste • Ensure that the length differences are minimal. • Sort the sentences and sequentially build a minibatch Sentence 1 0’s Sentence 2 0’s Sentence 3 0’s Sentence 4 62 2016-08-07 Figure from Luong et al. ACL 2016 NMT tutorial Gongbo Tang Introduction to Neural Machine Translation 18/38

Neural Networks Layer normalzation Large or small values at each layer may cause gradient explosion or gradient vanishing Normalize the values on a per-layer basis to solve it Early stopping Stop training when we get the best result on development set. Ensembling Combine multiple models together. Random seed Used for reproduction. Different seeds lead to different results. Gongbo Tang Introduction to Neural Machine Translation 19/38

Neural Networks Some pratical concepts Tensors : scalars, vectors, and matrices Epoch : update parameters over the training set Batch size : the number of sentence pairs in a batch Step : update parameters over a batch Gongbo Tang Introduction to Neural Machine Translation 20/38

Computation Graph Figure 13.2: A neural network with a hidden layer. The descriptive language of deep learning models Using simple functions to form complex models Functional description of the required computation Gongbo Tang Introduction to Neural Machine Translation 21/38

Computation Graph h = sigmoid ( W 1 x + b 1 ) y = sigmoid ( W 2 h + b 2 ) � 3 � 4 � 1 . 0 � W 1 x 2 3 0 . 0 � − 2 � � 3 � b 1 prod − 4 2 � 1 � sum − 2 � . 731 � � � − 5 W 2 5 sigmoid . 119 � � − 2 � � b 2 3 . 06 prod � � 1 . 06 sum � � sigmoid . 743 Figure 13.8: Two layer feed-forward neural network as a computation graph, consisting of the input value x , weight parameters W 1 , W 2 , b 1 , b 2 , and computation nodes (product, sum, sigmoid). To the right of each parameter node, its value is shown. To the left of input and computation nodes, we show how the input (1 , 0) T is processed by the graph. Gongbo Tang Introduction to Neural Machine Translation 22/38

Introduction to Neural Machine Translation Gongbo Tang 16 - PowerPoint PPT Presentation

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural Machine Translation ? 1 Introduction to Neural Networks 2 Neural Language Models 3 Gongbo Tang Introduction to Neural Machine Translation 2/38

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Image and Video Coding: Encoder Control D D = - R d R Problem Statement / Scope of Image

Course on Inverse Problems Albert Tarantola Lesson X: Optimization Optimization If the

Variational optimal power flow and dispatch problems and their approximations Anna Scaglione

Pointwise second-order necessary optimality conditions and sensitivity relations Nonlinear

A Formal Model Approach for the Analysis and Validation of the Cooperative Path Planning of a UAV

Introduction to Artificial Neural Networks Ahmed Guessoum Natural Language Processing and

Stochastic constrained optimization in Hilbert spaces with applications Georg Ch. Pflug/C.

Natural Language Processing with Deep Learning Neural Networks a Walkthrough Navid Rekab-Saz