Machine Translation 2: Statistical MT: Neural MT and - PowerPoint PPT Presentation

Machine Translation 2: Statistical MT: Neural MT and Representations Ondřej Bojar bojar@ufal.mfg.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague May 2020 MT2: NMT and Representations

Outline of Lectures on MT 1. Introduction. – Phrase-Based MT. 2. Neural Machine Translation. May 2020 MT2: NMT and Representations 1 • Why is MT diffjcult. • MT evaluation. • Approaches to MT. • Document, sentence and esp. word alignment. • Classical Statistical Machine Translation. • Neural MT: Sequence-to-sequence, attention, self-attentive. • Sentence representations. • Role of Linguistic Features in MT.

Outline of MT Lecture 2 1. Fundamental problems of PBMT. 2. Neural machine translation (NMT). May 2020 MT2: NMT and Representations 2 • Brief summary of NNs. • Sequence-to-sequence, with attention. • Transformer, self-attention. • Linguistic features in NMT.

Summary of PBMT Phrase-based MT: – lookup of all relevant translation options – stack-based beam search, gradually expanding hypotheses To train a PBMT system: 1. Align words. 2. Extract (and score) phrases consistent with word alignment. 3. Optimize weights (MERT). May 2020 MT2: NMT and Representations 3 • is a log-linear model • assumes phrases relatively independent of each other • decomposes sentence into contiguous phrases • search has two parts:

May 2020 1: Align Training Sentences MT2: NMT and Representations 4 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat.

May 2020 2: Align Words MT2: NMT and Representations 5 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat.

May 2020 3: Extract Phrase Pairs (MTUs) MT2: NMT and Representations 6 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat.

4: New Input May 2020 MT2: NMT and Representations 7 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. New input: Nemám ko č ku.

May 2020 4: New Input MT2: NMT and Representations 8 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. ... I don't have cat. New input: Nemám ko č ku.

5: Pick Probable Phrase Pairs (TM) May 2020 MT2: NMT and Representations 9 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. ... I don't have cat. New input: New input: Nemám Nemám ko č ku. I have

10 May 2020 MT2: NMT and Representations 6: So That n -Grams Probable (LM) Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. ... I don't have cat. New input: New input: Nemám Nemám ko č ku. ko č ku. I have a cat.

Meaning Got Reversed! May 2020 MT2: NMT and Representations 11 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. ... I don't have cat. a cat. ✘ New input: New input: Nemám Nemám ko č ku. ko č ku. I have

What Went Wrong? – Phrases do depend on each other. MT2: NMT and Representations May 2020 But adding it would increase data sparseness. – Word alignments ignored that dependence. Here “nemám” and “žádného” jointly express one negation. 12 ˆ ∏ p ( ˆ I p ( f J 1 | e I 1 ) p ( e I e ) p ( e I ˆ 1 = argmax 1 ) = argmax f | ˆ 1 ) (1) e I,e I I,e I ( ˆ e ) ∈ phrase pairs of f J 1 ,e I 1 1 f, ˆ 1 • Too strong phrase-independence assumption. • Language model is a separate unit. – p ( e I 1 ) models the target sentence independently of f J 1 .

13 (2) MT2: NMT and Representations May 2020 But what technical device can learn this? Main Benefjt: All dependencies available. Redefjning p ( e I 1 | f J 1 ) What if we modelled p ( e I 1 | f J 1 ) directly, word by word: p ( e I 1 | f J 1 ) = p ( e 1 , e 2 , . . . e I | f J 1 ) = p ( e 1 | f J 1 ) · p ( e 2 | e 1 , f J 1 ) · p ( e 3 | e 2 , e 1 , f J 1 ) . . . I ∏ p ( e i | e 1 , . . . e i − 1 , f J = 1 ) i =1 1 ) = ∏ I …this is “just a cleverer language model:” p ( e I i =1 p ( e i | e 1 , . . . e i − 1 )

NNs: Universal Approximators approximate any continuous function to any precision. https://www.quora.com/How-can-a-deep-neural-network-with-ReLU-activations-in-its-hidden-layers-approximate-any-function May 2020 MT2: NMT and Representations 14 • A neural network with a single hidden layer (possibly huge) can • (Nothing claimed about learnability.)

playground.tensorfmow.org May 2020 MT2: NMT and Representations 15

Perfect Features May 2020 MT2: NMT and Representations 16

Bad Features & Low Depth May 2020 MT2: NMT and Representations 17

Too Complex NN Fails to Learn May 2020 MT2: NMT and Representations 18

Deep NNs for Image Classifjcation May 2020 MT2: NMT and Representations 19

Representation Learning (sample inputs and expected outputs) May 2020 MT2: NMT and Representations 20 • Based on training data • the neural network learns by itself • what is important in the inputs • to predict the outputs best. A “representation” is a new set of axes. • Instead of 3 dimensions ( x, y, color ) , we get • 2000 dimensions: (elephantity, number of storks, blueness, …) • designed automatically to help in best prediction of the output

Skew: Transpose: Non-lin.: Animation by http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ May 2020 MT2: NMT and Representations 21 One Layer tanh( Wx + b ) , 2D → 2D W b tanh

Four Layers, Disentagling Spirals Animation by http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ May 2020 MT2: NMT and Representations 22

Processing Text with NNs 0 … … … … … 0 0 1 0 0 … on … … … … … … … 2.2M Czech 0 … the 0 … MT2: NMT and Representations May 2020 Main drawback: No relations, all words equally close/far. 0 0 0 0 0 0 zebra … 1 … … … … … 0 1 0 0 0 0 1 23 0 0 0 0 0 0 0 about 0 0 0 … 0 0 a mat the on is cat the 0 … … … 0 is 1.3M English … … … … … … … Vocabulary size: 0 0 0 0 1 0 cat … … … • Map each word to a vector of 0s and 1s (“1-hot repr.”): cat �→ (0 , 0 , . . . , 0 , 1 , 0 , . . . , 0) • Sentence is then a matrix: ↑ ↓

Solution: Word Embeddings – CBOW: Predict the word from its four neighbours. MT2: NMT and Representations May 2020 Right: CBOW with just a single-word context (http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf) – Skip-gram: Predict likely neighbours given the word. 26 – NNs: The matrix that maps 1-hot input to the fjrst layer. – The dimensions have no clear interpretation. • Map each word to a dense vector. • In practice 300–2000 dimensions are used, not 1–2M. • Embeddings are trained for each particular task. • The famous word2vec (Mikolov et al., 2013): Input layer Hidden layer Output layer x 1 y 1 x 2 y 2 x 3 y 3 h 1 h 2 x k y j h i W V × N ={ w ki } W ' N × V ={ w' ij } h N x V y V

Continuous Space of Words Word2vec embeddings show interesting properties: (3) Illustrations from https://www.tensorflow.org/tutorials/word2vec May 2020 MT2: NMT and Representations 27 v ( king ) − v ( man ) + v ( woman ) ≈ v ( queen )

Further Compression: Sub-Words Morphemes MT2: NMT and Representations May 2020 český politik s@@ vez@@ l mi@@ granty BPE 30k Chars Char Pairs 28 Syllables český politik svezl migranty Orig nejneobhodpodařovávatelnějšími, Donaudampfschifgfahrtsgesellschaftskapitän • SMT struggled with productive morphology ( > 1M wordforms). • NMT can handle only 30–80k dictionaries. ⇒ Resort to sub-word units. čes ký ⊔ po li tik ⊔ sve zl ⊔ mig ran ty česk ý ⊔ politik ⊔ s vez l ⊔ migrant y če sk ý ⊔ po li ti k ⊔ sv ez l ⊔ mi gr an ty č e s k ý ⊔ p o l i t i k ⊔ s v e z l ⊔ m i g r a n t y BPE (Byte-Pair Encoding) uses n most common substrings (incl. frequent words).

Machine Translation 2: Statistical MT: Neural MT and - PowerPoint PPT Presentation

Machine Translation 2: Statistical MT: Neural MT and Representations Ondej Bojar bojar@ufal.mfg.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague May 2020 MT2: NMT and

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

What can Statistical Machine Translation teach Neural Machine Translation about Structured

5G and Automo,ve The Perfect Storm? Carla Fabiana Chiasserini

CSC413/2516 Lecture 8: Attention and Transformers Jimmy Ba Jimmy Ba CSC413/2516 Lecture 8:

Low Mach number limit for the compressible viscous MHD equations Fucai LI Nanjing University

Shading I June 22, 1999 Motivational Film Graphic Violence Gritz, Bergen, Darken (1991)

Scalable Evolutionary Search Ke TANG Shenzhen Key Laboratory of Computational Intelligence

Distributed Computing Using Spark Practical / Praktikum WS17/18 October 18th, 2017

One parameter families of CalabiYau threefolds with trivial monodromy S lawomir Cynk

Periodic-end Dirac operators and Seiberg-Witten theory Tomasz Mrowka 1 Daniel Ruberman 2

Sambuz

Useful Links

Newsletter

Mail Us