Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond - - PowerPoint PPT Presentation

machine translation 2 statistical mt phrase based and
SMART_READER_LITE
LIVE PREVIEW

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond - - PowerPoint PPT Presentation

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2018 MT2: PBMT, NMT


slide-1
SLIDE 1

Machine Translation 2: Statistical MT: Phrase-Based and Neural

Ondˇ rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague

December 2018 MT2: PBMT, NMT

slide-2
SLIDE 2

Outline of Lectures on MT

  • 1. Introduction.
  • Why is MT difficult.
  • MT evaluation.
  • Approaches to MT.
  • First peek into phrase-based MT
  • Document, sentence and word alignment.
  • 2. Statistical Machine Translation.
  • Phrase-based: Assumptions, beam search, key issues.
  • Neural MT: Sequence-to-sequence, attention, self-attentive.
  • 3. Advanced Topics.
  • Linguistic Features in SMT and NMT.
  • Multilinguality, Multi-Task, Learned Representations.

December 2018 MT2: PBMT, NMT 1

slide-3
SLIDE 3

Outline of MT Lecture 2

  • 1. What makes MT statistical.
  • Brute-force statistical MT.
  • Noisy channed model.
  • Log-linear model.
  • 2. Phrase-based translation model.
  • Phrase extraction.
  • Decoding (gradual construction of hypotheses).
  • Minimum error-rate training (weight optimization).
  • 3. Neural machine translation (NMT).
  • Sequence-to-sequence, with attention.

December 2018 MT2: PBMT, NMT 2

slide-4
SLIDE 4

Quotes

Warren Weaver (1949):

I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that is has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text.

Noam Chomsky (1969):

. . . the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term.

Frederick Jelinek (80’s; IBM; later JHU and sometimes ´ UFAL)

Every time I fire a linguist, the accuracy goes up.

Hermann Ney (RWTH Aachen University):

MT = Linguistic Modelling + Statistical Decision Theory

December 2018 MT2: PBMT, NMT 3

slide-5
SLIDE 5

The Statistical Approach

(Statistical = Information-theoretic.)

  • Specify a probabilistic model.

= How is the probability mass distributed among possible

  • utputs given observed inputs.
  • Specify the training criterion and procedure.

= How to learn free parameters from training data. Notice:

  • Linguistics helpful when designing the models:

– How to divide input into smaller units. – Which bits of observations are more informative.

December 2018 MT2: PBMT, NMT 4

slide-6
SLIDE 6

Statistical MT

Given a source (foreign) language sentence f J

1 = f1 . . . fj . . . fJ,

Produce a target language (English) sentence eI

1 = e1 . . . ej . . . eI.

Among all possible target language sentences, choose the sentence with the highest probability: ˆ e

ˆ I 1 = argmax I,eI

1

p(eI

1|f J 1 )

(1)

We stick to the eI

1, f J 1 notation despite translating from English to Czech.

December 2018 MT2: PBMT, NMT 5

slide-7
SLIDE 7

Brute-Force MT (1/2)

Translate only sentences listed in a “translation memory” (TM):

Good morning. = Dobr´ e r´ ano. How are you? = Jak se m´ aˇ s? How are you? = Jak se m´ ate?

p(eI

1|f J 1 ) =

1 if eI

1 = f J 1 seen in the TM

0 otherwise (2) Any problems with the definition?

  • Not a probability. There may be f J

1 , s.t. eI

1 p(eI

1|f J 1 ) > 1.

⇒ Have to normalize, use count(eI

1,fJ 1 )

count(fJ

1 )

instead of 1.

  • Not “smooth”, no generalization:

Good morning. ⇒ Dobr´ e r´ ano.

December 2018 MT2: PBMT, NMT 6

slide-8
SLIDE 8

Brute-Force MT (2/2)

Translate only sentences listed in a “translation memory” (TM):

Good morning. = Dobr´ e r´ ano. How are you? = Jak se m´ aˇ s? How are you? = Jak se m´ ate?

p(eI

1|f J 1 ) =

1 if eI

1 = f J 1 seen in the TM

0 otherwise (3)

  • Not a probability. There may be f J

1 , s.t. eI

1 p(eI

1|f J 1 ) > 1.

⇒ Have to normalize, use count(eI

1,fJ 1 )

count(fJ

1 )

instead of 1.

  • Not “smooth”, no generalization:

Good morning. ⇒ Dobr´ e r´ ano. Good evening. ⇒ ∅

December 2018 MT2: PBMT, NMT 7

slide-9
SLIDE 9

Bayes’ Law

Bayes’ law for conditional probabilities: p(a|b) = p(b|a)p(a) p(b) So in our case:

ˆ e

ˆ I 1 = argmax I,eI

1

p(eI

1|f J 1 )

Apply Bayes’ law = argmax

I,eI

1

p(f J

1 |eI 1)p(eI 1)

p(f J

1 )

p(f J

1 ) constant

⇒ irrelevant in maximization = argmax

I,eI

1

p(f J

1 |eI 1)p(eI 1)

Also called “Noisy Channel” model.

December 2018 MT2: PBMT, NMT 8

slide-10
SLIDE 10

Motivation for Noisy Channel

ˆ e

ˆ I 1 = argmax I,eI

1

p(f J

1 |eI 1)p(eI 1)

(4) Bayes’ law divided the model into components: p(f J

1 |eI 1)

Translation model (“reversed”, eI

1 → f J 1 )

. . . is it a likely translation?

p(eI

1)

Language model (LM)

. . . is the output a likely sentence of the target language?

  • The components can be trained on different sources.

There are far more monolingual data ⇒ language model more reliable.

December 2018 MT2: PBMT, NMT 9

slide-11
SLIDE 11

Without Equations

Input Global Search for sentence with highest probability Output Parallel Texts Translation Model Monolingual Texts Language Model December 2018 MT2: PBMT, NMT 10

slide-12
SLIDE 12

Summary of Language Models

  • p(eI

1) should report how “good” sentence eI 1 is.

  • We surely want p(The the the.) < p(Hello.)
  • How about p(The cat was black.) < p(Hello.)?

. . . We don’t really care in MT. We hope to compare synonymic sentences. LM is usually a 3-gram language model:

p( The cat was black . ) = p(The| ) p(cat| The) p(was|The cat) p(black|cat was) p(.|was black) p( |black .) p( |. )

Formally, with n = 3: pLM(eI

1) = I

  • i=1

p(ei|ei−1

i−n+1)

(5)

December 2018 MT2: PBMT, NMT 11

slide-13
SLIDE 13

Estimating and Smoothing LM

p(w1) =

count(w1) total words observed

Unigram probabilities. p(w2|w1) = count(w1w2)

count(w1)

Bigram probabilities. p(w3|w2, w1) = count(w1w2w3)

count(w1w2)

Trigram probabilities. Unseen ngrams (p(ngram) = 0) are a big problem, invalidate whole sentence: pLM(eI

1) = · · · · 0 · · · · = 0

⇒ Back-off with shorter ngrams:

pLM(eI

1) = I i=1

  • 0.8 · p(ei|ei−1, ei−2)+

0.15 · p(ei|ei−1)+ 0.049 · p(ei)+ 0.001

  • = 0

(6)

December 2018 MT2: PBMT, NMT 12

slide-14
SLIDE 14

From Bayes to Log-Linear Model

Och (2002) discusses some problems of Equation 19:

  • Models estimated unreliably ⇒ maybe LM more important:

ˆ e

ˆ I 1 = argmax I,eI

1

p(f J

1 |eI 1)(p(eI 1))2

(7)

  • In practice, “direct” translation model equally good:

ˆ e

ˆ I 1 = argmax I,eI

1

p(eI

1|f J 1 )p(eI 1)

(8)

  • Complicated to correctly introduce other dependencies.

⇒ Use log-linear model instead.

December 2018 MT2: PBMT, NMT 13

slide-15
SLIDE 15

Log-Linear Model (1)

  • p(eI

1|f J 1 ) is modelled as a weighted combination of models,

called “feature functions”: h1(·, ·) . . . hM(·, ·) p(eI

1|f J 1 ) =

exp(M

m=1 λmhm(eI 1, f J 1 ))

  • e′

I′ 1 exp(M

m=1 λmhm(e′I′ 1 , f J 1 ))

(9)

  • Each feature function hm(e, f) relates source f to target e.

E.g. the feature for n-gram language model: hLM(f J

1 , eI 1) = log I

  • i=1

p(ei|ei−1

i−n+1)

(10)

  • Model weights λM

1 specify the relative importance of features. December 2018 MT2: PBMT, NMT 14

slide-16
SLIDE 16

Log-Linear Model (2)

As before, the constant denominator not needed in maximization: ˆ eˆ

I 1 = argmaxI,eI

1

exp(M

m=1 λmhm(eI 1, f J 1 ))

  • e′

I′ 1 exp(M

m=1 λmhm(e′I′ 1 , f J 1 ))

= argmaxI,eI

1 exp(M

m=1 λmhm(eI 1, f J 1 ))

(11)

December 2018 MT2: PBMT, NMT 15

slide-17
SLIDE 17

Relation to Noisy Channel

With equal weights and only two features:

  • hTM(eI

1, f J 1 ) = log p(f J 1 |eI 1) for the translation model,

  • hLM(eI

1, f J 1 ) = log p(eI 1) for the language model,

log-linear model reduces to Noisy Channel: ˆ eˆ

I 1 = argmaxI,eI

1 exp(M

m=1 λmhm(eI 1, f J 1 ))

= argmaxI,eI

1 exp(hTM(eI

1, f J 1 ) + hLM(eI 1, f J 1 ))

= argmaxI,eI

1 exp(log p(f J

1 |eI 1) + log p(eI 1))

= argmaxI,eI

1 p(f J

1 |eI 1)p(eI 1)

(12)

December 2018 MT2: PBMT, NMT 16

slide-18
SLIDE 18

Phrase-Based MT Overview

Nyn´ ı This time around , they ’re moving even faster . zareagovaly dokonce jeˇ stˇ e rychleji.

This time around = Nyn´ ı they ’re moving = zareagovaly even = dokonce jeˇ stˇ e . . . = . . . This time around, they ’re moving = Nyn´ ı zareagovaly even faster = dokonce jeˇ stˇ e rychleji . . . = . . .

Phrase-based MT: choose such segmentation

  • f input string and such phrase “replacements”

to make the

  • utput

sequence “coherent” (3-grams most probable).

December 2018 MT2: PBMT, NMT 17

slide-19
SLIDE 19

Phrase-Based Translation Model

  • Captures the basic assumption of phrase-based MT:
  • 1. Segment source sentence f J

1 into K phrases ˜

f1 . . . ˜ fK.

  • 2. Translate each phrase independently: ˜

fk → ˜ ek.

  • 3. Concatenate translated phrases (with possible reordering R):

˜ eR(1) . . . ˜ eR(K)

  • In theory, the segmentation sK

1

is a hidden variable in the maximization, we should be summing over all segmentations: (Note the three args in hm(·, ·, ·) now.) ˆ eˆ

I 1

= argmaxI,eI

1

  • sK

1 exp(M

m=1 λmhm(eI 1, f J 1 , sK 1 ))

(13)

  • In practice, the sum is approximated with a max (the biggest element only):

ˆ eˆ

I 1

= argmaxI,eI

1 maxsK 1 exp(M

m=1 λmhm(eI 1, f J 1 , sK 1 ))

(14) December 2018 MT2: PBMT, NMT 18

slide-20
SLIDE 20

Core Feature: Phrase Trans. Prob.

The most important feature: phrase-to-phrase translation: hPhr(f J

1 , eI 1, sK 1 ) = log K

  • k=1

p( ˜ fk|˜ ek) (15)

The conditional probability of phrase ˜ fk given phrase ˜ ek is estimated from relative frequencies: p( ˜ fk|˜ ek) = count( ˜ f, ˜ e) count(˜ e) (16)

  • count( ˜

f, ˜ e) is the number of co-occurrences of a phrase pair ( ˜ f, ˜ e) that are consistent with the word alignment

  • count(˜

e) is the number of occurrences of the target phrase ˜ e in the training corpus.

  • hPhr usually used twice, in both directions: p( ˜

fk|˜ ek) and p(˜ ek| ˜ fk) December 2018 MT2: PBMT, NMT 19

slide-21
SLIDE 21

Phrase-Based Features in Moses

Given parallel training corpus, phrases are extracted and scored:

in europa ||| in europe ||| 0.829007 0.207955 0.801493 0.492402 europas ||| in europe ||| 0.0251019 0.066211 0.0342506 0.0079563 in der europaeischen union ||| in europe ||| 0.018451 0.00100126 0.0319584 0.

The scores are: (φ(·) = log p(·))

  • phrase translation probabilities: φphr(f|e) and φphr(e|f)
  • lexical weighting: φlex(f|e) and φlex(e|f) (Koehn, 2003)

φlex(f|e) = log max

a∈alignments

  • f (f,e)

|f|

  • i=1

1 |{j|(i, j) ∈ a|

  • ∀(i,j)∈a

p(fi|ej) (17)

December 2018 MT2: PBMT, NMT 20

slide-22
SLIDE 22

Other Features Used in PBMT

  • Word count/penalty: hwp(eI

1, ·, ·) = I

⇒ Do we prefer longer or shorter output?

  • Phrase count/penalty: hpp(·, ·, sK

1 ) = K

⇒ Do we prefer translation in more or fewer less-dependent bits?

  • Reordering model: different basic strategies (Lopez, 2009)

⇒ Which source spans can provide continuation at a moment?

  • n-gram LM:

hLM(·, eI

1, ·) = log I

  • i=1

p(ei|ei−1

i−n+1)

(18)

⇒ Is output n-gram-wise coherent?

December 2018 MT2: PBMT, NMT 21

slide-23
SLIDE 23

Decoding in Phrase-Based MT

dio una bofetada a la verde bruja no Maria Mary not did not give a slap to the witch green by to the to green witch the witch did not give no a slap slap the slap e: Mary f: *-------- p: .534 e: witch f: -------*- p: .182 e: f: --------- p: 1 e: ... slap f: *-***---- p: .043

  • 1. Collect translation options (all possible translations per span).
  • 2. Gradually expand partial hypotheses until all input covered.
  • 3. Prune less promising hypotheses.
  • 4. When all input covered, trace back the best path.

December 2018 MT2: PBMT, NMT 22

slide-24
SLIDE 24

Local and Non-Local Features

Word penalty Peter left for home . Petr

  • dešel

dom . Bigram log. prob. 1,0 2,0 1,0 Phrase penalty 1,0 1,0 1,0 Phrase log. prob. 0,0

  • 0,69
  • 1,39

Total 4,0 3,0

  • 2,08
  • 2,50
  • 3,61
  • 0,39
  • 10,59

Weight

  • 0,5
  • 1,0

2,0 1,0 Weighted

  • 2,0
  • 3,0
  • 4,16
  • 10,59

Total

  • 19,75
  • 4,02
  • 0,08
  • Local features decompose along hypothesis construction.

– Phrase- and word-based features.

  • Non-local features span the boundaries (e.g. LM).

December 2018 MT2: PBMT, NMT 23

slide-25
SLIDE 25

Weight Optimization: MERT Loop

  • Minimum Error Rate Training (Och, 2003)

December 2018 MT2: PBMT, NMT 24

slide-26
SLIDE 26

Effects of Weights

  • Higher phrase penalty chops sentence into more segments.
  • Too strong LM weight leads to words dropped.
  • Negative LM weight leads to obscure wordings.

December 2018 MT2: PBMT, NMT 25

slide-27
SLIDE 27

Summary of PBMT

Phrase-based MT:

  • is a log-linear model
  • assumes phrases relatively independent of each other
  • decomposes sentence into contiguous phrases
  • search has two parts:

– lookup of all relevant translation options – stack-based beam search, gradually expanding hypotheses To train a PBMT system:

  • 1. Align words.
  • 2. Extract (and score) phrases consistent with word alignment.
  • 3. Optimize weights (MERT).

December 2018 MT2: PBMT, NMT 26

slide-28
SLIDE 28

1: Align Training Sentences

Nemám žádného psa. I have no dog. Vidl koku. a cat. He saw

December 2018 MT2: PBMT, NMT 27

slide-29
SLIDE 29

2: Align Words

Nemám žádného psa. I have no dog. Vidl koku. a cat. He saw

December 2018 MT2: PBMT, NMT 28

slide-30
SLIDE 30

3: Extract Phrase Pairs (MTUs)

Nemám žádného psa. I have no dog. Vidl koku. a cat. He saw

December 2018 MT2: PBMT, NMT 29

slide-31
SLIDE 31

4: New Input

Nemám žádného psa. I have no dog. Vidl koku. a cat. He saw New input: koku. Nemám

December 2018 MT2: PBMT, NMT 30

slide-32
SLIDE 32

4: New Input

Nemám žádného psa. I have no dog. Vidl koku. a cat. He saw New input: koku. Nemám

... I don't have cat.

December 2018 MT2: PBMT, NMT 31

slide-33
SLIDE 33

5: Pick Probable Phrase Pairs (TM)

Nemám žádného psa. I have no dog. Vidl koku. a cat. He saw New input: koku. Nemám

... I don't have cat.

New input: Nemám I have

December 2018 MT2: PBMT, NMT 32

slide-34
SLIDE 34

6: So That n-Grams Probable (LM)

Nemám žádného psa. I have no dog. Vidl koku. a cat. He saw New input: koku. Nemám

... I don't have cat.

New input: Nemám I have koku. a cat.

December 2018 MT2: PBMT, NMT 33

slide-35
SLIDE 35

Meaning Got Reversed!

Nemám žádného psa. I have no dog. Vidl koku. a cat. He saw New input: koku. Nemám

... I don't have cat.

New input: Nemám I have koku. a cat.

December 2018 MT2: PBMT, NMT 34

slide-36
SLIDE 36

What Went Wrong?

ˆ e

ˆ I 1 = argmax I,eI

1

p(f J

1 |eI 1)p(eI 1) = argmax I,eI

1

  • ( ˆ

f,ˆ e)∈phrase pairs of fJ

1 ,eI 1

p( ˆ f|ˆ e)p(eI

1)

(19)

  • Too strong phrase-independence assumption.

– Phrases do depend on each other. Here “nem´ am” and “ˇ z´ adn´ eho” jointly express one negation. – Word alignments ignored that dependence. But adding it would increase data sparseness.

  • Language model is a separate unit.

– p(eI

1) models the target sentence independently of f J 1 . December 2018 MT2: PBMT, NMT 35

slide-37
SLIDE 37

Redefining p(eI

1|f J 1 ) What if we modelled p(eI

1|f J 1 ) directly, word by word:

p(eI

1|f J 1 ) = p(e1, e2, . . . eI|f J 1 )

= p(e1|f J

1 ) · p(e2|e1, f J 1 ) · p(e3|e2, e1, f J 1 ) . . .

=

I

  • i=1

p(ei|e1, . . . ei−1, f J

1 )

(20)

. . . this is “just a cleverer language model:” p(eI

1) = I i=1 p(ei|e1, . . . ei−1)

Main Benefit: All dependencies available. But what technical device can learn this?

December 2018 MT2: PBMT, NMT 36

slide-38
SLIDE 38

NNs: Universal Approximators

  • A neural network with a single hidden layer (possibly huge) can

approximate any continuous function to any precision.

  • (Nothing claimed about learnability.)

https://www.quora.com/How-can-a-deep-neural-network-with-ReLU-activations-in-its-hidden-layers-approximate-an

December 2018 MT2: PBMT, NMT 37

slide-39
SLIDE 39

playground.tensorflow.org

December 2018 MT2: PBMT, NMT 38

slide-40
SLIDE 40

Perfect Features

December 2018 MT2: PBMT, NMT 39

slide-41
SLIDE 41

Bad Features & Low Depth

December 2018 MT2: PBMT, NMT 40

slide-42
SLIDE 42

Too Complex NN Fails to Learn

December 2018 MT2: PBMT, NMT 41

slide-43
SLIDE 43

Deep NNs for Image Classification

December 2018 MT2: PBMT, NMT 42

slide-44
SLIDE 44

Representation Learning

  • Based on training data

(sample inputs and expected outputs)

  • the neural network learns by itself
  • what is important in the inputs
  • to predict the outputs best.

A “representation” is a new set of axes.

  • Instead of 3 dimensions (x, y, color), we get
  • 2000 dimensions: (elephantity, number of storks, blueness, . . . )
  • designed automatically to help in best prediction of the output

December 2018 MT2: PBMT, NMT 43

slide-45
SLIDE 45

One Layer tanh(Wx + b), 2D→2D

Skew:

W

Transpose:

b

Non-lin.:

tanh

Animation by http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

December 2018 MT2: PBMT, NMT 44

slide-46
SLIDE 46

Four Layers, Disentagling Spirals

Animation by http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

December 2018 MT2: PBMT, NMT 45

slide-47
SLIDE 47

Processing Text with NNs

  • Map each word to a vector of 0s and 1s (“1-hot repr.”):

cat → (0, 0, . . . , 0, 1, 0, . . . , 0)

  • Sentence is then a matrix:

the cat is

  • n

the mat ↑ a about . . . . . . . . . . . . . . . . . . . . . cat 1 Vocabulary size: . . . . . . . . . . . . . . . . . . . . . 1.3M English is 1 2.2M Czech . . . . . . . . . . . . . . . . . . . . .

  • n

1 . . . . . . . . . . . . . . . . . . . . . the 1 1 . . . . . . . . . . . . . . . . . . . . . ↓ zebra

Main drawback: No relations, all words equally close/far.

December 2018 MT2: PBMT, NMT 46

slide-48
SLIDE 48

Processing Text with NNs

  • Map each word to a vector of 0s and 1s (“1-hot repr.”):

cat → (0, 0, . . . , 0, 1, 0, . . . , 0)

  • Sentence is then a matrix:

the cat is

  • n

the mat ↑ a about . . . . . . . . . . . . . . . . . . . . . cat 1 Vocabulary size: . . . . . . . . . . . . . . . . . . . . . 1.3M English is 1 2.2M Czech . . . . . . . . . . . . . . . . . . . . .

  • n

1 . . . . . . . . . . . . . . . . . . . . . the 1 1 . . . . . . . . . . . . . . . . . . . . . ↓ zebra

Main drawback: No relations, all words equally close/far.

December 2018 MT2: PBMT, NMT 47

slide-49
SLIDE 49

Processing Text with NNs

  • Map each word to a vector of 0s and 1s (“1-hot repr.”):

cat → (0, 0, . . . , 0, 1, 0, . . . , 0)

  • Sentence is then a matrix:

the cat is

  • n

the mat ↑ a about . . . . . . . . . . . . . . . . . . . . . cat 1 Vocabulary size: . . . . . . . . . . . . . . . . . . . . . 1.3M English is 1 2.2M Czech . . . . . . . . . . . . . . . . . . . . .

  • n

1 . . . . . . . . . . . . . . . . . . . . . the 1 1 . . . . . . . . . . . . . . . . . . . . . ↓ zebra

Main drawback: No relations, all words equally close/far.

December 2018 MT2: PBMT, NMT 48

slide-50
SLIDE 50

Solution: Word Embeddings

  • Map each word to a dense vector.
  • In practice 300–2000 dimensions are used, not 1–2M.

– The dimensions have no clear interpretation.

  • Embeddings are trained for each particular task.

– NNs: The matrix that maps 1-hot input to the first layer.

  • The famous word2vec (Mikolov et al., 2013):

– CBOW: Predict the word from its four neighbours. – Skip-gram: Predict likely neighbours given the word.

Input layer Hidden layer Output layer x1 x2 x3 xk xV y1 y2 y3 yj yV h1 h2 hi hN WV×N={wki} W'N×V={w'ij}

Right: CBOW with just a single-word context (http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf) December 2018 MT2: PBMT, NMT 49

slide-51
SLIDE 51

Continuous Space of Words

Word2vec embeddings show interesting properties: v(king) − v(man) + v(woman) ≈ v(queen) (21)

Illustrations from https://www.tensorflow.org/tutorials/word2vec

December 2018 MT2: PBMT, NMT 50

slide-52
SLIDE 52

Further Compression: Sub-Words

  • SMT struggled with productive morphology (>1M wordforms).

nejneobhodpodaˇ rov´ avatelnˇ ejˇ s´ ımi, Donaudampfschifffahrtsgesellschaftskapit¨ an

  • NMT can handle only 30–80k dictionaries.

⇒ Resort to sub-word units.

Orig ˇ cesk´ y politik svezl migranty Syllables ˇ ces k´ y ⊔ po li tik ⊔ sve zl ⊔ mig ran ty Morphemes ˇ cesk ´ y ⊔ politik ⊔ s vez l ⊔ migrant y Char Pairs ˇ ce sk ´ y ⊔ po li ti k ⊔ sv ez l ⊔ mi gr an ty Chars ˇ c e s k ´ y ⊔ p o l i t i k ⊔ s v e z l ⊔ m i g r a n t y BPE 30k ˇ cesk´ y politik s@@ vez@@ l mi@@ granty

BPE (Byte-Pair Encoding) uses n most common substrings (incl. frequent words). December 2018 MT2: PBMT, NMT 51

slide-53
SLIDE 53

Variable-Length Inputs

Variable-length input can be handled by recurrent NNs:

  • Reading one input symbol at a time.

– The same (trained) transformation A used every time.

  • Unroll in time (up to a fixed length limit).

Vanilla RNN: ht = tanh (W[ht−1; xt] + b) (22)

December 2018 MT2: PBMT, NMT 52

slide-54
SLIDE 54

Neural Language Model

  • estimate probability of a sentence using the chain rule
  • output distributions can be used for sampling

Thanks to Jindˇ rich Libovick´ y for the slides.

December 2018 MT2: PBMT, NMT 53

slide-55
SLIDE 55

Sampling from a LM

  • “Autoregressive decoder” = conditioned on its preceding
  • utput.

December 2018 MT2: PBMT, NMT 54

slide-56
SLIDE 56

Autoregressive Decoding

last_w = "<s>" while last_w != "</s>": last_w_embedding = target_embeddings[last_w] state, dec_output = dec_cell(state, last_w_embedding) logits = output_projection(dec_output) last_w = np.argmax(logits) yield last_w

December 2018 MT2: PBMT, NMT 55

slide-57
SLIDE 57

RNN Training vs. Runtime

runtime: ˆ

yj

(decoded) ×

training: yj

(ground truth)

<s> ~y1 ~y2 ~y3 ~y4 ~y5 <s> x1 x2 x3 x4 <s> y1 y2 y3 y4 loss

December 2018 MT2: PBMT, NMT 56

slide-58
SLIDE 58

NNs as Translation Model in SMT

Cho et al. (2014) proposed:

  • encoder-decoder architecture and
  • GRU unit (name given later by Chung et al. (2014))
  • to score variable-length phrase pairs in PBMT.

x1 x2 xT

yT' y2 y1

c

Decoder Encoder

December 2018 MT2: PBMT, NMT 57

slide-59
SLIDE 59

⇒ Embeddings of Phrases

December 2018 MT2: PBMT, NMT 58

slide-60
SLIDE 60

⇒ Syntactic Similarity (“of the”)

December 2018 MT2: PBMT, NMT 59

slide-61
SLIDE 61

⇒ Semantic Similarity (Countries)

December 2018 MT2: PBMT, NMT 60

slide-62
SLIDE 62

NMT: Sequence to Sequence

Sutskever et al. (2014) use:

  • LSTM RNN encoder-decoder
  • to consume

and produce variable-length sentences. First the Encoder:

December 2018 MT2: PBMT, NMT 61

slide-63
SLIDE 63

Then the Decoder

Remember: p(eI

1|f J 1 ) = p(e1|f J 1 ) · p(e2|e1, f J 1 ) · p(e3|e2, e1, f J 1 ) . . .

  • Again RNN, producing one word at a time.
  • The produced word fed back into the network.

– (Word embeddings in the target language used here.)

December 2018 MT2: PBMT, NMT 62

slide-64
SLIDE 64

Encoder-Decoder Architecture

https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/

December 2018 MT2: PBMT, NMT 63

slide-65
SLIDE 65

Continuous Space of Sentences

15 10 5 5 10 15 20 20 15 10 5 5 10 15

I gave her a card in the garden In the garden , I gave her a card She was given a card by me in the garden She gave me a card in the garden In the garden , she gave me a card I was given a card by her in the garden

2-D PCA projection of 8000-D space representing sentences (Sutskever et al., 2014). December 2018 MT2: PBMT, NMT 64

slide-66
SLIDE 66

Architectures in the Decoder

  • RNN – original sequence-to-sequence learning (2015)

– principle known since 2014 (University of Montreal) – made usable in 2016 (University of Edinburgh)

  • CNN – convolution sequence-to-sequence by Facebook (2017)
  • Self-attention (so called Transformer) by Google (2017)

December 2018 MT2: PBMT, NMT 65

slide-67
SLIDE 67

Attention (1/3)

  • Arbitrary-length sentences fit badly into a fixed vector.
  • Reading input backward works better.

. . . because early words will be more salient. ⇒ Use Bi-directional RNN and “attend” to all states hi.

December 2018 MT2: PBMT, NMT 66

slide-68
SLIDE 68

Attention (2/3)

  • Add a sub-network predicing importance of source states at

each step.

December 2018 MT2: PBMT, NMT 67

slide-69
SLIDE 69

Attention (3/3)

<s> x1 x2 x3 x4 ~yi ~yi+1 h1 h0 h2 h3 h4

...

+

× ×

1

×

2

×

3

×

4

si si-1 si+1

+

December 2018 MT2: PBMT, NMT 68

slide-70
SLIDE 70

Attention Model in Equations (1)

Inputs: decoder state: si, encoder states: hj = − → hj; ← − hj

  • ∀i = 1 . . . Tx

Attention energies: eij = v⊤

a tanh (Wasi−1 + Uahj + ba)

Attention distribution: αij =

exp(eij) Tx

k=1 exp(eik)

Context vector: ci = Tx

j=1 αijhj December 2018 MT2: PBMT, NMT 69

slide-71
SLIDE 71

Attention Model in Equations (2)

Output projection: ti = MLP (Uosi−1 + VoEyi−1 + Coci + bo) . . . attention is mixed with the hidden state Output distribution: p (yi = k |si, yi−1, ci) ∝ exp (Woti)k + bk

December 2018 MT2: PBMT, NMT 70

slide-72
SLIDE 72

Attention ≈ Alignment

  • We can collect the attention across time.
  • Each column corresponds to one decoder time step.
  • Source tokens correspond to rows.

December 2018 MT2: PBMT, NMT 71

slide-73
SLIDE 73

Ultimate Goal of SMT vs. NMT

Goal of “classical” SMT: Find minimum translation units ∼ graph partitions:

  • such that they are frequent across many sentence pairs.
  • without imposing (too hard) constraints on reordering.
  • in an unsupervised fashion.

Goal of neural MT: Avoid minimum translation units. Find NN architecture that

  • Reads input in as original form as possible.
  • Produces output in as final form as possible.
  • Can be optimized end-to-end in practice.

December 2018 MT2: PBMT, NMT 72

slide-74
SLIDE 74

Is NMT That Much Better?

The outputs of this year’s best system: http://matrix.statmt.org/ SRC A 28-year-old chef who had recently moved to San Francisco was found dead in the stairwell of a local mall this week. MT Osmadvacetilet´ y kuchaˇ r, kter´ y se ned´ avno pˇ restˇ ehoval do San Francisca, byl tento t´ yden nalezen mrtv´ y na schodiˇ sti m´ ıstn´ ıho

  • bchodn´

ıho centra. REF Osmadvacetilet´ y ˇ s´ efkuchaˇ r, kter´ y se ned´ avno pˇ ristˇ ehoval do San Franciska, byl tento t´ yden ∅ schodech m´ ıstn´ ıho obchodu. SRC There were creative differences on the set and a disagreement. REF Doˇ slo ke vzniku kreativn´ ıch rozd´ ıl˚ u na sc´ enˇ e a k neshod´ am. MT Na place byly tv˚ urˇ c´ ı rozd´ ıly a neshody.

December 2018 MT2: PBMT, NMT 73

slide-75
SLIDE 75

Is NMT That Much Better?

The outputs of this year’s best system: http://matrix.statmt.org/ SRC A 28-year-old chef who had recently moved to San Francisco was found dead in the stairwell of a local mall this week. MT Osmadvacetilet´ y kuchaˇ r, kter´ y se ned´ avno pˇ restˇ ehoval do San Francisca, byl tento t´ yden nalezen mrtv´ y na schodiˇ sti m´ ıstn´ ıho

  • bchodn´

ıho centra. REF Osmadvacetilet´ y ˇ s´ efkuchaˇ r, kter´ y se ned´ avno pˇ ristˇ ehoval do San Franciska, byl tento t´ yden ∅ schodech m´ ıstn´ ıho obchodu. SRC There were creative differences on the set and a disagreement. REF Doˇ slo ke vzniku kreativn´ ıch rozd´ ıl˚ u na sc´ enˇ e a k neshod´ am. MT Na place byly tv˚ urˇ c´ ı rozd´ ıly a neshody.

December 2018 MT2: PBMT, NMT 74

slide-76
SLIDE 76

Luckily ;-) Bad Errors Happen

SRC ... said Frank initially stayed in hostels... MT ... ˇ rekl, ˇ ze Frank p˚ uvodnˇ e z˚ ustal v Budˇ ejovic´ ıch... SRC Most of the Clintons’ income... MT Vˇ etˇ sinu pˇ r´ ıjm˚ u Kliniky... SRC The 63-year-old has now been made a special representative MT 63let´ y mlad´ ık se nyn´ ı stal zvl´ aˇ stn´ ım z´ astupcem... SRC He listened to the moving stories of the women. MT Naslouchal pohybliv´ ym pˇ r´ ıbˇ eh˚ um ˇ zen.

December 2018 MT2: PBMT, NMT 75

slide-77
SLIDE 77

Catastrophic Errors

SRC Criminal Minds star Thomas Gibson sacked after hitting producer REF Thomas Gibson, hvˇ ezda seri´ alu Myˇ slenky zloˇ cince, byl propuˇ stˇ en po t´ e, co uhodil reˇ zis´ era MT Kriminalist´ e Minsku hvˇ ezdu Thomase Gibsona vyhostili po z´ asahu producenta SRC ...add to that its long-standing grudge... REF ...pˇ ridejte k tomu svou dlouholetou nen´ avist... MT ...pˇ ridejte k tomu svou dlouholetou z´ aˇ stitu... (grudge → z´ aˇ sˇ t → z´ aˇ stita)

December 2018 MT2: PBMT, NMT 76

slide-78
SLIDE 78

German→Czech SMT vs. NMT

  • A smaller dataset, very first (but comparable) results.
  • NMT performs better on average, but occasionally:

SRC Das Spektakel ¨ ahnelt dem Eurovision Song Contest. REF Je to jako pˇ eveck´ a soutˇ eˇ z Eurovision. SMT Pod´ ıvanou pˇ ripom´ ın´ a hudebn´ ı soutˇ eˇ z Eurovize. NMT Divadlo se podob´ a Eurovizi Conview. SRC Erderw¨ armung oder Zusammenstoß mit Killerasteroid. REF Glob´ aln´ ı oteplen´ ı nebo kolize se zabij´ ack´ ym asteroidem. SMT Glob´ aln´ ı oteplov´ an´ ı, nebo sr´ aˇ zka s Killerasteroid. NMT Glob´ aln´ ı oteplov´ an´ ı, nebo stˇ ret s zabij´ akem. SRC Zu viele verletzte Gef¨ uhle. REF Pˇ r´ ıliˇ s mnoho nepˇ r´ atelsk´ ych pocit˚ u. SMT Pˇ r´ ıliˇ s mnoho zranˇ en´ ych pocity. NMT Pˇ r´ ıliˇ s mnoho zranˇ en´ ych ∅. December 2018 MT2: PBMT, NMT 77

slide-79
SLIDE 79

Summary

  • What makes MT statistical.

Two crucially different models covered:

  • Phrase-based: contiguous but independent phrases.

– Bayes Law as a special case of Log-Linear Model. – Hand-crafted features (scoring functions); local vs. non-local. – Decoding as search, expanding partial hypotheses.

  • Neural: unit-less, continuous space.

– NMT as a fancy Language Model. – Word embeddings, subwords. – RNNs for variable-length input and output. – Attention model.

December 2018 MT2: PBMT, NMT 78

slide-80
SLIDE 80

References

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine

  • translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing

(EMNLP), pages 1724–1734, Doha, Qatar, October. Association for Computational Linguistics. Junyoung Chung, C ¸aglar G¨ ul¸ cehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555. Philipp Koehn. 2003. Noun Phrase Translation. Ph.D. thesis, University of Southern California. Adam Lopez. 2009. Translation as weighted deduction. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 532–540, Athens, Greece, March. Association for Computational Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781. Franz Joseph Och. 2002. Statistical Machine Translation: From Single-Word Models to Alignment Templates. Ph.D. thesis, RWTH Aachen University. Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. of the Association for Computational Linguistics, Sapporo, Japan, July 6-7. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. pages 3104–3112.

December 2018 MT2: PBMT, NMT 79