Machine Translation 2: Statistical MT: Phrase-Based and Neural
Ondˇ rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague
December 2017 MT2: PBMT, NMT
Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond - - PowerPoint PPT Presentation
Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2017 MT2: PBMT, NMT
December 2017 MT2: PBMT, NMT
December 2017 MT2: PBMT, NMT 1
December 2017 MT2: PBMT, NMT 2
December 2017 MT2: PBMT, NMT 3
December 2017 MT2: PBMT, NMT 4
1 = f1 . . . fj . . . fJ,
1 = e1 . . . ej . . . eI.
ˆ I 1 = argmax I,eI
1
1|f J 1 )
We stick to the eI
1, f J 1 notation despite translating from English to Czech.
December 2017 MT2: PBMT, NMT 5
1|f J 1 ) =
1 = f J 1 seen in the TM
1 , s.t. eI
1 p(eI
1|f J 1 ) > 1.
1,fJ 1 )
count(fJ
1 )
December 2017 MT2: PBMT, NMT 6
1|f J 1 ) =
1 = f J 1 seen in the TM
1 , s.t. eI
1 p(eI
1|f J 1 ) > 1.
1,fJ 1 )
count(fJ
1 )
December 2017 MT2: PBMT, NMT 7
ˆ I 1 = argmax I,eI
1
1|f J 1 )
I,eI
1
1 |eI 1)p(eI 1)
1 )
1 ) constant
I,eI
1
1 |eI 1)p(eI 1)
December 2017 MT2: PBMT, NMT 8
ˆ I 1 = argmax I,eI
1
1 |eI 1)p(eI 1)
1 |eI 1)
1 → f J 1 )
1)
December 2017 MT2: PBMT, NMT 9
Input Global Search for sentence with highest probability Output Parallel Texts Translation Model Monolingual Texts Language Model December 2017 MT2: PBMT, NMT 10
1) should report how “good” sentence eI 1 is.
p( The cat was black . ) = p(The| ) p(cat| The) p(was|The cat) p(black|cat was) p(.|was black) p( |black .) p( |. )
1) = I
i−n+1)
December 2017 MT2: PBMT, NMT 11
count(w1) total words observed
count(w1)
count(w1w2)
1) = · · · · 0 · · · · = 0
1) = I i=1
December 2017 MT2: PBMT, NMT 12
ˆ I 1 = argmax I,eI
1
1 |eI 1)(p(eI 1))2
ˆ I 1 = argmax I,eI
1
1|f J 1 )p(eI 1)
December 2017 MT2: PBMT, NMT 13
1|f J 1 ) is modelled as a weighted combination of models,
1|f J 1 ) =
m=1 λmhm(eI 1, f J 1 ))
I′ 1 exp(M
m=1 λmhm(e′I′ 1 , f J 1 ))
1 , eI 1) = log I
i−n+1)
1 specify the relative importance of features. December 2017 MT2: PBMT, NMT 14
I 1 = argmaxI,eI
1
m=1 λmhm(eI 1, f J 1 ))
I′ 1 exp(M
m=1 λmhm(e′I′ 1 , f J 1 ))
1 exp(M
m=1 λmhm(eI 1, f J 1 ))
December 2017 MT2: PBMT, NMT 15
1, f J 1 ) = log p(f J 1 |eI 1) for the translation model,
1, f J 1 ) = log p(eI 1) for the language model,
I 1 = argmaxI,eI
1 exp(M
m=1 λmhm(eI 1, f J 1 ))
1 exp(hTM(eI
1, f J 1 ) + hLM(eI 1, f J 1 ))
1 exp(log p(f J
1 |eI 1) + log p(eI 1))
1 p(f J
1 |eI 1)p(eI 1)
December 2017 MT2: PBMT, NMT 16
This time around = Nyn´ ı they ’re moving = zareagovaly even = dokonce jeˇ stˇ e . . . = . . . This time around, they ’re moving = Nyn´ ı zareagovaly even faster = dokonce jeˇ stˇ e rychleji . . . = . . .
December 2017 MT2: PBMT, NMT 17
1 into K phrases ˜
1
is a hidden variable in the maximization, we should be summing over all segmentations: (Note the three args in hm(·, ·, ·) now.) ˆ eˆ
I 1
= argmaxI,eI
1
1 exp(M
m=1 λmhm(eI 1, f J 1 , sK 1 ))
(13)
ˆ eˆ
I 1
= argmaxI,eI
1 maxsK 1 exp(M
m=1 λmhm(eI 1, f J 1 , sK 1 ))
(14) December 2017 MT2: PBMT, NMT 18
1 , eI 1, sK 1 ) = log K
f, ˜ e) is the number of co-occurrences of a phrase pair ( ˜ f, ˜ e) that are consistent with the word alignment
e) is the number of occurrences of the target phrase ˜ e in the training corpus.
fk|˜ ek) and p(˜ ek| ˜ fk) December 2017 MT2: PBMT, NMT 19
in europa ||| in europe ||| 0.829007 0.207955 0.801493 0.492402 europas ||| in europe ||| 0.0251019 0.066211 0.0342506 0.0079563 in der europaeischen union ||| in europe ||| 0.018451 0.00100126 0.0319584 0.
a∈alignments
|f|
December 2017 MT2: PBMT, NMT 20
1, ·, ·) = I
1 ) = K
1, ·) = log I
i−n+1)
December 2017 MT2: PBMT, NMT 21
December 2017 MT2: PBMT, NMT 22
December 2017 MT2: PBMT, NMT 23
December 2017 MT2: PBMT, NMT 24
December 2017 MT2: PBMT, NMT 25
December 2017 MT2: PBMT, NMT 26
December 2017 MT2: PBMT, NMT 27
December 2017 MT2: PBMT, NMT 28
December 2017 MT2: PBMT, NMT 29
December 2017 MT2: PBMT, NMT 30
December 2017 MT2: PBMT, NMT 31
December 2017 MT2: PBMT, NMT 32
December 2017 MT2: PBMT, NMT 33
December 2017 MT2: PBMT, NMT 34
ˆ I 1 = argmax I,eI
1
1 |eI 1)p(eI 1) = argmax I,eI
1
f,ˆ e)∈phrase pairs of fJ
1 ,eI 1
1)
1) models the target sentence independently of f J 1 . December 2017 MT2: PBMT, NMT 35
1|f J 1 ) directly, word by word:
1|f J 1 ) = p(e1, e2, . . . eI|f J 1 )
1 ) · p(e2|e1, f J 1 ) · p(e3|e2, e1, f J 1 ) . . .
I
1 )
1) = I i=1 p(ei|e1, . . . ei−1)
December 2017 MT2: PBMT, NMT 36
https://www.quora.com/How-can-a-deep-neural-network-with-ReLU-activations-in-its-hidden-layers-approximate-an
December 2017 MT2: PBMT, NMT 37
December 2017 MT2: PBMT, NMT 38
December 2017 MT2: PBMT, NMT 39
December 2017 MT2: PBMT, NMT 40
December 2017 MT2: PBMT, NMT 41
December 2017 MT2: PBMT, NMT 42
December 2017 MT2: PBMT, NMT 43
Animation by http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
December 2017 MT2: PBMT, NMT 44
Animation by http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
December 2017 MT2: PBMT, NMT 45
cat → (0, 0, . . . , 0, 1, 0, . . . , 0)
the cat is
the mat ↑ a about . . . . . . . . . . . . . . . . . . . . . cat 1 Vocabulary size: . . . . . . . . . . . . . . . . . . . . . 1.3M English is 1 2.2M Czech . . . . . . . . . . . . . . . . . . . . .
1 . . . . . . . . . . . . . . . . . . . . . the 1 1 . . . . . . . . . . . . . . . . . . . . . ↓ zebra
December 2017 MT2: PBMT, NMT 46
cat → (0, 0, . . . , 0, 1, 0, . . . , 0)
the cat is
the mat ↑ a about . . . . . . . . . . . . . . . . . . . . . cat 1 Vocabulary size: . . . . . . . . . . . . . . . . . . . . . 1.3M English is 1 2.2M Czech . . . . . . . . . . . . . . . . . . . . .
1 . . . . . . . . . . . . . . . . . . . . . the 1 1 . . . . . . . . . . . . . . . . . . . . . ↓ zebra
December 2017 MT2: PBMT, NMT 47
cat → (0, 0, . . . , 0, 1, 0, . . . , 0)
the cat is
the mat ↑ a about . . . . . . . . . . . . . . . . . . . . . cat 1 Vocabulary size: . . . . . . . . . . . . . . . . . . . . . 1.3M English is 1 2.2M Czech . . . . . . . . . . . . . . . . . . . . .
1 . . . . . . . . . . . . . . . . . . . . . the 1 1 . . . . . . . . . . . . . . . . . . . . . ↓ zebra
December 2017 MT2: PBMT, NMT 48
– CBOW: Predict the word from its four neighbours. – Skip-gram: Predict likely neighbours given the word.
Input layer Hidden layer Output layer x1 x2 x3 xk xV y1 y2 y3 yj yV h1 h2 hi hN WV×N={wki} W'N×V={w'ij}
Right: CBOW with just a single-word context (http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf) December 2017 MT2: PBMT, NMT 49
Illustrations from https://www.tensorflow.org/tutorials/word2vec
December 2017 MT2: PBMT, NMT 50
December 2017 MT2: PBMT, NMT 51
x1 x2 xT
yT' y2 y1
c
Decoder Encoder
December 2017 MT2: PBMT, NMT 52
December 2017 MT2: PBMT, NMT 53
December 2017 MT2: PBMT, NMT 54
December 2017 MT2: PBMT, NMT 55
nejneobhodpodaˇ rov´ avatelnˇ ejˇ s´ ımi, Donaudampfschifffahrtsgesellschaftskapit¨ an
BPE (Byte-Pair Encoding) uses n most common substrings (incl. frequent words). December 2017 MT2: PBMT, NMT 56
December 2017 MT2: PBMT, NMT 57
1|f J 1 ) = p(e1|f J 1 ) · p(e2|e1, f J 1 ) · p(e3|e2, e1, f J 1 ) . . .
December 2017 MT2: PBMT, NMT 58
https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/
December 2017 MT2: PBMT, NMT 59
15 10 5 5 10 15 20 20 15 10 5 5 10 15
I gave her a card in the garden In the garden , I gave her a card She was given a card by me in the garden She gave me a card in the garden In the garden , she gave me a card I was given a card by her in the garden
2-D PCA projection of 8000-D space representing sentences (Sutskever et al., 2014). December 2017 MT2: PBMT, NMT 60
December 2017 MT2: PBMT, NMT 61
December 2017 MT2: PBMT, NMT 62
December 2017 MT2: PBMT, NMT 63
December 2017 MT2: PBMT, NMT 64
December 2017 MT2: PBMT, NMT 65
December 2017 MT2: PBMT, NMT 66
December 2017 MT2: PBMT, NMT 67
December 2017 MT2: PBMT, NMT 68
SRC Das Spektakel ¨ ahnelt dem Eurovision Song Contest. REF Je to jako pˇ eveck´ a soutˇ eˇ z Eurovision. SMT Pod´ ıvanou pˇ ripom´ ın´ a hudebn´ ı soutˇ eˇ z Eurovize. NMT Divadlo se podob´ a Eurovizi Conview. SRC Erderw¨ armung oder Zusammenstoß mit Killerasteroid. REF Glob´ aln´ ı oteplen´ ı nebo kolize se zabij´ ack´ ym asteroidem. SMT Glob´ aln´ ı oteplov´ an´ ı, nebo sr´ aˇ zka s Killerasteroid. NMT Glob´ aln´ ı oteplov´ an´ ı, nebo stˇ ret s zabij´ akem. SRC Zu viele verletzte Gef¨ uhle. REF Pˇ r´ ıliˇ s mnoho nepˇ r´ atelsk´ ych pocit˚ u. SMT Pˇ r´ ıliˇ s mnoho zranˇ en´ ych pocity. NMT Pˇ r´ ıliˇ s mnoho zranˇ en´ ych. December 2017 MT2: PBMT, NMT 69
December 2017 MT2: PBMT, NMT 70
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine
(EMNLP), pages 1724–1734, Doha, Qatar, October. Association for Computational Linguistics. Junyoung Chung, C ¸aglar G¨ ul¸ cehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555. Sepp Hochreiter and J¨ urgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780, November. Philipp Koehn. 2003. Noun Phrase Translation. Ph.D. thesis, University of Southern California. Adam Lopez. 2009. Translation as weighted deduction. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 532–540, Athens, Greece, March. Association for Computational Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781. Franz Joseph Och. 2002. Statistical Machine Translation: From Single-Word Models to Alignment Templates. Ph.D. thesis, RWTH Aachen University. Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. of the Association for Computational Linguistics, Sapporo, Japan, July 6-7. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. pages 3104–3112.
December 2017 MT2: PBMT, NMT 71