An Introduction to Neural Machine Translation
- Prof. John D. Kelleher
@johndkelleher
ADAPT Centre for Digital Content Technology Dublin Institute of Technology, Ireland
June 25, 2018
1 / 57
An Introduction to Neural Machine Translation Prof. John D. Kelleher - - PowerPoint PPT Presentation
An Introduction to Neural Machine Translation Prof. John D. Kelleher @johndkelleher ADAPT Centre for Digital Content Technology Dublin Institute of Technology, Ireland June 25, 2018 1 / 57 Outline The Neural Machine Translation Revolution
@johndkelleher
ADAPT Centre for Digital Content Technology Dublin Institute of Technology, Ireland
June 25, 2018
1 / 57
The Neural Machine Translation Revolution Neural Networks 101 Word Embeddings Language Models Neural Language Models Neural Machine Translation Beyond NMT: Image Annotation
Image from https://blogs.msdn.microsoft.com/translation/
Image from https://www.blog.google/products/translate
Image from https://techcrunch.com/2017/08/03/facebook-finishes-its-move-to-neural-machine-translation/
Image from https: //slator.com/technology/linguees-founder-launches-deepl-attempt-challenge-google-translate/
7 / 57
A function maps a set of inputs (numbers) to an output (number)1 sum(2, 5, 4) → 11
1This introduction to neural network and machine translation is based on: Kelleher (2016)
8 / 57
weightedSum([x1, x2, . . . , xm]
, [w1, w2, . . . , wm]
) = (x1 × w1) + (x2 × w2) + · · · + (xm × wm) weightedSum([3, 9], [−3, 1]) = (3 × −3) + (9 × 1) = −9 + 9 = 0
9 / 57
An activation function takes the output of our weightedSum function and applies another mapping to it.
10 / 57
−10 −5 5 10 −1.0 −0.5 0.0 0.5 1.0
z activation(z) logistic(z) = 1 1 + e−z tanh(z) = ez − e−z ez + e−z logistic(z) tanh(z)
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
z activation(z) rectifier(z) = max(0,z)
11 / 57
activation = logistic(weightedSum(([x1, x2, . . . , xm]
, [w1, w2, . . . , wm]
)) logistic(weightedSum([3, 9], [−3, 1])) = logistic((3 × −3) + (9 × 1)) = logistic(−9 + 9) = logistic(0) = 0.5
12 / 57
The simple list of operations that we have just described defines the fundamental building block of a neural network: the Neuron. Neuron = activation(weightedSum(([x1, x2, . . . , xm]
, [w1, w2, . . . , wm]
))
13 / 57
14 / 57
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer 4
15 / 57
◮ We train a neural network by iteratively updating the weights ◮ We start by randomly assigning weights to each edge ◮ We then show the network examples of inputs and expected
that the network outputs match the expected outputs
◮ We keep updating the weights until the network is working the
way we want
16 / 57
17 / 57
◮ Language is sequential and has lots of words.
18 / 57
— Firth, 1957
19 / 57
20 / 57
Each word is represented by a vector of numbers that positions the word in a multi-dimensional space, e.g.: king =< 55, −10, 176, 27 > man =< 10, 79, 150, 83 > woman =< 15, 74, 159, 106 > queen =< 60, −15, 185, 50 >
21 / 57
vec(King) − vec(Man) + vec(Woman) ≈ vec(Queen)2
2Linguistic Regularities in Continuous Space Word Representations Mikolov et al. (2013)
22 / 57
23 / 57
◮ Language is sequential and has lots of words.
24 / 57
25 / 57
1 2 3 4 5 6 7 8 9 8 · 10−2 0.1 0.12 0.14 0.16 0.18 0.2
27 / 57
a b c d e f g h i j k l m n o p q r s t u v w x y z 0.1 0.2 0.3 0.4
◮ A language model can compute:
3We can go from 1. to 2. using the Chain Rule of Probability P(w1, w2, w3) = P(w1)P(w2|w1)P(w3|w1, w2)
29 / 57
◮ Language models are useful for machine translation
4Unless its Yoda that speaking
30 / 57
31 / 57
32 / 57
33 / 57
Output Layer Hidden Layer Input Layer Input 1 Input 2 Input 3 ...
34 / 57
Input Layer Hidden Layer Buffer Output Layer Input 1 Input 2 Input 3 ...
35 / 57
Output Layer Input Layer Buffer Hidden Layer Input 1 Input 2 Input 3 ...
36 / 57
Input Layer Buffer Output Layer Hidden Layer Input 1 Input 2 Input 3 ...
37 / 57
Input Layer Hidden Layer Buffer Output Layer Input 1 Input 2 Input 3 ...
38 / 57
Input Layer Output Layer Hidden Layer Buffer Input 1 Input 2 Input 3 ...
39 / 57
Buffer Hidden Layer Output Layer Input Layer Input 1 Input 2 Input 3 ...
40 / 57
Buffer Hidden Layer Output Layer Input Layer Input 1 Input 2 Input 3 ...
41 / 57
yt ht xt ht−1
Figure: Recurrent Neural Network
ht = φ((Whh · ht−1) + (Wxh · xt)) yt = φ(Why · ht)
42 / 57
Output: Input: y1 y2 y3 yt yt+1 h1 h2 h3 · · · ht ht+1 x1 x2 x3 xt xt+1
Figure: RNN Unrolled Through Time
43 / 57
Output: Input: ∗Word2 ∗Word3 ∗Word4 · · · ∗Wordt+1 h1 h2 h3 · · · ht Word1
44 / 57
PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain’d into being never fed, And who is but a chain and subjects of his death, I should not sleep. Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that.
From: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
45 / 57
46 / 57
47 / 57
Encoding: Input: h1 h2 · · · hm C Word1 Word2 Wordm < eos >
Figure: Using an RNN to Generate an Encoding of a Word Sequence
48 / 57
Output: Input: ∗Word2 ∗Word3 ∗Word4 ∗Wordt+1 h1 h2 h3 · · · ht Word1 Word2 Word3 Wordt
Figure: RNN Language Model Unrolled Through Time
49 / 57
Output: Input: ∗Word2 ∗Word3 ∗Word4 · · · ∗Wordt+1 h1 h2 h3 · · · ht Word1
Figure: Using an RNN Language Model to Generate (Hallucinate) a Word Sequence
50 / 57
Decoder Encoder Target1 Target2 · · · < eos > h1 h2 · · · C d1 · · · dn Source1 Source2 · · · < eos > Figure: Sequence to Sequence Translation using an Encoder-Decoder Architecture
51 / 57
Decoder Encoder Life is beautiful < eos > h1 h2 h3 h4 C d1 d2 d3 belle est vie La < eos >
Figure: Example Translation using an Encoder-Decoder Architecture
52 / 57
53 / 57
Image from Image from Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Xu et al. (2015)
54 / 57
john.d.kelleher@dit.ie @johndkelleher www.machinelearningbook.com https://ie.linkedin.com/in/johndkelleher
DATA SCIENCE
JOHN D. KELLEHER AND BRENDAN TIERNEY THE MIT PRESS ESSENTIAL KNOWLEDGE SERIESAcknowledgements: The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
Books: Kelleher et al. (2015) Kelleher and Tierney (2018)
Kelleher, J. (2016). Fundamentals of machine learning for neural machine translation. In Translating Europe Forum 2016: Focusing on Translation Technologies. The European Commission Directorate-General for Translation, European Commission doi: 10.21427/D78012. Kelleher, J. D., Mac Namee, B., and D’Arcy, A. (2015). Fundamentals of Machine Learning for Predictive Analytics: Algorithms, Worked Examples and Case Studies. MIT Press, Cambridge, MA. Kelleher, J. D. and Tierney, B. (2018). Data Science. MIT Press.
56 / 57
Mikolov, T., Yih, W.-t., and Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057.
57 / 57