An Introduction to Neural Machine Translation Prof. John D. Kelleher - - PowerPoint PPT Presentation

an introduction to neural machine translation
SMART_READER_LITE
LIVE PREVIEW

An Introduction to Neural Machine Translation Prof. John D. Kelleher - - PowerPoint PPT Presentation

An Introduction to Neural Machine Translation Prof. John D. Kelleher @johndkelleher ADAPT Centre for Digital Content Technology Dublin Institute of Technology, Ireland June 25, 2018 1 / 57 Outline The Neural Machine Translation Revolution


slide-1
SLIDE 1

An Introduction to Neural Machine Translation

  • Prof. John D. Kelleher

@johndkelleher

ADAPT Centre for Digital Content Technology Dublin Institute of Technology, Ireland

June 25, 2018

1 / 57

slide-2
SLIDE 2

Outline

The Neural Machine Translation Revolution Neural Networks 101 Word Embeddings Language Models Neural Language Models Neural Machine Translation Beyond NMT: Image Annotation

slide-3
SLIDE 3

Image from https://blogs.msdn.microsoft.com/translation/

slide-4
SLIDE 4

Image from https://www.blog.google/products/translate

slide-5
SLIDE 5

Image from https://techcrunch.com/2017/08/03/facebook-finishes-its-move-to-neural-machine-translation/

slide-6
SLIDE 6

Image from https: //slator.com/technology/linguees-founder-launches-deepl-attempt-challenge-google-translate/

slide-7
SLIDE 7

Neural Networks 101

7 / 57

slide-8
SLIDE 8

What is a function?

A function maps a set of inputs (numbers) to an output (number)1 sum(2, 5, 4) → 11

1This introduction to neural network and machine translation is based on: Kelleher (2016)

8 / 57

slide-9
SLIDE 9

What is a weightedSum function?

weightedSum([x1, x2, . . . , xm]

  • Input Numbers

, [w1, w2, . . . , wm]

  • Weights

) = (x1 × w1) + (x2 × w2) + · · · + (xm × wm) weightedSum([3, 9], [−3, 1]) = (3 × −3) + (9 × 1) = −9 + 9 = 0

9 / 57

slide-10
SLIDE 10

What is an activation function?

An activation function takes the output of our weightedSum function and applies another mapping to it.

10 / 57

slide-11
SLIDE 11

What is an activation function?

−10 −5 5 10 −1.0 −0.5 0.0 0.5 1.0

z activation(z) logistic(z) = 1 1 + e−z tanh(z) = ez − e−z ez + e−z logistic(z) tanh(z)

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

z activation(z) rectifier(z) = max(0,z)

11 / 57

slide-12
SLIDE 12

What is an activation function?

activation = logistic(weightedSum(([x1, x2, . . . , xm]

  • Input Numbers

, [w1, w2, . . . , wm]

  • Weights

)) logistic(weightedSum([3, 9], [−3, 1])) = logistic((3 × −3) + (9 × 1)) = logistic(−9 + 9) = logistic(0) = 0.5

12 / 57

slide-13
SLIDE 13

What is a Neuron?

The simple list of operations that we have just described defines the fundamental building block of a neural network: the Neuron. Neuron = activation(weightedSum(([x1, x2, . . . , xm]

  • Input Numbers

, [w1, w2, . . . , wm]

  • Weights

))

13 / 57

slide-14
SLIDE 14

What is a Neuron?

Σ ϕ x0 x1 x2 x3 xm w0 w1 w2 w3 wm Activation . . .

14 / 57

slide-15
SLIDE 15

What is a Neural Network?

Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer 4

15 / 57

slide-16
SLIDE 16

Training a Neural Network

◮ We train a neural network by iteratively updating the weights ◮ We start by randomly assigning weights to each edge ◮ We then show the network examples of inputs and expected

  • utputs and update the weights using Backpropogation so

that the network outputs match the expected outputs

◮ We keep updating the weights until the network is working the

way we want

16 / 57

slide-17
SLIDE 17

Word Embeddings

17 / 57

slide-18
SLIDE 18

Word Embeddings

◮ Language is sequential and has lots of words.

18 / 57

slide-19
SLIDE 19

“a word is characteriezed by the company it keeps”

— Firth, 1957

19 / 57

slide-20
SLIDE 20

Word Embeddings

  • 1. Train a network to predict the word that is missing

from the middle of an n-gram (or predict the n-gram from the word)

  • 2. Use the trained network weights to represent the word

in vector space.

20 / 57

slide-21
SLIDE 21

Word Embeddings

Each word is represented by a vector of numbers that positions the word in a multi-dimensional space, e.g.: king =< 55, −10, 176, 27 > man =< 10, 79, 150, 83 > woman =< 15, 74, 159, 106 > queen =< 60, −15, 185, 50 >

21 / 57

slide-22
SLIDE 22

Word Embeddings

vec(King) − vec(Man) + vec(Woman) ≈ vec(Queen)2

2Linguistic Regularities in Continuous Space Word Representations Mikolov et al. (2013)

22 / 57

slide-23
SLIDE 23

Language Models

23 / 57

slide-24
SLIDE 24

Language Models

◮ Language is sequential and has lots of words.

24 / 57

slide-25
SLIDE 25

1,2,?

25 / 57

slide-26
SLIDE 26

1 2 3 4 5 6 7 8 9 8 · 10−2 0.1 0.12 0.14 0.16 0.18 0.2

slide-27
SLIDE 27

Th?

27 / 57

slide-28
SLIDE 28

a b c d e f g h i j k l m n o p q r s t u v w x y z 0.1 0.2 0.3 0.4

slide-29
SLIDE 29

◮ A language model can compute:

  • 1. the probability of an upcoming symbol:

P(wn|w1, . . . , wn−1)

  • 2. the probability for a sequence of symbols3

P(w1, . . . , wn)

3We can go from 1. to 2. using the Chain Rule of Probability P(w1, w2, w3) = P(w1)P(w2|w1)P(w3|w1, w2)

29 / 57

slide-30
SLIDE 30

◮ Language models are useful for machine translation

because they help with:

  • 1. word ordering

P(Yes I can help you) > P(Help you I can yes)4

  • 2. word choice

P(Feel the Force) > P(Eat the Force)

4Unless its Yoda that speaking

30 / 57

slide-31
SLIDE 31

Neural Language Models

31 / 57

slide-32
SLIDE 32

Recurrent Neural Networks

A particular type of neural network that is useful for processing sequential data (such as, language) is a Recurrent Neural Network.

32 / 57

slide-33
SLIDE 33

Recurrent Neural Networks

Using an RNN we process our sequential data one input at a time. In an RNN the outputs of some of the neurons for one input are feed back into the network as part the next input.

33 / 57

slide-34
SLIDE 34

Simple Feed-Forward Network

Output Layer Hidden Layer Input Layer Input 1 Input 2 Input 3 ...

34 / 57

slide-35
SLIDE 35

Recurrent Neural Networks

Input Layer Hidden Layer Buffer Output Layer Input 1 Input 2 Input 3 ...

35 / 57

slide-36
SLIDE 36

Recurrent Neural Networks

Output Layer Input Layer Buffer Hidden Layer Input 1 Input 2 Input 3 ...

36 / 57

slide-37
SLIDE 37

Recurrent Neural Networks

Input Layer Buffer Output Layer Hidden Layer Input 1 Input 2 Input 3 ...

37 / 57

slide-38
SLIDE 38

Recurrent Neural Networks

Input Layer Hidden Layer Buffer Output Layer Input 1 Input 2 Input 3 ...

38 / 57

slide-39
SLIDE 39

Recurrent Neural Networks

Input Layer Output Layer Hidden Layer Buffer Input 1 Input 2 Input 3 ...

39 / 57

slide-40
SLIDE 40

Recurrent Neural Networks

Buffer Hidden Layer Output Layer Input Layer Input 1 Input 2 Input 3 ...

40 / 57

slide-41
SLIDE 41

Recurrent Neural Networks

Buffer Hidden Layer Output Layer Input Layer Input 1 Input 2 Input 3 ...

41 / 57

slide-42
SLIDE 42

yt ht xt ht−1

Figure: Recurrent Neural Network

ht = φ((Whh · ht−1) + (Wxh · xt)) yt = φ(Why · ht)

42 / 57

slide-43
SLIDE 43

Recurrent Neural Networks

Output: Input: y1 y2 y3 yt yt+1 h1 h2 h3 · · · ht ht+1 x1 x2 x3 xt xt+1

Figure: RNN Unrolled Through Time

43 / 57

slide-44
SLIDE 44

Hallucinating Text

Output: Input: ∗Word2 ∗Word3 ∗Word4 · · · ∗Wordt+1 h1 h2 h3 · · · ht Word1

44 / 57

slide-45
SLIDE 45

Hallucinating Shakespeare

PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain’d into being never fed, And who is but a chain and subjects of his death, I should not sleep. Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that.

From: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

45 / 57

slide-46
SLIDE 46

Neural Machine Translation

46 / 57

slide-47
SLIDE 47

Neural Machine Translation

  • 1. RNN Encoders
  • 2. RNN Language Models

47 / 57

slide-48
SLIDE 48

Encoders

Encoding: Input: h1 h2 · · · hm C Word1 Word2 Wordm < eos >

Figure: Using an RNN to Generate an Encoding of a Word Sequence

48 / 57

slide-49
SLIDE 49

Language Models

Output: Input: ∗Word2 ∗Word3 ∗Word4 ∗Wordt+1 h1 h2 h3 · · · ht Word1 Word2 Word3 Wordt

Figure: RNN Language Model Unrolled Through Time

49 / 57

slide-50
SLIDE 50

Decoder

Output: Input: ∗Word2 ∗Word3 ∗Word4 · · · ∗Wordt+1 h1 h2 h3 · · · ht Word1

Figure: Using an RNN Language Model to Generate (Hallucinate) a Word Sequence

50 / 57

slide-51
SLIDE 51

Encoder-Decoder Architecture

Decoder Encoder Target1 Target2 · · · < eos > h1 h2 · · · C d1 · · · dn Source1 Source2 · · · < eos > Figure: Sequence to Sequence Translation using an Encoder-Decoder Architecture

51 / 57

slide-52
SLIDE 52

Neural Machine Translation

Decoder Encoder Life is beautiful < eos > h1 h2 h3 h4 C d1 d2 d3 belle est vie La < eos >

Figure: Example Translation using an Encoder-Decoder Architecture

52 / 57

slide-53
SLIDE 53

Beyond NMT: Image Annotation

53 / 57

slide-54
SLIDE 54

Image Annotation

Image from Image from Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Xu et al. (2015)

54 / 57

slide-55
SLIDE 55

Thank you for your attention

john.d.kelleher@dit.ie @johndkelleher www.machinelearningbook.com https://ie.linkedin.com/in/johndkelleher

DATA SCIENCE

JOHN D. KELLEHER AND BRENDAN TIERNEY THE MIT PRESS ESSENTIAL KNOWLEDGE SERIES

Acknowledgements: The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Books: Kelleher et al. (2015) Kelleher and Tierney (2018)

slide-56
SLIDE 56

References I

Kelleher, J. (2016). Fundamentals of machine learning for neural machine translation. In Translating Europe Forum 2016: Focusing on Translation Technologies. The European Commission Directorate-General for Translation, European Commission doi: 10.21427/D78012. Kelleher, J. D., Mac Namee, B., and D’Arcy, A. (2015). Fundamentals of Machine Learning for Predictive Analytics: Algorithms, Worked Examples and Case Studies. MIT Press, Cambridge, MA. Kelleher, J. D. and Tierney, B. (2018). Data Science. MIT Press.

56 / 57

slide-57
SLIDE 57

References II

Mikolov, T., Yih, W.-t., and Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057.

57 / 57