an introduction to neural machine translation
play

An Introduction to Neural Machine Translation Prof. John D. Kelleher - PowerPoint PPT Presentation

An Introduction to Neural Machine Translation Prof. John D. Kelleher @johndkelleher ADAPT Centre for Digital Content Technology Dublin Institute of Technology, Ireland June 25, 2018 1 / 57 Outline The Neural Machine Translation Revolution


  1. An Introduction to Neural Machine Translation Prof. John D. Kelleher @johndkelleher ADAPT Centre for Digital Content Technology Dublin Institute of Technology, Ireland June 25, 2018 1 / 57

  2. Outline The Neural Machine Translation Revolution Neural Networks 101 Word Embeddings Language Models Neural Language Models Neural Machine Translation Beyond NMT: Image Annotation

  3. Image from https://blogs.msdn.microsoft.com/translation/

  4. Image from https://www.blog.google/products/translate

  5. Image from https://techcrunch.com/2017/08/03/facebook-finishes-its-move-to-neural-machine-translation/

  6. Image from https: //slator.com/technology/linguees-founder-launches-deepl-attempt-challenge-google-translate/

  7. Neural Networks 101 7 / 57

  8. What is a function? A function maps a set of inputs (numbers) to an output (number) 1 sum (2 , 5 , 4) → 11 1 This introduction to neural network and machine translation is based on: Kelleher (2016) 8 / 57

  9. What is a weightedSum function? weightedSum ([ x 1 , x 2 , . . . , x m ] , [ w 1 , w 2 , . . . , w m ] ) � �� � � �� � Input Numbers Weights = ( x 1 × w 1 ) + ( x 2 × w 2 ) + · · · + ( x m × w m ) weightedSum ([3 , 9] , [ − 3 , 1]) = (3 × − 3) + (9 × 1) = − 9 + 9 = 0 9 / 57

  10. What is an activation function? An activation function takes the output of our weightedSum function and applies another mapping to it. 10 / 57

  11. What is an activation function? 1.0 1.0 0.5 0.5 1 activation(z) logistic ( z ) = activation(z) 1 + e − z rectifier(z) = max(0,z) 0.0 0.0 −0.5 −0.5 tanh ( z ) = e z − e − z e z + e − z logistic(z) −1.0 tanh(z) −1.0 −10 −5 0 5 10 −1.0 −0.5 0.0 0.5 1.0 z z 11 / 57

  12. What is an activation function? activation = logistic ( weightedSum (([ x 1 , x 2 , . . . , x m ] , [ w 1 , w 2 , . . . , w m ] )) � �� � � �� � Input Numbers Weights logistic ( weightedSum ([3 , 9] , [ − 3 , 1])) = logistic ((3 × − 3) + (9 × 1)) = logistic ( − 9 + 9) = logistic (0) = 0 . 5 12 / 57

  13. What is a Neuron ? The simple list of operations that we have just described defines the fundamental building block of a neural network: the Neuron . Neuron = activation ( weightedSum (([ x 1 , x 2 , . . . , x m ] , [ w 1 , w 2 , . . . , w m ] )) � �� � � �� � Input Numbers Weights 13 / 57

  14. What is a Neuron ? x 0 w 0 w 1 x 1 w 2 Activation Σ ϕ x 2 w 3 x 3 . . w m . x m 14 / 57

  15. What is a Neural Network ? Input Hidden Hidden Hidden Output Layer Layer 1 Layer 2 Layer 3 Layer 4 15 / 57

  16. Training a Neural Network ◮ We train a neural network by iteratively updating the weights ◮ We start by randomly assigning weights to each edge ◮ We then show the network examples of inputs and expected outputs and update the weights using Backpropogation so that the network outputs match the expected outputs ◮ We keep updating the weights until the network is working the way we want 16 / 57

  17. Word Embeddings 17 / 57

  18. Word Embeddings ◮ Language is sequential and has lots of words. 18 / 57

  19. “a word is characteriezed by the company it keeps” — Firth, 1957 19 / 57

  20. Word Embeddings 1. Train a network to predict the word that is missing from the middle of an n-gram (or predict the n-gram from the word) 2. Use the trained network weights to represent the word in vector space. 20 / 57

  21. Word Embeddings Each word is represented by a vector of numbers that positions the word in a multi-dimensional space, e.g.: king = < 55 , − 10 , 176 , 27 > man = < 10 , 79 , 150 , 83 > woman = < 15 , 74 , 159 , 106 > queen = < 60 , − 15 , 185 , 50 > 21 / 57

  22. Word Embeddings vec ( King ) − vec ( Man ) + vec ( Woman ) ≈ vec ( Queen ) 2 2 Linguistic Regularities in Continuous Space Word Representations Mikolov et al. (2013) 22 / 57

  23. Language Models 23 / 57

  24. Language Models ◮ Language is sequential and has lots of words. 24 / 57

  25. 1,2,? 25 / 57

  26. 0 . 2 0 . 18 0 . 16 0 . 14 0 . 12 0 . 1 8 · 10 − 2 0 1 2 3 4 5 6 7 8 9

  27. Th? 27 / 57

  28. 0 . 4 0 . 3 0 . 2 0 . 1 0 a b c d e f g h i j k l m n o p q r s t u v w x y z

  29. ◮ A language model can compute: 1. the probability of an upcoming symbol: P ( w n | w 1 , . . . , w n − 1 ) 2. the probability for a sequence of symbols 3 P ( w 1 , . . . , w n ) 3 We can go from 1. to 2. using the Chain Rule of Probability P ( w 1 , w 2 , w 3) = P ( w 1 ) P ( w 2 | w 1 ) P ( w 3 | w 1 , w 2 ) 29 / 57

  30. ◮ Language models are useful for machine translation because they help with: 1. word ordering P ( Yes I can help you ) > P ( Help you I can yes ) 4 2. word choice P ( Feel the Force ) > P ( Eat the Force ) 4 Unless its Yoda that speaking 30 / 57

  31. Neural Language Models 31 / 57

  32. Recurrent Neural Networks A particular type of neural network that is useful for processing sequential data (such as, language) is a Recurrent Neural Network . 32 / 57

  33. Recurrent Neural Networks Using an RNN we process our sequential data one input at a time. In an RNN the outputs of some of the neurons for one input are feed back into the network as part the next input. 33 / 57

  34. Simple Feed-Forward Network Input Layer Hidden Input 1 Layer Input 2 Input 3 Output ... Layer 34 / 57

  35. Recurrent Neural Networks Input Layer Hidden Input 1 Layer Input 2 Buffer Input 3 Output ... Layer 35 / 57

  36. Recurrent Neural Networks Input Layer Hidden Input 1 Layer Input 2 Buffer Input 3 Output ... Layer 36 / 57

  37. Recurrent Neural Networks Input Layer Hidden Input 1 Layer Input 2 Buffer Input 3 Output ... Layer 37 / 57

  38. Recurrent Neural Networks Input Layer Input 1 Hidden Input 2 Layer Input 3 Buffer ... Output Layer 38 / 57

  39. Recurrent Neural Networks Input Layer Input 1 Hidden Input 2 Layer Input 3 Buffer ... Output Layer 39 / 57

  40. Recurrent Neural Networks Input Input 1 Layer Input 2 Hidden Input 3 Layer ... Buffer Output Layer 40 / 57

  41. Recurrent Neural Networks Input Input 1 Layer Input 2 Hidden Input 3 Layer ... Buffer Output Layer 41 / 57

  42. y t h t h t = φ (( W hh · h t − 1 ) + ( W xh · x t )) y t = φ ( W hy · h t ) x t h t − 1 Figure: Recurrent Neural Network 42 / 57

  43. Recurrent Neural Networks Output: y 1 y 2 y 3 y t y t +1 · · · h 1 h 2 h 3 h t h t +1 Input: x 1 x 2 x 3 x t x t +1 Figure: RNN Unrolled Through Time 43 / 57

  44. Hallucinating Text Output: ∗ Word 2 ∗ Word 3 ∗ Word 4 ∗ Word t +1 · · · h 1 h 2 h 3 h t · · · Input: Word 1 44 / 57

  45. Hallucinating Shakespeare PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain’d into being never fed, And who is but a chain and subjects of his death, I should not sleep. Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that. From: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 45 / 57

  46. Neural Machine Translation 46 / 57

  47. Neural Machine Translation 1. RNN Encoders 2. RNN Language Models 47 / 57

  48. Encoders Encoding: h 1 h 2 h m C · · · Input: Word 1 Word 2 Word m < eos > Figure: Using an RNN to Generate an Encoding of a Word Sequence 48 / 57

  49. Language Models Output: ∗ Word 2 ∗ Word 3 ∗ Word 4 ∗ Word t +1 h 1 h 2 h 3 h t · · · Input: Word 1 Word 2 Word 3 Word t Figure: RNN Language Model Unrolled Through Time 49 / 57

  50. Decoder Output: ∗ Word 2 ∗ Word 3 ∗ Word 4 ∗ Word t +1 · · · h 1 h 2 h 3 h t · · · Input: Word 1 Figure: Using an RNN Language Model to Generate (Hallucinate) a Word Sequence 50 / 57

  51. Encoder-Decoder Architecture Target 1 Target 2 < eos > · · · Encoder h 1 h 2 C d 1 d n · · · · · · Decoder Source 1 Source 2 < eos > · · · Figure: Sequence to Sequence Translation using an Encoder-Decoder Architecture 51 / 57

  52. Neural Machine Translation Life is beautiful < eos > Encoder h 1 h 2 h 3 h 4 C d 1 d 2 d 3 Decoder belle est vie La < eos > Figure: Example Translation using an Encoder-Decoder Architecture 52 / 57

  53. Beyond NMT: Image Annotation 53 / 57

  54. Image Annotation Image from Image from Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Xu et al. (2015) 54 / 57

  55. Thank you for your attention john.d.kelleher@dit.ie @johndkelleher www.machinelearningbook.com https://ie.linkedin.com/in/johndkelleher DATA SCIENCE JOHN D. KELLEHER AND BRENDAN TIERNEY THE MIT PRESS ESSENTIAL KNOWLEDGE SERIES Acknowledgements: The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. Books: Kelleher et al. (2015) Kelleher and Tierney (2018)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend