seq2seq models and attention
play

Seq2Seq Models and Attention M. Soleymani Sharif University of - PowerPoint PPT Presentation

Seq2Seq Models and Attention M. Soleymani Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019, and some from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017. Se Sequence-to


  1. Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen • We will illustrate with a single hidden layer, but the discussion generalizes to more layers 22

  2. The “simple” translation model Th Ich habe einen apfel gegessen <eos> ENCODER I ate an apple <eos> Ich habe einen apfel gegessen DECODER 23

  3. The “simple” translation model Th Ich habe einen apfel gegessen <eos> 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 " " " " " - - - - - I ate an apple <eos> Ich habe einen apfel gegessen • A more detailed look: The one-hot word representations may be compressed via embeddings – Embeddings will be learned along with the rest of the net – In the following slides we will not represent the projection matrices 24

  4. Training th Tr the e system em Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen • Must learn to make predictions appropriately – Given “I ate an apple <eos>”, produce “Ich habe einen apfel gegessen <eos>”. 25

  5. Training : Tr : Forward pass 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen • Forward pass: Input the source and target sequences, sequentially – Output will be a probability distribution over target symbol set (vocabulary) 26

  6. Training : Tr : Backward pass Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen • Backward pass: Compute the loss between the output distribution and target word sequence 27

  7. Training : Tr : Backward pass Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen • Backward pass: Compute the loss between the output distribution and target word sequence • Backpropagate the derivatives of the loss through the network to learn the net 28

  8. Training : Tr : Backward pass Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen • In practice, if we apply SGD, we may randomly sample words from the output to actually use for the backprop and update – Typical usage: Randomly select one word from each input training instance (comprising an input-output pair) For each iteration • Randomly select training instance: (input, output) • • Forward pass Randomly select a single output y(t) and corresponding desired output d(t) for backprop • 29

  9. Tr Trick of the trade: Reversing the input Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 an ate I <eos> apple Ich habe einen apfel gegessen • Standard trick of the trade: The input sequence is fed in reverse order – Things work better this way 30

  10. Tr Trick of the trade: Reversing the input Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 an ate I <eos> apple Ich habe einen apfel gegessen • Standard trick of the trade: The input sequence is fed in reverse order – Things work better this way 31

  11. Trick of the trade: Reversing the input Tr Ich habe einen apfel gegessen <eos> 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 an ate I <eos> apple Ich habe einen apfel gegessen • Standard trick of the trade: The input sequence is fed in reverse order – Things work better this way • This happens both for training and during actual decode 32

  12. Ov Overall training • Given several training instance (𝐘, 𝐙 XYZ[\X ) • Forward pass: Compute the output of the network for (𝐘, 𝐙 XYZ[\X ) with input in reverse order – Note, both 𝐘 and 𝐙 XYZ[\X are used in the forward pass • Backward pass: Compute the loss between the desired target 𝐙 XYZ[\X and the actual output 𝐙 – Propagate derivatives of loss for updates • We called 𝐘 as 𝑱 and 𝐙 XYZ[\X as 𝑷 during training 33

  13. What the Wha he ne network k actua ually y pr produc duces :cd 𝑧 " CfgFh 𝑧 " e:FD 𝑧 " … iFjkl 𝑧 " I ate an apple <eos> At each time 𝑙 the network actually produces a probability distribution over the output vocabulary • ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • The drawn word is provided as input to the next time 34 •

  14. Wha What the he ne network k actua ually y pr produc duces Ich :cd 𝑧 " CfgFh 𝑧 " e:FD 𝑧 " … iFjkl 𝑧 " I ate an apple <eos> At each time 𝑙 the network actually produces a probability distribution over the output vocabulary • ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • The drawn word is provided as input to the next time • 35

  15. Wha What the he ne network k actua ually y pr produc duces Ich :cd 𝑧 " CfgFh 𝑧 " e:FD 𝑧 " … iFjkl 𝑧 " I ate an apple <eos> Ich At each time 𝑙 the network actually produces a probability distribution over the output vocabulary • ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 36 The drawn word is provided as input to the next time •

  16. What the Wha he ne network k actua ually y pr produc duces Ich :cd :cd 𝑧 " 𝑧 - CfgFh CfgFh 𝑧 " 𝑧 - e:FD e:FD 𝑧 " 𝑧 - … … iFjkl iFjkl 𝑧 " 𝑧 - I ate an apple <eos> Ich At each time 𝑙 the network actually produces a probability distribution over the output vocabulary • ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 37 The drawn word is provided as input to the next time •

  17. Wha What the he ne network k actua ually y pr produc duces Ich habe :cd :cd 𝑧 " 𝑧 - CfgFh CfgFh 𝑧 " 𝑧 - e:FD e:FD 𝑧 " 𝑧 - … … iFjkl iFjkl 𝑧 " 𝑧 - I ate an apple <eos> Ich • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • The drawn word is provided as input to the next time • 38

  18. Wha What the he ne network k actua ually y pr produc duces Ich habe :cd :cd 𝑧 " 𝑧 - CfgFh CfgFh 𝑧 " 𝑧 - e:FD e:FD 𝑧 " 𝑧 - … … iFjkl iFjkl 𝑧 " 𝑧 - I ate an apple <eos> Ich Ich habe • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 39 The drawn word is provided as input to the next time •

  19. Wha What the he ne network k actua ually y pr produc duces Ich habe :cd :cd :cd 𝑧 " 𝑧 - 𝑧 . CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . … … … iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . I ate an apple <eos> Ich Ich habe • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • The drawn word is provided as input to the next time • 40

  20. Wha What the he ne network k actua ually y pr produc duces Ich habe einen :cd :cd :cd 𝑧 " 𝑧 - 𝑧 . CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . … … … iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . I ate an apple <eos> Ich Ich habe • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 41 The drawn word is provided as input to the next time •

  21. What the Wha he ne network k actua ually y pr produc duces Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd 𝑧 " 𝑧 - 𝑧 / 𝑧 . 𝑧 0 𝑧 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 e:FD e:FD e:FD e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 … … … … … … iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 I ate an apple <eos> Ich habe einen apfel gegessen • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 42 The drawn word is provided as input to the next time •

  22. Generatin Ge ting an an outp tput t from th the net Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd 𝑧 " 𝑧 - 𝑧 / 𝑧 . 𝑧 0 𝑧 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 e:FD e:FD e:FD e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 … … … … … … iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 I ate an apple <eos> Ich habe einen apfel gegessen • At each time the network produces a probability distribution over words, given the entire input and previous outputs • At each time a word is drawn from the output distribution The drawn word is provided as input to the next time • The process continues until an <eos> is generated • 43

  23. The probability of the output Th Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd 𝑧 " 𝑧 - 𝑧 / 𝑧 . 𝑧 0 𝑧 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 e:FD e:FD e:FD e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 … … … … … … iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 I ate an apple <eos> Ich habe einen apfel gegessen n p … 𝑧 m n o 𝑧 - n q 𝑄 𝑃 " , … , 𝑃 m |𝐽 " , … , 𝐽 $ = 𝑧 " • The objective of drawing: Produce the most likely output (that ends in an <eos>) n p … 𝑧 m n o 𝑧 - n q argmax 𝑧 " n o ,…,n q 44

  24. The probability of the output Th Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd 𝑧 " 𝑧 - 𝑧 / 𝑧 . 𝑧 0 𝑧 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 Objective: e:FD e:FD e:FD e:FD e:FD e:FD n p … 𝑧 m 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 n o 𝑧 - n q argmax 𝑧 " n o ,…,n q … … … … … … iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 I ate an apple <eos> Ich habe einen apfel gegessen • Cannot just pick the most likely symbol at each time – That may cause the distribution to be more “confused” at the next time – Choosing a different, less likely word could cause the distribution at the next time to be more peaky, resulting in a more likely output overall 45

  25. Greedy is Gr is not t good 𝑄(𝑃 . |𝑃 " , 𝑃 - , 𝐽 " , … , 𝐽 $ ) 𝑄(𝑃 . |𝑃 " , 𝑃 - , 𝐽 " , … , 𝐽 $ ) T=1 2 3 T=1 2 3 w 1 w 2 w 3 … w V w 1 w 2 w 3 … w V • Hypothetical example (from English speech recognition: Input is speech, output must be text) • “Nose” has highest probability at t=2 and is selected – The model is very confused at t=3 and assigns low probabilities to many words at the next time – Selecting any of these will result in low probability for the entire 3-word sequence • “Knows” has slightly lower probability than “nose”, but is still high and is selected – “he knows” is a reasonable beginning and the model assigns high probabilities to words such as “something” – Selecting one of these results in higher overall probability for the 3-word sequence 46

  26. Gr Greedy is is not t good What should we 𝑄(𝑃 - |𝑃 " , 𝐽 " , … , 𝐽 $ ) have chosen at t=2?? “nose” “knows” Will selecting “nose” continue to have a bad effect into the w 1 w 2 w 3 … w V distant future? T=1 2 3 • Problem: Impossible to know a priori which word leads to the more promising future – Should we draw “nose” or “knows”? – Effect may not be obvious until several words down the line – Or the choice of the wrong word early may cumulatively lead to a poorer overall score over time 47

  27. Greedy is Gr is not t good 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) What should we have chosen at t=1?? Choose “the” or “he”? w 1 the w 3 … he T=1 2 3 • Problem: Impossible to know a priori which word leads to the more promising future – Even earlier: Choosing the lower probability “the” instead of “he” at T=1 may have made a choice of “nose” more reasonable at T=2. • In general, making a poor choice at any time commits us to a poor future – But we cannot know at that time the choice was poor • Solution: Don’t choose.. 48

  28. Solution: So : Mu Multiple choices I He We The • Retain both choices and fork the network – With every possible word as input 49

  29. Problem: Multiple choices Pr ⋮ I ⋮ He ⋮ We ⋮ ⋮ The • Problem : This will blow up very quickly – For an output vocabulary of size 𝑊 , after 𝑈 output steps we’d have forked out 𝑊 v branches 50

  30. Solution: So : Pru rune Beam I search He 𝑈𝑝𝑞 y 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) We ⋮ The • Solution: Prune – At each time, retain only the top K scoring forks 51

  31. So Solution: : Pru rune Beam I search He 𝑈𝑝𝑞 y 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) We ⋮ The • Solution: Prune – At each time, retain only the top K scoring forks 52

  32. So Solution: : Pru rune Beam I search ⋮ Note: based on product Knows He 𝑈𝑝𝑞 y 𝑄(𝑃 - 𝑃 " |𝐽 " , … , 𝐽 $ ) … = 𝑈𝑝𝑞 y 𝑄(𝑃 - |𝑃 " , 𝐽 " , … , 𝐽 $ )𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) I ⋮ The Nose • Solution: Prune … – At each time, retain only the top K scoring forks 53

  33. So Solution: : Pru rune Beam I search ⋮ Note: based on product Knows He 𝑈𝑝𝑞 y 𝑄(𝑃 - 𝑃 " |𝐽 " , … , 𝐽 $ ) … = 𝑈𝑝𝑞 y 𝑄(𝑃 - |𝑃 " , 𝐽 " , … , 𝐽 $ )𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) I ⋮ The Nose • Solution: Prune … – At each time, retain only the top K scoring forks 54

  34. So Solution: : Pru rune Beam search ⋮ Knows = 𝑈𝑝𝑞 y 𝑄 𝑃 - 𝑃 " , 𝐽 " , … , 𝐽 $ × He 𝑄 𝑃 - 𝑃 " , 𝐽 " , … , 𝐽 $ × 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) … ⋮ The Nose • Solution: Prune – At each time, retain only the top K scoring forks 55

  35. So Solution: : Pru rune Beam search ⋮ Knows = 𝑈𝑝𝑞 y 𝑄 𝑃 - 𝑃 " , 𝐽 " , … , 𝐽 $ × He 𝑄 𝑃 - 𝑃 " , 𝐽 " , … , 𝐽 $ × 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) … ⋮ The Nose • Solution: Prune – At each time, retain only the top K scoring forks 56

  36. So Solution: : Pru rune Beam search Knows He | 𝑈𝑝𝑞 y { 𝑄 𝑃 < 𝑃 " , … , 𝑃 <=" , 𝐽 " , … , 𝐽 $ <}" The Nose • Solution: Prune – At each time, retain only the top K scoring forks 57

  37. Terminate Te Beam search Knows He <eos> The Nose • Terminate – When the current most likely path overall ends in <eos> • Or continue producing more outputs (each of which terminates in <eos>) to get N-best outputs 58

  38. Termination: <eo Te eos> Beam Example has K = 2 <eos> search Knows He <eos> <eos> The Nose • Terminate – Paths cannot continue once the output an <eos> • So paths may be different lengths Select the most likely sequence ending in <eos> across all terminating sequences 59 •

  39. Appl Applications ns • Machine Translation – My name is Tom à Ich heisse Tom/Mein name ist Tom • Dialog – “I have a problem” à “How may I help you” • Image to text – Picture à Caption for picture 60

  40. Ma Machine Translation Examp mple • Hidden state clusters by meaning! “Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals, and Le, 2014 61

  41. Ma Machine Translation Examp mple “Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals and Le, 2014 62

  42. Hum Human an Mac achine hine Conver ersatio tion: n: Exam ample ple • Trained on human-human conversations • Task: Human text in, machine response out 63 “A neural conversational model”, Orin Vinyals and Quoc Le, 2015

  43. A pr A probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 R " - . / 0 I ate an apple <eos> Ich habe einen apfel gegessen • All the information about the input sequence is embedded into a single vector – The “hidden” node layer at the end of the input sequence – This one node is “overloaded” with information • Particularly if the input is long 64

  44. A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> • In reality: All hidden values carry information – Some of which may be diluted downstream • Different outputs are related to different inputs – Recall input and output may not be in sequence – Have no way of knowing a priori which input must connect to what output 65

  45. Va Variants Ich habe einen apfel gegessen <eos> A better model: Encoded input embedding is input to all output timesteps <sos> an ate I <eos> apple A boy on a surfboard <eos> 66 <sos> A boy on surfboard a

  46. A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 R " - . / 0 I ate an apple <eos> Ich habe einen apfel gegessen • All the information about the input sequence is embedded into a single vector – The “hidden” node layer at the end of the input sequence – This one node is “overloaded” with information • Particularly if the input is long 67

  47. A A pr probl blem wi with h thi his s fr framework I ate an apple <eos> • In reality: All hidden values carry information – Some of which may be diluted downstream FIX ENCODER DECODER SEPARATION 68

  48. A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> • In reality: All hidden values carry information – Some of which may be diluted downstream • Different outputs are related to different inputs – Recall input and output may not be in sequence 69

  49. A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> • In reality: All hidden values carry information – Some of which may be diluted downstream • Different outputs are related to different inputs – Recall input and output may not be in sequence – Have no way of knowing a priori which input must connect to what output 70

  50. A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> • Connecting everything to everything is infeasible – Variable sized inputs and outputs – Overparametrized – Connection pattern ignores the actual a synchronous dependence of output on input 71

  51. So Solution: : Attention mo models 𝒕 =" 𝒊 =" 𝒊 R 𝒊 " 𝒊 - 𝒊 . I ate an apple <eos> • Separating the encoder and decoder in illustration 72

  52. � � � � � � At Attention models 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 I𝛽 -,: 𝒊 : I𝛽 .,: 𝒊 : I𝛽 /,: 𝒊 : I𝛽 0,: 𝒊 : I𝛽 1,: 𝒊 : I𝛽 ",: 𝒊 : : : : : : : 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Compute a weighted combination of all the hidden outputs into a single vector – Weights vary by output time 73

  53. � � � � � � � Solution: So : Attention mo models Note: Weights vary with output time 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 I𝛽 -,: 𝒊 : I𝛽 .,: 𝒊 : I𝛽 /,: 𝒊 : I𝛽 0,: 𝒊 : I𝛽 1,: 𝒊 : I𝛽 ",: 𝒊 : : : : : : : 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 Input to hidden decoder layer: ∑ 𝛽 <,: 𝒊 : : Weights: 𝛽 <,: are scalars I ate an apple <eos> • Compute a weighted combination of all the hidden outputs into a single vector – Weights vary by output time 74

  54. Attention instead of simple encoder-decoder • Encoder-decoder models – needs to be able to compress all the necessary information of a source sentence into a fixed-length vector – performance deteriorates rapidly as the length of an input sentence increases. • Attention avoids this by: – allowing the RNN generating the output to focus on hidden states (generated by the first RNN) as they become relevant. Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 75

  55. Soft Attention for Translation An RNN can attend over the output of another RNN. At every time step, it focuses on different positions in the other RNN. “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 76

  56. Soft Attention for Translation Distribution over input words “ I love coffee ” -> “ Me gusta el café ” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 77

  57. Soft Attention for Translation Distribution over input words “ I love coffee ” -> “ Me gusta el café ” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 78

  58. Soft Attention for Translation Distribution over input words “ I love coffee ” -> “ Me gusta el café ” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 79

  59. Soft Attention for Translation Distribution over input words “ I love coffee ” -> “ Me gusta el café ” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 80

  60. � � � � � � � So Solution: : Attention mo models Input to hidden decoder layer: ∑ 𝛽 <,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 I𝛽 :,- 𝒊 : I𝛽 :,. 𝒊 : I𝛽 :,/ 𝒊 : I𝛽 :,0 𝒊 : I𝛽 :,1 𝒊 : I𝛽 :," 𝒊 : : : : : : : 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 𝛽 <,: = 𝑏 𝒊 : , 𝒕 <=" I ate an apple <eos> • Require a time-varying weight that specifies relationship of output time to input time – Weights are functions of current output state 81

  61. � Attention models At Ich habe einen Input to hidden decoder layer: ∑ 𝛽 <,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 Sum to 1.0 Ich habe einen 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 𝛽 <,: = 𝑏 𝒊 : , 𝒕 <=" I ate an apple <eos> • The weights are a distribution over the input – Must automatically highlight the most important input components for any output 82

  62. � � Attention models At Ich habe einen Input to hidden decoder layer: ∑ 𝛽 <,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 Sum to 1.0 Ich habe einen 𝑓 : 𝑢 = 𝑕 𝒊 : , 𝒕 <=" 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 𝑢 ) 𝛽 <,: = ∑ exp(𝑓 † 𝑢 ) † I ate an apple <eos> • “Raw” weight at any time: A function 𝑕() that works on the two hidden states • Actual weight: softmax over raw weights 83

  63. � At Attention models Ich habe einen v 𝒕 <=" 𝑕 𝒊 : , 𝒕 <=" = 𝒊 : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 v 𝑿 E 𝒕 <=" 𝑕 𝒊 : , 𝒕 <=" = 𝒊 : 𝒊 : v 𝒖𝒃𝒐𝒊 𝑿 E 𝑕 𝒊 : , 𝒕 <=" = 𝒘 E 𝒕 <=" Ich habe einen 𝑕 𝒊 : , 𝒕 <=" = 𝑁𝑀𝑄 [𝒊 : , 𝒕 <=" ] 𝑓 : 𝑢 = 𝑕 𝒊 : , 𝒕 <=" 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 𝑢 ) 𝛽 <,: = ∑ exp(𝑓 † 𝑢 ) † I ate an apple <eos> • Typical options for 𝑕() … – Variables in red are to be learned 84

  64. Co Convert rting an input (f (forward pass) ss) 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Pass the input through the encoder to produce hidden representations 𝒊 : 85

  65. Co Convert rting an input (f (forward pass) ss) What is this? Multiple options Simplest: 𝒕 R = 𝒊 $ 𝒕 R If 𝒕 and 𝒊 are different sizes: 𝒕 R = 𝑿 k 𝒊 $ 𝑿 k is learnable parameter 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Compute weights for first output 86

  66. � Convert Co rting an input (f (forward pass) ss) 𝒕 R v 𝑿 E 𝒕 R 𝑕 𝒊 : , 𝒕 R = 𝒊 : 𝑓 : 1 = 𝑕 𝒊 : , 𝒕 R 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 1 ) 𝛽 ",: = ∑ exp(𝑓 † 1 ) I ate an apple <eos> † • Compute weights (for every 𝒊 : ) for first output 87

  67. � � Convert Co rting an input (f (forward pass) ss) 𝒜 " = I 𝛽 ",: 𝒊 : : 𝒕 R v 𝑿 E 𝒕 R 𝑕 𝒊 : , 𝒕 R = 𝒊 : 𝑓 : 1 = 𝑕 𝒊 : , 𝒕 R 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 1 ) 𝛽 ",: = ∑ exp(𝑓 † 1 ) I ate an apple <eos> † • Compute weights (for every 𝒊 : ) for first output • Compute weighted combination of hidden values 88

  68. � � Convert Co rting an input (f (forward pass) ss) 𝒁 " 𝒜 " = I 𝛽 ",: 𝒊 : : 𝒕 R 𝒕 " v 𝑿 E 𝒕 R 𝑕 𝒊 : , 𝒕 R = 𝒊 : 𝑓 : 1 = 𝑕 𝒊 : , 𝒕 R 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 1 ) 𝛽 ",: = ∑ exp(𝑓 † 1 ) I ate an apple <eos> † • Produce the first output – Will be distribution over words 89

  69. � Co Convert rting an input (f (forward pass) ss) 𝒁 " 𝒜 " = I 𝛽 ",: 𝒊 : : 𝒕 R 𝒕 " 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Produce the first output – Will be distribution over words – Draw a word from the distribution 90

  70. � Ich 𝒁 " 𝒕 R 𝒕 " 𝒜 " v 𝑿 E 𝒕 " 𝑕 𝒊 : , 𝒕 " = 𝒊 : 𝑓 : 2 = 𝑕 𝒊 : , 𝒕 " 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 2 ) 𝛽 -,: = ∑ exp(𝑓 † 2 ) I ate an apple <eos> † 91

  71. � � Ich 𝒁 " 𝒜 - = I 𝛽 -,: 𝒊 : : 𝒕 R 𝒕 " 𝒜 " v 𝑿 E 𝒕 " 𝑕 𝒊 : , 𝒕 " = 𝒊 : 𝑓 : 2 = 𝑕 𝒊 : , 𝒕 " 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 2 ) 𝛽 -,: = ∑ exp(𝑓 † 2 ) I ate an apple <eos> † 92

  72. � Ich 𝒁 " 𝒁 - 𝒜 - = I 𝛽 -,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒜 " Ich 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Compute the output at t=2 • Will be a probability distribution over words 93

  73. � Ich 𝒁 " 𝒁 - 𝒜 - = I 𝛽 -,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒜 " Ich 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Draw a word from the output distribution at t=2 94

  74. � � Ich 𝒁 " 𝒁 - 𝒜 . = I 𝛽 .,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒜 " Ich v 𝑿 E 𝒕 - 𝒜 - 𝑕 𝒊 : , 𝒕 - = 𝒊 : 𝑓 : 3 = 𝑕 𝒊 : , 𝒕 - 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 3 ) 𝛽 .,: = ∑ exp(𝑓 † 3 ) I ate an apple <eos> † • Compute the weights for all instances for time = 3 95

  75. � habe Ich 𝒁 " 𝒁 - 𝒁 . 𝒜 . = I 𝛽 .,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒜 " Ich habe 𝒜 - 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Compute the output at t=3 • Will be a probability distribution over words 96

  76. � habe Ich 𝒁 " 𝒁 - 𝒁 . 𝒜 . = I 𝛽 .,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒜 " Ich habe 𝒜 - 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Draw a word from the distribution 97

  77. � habe einen Ich 𝒁 " 𝒁 - 𝒁 . 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒜 " Ich habe 𝒜 - 𝒜 . 𝑓 : 4 = 𝑕 𝒊 : , 𝒕 . 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 4 ) 𝛽 /,: = ∑ exp(𝑓 † 4 ) I ate an apple <eos> † • Compute the weights for all instances for time = 4 98

  78. einen habe Ich apfel gegessen <eos> 𝒁 " 𝒁 - 𝒁 . 𝒁 / 𝒁 0 𝒁 1 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 𝒜 " Ich habe einen apfel gegessen 𝒜 - 𝒜 . 𝒜 / 𝒜 0 𝒜 1 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • As before, the objective of drawing: Produce the most likely output (that ends in an <eos>) n p … 𝑧 n o 𝑧 n q argmax 𝑧 " " " n o ,…,n q • Simply selecting the most likely symbol at each time may result in suboptimal output 99

  79. � � What does the attention learn? Ich 𝒁 " 𝒁 - 𝒜 " = I 𝛽 ",: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒜 " v 𝑿 E 𝒕 R 𝑕 𝒊 : , 𝒕 R = 𝒊 : Ich 𝑓 : 1 = 𝑕 𝒊 : , 𝒕 R exp(𝑓 : 1 ) 𝒊 R 𝒊 " 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝛽 ",: = ∑ exp(𝑓 † 1 ) † I ate an apple <eos> • The key component of this model is the attention weight – It captures the relative importance of each position in the input to the current output 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend