Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen โข We will illustrate with a single hidden layer, but the discussion generalizes to more layers 22
The โsimpleโ translation model Th Ich habe einen apfel gegessen <eos> ENCODER I ate an apple <eos> Ich habe einen apfel gegessen DECODER 23
The โsimpleโ translation model Th Ich habe einen apfel gegessen <eos> ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ " " " " " - - - - - I ate an apple <eos> Ich habe einen apfel gegessen โข A more detailed look: The one-hot word representations may be compressed via embeddings โ Embeddings will be learned along with the rest of the net โ In the following slides we will not represent the projection matrices 24
Training th Tr the e system em Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen โข Must learn to make predictions appropriately โ Given โI ate an apple <eos>โ, produce โIch habe einen apfel gegessen <eos>โ. 25
Training : Tr : Forward pass ๐ ๐ ๐ ๐ ๐ ๐ " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen โข Forward pass: Input the source and target sequences, sequentially โ Output will be a probability distribution over target symbol set (vocabulary) 26
Training : Tr : Backward pass Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss ๐ ๐ ๐ ๐ ๐ ๐ " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen โข Backward pass: Compute the loss between the output distribution and target word sequence 27
Training : Tr : Backward pass Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss ๐ ๐ ๐ ๐ ๐ ๐ " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen โข Backward pass: Compute the loss between the output distribution and target word sequence โข Backpropagate the derivatives of the loss through the network to learn the net 28
Training : Tr : Backward pass Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss ๐ ๐ ๐ ๐ ๐ ๐ " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen โข In practice, if we apply SGD, we may randomly sample words from the output to actually use for the backprop and update โ Typical usage: Randomly select one word from each input training instance (comprising an input-output pair) For each iteration โข Randomly select training instance: (input, output) โข โข Forward pass Randomly select a single output y(t) and corresponding desired output d(t) for backprop โข 29
Tr Trick of the trade: Reversing the input Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss ๐ ๐ ๐ ๐ ๐ ๐ " - . / 0 1 an ate I <eos> apple Ich habe einen apfel gegessen โข Standard trick of the trade: The input sequence is fed in reverse order โ Things work better this way 30
Tr Trick of the trade: Reversing the input Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss ๐ ๐ ๐ ๐ ๐ ๐ " - . / 0 1 an ate I <eos> apple Ich habe einen apfel gegessen โข Standard trick of the trade: The input sequence is fed in reverse order โ Things work better this way 31
Trick of the trade: Reversing the input Tr Ich habe einen apfel gegessen <eos> ๐ ๐ ๐ ๐ ๐ ๐ " - . / 0 1 an ate I <eos> apple Ich habe einen apfel gegessen โข Standard trick of the trade: The input sequence is fed in reverse order โ Things work better this way โข This happens both for training and during actual decode 32
Ov Overall training โข Given several training instance (๐, ๐ XYZ[\X ) โข Forward pass: Compute the output of the network for (๐, ๐ XYZ[\X ) with input in reverse order โ Note, both ๐ and ๐ XYZ[\X are used in the forward pass โข Backward pass: Compute the loss between the desired target ๐ XYZ[\X and the actual output ๐ โ Propagate derivatives of loss for updates โข We called ๐ as ๐ฑ and ๐ XYZ[\X as ๐ท during training 33
What the Wha he ne network k actua ually y pr produc duces :cd ๐ง " CfgFh ๐ง " e:FD ๐ง " โฆ iFjkl ๐ง " I ate an apple <eos> At each time ๐ the network actually produces a probability distribution over the output vocabulary โข ` = ๐ ๐ง + = ๐ฅ|๐ +=" , โฆ , ๐ " , ๐ฝ " , โฆ , ๐ฝ $ โ ๐ง + โ The probability given the entire input sequence ๐ฝ " , โฆ , ๐ฝ $ and the partial output sequence ๐ " , โฆ , ๐ +=" until ๐ At each time a word is drawn from the output distribution โข The drawn word is provided as input to the next time 34 โข
Wha What the he ne network k actua ually y pr produc duces Ich :cd ๐ง " CfgFh ๐ง " e:FD ๐ง " โฆ iFjkl ๐ง " I ate an apple <eos> At each time ๐ the network actually produces a probability distribution over the output vocabulary โข ` = ๐ ๐ง + = ๐ฅ|๐ +=" , โฆ , ๐ " , ๐ฝ " , โฆ , ๐ฝ $ โ ๐ง + โ The probability given the entire input sequence ๐ฝ " , โฆ , ๐ฝ $ and the partial output sequence ๐ " , โฆ , ๐ +=" until ๐ At each time a word is drawn from the output distribution โข The drawn word is provided as input to the next time โข 35
Wha What the he ne network k actua ually y pr produc duces Ich :cd ๐ง " CfgFh ๐ง " e:FD ๐ง " โฆ iFjkl ๐ง " I ate an apple <eos> Ich At each time ๐ the network actually produces a probability distribution over the output vocabulary โข ` = ๐ ๐ง + = ๐ฅ|๐ +=" , โฆ , ๐ " , ๐ฝ " , โฆ , ๐ฝ $ โ ๐ง + โ The probability given the entire input sequence ๐ฝ " , โฆ , ๐ฝ $ and the partial output sequence ๐ " , โฆ , ๐ +=" until ๐ At each time a word is drawn from the output distribution โข 36 The drawn word is provided as input to the next time โข
What the Wha he ne network k actua ually y pr produc duces Ich :cd :cd ๐ง " ๐ง - CfgFh CfgFh ๐ง " ๐ง - e:FD e:FD ๐ง " ๐ง - โฆ โฆ iFjkl iFjkl ๐ง " ๐ง - I ate an apple <eos> Ich At each time ๐ the network actually produces a probability distribution over the output vocabulary โข ` = ๐ ๐ง + = ๐ฅ|๐ +=" , โฆ , ๐ " , ๐ฝ " , โฆ , ๐ฝ $ โ ๐ง + โ The probability given the entire input sequence ๐ฝ " , โฆ , ๐ฝ $ and the partial output sequence ๐ " , โฆ , ๐ +=" until ๐ At each time a word is drawn from the output distribution โข 37 The drawn word is provided as input to the next time โข
Wha What the he ne network k actua ually y pr produc duces Ich habe :cd :cd ๐ง " ๐ง - CfgFh CfgFh ๐ง " ๐ง - e:FD e:FD ๐ง " ๐ง - โฆ โฆ iFjkl iFjkl ๐ง " ๐ง - I ate an apple <eos> Ich โข At each time ๐ the network actually produces a probability distribution over the output vocabulary ` = ๐ ๐ง + = ๐ฅ|๐ +=" , โฆ , ๐ " , ๐ฝ " , โฆ , ๐ฝ $ โ ๐ง + โ The probability given the entire input sequence ๐ฝ " , โฆ , ๐ฝ $ and the partial output sequence ๐ " , โฆ , ๐ +=" until ๐ At each time a word is drawn from the output distribution โข The drawn word is provided as input to the next time โข 38
Wha What the he ne network k actua ually y pr produc duces Ich habe :cd :cd ๐ง " ๐ง - CfgFh CfgFh ๐ง " ๐ง - e:FD e:FD ๐ง " ๐ง - โฆ โฆ iFjkl iFjkl ๐ง " ๐ง - I ate an apple <eos> Ich Ich habe โข At each time ๐ the network actually produces a probability distribution over the output vocabulary ` = ๐ ๐ง + = ๐ฅ|๐ +=" , โฆ , ๐ " , ๐ฝ " , โฆ , ๐ฝ $ โ ๐ง + โ The probability given the entire input sequence ๐ฝ " , โฆ , ๐ฝ $ and the partial output sequence ๐ " , โฆ , ๐ +=" until ๐ At each time a word is drawn from the output distribution โข 39 The drawn word is provided as input to the next time โข
Wha What the he ne network k actua ually y pr produc duces Ich habe :cd :cd :cd ๐ง " ๐ง - ๐ง . CfgFh CfgFh CfgFh ๐ง " ๐ง - ๐ง . e:FD e:FD e:FD ๐ง " ๐ง - ๐ง . โฆ โฆ โฆ iFjkl iFjkl iFjkl ๐ง " ๐ง - ๐ง . I ate an apple <eos> Ich Ich habe โข At each time ๐ the network actually produces a probability distribution over the output vocabulary ` = ๐ ๐ง = ๐ฅ|๐ +=" , โฆ , ๐ " , ๐ฝ " , โฆ , ๐ฝ $ โ ๐ง + โ The probability given the entire input sequence ๐ฝ " , โฆ , ๐ฝ $ and the partial output sequence ๐ " , โฆ , ๐ +=" until ๐ At each time a word is drawn from the output distribution โข The drawn word is provided as input to the next time โข 40
Wha What the he ne network k actua ually y pr produc duces Ich habe einen :cd :cd :cd ๐ง " ๐ง - ๐ง . CfgFh CfgFh CfgFh ๐ง " ๐ง - ๐ง . e:FD e:FD e:FD ๐ง " ๐ง - ๐ง . โฆ โฆ โฆ iFjkl iFjkl iFjkl ๐ง " ๐ง - ๐ง . I ate an apple <eos> Ich Ich habe โข At each time ๐ the network actually produces a probability distribution over the output vocabulary ` = ๐ ๐ง + = ๐ฅ|๐ +=" , โฆ , ๐ " , ๐ฝ " , โฆ , ๐ฝ $ โ ๐ง + โ The probability given the entire input sequence ๐ฝ " , โฆ , ๐ฝ $ and the partial output sequence ๐ " , โฆ , ๐ +=" until ๐ At each time a word is drawn from the output distribution โข 41 The drawn word is provided as input to the next time โข
What the Wha he ne network k actua ually y pr produc duces Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd ๐ง " ๐ง - ๐ง / ๐ง . ๐ง 0 ๐ง 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh ๐ง " ๐ง - ๐ง . ๐ง / ๐ง 0 ๐ง 1 e:FD e:FD e:FD e:FD e:FD e:FD ๐ง " ๐ง - ๐ง . ๐ง / ๐ง 0 ๐ง 1 โฆ โฆ โฆ โฆ โฆ โฆ iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl ๐ง " ๐ง - ๐ง . ๐ง / ๐ง 0 ๐ง 1 I ate an apple <eos> Ich habe einen apfel gegessen โข At each time ๐ the network actually produces a probability distribution over the output vocabulary ` = ๐ ๐ง + = ๐ฅ|๐ +=" , โฆ , ๐ " , ๐ฝ " , โฆ , ๐ฝ $ โ ๐ง + โ The probability given the entire input sequence ๐ฝ " , โฆ , ๐ฝ $ and the partial output sequence ๐ " , โฆ , ๐ +=" until ๐ At each time a word is drawn from the output distribution โข 42 The drawn word is provided as input to the next time โข
Generatin Ge ting an an outp tput t from th the net Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd ๐ง " ๐ง - ๐ง / ๐ง . ๐ง 0 ๐ง 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh ๐ง " ๐ง - ๐ง . ๐ง / ๐ง 0 ๐ง 1 e:FD e:FD e:FD e:FD e:FD e:FD ๐ง " ๐ง - ๐ง . ๐ง / ๐ง 0 ๐ง 1 โฆ โฆ โฆ โฆ โฆ โฆ iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl ๐ง " ๐ง - ๐ง . ๐ง / ๐ง 0 ๐ง 1 I ate an apple <eos> Ich habe einen apfel gegessen โข At each time the network produces a probability distribution over words, given the entire input and previous outputs โข At each time a word is drawn from the output distribution The drawn word is provided as input to the next time โข The process continues until an <eos> is generated โข 43
The probability of the output Th Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd ๐ง " ๐ง - ๐ง / ๐ง . ๐ง 0 ๐ง 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh ๐ง " ๐ง - ๐ง . ๐ง / ๐ง 0 ๐ง 1 e:FD e:FD e:FD e:FD e:FD e:FD ๐ง " ๐ง - ๐ง . ๐ง / ๐ง 0 ๐ง 1 โฆ โฆ โฆ โฆ โฆ โฆ iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl ๐ง " ๐ง - ๐ง . ๐ง / ๐ง 0 ๐ง 1 I ate an apple <eos> Ich habe einen apfel gegessen n p โฆ ๐ง m n o ๐ง - n q ๐ ๐ " , โฆ , ๐ m |๐ฝ " , โฆ , ๐ฝ $ = ๐ง " โข The objective of drawing: Produce the most likely output (that ends in an <eos>) n p โฆ ๐ง m n o ๐ง - n q argmax ๐ง " n o ,โฆ,n q 44
The probability of the output Th Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd ๐ง " ๐ง - ๐ง / ๐ง . ๐ง 0 ๐ง 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh ๐ง " ๐ง - ๐ง . ๐ง / ๐ง 0 ๐ง 1 Objective: e:FD e:FD e:FD e:FD e:FD e:FD n p โฆ ๐ง m ๐ง " ๐ง - ๐ง . ๐ง / ๐ง 0 ๐ง 1 n o ๐ง - n q argmax ๐ง " n o ,โฆ,n q โฆ โฆ โฆ โฆ โฆ โฆ iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl ๐ง " ๐ง - ๐ง . ๐ง / ๐ง 0 ๐ง 1 I ate an apple <eos> Ich habe einen apfel gegessen โข Cannot just pick the most likely symbol at each time โ That may cause the distribution to be more โconfusedโ at the next time โ Choosing a different, less likely word could cause the distribution at the next time to be more peaky, resulting in a more likely output overall 45
Greedy is Gr is not t good ๐(๐ . |๐ " , ๐ - , ๐ฝ " , โฆ , ๐ฝ $ ) ๐(๐ . |๐ " , ๐ - , ๐ฝ " , โฆ , ๐ฝ $ ) T=1 2 3 T=1 2 3 w 1 w 2 w 3 โฆ w V w 1 w 2 w 3 โฆ w V โข Hypothetical example (from English speech recognition: Input is speech, output must be text) โข โNoseโ has highest probability at t=2 and is selected โ The model is very confused at t=3 and assigns low probabilities to many words at the next time โ Selecting any of these will result in low probability for the entire 3-word sequence โข โKnowsโ has slightly lower probability than โnoseโ, but is still high and is selected โ โhe knowsโ is a reasonable beginning and the model assigns high probabilities to words such as โsomethingโ โ Selecting one of these results in higher overall probability for the 3-word sequence 46
Gr Greedy is is not t good What should we ๐(๐ - |๐ " , ๐ฝ " , โฆ , ๐ฝ $ ) have chosen at t=2?? โnoseโ โknowsโ Will selecting โnoseโ continue to have a bad effect into the w 1 w 2 w 3 โฆ w V distant future? T=1 2 3 โข Problem: Impossible to know a priori which word leads to the more promising future โ Should we draw โnoseโ or โknowsโ? โ Effect may not be obvious until several words down the line โ Or the choice of the wrong word early may cumulatively lead to a poorer overall score over time 47
Greedy is Gr is not t good ๐(๐ " |๐ฝ " , โฆ , ๐ฝ $ ) What should we have chosen at t=1?? Choose โtheโ or โheโ? w 1 the w 3 โฆ he T=1 2 3 โข Problem: Impossible to know a priori which word leads to the more promising future โ Even earlier: Choosing the lower probability โtheโ instead of โheโ at T=1 may have made a choice of โnoseโ more reasonable at T=2. โข In general, making a poor choice at any time commits us to a poor future โ But we cannot know at that time the choice was poor โข Solution: Donโt choose.. 48
Solution: So : Mu Multiple choices I He We The โข Retain both choices and fork the network โ With every possible word as input 49
Problem: Multiple choices Pr โฎ I โฎ He โฎ We โฎ โฎ The โข Problem : This will blow up very quickly โ For an output vocabulary of size ๐ , after ๐ output steps weโd have forked out ๐ v branches 50
Solution: So : Pru rune Beam I search He ๐๐๐ y ๐(๐ " |๐ฝ " , โฆ , ๐ฝ $ ) We โฎ The โข Solution: Prune โ At each time, retain only the top K scoring forks 51
So Solution: : Pru rune Beam I search He ๐๐๐ y ๐(๐ " |๐ฝ " , โฆ , ๐ฝ $ ) We โฎ The โข Solution: Prune โ At each time, retain only the top K scoring forks 52
So Solution: : Pru rune Beam I search โฎ Note: based on product Knows He ๐๐๐ y ๐(๐ - ๐ " |๐ฝ " , โฆ , ๐ฝ $ ) โฆ = ๐๐๐ y ๐(๐ - |๐ " , ๐ฝ " , โฆ , ๐ฝ $ )๐(๐ " |๐ฝ " , โฆ , ๐ฝ $ ) I โฎ The Nose โข Solution: Prune โฆ โ At each time, retain only the top K scoring forks 53
So Solution: : Pru rune Beam I search โฎ Note: based on product Knows He ๐๐๐ y ๐(๐ - ๐ " |๐ฝ " , โฆ , ๐ฝ $ ) โฆ = ๐๐๐ y ๐(๐ - |๐ " , ๐ฝ " , โฆ , ๐ฝ $ )๐(๐ " |๐ฝ " , โฆ , ๐ฝ $ ) I โฎ The Nose โข Solution: Prune โฆ โ At each time, retain only the top K scoring forks 54
So Solution: : Pru rune Beam search โฎ Knows = ๐๐๐ y ๐ ๐ - ๐ " , ๐ฝ " , โฆ , ๐ฝ $ ร He ๐ ๐ - ๐ " , ๐ฝ " , โฆ , ๐ฝ $ ร ๐(๐ " |๐ฝ " , โฆ , ๐ฝ $ ) โฆ โฎ The Nose โข Solution: Prune โ At each time, retain only the top K scoring forks 55
So Solution: : Pru rune Beam search โฎ Knows = ๐๐๐ y ๐ ๐ - ๐ " , ๐ฝ " , โฆ , ๐ฝ $ ร He ๐ ๐ - ๐ " , ๐ฝ " , โฆ , ๐ฝ $ ร ๐(๐ " |๐ฝ " , โฆ , ๐ฝ $ ) โฆ โฎ The Nose โข Solution: Prune โ At each time, retain only the top K scoring forks 56
So Solution: : Pru rune Beam search Knows He | ๐๐๐ y { ๐ ๐ < ๐ " , โฆ , ๐ <=" , ๐ฝ " , โฆ , ๐ฝ $ <}" The Nose โข Solution: Prune โ At each time, retain only the top K scoring forks 57
Terminate Te Beam search Knows He <eos> The Nose โข Terminate โ When the current most likely path overall ends in <eos> โข Or continue producing more outputs (each of which terminates in <eos>) to get N-best outputs 58
Termination: <eo Te eos> Beam Example has K = 2 <eos> search Knows He <eos> <eos> The Nose โข Terminate โ Paths cannot continue once the output an <eos> โข So paths may be different lengths Select the most likely sequence ending in <eos> across all terminating sequences 59 โข
Appl Applications ns โข Machine Translation โ My name is Tom ร Ich heisse Tom/Mein name ist Tom โข Dialog โ โI have a problemโ ร โHow may I help youโ โข Image to text โ Picture ร Caption for picture 60
Ma Machine Translation Examp mple โข Hidden state clusters by meaning! โSequence-to-sequence learning with neural networksโ, Sutskever, Vinyals, and Le, 2014 61
Ma Machine Translation Examp mple โSequence-to-sequence learning with neural networksโ, Sutskever, Vinyals and Le, 2014 62
Hum Human an Mac achine hine Conver ersatio tion: n: Exam ample ple โข Trained on human-human conversations โข Task: Human text in, machine response out 63 โA neural conversational modelโ, Orin Vinyals and Quoc Le, 2015
A pr A probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> ๐ ๐ ๐ ๐ ๐ ๐ R " - . / 0 I ate an apple <eos> Ich habe einen apfel gegessen โข All the information about the input sequence is embedded into a single vector โ The โhiddenโ node layer at the end of the input sequence โ This one node is โoverloadedโ with information โข Particularly if the input is long 64
A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> โข In reality: All hidden values carry information โ Some of which may be diluted downstream โข Different outputs are related to different inputs โ Recall input and output may not be in sequence โ Have no way of knowing a priori which input must connect to what output 65
Va Variants Ich habe einen apfel gegessen <eos> A better model: Encoded input embedding is input to all output timesteps <sos> an ate I <eos> apple A boy on a surfboard <eos> 66 <sos> A boy on surfboard a
A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> ๐ ๐ ๐ ๐ ๐ ๐ R " - . / 0 I ate an apple <eos> Ich habe einen apfel gegessen โข All the information about the input sequence is embedded into a single vector โ The โhiddenโ node layer at the end of the input sequence โ This one node is โoverloadedโ with information โข Particularly if the input is long 67
A A pr probl blem wi with h thi his s fr framework I ate an apple <eos> โข In reality: All hidden values carry information โ Some of which may be diluted downstream FIX ENCODER DECODER SEPARATION 68
A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> โข In reality: All hidden values carry information โ Some of which may be diluted downstream โข Different outputs are related to different inputs โ Recall input and output may not be in sequence 69
A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> โข In reality: All hidden values carry information โ Some of which may be diluted downstream โข Different outputs are related to different inputs โ Recall input and output may not be in sequence โ Have no way of knowing a priori which input must connect to what output 70
A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> โข Connecting everything to everything is infeasible โ Variable sized inputs and outputs โ Overparametrized โ Connection pattern ignores the actual a synchronous dependence of output on input 71
So Solution: : Attention mo models ๐ =" ๐ =" ๐ R ๐ " ๐ - ๐ . I ate an apple <eos> โข Separating the encoder and decoder in illustration 72
๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ At Attention models ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 ๐ 1 I๐ฝ -,: ๐ : I๐ฝ .,: ๐ : I๐ฝ /,: ๐ : I๐ฝ 0,: ๐ : I๐ฝ 1,: ๐ : I๐ฝ ",: ๐ : : : : : : : ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 I ate an apple <eos> โข Compute a weighted combination of all the hidden outputs into a single vector โ Weights vary by output time 73
๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ Solution: So : Attention mo models Note: Weights vary with output time ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 ๐ 1 I๐ฝ -,: ๐ : I๐ฝ .,: ๐ : I๐ฝ /,: ๐ : I๐ฝ 0,: ๐ : I๐ฝ 1,: ๐ : I๐ฝ ",: ๐ : : : : : : : ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 Input to hidden decoder layer: โ ๐ฝ <,: ๐ : : Weights: ๐ฝ <,: are scalars I ate an apple <eos> โข Compute a weighted combination of all the hidden outputs into a single vector โ Weights vary by output time 74
Attention instead of simple encoder-decoder โข Encoder-decoder models โ needs to be able to compress all the necessary information of a source sentence into a fixed-length vector โ performance deteriorates rapidly as the length of an input sentence increases. โข Attention avoids this by: โ allowing the RNN generating the output to focus on hidden states (generated by the first RNN) as they become relevant. Bahdanau et al, โNeural Machine Translation by Jointly Learning to Align and Translateโ, ICLR 2015 75
Soft Attention for Translation An RNN can attend over the output of another RNN. At every time step, it focuses on different positions in the other RNN. โI love coffeeโ -> โMe gusta el cafรฉโ Bahdanau et al, โNeural Machine Translation by Jointly Learning to Align and Translateโ, ICLR 2015 76
Soft Attention for Translation Distribution over input words โ I love coffee โ -> โ Me gusta el cafรฉ โ Bahdanau et al, โNeural Machine Translation by Jointly Learning to Align and Translateโ, ICLR 2015 77
Soft Attention for Translation Distribution over input words โ I love coffee โ -> โ Me gusta el cafรฉ โ Bahdanau et al, โNeural Machine Translation by Jointly Learning to Align and Translateโ, ICLR 2015 78
Soft Attention for Translation Distribution over input words โ I love coffee โ -> โ Me gusta el cafรฉ โ Bahdanau et al, โNeural Machine Translation by Jointly Learning to Align and Translateโ, ICLR 2015 79
Soft Attention for Translation Distribution over input words โ I love coffee โ -> โ Me gusta el cafรฉ โ Bahdanau et al, โNeural Machine Translation by Jointly Learning to Align and Translateโ, ICLR 2015 80
๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ So Solution: : Attention mo models Input to hidden decoder layer: โ ๐ฝ <,: ๐ : : ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 ๐ 1 I๐ฝ :,- ๐ : I๐ฝ :,. ๐ : I๐ฝ :,/ ๐ : I๐ฝ :,0 ๐ : I๐ฝ :,1 ๐ : I๐ฝ :," ๐ : : : : : : : ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 ๐ฝ <,: = ๐ ๐ : , ๐ <=" I ate an apple <eos> โข Require a time-varying weight that specifies relationship of output time to input time โ Weights are functions of current output state 81
๏ฟฝ Attention models At Ich habe einen Input to hidden decoder layer: โ ๐ฝ <,: ๐ : : ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 ๐ 1 Sum to 1.0 Ich habe einen ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 ๐ฝ <,: = ๐ ๐ : , ๐ <=" I ate an apple <eos> โข The weights are a distribution over the input โ Must automatically highlight the most important input components for any output 82
๏ฟฝ ๏ฟฝ Attention models At Ich habe einen Input to hidden decoder layer: โ ๐ฝ <,: ๐ : : ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 ๐ 1 Sum to 1.0 Ich habe einen ๐ : ๐ข = ๐ ๐ : , ๐ <=" ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 exp(๐ : ๐ข ) ๐ฝ <,: = โ exp(๐ โ ๐ข ) โ I ate an apple <eos> โข โRawโ weight at any time: A function ๐() that works on the two hidden states โข Actual weight: softmax over raw weights 83
๏ฟฝ At Attention models Ich habe einen v ๐ <=" ๐ ๐ : , ๐ <=" = ๐ : ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 ๐ 1 v ๐ฟ E ๐ <=" ๐ ๐ : , ๐ <=" = ๐ : ๐ : v ๐๐๐๐ ๐ฟ E ๐ ๐ : , ๐ <=" = ๐ E ๐ <=" Ich habe einen ๐ ๐ : , ๐ <=" = ๐๐๐ [๐ : , ๐ <=" ] ๐ : ๐ข = ๐ ๐ : , ๐ <=" ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 exp(๐ : ๐ข ) ๐ฝ <,: = โ exp(๐ โ ๐ข ) โ I ate an apple <eos> โข Typical options for ๐() โฆ โ Variables in red are to be learned 84
Co Convert rting an input (f (forward pass) ss) ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 I ate an apple <eos> โข Pass the input through the encoder to produce hidden representations ๐ : 85
Co Convert rting an input (f (forward pass) ss) What is this? Multiple options Simplest: ๐ R = ๐ $ ๐ R If ๐ and ๐ are different sizes: ๐ R = ๐ฟ k ๐ $ ๐ฟ k is learnable parameter ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 I ate an apple <eos> โข Compute weights for first output 86
๏ฟฝ Convert Co rting an input (f (forward pass) ss) ๐ R v ๐ฟ E ๐ R ๐ ๐ : , ๐ R = ๐ : ๐ : 1 = ๐ ๐ : , ๐ R ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 exp(๐ : 1 ) ๐ฝ ",: = โ exp(๐ โ 1 ) I ate an apple <eos> โ โข Compute weights (for every ๐ : ) for first output 87
๏ฟฝ ๏ฟฝ Convert Co rting an input (f (forward pass) ss) ๐ " = I ๐ฝ ",: ๐ : : ๐ R v ๐ฟ E ๐ R ๐ ๐ : , ๐ R = ๐ : ๐ : 1 = ๐ ๐ : , ๐ R ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 exp(๐ : 1 ) ๐ฝ ",: = โ exp(๐ โ 1 ) I ate an apple <eos> โ โข Compute weights (for every ๐ : ) for first output โข Compute weighted combination of hidden values 88
๏ฟฝ ๏ฟฝ Convert Co rting an input (f (forward pass) ss) ๐ " ๐ " = I ๐ฝ ",: ๐ : : ๐ R ๐ " v ๐ฟ E ๐ R ๐ ๐ : , ๐ R = ๐ : ๐ : 1 = ๐ ๐ : , ๐ R ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 exp(๐ : 1 ) ๐ฝ ",: = โ exp(๐ โ 1 ) I ate an apple <eos> โ โข Produce the first output โ Will be distribution over words 89
๏ฟฝ Co Convert rting an input (f (forward pass) ss) ๐ " ๐ " = I ๐ฝ ",: ๐ : : ๐ R ๐ " ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 I ate an apple <eos> โข Produce the first output โ Will be distribution over words โ Draw a word from the distribution 90
๏ฟฝ Ich ๐ " ๐ R ๐ " ๐ " v ๐ฟ E ๐ " ๐ ๐ : , ๐ " = ๐ : ๐ : 2 = ๐ ๐ : , ๐ " ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 exp(๐ : 2 ) ๐ฝ -,: = โ exp(๐ โ 2 ) I ate an apple <eos> โ 91
๏ฟฝ ๏ฟฝ Ich ๐ " ๐ - = I ๐ฝ -,: ๐ : : ๐ R ๐ " ๐ " v ๐ฟ E ๐ " ๐ ๐ : , ๐ " = ๐ : ๐ : 2 = ๐ ๐ : , ๐ " ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 exp(๐ : 2 ) ๐ฝ -,: = โ exp(๐ โ 2 ) I ate an apple <eos> โ 92
๏ฟฝ Ich ๐ " ๐ - ๐ - = I ๐ฝ -,: ๐ : : ๐ R ๐ " ๐ - ๐ " Ich ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 I ate an apple <eos> โข Compute the output at t=2 โข Will be a probability distribution over words 93
๏ฟฝ Ich ๐ " ๐ - ๐ - = I ๐ฝ -,: ๐ : : ๐ R ๐ " ๐ - ๐ " Ich ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 I ate an apple <eos> โข Draw a word from the output distribution at t=2 94
๏ฟฝ ๏ฟฝ Ich ๐ " ๐ - ๐ . = I ๐ฝ .,: ๐ : : ๐ R ๐ " ๐ - ๐ " Ich v ๐ฟ E ๐ - ๐ - ๐ ๐ : , ๐ - = ๐ : ๐ : 3 = ๐ ๐ : , ๐ - ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 exp(๐ : 3 ) ๐ฝ .,: = โ exp(๐ โ 3 ) I ate an apple <eos> โ โข Compute the weights for all instances for time = 3 95
๏ฟฝ habe Ich ๐ " ๐ - ๐ . ๐ . = I ๐ฝ .,: ๐ : : ๐ R ๐ " ๐ - ๐ . ๐ " Ich habe ๐ - ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 I ate an apple <eos> โข Compute the output at t=3 โข Will be a probability distribution over words 96
๏ฟฝ habe Ich ๐ " ๐ - ๐ . ๐ . = I ๐ฝ .,: ๐ : : ๐ R ๐ " ๐ - ๐ . ๐ " Ich habe ๐ - ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 I ate an apple <eos> โข Draw a word from the distribution 97
๏ฟฝ habe einen Ich ๐ " ๐ - ๐ . ๐ R ๐ " ๐ - ๐ . ๐ " Ich habe ๐ - ๐ . ๐ : 4 = ๐ ๐ : , ๐ . ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 exp(๐ : 4 ) ๐ฝ /,: = โ exp(๐ โ 4 ) I ate an apple <eos> โ โข Compute the weights for all instances for time = 4 98
einen habe Ich apfel gegessen <eos> ๐ " ๐ - ๐ . ๐ / ๐ 0 ๐ 1 ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 ๐ 1 ๐ " Ich habe einen apfel gegessen ๐ - ๐ . ๐ / ๐ 0 ๐ 1 ๐ R ๐ " ๐ - ๐ . ๐ / ๐ 0 I ate an apple <eos> โข As before, the objective of drawing: Produce the most likely output (that ends in an <eos>) n p โฆ ๐ง n o ๐ง n q argmax ๐ง " " " n o ,โฆ,n q โข Simply selecting the most likely symbol at each time may result in suboptimal output 99
๏ฟฝ ๏ฟฝ What does the attention learn? Ich ๐ " ๐ - ๐ " = I ๐ฝ ",: ๐ : : ๐ R ๐ " ๐ - ๐ " v ๐ฟ E ๐ R ๐ ๐ : , ๐ R = ๐ : Ich ๐ : 1 = ๐ ๐ : , ๐ R exp(๐ : 1 ) ๐ R ๐ " ๐ " ๐ - ๐ . ๐ / ๐ฝ ",: = โ exp(๐ โ 1 ) โ I ate an apple <eos> โข The key component of this model is the attention weight โ It captures the relative importance of each position in the input to the current output 100
Recommend
More recommend