lecture 16 language model
play

Lecture 16: Language Model CS109B Data Science 2 Pavlos Protopapas, - PowerPoint PPT Presentation

Lecture 16: Language Model CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and Chris Tanner Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 2


  1. Language Modelling: neural networks • Each circle is a specific floating point scalar IDEA: Let’s use a neural networks! • Words that are more semantically similar to one another will have embeddings that are proportionally similar, First, each word is represented by a word embedding too (e.g., vector of length 200) • We can use pre-existing word embeddings that have been trained on gigantic corpora man woman table CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  2. Language Modelling: neural networks These word embeddings are so rich that you get nice properties: king - man + ____________ ____________ woman ~ queen Word2vec: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf CS109B, P ROTOPAPAS , G LICKMAN , T ANNER GloVe: https://www.aclweb.org/anthology/D14-1162.pdf

  3. Language Modelling: neural networks How can we use these embeddings to build a LM? Remember, we only need a system that can estimate: 𝑄 𝑦 *=# |𝑦 * , 𝑦 *+# , … , 𝑦 # next word previous words Example input sentence class She went to CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  4. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class? Output layer 𝑋 Hidden layer 𝑊 Example input sentence She went to CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  5. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class? F = softmax 𝑋ℎ + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 ℎ = 𝑔(𝑊𝑦 + 𝑐 # ) Hidden layer 𝑊 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Co Conc ncatena enated ed wo word em embed eddi ding ngs She went to CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  6. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word after F = softmax 𝑋ℎ + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 ℎ = 𝑔(𝑊𝑦 + 𝑐 # ) Hidden layer 𝑊 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Co Conc ncatena enated ed wo word em embed eddi ding ngs went to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  7. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word visiting F = softmax 𝑋ℎ + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 ℎ = 𝑔(𝑊𝑦 + 𝑐 # ) Hidden layer 𝑊 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Co Conc ncatena enated ed wo word em embedd beddings ngs to after class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  8. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word her F = softmax 𝑋ℎ + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 ℎ = 𝑔(𝑊𝑦 + 𝑐 # ) Hidden layer 𝑊 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Conc Co ncatena enated ed wo word em embed eddi ding ngs after class visiting CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  9. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word grandma F = softmax 𝑋ℎ + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 ℎ = 𝑔(𝑊𝑦 + 𝑐 # ) Hidden layer 𝑊 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Co Conc ncatena enated ed wo word em embed eddi ding ngs after her visiting CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  10. Language Modelling : Feed-forward Neural Net FFN FFNN ST STRENGTHS? S? FFN FFNN ISSU SSUES? S? CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  11. Language Modelling : Feed-forward Neural Net FFN FFNN ST STRENGTHS? S? • No sparsity issues (it’s okay if we’ve never seen a segment of words) • No storage issues (we never store counts) FFN FFNN ISSU SSUES? S? • Fixed-window size can never be big enough. Need more context. • Increasing window size adds many more weights • The weights awkwardly handle word position • No concept of time • Requires inputting entire context just to predict one word CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  12. Language Modelling We especially need a system that: • Has an “ infinite ” concept of the past, not just a fixed window • For each new input, output the most likely next event (e.g., word) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  13. Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 50

  14. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 51

  15. Language Modelling IDEA: for every individual input, output a prediction Let’s use the previous hidden state, too Le went F = softmax 𝑋ℎ + 𝑐 9 ∈ ℝ P 𝑧 Output Out ut layer er 𝑋 ℎ = 𝑔(𝑊𝑦 + 𝑐 # ) Hi Hidde dden layer 𝑊 𝑦 = 𝑦 # Example i Ex input wo word si single word embedding She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  16. Language Modelling: RNNs Neural Approach #2: Recurrent Neural Network (RNN) 𝑧 F # 𝑧 F 9 𝑧 F : 𝑧 F ; Out Output ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer 𝑊 V V V Input l In layer 𝑦 9 𝑦 # 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  17. Language Modelling: RNNs We have seen this abstract view in Lecture 15. 𝑧 F R Output Out ut layer er 𝑋 𝑉 Th The recurrent loop loop 𝑽 co conveys that the Hi Hidde dden layer cu current hidden layer is influence ced by the hidden hi en layer er from the he prev evious us time e step ep. 𝑊 Input l In layer 𝑦 R CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  18. � RNN (review) Training Process Tr F R = − W 𝑧 X R log(𝑧 R ) 𝐷𝐹 𝑧 R , 𝑧 F X X∈\ 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Er Error 𝑧 F # 𝑧 F 9 𝑧 F : 𝑧 F ; Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  19. � RNN (review) Tr Training Process F R = − W 𝑧 X R log(𝑧 R ) 𝐷𝐹 𝑧 R , 𝑧 F X X∈\ 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Er Error 𝑧 F # 𝑧 F 9 𝑧 F : 𝑧 F ; Out Output ut layer er During training, regardless of our output predictions, we feed in the correct inputs 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  20. � RNN (review) Training Process Tr F R = − W 𝑧 X R log(𝑧 R ) 𝐷𝐹 𝑧 R , 𝑧 F X X∈\ 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? 𝑧 F Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hidde Hi dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer Our total loss is simply the average loss across all 𝑈 time steps 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  21. � RNN (review) Training Process Tr F R = − W 𝑧 X R log(𝑧 R ) 𝐷𝐹 𝑧 R , 𝑧 F X To update our weights (e.g. To . 𝑽 ), ), we we calculate the gradi dient X∈\ ., 𝝐𝑴 of ou of our los loss w. w.r.t. t . the r repeate ted w weight m t matri trix (e (e.g .g., 𝝐𝑽 ). 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? Us Using g the ch chain rule, we trace ce the de derivative all the 𝑧 F Out Output ut layer way back to the beginning, w wa er , while s summing t the r results lts. 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer 𝑊 𝑊 𝑊 𝑊 Input l In layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  22. RNN (review) Tr Training Process 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Er Error went? after? class? over? 𝑧 F Out Output ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  23. RNN (review) Training Process Tr 𝝐𝑴 𝝐𝑾 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? 𝑧 F Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 : Hidde Hi dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  24. RNN (review) Training Process Tr 𝝐𝑴 𝝐𝑾 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? 𝑧 F Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 9 𝑉 𝑉 : Hi Hidde dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  25. RNN (review) Training Process Tr 𝝐𝑴 𝝐𝑾 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? 𝑧 F Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 # 𝑉 9 𝑉 : Hidde Hi dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  26. RNN (review) • This backpropagation through time (BPTT) process is expensive • Instead of updating after every timestep, we tend to do so every 𝑈 steps (e.g., every sentence or paragraph) • This isn’t equivalent to using only a window size 𝑈 (a la n-grams) because we still have ‘infinite memory’ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  27. RNN: Generation We can generate the most likely next event (e.g., word) by sampling from 𝑧 F Continue until we generate <EOS> symbol. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  28. RNN: Generation We can generate the most likely next event (e.g., word) by sampling from 𝒛 c Continue until we generate <EOS> symbol. “Sorry” Output Out ut layer er 𝑋 Hi Hidde dden layer 𝑊 Input l In layer 𝑦 # <START> CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  29. RNN: Generation We can generate the most likely next event (e.g., word) by sampling from 𝒛 c Continue until we generate <EOS> symbol. “Sorry” Harry shouted panicking Out Output ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hidde Hi dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 ; 𝑦 : <START> “Sorry” “Harry” shouted, CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  30. RNN: Generation NOTE: the same input (e.g., “ Harry ” ) can easily yield different outputs, depending on the context (unlike FFNNs and n-grams). “Sorry” Harry shouted panicking Out Output ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer 𝑊 𝑊 𝑊 𝑊 Input l In layer 𝑦 # 𝑦 9 𝑦 ; 𝑦 : <START> “Sorry” “Harry” “shouted,” CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  31. RNN: Generation When trained on Harry Potter text, it generates: Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6 CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  32. RNN: Generation When trained on recipes Source: https://gist.github.com/nylki/1efbaa36635956d35bcc CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  33. RNNs: Overview RN RNN S STREN RENGTHS? • Can handle infinite-length sequences (not just a fixed-window) • Has a “ memory ” of the context (thanks to the hidden layer’s recurrent loop) • Same weights used for all inputs, so word order isn’t wonky (like FFNN) RN RNN IS ISSUES ES? • Slow to train (BPTT) • Due to ” infinite sequence ” , gradients can easily vanish or explode • Has trouble actually making use of long-range context CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  34. RNNs: Overview RN RNN S STREN RENGTHS? • Can handle infinite-length sequences (not just a fixed-window) • Has a “ memory ” of the context (thanks to the hidden layer’s recurrent loop) • Same weights used for all inputs, so word order isn’t wonky (like FFNN) RN RNN IS ISSUES ES? • Slow to train (BPTT) • Due to ” infinite sequence ” , gradients can easily vanish or explode • Has trouble actually making use of long-range context CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  35. RNNs: Vanishing and Exploding Gradients (review) 𝝐𝑴 𝟓 𝝐𝑾 𝟐 𝝐𝑴 𝟓 𝝐𝑾 𝟐 = ? 𝐷𝐹 𝑧 ; , 𝑧 F ; 𝑧 F 𝑉 𝑉 𝑉 𝑉 𝑊 9 𝑊 : 𝑊 # Hi Hidde dden layer 𝑋 𝑋 𝑋 𝑋 In Input l layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  36. RNNs: Vanishing and Exploding Gradients (review) 𝝐𝑴 𝟓 𝝐𝑾 𝟐 𝝐𝑾 𝟐 = 𝝐𝑴 𝟓 𝝐𝑴 𝟓 𝐷𝐹 𝑧 ; , 𝑧 F ; 𝝐𝑾 𝟒 𝑧 F 𝑉 𝑉 𝑉 𝑉 𝑊 9 𝑊 : 𝑊 # Hi Hidde dden layer 𝑋 𝑋 𝑋 𝑋 In Input l layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  37. RNNs: Vanishing and Exploding Gradients (review) 𝝐𝑴 𝟓 𝝐𝑾 𝟐 𝝐𝑾 𝟐 = 𝝐𝑴 𝟓 𝝐𝑴 𝟓 𝝐𝑾 𝟒 𝐷𝐹 𝑧 ; , 𝑧 F ; 𝝐𝑾 𝟒 𝝐𝑾 𝟑 𝑧 F 𝑉 𝑉 𝑉 𝑉 𝑊 9 𝑊 : 𝑊 # Hi Hidde dden layer 𝑋 𝑋 𝑋 𝑋 Input l In layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  38. RNNs: Vanishing and Exploding Gradients (review) 𝝐𝑴 𝟓 𝝐𝑾 𝟐 𝝐𝑾 𝟐 = 𝝐𝑴 𝟓 𝝐𝑴 𝟓 𝝐𝑾 𝟒 𝝐𝑾 𝟑 𝐷𝐹 𝑧 ; , 𝑧 F ; 𝝐𝑾 𝟒 𝝐𝑾 𝟑 𝝐𝑾 𝟐 𝑧 F 𝑉 𝑉 𝑉 𝑉 𝑊 9 𝑊 : 𝑊 # Hidde Hi dden layer 𝑋 𝑋 𝑋 𝑋 Input l In layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  39. RNNs: Vanishing and Exploding Gradients (review) To address RNNs’ finnicky nature with long-range context, we turned to an RNN variant named LSTMs (long short-term memory) But first, let’s recap what we’ve learned so far CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  40. Sequential Modelling (so far) 𝑧 F # 𝑧 F 9 𝑧 F : class 𝑉 𝑉 𝑉 𝑉 𝑊 𝑊 𝑄 went 𝑇ℎ𝑓 = count(𝑇ℎ𝑓 𝑥𝑓𝑜𝑢) count(𝑇ℎ𝑓) 𝑋 𝑋 𝑋 𝑋 to to She went She went n-grams RNN FFNN • Ha Handles infinite context • Kind of robust… … almost • Ba Basic counts; fast (i (in t theory) • Fi Fixed xed wind ndow size • Fi Fixed xed wind ndow size • Ro Robust to rare words • We Weirdly handles context • Sp Sparsity & storage e issues ues • Sl Slow positions po • No Not robust • Di Difficulty w with l long c context • No No “memory” of past CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  41. Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 78

  42. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 79

  43. Long short-term memory (LSTM) • A type of RNN that is designed to better handle long-range dependencies • In ” vanilla ” RNNs, the hidden state is perpetually being rewritten • In addition to a traditional hidden state h , let’s have a dedicated memory cell c for long-term events. More power to relay sequence info. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  44. Inside an LSTM Hidden Layer 𝐷 * 𝐷 *=# 𝐷 *+# 𝐼 *+# 𝐼 * 𝐼 *=# Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  45. Inside an LSTM Hidden Layer some old memories are “forgotten” 𝐷 * 𝐷 *=# 𝐷 *+# 𝐼 *+# 𝐼 * 𝐼 *=# Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  46. Inside an LSTM Hidden Layer some old memories are “forgotten” some new memories are made 𝐷 * 𝐷 *=# 𝐷 *+# 𝐼 *+# 𝐼 * 𝐼 *=# Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  47. Inside an LSTM Hidden Layer some old memories are “forgotten” some new memories are made 𝐷 * 𝐷 *=# 𝐷 *+# 𝐼 *+# 𝐼 * 𝐼 *=# a nonlinear weighted version of the long-term memory becomes our short-term memory Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  48. Inside an LSTM Hidden Layer some old memories are “forgotten” some new memories are made 𝐷 * 𝐷 *=# 𝐷 *+# memory is written, erased, and read by three gates – which are influenced by 𝒚 and 𝒊 𝐼 *+# 𝐼 * 𝐼 *=# a nonlinear weighted version of the long-term memory becomes our short-term memory Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  49. Inside an LSTM Hidden Layer It’s still possible for LSTMs to suffer from vanishing/exploding gradients, but it’s way less likely than with vanilla RNNs: • If RNNs wish to preserve info over long contexts, it must delicately find a recurrent weight matrix 𝑋 ℎ that isn’t too large or small • However, LSTMs have 3 separate mechanism that adjust the flow of information (e.g., forget gate, if turned off, will preserve all info) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  50. Long short-term memory (LSTM) LS LSTM STRENGTHS? • Almost always outperforms vanilla RNNs • Captures long-range dependencies shockingly well LS LSTM ISSUES? • Has more weights to learn than vanilla RNNs; thus, • Requires a moderate amount of training data (otherwise, vanilla RNNs are better) • Can still suffer from vanishing/exploding gradients CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  51. Sequential Modelling 𝑧 F # 𝑧 F 9 𝑧 F : clas 𝑉 𝑉 𝑉 𝑊 𝑊 s 𝑄 went 𝑇ℎ𝑓 = count(𝑇ℎ𝑓 𝑥𝑓𝑜𝑢) 𝑉 count(𝑇ℎ𝑓) 𝑋 𝑋 𝑋 𝑋 to She went to She went n-grams RNN FFNN 𝑧 F # 𝑧 F 9 𝑧 F : went to She LSTM CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  52. Sequential Modelling IM IMPORTANT If your goal isn’t to predict the next item in a sequence, and you rather do some other classification or regression task using the sequence, then you can: • Train an aforementioned model (e.g., LSTM) as a language model • Use the hidden layers that correspond to each item in your sequence CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  53. Sequential Modelling 2. Use hidden layer 1. Train LM to learn hidden layer embeddings for other tasks embeddings Sentiment score 𝑧 F # 𝑧 F 9 𝑧 F ; 𝑧 F : Out Output ut 𝑋 𝑋 𝑋 𝑋 layer la 𝑉 𝑉 𝑉 Hidde Hi dden layer la 𝑊 𝑊 𝑊 𝑊 Input In 𝑦 9 𝑦 : 𝑦 # 𝑦 ; layer la 𝑦 9 𝑦 : 𝑦 # 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  54. Sequential Modelling Or jointly learn hidden embeddings toward a particular task (end-to-end) Out Output ut Sentiment score layer la Hidde Hi dden layer 2 la 𝑋 𝑋 𝑋 𝑋 𝑉 Hi Hidde dden 𝑉 𝑉 la layer 1 𝑊 𝑊 𝑊 𝑊 Input In 𝑦 : layer la 𝑦 9 𝑦 # 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  55. You now have the foundation for modelling sequential data. Most state-of-the-art advances are based on those core RNN/LSTM ideas. But, with tens of thousands of researchers and hackers exploring deep learning, there are many tweaks that haven proven useful. (This is where things get crazy.) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  56. Bi-directional (review) 𝑍 𝑍 𝑍 *+# *+9 * �symbol for a BRNN 𝑍 * previous state previous state 𝑌 * 𝑌 *+9 𝑌 *+# 𝑌 * CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 93

  57. Bi-directional (review) RNNs/LSTMs use the left-to-right context and sequentially process data. If you have full access to the data at testing time, why not make use of the flow of information from right-to-left, also? CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  58. RNN Extensions: Bi-directional LSTMs (review) For brevity, let’s use the follow schematic to represent an RNN v v v v ℎ # ℎ ; ℎ 9 ℎ : Hidde Hi dden layer 𝑦 # Input l In layer 𝑦 9 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  59. RNN Extensions: Bi-directional LSTMs (review) For brevity, let’s use the follow schematic to represent an RNN v v v v ℎ # w ℎ ; w ℎ 9 ℎ : w w ℎ # ℎ ; ℎ 9 ℎ : Hi Hidde dden layer 𝑦 # In Input l layer 𝑦 9 𝑦 : 𝑦 ; 𝑦 # 𝑦 9 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  60. RNN Extensions: Bi-directional LSTMs (review) w w w w ℎ # ℎ 9 ℎ : ℎ ; Concatenate the hidden layers Co v v v v ℎ # ℎ 9 ℎ : ℎ ; v v v v ℎ # w ℎ ; w ℎ 9 ℎ : w w ℎ # ℎ ; ℎ 9 ℎ : Hidde Hi dden layer 𝑦 # In Input l layer 𝑦 9 𝑦 : 𝑦 ; 𝑦 # 𝑦 9 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  61. RNN Extensions: Bi-directional LSTMs (review) 𝑧 F # 𝑧 F : 𝑧 F 9 𝑧 F ; Out Output ut layer er w w w w ℎ # ℎ 9 ℎ : ℎ ; Concatenate the hidden layers Co v v v v ℎ # ℎ 9 ℎ : ℎ ; v v v v ℎ # w ℎ ; w ℎ 9 ℎ : w w ℎ # ℎ ; ℎ 9 ℎ : Hidde Hi dden layer 𝑦 # In Input l layer 𝑦 9 𝑦 : 𝑦 ; 𝑦 # 𝑦 9 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  62. RNN Extensions: Bi-directional LSTMs (review) BI-LS BI LSTM STRENGTHS? • Usually performs at least as well as uni-directional RNNs/LSTMs BI BI-LS LSTM ISSUES? • Slower to train • Only possible if access to full data is allowed CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  63. Deep RNN (review) LSTMs units can be arranged in layers, so that the output of each unit is the input to the other units. This is called a deep RNN , where the adjective “ deep ” refers to these multiple layers. • Each layer feeds the LSTM on the next layer • First time step of a feature is fed to the first LSTM, which processes that data and produces an output (and a new state for itself). • That output is fed to the next LSTM, which does the same thing, and the next, and so on. • Then the second time step arrives at the first LSTM, and the process repeats. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend