lecture 16 language model

Lecture 16: Language Model CS109B Data Science 2 Pavlos Protopapas, - PowerPoint PPT Presentation

Lecture 16: Language Model CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and Chris Tanner Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 2


  1. Language Modelling: neural networks β€’ Each circle is a specific floating point scalar IDEA: Let’s use a neural networks! β€’ Words that are more semantically similar to one another will have embeddings that are proportionally similar, First, each word is represented by a word embedding too (e.g., vector of length 200) β€’ We can use pre-existing word embeddings that have been trained on gigantic corpora man woman table CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  2. Language Modelling: neural networks These word embeddings are so rich that you get nice properties: king - man + ____________ ____________ woman ~ queen Word2vec: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf CS109B, P ROTOPAPAS , G LICKMAN , T ANNER GloVe: https://www.aclweb.org/anthology/D14-1162.pdf

  3. Language Modelling: neural networks How can we use these embeddings to build a LM? Remember, we only need a system that can estimate: 𝑄 𝑦 *=# |𝑦 * , 𝑦 *+# , … , 𝑦 # next word previous words Example input sentence class She went to CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  4. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class? Output layer 𝑋 Hidden layer π‘Š Example input sentence She went to CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  5. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class? F = softmax π‘‹β„Ž + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 β„Ž = 𝑔(π‘Šπ‘¦ + 𝑐 # ) Hidden layer π‘Š 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Co Conc ncatena enated ed wo word em embed eddi ding ngs She went to CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  6. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word after F = softmax π‘‹β„Ž + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 β„Ž = 𝑔(π‘Šπ‘¦ + 𝑐 # ) Hidden layer π‘Š 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Co Conc ncatena enated ed wo word em embed eddi ding ngs went to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  7. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word visiting F = softmax π‘‹β„Ž + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 β„Ž = 𝑔(π‘Šπ‘¦ + 𝑐 # ) Hidden layer π‘Š 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Co Conc ncatena enated ed wo word em embedd beddings ngs to after class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  8. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word her F = softmax π‘‹β„Ž + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 β„Ž = 𝑔(π‘Šπ‘¦ + 𝑐 # ) Hidden layer π‘Š 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Conc Co ncatena enated ed wo word em embed eddi ding ngs after class visiting CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  9. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word grandma F = softmax π‘‹β„Ž + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 β„Ž = 𝑔(π‘Šπ‘¦ + 𝑐 # ) Hidden layer π‘Š 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Co Conc ncatena enated ed wo word em embed eddi ding ngs after her visiting CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  10. Language Modelling : Feed-forward Neural Net FFN FFNN ST STRENGTHS? S? FFN FFNN ISSU SSUES? S? CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  11. Language Modelling : Feed-forward Neural Net FFN FFNN ST STRENGTHS? S? β€’ No sparsity issues (it’s okay if we’ve never seen a segment of words) β€’ No storage issues (we never store counts) FFN FFNN ISSU SSUES? S? β€’ Fixed-window size can never be big enough. Need more context. β€’ Increasing window size adds many more weights β€’ The weights awkwardly handle word position β€’ No concept of time β€’ Requires inputting entire context just to predict one word CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  12. Language Modelling We especially need a system that: β€’ Has an β€œ infinite ” concept of the past, not just a fixed window β€’ For each new input, output the most likely next event (e.g., word) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  13. Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 50

  14. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 51

  15. Language Modelling IDEA: for every individual input, output a prediction Let’s use the previous hidden state, too Le went F = softmax π‘‹β„Ž + 𝑐 9 ∈ ℝ P 𝑧 Output Out ut layer er 𝑋 β„Ž = 𝑔(π‘Šπ‘¦ + 𝑐 # ) Hi Hidde dden layer π‘Š 𝑦 = 𝑦 # Example i Ex input wo word si single word embedding She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  16. Language Modelling: RNNs Neural Approach #2: Recurrent Neural Network (RNN) 𝑧 F # 𝑧 F 9 𝑧 F : 𝑧 F ; Out Output ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer π‘Š V V V Input l In layer 𝑦 9 𝑦 # 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  17. Language Modelling: RNNs We have seen this abstract view in Lecture 15. 𝑧 F R Output Out ut layer er 𝑋 𝑉 Th The recurrent loop loop 𝑽 co conveys that the Hi Hidde dden layer cu current hidden layer is influence ced by the hidden hi en layer er from the he prev evious us time e step ep. π‘Š Input l In layer 𝑦 R CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  18. οΏ½ RNN (review) Training Process Tr F R = βˆ’ W 𝑧 X R log(𝑧 R ) 𝐷𝐹 𝑧 R , 𝑧 F X X∈\ 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Er Error 𝑧 F # 𝑧 F 9 𝑧 F : 𝑧 F ; Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer π‘Š π‘Š π‘Š π‘Š In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  19. οΏ½ RNN (review) Tr Training Process F R = βˆ’ W 𝑧 X R log(𝑧 R ) 𝐷𝐹 𝑧 R , 𝑧 F X X∈\ 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Er Error 𝑧 F # 𝑧 F 9 𝑧 F : 𝑧 F ; Out Output ut layer er During training, regardless of our output predictions, we feed in the correct inputs 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer π‘Š π‘Š π‘Š π‘Š In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  20. οΏ½ RNN (review) Training Process Tr F R = βˆ’ W 𝑧 X R log(𝑧 R ) 𝐷𝐹 𝑧 R , 𝑧 F X X∈\ 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? 𝑧 F Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hidde Hi dden layer π‘Š π‘Š π‘Š π‘Š In Input l layer Our total loss is simply the average loss across all π‘ˆ time steps 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  21. οΏ½ RNN (review) Training Process Tr F R = βˆ’ W 𝑧 X R log(𝑧 R ) 𝐷𝐹 𝑧 R , 𝑧 F X To update our weights (e.g. To . 𝑽 ), ), we we calculate the gradi dient X∈\ ., 𝝐𝑴 of ou of our los loss w. w.r.t. t . the r repeate ted w weight m t matri trix (e (e.g .g., 𝝐𝑽 ). 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? Us Using g the ch chain rule, we trace ce the de derivative all the 𝑧 F Out Output ut layer way back to the beginning, w wa er , while s summing t the r results lts. 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer π‘Š π‘Š π‘Š π‘Š Input l In layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  22. RNN (review) Tr Training Process 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Er Error went? after? class? over? 𝑧 F Out Output ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer π‘Š π‘Š π‘Š π‘Š In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  23. RNN (review) Training Process Tr 𝝐𝑴 𝝐𝑾 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? 𝑧 F Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 : Hidde Hi dden layer π‘Š π‘Š π‘Š π‘Š In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  24. RNN (review) Training Process Tr 𝝐𝑴 𝝐𝑾 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? 𝑧 F Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 9 𝑉 𝑉 : Hi Hidde dden layer π‘Š π‘Š π‘Š π‘Š In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  25. RNN (review) Training Process Tr 𝝐𝑴 𝝐𝑾 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? 𝑧 F Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 # 𝑉 9 𝑉 : Hidde Hi dden layer π‘Š π‘Š π‘Š π‘Š In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  26. RNN (review) β€’ This backpropagation through time (BPTT) process is expensive β€’ Instead of updating after every timestep, we tend to do so every π‘ˆ steps (e.g., every sentence or paragraph) β€’ This isn’t equivalent to using only a window size π‘ˆ (a la n-grams) because we still have β€˜infinite memory’ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  27. RNN: Generation We can generate the most likely next event (e.g., word) by sampling from 𝑧 F Continue until we generate <EOS> symbol. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  28. RNN: Generation We can generate the most likely next event (e.g., word) by sampling from 𝒛 c Continue until we generate <EOS> symbol. β€œSorry” Output Out ut layer er 𝑋 Hi Hidde dden layer π‘Š Input l In layer 𝑦 # <START> CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  29. RNN: Generation We can generate the most likely next event (e.g., word) by sampling from 𝒛 c Continue until we generate <EOS> symbol. β€œSorry” Harry shouted panicking Out Output ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hidde Hi dden layer π‘Š π‘Š π‘Š π‘Š In Input l layer 𝑦 # 𝑦 9 𝑦 ; 𝑦 : <START> β€œSorry” β€œHarry” shouted, CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  30. RNN: Generation NOTE: the same input (e.g., β€œ Harry ” ) can easily yield different outputs, depending on the context (unlike FFNNs and n-grams). β€œSorry” Harry shouted panicking Out Output ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer π‘Š π‘Š π‘Š π‘Š Input l In layer 𝑦 # 𝑦 9 𝑦 ; 𝑦 : <START> β€œSorry” β€œHarry” β€œshouted,” CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  31. RNN: Generation When trained on Harry Potter text, it generates: Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6 CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  32. RNN: Generation When trained on recipes Source: https://gist.github.com/nylki/1efbaa36635956d35bcc CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  33. RNNs: Overview RN RNN S STREN RENGTHS? β€’ Can handle infinite-length sequences (not just a fixed-window) β€’ Has a β€œ memory ” of the context (thanks to the hidden layer’s recurrent loop) β€’ Same weights used for all inputs, so word order isn’t wonky (like FFNN) RN RNN IS ISSUES ES? β€’ Slow to train (BPTT) β€’ Due to ” infinite sequence ” , gradients can easily vanish or explode β€’ Has trouble actually making use of long-range context CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  34. RNNs: Overview RN RNN S STREN RENGTHS? β€’ Can handle infinite-length sequences (not just a fixed-window) β€’ Has a β€œ memory ” of the context (thanks to the hidden layer’s recurrent loop) β€’ Same weights used for all inputs, so word order isn’t wonky (like FFNN) RN RNN IS ISSUES ES? β€’ Slow to train (BPTT) β€’ Due to ” infinite sequence ” , gradients can easily vanish or explode β€’ Has trouble actually making use of long-range context CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  35. RNNs: Vanishing and Exploding Gradients (review) 𝝐𝑴 πŸ“ 𝝐𝑾 𝟐 𝝐𝑴 πŸ“ 𝝐𝑾 𝟐 = ? 𝐷𝐹 𝑧 ; , 𝑧 F ; 𝑧 F 𝑉 𝑉 𝑉 𝑉 π‘Š 9 π‘Š : π‘Š # Hi Hidde dden layer 𝑋 𝑋 𝑋 𝑋 In Input l layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  36. RNNs: Vanishing and Exploding Gradients (review) 𝝐𝑴 πŸ“ 𝝐𝑾 𝟐 𝝐𝑾 𝟐 = 𝝐𝑴 πŸ“ 𝝐𝑴 πŸ“ 𝐷𝐹 𝑧 ; , 𝑧 F ; 𝝐𝑾 πŸ’ 𝑧 F 𝑉 𝑉 𝑉 𝑉 π‘Š 9 π‘Š : π‘Š # Hi Hidde dden layer 𝑋 𝑋 𝑋 𝑋 In Input l layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  37. RNNs: Vanishing and Exploding Gradients (review) 𝝐𝑴 πŸ“ 𝝐𝑾 𝟐 𝝐𝑾 𝟐 = 𝝐𝑴 πŸ“ 𝝐𝑴 πŸ“ 𝝐𝑾 πŸ’ 𝐷𝐹 𝑧 ; , 𝑧 F ; 𝝐𝑾 πŸ’ 𝝐𝑾 πŸ‘ 𝑧 F 𝑉 𝑉 𝑉 𝑉 π‘Š 9 π‘Š : π‘Š # Hi Hidde dden layer 𝑋 𝑋 𝑋 𝑋 Input l In layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  38. RNNs: Vanishing and Exploding Gradients (review) 𝝐𝑴 πŸ“ 𝝐𝑾 𝟐 𝝐𝑾 𝟐 = 𝝐𝑴 πŸ“ 𝝐𝑴 πŸ“ 𝝐𝑾 πŸ’ 𝝐𝑾 πŸ‘ 𝐷𝐹 𝑧 ; , 𝑧 F ; 𝝐𝑾 πŸ’ 𝝐𝑾 πŸ‘ 𝝐𝑾 𝟐 𝑧 F 𝑉 𝑉 𝑉 𝑉 π‘Š 9 π‘Š : π‘Š # Hidde Hi dden layer 𝑋 𝑋 𝑋 𝑋 Input l In layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  39. RNNs: Vanishing and Exploding Gradients (review) To address RNNs’ finnicky nature with long-range context, we turned to an RNN variant named LSTMs (long short-term memory) But first, let’s recap what we’ve learned so far CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  40. Sequential Modelling (so far) 𝑧 F # 𝑧 F 9 𝑧 F : class 𝑉 𝑉 𝑉 𝑉 π‘Š π‘Š 𝑄 went π‘‡β„Žπ‘“ = count(π‘‡β„Žπ‘“ π‘₯π‘“π‘œπ‘’) count(π‘‡β„Žπ‘“) 𝑋 𝑋 𝑋 𝑋 to to She went She went n-grams RNN FFNN β€’ Ha Handles infinite context β€’ Kind of robust… … almost β€’ Ba Basic counts; fast (i (in t theory) β€’ Fi Fixed xed wind ndow size β€’ Fi Fixed xed wind ndow size β€’ Ro Robust to rare words β€’ We Weirdly handles context β€’ Sp Sparsity & storage e issues ues β€’ Sl Slow positions po β€’ No Not robust β€’ Di Difficulty w with l long c context β€’ No No β€œmemory” of past CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  41. Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 78

  42. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 79

  43. Long short-term memory (LSTM) β€’ A type of RNN that is designed to better handle long-range dependencies β€’ In ” vanilla ” RNNs, the hidden state is perpetually being rewritten β€’ In addition to a traditional hidden state h , let’s have a dedicated memory cell c for long-term events. More power to relay sequence info. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  44. Inside an LSTM Hidden Layer 𝐷 * 𝐷 *=# 𝐷 *+# 𝐼 *+# 𝐼 * 𝐼 *=# Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  45. Inside an LSTM Hidden Layer some old memories are β€œforgotten” 𝐷 * 𝐷 *=# 𝐷 *+# 𝐼 *+# 𝐼 * 𝐼 *=# Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  46. Inside an LSTM Hidden Layer some old memories are β€œforgotten” some new memories are made 𝐷 * 𝐷 *=# 𝐷 *+# 𝐼 *+# 𝐼 * 𝐼 *=# Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  47. Inside an LSTM Hidden Layer some old memories are β€œforgotten” some new memories are made 𝐷 * 𝐷 *=# 𝐷 *+# 𝐼 *+# 𝐼 * 𝐼 *=# a nonlinear weighted version of the long-term memory becomes our short-term memory Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  48. Inside an LSTM Hidden Layer some old memories are β€œforgotten” some new memories are made 𝐷 * 𝐷 *=# 𝐷 *+# memory is written, erased, and read by three gates – which are influenced by π’š and π’Š 𝐼 *+# 𝐼 * 𝐼 *=# a nonlinear weighted version of the long-term memory becomes our short-term memory Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  49. Inside an LSTM Hidden Layer It’s still possible for LSTMs to suffer from vanishing/exploding gradients, but it’s way less likely than with vanilla RNNs: β€’ If RNNs wish to preserve info over long contexts, it must delicately find a recurrent weight matrix 𝑋 β„Ž that isn’t too large or small β€’ However, LSTMs have 3 separate mechanism that adjust the flow of information (e.g., forget gate, if turned off, will preserve all info) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  50. Long short-term memory (LSTM) LS LSTM STRENGTHS? β€’ Almost always outperforms vanilla RNNs β€’ Captures long-range dependencies shockingly well LS LSTM ISSUES? β€’ Has more weights to learn than vanilla RNNs; thus, β€’ Requires a moderate amount of training data (otherwise, vanilla RNNs are better) β€’ Can still suffer from vanishing/exploding gradients CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  51. Sequential Modelling 𝑧 F # 𝑧 F 9 𝑧 F : clas 𝑉 𝑉 𝑉 π‘Š π‘Š s 𝑄 went π‘‡β„Žπ‘“ = count(π‘‡β„Žπ‘“ π‘₯π‘“π‘œπ‘’) 𝑉 count(π‘‡β„Žπ‘“) 𝑋 𝑋 𝑋 𝑋 to She went to She went n-grams RNN FFNN 𝑧 F # 𝑧 F 9 𝑧 F : went to She LSTM CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  52. Sequential Modelling IM IMPORTANT If your goal isn’t to predict the next item in a sequence, and you rather do some other classification or regression task using the sequence, then you can: β€’ Train an aforementioned model (e.g., LSTM) as a language model β€’ Use the hidden layers that correspond to each item in your sequence CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  53. Sequential Modelling 2. Use hidden layer 1. Train LM to learn hidden layer embeddings for other tasks embeddings Sentiment score 𝑧 F # 𝑧 F 9 𝑧 F ; 𝑧 F : Out Output ut 𝑋 𝑋 𝑋 𝑋 layer la 𝑉 𝑉 𝑉 Hidde Hi dden layer la π‘Š π‘Š π‘Š π‘Š Input In 𝑦 9 𝑦 : 𝑦 # 𝑦 ; layer la 𝑦 9 𝑦 : 𝑦 # 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  54. Sequential Modelling Or jointly learn hidden embeddings toward a particular task (end-to-end) Out Output ut Sentiment score layer la Hidde Hi dden layer 2 la 𝑋 𝑋 𝑋 𝑋 𝑉 Hi Hidde dden 𝑉 𝑉 la layer 1 π‘Š π‘Š π‘Š π‘Š Input In 𝑦 : layer la 𝑦 9 𝑦 # 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  55. You now have the foundation for modelling sequential data. Most state-of-the-art advances are based on those core RNN/LSTM ideas. But, with tens of thousands of researchers and hackers exploring deep learning, there are many tweaks that haven proven useful. (This is where things get crazy.) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  56. Bi-directional (review) 𝑍 𝑍 𝑍 *+# *+9 * οΏ½symbol for a BRNN 𝑍 * previous state previous state π‘Œ * π‘Œ *+9 π‘Œ *+# π‘Œ * CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 93

  57. Bi-directional (review) RNNs/LSTMs use the left-to-right context and sequentially process data. If you have full access to the data at testing time, why not make use of the flow of information from right-to-left, also? CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  58. RNN Extensions: Bi-directional LSTMs (review) For brevity, let’s use the follow schematic to represent an RNN v v v v β„Ž # β„Ž ; β„Ž 9 β„Ž : Hidde Hi dden layer 𝑦 # Input l In layer 𝑦 9 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  59. RNN Extensions: Bi-directional LSTMs (review) For brevity, let’s use the follow schematic to represent an RNN v v v v β„Ž # w β„Ž ; w β„Ž 9 β„Ž : w w β„Ž # β„Ž ; β„Ž 9 β„Ž : Hi Hidde dden layer 𝑦 # In Input l layer 𝑦 9 𝑦 : 𝑦 ; 𝑦 # 𝑦 9 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  60. RNN Extensions: Bi-directional LSTMs (review) w w w w β„Ž # β„Ž 9 β„Ž : β„Ž ; Concatenate the hidden layers Co v v v v β„Ž # β„Ž 9 β„Ž : β„Ž ; v v v v β„Ž # w β„Ž ; w β„Ž 9 β„Ž : w w β„Ž # β„Ž ; β„Ž 9 β„Ž : Hidde Hi dden layer 𝑦 # In Input l layer 𝑦 9 𝑦 : 𝑦 ; 𝑦 # 𝑦 9 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  61. RNN Extensions: Bi-directional LSTMs (review) 𝑧 F # 𝑧 F : 𝑧 F 9 𝑧 F ; Out Output ut layer er w w w w β„Ž # β„Ž 9 β„Ž : β„Ž ; Concatenate the hidden layers Co v v v v β„Ž # β„Ž 9 β„Ž : β„Ž ; v v v v β„Ž # w β„Ž ; w β„Ž 9 β„Ž : w w β„Ž # β„Ž ; β„Ž 9 β„Ž : Hidde Hi dden layer 𝑦 # In Input l layer 𝑦 9 𝑦 : 𝑦 ; 𝑦 # 𝑦 9 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  62. RNN Extensions: Bi-directional LSTMs (review) BI-LS BI LSTM STRENGTHS? β€’ Usually performs at least as well as uni-directional RNNs/LSTMs BI BI-LS LSTM ISSUES? β€’ Slower to train β€’ Only possible if access to full data is allowed CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  63. Deep RNN (review) LSTMs units can be arranged in layers, so that the output of each unit is the input to the other units. This is called a deep RNN , where the adjective β€œ deep ” refers to these multiple layers. β€’ Each layer feeds the LSTM on the next layer β€’ First time step of a feature is fed to the first LSTM, which processes that data and produces an output (and a new state for itself). β€’ That output is fed to the next LSTM, which does the same thing, and the next, and so on. β€’ Then the second time step arrives at the first LSTM, and the process repeats. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 100

Recommend


More recommend