Language Modelling: neural networks β’ Each circle is a specific floating point scalar IDEA: Letβs use a neural networks! β’ Words that are more semantically similar to one another will have embeddings that are proportionally similar, First, each word is represented by a word embedding too (e.g., vector of length 200) β’ We can use pre-existing word embeddings that have been trained on gigantic corpora man woman table CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Language Modelling: neural networks These word embeddings are so rich that you get nice properties: king - man + ____________ ____________ woman ~ queen Word2vec: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf CS109B, P ROTOPAPAS , G LICKMAN , T ANNER GloVe: https://www.aclweb.org/anthology/D14-1162.pdf
Language Modelling: neural networks How can we use these embeddings to build a LM? Remember, we only need a system that can estimate: π π¦ *=# |π¦ * , π¦ *+# , β¦ , π¦ # next word previous words Example input sentence class She went to CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class? Output layer π Hidden layer π Example input sentence She went to CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class? F = softmax πβ + π 9 β β P π§ Output layer π β = π(ππ¦ + π # ) Hidden layer π π¦ = [π¦ # , π¦ 9 , π¦ : ] Example input sentence Co Conc ncatena enated ed wo word em embed eddi ding ngs She went to CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word after F = softmax πβ + π 9 β β P π§ Output layer π β = π(ππ¦ + π # ) Hidden layer π π¦ = [π¦ # , π¦ 9 , π¦ : ] Example input sentence Co Conc ncatena enated ed wo word em embed eddi ding ngs went to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word visiting F = softmax πβ + π 9 β β P π§ Output layer π β = π(ππ¦ + π # ) Hidden layer π π¦ = [π¦ # , π¦ 9 , π¦ : ] Example input sentence Co Conc ncatena enated ed wo word em embedd beddings ngs to after class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word her F = softmax πβ + π 9 β β P π§ Output layer π β = π(ππ¦ + π # ) Hidden layer π π¦ = [π¦ # , π¦ 9 , π¦ : ] Example input sentence Conc Co ncatena enated ed wo word em embed eddi ding ngs after class visiting CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word grandma F = softmax πβ + π 9 β β P π§ Output layer π β = π(ππ¦ + π # ) Hidden layer π π¦ = [π¦ # , π¦ 9 , π¦ : ] Example input sentence Co Conc ncatena enated ed wo word em embed eddi ding ngs after her visiting CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Language Modelling : Feed-forward Neural Net FFN FFNN ST STRENGTHS? S? FFN FFNN ISSU SSUES? S? CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Language Modelling : Feed-forward Neural Net FFN FFNN ST STRENGTHS? S? β’ No sparsity issues (itβs okay if weβve never seen a segment of words) β’ No storage issues (we never store counts) FFN FFNN ISSU SSUES? S? β’ Fixed-window size can never be big enough. Need more context. β’ Increasing window size adds many more weights β’ The weights awkwardly handle word position β’ No concept of time β’ Requires inputting entire context just to predict one word CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Language Modelling We especially need a system that: β’ Has an β infinite β concept of the past, not just a fixed window β’ For each new input, output the most likely next event (e.g., word) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 50
CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 51
Language Modelling IDEA: for every individual input, output a prediction Letβs use the previous hidden state, too Le went F = softmax πβ + π 9 β β P π§ Output Out ut layer er π β = π(ππ¦ + π # ) Hi Hidde dden layer π π¦ = π¦ # Example i Ex input wo word si single word embedding She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Language Modelling: RNNs Neural Approach #2: Recurrent Neural Network (RNN) π§ F # π§ F 9 π§ F : π§ F ; Out Output ut layer er π π π π π π π Hi Hidde dden layer π V V V Input l In layer π¦ 9 π¦ # π¦ : π¦ ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Language Modelling: RNNs We have seen this abstract view in Lecture 15. π§ F R Output Out ut layer er π π Th The recurrent loop loop π½ co conveys that the Hi Hidde dden layer cu current hidden layer is influence ced by the hidden hi en layer er from the he prev evious us time e step ep. π Input l In layer π¦ R CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
οΏ½ RNN (review) Training Process Tr F R = β W π§ X R log(π§ R ) π·πΉ π§ R , π§ F X Xβ\ π·πΉ π§ 9 , π§ F 9 π·πΉ π§ # , π§ F # π·πΉ π§ : , π§ F : π·πΉ π§ ; , π§ F ; Er Error π§ F # π§ F 9 π§ F : π§ F ; Output Out ut layer er π π π π π π π Hi Hidde dden layer π π π π In Input l layer π¦ # π¦ 9 π¦ : π¦ ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
οΏ½ RNN (review) Tr Training Process F R = β W π§ X R log(π§ R ) π·πΉ π§ R , π§ F X Xβ\ π·πΉ π§ 9 , π§ F 9 π·πΉ π§ # , π§ F # π·πΉ π§ : , π§ F : π·πΉ π§ ; , π§ F ; Er Error π§ F # π§ F 9 π§ F : π§ F ; Out Output ut layer er During training, regardless of our output predictions, we feed in the correct inputs π π π π π π π Hi Hidde dden layer π π π π In Input l layer π¦ # π¦ 9 π¦ : π¦ ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
οΏ½ RNN (review) Training Process Tr F R = β W π§ X R log(π§ R ) π·πΉ π§ R , π§ F X Xβ\ π·πΉ π§ 9 , π§ F 9 π·πΉ π§ # , π§ F # π·πΉ π§ : , π§ F : π·πΉ π§ ; , π§ F ; Error Er went? after? class? over? π§ F Output Out ut layer er π π π π π π π Hidde Hi dden layer π π π π In Input l layer Our total loss is simply the average loss across all π time steps π¦ # π¦ 9 π¦ : π¦ ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
οΏ½ RNN (review) Training Process Tr F R = β W π§ X R log(π§ R ) π·πΉ π§ R , π§ F X To update our weights (e.g. To . π½ ), ), we we calculate the gradi dient Xβ\ ., ππ΄ of ou of our los loss w. w.r.t. t . the r repeate ted w weight m t matri trix (e (e.g .g., ππ½ ). π·πΉ π§ 9 , π§ F 9 π·πΉ π§ # , π§ F # π·πΉ π§ : , π§ F : π·πΉ π§ ; , π§ F ; Error Er went? after? class? over? Us Using g the ch chain rule, we trace ce the de derivative all the π§ F Out Output ut layer way back to the beginning, w wa er , while s summing t the r results lts. π π π π π π π Hi Hidde dden layer π π π π Input l In layer π¦ # π¦ 9 π¦ : π¦ ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN (review) Tr Training Process π·πΉ π§ 9 , π§ F 9 π·πΉ π§ # , π§ F # π·πΉ π§ : , π§ F : π·πΉ π§ ; , π§ F ; Er Error went? after? class? over? π§ F Out Output ut layer er π π π π π π π Hi Hidde dden layer π π π π In Input l layer π¦ # π¦ 9 π¦ : π¦ ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN (review) Training Process Tr ππ΄ ππΎ π·πΉ π§ 9 , π§ F 9 π·πΉ π§ # , π§ F # π·πΉ π§ : , π§ F : π·πΉ π§ ; , π§ F ; Error Er went? after? class? over? π§ F Output Out ut layer er π π π π π π π : Hidde Hi dden layer π π π π In Input l layer π¦ # π¦ 9 π¦ : π¦ ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN (review) Training Process Tr ππ΄ ππΎ π·πΉ π§ 9 , π§ F 9 π·πΉ π§ # , π§ F # π·πΉ π§ : , π§ F : π·πΉ π§ ; , π§ F ; Error Er went? after? class? over? π§ F Output Out ut layer er π π π π π 9 π π : Hi Hidde dden layer π π π π In Input l layer π¦ # π¦ 9 π¦ : π¦ ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN (review) Training Process Tr ππ΄ ππΎ π·πΉ π§ 9 , π§ F 9 π·πΉ π§ # , π§ F # π·πΉ π§ : , π§ F : π·πΉ π§ ; , π§ F ; Error Er went? after? class? over? π§ F Output Out ut layer er π π π π π # π 9 π : Hidde Hi dden layer π π π π In Input l layer π¦ # π¦ 9 π¦ : π¦ ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN (review) β’ This backpropagation through time (BPTT) process is expensive β’ Instead of updating after every timestep, we tend to do so every π steps (e.g., every sentence or paragraph) β’ This isnβt equivalent to using only a window size π (a la n-grams) because we still have βinfinite memoryβ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN: Generation We can generate the most likely next event (e.g., word) by sampling from π§ F Continue until we generate <EOS> symbol. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN: Generation We can generate the most likely next event (e.g., word) by sampling from π c Continue until we generate <EOS> symbol. βSorryβ Output Out ut layer er π Hi Hidde dden layer π Input l In layer π¦ # <START> CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN: Generation We can generate the most likely next event (e.g., word) by sampling from π c Continue until we generate <EOS> symbol. βSorryβ Harry shouted panicking Out Output ut layer er π π π π π π π Hidde Hi dden layer π π π π In Input l layer π¦ # π¦ 9 π¦ ; π¦ : <START> βSorryβ βHarryβ shouted, CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN: Generation NOTE: the same input (e.g., β Harry β ) can easily yield different outputs, depending on the context (unlike FFNNs and n-grams). βSorryβ Harry shouted panicking Out Output ut layer er π π π π π π π Hi Hidde dden layer π π π π Input l In layer π¦ # π¦ 9 π¦ ; π¦ : <START> βSorryβ βHarryβ βshouted,β CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN: Generation When trained on Harry Potter text, it generates: Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6 CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN: Generation When trained on recipes Source: https://gist.github.com/nylki/1efbaa36635956d35bcc CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNNs: Overview RN RNN S STREN RENGTHS? β’ Can handle infinite-length sequences (not just a fixed-window) β’ Has a β memory β of the context (thanks to the hidden layerβs recurrent loop) β’ Same weights used for all inputs, so word order isnβt wonky (like FFNN) RN RNN IS ISSUES ES? β’ Slow to train (BPTT) β’ Due to β infinite sequence β , gradients can easily vanish or explode β’ Has trouble actually making use of long-range context CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNNs: Overview RN RNN S STREN RENGTHS? β’ Can handle infinite-length sequences (not just a fixed-window) β’ Has a β memory β of the context (thanks to the hidden layerβs recurrent loop) β’ Same weights used for all inputs, so word order isnβt wonky (like FFNN) RN RNN IS ISSUES ES? β’ Slow to train (BPTT) β’ Due to β infinite sequence β , gradients can easily vanish or explode β’ Has trouble actually making use of long-range context CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNNs: Vanishing and Exploding Gradients (review) ππ΄ π ππΎ π ππ΄ π ππΎ π = ? π·πΉ π§ ; , π§ F ; π§ F π π π π π 9 π : π # Hi Hidde dden layer π π π π In Input l layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNNs: Vanishing and Exploding Gradients (review) ππ΄ π ππΎ π ππΎ π = ππ΄ π ππ΄ π π·πΉ π§ ; , π§ F ; ππΎ π π§ F π π π π π 9 π : π # Hi Hidde dden layer π π π π In Input l layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNNs: Vanishing and Exploding Gradients (review) ππ΄ π ππΎ π ππΎ π = ππ΄ π ππ΄ π ππΎ π π·πΉ π§ ; , π§ F ; ππΎ π ππΎ π π§ F π π π π π 9 π : π # Hi Hidde dden layer π π π π Input l In layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNNs: Vanishing and Exploding Gradients (review) ππ΄ π ππΎ π ππΎ π = ππ΄ π ππ΄ π ππΎ π ππΎ π π·πΉ π§ ; , π§ F ; ππΎ π ππΎ π ππΎ π π§ F π π π π π 9 π : π # Hidde Hi dden layer π π π π Input l In layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNNs: Vanishing and Exploding Gradients (review) To address RNNsβ finnicky nature with long-range context, we turned to an RNN variant named LSTMs (long short-term memory) But first, letβs recap what weβve learned so far CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Sequential Modelling (so far) π§ F # π§ F 9 π§ F : class π π π π π π π went πβπ = count(πβπ π₯πππ’) count(πβπ) π π π π to to She went She went n-grams RNN FFNN β’ Ha Handles infinite context β’ Kind of robustβ¦ β¦ almost β’ Ba Basic counts; fast (i (in t theory) β’ Fi Fixed xed wind ndow size β’ Fi Fixed xed wind ndow size β’ Ro Robust to rare words β’ We Weirdly handles context β’ Sp Sparsity & storage e issues ues β’ Sl Slow positions po β’ No Not robust β’ Di Difficulty w with l long c context β’ No No βmemoryβ of past CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 78
CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 79
Long short-term memory (LSTM) β’ A type of RNN that is designed to better handle long-range dependencies β’ In β vanilla β RNNs, the hidden state is perpetually being rewritten β’ In addition to a traditional hidden state h , letβs have a dedicated memory cell c for long-term events. More power to relay sequence info. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Inside an LSTM Hidden Layer π· * π· *=# π· *+# πΌ *+# πΌ * πΌ *=# Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Inside an LSTM Hidden Layer some old memories are βforgottenβ π· * π· *=# π· *+# πΌ *+# πΌ * πΌ *=# Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Inside an LSTM Hidden Layer some old memories are βforgottenβ some new memories are made π· * π· *=# π· *+# πΌ *+# πΌ * πΌ *=# Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Inside an LSTM Hidden Layer some old memories are βforgottenβ some new memories are made π· * π· *=# π· *+# πΌ *+# πΌ * πΌ *=# a nonlinear weighted version of the long-term memory becomes our short-term memory Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Inside an LSTM Hidden Layer some old memories are βforgottenβ some new memories are made π· * π· *=# π· *+# memory is written, erased, and read by three gates β which are influenced by π and π πΌ *+# πΌ * πΌ *=# a nonlinear weighted version of the long-term memory becomes our short-term memory Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Inside an LSTM Hidden Layer Itβs still possible for LSTMs to suffer from vanishing/exploding gradients, but itβs way less likely than with vanilla RNNs: β’ If RNNs wish to preserve info over long contexts, it must delicately find a recurrent weight matrix π β that isnβt too large or small β’ However, LSTMs have 3 separate mechanism that adjust the flow of information (e.g., forget gate, if turned off, will preserve all info) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Long short-term memory (LSTM) LS LSTM STRENGTHS? β’ Almost always outperforms vanilla RNNs β’ Captures long-range dependencies shockingly well LS LSTM ISSUES? β’ Has more weights to learn than vanilla RNNs; thus, β’ Requires a moderate amount of training data (otherwise, vanilla RNNs are better) β’ Can still suffer from vanishing/exploding gradients CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Sequential Modelling π§ F # π§ F 9 π§ F : clas π π π π π s π went πβπ = count(πβπ π₯πππ’) π count(πβπ) π π π π to She went to She went n-grams RNN FFNN π§ F # π§ F 9 π§ F : went to She LSTM CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Sequential Modelling IM IMPORTANT If your goal isnβt to predict the next item in a sequence, and you rather do some other classification or regression task using the sequence, then you can: β’ Train an aforementioned model (e.g., LSTM) as a language model β’ Use the hidden layers that correspond to each item in your sequence CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Sequential Modelling 2. Use hidden layer 1. Train LM to learn hidden layer embeddings for other tasks embeddings Sentiment score π§ F # π§ F 9 π§ F ; π§ F : Out Output ut π π π π layer la π π π Hidde Hi dden layer la π π π π Input In π¦ 9 π¦ : π¦ # π¦ ; layer la π¦ 9 π¦ : π¦ # π¦ ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Sequential Modelling Or jointly learn hidden embeddings toward a particular task (end-to-end) Out Output ut Sentiment score layer la Hidde Hi dden layer 2 la π π π π π Hi Hidde dden π π la layer 1 π π π π Input In π¦ : layer la π¦ 9 π¦ # π¦ ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
You now have the foundation for modelling sequential data. Most state-of-the-art advances are based on those core RNN/LSTM ideas. But, with tens of thousands of researchers and hackers exploring deep learning, there are many tweaks that haven proven useful. (This is where things get crazy.) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Bi-directional (review) π π π *+# *+9 * οΏ½symbol for a BRNN π * previous state previous state π * π *+9 π *+# π * CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 93
Bi-directional (review) RNNs/LSTMs use the left-to-right context and sequentially process data. If you have full access to the data at testing time, why not make use of the flow of information from right-to-left, also? CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN Extensions: Bi-directional LSTMs (review) For brevity, letβs use the follow schematic to represent an RNN v v v v β # β ; β 9 β : Hidde Hi dden layer π¦ # Input l In layer π¦ 9 π¦ : π¦ ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN Extensions: Bi-directional LSTMs (review) For brevity, letβs use the follow schematic to represent an RNN v v v v β # w β ; w β 9 β : w w β # β ; β 9 β : Hi Hidde dden layer π¦ # In Input l layer π¦ 9 π¦ : π¦ ; π¦ # π¦ 9 π¦ : π¦ ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN Extensions: Bi-directional LSTMs (review) w w w w β # β 9 β : β ; Concatenate the hidden layers Co v v v v β # β 9 β : β ; v v v v β # w β ; w β 9 β : w w β # β ; β 9 β : Hidde Hi dden layer π¦ # In Input l layer π¦ 9 π¦ : π¦ ; π¦ # π¦ 9 π¦ : π¦ ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN Extensions: Bi-directional LSTMs (review) π§ F # π§ F : π§ F 9 π§ F ; Out Output ut layer er w w w w β # β 9 β : β ; Concatenate the hidden layers Co v v v v β # β 9 β : β ; v v v v β # w β ; w β 9 β : w w β # β ; β 9 β : Hidde Hi dden layer π¦ # In Input l layer π¦ 9 π¦ : π¦ ; π¦ # π¦ 9 π¦ : π¦ ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
RNN Extensions: Bi-directional LSTMs (review) BI-LS BI LSTM STRENGTHS? β’ Usually performs at least as well as uni-directional RNNs/LSTMs BI BI-LS LSTM ISSUES? β’ Slower to train β’ Only possible if access to full data is allowed CS109B, P ROTOPAPAS , G LICKMAN , T ANNER
Deep RNN (review) LSTMs units can be arranged in layers, so that the output of each unit is the input to the other units. This is called a deep RNN , where the adjective β deep β refers to these multiple layers. β’ Each layer feeds the LSTM on the next layer β’ First time step of a feature is fed to the first LSTM, which processes that data and produces an output (and a new state for itself). β’ That output is fed to the next LSTM, which does the same thing, and the next, and so on. β’ Then the second time step arrives at the first LSTM, and the process repeats. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 100
Recommend
More recommend