CS109B Data Science 2
Pavlos Protopapas, Mark Glickman, and Chris Tanner
Lecture 16: Language Model CS109B Data Science 2 Pavlos Protopapas, - - PowerPoint PPT Presentation
Lecture 16: Language Model CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and Chris Tanner Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 2
Pavlos Protopapas, Mark Glickman, and Chris Tanner
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions
2
CS109B, PROTOPAPAS, GLICKMAN, TANNER
We could easily spend an entire semester on this material. The goal for today and Wednesday is to convey:
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions
4
CS109B, PROTOPAPAS, GLICKMAN, TANNER
5
Regardless of how we model sequential data, keep in mind that we can estimate any time series as follows:
π π¦#, β¦ , π¦& = ( π π¦* π¦*+#, β¦ , π¦#)
& *-#
Joint distribution of all measurements Conditional probability
all of the events that
Th This comp mpounds fo for al all subsequent eve vents, to too
CS109B, PROTOPAPAS, GLICKMAN, TANNER
6
The probability of the following 3-day weather pattern in Seattle:
Da Day 1 1 Da Day 2 2 Da Day 3 3
CS109B, PROTOPAPAS, GLICKMAN, TANNER
The probability of the following 3-day weather pattern in Seattle:
Da Day 1 1 Da Day 2 2 Da Day 3 3
CS109B, PROTOPAPAS, GLICKMAN, TANNER
8
The probability of the following 3-day weather pattern in Seattle:
Da Day 1 1 Da Day 2 2 Da Day 3 3
Da Day 1 1 Da Day 2 2 Da Day 3 3
CS109B, PROTOPAPAS, GLICKMAN, TANNER
9
Why is it useful to accurately estimate the joint of any given sequence of length π?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
10
Having the ability to estimate the probability of any sequence of length π allows us to determine the most likely next event (i.e., sequence of length π + 1)
Da Day 1 1 Da Day 2 2 Da Day 3 3 Da Day 4 4
CS109B, PROTOPAPAS, GLICKMAN, TANNER
11
For the remainder of this lecture, we will use text (natural language) as examples because:
Ye Yet, for any model, you can imagine using any other sequential data
CS109B, PROTOPAPAS, GLICKMAN, TANNER
12
A Language Model represents the language used by a given entity (e.g., a particular person, genre, or other well-defined class of text)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
13
A Language Model represents the language used by a given entity (e.g., a particular person, genre, or other well-defined class of text)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
14
A Language Model represents the language used by a given entity (e.g., a particular person, genre, or other well-defined class of text)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
15
A Language Model estimates the probability of any sequence of words Let π = βEleni was late for classβ P(π) = π(βEleni was late for classβ) π₯# π₯9 π₯: π₯; π₯<
CS109B, PROTOPAPAS, GLICKMAN, TANNER
16
Ge Generate Text xt
CS109B, PROTOPAPAS, GLICKMAN, TANNER
17
Ge Generate Text xt
CS109B, PROTOPAPAS, GLICKMAN, TANNER
18
Ge Generate Text xt
CS109B, PROTOPAPAS, GLICKMAN, TANNER
19
βDrug kingpin El Chapo testified that he gave MILLIONS to Pelosi, Schiff &
CS109B, PROTOPAPAS, GLICKMAN, TANNER
20
A Language Model is useful for: Generating Text Classifying Text
And much more!
CS109B, PROTOPAPAS, GLICKMAN, TANNER
21
Today, we heavily focus on Language Modelling (LM) because: 1. Itβs foundational for nearly all NLP tasks 2. Since weβre ultimately modelling a sequence, LM approaches are generalizable to any type of data, not just text.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
22
How can we build a language model? Naive Approach: unigram model Assume each word is independent of all others. Count how often each word occurs (in the training data).
CS109B, PROTOPAPAS, GLICKMAN, TANNER
23
How can we build a language model? Naive Approach: unigram model Assume each word is independent of all others Let π = βEleni was late for classβ π₯# π₯9 π₯: π₯; π₯<
CS109B, PROTOPAPAS, GLICKMAN, TANNER
24
How can we build a language model? Naive Approach: unigram model Assume each word is independent of all others Let π = βEleni was late for classβ P(π) = π(Eleni)π(was)π(late)π(for)π(class) π₯# π₯9 π₯: π₯; π₯<
= 0.00015 * 0.01 * 0.004 * 0.03 * 0.0035 = 6.3x10-13 You calculate each of these probabilities from the training corpus
CS109B, PROTOPAPAS, GLICKMAN, TANNER
25
UNIGRAM ISSUES?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
26
UNIGRAM ISSUES? π(βEleni was late for classβ) = π(βclass for was late Eleniβ) Context doesnβt play a role at all
CS109B, PROTOPAPAS, GLICKMAN, TANNER
27
UNIGRAM ISSUES? π(βEleni was late for classβ) = π(βclass for was late Eleniβ) Context doesnβt play a role at all Eleni was late for class _____ Sequence generation: Whatβs the most likely next word?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
28
UNIGRAM ISSUES? π(βEleni was late for classβ) = π(βclass for was late Eleniβ) Context doesnβt play a role at all Eleni was late for class _____ Sequence generation: Whatβs the most likely next word? Eleni was late for class the Anqi was late for class the the
CS109B, PROTOPAPAS, GLICKMAN, TANNER
29
How can we build a language model? Alternative Approach: bigram model Look at pairs of consecutive words Let π = βEleni was late for classβ π₯# π₯9 π₯: π₯; π₯<
CS109B, PROTOPAPAS, GLICKMAN, TANNER
30
How can we build a language model? Alternative Approach: bigram model Look at pairs of consecutive words Let π = βEleni was late for classβ π₯# π₯9 π₯: π₯; π₯<
pr proba babi bility
P(π) = π(was|Eleni)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
31
How can we build a language model? Alternative Approach: bigram model Look at pairs of consecutive words Let π = βEleni was late for classβ π₯# π₯9 π₯: π₯; π₯<
pr proba babi bility
P(π) = π(was|Eleni)π(late|was)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
32
How can we build a language model? Alternative Approach: bigram model Look at pairs of consecutive words Let π = βEleni was late for classβ π₯# π₯9 π₯: π₯; π₯<
pr proba babi bility
P(π) = π(was|Eleni)π(late|was)π(for|late)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
33
How can we build a language model? Alternative Approach: bigram model Look at pairs of consecutive words Let π = βEleni was late for classβ π₯# π₯9 π₯: π₯; π₯<
pr proba babi bility
P(π) = π(was|Eleni)π(late|was)π(for|late)π(class|for)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
34
How can we build a language model? Alternative Approach: bigram model Look at pairs of consecutive words Let π = βEleni was late for classβ π₯# π₯9 π₯: π₯; π₯<
pr proba babi bility
P(π) = π(was|Eleni)π(late|was)π(for|late)π(class|for) You calculate each of these probabilities by simply counting the occurrences
P(class| for) = count(for class) count(for)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
BIGRAM ISSUES?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
BIGRAM ISSUES?
sparsity is an issue (rarely seen subsequences)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
IDEA: Letβs use a neural networks! First, each word is represented by a word embedding (e.g., vector of length 200) man woman table
CS109B, PROTOPAPAS, GLICKMAN, TANNER
IDEA: Letβs use a neural networks! First, each word is represented by a word embedding (e.g., vector of length 200) man woman table
will have embeddings that are proportionally similar, too
been trained on gigantic corpora
CS109B, PROTOPAPAS, GLICKMAN, TANNER
These word embeddings are so rich that you get nice properties:
Word2vec: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf GloVe: https://www.aclweb.org/anthology/D14-1162.pdf
king woman queen man
CS109B, PROTOPAPAS, GLICKMAN, TANNER
How can we use these embeddings to build a LM? Remember, we only need a system that can estimate: She went to class
next word previous words
Example input sentence
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word She went to
Example input sentence
π π
Hidden layer Output layer
class?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word She went to class?
Example input sentence
π π
Hidden layer Output layer
π¦ = [π¦#, π¦9, π¦:]
Co Conc ncatena enated ed wo word em embed eddi ding ngs
β = π(ππ¦ + π#) π§ F = softmax πβ + π9 β β P
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class went to after
Example input sentence
π π
Hidden layer Output layer
π¦ = [π¦#, π¦9, π¦:]
Co Conc ncatena enated ed wo word em embed eddi ding ngs
β = π(ππ¦ + π#) π§ F = softmax πβ + π9 β β P
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class to after
Example input sentence
π π
Hidden layer Output layer
π¦ = [π¦#, π¦9, π¦:]
Co Conc ncatena enated ed wo word em embedd beddings ngs
β = π(ππ¦ + π#) π§ F = softmax πβ + π9 β β P visiting
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class after
Example input sentence
π π
Hidden layer Output layer
π¦ = [π¦#, π¦9, π¦:]
Co Conc ncatena enated ed wo word em embed eddi ding ngs
β = π(ππ¦ + π#) π§ F = softmax πβ + π9 β β P visiting her
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word after
Example input sentence
π π
Hidden layer Output layer
π¦ = [π¦#, π¦9, π¦:]
Co Conc ncatena enated ed wo word em embed eddi ding ngs
β = π(ππ¦ + π#) π§ F = softmax πβ + π9 β β P visiting her grandma
CS109B, PROTOPAPAS, GLICKMAN, TANNER
FFN FFNN ISSU SSUES? S? FFN FFNN ST STRENGTHS? S?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
FFN FFNN ISSU SSUES? S? FFN FFNN ST STRENGTHS? S?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
We especially need a system that:
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions
50
CS109B, PROTOPAPAS, GLICKMAN, TANNER
51
CS109B, PROTOPAPAS, GLICKMAN, TANNER
IDEA: for every individual input, output a prediction She
Ex Example i input wo word
π π
Hi Hidde dden layer Out Output ut layer er
π¦ = π¦#
si single word embedding
β = π(ππ¦ + π#) π§ F = softmax πβ + π9 β β P went Le Letβs use the previous hidden state, too
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π§ F# π¦# Neural Approach #2: Recurrent Neural Network (RNN) V π π§ F9 π¦9 π V π π§ F: π¦: π V π π§ F; π¦; π
CS109B, PROTOPAPAS, GLICKMAN, TANNER
We have seen this abstract view in Lecture 15.
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π§ FR π¦R π
Th The recurrent loop loop π½ co conveys that the cu current hidden layer is influence ced by the hi hidden en layer er from the he prev evious us time e step ep.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π§ F# π¦#
Tr Training Process
π·πΉ π§#, π§ F# Er Error
She
π·πΉ π§R, π§ FR = β W π§X
R log(π§
FX
R )
π π π§ F9 π¦9 π
π·πΉ π§9, π§ F9
went π π π§ F: π¦: π
π·πΉ π§:, π§ F:
to π π π§ F; π¦; π
π·πΉ π§;, π§ F;
class
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π§ F# π¦#
Tr Training Process
π·πΉ π§#, π§ F# Er Error
She
π·πΉ π§R, π§ FR = β W π§X
R log(π§
FX
R )
π π π§ F9 π¦9 π
π·πΉ π§9, π§ F9
went π π π§ F: π¦: π
π·πΉ π§:, π§ F:
to π π π§ F; π¦; π
π·πΉ π§;, π§ F;
class During training, regardless of our output predictions, we feed in the correct inputs
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π§ F π¦#
Tr Training Process
π·πΉ π§#, π§ F# Er Error
She
π·πΉ π§R, π§ FR = β W π§X
R log(π§
FX
R )
π π π¦9 π
π·πΉ π§9, π§ F9
went π π π¦: π
π·πΉ π§:, π§ F:
to π π π¦; π
π·πΉ π§;, π§ F;
class went?
class? after? Our total loss is simply the average loss across all π time steps
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π§ F π¦#
Tr Training Process
π·πΉ π§#, π§ F# Er Error
She
π·πΉ π§R, π§ FR = β W π§X
R log(π§
FX
R )
π π π¦9 π
π·πΉ π§9, π§ F9
went π π π¦: π
π·πΉ π§:, π§ F:
to π π π¦; π
π·πΉ π§;, π§ F;
class went?
class? after? Us Using g the ch chain rule, we trace ce the de derivative all the wa way back to the beginning, w , while s summing t the r results lts. To To update our weights (e.g. . π½), ), we we calculate the gradi dient
loss w. w.r.t. t . the r repeate ted w weight m t matri trix (e (e.g .g., ., ππ΄
ππ½ ).
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π§ F π¦#
Tr Training Process
π·πΉ π§#, π§ F# Er Error
She π π π¦9 π
π·πΉ π§9, π§ F9
went π π π¦: π
π·πΉ π§:, π§ F:
to π π π¦; π
π·πΉ π§;, π§ F;
class went?
class? after?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π§ F π¦#
Tr Training Process
π·πΉ π§#, π§ F# Er Error
She π π π¦9 π
π·πΉ π§9, π§ F9
went π π π¦: π
π·πΉ π§:, π§ F:
to π π π¦; π:
π·πΉ π§;, π§ F;
class went?
class? after?
ππ΄ ππΎ
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π§ F π¦#
Tr Training Process
π·πΉ π§#, π§ F# Er Error
She π π π¦9 π
π·πΉ π§9, π§ F9
went π π π¦: π9
π·πΉ π§:, π§ F:
to π π π¦; π:
π·πΉ π§;, π§ F;
class went?
class? after?
ππ΄ ππΎ
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π§ F π¦#
Tr Training Process
π·πΉ π§#, π§ F# Er Error
She π π π¦9 π#
π·πΉ π§9, π§ F9
went π π π¦: π9
π·πΉ π§:, π§ F:
to π π π¦; π:
π·πΉ π§;, π§ F;
class went?
class? after?
ππ΄ ππΎ
CS109B, PROTOPAPAS, GLICKMAN, TANNER
expensive
every π steps (e.g., every sentence or paragraph)
(a la n-grams) because we still have βinfinite memoryβ
CS109B, PROTOPAPAS, GLICKMAN, TANNER
We can generate the most likely next event (e.g., word) by sampling from π§ F Continue until we generate <EOS> symbol.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
We can generate the most likely next event (e.g., word) by sampling from π c Continue until we generate <EOS> symbol.
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π¦# <START> βSorryβ
CS109B, PROTOPAPAS, GLICKMAN, TANNER
We can generate the most likely next event (e.g., word) by sampling from π c Continue until we generate <EOS> symbol.
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π¦# <START> βSorryβ π π π¦9 π βSorryβ Harry π π π¦: π βHarryβ shouted π π π¦; π
shouted,
panicking
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer Out Output ut layer er
π π π¦# <START> βSorryβ π π π¦9 π βSorryβ Harry π π π¦: π βHarryβ shouted π π π¦; π βshouted,β panicking NOTE: the same input (e.g., βHarryβ) can easily yield different outputs, depending on the context (unlike FFNNs and n-grams).
CS109B, PROTOPAPAS, GLICKMAN, TANNER
When trained on Harry Potter text, it generates:
Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6
CS109B, PROTOPAPAS, GLICKMAN, TANNER
When trained on recipes
Source: https://gist.github.com/nylki/1efbaa36635956d35bcc
CS109B, PROTOPAPAS, GLICKMAN, TANNER
RN RNN IS ISSUES ES? RN RNN S STREN RENGTHS?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
RN RNN IS ISSUES ES? RN RNN S STREN RENGTHS?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
π π π π π:
π·πΉ π§;, π§ F;
π§ F
ππ΄π ππΎπ
π9 π#
In Input l layer Hi Hidde dden layer
π π π π She went to class
ππ΄π ππΎπ = ?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
π π π π π:
π·πΉ π§;, π§ F;
π§ F
ππ΄π ππΎπ
π9 π#
In Input l layer Hi Hidde dden layer
π π π π She went to class
ππ΄π ππΎπ = ππ΄π ππΎπ
CS109B, PROTOPAPAS, GLICKMAN, TANNER
π π π π π:
π·πΉ π§;, π§ F;
π§ F
ππ΄π ππΎπ
π9 π#
In Input l layer Hi Hidde dden layer
π π π π She went to class
ππ΄π ππΎπ = ππ΄π ππΎπ ππΎπ ππΎπ
CS109B, PROTOPAPAS, GLICKMAN, TANNER
π π π π π:
π·πΉ π§;, π§ F;
π§ F
ππ΄π ππΎπ
π9 π#
In Input l layer Hi Hidde dden layer
π π π π She went to class
ππ΄π ππΎπ = ππ΄π ππΎπ ππΎπ ππΎπ ππΎπ ππΎπ
CS109B, PROTOPAPAS, GLICKMAN, TANNER
To address RNNsβ finnicky nature with long-range context, we turned to an RNN variant named LSTMs (long short-term memory) But first, letβs recap what weβve learned so far
CS109B, PROTOPAPAS, GLICKMAN, TANNER
n-grams
π went πβπ = count(πβπ π₯πππ’) count(πβπ)
FFNN
She went to π π class
π π π§ F# π π π π π§ F9 π§ F: π π
RNN
Basic counts; fast
Fixed xed wind ndow size
Sparsity & storage e issues ues
Not robust
β¦ almost
Fixed xed wind ndow size
Weirdly handles context po positions
No βmemoryβ of past
Handles infinite context (i (in t theory)
Robust to rare words
Slow
Difficulty w with l long c context
She went to
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions
78
CS109B, PROTOPAPAS, GLICKMAN, TANNER
79
CS109B, PROTOPAPAS, GLICKMAN, TANNER
dependencies
rewritten
dedicated memory cell c for long-term events. More power to relay sequence info.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
πΌ*+# π·*+# πΌ* π·* πΌ*=# π·*=#
Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
CS109B, PROTOPAPAS, GLICKMAN, TANNER
πΌ*+# π·*+# πΌ* π·* πΌ*=# π·*=#
some old memories are βforgottenβ
Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
CS109B, PROTOPAPAS, GLICKMAN, TANNER
πΌ*+# π·*+# πΌ* π·* πΌ*=# π·*=#
some old memories are βforgottenβ some new memories are made
Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
CS109B, PROTOPAPAS, GLICKMAN, TANNER
πΌ*+# π·*+# πΌ* π·* πΌ*=# π·*=#
some old memories are βforgottenβ some new memories are made
a nonlinear weighted version of the long-term memory becomes our short-term memory
Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
CS109B, PROTOPAPAS, GLICKMAN, TANNER
πΌ*+# π·*+# πΌ* π·* πΌ*=# π·*=#
some old memories are βforgottenβ some new memories are made
a nonlinear weighted version of the long-term memory becomes our short-term memory memory is written, erased, and read by three gates β which are influenced by π and π
Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Itβs still possible for LSTMs to suffer from vanishing/exploding gradients, but itβs way less likely than with vanilla RNNs:
find a recurrent weight matrix π
β that isnβt too large or small
CS109B, PROTOPAPAS, GLICKMAN, TANNER
LS LSTM ISSUES? LS LSTM STRENGTHS?
RNNs are better)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
n-grams
π went πβπ = count(πβπ π₯πππ’) count(πβπ)
FFNN
She went to π π clas s
π π π§ F# π π π π π§ F9 π§ F: π π
RNN
She went to
π§ F# π§ F9 π§ F:
She went to
LSTM
CS109B, PROTOPAPAS, GLICKMAN, TANNER
If your goal isnβt to predict the next item in a sequence, and you rather do some other classification or regression task using the sequence, then you can:
sequence IM IMPORTANT
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input la layer Hi Hidde dden la layer Out Output ut la layer
π π π§ F# π π π π π π π§ F9 π§ F: π§ F; π¦# π¦9 π¦: π¦; π π π
embeddings
embeddings for other tasks
π¦# π¦9 π¦: π¦;
Sentiment score
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input la layer Hi Hidde dden la layer 1 Out Output ut la layer
π π π π π π π π π¦# π¦9 π¦: π¦; π π π
Or jointly learn hidden embeddings toward a particular task (end-to-end)
Sentiment score
Hi Hidde dden la layer 2
CS109B, PROTOPAPAS, GLICKMAN, TANNER
You now have the foundation for modelling sequential data. Most state-of-the-art advances are based on those core RNN/LSTM ideas. But, with tens of thousands of researchers and hackers exploring deep learning, there are many tweaks that haven proven useful. (This is where things get crazy.)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
93
π
*+9
π
*+#
π
*
π*+9 π*+# π* previous state previous state π
*
π* symbol for a BRNN
CS109B, PROTOPAPAS, GLICKMAN, TANNER
RNNs/LSTMs use the left-to-right context and sequentially process data. If you have full access to the data at testing time, why not make use of the flow of information from right-to-left, also?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
π¦# π¦9 π¦: π¦; β#
v
β9
v
β:
v
β;
v
For brevity, letβs use the follow schematic to represent an RNN
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
π¦# π¦9 π¦: π¦; π¦# π¦9 π¦: π¦; β#
v
β9
v
β:
v
β;
v
β#
w
β9
w
β:
w
β;
w
For brevity, letβs use the follow schematic to represent an RNN
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
π¦# π¦9 π¦: π¦; π¦# π¦9 π¦: π¦; β#
v
β9
v
β:
v
β;
v
β#
w
β9
w
β:
w
β;
w
β#
v
β#
w
β9
v
β9
w
β:
v
β:
w
β;
v
β;
w Co Concatenate the hidden layers
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer Out Output ut layer er
π¦# π¦9 π¦: π¦; π¦# π¦9 π¦: π¦; β#
v
β9
v
β:
v
β;
v
β#
w
β9
w
β:
w
β;
w
β#
v
β#
w
β9
v
β9
w
β:
v
β:
w
β;
v
β;
w
π§ F# π§ F9 π§ F: π§ F;
Co Concatenate the hidden layers
CS109B, PROTOPAPAS, GLICKMAN, TANNER
BI BI-LS LSTM ISSUES? BI BI-LS LSTM STRENGTHS?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
LSTMs units can be arranged in layers, so that the output of each unit is the input to the other units. This is called a deep RNN, where the adjective βdeepβ refers to these multiple layers.
that data and produces an output (and a new state for itself).
the next, and so on.
repeats.
100
CS109B, PROTOPAPAS, GLICKMAN, TANNER
101
π*+9 π*+# π* π*=# π*=9 π
*+9
π
*+#
π
*
π
*=#
π
*=9
π π
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer r #1
π¦# π¦9 π¦: π¦; β#
v
β9
v
β:
v
β;
v
Hidden layers provide an abstraction (holds βmeaningβ). Stacking hidden layers provides increased abstractions.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer r #1
π¦# π¦9 π¦: π¦; β#
v
β9
v
β:
v
β;
v
β;
v9
β:
v9
β9
v9
β#
v9 Hi Hidde dden layer r #2
Hidden layers provide an abstraction (holds βmeaningβ). Stacking hidden layers provides increased abstractions.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer r #1
π¦# π¦9 π¦: π¦; β#
v
β9
v
β:
v
β;
v
β;
v9
β:
v9
β9
v9
β#
v9
π§ F# π§ F9 π§ F: π§ F;
Out Output ut layer er Hi Hidde dden layer r #2
Hidden layers provide an abstraction (holds βmeaningβ). Stacking hidden layers provides increased abstractions.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
General Idea:
abstractions (stacked)
ELMo Slides: https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Illustration: http://jalammar.github.io/illustrated-bert/
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Illustration: http://jalammar.github.io/illustrated-bert/
CS109B, PROTOPAPAS, GLICKMAN, TANNER
the-art results when applied to many NLP tasks.
explicit connections between your vectors is useful (system can determine how to best use context)
ELMo Slides: https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018
CS109B, PROTOPAPAS, GLICKMAN, TANNER
REFLE LECTION
So So far, for all of our ur seq equent uential model elling ng, we e ha have e been een co conce cerned with emitting 1 output per input datum. So Somet etimes es, a se sequence is is the sma malle llest granula larit ity we we care ab about t though (e (e.g .g., an ., an E English s sentence)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions
110
CS109B, PROTOPAPAS, GLICKMAN, TANNER
111
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Language B, it is clearly sub-optimal to translate word by word (like our current models are suited to do).
work with (a sequence of length N may emit a sequences of length M)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
β#
x
β9
x
β:
x
β;
x
The brown dog ran
EN ENCODER ER RN RNN
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
β#
x
β9
x
β:
x
The brown dog ran
EN ENCODER ER RN RNN
β;
x Th The fi final hidden state of f the encoder RNN is is the in init itia ial l state of
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
β#
x
β9
x
β:
x
β;
x
The brown dog ran
Th The fi final hidden state of f the encoder RNN is is the in init itia ial l state of
EN ENCODER ER RN RNN
β#
y
β9
y
β:
y
Le chien brun a
DE DECODE DER R RNN
β;
y
β<
y
couru
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
β#
x
β9
x
β:
x
β;
x
The brown dog ran
EN ENCODER ER RN RNN
β#
y
β9
y
β:
y
Le chien brun a
DE DECODE DER R RNN
β;
y
β<
y
couru π§ F# π§ F9 π§ F: π§ F; π§ F<
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
β#
x
β9
x
β:
x
β;
x
The brown dog ran
EN ENCODER ER RN RNN
β#
y
β9
y
β:
y
Le chien brun a
DE DECODE DER R RNN
β;
y
β<
y
couru π§ F# π§ F9 π§ F: π§ F; π§ F<
Tr Training oc
ypical ally y do;
(f (from t the d decoder o
) is c calculated, a , and we we u update we weights a all t the wa way t to t the be begi ginning g (encode der)
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
β#
x
β9
x
β:
x
β;
x
The brown dog ran
EN ENCODER ER RN RNN
β#
y
β9
y
β:
y
Le chien brun a
DE DECODE DER R RNN
β;
y
β<
y
couru π§ F# π§ F9 π§ F: π§ F; π§ F<
Te Testing ge generates de decode der outpu puts one word d at at a a time, until we generat ate a a <E <EOS> > to token. Ea Each decoderβs π cπ be becomes the inpu put ππ=π
CS109B, PROTOPAPAS, GLICKMAN, TANNER
See any issues with this traditional seq2seq paradigm?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
β#
x
β9
x
β:
x
β;
x
The brown dog ran
Itβs crazy that the entire βmeaningβ of the 1st sequence is expected to be packed into this one embedding, and that the encoder then never interacts w/ the decoder again. Hands free. EN ENCODER ER RN RNN
β#
y
β9
y
β:
y
Le chien brun a
DE DECODE DER R RNN
β;
y
β<
y
couru
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Instead, what if the decoder, at each step, pays attention to a distribution of all of the encoderβs hidden states?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
The brown dog ran
EN ENCODER ER RN RNN DE DECODE DER R RNN
[π#
y, β; x]
π§ F# Le π#
y
π#
#
π9
#
π:
#
π;
#
NOTE: each attention weight ππ
π is based on the decoderβs current hidden state, too.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
The brown dog ran
EN ENCODER ER RN RNN DE DECODE DER R RNN
π9
y
π#
9
π9
9
π:
9
π;
9
[π#
y, β; x]
π§ F# [π9
y,π§
F#] π§ F9 Le chien
NOTE: each attention weight ππ
π is based on the decoderβs current hidden state, too.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
The brown dog ran
EN ENCODER ER RN RNN DE DECODE DER R RNN
π:
y
π#
:
π9
:
π:
:
π;
:
[π#
y, β; x]
π§ F# [π9
y,π§
F#] [π:
y,π§
F9] π§ F9 π§ F: Le chien brun
NOTE: each attention weight ππ
π is based on the decoderβs current hidden state, too.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
The brown dog ran
EN ENCODER ER RN RNN DE DECODE DER R RNN
π;
y
π#
;
π9
;
π:
;
π;
;
[π#
y, β; x]
π§ F# [π9
y,π§
F#] [π:
y,π§
F9] [π;
y,π§
F:] π§ F9 π§ F: π§ F; Le chien brun a
NOTE: each attention weight ππ
π is based on the decoderβs current hidden state, too.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
In Input l layer Hi Hidde dden layer
The brown dog ran
EN ENCODER ER RN RNN DE DECODE DER R RNN
π<
y
π#
<
π9
<
π:
<
π;
<
[π#
y, β; x]
π§ F# [π9
y,π§
F#] [π:
y,π§
F9] [π;
y,π§
F:] [π<
y,π§
F;] π§ F9 π§ F: π§ F; π§ F< Le chien brun a couru
NOTE: each attention weight ππ
π is based on the decoderβs current hidden state, too.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Attention:
contribution each word gave during each step of the decoder
Image source: Fig 3 in Bahdanau et al., 2015
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions
128