Contextualized Word Embeddings
Spring 2020
2020-03-17
CMPT 825: Natural Language Processing
SFU NatLangLab
Adapted from slides from Danqi Chen and Karthik Narasimhan (with some content from slides from Chris Manning and Abigail See)
Contextualized Word Embeddings Spring 2020 2020-03-17 Adapted from - - PowerPoint PPT Presentation
SFU NatLangLab CMPT 825: Natural Language Processing Contextualized Word Embeddings Spring 2020 2020-03-17 Adapted from slides from Danqi Chen and Karthik Narasimhan (with some content from slides from Chris Manning and Abigail See) Course
Spring 2020
2020-03-17
CMPT 825: Natural Language Processing
Adapted from slides from Danqi Chen and Karthik Narasimhan (with some content from slides from Chris Manning and Abigail See)
Remaining lectures (tentative)
Contextualized Word Representations
= Bidirectional Encoder Representations from Transformers
= Embeddings from Language Models
word = “sweden”
cat =
−0.224 0.130 −0.290 0.276
<latexit sha1_base64="ZS1t+SATcIQYaJ4VZuEjXjz0Y=">ACOXicbZDPShxBEMZrjIk65s8aj3poIoFcsvRsQlSIHjxuIKrws6y9PTWro09PUN3jbgM8wx5m1zyFt4ELx4U8ZoXSM+uiFE/aPj4VRVd9SW5Vo4vwhmXs2+fjM3vxAuvn3/kNj6eOByworsSMzndmjRDjUymCHFGk8yi2KNF4mJzs1PXDU7ROZWafxjn2UjEyaqikI/6jfZpv4wJz6j0pKq2wjBkTJlngqy6qwKv/Jmq/WdxXHIm9E3XpsabfIpaq3/CGM0g4eBfmON/lE7LmJ7s3a9uqvmAFAu984jweZLFI0JLVwrhvxnHqlsKSkxiqMC4e5kCdihF1vjUjR9crJ5RX7MmADTPrnyE2oY8nSpE6N04T3+n3O3ZPazV8qdYtaLjRK5XJC0Ijpx8NC80oY3WMbKAsStJjb4S0yu/K5LGwQpIPO/QhRE9Pfm4OWs3Ih7rn0/gJU83DCnyCLxDBOmzDLrShAxJ+wyVcw03wJ7gKboO7aetMcD+zDP8p+PsPf3eqbQ=</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">ACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDw8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENkgXyiyTgKyRbJL6qRBLkgt+SePHiX3p36D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">ACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDw8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENkgXyiyTgKyRbJL6qRBLkgt+SePHiX3p36D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="yUhkDlYwUEoQ+3MeiaCkTY5/M=">ACOXicbZBNSyNBEIZ71PVj3NWsHr0BsHLhp4oxgUFwYvHCEaFTAg9nUps7OkZumvEMzf8uK/8CZ48bCLePUP2JME8euFhpenquiqN0qVtMjYvTc1PfNjdm5+wV/8+WtpufJ75dQmRHQEolKzHnELSipoYUSFZynBngcKTiLg/L+tkVGCsTfYLDFDoxH2jZl4KjQ91K86qbhwjXmDtSFPt+GMFA6jyNORp5Xfh/WK1e36Zh6LNasMVKU6K/bIzqjR0/BN17G+hWqzGRqJfTAxVTJRs1u5C3uJyGLQKBS3th2wFDs5NyiFgsIPMwspF5d8AG1nNY/BdvLR5QXdcKRH+4lxTyMd0fcTOY+tHcaR63T7XdjPtRJ+V2tn2N/t5FKnGYIW4/6maKY0DJG2pMGBKqhM1wY6Xal4oIbLtCF7bsQgs8nfzWn9VrgQj1m1YO9SRzZI2sk0SkAY5IEekSVpEkBvyQP6R/96t9+g9ec/j1ilvMrNKPsh7eQWaI6kG</latexit>v(bank)
behavior, and connotations
Let’s build a vector for each word conditioned on its context!
movie
was
terribly exciting
! the Contextualized word embeddings
f : (w1, w2, …, wn) ⟶ x1, …, xn ∈ ℝd
(Peters et al, 2018): Deep contextualized word representations
(from ELMo)
large corpus
compute a vector representation of each word
(figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/)
softmax
# tokens in the sentence input tokens
LSTM parameters
(figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/)
To get the ELMO embedding of a word (“stick”):
Concatenate forward and backward embeddings and take weighted sum of layers
(figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/)
To get the ELMO embedding of a word (“stick”): Concatenate forward and backward embeddings and take weighted sum of layers LM weights are frozen Weights are trained on specific task.
sj
γtask
stask
j
hLM
k,0 = xLM k , hLM k,j = [h LM k,j ; h LM k,j ]
LMs weights and change the input representation to: (could also insert into higher layers)
L is # of layers
hidden states
Token representation
projection
projections to next input
First Layer > Second Layer
syntactic information is better represented at lower layers while semantic information is captured at higher layers syntactic information
Second Layer > First Layer
semantic information
https://allennlp.org/elmo
Also available in TensorFlow
Transformers for Language Understanding
How is BERT different from ELMo? #1. Unidirectional context vs bidirectional context #2. LSTMs vs Transformers (will talk later) #3. The weights are not frozen, called fine-tuning
ELMo used two independent LMs from each direction).
Why are LMs unidirectional?
ELMo used two independent LMs from each direction).
masked words
A little more complex (don’t always replace with [MASK]):
Because [MASK] is never seen when BERT is used…
Example: my dog is hairy, we replace the word hairy
my dog is [MASK]
my dog is apple
toward actual observed word my dog is hairy
Always sample two sentences, predict whether the second sentence is followed after the first one.
Recent papers show that NSP is not necessary…
(Joshi*, Chen* et al, 2019) :SpanBERT: Improving Pre-training by Representing and Predicting Spans (Liu et al, 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach
Pre-training Fine-tuning Key idea: all the weights are fine-tuned on downstream tasks
(figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/)
(figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/)
(Wang et al, 2018): GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
BiLSTM: 63.9
TensorFlow: https://github.com/google-research/bert PyTorch: https://github.com/huggingface/transformers
https://github.com/ huggingface/transformers
decoder framework)
(encoder only)
trains much faster
Encoder Decoder
Scaled Dot-Product Attention
Multiple Heads
(slide credit: Abigail See)
(also referred to as Intra-Attention)
attention with all the other words
= the word vectors themselves select each other
the query and the corresponding keys.
matrices.
function
scaled dot-product attention
(figure credit: Jay Alammar http://jalammar.github.io/illustrated-transformer/)
created by multiplying learned weight matrices with embedding
and decoder hidden state
, where is a weight matrix
where are weight matrices and is a weight vector
Perform better for larger dimensions more efficient (matrix multiplication) Simplest (no extra parameters) requires z and h_i to be same size More flexible than dot-product (W is trainable)
and decoder hidden state
Maybe will perform well for larger dimensions Scaling factor: d = dimension of hidden state Perform poorly for large d Softmax has small gradient
function
scaled dot-product attention
(figure credit: Jay Alammar http://jalammar.github.io/illustrated-transformer/)
“Thinking” as the query
(figure credit: Jay Alammar http://jalammar.github.io/illustrated-transformer/)
for each query, key, and values
different vectors
interact with each other
pairs to an output
q (ki, vi)
A(Q, K, V) = softmax(QK⊺)V
Q ∈ ℝnQ×d, K, V ∈ ℝn×d
A(q, K, V) = ∑
i
eq⋅ki ∑j eq⋅kj vi
K, V ∈ ℝn×d, q ∈ ℝd
attention with all the other words
= the word vectors themselves select each other
A(Q, K, V) = softmax( QK⊺ d )V
A(XWQ, XWK, XWV) ∈ ℝn×d
WQ, WK, WV ∈ ℝdin×d
A(Q, K, V) = Concat(head1, …, headh)WO
headi = A(XWQ
i , XWK i , XWV i )
In practice, ,
h = 8 d = dout/h, WO = dout × dout
a layer normalization LayerNorm(x + SubLayer(x))
(Ba et al, 2016): Layer Normalization
residual connection
(figure credit: Jay Alammar http://jalammar.github.io/illustrated-transformer/)
LayerNorm
variance 1 per layer.
(Ba et al, 2016): Layer Normalization
a layer normalization LayerNorm(x + SubLayer(x))
(Ba et al, 2016): Layer Normalization
t = position d = embedding dimension i = embedding index (0 to d-1)
a layer normalization LayerNorm(x + SubLayer(x))
(Ba et al, 2016): Layer Normalization
come from previous decoder layer and keys and values come from output of encoder
generated outputs
(figure credit: Jay Alammar http://jalammar.github.io/illustrated-gpt2/)
model during test time?
Are Sixteen Heads Really Better than One? Michel, Levy, and Neubig, NeurIPS 2019
3 types of attention: Enc-Enc, Enc-Dec, Dec-Dec 6 layers, 16 heads each layer for each type
model with less heads?
Are Sixteen Heads Really Better than One? Michel, Levy, and Neubig, NeurIPS 2019
3 types of attention: Enc-Enc, Enc-Dec, Dec-Dec 6 layers, 16 heads each layer for each type
the movie was terribly exciting !
Transformer layer 3 Transformer layer 2 Transformer layer 1
RNN Transformer
nn.Transformer:
nn.TransformerEncoder:
The Annotated Transformer:
http://nlp.seas.harvard.edu/2018/04/03/attention.html
A Jupyter notebook which explains how Transformer works line by line in PyTorch!
(slide credit: Stanford CS224N, Chris Manning)
(slide credit: Stanford CS224N, Chris Manning)
(slide credit: Stanford CS224N, Chris Manning)
(slide credit: Stanford CS224N, Chris Manning)
Have fun with using ELMo or BERT in your final project :)