CSEP 517 Natural Language Processing
Contextualized Word Embeddings
Luke Zettlemoyer
(Slides adapted from Danqi Chen, Chris Manning, Abigail See, Andrej Karpathy)
Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted - - PowerPoint PPT Presentation
CSEP 517 Natural Language Processing Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted from Danqi Chen, Chris Manning, Abigail See, Andrej Karpathy) Overview Contextualized Word Representations ELMo = E mbeddings from L
(Slides adapted from Danqi Chen, Chris Manning, Abigail See, Andrej Karpathy)
Contextualized Word Representations
= Bidirectional Encoder Representations from Transformers
= Embeddings from Language Models
word = “sweden”
cat =
−0.224 0.130 −0.290 0.276
<latexit sha1_base64="ZS1t+SATcIQYaJ4VZuEjXjz0Y=">ACOXicbZDPShxBEMZrjIk65s8aj3poIoFcsvRsQlSIHjxuIKrws6y9PTWro09PUN3jbgM8wx5m1zyFt4ELx4U8ZoXSM+uiFE/aPj4VRVd9SW5Vo4vwhmXs2+fjM3vxAuvn3/kNj6eOByworsSMzndmjRDjUymCHFGk8yi2KNF4mJzs1PXDU7ROZWafxjn2UjEyaqikI/6jfZpv4wJz6j0pKq2wjBkTJlngqy6qwKv/Jmq/WdxXHIm9E3XpsabfIpaq3/CGM0g4eBfmON/lE7LmJ7s3a9uqvmAFAu984jweZLFI0JLVwrhvxnHqlsKSkxiqMC4e5kCdihF1vjUjR9crJ5RX7MmADTPrnyE2oY8nSpE6N04T3+n3O3ZPazV8qdYtaLjRK5XJC0Ijpx8NC80oY3WMbKAsStJjb4S0yu/K5LGwQpIPO/QhRE9Pfm4OWs3Ih7rn0/gJU83DCnyCLxDBOmzDLrShAxJ+wyVcw03wJ7gKboO7aetMcD+zDP8p+PsPf3eqbQ=</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">ACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDw8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENkgXyiyTgKyRbJL6qRBLkgt+SePHiX3p36D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">ACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDw8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENkgXyiyTgKyRbJL6qRBLkgt+SePHiX3p36D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="yUhkDlYwUEoQ+3MeiaCkTY5/M=">ACOXicbZBNSyNBEIZ71PVj3NWsHr0BsHLhp4oxgUFwYvHCEaFTAg9nUps7OkZumvEMzf8uK/8CZ48bCLePUP2JME8euFhpenquiqN0qVtMjYvTc1PfNjdm5+wV/8+WtpufJ75dQmRHQEolKzHnELSipoYUSFZynBngcKTiLg/L+tkVGCsTfYLDFDoxH2jZl4KjQ91K86qbhwjXmDtSFPt+GMFA6jyNORp5Xfh/WK1e36Zh6LNasMVKU6K/bIzqjR0/BN17G+hWqzGRqJfTAxVTJRs1u5C3uJyGLQKBS3th2wFDs5NyiFgsIPMwspF5d8AG1nNY/BdvLR5QXdcKRH+4lxTyMd0fcTOY+tHcaR63T7XdjPtRJ+V2tn2N/t5FKnGYIW4/6maKY0DJG2pMGBKqhM1wY6Xal4oIbLtCF7bsQgs8nfzWn9VrgQj1m1YO9SRzZI2sk0SkAY5IEekSVpEkBvyQP6R/96t9+g9ec/j1ilvMrNKPsh7eQWaI6kG</latexit>v(bank)
behavior, and connotations
Let’s build a vector for each word conditioned on its context!
movie
was
terribly
exciting
! the
Contextualized word embeddings
(Peters et al, 2018): Deep contextualized word representations
large corpus
to compute a vector representation of each word
input softmax
# words in the sentence
γtask
stask
j
hlM
k,0 = xLM k , hLM k,j = [h LM k,j ; h LM k,j ]
weights and change the input representation to: (could also insert into higher layers)
# of layers
projection
to next input
First Layer > Second Layer Second Layer > First Layer
syntactic information is better represented at lower layers while semantic information is captured a higher layers
https://allennlp.org/elmo
Also available in TensorFlow
Transformers for Language Understanding
How is BERT different from ELMo? #1. Unidirectional context vs bidirectional context #2. LSTMs vs Transformers (will talk later) #3. The weights are not freezed, called fine-tuning
ELMo used two independent LMs from each direction).
Why are LMs unidirectional?
ELMo used two independent LMs from each direction).
masked words
A little more complication:
Because [MASK] is never seen when BERT is used…
Always sample two sentences, predict whether the second sentence is followed after the first one.
Recent papers show that NSP is not necessary…
(Joshi*, Chen* et al, 2019) :SpanBERT: Improving Pre-training by Representing and Predicting Spans (Liu et al, 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach
Pre-training Fine-tuning
Key idea: all the weights are fine-tuned on downstream tasks
(Wang et al, 2018): GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
BiLSTM: 63.9
TensorFlow: https://github.com/google-research/bert
PyTorch: https://github.com/huggingface/transformers