Contextualized Word Embeddings Spring 2020 2020-03-17 Adapted from - - PowerPoint PPT Presentation

contextualized word embeddings
SMART_READER_LITE
LIVE PREVIEW

Contextualized Word Embeddings Spring 2020 2020-03-17 Adapted from - - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Contextualized Word Embeddings Spring 2020 2020-03-17 Adapted from slides from Danqi Chen and Karthik Narasimhan (with some content from slides from Chris Manning and Abigail See) Course


slide-1
SLIDE 1

Contextualized Word Embeddings

Spring 2020

2020-03-17

CMPT 825: Natural Language Processing

SFU NatLangLab

Adapted from slides from Danqi Chen and Karthik Narasimhan (with some content from slides from Chris Manning and Abigail See)

slide-2
SLIDE 2

Course Logistics

  • Online lectures from now on! Everyone stay safe!
  • HW4 due Tuesday 3/24
  • Project Milestone due Tuesday 3/31
slide-3
SLIDE 3

Course Logistics

  • Contextual word embeddings and Transformers
  • Parsing:
  • Dependency Parsing
  • Constituency Parsing
  • Semantic Parsing
  • CNNs for NLP
  • Applications: Question Answering, Dialogue, Coreference, Grounding

Remaining lectures (tentative)

slide-4
SLIDE 4

Overview

  • ELMo

Contextualized Word Representations

= Bidirectional Encoder Representations from Transformers

= Embeddings from Language Models

  • BERT
slide-5
SLIDE 5

Overview

  • Transformers
slide-6
SLIDE 6

Recap: word2vec

word = “sweden”

slide-7
SLIDE 7

What’s wrong with word2vec?

  • One vector for each word type

cat =

    −0.224 0.130 −0.290 0.276    

<latexit sha1_base64="ZS1t+SATcIQYaJ4VZuEjXjz0Y=">ACOXicbZDPShxBEMZrjIk65s8aj3poIoFcsvRsQlSIHjxuIKrws6y9PTWro09PUN3jbgM8wx5m1zyFt4ELx4U8ZoXSM+uiFE/aPj4VRVd9SW5Vo4vwhmXs2+fjM3vxAuvn3/kNj6eOByworsSMzndmjRDjUymCHFGk8yi2KNF4mJzs1PXDU7ROZWafxjn2UjEyaqikI/6jfZpv4wJz6j0pKq2wjBkTJlngqy6qwKv/Jmq/WdxXHIm9E3XpsabfIpaq3/CGM0g4eBfmON/lE7LmJ7s3a9uqvmAFAu984jweZLFI0JLVwrhvxnHqlsKSkxiqMC4e5kCdihF1vjUjR9crJ5RX7MmADTPrnyE2oY8nSpE6N04T3+n3O3ZPazV8qdYtaLjRK5XJC0Ijpx8NC80oY3WMbKAsStJjb4S0yu/K5LGwQpIPO/QhRE9Pfm4OWs3Ih7rn0/gJU83DCnyCLxDBOmzDLrShAxJ+wyVcw03wJ7gKboO7aetMcD+zDP8p+PsPf3eqbQ=</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">ACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDw8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENkgXyiyTgKyRbJL6qRBLkgt+SePHiX3p36D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">ACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDw8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENkgXyiyTgKyRbJL6qRBLkgt+SePHiX3p36D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="yUhkDlYwUEoQ+3MeiaCkTY5/M=">ACOXicbZBNSyNBEIZ71PVj3NWsHr0BsHLhp4oxgUFwYvHCEaFTAg9nUps7OkZumvEMzf8uK/8CZ48bCLePUP2JME8euFhpenquiqN0qVtMjYvTc1PfNjdm5+wV/8+WtpufJ75dQmRHQEolKzHnELSipoYUSFZynBngcKTiLg/L+tkVGCsTfYLDFDoxH2jZl4KjQ91K86qbhwjXmDtSFPt+GMFA6jyNORp5Xfh/WK1e36Zh6LNasMVKU6K/bIzqjR0/BN17G+hWqzGRqJfTAxVTJRs1u5C3uJyGLQKBS3th2wFDs5NyiFgsIPMwspF5d8AG1nNY/BdvLR5QXdcKRH+4lxTyMd0fcTOY+tHcaR63T7XdjPtRJ+V2tn2N/t5FKnGYIW4/6maKY0DJG2pMGBKqhM1wY6Xal4oIbLtCF7bsQgs8nfzWn9VrgQj1m1YO9SRzZI2sk0SkAY5IEekSVpEkBvyQP6R/96t9+g9ec/j1ilvMrNKPsh7eQWaI6kG</latexit>

v(bank)

  • Complex characteristics of word use: semantics, syntactic

behavior, and connotations

  • Polysemous words, e.g., bank, mouse
slide-8
SLIDE 8

Contextualized word embeddings

Let’s build a vector for each word conditioned on its context!

movie

was

terribly exciting

! the Contextualized word embeddings

f : (w1, w2, …, wn) ⟶ x1, …, xn ∈ ℝd

slide-9
SLIDE 9

Contextualized word embeddings

(Peters et al, 2018): Deep contextualized word representations

(from ELMo)

slide-10
SLIDE 10

ELMo

  • NAACL’18: Deep contextualized word representations
  • Key idea:
  • Train an LSTM-based language model on some

large corpus

  • Use the hidden states of the LSTM for each token to

compute a vector representation of each word

slide-11
SLIDE 11

ELMo

(figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/)

softmax

# tokens in the sentence input tokens

LSTM parameters

Pretrain LM

Forward LM

Backward LM

Let’s stick to in this skit

slide-12
SLIDE 12

(figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/)

To get the ELMO embedding of a word (“stick”):

ELMo After training LM

Concatenate forward and backward embeddings and take weighted sum of layers

slide-13
SLIDE 13

(figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/)

ELMo

To get the ELMO embedding of a word (“stick”): Concatenate forward and backward embeddings and take weighted sum of layers LM weights are frozen Weights are trained on specific task.

sj

slide-14
SLIDE 14

Summary: How to get ELMo embedding?

  • : allows the task model to scale the entire ELMo vector

γtask

  • : softmax-normalized weights across layers

stask

j

hLM

k,0 = xLM k , hLM k,j = [h LM k,j ; h LM k,j ]

  • To use: plug ELMo into any (neural) NLP model: freeze all the

LMs weights and change the input representation to: (could also insert into higher layers)

L is # of layers

hidden states

Token representation

slide-15
SLIDE 15

More details

  • Forward and backward LMs: 2 layers each
  • Use character CNN to build initial word representation
  • 2048 char n-gram filters and 2 highway layers, 512 dim

projection

  • User 4096 dim hidden/cell LSTM states with 512 dim

projections to next input

  • A residual connection from the first to second layer
  • Trained 10 epochs on 1B Word Benchmark
slide-16
SLIDE 16

Experimental results

  • SQuAD: question answering
  • SNLI: natural language inference
  • SRL: semantic role labeling
  • Coref: coreference resolution
  • NER: named entity recognition
  • SST-5: sentiment analysis
slide-17
SLIDE 17

Intrinsic Evaluation

First Layer > Second Layer

syntactic information is better represented at lower layers while semantic information is captured at higher layers syntactic information

Second Layer > First Layer

semantic information

slide-18
SLIDE 18

Use ELMo in practice

https://allennlp.org/elmo

Also available in TensorFlow

slide-19
SLIDE 19

BERT

  • NAACL’19: BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding

  • First released in Oct 2018.

How is BERT different from ELMo? #1. Unidirectional context vs bidirectional context #2. LSTMs vs Transformers (will talk later) #3. The weights are not frozen, called fine-tuning

slide-20
SLIDE 20

Bidirectional encoders

  • Language models only use left context or right context (although

ELMo used two independent LMs from each direction).

  • Language understanding is bidirectional

Why are LMs unidirectional?

slide-21
SLIDE 21

Bidirectional encoders

  • Language models only use left context or right context (although

ELMo used two independent LMs from each direction).

  • Language understanding is bidirectional
slide-22
SLIDE 22

Masked language models (MLMs)

  • Solution: Mask out 15% of the input words, and then predict the

masked words

  • Too little masking: too expensive to train
  • Too much masking: not enough context
slide-23
SLIDE 23

Masked language models (MLMs)

A little more complex (don’t always replace with [MASK]):

Because [MASK] is never seen when BERT is used…

Example: my dog is hairy, we replace the word hairy

  • 80% of time: replace word with [MASK] token

my dog is [MASK]

  • 10% of time: replace word with random word

my dog is apple

  • 10% of time: keep word unchanged to bias representation

toward actual observed word my dog is hairy

slide-24
SLIDE 24

Next sentence prediction (NSP)

Always sample two sentences, predict whether the second sentence is followed after the first one.

Recent papers show that NSP is not necessary…

(Joshi*, Chen* et al, 2019) :SpanBERT: Improving Pre-training by Representing and Predicting Spans (Liu et al, 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach

slide-25
SLIDE 25

Pre-training and fine-tuning

Pre-training Fine-tuning Key idea: all the weights are fine-tuned on downstream tasks

(figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/)

slide-26
SLIDE 26

Applications

(figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/)

slide-27
SLIDE 27

More details

  • Input representations
  • Use word pieces instead of words: playing => play ##ing
  • Trained 40 epochs on Wikipedia (2.5B tokens) + BookCorpus (0.8B tokens)
  • Released two model sizes: BERT_base, BERT_large
slide-28
SLIDE 28

Experimental results

(Wang et al, 2018): GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

BiLSTM: 63.9

slide-29
SLIDE 29

Use BERT in practice

TensorFlow: https://github.com/google-research/bert PyTorch: https://github.com/huggingface/transformers

slide-30
SLIDE 30

Contextualized word embeddings in context

  • TagLM (Peters et, 2017)
  • CoVe (McCann et al. 2017)
  • ULMfit (Howard and Ruder, 2018)
  • ELMo (Peters et al, 2018)
  • OpenAI GPT (Radford et al, 2018)
  • BERT (Devlin et al, 2018)
  • OpenAI GPT-2 (Radford et al, 2019)
  • XLNet (Yang et al, 2019)
  • SpanBERT (Joshi et al, 2019)
  • RoBERTa (Liu et al, 2019)
  • ALBERT (Lan et al, 2019)

https://github.com/ huggingface/transformers

slide-31
SLIDE 31
slide-32
SLIDE 32

Transformers

slide-33
SLIDE 33

Transformers

  • NIPS’17: Attention is All You Need
  • Originally proposed for NMT (encoder-

decoder framework)

  • Used as the base model of BERT

(encoder only)

  • Key idea: Multi-head self-attention
  • No recurrence structure any more so it

trains much faster

Encoder Decoder

slide-34
SLIDE 34

Multi-head self-attention

Scaled Dot-Product Attention

self-attention

Multiple Heads

slide-35
SLIDE 35

Recall: attention

(slide credit: Abigail See)

slide-36
SLIDE 36

Self Attention

(also referred to as Intra-Attention)

  • Self-attention: let’s use each word as query and compute the

attention with all the other words

= the word vectors themselves select each other

slide-37
SLIDE 37

General definition of attention

  • Attention is a general concept
  • Given query q and a set of key-value pairs (K,V)
  • Attention is a way to compute a weighted sum of the values dependent on

the query and the corresponding keys.

  • Query determines what values to focus on, the query “attends” to the values
  • All of these (key value query) are represented using vectors
  • These vectors are created by multiplying embedding by trained weight

matrices.

slide-38
SLIDE 38
  • Can be any kind of attention

function

  • For transformers, this is the

scaled dot-product attention

(figure credit: Jay Alammar http://jalammar.github.io/illustrated-transformer/)

  • query, key, and value vectors

created by multiplying learned weight matrices with embedding

slide-39
SLIDE 39

Recall: types of attention

  • Assume encoder hidden states

and decoder hidden state

  • 1. Dot-product attention (assumes equal dimensions for and ):

  • 2. Bilinear / multiplicative attention:


, where is a weight matrix

  • 3. Additive attention (essentially MLP):



 where are weight matrices and is a weight vector

h1, h2, . . . , hn z

a b g(hi, z) = zThi ∈ ℝ g(hi, z) = zTWhi ∈ ℝ W g(hi, z) = vT tanh (W1hi + W2z) ∈ ℝ W1, W2 v

Perform better for larger dimensions more efficient (matrix multiplication) Simplest (no extra parameters) requires z and h_i to be same size More flexible than dot-product (W is trainable)

slide-40
SLIDE 40

Scaled dot-product attention

  • Assume encoder hidden states

and decoder hidden state

  • 1. Dot-product attention (assumes equal dimensions for and ):

  • 2. Scaled dot-product attention:


h1, h2, . . . , hn z

a b g(hi, z) = zThi ∈ ℝ g(hi, z) = zThi d ∈ ℝ

Maybe will perform well for larger dimensions Scaling factor: d = dimension of hidden state Perform poorly for large d Softmax has small gradient

slide-41
SLIDE 41
  • Can be any kind of attention

function

  • For transformers, this is the

scaled dot-product attention

(figure credit: Jay Alammar http://jalammar.github.io/illustrated-transformer/)

  • Final vector of attended values for

“Thinking” as the query

slide-42
SLIDE 42

Multiple heads

(figure credit: Jay Alammar http://jalammar.github.io/illustrated-transformer/)

  • Multiple (different) representations

for each query, key, and values

  • Different weight matrices —>

different vectors

  • Different ways for the words to

interact with each other

slide-43
SLIDE 43

Summary: Multi-head Self Attention

  • Attention: a query and a set of key-value

pairs to an output

q (ki, vi)

  • If we have multiple queries:

A(Q, K, V) = softmax(QK⊺)V

Q ∈ ℝnQ×d, K, V ∈ ℝn×d

  • Dot-product attention:

A(q, K, V) = ∑

i

eq⋅ki ∑j eq⋅kj vi

K, V ∈ ℝn×d, q ∈ ℝd

  • Self-attention: let’s use each word as query and compute the

attention with all the other words

= the word vectors themselves select each other

slide-44
SLIDE 44

Summary: Multi-head Self Attention

  • Scaled Dot-Product Attention:

A(Q, K, V) = softmax( QK⊺ d )V

  • Input: X ∈ ℝn×din

A(XWQ, XWK, XWV) ∈ ℝn×d

WQ, WK, WV ∈ ℝdin×d

  • Multi-head attention: using more than one head is always useful..

A(Q, K, V) = Concat(head1, …, headh)WO

headi = A(XWQ

i , XWK i , XWV i )

In practice, ,

h = 8 d = dout/h, WO = dout × dout

slide-45
SLIDE 45

Putting it all together

  • Each Transformer block has two sub-layers
  • Multi-head attention
  • 2-layer feedforward NN (with ReLU)
  • Each sublayer has a residual connection and

a layer normalization LayerNorm(x + SubLayer(x))

(Ba et al, 2016): Layer Normalization

residual connection

slide-46
SLIDE 46

Residual connections and Layer Normalization

(figure credit: Jay Alammar http://jalammar.github.io/illustrated-transformer/)

LayerNorm

  • changes input features to have mean 0 and

variance 1 per layer.

  • Adds two more parameters

(Ba et al, 2016): Layer Normalization

slide-47
SLIDE 47

Putting it all together

  • Each Transformer block has two sub-layers
  • Multi-head attention
  • 2-layer feedforward NN (with ReLU)
  • Each sublayer has a residual connection and

a layer normalization LayerNorm(x + SubLayer(x))

(Ba et al, 2016): Layer Normalization

  • Input layer has a positional encoding
slide-48
SLIDE 48

Positional encoding

t = position d = embedding dimension i = embedding index (0 to d-1)

Embedding index i

Position t

  • 1

+1

slide-49
SLIDE 49

Putting it all together

  • Each Transformer block has two sub-layers
  • Multi-head attention
  • 2-layer feedforward NN (with ReLU)
  • Each sublayer has a residual connection and

a layer normalization LayerNorm(x + SubLayer(x))

(Ba et al, 2016): Layer Normalization

  • Input layer has a positional encoding
  • BERT_base: 12 layers, 12 heads, hidden size = 768, 110M parameters
  • BERT_large: 24 layers, 16 heads, hidden size = 1024, 340M parameters
  • Input embedding is byte pair encoding (BPE)
  • riginal
slide-50
SLIDE 50

Transformer decoder

  • Encoder-Decoder Attention, where queries

come from previous decoder layer and keys and values come from output of encoder

  • also 6 layers (in original paper)
  • Masked decoder self-attention on previously

generated outputs

(figure credit: Jay Alammar http://jalammar.github.io/illustrated-gpt2/)

slide-51
SLIDE 51

Do we need all these heads?

  • Can we prune away some
  • f the heads of a trained

model during test time?

Are Sixteen Heads Really Better than One? Michel, Levy, and Neubig, NeurIPS 2019

3 types of attention: Enc-Enc, Enc-Dec, Dec-Dec 6 layers, 16 heads each layer for each type

slide-52
SLIDE 52

Do we need all these heads?

  • Can we train a good MT

model with less heads?

Are Sixteen Heads Really Better than One? Michel, Levy, and Neubig, NeurIPS 2019

3 types of attention: Enc-Enc, Enc-Dec, Dec-Dec 6 layers, 16 heads each layer for each type

slide-53
SLIDE 53

RNNs vs Transformers

the movie was terribly exciting !

Transformer layer 3 Transformer layer 2 Transformer layer 1

RNN Transformer

slide-54
SLIDE 54

Useful Resources

nn.Transformer:

nn.TransformerEncoder:

The Annotated Transformer:

http://nlp.seas.harvard.edu/2018/04/03/attention.html

A Jupyter notebook which explains how Transformer works line by line in PyTorch!

slide-55
SLIDE 55

NLP progress so far

slide-56
SLIDE 56

(slide credit: Stanford CS224N, Chris Manning)

slide-57
SLIDE 57

(slide credit: Stanford CS224N, Chris Manning)

slide-58
SLIDE 58

(slide credit: Stanford CS224N, Chris Manning)

slide-59
SLIDE 59

(slide credit: Stanford CS224N, Chris Manning)

slide-60
SLIDE 60

Have fun with using ELMo or BERT in your final project :)