Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted - - PowerPoint PPT Presentation

contextualized word embeddings
SMART_READER_LITE
LIVE PREVIEW

Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted - - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted from Danqi Chen, Chris Manning, Abigail See, Andrej Karpathy) Overview Contextualized Word Representations ELMo = E mbeddings from L


slide-1
SLIDE 1

CSEP 517 Natural Language Processing

Contextualized Word Embeddings

Luke Zettlemoyer

(Slides adapted from Danqi Chen, Chris Manning, Abigail See, Andrej Karpathy)

slide-2
SLIDE 2

Overview

  • ELMo

Contextualized Word Representations

= Bidirectional Encoder Representations from Transformers

= Embeddings from Language Models

  • BERT
slide-3
SLIDE 3

Recap: Transformer

  • Transformers
slide-4
SLIDE 4

Recap: word2vec

word = “sweden”

slide-5
SLIDE 5

What’s wrong with word2vec?

  • One vector for each word type

cat =

    −0.224 0.130 −0.290 0.276    

<latexit sha1_base64="ZS1t+SATcIQYaJ4VZuEjXjz0Y=">ACOXicbZDPShxBEMZrjIk65s8aj3poIoFcsvRsQlSIHjxuIKrws6y9PTWro09PUN3jbgM8wx5m1zyFt4ELx4U8ZoXSM+uiFE/aPj4VRVd9SW5Vo4vwhmXs2+fjM3vxAuvn3/kNj6eOByworsSMzndmjRDjUymCHFGk8yi2KNF4mJzs1PXDU7ROZWafxjn2UjEyaqikI/6jfZpv4wJz6j0pKq2wjBkTJlngqy6qwKv/Jmq/WdxXHIm9E3XpsabfIpaq3/CGM0g4eBfmON/lE7LmJ7s3a9uqvmAFAu984jweZLFI0JLVwrhvxnHqlsKSkxiqMC4e5kCdihF1vjUjR9crJ5RX7MmADTPrnyE2oY8nSpE6N04T3+n3O3ZPazV8qdYtaLjRK5XJC0Ijpx8NC80oY3WMbKAsStJjb4S0yu/K5LGwQpIPO/QhRE9Pfm4OWs3Ih7rn0/gJU83DCnyCLxDBOmzDLrShAxJ+wyVcw03wJ7gKboO7aetMcD+zDP8p+PsPf3eqbQ=</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">ACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDw8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENkgXyiyTgKyRbJL6qRBLkgt+SePHiX3p36D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">ACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDw8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENkgXyiyTgKyRbJL6qRBLkgt+SePHiX3p36D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="yUhkDlYwUEoQ+3MeiaCkTY5/M=">ACOXicbZBNSyNBEIZ71PVj3NWsHr0BsHLhp4oxgUFwYvHCEaFTAg9nUps7OkZumvEMzf8uK/8CZ48bCLePUP2JME8euFhpenquiqN0qVtMjYvTc1PfNjdm5+wV/8+WtpufJ75dQmRHQEolKzHnELSipoYUSFZynBngcKTiLg/L+tkVGCsTfYLDFDoxH2jZl4KjQ91K86qbhwjXmDtSFPt+GMFA6jyNORp5Xfh/WK1e36Zh6LNasMVKU6K/bIzqjR0/BN17G+hWqzGRqJfTAxVTJRs1u5C3uJyGLQKBS3th2wFDs5NyiFgsIPMwspF5d8AG1nNY/BdvLR5QXdcKRH+4lxTyMd0fcTOY+tHcaR63T7XdjPtRJ+V2tn2N/t5FKnGYIW4/6maKY0DJG2pMGBKqhM1wY6Xal4oIbLtCF7bsQgs8nfzWn9VrgQj1m1YO9SRzZI2sk0SkAY5IEekSVpEkBvyQP6R/96t9+g9ec/j1ilvMrNKPsh7eQWaI6kG</latexit>

v(bank)

  • Complex characteristics of word use: semantics, syntactic

behavior, and connotations

  • Polysemous words, e.g., bank, mouse
slide-6
SLIDE 6

Contextualized word embeddings

Let’s build a vector for each word conditioned on its context!

movie

was

terribly

exciting

! the

Contextualized word embeddings

f : (w1, w2, …, wn) ⟶ x1, …, xn ∈ ℝd

slide-7
SLIDE 7

Contextualized word embeddings

(Peters et al, 2018): Deep contextualized word representations

slide-8
SLIDE 8

ELMo

  • NAACL’18: Deep contextualized word representations
  • Key idea:
  • Train an LSTM-based language model on some

large corpus

  • Use the hidden states of the LSTM for each token

to compute a vector representation of each word

slide-9
SLIDE 9

ELMo

input softmax

# words in the sentence

slide-10
SLIDE 10

How to use ELMo?

  • : allows the task model to scale the entire ELMo vector

γtask

  • : softmax-normalized weights across layers

stask

j

hlM

k,0 = xLM k , hLM k,j = [h LM k,j ; h LM k,j ]

  • Plug ELMo into any (neural) NLP model: freeze all the LMs

weights and change the input representation to: (could also insert into higher layers)

# of layers

slide-11
SLIDE 11

More details

  • Forward and backward LMs: 2 layers each
  • Use character CNN to build initial word representation
  • 2048 char n-gram filters and 2 highway layers, 512 dim

projection

  • User 4096 dim hidden/cell LSTM states with 512 dim projections

to next input

  • A residual connection from the first to second layer
  • Trained 10 epochs on 1B Word Benchmark
slide-12
SLIDE 12

Experimental results

  • SQuAD: question answering
  • SNLI: natural language inference
  • SRL: semantic role labeling
  • Coref: coreference resolution
  • NER: named entity recognition
  • SST-5: sentiment analysis
slide-13
SLIDE 13

Intrinsic Evaluation

First Layer > Second Layer Second Layer > First Layer

syntactic information is better represented at lower layers while semantic information is captured a higher layers

slide-14
SLIDE 14

Use ELMo in practice

https://allennlp.org/elmo

Also available in TensorFlow

slide-15
SLIDE 15

BERT

  • NAACL’19: BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding

  • First released in Oct 2018.

How is BERT different from ELMo? #1. Unidirectional context vs bidirectional context #2. LSTMs vs Transformers (will talk later) #3. The weights are not freezed, called fine-tuning

slide-16
SLIDE 16

Bidirectional encoders

  • Language models only use left context or right context (although

ELMo used two independent LMs from each direction).

  • Language understanding is bidirectional

Why are LMs unidirectional?

slide-17
SLIDE 17

Bidirectional encoders

  • Language models only use left context or right context (although

ELMo used two independent LMs from each direction).

  • Language understanding is bidirectional
slide-18
SLIDE 18

Masked language models (MLMs)

  • Solution: Mask out 15% of the input words, and then predict the

masked words

  • Too little masking: too expensive to train
  • Too much masking: not enough context
slide-19
SLIDE 19

Masked language models (MLMs)

A little more complication:

Because [MASK] is never seen when BERT is used…

slide-20
SLIDE 20

Next sentence prediction (NSP)

Always sample two sentences, predict whether the second sentence is followed after the first one.

Recent papers show that NSP is not necessary…

(Joshi*, Chen* et al, 2019) :SpanBERT: Improving Pre-training by Representing and Predicting Spans
 (Liu et al, 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach

slide-21
SLIDE 21

Pre-training and fine-tuning

Pre-training Fine-tuning

Key idea: all the weights are fine-tuned on downstream tasks

slide-22
SLIDE 22

Applications

slide-23
SLIDE 23

More details

  • Input representations
  • Use word pieces instead of words: playing => play ##ing
  • Trained 40 epochs on Wikipedia (2.5B tokens) + BookCorpus (0.8B tokens)
  • Released two model sizes: BERT_base, BERT_large
slide-24
SLIDE 24

Experimental results

(Wang et al, 2018): GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

BiLSTM: 63.9

slide-25
SLIDE 25

Use BERT in practice

TensorFlow: https://github.com/google-research/bert

PyTorch: https://github.com/huggingface/transformers

slide-26
SLIDE 26

Contextualized word embeddings in context

  • TagLM (Peters et, 2017)
  • CoVe (McCann et al. 2017)
  • ULMfit (Howard and Ruder, 2018)
  • ELMo (Peters et al, 2018)
  • OpenAI GPT (Radford et al, 2018)
  • BERT (Devlin et al, 2018)
  • OpenAI GPT-2 (Radford et al, 2019)
  • XLNet (Yang et al, 2019)
  • SpanBERT (Joshi et al, 2019)
  • RoBERTa (Liu et al, 2019)
  • many many more ...