Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted from Danqi Chen, Chris Manning, Abigail See, Andrej Karpathy)

Overview Contextualized Word Representations • ELMo = E mbeddings from L anguage Mo dels • BERT = B idirectional E ncoder R epresentations from T ransformers

Recap: Transformer • Transformers

Recap: word2vec word = “sweden”

<latexit sha1_base64="ZS1t+SATcIQYaJ4VZuEjXjz0Y=">ACOXicbZDPShxBEMZrjIk65s8aj3poIoFcsvRsQlSIHjxuIKrws6y9PTWro09PUN3jbgM8wx5m1zyFt4ELx4U8ZoXSM+uiFE/aPj4VRVd9SW5Vo4vwhmXs2+fjM3vxAuvn3/kNj6eOByworsSMzndmjRDjUymCHFGk8yi2KNF4mJzs1PXDU7ROZWafxjn2UjEyaqikI/6jfZpv4wJz6j0pKq2wjBkTJlngqy6qwKv/Jmq/WdxXHIm9E3XpsabfIpaq3/CGM0g4eBfmON/lE7LmJ7s3a9uqvmAFAu984jweZLFI0JLVwrhvxnHqlsKSkxiqMC4e5kCdihF1vjUjR9crJ5RX7MmADTPrnyE2oY8nSpE6N04T3+n3O3ZPazV8qdYtaLjRK5XJC0Ijpx8NC80oY3WMbKAsStJjb4S0yu/K5LGwQpIPO/QhRE9Pfm4OWs3Ih7rn0/gJU83DCnyCLxDBOmzDLrShAxJ+wyVcw03wJ7gKboO7aetMcD+zDP8p+PsPf3eqbQ=</latexit> <latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">ACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDw8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENkgXyiyTgKyRbJL6qRBLkgt+SePHiX3p36D0NW0e8t5l58kHe8yuUTqy7</latexit> <latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">ACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDw8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENkgXyiyTgKyRbJL6qRBLkgt+SePHiX3p36D0NW0e8t5l58kHe8yuUTqy7</latexit> <latexit sha1_base64="yUhkDlYwUEoQ+3MeiaCkTY5/M=">ACOXicbZBNSyNBEIZ71PVj3NWsHr0BsHLhp4oxgUFwYvHCEaFTAg9nUps7OkZumvEMzf8uK/8CZ48bCLePUP2JME8euFhpenquiqN0qVtMjYvTc1PfNjdm5+wV/8+WtpufJ75dQmRHQEolKzHnELSipoYUSFZynBngcKTiLg/L+tkVGCsTfYLDFDoxH2jZl4KjQ91K86qbhwjXmDtSFPt+GMFA6jyNORp5Xfh/WK1e36Zh6LNasMVKU6K/bIzqjR0/BN17G+hWqzGRqJfTAxVTJRs1u5C3uJyGLQKBS3th2wFDs5NyiFgsIPMwspF5d8AG1nNY/BdvLR5QXdcKRH+4lxTyMd0fcTOY+tHcaR63T7XdjPtRJ+V2tn2N/t5FKnGYIW4/6maKY0DJG2pMGBKqhM1wY6Xal4oIbLtCF7bsQgs8nfzWn9VrgQj1m1YO9SRzZI2sk0SkAY5IEekSVpEkBvyQP6R/96t9+g9ec/j1ilvMrNKPsh7eQWaI6kG</latexit> What’s wrong with word2vec?   − 0 . 224 • One vector for each word type 0 . 130 v ( bank )   cat =   − 0 . 290   0 . 276 • Complex characteristics of word use: semantics, syntactic behavior, and connotations • Polysemous words, e.g., bank, mouse

Contextualized word embeddings Let’s build a vector for each word conditioned on its context ! Contextualized word embeddings was ! movie terribly exciting the f : ( w 1 , w 2 , …, w n ) ⟶ x 1 , …, x n ∈ ℝ d

Contextualized word embeddings (Peters et al, 2018): Deep contextualized word representations

ELMo • NAACL’18: Deep contextualized word representations • Key idea: • Train an LSTM-based language model on some large corpus • Use the hidden states of the LSTM for each token to compute a vector representation of each word

ELMo # words in the sentence softmax input

How to use ELMo? # of layers h lM k ,0 = x LM k , h LM k , j = [ h LM k , j ; h LM k , j ] • γ task : allows the task model to scale the entire ELMo vector • s task : softmax-normalized weights across layers j • Plug ELMo into any (neural) NLP model: freeze all the LMs weights and change the input representation to: (could also insert into higher layers)

More details • Forward and backward LMs: 2 layers each • Use character CNN to build initial word representation • 2048 char n-gram filters and 2 highway layers, 512 dim projection • User 4096 dim hidden/cell LSTM states with 512 dim projections to next input • A residual connection from the first to second layer • Trained 10 epochs on 1B Word Benchmark

Experimental results • SQuAD: question answering • SNLI: natural language inference • SRL: semantic role labeling • Coref: coreference resolution • NER: named entity recognition • SST-5: sentiment analysis

Intrinsic Evaluation First Layer > Second Layer Second Layer > First Layer syntactic information is better represented at lower layers while semantic information is captured a higher layers

Use ELMo in practice https://allennlp.org/elmo Also available in TensorFlow

BERT • First released in Oct 2018. • NAACL’19: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding How is BERT different from ELMo? #1. Unidirectional context vs bidirectional context #2. LSTMs vs Transformers (will talk later) #3. The weights are not freezed, called fine-tuning

Bidirectional encoders • Language models only use left context or right context (although ELMo used two independent LMs from each direction). • Language understanding is bidirectional Why are LMs unidirectional?

Bidirectional encoders • Language models only use left context or right context (although ELMo used two independent LMs from each direction). • Language understanding is bidirectional

Masked language models (MLMs) • Solution: Mask out 15% of the input words, and then predict the masked words • Too little masking: too expensive to train • Too much masking: not enough context

Masked language models (MLMs) A little more complication: Because [MASK] is never seen when BERT is used…

Next sentence prediction (NSP) Always sample two sentences, predict whether the second sentence is followed after the first one. Recent papers show that NSP is not necessary… (Joshi*, Chen* et al, 2019) :SpanBERT: Improving Pre-training by Representing and Predicting Spans   (Liu et al, 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach

Pre-training and fine-tuning Pre-training Fine-tuning Key idea: all the weights are fine-tuned on downstream tasks

Applications

More details • Input representations • Use word pieces instead of words: playing => play ##ing • Trained 40 epochs on Wikipedia (2.5B tokens) + BookCorpus (0.8B tokens) • Released two model sizes: BERT_base, BERT_large

Experimental results BiLSTM: 63.9 (Wang et al, 2018): GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Use BERT in practice TensorFlow : https://github.com/google-research/bert PyTorch : https://github.com/huggingface/transformers

Contextualized word embeddings in context • TagLM (Peters et, 2017) • CoVe (McCann et al. 2017) • ULMfit (Howard and Ruder, 2018) • ELMo (Peters et al, 2018) • OpenAI GPT (Radford et al, 2018) • BERT (Devlin et al, 2018) • OpenAI GPT-2 (Radford et al, 2019) • XLNet (Yang et al, 2019) • SpanBERT (Joshi et al, 2019) • RoBERTa (Liu et al, 2019) • many many more ...

Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted from Danqi Chen, Chris Manning, Abigail See, Andrej Karpathy) Overview Contextualized Word Representations ELMo = E mbeddings from L

Retrofitting Contextualized Word Embeddings with Paraphrases Weijia Shi 1* , Muhao Chen 1 * , Pei

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

IN5550: Neural Methods in Natural Language Processing Lecture 11/1 Contextualized embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word, Sense and Contextualized Embeddings: Vector Representations of Meaning in NLP Jose

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Gender Bias in Contextualized Word Embeddings Jieyu Zhao 1 , Tianlu Wang 2 , Mark Yatskar 3 ,

Contextualized Word Embeddings Spring 2020 2020-03-17 Adapted from slides from Danqi Chen and

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

The Network Strategy Plan Brian Flynn Head OMR, Directorate of Network Management 13 th February

PWE3 Protocol Layering PWE3 IETF-54 July 15, 2002 Stewart Bryant <stbryant@cisco.com>

Measurements & Metrics Not everything that can be counted counts, and not everything that

The Computational Complexity of Spark, RIP, and NSP Andreas M. Tillmann Research Group

Using Premia and Nsp on a Parallel Architecture for Risk Management Benchmark Jean-Philippe

1 Union Allocation Using Union to Access Bit Patterns Allocate according to largest element

Network View Joe Metzger Network Engineering, ESnet LHC Workshop CERN February 10th, 2014

Stochastic Gravitational Wave Background Mapmaking using regularized deconvolution Sambit Panda ,

Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted from Danqi Chen, Chris Manning, Abigail See, Andrej Karpathy) Overview Contextualized Word Representations ELMo = E mbeddings from L

Retrofitting Contextualized Word Embeddings with Paraphrases Weijia Shi 1* , Muhao Chen 1 * , Pei

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

IN5550: Neural Methods in Natural Language Processing Lecture 11/1 Contextualized embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word, Sense and Contextualized Embeddings: Vector Representations of Meaning in NLP Jose

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Gender Bias in Contextualized Word Embeddings Jieyu Zhao 1 , Tianlu Wang 2 , Mark Yatskar 3 ,

Contextualized Word Embeddings Spring 2020 2020-03-17 Adapted from slides from Danqi Chen and

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

The Network Strategy Plan Brian Flynn Head OMR, Directorate of Network Management 13 th February

PWE3 Protocol Layering PWE3 IETF-54 July 15, 2002 Stewart Bryant &lt;stbryant@cisco.com&gt;

Measurements &amp; Metrics Not everything that can be counted counts, and not everything that

The Computational Complexity of Spark, RIP, and NSP Andreas M. Tillmann Research Group

Using Premia and Nsp on a Parallel Architecture for Risk Management Benchmark Jean-Philippe

1 Union Allocation Using Union to Access Bit Patterns Allocate according to largest element

Network View Joe Metzger Network Engineering, ESnet LHC Workshop CERN February 10th, 2014

Stochastic Gravitational Wave Background Mapmaking using regularized deconvolution Sambit Panda ,

PWE3 Protocol Layering PWE3 IETF-54 July 15, 2002 Stewart Bryant <stbryant@cisco.com>

Measurements & Metrics Not everything that can be counted counts, and not everything that