Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, - - PowerPoint PPT Presentation

contextual token representations
SMART_READER_LITE
LIVE PREVIEW

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, - - PowerPoint PPT Presentation

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background: Language Modeling T 1 T 2 </s> Data: Monolingual Corpus softmax softmax softmax project. project. project. Task:


slide-1
SLIDE 1

Contextual Token Representations

ULMfit, OpenAI GPT, ELMo, BERT, XLM

Noe Casas

slide-2
SLIDE 2
  • Data: Monolingual Corpus
  • Task: predict next token given

previous tokens (causal):

  • Usual models: LSTM, Transformer.

Background: Language Modeling

T

N

T

1

< s > T2 T1 </s> … …

P(Ti|T1…Ti−1)

… …

Model embed1 softmax softmax softmax project. project. project. embed2 embed3

slide-3
SLIDE 3
  • Same word can have different meaning depending on the
  • context. Example:
  • Please, type everything in lowercase.
  • What type of flowers do you like most?
  • Classic word embeddings offer the same vector

representation regardless of the context.

  • Solution: create word representations that depend on the

context.

Contextual embeddings: intuition

slide-4
SLIDE 4

Articles

Model Alias Org. Article Reference ULMfit fast.ai Universal Language Model Fine-tuning for Text Classification

Howard and Ruder

ELMo AllenNLP Deep contextualized word representations

Peters et al.

OpenAI GPT OpenAI Improving Language Understanding by Generative Pre-Training

Radford et al.

BERT Google BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin et al.

XLM Facebook Cross-lingual Language Model Pretraining

Lample and Conneau

slide-5
SLIDE 5

Overview

  • Train model in one of multiple tasks that lead to word

representations.

  • Release pre-trained models.
  • Use pre-trained models, options:
  • A. Fine-tune model on final task.
  • B. Directly encode token representations with model.
slide-6
SLIDE 6

Phase 2: downstream task fine-tuning

Language Modeling Architecture

Downstream task

Downstream task head Phase 1: semi-supervised training

Language Modeling Architecture *LM task

LM task head

(projection + softmax)

Overview (graphical)

monolingual corpus task-specific data

small learning rate

  • r directly freeze

weights

transfer learning

contextual representations

slide-7
SLIDE 7

Differences

Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English OpenAI GPT Transformer subword Causal LM + Classification English BERT Transformer subword Masked LM + Next sentence prediction Multilingual XLM Transformer subword Causal LM +Masked LM + Translation LM Multilingual

slide-8
SLIDE 8

ULMFiT

LSTM LSTM LSTM LSTM LSTM LSTM E

N

E

1

< / s > T2 T1 </s> …

… …

  • Task: causal LM
  • Model: 3-layer LSTM
  • Tokens: words

LSTM LSTM LSTM …

softmax project. softmax project. softmax project.

… …

slide-9
SLIDE 9

ELMO

  • Task: bidirectional LM
  • Model: 2-layer biLSTM
  • Tokens: words

LSTM LSTM LSTM LSTM LSTM LSTM C

N

C

1

T2 T1 TN

softmax project. softmax project. softmax project.

… … LSTM LSTM LSTM LSTM LSTM LSTM …

charCNN charCNN

< s > < / s >

charCNN charCNN

… …

slide-10
SLIDE 10

OpenAI GPT

  • Task: causal LM
  • Model: self-attention layers
  • Tokens: subwords

he will be late 1 2 3 4 + + + + </s>] + Self-attention layers Positional embeddings Token embeddings Output tokens he will be late </s> softmax project. softmax project. softmax project. softmax project. softmax project.

slide-11
SLIDE 11

BERT

he [MASK] be late [SEP] you [MASK] leave now [SEP] 1 2 3 4 5 6 7 8 9 + + + + + + + + + + [CLS] 10 + Positional embeddings Token embeddings Output tokens

  • Tasks: masked LM + next sentence prediction
  • Model: self-attention layers
  • Tokens: subwords

A A A A A B B B B B A Segment embeddings + + + + + + + + + + + he will br late [SEP] you should leave now [SEP]

15% of tokens get masked

softmax project. Self-attention Layers softmax project. softmax project. softmax project. softmax project. softmax project. softmax project. softmax project. softmax project. softmax project.

This output is used for classification tasks

slide-12
SLIDE 12

XLM

*figure from “Cross-lingual Language Model Pretraining”

[/s] the

[MASK] [MASK]

blue [/s]

[MASK]

rideaux

étaient

[MASK]

1 2 3 4 5 en en en en en en

curtains

les [/s] 1 2 3 4 5 [/s] fr fr fr fr fr fr bleus Transformer Transformer Token embeddings take drink now [/s] [/s]

[MASK]

a seat have a

[MASK]

[/s]

[MASK] [MASK]

relax and 1 2 3 4 5 en en en en en en 7 8 9 10 11 6 en en en en en en + Position embeddings Language embeddings Masked Language Modeling (MLM) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Token embeddings Position embeddings Language embeddings Translation Language Modeling (TLM)

were

  • Tasks: LM + masked LM + Translation LM
  • Model: self-attention layers
  • Tokens: subwords

Masked LM with parallel sentences Projection and softmax are omitted

slide-13
SLIDE 13

Downstream Tasks

  • Natural Language Inference (NLI) or Cross-lingual NLI.
  • Text classification (e.g. sentiment analysis).
  • Next sentence prediction.
  • Supervised and Unsupervised Neural Machine Translation (NMT).
  • Question Answering (QA).
  • Named Entity Recognition (NER).
slide-14
SLIDE 14
  • “Looking for ELMo's friends: Sentence-Level Pretraining

Beyond Language Modeling”, Bowman et al., 2018

  • “What do you learn from context? Probing for sentence

structure in contextualized word representations”, Tenney et al., 2018.

  • “Assessing BERT’s Syntactic Abilities”, Goldberg, 2018
  • “Learning and Evaluating General Linguistic Intelligence”,

Yogatama et al., 2019.

Further reading

slide-15
SLIDE 15

Note the differences of contextual token representations with:

  • Non-word representations like in (CoVe): Learned in

Translation: Contextualized Word Vectors by McCann et

  • al. 2017 [salesforce].
  • Fixed-size sentence representations like in Massively

Multilingual Sentence Embeddings for Zero-Shot Cross- Lingual Transfer and Beyond by Artetxe and Schewnk, 2018 [facebook].

Differences with other representations

slide-16
SLIDE 16
  • https://nlp.stanford.edu/seminar/details/jdevlin.pdf
  • http://jalammar.github.io/illustrated-bert/
  • https://medium.com/dissecting-bert/dissecting-bert-

part2-335ff2ed9c73

  • https://github.com/huggingface/pytorch-pretrained-BERT

Other resources

slide-17
SLIDE 17

Summary

Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English OpenAI GPT Transformer subword Causal LM + Classification English BERT Transformer subword Masked LM + Next sentence prediction Multilingual XLM Transformer subword Causal LM +Masked LM + Translation LM Multilingual

Phase 2: downstream task fine-tuning

model

Downstream task

Downstream task head Phase 1: semi-supervised training

model *LM task

LM task head

(projection + softmax)

monolingual corpus

task-specific data

transfer learning

slide-18
SLIDE 18

Bonus slides

slide-19
SLIDE 19
  • They are a linear projection away from

token space.

  • Word-level nearest neighbours in

corpus finds same word with same usage.

Are these really token representations?

he will be late Model he will be late softmax project. softmax project. softmax project. softmax project.