Contextual Token Representations
ULMfit, OpenAI GPT, ELMo, BERT, XLM
Noe Casas
Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, - - PowerPoint PPT Presentation
Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background: Language Modeling T 1 T 2 </s> Data: Monolingual Corpus softmax softmax softmax project. project. project. Task:
ULMfit, OpenAI GPT, ELMo, BERT, XLM
Noe Casas
previous tokens (causal):
T
N
T
1
< s > T2 T1 </s> … …
P(Ti|T1…Ti−1)
… …
Model embed1 softmax softmax softmax project. project. project. embed2 embed3
representation regardless of the context.
context.
Model Alias Org. Article Reference ULMfit fast.ai Universal Language Model Fine-tuning for Text Classification
Howard and Ruder
ELMo AllenNLP Deep contextualized word representations
Peters et al.
OpenAI GPT OpenAI Improving Language Understanding by Generative Pre-Training
Radford et al.
BERT Google BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin et al.
XLM Facebook Cross-lingual Language Model Pretraining
Lample and Conneau
representations.
Phase 2: downstream task fine-tuning
Language Modeling Architecture
Downstream task
Downstream task head Phase 1: semi-supervised training
Language Modeling Architecture *LM task
LM task head
(projection + softmax)
monolingual corpus task-specific data
small learning rate
weights
transfer learning
contextual representations
Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English OpenAI GPT Transformer subword Causal LM + Classification English BERT Transformer subword Masked LM + Next sentence prediction Multilingual XLM Transformer subword Causal LM +Masked LM + Translation LM Multilingual
LSTM LSTM LSTM LSTM LSTM LSTM E
N
E
1
< / s > T2 T1 </s> …
… …
…
LSTM LSTM LSTM …
softmax project. softmax project. softmax project.
… …
LSTM LSTM LSTM LSTM LSTM LSTM C
N
C
1
T2 T1 TN
…
…
softmax project. softmax project. softmax project.
… … LSTM LSTM LSTM LSTM LSTM LSTM …
…
charCNN charCNN
< s > < / s >
charCNN charCNN
… …
he will be late 1 2 3 4 + + + + </s>] + Self-attention layers Positional embeddings Token embeddings Output tokens he will be late </s> softmax project. softmax project. softmax project. softmax project. softmax project.
he [MASK] be late [SEP] you [MASK] leave now [SEP] 1 2 3 4 5 6 7 8 9 + + + + + + + + + + [CLS] 10 + Positional embeddings Token embeddings Output tokens
A A A A A B B B B B A Segment embeddings + + + + + + + + + + + he will br late [SEP] you should leave now [SEP]
15% of tokens get masked
softmax project. Self-attention Layers softmax project. softmax project. softmax project. softmax project. softmax project. softmax project. softmax project. softmax project. softmax project.
This output is used for classification tasks
*figure from “Cross-lingual Language Model Pretraining”
[/s] the
[MASK] [MASK]
blue [/s]
[MASK]
rideaux
étaient
[MASK]
1 2 3 4 5 en en en en en en
curtains
les [/s] 1 2 3 4 5 [/s] fr fr fr fr fr fr bleus Transformer Transformer Token embeddings take drink now [/s] [/s]
[MASK]
a seat have a
[MASK]
[/s]
[MASK] [MASK]
relax and 1 2 3 4 5 en en en en en en 7 8 9 10 11 6 en en en en en en + Position embeddings Language embeddings Masked Language Modeling (MLM) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Token embeddings Position embeddings Language embeddings Translation Language Modeling (TLM)
were
Masked LM with parallel sentences Projection and softmax are omitted
Beyond Language Modeling”, Bowman et al., 2018
structure in contextualized word representations”, Tenney et al., 2018.
Yogatama et al., 2019.
Note the differences of contextual token representations with:
Translation: Contextualized Word Vectors by McCann et
Multilingual Sentence Embeddings for Zero-Shot Cross- Lingual Transfer and Beyond by Artetxe and Schewnk, 2018 [facebook].
part2-335ff2ed9c73
Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English OpenAI GPT Transformer subword Causal LM + Classification English BERT Transformer subword Masked LM + Next sentence prediction Multilingual XLM Transformer subword Causal LM +Masked LM + Translation LM Multilingual
Phase 2: downstream task fine-tuning
model
Downstream task
Downstream task head Phase 1: semi-supervised training
model *LM task
LM task head
(projection + softmax)
monolingual corpus
task-specific data
transfer learning
token space.
corpus finds same word with same usage.
he will be late Model he will be late softmax project. softmax project. softmax project. softmax project.