Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, - PowerPoint PPT Presentation

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas

Background: Language Modeling … T 1 T 2 </s> • Data: Monolingual Corpus … softmax softmax softmax … project. project. project. • Task: predict next token given previous tokens (causal): embed 1 embed 2 embed 3 Model P ( T i | T 1 … T i − 1 ) • Usual models: LSTM, Transformer. < s > T T … 1 N

Contextual embeddings: intuition • Same word can have different meaning depending on the context. Example: - Please, type everything in lowercase. - What type of flowers do you like most? • Classic word embeddings offer the same vector representation regardless of the context. • Solution: create word representations that depend on the context.

Articles Model Alias Org. Article Reference Universal Language Model Fine-tuning for Text Classification ULMfit fast.ai Howard and Ruder Deep contextualized word representations ELMo AllenNLP Peters et al. Improving Language Understanding by Generative Pre-Training OpenAI GPT OpenAI Radford et al. BERT: Pre-training of Deep Bidirectional Transformers for BERT Google Language Understanding Devlin et al. Facebook Cross-lingual Language Model Pretraining XLM Lample and Conneau

Overview • Train model in one of multiple tasks that lead to word representations. • Release pre-trained models. • Use pre-trained models, options: A. Fine-tune model on final task. B. Directly encode token representations with model.

Overview (graphical) Phase 1: Phase 2: semi-supervised training downstream task fine-tuning *LM task Downstream task Downstream task LM task head head (projection + softmax) contextual transfer learning representations Language Language Modeling Modeling Architecture Architecture small learning rate or directly freeze monolingual task-specific weights corpus data

Differences Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English Causal LM OpenAI GPT Transformer subword English + Classification Masked LM BERT Transformer subword + Next sentence Multilingual prediction Causal LM XLM Transformer subword +Masked LM Multilingual + Translation LM

ULMFiT • Task : causal LM T 1 T 2 </s> … … softmax softmax softmax • Model : 3-layer LSTM … project. project. project. • Tokens : words LSTM LSTM LSTM … LSTM LSTM … LSTM LSTM LSTM LSTM … … < / s > E E 1 N

ELMO • Task : bidirectional LM T 1 T 2 T N … softmax … softmax softmax • Model : 2-layer biLSTM … project. project. project. • Tokens : words … LSTM LSTM LSTM LSTM LSTM … LSTM LSTM LSTM LSTM LSTM LSTM LSTM … … charCNN charCNN charCNN charCNN … C < / s > < s > C 1 N

OpenAI GPT • Task : causal LM Output tokens he will be late </s> • Model : self-attention layers softmax softmax softmax softmax softmax project. project. project. project. project. • Tokens : subwords Self-attention layers Token </s>] he will be late embeddings + + + + + Positional 0 1 2 3 4 embeddings

BERT he will br late [SEP] you should leave now [SEP] Output tokens This output is used softmax softmax softmax softmax softmax softmax softmax softmax softmax softmax for classification tasks project. project. project. project. project. project. project. project. project. project. Self-attention Layers Token [CLS] he [MASK] be late [SEP] you [MASK] leave now [SEP] embeddings + + + + + + + + + + + Positional 0 1 2 3 4 5 6 7 8 9 10 embeddings + + + + + + + + + + + Segment A A A A A A B B B B B embeddings 15% of tokens get masked • Tasks : masked LM + next sentence prediction • Model : self-attention layers • Tokens : subwords

Masked Language [/s] take drink now Modeling (MLM) Transformer XLM Token [/s] [/s] [MASK] a seat [MASK] have a [MASK] [MASK] relax and embeddings + + + + + + + + + + + + Position 0 1 2 3 4 5 6 7 8 9 10 11 embeddings + + + + + + + + + + + + Language en en en en en en en en en en en en embeddings Translation Language curtains were les bleus Modeling (TLM) Transformer Token [/s] [/s] [/s] rideaux [/s] the [MASK] [MASK] blue [MASK] étaient [MASK] embeddings + + + + + + + + + + + + Position 0 1 2 3 4 5 0 1 2 3 4 5 embeddings + + + + + + + + + + + + Language en en en en en en fr fr fr fr fr fr embeddings • Tasks : LM + masked LM + Translation LM Masked LM with • Model : self-attention layers parallel sentences • Tokens : subwords Projection and softmax are omitted *figure from “Cross-lingual Language Model Pretraining”

Downstream Tasks • Natural Language Inference (NLI) or Cross-lingual NLI. • Text classification (e.g. sentiment analysis). • Next sentence prediction. • Supervised and Unsupervised Neural Machine Translation (NMT). • Question Answering (QA). • Named Entity Recognition (NER).

Further reading • “Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling”, Bowman et al., 2018 • “What do you learn from context? Probing for sentence structure in contextualized word representations”, Tenney et al., 2018. • “Assessing BERT’s Syntactic Abilities”, Goldberg, 2018 • “Learning and Evaluating General Linguistic Intelligence”, Yogatama et al., 2019.

Differences with other representations Note the differences of contextual token representations with: • Non-word representations like in (CoVe): Learned in Translation: Contextualized Word Vectors by McCann et al. 2017 [salesforce]. • Fixed-size sentence representations like in Massively Multilingual Sentence Embeddings for Zero-Shot Cross- Lingual Transfer and Beyond by Artetxe and Schewnk, 2018 [facebook].

Other resources • https://nlp.stanford.edu/seminar/details/jdevlin.pdf • http://jalammar.github.io/illustrated-bert/ • https://medium.com/dissecting-bert/dissecting-bert- part2-335ff2ed9c73 • https://github.com/huggingface/pytorch-pretrained-BERT

Summary Phase 2: Phase 1: downstream task fine-tuning semi-supervised training *LM task Downstream task task-specific data Downstream task LM task head head transfer learning (projection + softmax) monolingual corpus model model Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English Causal LM OpenAI GPT Transformer subword English + Classification Masked LM BERT Transformer subword Multilingual + Next sentence prediction Causal LM XLM Transformer subword +Masked LM Multilingual + Translation LM

Bonus slides

Are these really token representations? • They are a linear projection away from he will be late token space. softmax softmax softmax softmax project. project. project. project. • Word-level nearest neighbours in corpus finds same word with same Model usage. he will be late

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, - PowerPoint PPT Presentation

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background: Language Modeling T 1 T 2 </s> Data: Monolingual Corpus softmax softmax softmax project. project. project. Task:

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

HOTNOW HOT Token HOTNOW l TOKEN SALE THE FIRST UTILITY TOKEN WITH REAL INTRINSIC VALUE REINVENT

PIV Token Issuance PIV Token Issuance Ketan Mehta Mehta_Ketan@bah.com October 6, 2004 1

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

TREE = TOKEN The Frontier of Impact Finance T TREE T TREE Token = oken = 1 The Frontier

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Co-Founder, Managing Partner, SPiCE VC - Security Token Pioneers @AmiBenDavid 1 st Fully

Security model for hybrid token-based networking models By Rudy Borgstede Contents

Token to Words Expanding identified token to words numbers+type = word list

Lecture 19: Dictionaries Counting Words Creating token from a text file: 1 def file to

Token Ring Developed by IBM, adopted by IEEE as 802.5 standard Token rings latter

GNUK TOKEN AND GNUPG GNUK TOKEN AND GNUPG SCDAEMON SCDAEMON minimizing the attack surface

Kidnapping's Cesar Cerrudo Revenge Argeniss Token Token Who am I? Who am I? Argeniss

Lean Process Improvement in Healthcare: Barriers and Opportunities Edward G. Anderson Jr., Ph.D.

VIRTUAL CONFERENCE ictcm.com | #ICTCM 32 nd International Conference on Technology in Collegiate

Renovating Smaller Central Business Districts: The Impact of Main Street Revitalization on Rural

CMSC 430 Introduction to Compilers Fall 2018 Symbolic Execution Introduction Static

Publicly Available Large Data Sets for Health Outcomes Research: Pearls, Pitfalls, Prices &

Weighted walks around dissected polygons Conway-Coxeter friezes and beyond Christine

Dissection-BKW CRYPTO 2018, Santa Barbara , August 20th 2018 Andre Esser , Felix Heuer, Robert

Learning Where to Sample in Structured Prediction Tianlin Shi Jacob Steinhardt Percy Liang

Sambuz

Useful Links

Newsletter

Mail Us

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, - PowerPoint PPT Presentation

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background: Language Modeling T 1 T 2 </s> Data: Monolingual Corpus softmax softmax softmax project. project. project. Task:

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

HOTNOW HOT Token HOTNOW l TOKEN SALE THE FIRST UTILITY TOKEN WITH REAL INTRINSIC VALUE REINVENT

PIV Token Issuance PIV Token Issuance Ketan Mehta Mehta_Ketan@bah.com October 6, 2004 1

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

TREE = TOKEN The Frontier of Impact Finance T TREE T TREE Token = oken = 1 The Frontier

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Experimental Design &amp; Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Co-Founder, Managing Partner, SPiCE VC - Security Token Pioneers @AmiBenDavid 1 st Fully

Security model for hybrid token-based networking models By Rudy Borgstede Contents

Token to Words Expanding identified token to words numbers+type = word list

Lecture 19: Dictionaries Counting Words Creating token from a text file: 1 def file to

Token Ring Developed by IBM, adopted by IEEE as 802.5 standard Token rings latter

GNUK TOKEN AND GNUPG GNUK TOKEN AND GNUPG SCDAEMON SCDAEMON minimizing the attack surface

Kidnapping's Cesar Cerrudo Revenge Argeniss Token Token Who am I? Who am I? Argeniss

Lean Process Improvement in Healthcare: Barriers and Opportunities Edward G. Anderson Jr., Ph.D.

VIRTUAL CONFERENCE ictcm.com | #ICTCM 32 nd International Conference on Technology in Collegiate

Renovating Smaller Central Business Districts: The Impact of Main Street Revitalization on Rural

CMSC 430 Introduction to Compilers Fall 2018 Symbolic Execution Introduction Static

Publicly Available Large Data Sets for Health Outcomes Research: Pearls, Pitfalls, Prices &amp;

Weighted walks around dissected polygons Conway-Coxeter friezes and beyond Christine

Dissection-BKW CRYPTO 2018, Santa Barbara , August 20th 2018 Andre Esser , Felix Heuer, Robert

Learning Where to Sample in Structured Prediction Tianlin Shi Jacob Steinhardt Percy Liang

Sambuz

Useful Links

Newsletter

Mail Us

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Publicly Available Large Data Sets for Health Outcomes Research: Pearls, Pitfalls, Prices &