contextual token representations
play

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, - PowerPoint PPT Presentation

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background: Language Modeling T 1 T 2 </s> Data: Monolingual Corpus softmax softmax softmax project. project. project. Task:


  1. Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas

  2. Background: Language Modeling … T 1 T 2 </s> • Data: Monolingual Corpus … softmax softmax softmax … project. project. project. • Task: predict next token given previous tokens (causal): embed 1 embed 2 embed 3 Model P ( T i | T 1 … T i − 1 ) • Usual models: LSTM, Transformer. < s > T T … 1 N

  3. Contextual embeddings: intuition • Same word can have different meaning depending on the context. Example: - Please, type everything in lowercase. - What type of flowers do you like most? • Classic word embeddings offer the same vector representation regardless of the context. • Solution: create word representations that depend on the context.

  4. Articles Model Alias Org. Article Reference Universal Language Model Fine-tuning for Text Classification ULMfit fast.ai Howard and Ruder Deep contextualized word representations ELMo AllenNLP Peters et al. Improving Language Understanding by Generative Pre-Training OpenAI GPT OpenAI Radford et al. BERT: Pre-training of Deep Bidirectional Transformers for BERT Google Language Understanding Devlin et al. Facebook Cross-lingual Language Model Pretraining XLM Lample and Conneau

  5. Overview • Train model in one of multiple tasks that lead to word representations. • Release pre-trained models. • Use pre-trained models, options: A. Fine-tune model on final task. B. Directly encode token representations with model.

  6. Overview (graphical) Phase 1: Phase 2: semi-supervised training downstream task fine-tuning *LM task Downstream task Downstream task LM task head head (projection + softmax) contextual transfer learning representations Language Language Modeling Modeling Architecture Architecture small learning rate or directly freeze monolingual task-specific weights corpus data

  7. Differences Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English Causal LM OpenAI GPT Transformer subword English + Classification Masked LM BERT Transformer subword + Next sentence Multilingual prediction Causal LM XLM Transformer subword +Masked LM Multilingual + Translation LM

  8. ULMFiT • Task : causal LM T 1 T 2 </s> … … softmax softmax softmax • Model : 3-layer LSTM … project. project. project. • Tokens : words LSTM LSTM LSTM … LSTM LSTM … LSTM LSTM LSTM LSTM … … < / s > E E 1 N

  9. ELMO • Task : bidirectional LM T 1 T 2 T N … softmax … softmax softmax • Model : 2-layer biLSTM … project. project. project. • Tokens : words … LSTM LSTM LSTM LSTM LSTM … LSTM LSTM LSTM LSTM LSTM LSTM LSTM … … charCNN charCNN charCNN charCNN … C < / s > < s > C 1 N

  10. OpenAI GPT • Task : causal LM Output tokens he will be late </s> • Model : self-attention layers softmax softmax softmax softmax softmax project. project. project. project. project. • Tokens : subwords Self-attention layers Token </s>] he will be late embeddings + + + + + Positional 0 1 2 3 4 embeddings

  11. BERT he will br late [SEP] you should leave now [SEP] Output tokens This output is used softmax softmax softmax softmax softmax softmax softmax softmax softmax softmax for classification tasks project. project. project. project. project. project. project. project. project. project. Self-attention Layers Token [CLS] he [MASK] be late [SEP] you [MASK] leave now [SEP] embeddings + + + + + + + + + + + Positional 0 1 2 3 4 5 6 7 8 9 10 embeddings + + + + + + + + + + + Segment A A A A A A B B B B B embeddings 15% of tokens get masked • Tasks : masked LM + next sentence prediction • Model : self-attention layers • Tokens : subwords

  12. Masked Language [/s] take drink now Modeling (MLM) Transformer XLM Token [/s] [/s] [MASK] a seat [MASK] have a [MASK] [MASK] relax and embeddings + + + + + + + + + + + + Position 0 1 2 3 4 5 6 7 8 9 10 11 embeddings + + + + + + + + + + + + Language en en en en en en en en en en en en embeddings Translation Language curtains were les bleus Modeling (TLM) Transformer Token [/s] [/s] [/s] rideaux [/s] the [MASK] [MASK] blue [MASK] étaient [MASK] embeddings + + + + + + + + + + + + Position 0 1 2 3 4 5 0 1 2 3 4 5 embeddings + + + + + + + + + + + + Language en en en en en en fr fr fr fr fr fr embeddings • Tasks : LM + masked LM + Translation LM Masked LM with • Model : self-attention layers parallel sentences • Tokens : subwords Projection and softmax are omitted *figure from “Cross-lingual Language Model Pretraining”

  13. Downstream Tasks • Natural Language Inference (NLI) or Cross-lingual NLI. • Text classification (e.g. sentiment analysis). • Next sentence prediction. • Supervised and Unsupervised Neural Machine Translation (NMT). • Question Answering (QA). • Named Entity Recognition (NER).

  14. Further reading • “Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling”, Bowman et al., 2018 • “What do you learn from context? Probing for sentence structure in contextualized word representations”, Tenney et al., 2018. • “Assessing BERT’s Syntactic Abilities”, Goldberg, 2018 • “Learning and Evaluating General Linguistic Intelligence”, Yogatama et al., 2019.

  15. Differences with other representations Note the differences of contextual token representations with: • Non-word representations like in (CoVe): Learned in Translation: Contextualized Word Vectors by McCann et al. 2017 [salesforce]. • Fixed-size sentence representations like in Massively Multilingual Sentence Embeddings for Zero-Shot Cross- Lingual Transfer and Beyond by Artetxe and Schewnk, 2018 [facebook].

  16. Other resources • https://nlp.stanford.edu/seminar/details/jdevlin.pdf • http://jalammar.github.io/illustrated-bert/ • https://medium.com/dissecting-bert/dissecting-bert- part2-335ff2ed9c73 • https://github.com/huggingface/pytorch-pretrained-BERT

  17. Summary Phase 2: Phase 1: downstream task fine-tuning semi-supervised training *LM task Downstream task task-specific data Downstream task LM task head head transfer learning (projection + softmax) monolingual corpus model model Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English Causal LM OpenAI GPT Transformer subword English + Classification Masked LM BERT Transformer subword Multilingual + Next sentence prediction Causal LM XLM Transformer subword +Masked LM Multilingual + Translation LM

  18. Bonus slides

  19. Are these really token representations? • They are a linear projection away from he will be late token space. softmax softmax softmax softmax project. project. project. project. • Word-level nearest neighbours in corpus finds same word with same Model usage. he will be late

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend