contextualized word embeddings
play

Contextualized Word Embeddings Spring 2020 2020-03-17 Adapted from - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Contextualized Word Embeddings Spring 2020 2020-03-17 Adapted from slides from Danqi Chen and Karthik Narasimhan (with some content from slides from Chris Manning and Abigail See) Course


  1. SFU NatLangLab CMPT 825: Natural Language Processing Contextualized Word Embeddings Spring 2020 2020-03-17 Adapted from slides from Danqi Chen and Karthik Narasimhan (with some content from slides from Chris Manning and Abigail See)

  2. Course Logistics • Online lectures from now on! Everyone stay safe! • HW4 due Tuesday 3/24 • Project Milestone due Tuesday 3/31

  3. Course Logistics Remaining lectures (tentative) • Contextual word embeddings and Transformers • Parsing: • Dependency Parsing • Constituency Parsing • Semantic Parsing • CNNs for NLP • Applications: Question Answering, Dialogue, Coreference, Grounding

  4. Overview Contextualized Word Representations • ELMo = E mbeddings from L anguage Mo dels • BERT = B idirectional E ncoder R epresentations from T ransformers

  5. Overview • Transformers

  6. Recap: word2vec word = “sweden”

  7. <latexit sha1_base64="ZS1t+SATcIQYaJ4VZuEjXjz0Y=">ACOXicbZDPShxBEMZrjIk65s8aj3poIoFcsvRsQlSIHjxuIKrws6y9PTWro09PUN3jbgM8wx5m1zyFt4ELx4U8ZoXSM+uiFE/aPj4VRVd9SW5Vo4vwhmXs2+fjM3vxAuvn3/kNj6eOByworsSMzndmjRDjUymCHFGk8yi2KNF4mJzs1PXDU7ROZWafxjn2UjEyaqikI/6jfZpv4wJz6j0pKq2wjBkTJlngqy6qwKv/Jmq/WdxXHIm9E3XpsabfIpaq3/CGM0g4eBfmON/lE7LmJ7s3a9uqvmAFAu984jweZLFI0JLVwrhvxnHqlsKSkxiqMC4e5kCdihF1vjUjR9crJ5RX7MmADTPrnyE2oY8nSpE6N04T3+n3O3ZPazV8qdYtaLjRK5XJC0Ijpx8NC80oY3WMbKAsStJjb4S0yu/K5LGwQpIPO/QhRE9Pfm4OWs3Ih7rn0/gJU83DCnyCLxDBOmzDLrShAxJ+wyVcw03wJ7gKboO7aetMcD+zDP8p+PsPf3eqbQ=</latexit> <latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">ACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDw8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENkgXyiyTgKyRbJL6qRBLkgt+SePHiX3p36D0NW0e8t5l58kHe8yuUTqy7</latexit> <latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">ACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDw8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENkgXyiyTgKyRbJL6qRBLkgt+SePHiX3p36D0NW0e8t5l58kHe8yuUTqy7</latexit> <latexit sha1_base64="yUhkDlYwUEoQ+3MeiaCkTY5/M=">ACOXicbZBNSyNBEIZ71PVj3NWsHr0BsHLhp4oxgUFwYvHCEaFTAg9nUps7OkZumvEMzf8uK/8CZ48bCLePUP2JME8euFhpenquiqN0qVtMjYvTc1PfNjdm5+wV/8+WtpufJ75dQmRHQEolKzHnELSipoYUSFZynBngcKTiLg/L+tkVGCsTfYLDFDoxH2jZl4KjQ91K86qbhwjXmDtSFPt+GMFA6jyNORp5Xfh/WK1e36Zh6LNasMVKU6K/bIzqjR0/BN17G+hWqzGRqJfTAxVTJRs1u5C3uJyGLQKBS3th2wFDs5NyiFgsIPMwspF5d8AG1nNY/BdvLR5QXdcKRH+4lxTyMd0fcTOY+tHcaR63T7XdjPtRJ+V2tn2N/t5FKnGYIW4/6maKY0DJG2pMGBKqhM1wY6Xal4oIbLtCF7bsQgs8nfzWn9VrgQj1m1YO9SRzZI2sk0SkAY5IEekSVpEkBvyQP6R/96t9+g9ec/j1ilvMrNKPsh7eQWaI6kG</latexit> What’s wrong with word2vec?   − 0 . 224 • One vector for each word type 0 . 130 v ( bank )   cat =   − 0 . 290   0 . 276 • Complex characteristics of word use: semantics, syntactic behavior, and connotations • Polysemous words, e.g., bank, mouse

  8. Contextualized word embeddings Let’s build a vector for each word conditioned on its context ! Contextualized word embeddings was ! movie terribly exciting the f : ( w 1 , w 2 , …, w n ) ⟶ x 1 , …, x n ∈ ℝ d

  9. Contextualized word embeddings (from ELMo) (Peters et al, 2018): Deep contextualized word representations

  10. ELMo • NAACL’18: Deep contextualized word representations • Key idea: • Train an LSTM-based language model on some large corpus • Use the hidden states of the LSTM for each token to compute a vector representation of each word

  11. ELMo # tokens in the LSTM sentence Pretrain LM parameters softmax input tokens Backward LM Forward LM Let’s stick to in this skit (figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/)

  12. ELMo After training LM To get the ELMO embedding of a word (“stick”): Concatenate forward and backward embeddings (figure credit: Jay Alammar and take weighted sum of layers http://jalammar.github.io/illustrated-bert/)

  13. ELMo LM weights are frozen Weights are trained on specific task. s j To get the ELMO embedding of a word (“stick”): Concatenate forward and backward embeddings (figure credit: Jay Alammar and take weighted sum of layers http://jalammar.github.io/illustrated-bert/)

  14. Summary: How to get ELMo embedding? L is # of layers hidden states Token representation h LM k ,0 = x LM k , h LM k , j = [ h LM k , j ; h LM k , j ] • γ task : allows the task model to scale the entire ELMo vector • s task : softmax-normalized weights across layers j • To use: plug ELMo into any (neural) NLP model: freeze all the LMs weights and change the input representation to: (could also insert into higher layers)

  15. More details • Forward and backward LMs: 2 layers each • Use character CNN to build initial word representation • 2048 char n-gram filters and 2 highway layers, 512 dim projection • User 4096 dim hidden/cell LSTM states with 512 dim projections to next input • A residual connection from the first to second layer • Trained 10 epochs on 1B Word Benchmark

  16. Experimental results • SQuAD: question answering • SNLI: natural language inference • SRL: semantic role labeling • Coref: coreference resolution • NER: named entity recognition • SST-5: sentiment analysis

  17. Intrinsic Evaluation syntactic information semantic information First Layer > Second Layer Second Layer > First Layer syntactic information is better represented at lower layers while semantic information is captured at higher layers

  18. Use ELMo in practice https://allennlp.org/elmo Also available in TensorFlow

  19. BERT • First released in Oct 2018. • NAACL’19: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding How is BERT different from ELMo? #1. Unidirectional context vs bidirectional context #2. LSTMs vs Transformers (will talk later) #3. The weights are not frozen, called fine-tuning

  20. Bidirectional encoders • Language models only use left context or right context (although ELMo used two independent LMs from each direction). • Language understanding is bidirectional Why are LMs unidirectional?

  21. Bidirectional encoders • Language models only use left context or right context (although ELMo used two independent LMs from each direction). • Language understanding is bidirectional

  22. Masked language models (MLMs) • Solution: Mask out 15% of the input words, and then predict the masked words • Too little masking: too expensive to train • Too much masking: not enough context

  23. Masked language models (MLMs) A little more complex (don’t always replace with [MASK] ): Example: my dog is hairy , we replace the word hairy • 80% of time: replace word with [MASK] token my dog is [MASK] • 10% of time: replace word with random word my dog is apple • 10% of time: keep word unchanged to bias representation toward actual observed word my dog is hairy Because [MASK] is never seen when BERT is used…

  24. Next sentence prediction (NSP) Always sample two sentences, predict whether the second sentence is followed after the first one. Recent papers show that NSP is not necessary… (Joshi*, Chen* et al, 2019) :SpanBERT: Improving Pre-training by Representing and Predicting Spans (Liu et al, 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach

  25. Pre-training and fine-tuning Pre-training Fine-tuning (figure credit: Jay Alammar Key idea: all the weights are fine-tuned on downstream tasks http://jalammar.github.io/illustrated-bert/)

  26. Applications (figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/)

  27. More details • Input representations • Use word pieces instead of words: playing => play ##ing • Trained 40 epochs on Wikipedia (2.5B tokens) + BookCorpus (0.8B tokens) • Released two model sizes: BERT_base, BERT_large

  28. Experimental results BiLSTM: 63.9 (Wang et al, 2018): GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

  29. Use BERT in practice TensorFlow : https://github.com/google-research/bert PyTorch : https://github.com/huggingface/transformers

  30. Contextualized word embeddings in context • TagLM (Peters et, 2017) • CoVe (McCann et al. 2017) • ULMfit (Howard and Ruder, 2018) • ELMo (Peters et al, 2018) • OpenAI GPT (Radford et al, 2018) • BERT (Devlin et al, 2018) https://github.com/ huggingface/transformers • OpenAI GPT-2 (Radford et al, 2019) • XLNet (Yang et al, 2019) • SpanBERT (Joshi et al, 2019) • RoBERTa (Liu et al, 2019) • ALBERT (Lan et al, 2019) • …

  31. Transformers

  32. Transformers • NIPS’17: Attention is All You Need • Originally proposed for NMT (encoder- decoder framework) • Used as the base model of BERT (encoder only) • Key idea: Multi-head self-attention • No recurrence structure any more so it trains much faster Decoder Encoder

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend