linguistic knowledge and transferability of contextual
play

Linguistic Knowledge and Transferability of Contextual - PowerPoint PPT Presentation

Linguistic Knowledge and Transferability of Contextual Representations Nelson F. Matt Yonatan Matthew E. Noah A. Liu Gardner Belinkov Peters Smith NAACL 2019June 3, 2019 UWNLP 1 [McCann et al., 2017; Peters et al., 2018a;


  1. Linguistic Knowledge and Transferability of Contextual Representations Nelson F. Matt Yonatan Matthew E. Noah A. Liu Gardner Belinkov Peters Smith NAACL 2019—June 3, 2019 UWNLP � 1

  2. [McCann et al., 2017; Peters et al., 2018a; Devlin et al., 2019, inter alia ] Contextual Word Representations Are Extraordinarily Effective • Contextual word representations (from contextualizers like ELMo or BERT) work well on many NLP tasks. • But why do they work so well? • Better understanding enables principled enhancement. • This work: studies a few questions about their generalizability and transferability. � 2

  3. (1) Probing Contextual Representations Question: Is the information necessary for a variety of core NLP tasks linearly recoverable from contextual word representations? Answer: Yes, to a great extent! Tasks with lower performance may require fine- grained linguistic knowledge. � 3

  4. (2) How Does Transferability Vary? Question: How does transferability vary across contextualizer layers? Answer: First layer in LSTMs is the most transferable. Middle layers for transformers. � 4

  5. (3) Why Does Transferability Vary? Question: Why does transferability vary across contextualizer layers? Answer: It depends on pretraining task-specificity! � 5

  6. (4) Alternative Pretraining Objectives Question: How does language model pretraining compare to alternatives? Answer: Even with 1 million tokens, language model pretraining yields the most transferable representations. But, transferring between related tasks does help. � 6

  7. [Shi et al., 2016; Adi et al., 2017] Probing Models � 7

  8. [Shi et al., 2016; Adi et al., 2017] Probing Models � 8

  9. [Shi et al., 2016; Adi et al., 2017] Probing Models � 9

  10. [Shi et al., 2016; Adi et al., 2017] Probing Models � 10

  11. [Shi et al., 2016; Adi et al., 2017] Probing Models � 11

  12. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing

  13. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 13

  14. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 14

  15. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 15

  16. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 16

  17. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 17

  18. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 18

  19. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 19

  20. Probing Model Setup • Contextualizer weights are always frozen. • Results are from the highest-performing contextualizer layer. • We use a linear probing model. � 20

  21. Contextualizers Analyzed � 21

  22. [Peters et al., 2018a,b] Contextualizers Analyzed ELMo 2-layer 4-layer 6-layer Bidirectional language Transformer LSTM LSTM model (BiLM) pretraining (ELMo (ELMo original) (ELMo 4-layer) transformer) on 1B Word Benchmark � 22

  23. [Peters et al., 2018a,b; Radford et al., 2018] Contextualizers Analyzed ELMo 2-layer 4-layer 6-layer Bidirectional language Transformer LSTM LSTM model (BiLM) pretraining (ELMo (ELMo original) (ELMo 4-layer) transformer) on 1B Word Benchmark OpenAI Transformer 12-layer Left-to-right language transformer model pretraining on uncased BookCorpus � 23

  24. [Peters et al., 2018a,b; Radford et al., 2018; Devlin et al., 2019] Contextualizers Analyzed ELMo 2-layer 4-layer 6-layer Bidirectional language Transformer LSTM LSTM model (BiLM) pretraining (ELMo (ELMo original) (ELMo 4-layer) transformer) on 1B Word Benchmark OpenAI Transformer 12-layer Left-to-right language transformer model pretraining on uncased BookCorpus BERT (cased) 24-layer 12-layer Masked language model transformer transformer pretraining on (BERT large) (BERT base) BookCorpus + Wikipedia � 24

  25. (1) Probing Contextual Representations Question: Is the information necessary for a variety of core NLP tasks linearly recoverable from contextual word representations? Answer: Yes, to a great extent! Tasks with lower performance may require fine- grained linguistic knowledge. � 25

  26. Examined 17 Diverse Probing Tasks • Part-of-Speech • Syntactic • Syntactic Dependency Tagging Chunking Arc Prediction • CCG Supertagging • Named entity • Syntactic Dependency recognition • Semantic Tagging Arc Classification • Grammatical • Semantic Dependency • Preposition error detection Arc Prediction supersense • Conjunct disambiguation • Semantic Dependency identification • Event Factuality Arc Classification • Coreference Arc • Syntactic Prediction Constituency Ancestor Tagging � 26

  27. Linear Probing Models Rival Task-Specific Architectures • Part-of-Speech • Syntactic • Syntactic Dependency Tagging Chunking Arc Prediction • CCG Supertagging • Named entity • Syntactic Dependency recognition • Semantic Tagging Arc Classification • Grammatical • Semantic Dependency • Preposition error detection Arc Prediction supersense • Conjunct disambiguation • Semantic Dependency identification • Event Factuality Arc Classification • Coreference Arc • Syntactic Prediction Constituency Ancestor Tagging � 27

  28. CCG Supertagging 100 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 28

  29. CCG Supertagging 100 71.58 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 29

  30. CCG Supertagging 94.28 93.31 100 92.68 82.69 71.58 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 30

  31. CCG Supertagging 94.7 94.28 93.31 100 92.68 82.69 71.58 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 31

  32. Event Factuality 100 Pearson Correlation (r) x 100 75 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 32

  33. Event Factuality 100 77.10 76.25 74.03 Pearson Correlation (r) x 100 73.20 70.88 75 49.70 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 33

  34. But Linear Probing Models Underperform on Some Tasks • Tasks that linear model + contextual word representation performs poorly may require more fine-grained linguistic knowledge. • In these cases, task-specific contextualization leads to especially large gains. See the paper for more details. � 34

  35. Named Entity Recognition 100 75 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 35

  36. Named Entity Recognition 100 75 53.22 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 36

  37. Named Entity Recognition 100 84.44 82.85 81.21 75 58.14 53.22 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 37

  38. Named Entity Recognition 100 91.38 84.44 82.85 81.21 75 58.14 53.22 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 38

  39. (2) How Does Transferability Vary? Question: How does transferability vary across contextualizer layers? Answer: First layer in LSTMs is the most transferable. Middle layers for transformers. � 39

  40. Layerwise Patterns in Transferability � 40

  41. Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks � 41

  42. Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks � 42

  43. Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks Transformer-based Contextualizers � 43

  44. Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks Transformer-based Contextualizers OpenAI Transformer ELMo (transformer) Tasks Tasks BERT (base, cased) BERT (large, cased) Tasks Tasks � 44

  45. (3) Why Does Transferability Vary? Question: Why does transferability vary across contextualizer layers? Answer: It depends on pretraining task-specificity! � 45

  46. Layerwise Patterns Dictated by Perplexity LSTM-based ELMo (original) 8000 7026 Outputs of higher LSTM layers are 7000 better for language modeling 6000 (have lower perplexity) 5000 Perplexity 4000 3000 2000 920 1000 235 0 Layer 0 Layer 1 Layer 2 � 46

  47. Layerwise Patterns Dictated by Perplexity LSTM-based ELMo (4-layer) 4500 4204 Outputs of higher LSTM layers are 4000 better for language modeling (have lower perplexity) 3500 3000 Perplexity 2398 2363 2500 2000 1500 1013 1000 500 195 0 Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 � 47

  48. Layerwise Patterns Dictated by Perplexity Transformer-based ELMo (6-layer) 600 546 523 500 448 374 400 Perplexity 314 295 300 200 91 100 0 Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 � 48

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend