Language is not language processing Marten van Schijndel May 2020 - - PowerPoint PPT Presentation

language is not language processing
SMART_READER_LITE
LIVE PREVIEW

Language is not language processing Marten van Schijndel May 2020 - - PowerPoint PPT Presentation

Language is not language processing Marten van Schijndel May 2020 Department of Linguistics, Cornell University van Schijndel May 2020 1 / 63 CL/NLP often aim to create models of language comprehension (NLI, parsing, information extraction,


slide-1
SLIDE 1

Language is not language processing

Marten van Schijndel May 2020

Department of Linguistics, Cornell University

van Schijndel May 2020 1 / 63

slide-2
SLIDE 2

CL/NLP often aim to create models of language comprehension (NLI, parsing, information extraction, etc) Often, language models are trained on large amounts of text And these are the starting point for more complex models Or they are used for cognitive modeling

van Schijndel May 2020 2 / 63

slide-3
SLIDE 3

Two potential problems

Model biases may not align with human comprehension biases → Models may not learn human comprehension during training All language data comes from production not comprehension (though annotations provide comprehension cues) → Comprehension signal may not be present in the produced data In this talk, I explore these two possible problems with our current modeling paradigm

van Schijndel May 2020 3 / 63

slide-4
SLIDE 4

Overview

Part 0: Background Part 1: Magnitude probing Part 2: World knowledge probing Part 3: Production / comprehension mismatch

van Schijndel May 2020 4 / 63

slide-5
SLIDE 5

Part 0: Background

Neural networks have proven especially successful at finding linguistically accurate language processing solutions.

van Schijndel May 2020 5 / 63

slide-6
SLIDE 6

NNs are often trained on a word prediction task

van Schijndel May 2020 6 / 63

slide-7
SLIDE 7

NNs are often trained on a word prediction task

van Schijndel May 2020 6 / 63

slide-8
SLIDE 8

Why word prediction?

We can measure how unexpected a word is with surprisal Surprisal(wi) = −log P(wi | w1..i−1) (1)

van Schijndel May 2020 7 / 63 Shannon, 1948, Bell Systems Technical Journal Hale, 2001, Proc. North American Assoc. Comp. Ling. Levy, 2008, Cognition

slide-9
SLIDE 9

Why word prediction?

Surprisal indicates what the model finds unexpected/unnatural which can then be mapped onto human behavioral and neural measurements

  • acceptability/grammaticality
  • reading/reaction times
  • neural activation

van Schijndel May 2020 8 / 63

  • So. Many. People.
slide-10
SLIDE 10

This is kind of crazy!

We know frequency/predictability affect human language processing However, many plausible explanations of human responses involve experience beyond language statistics E.g., can language models learn intention from text alone? There may be some weak signal, but ...

van Schijndel May 2020 9 / 63

slide-11
SLIDE 11

Part 1: Magnitude probing

van Schijndel May 2020 10 / 63 van Schijndel & Linzen, 2018, Proc. CogSci van Schijndel & Linzen, in prep

slide-12
SLIDE 12

Humans experience a visceral response upon encountering garden path constructions NNs model average stats and therefore average frequency responses. Garden path responses exist in the tail.

van Schijndel May 2020 11 / 63

slide-13
SLIDE 13

They exist in the tail because:

1 the statistics are in the tail (predictability)

OR

2 the response is unusual (reanalysis)

van Schijndel May 2020 12 / 63

slide-14
SLIDE 14

The horse raced past the barn fell .

van Schijndel May 2020 13 / 63 Bever, 1970, Cognition and the Development of Language

slide-15
SLIDE 15

The horse that was raced past the barn fell .

van Schijndel May 2020 13 / 63 Bever, 1970, Cognition and the Development of Language

slide-16
SLIDE 16

S . VP PP NP N barn D the P past V raced NP N horse D the S VP NP RC PP NP N barn D the P past VBN raced NP N horse D the

van Schijndel May 2020 14 / 63 Bever, 1970, Cognition and the Development of Language

slide-17
SLIDE 17

While human responses are framed in terms of explicit syntactic frequencies, RNNs can predict garden path responses without explicit syntactic training.

van Schijndel May 2020 15 / 63 van Schijndel & Linzen, 2018, Proc. CogSci Futrell et al., 2019, Proc. NAACL Frank & Hoeks, 2019, Proc. CogSci

slide-18
SLIDE 18

Do RNNs process garden paths similar to humans? Look beyond garden path existence to garden path magnitude

van Schijndel May 2020 16 / 63

slide-19
SLIDE 19

Models

WikiRNN: Gulordava et al., (2018) LSTM Data: Wikipedia (80M words) SoapRNN: 2-layer LSTM (Same training parameters as above) Data: Corpus of American Soap Operas (80M words; Davies, 2011)

van Schijndel May 2020 17 / 63

slide-20
SLIDE 20

Three garden paths

NP/S: The woman saw    the doctor wore a hat. that the doctor wore a hat. NP/Z: When the woman    visited her nephew laughed loudly. visited, her nephew laughed loudly. MV/RR: The horse    raced past the barn fell. which was raced past the barn fell.

van Schijndel May 2020 18 / 63

slide-21
SLIDE 21

Surprisal-to-ms conversion

RT(wt) = αS(wt) (2)

van Schijndel May 2020 19 / 63 Smith & Levy, 2013, Cognition

slide-22
SLIDE 22

Probability-to-ms Conversion

RT(wi) = δ0S(wi) + δ−1S(wi−1) + δ−2S(wi−2) + δ−3S(wi−3) (3)

van Schijndel May 2020 20 / 63 Smith & Levy, 2013, Cognition

slide-23
SLIDE 23

Deriving the original mapping

Probabilities

  • Kneser-Ney trigram probabilities
  • Estimated from British National Corpus (100M words)

Reading Time Data (SPR; ignoring ET)

  • Brown corpus
  • 35 participants
  • 5000 words / participant

Generalized Additive Mixed Model

  • mgcv package
  • Factors: text position, word length × log-frequency, participant

van Schijndel May 2020 21 / 63 Smith & Levy, 2013, Cognition

slide-24
SLIDE 24

Deriving the new mapping

Probabilities

  • LSTM LM probabilities
  • Estimated from Wikipedia/Soaps (80M words)

Reading Time Data (SPR)

  • 80 simple sentences (fillers)
  • 224 participants
  • 1000 words / participant

Linear Mixed Model

  • lme4 package
  • Factors: text position, word length × log-frequency, participant

entropy, entropy reduction

van Schijndel May 2020 22 / 63

slide-25
SLIDE 25

Learned Surp-to-RT mapping

Smith & Levy, 2013: δ0 = 0.53 δ−1 = 1.53 δ−2 = 0.92 δ−3 = 0.84 WikiRNN using Prasad & Linzen, 2019: (δ0 = 0.04) δ−1 = 1.10 δ−2 = 0.37 δ−3 = 0.39 SoapRNN using Prasad & Linzen, 2019: (δ0 = −0.04) δ−1 = 0.83 δ−2 = 0.91 δ−3 = 0.44

van Schijndel May 2020 23 / 63

slide-26
SLIDE 26

RNN garden path prediction

(a) NP/S (b) NP/Z (c) MV/RR 5 5 10 15 20 25 30 35 40 Difference in Reading Times (ms) WikiRNN Surprisal SoapRNN Surprisal Humans van Schijndel May 2020 24 / 63

slide-27
SLIDE 27

Instead of region response, examine word-by-word response

van Schijndel May 2020 25 / 63

slide-28
SLIDE 28

Word-by-word garden path prediction

WikiRNN SoapRNN Humans (a) NP/S 10 10 20 30 40 50 Difference in Reading Times (ms) Region Position 1 2 WikiRNN SoapRNN Humans (b) NP/Z WikiRNN SoapRNN Humans (c) MV/RR van Schijndel May 2020 26 / 63

slide-29
SLIDE 29

Do RNNs garden path in a reasonable way?

van Schijndel May 2020 27 / 63

slide-30
SLIDE 30

Parts-of-speech predictions

van Schijndel May 2020 28 / 63

slide-31
SLIDE 31

Conclusion

  • Conversion rates are relatively similar, but all underestimate

human effect

  • Suggests human processing involves mechanisms outside
  • ccurrence statistics

(We will come back to this in Part 3)

van Schijndel May 2020 29 / 63

slide-32
SLIDE 32

But how well can human responses be explained by text statistics? We know that RNNs track syntactic and semantic statistics. What about event representations?

van Schijndel May 2020 30 / 63

slide-33
SLIDE 33

Part 2: World knowledge probing

van Schijndel May 2020 31 / 63 Davis & van Schijndel, 2020, Proc. CogSci Davis & van Schijndel, 2020, Proc. CUNY

slide-34
SLIDE 34

(1) a. Context - Several horses were being raced. b. Target - The horse raced past the barn fell. Knowledge of the situation mitigates the garden path

van Schijndel May 2020 32 / 63

slide-35
SLIDE 35

Context: One knight exists

van Schijndel May 2020 33 / 63 Spivey-Knowlton et al., 1993, Canadian Journal of Experimental Psychology

slide-36
SLIDE 36

Context: One knight exists

van Schijndel May 2020 34 / 63 Spivey-Knowlton et al., 1993, Canadian Journal of Experimental Psychology

slide-37
SLIDE 37

Context: Two knights exist

van Schijndel May 2020 35 / 63 Spivey-Knowlton et al., 1993, Canadian Journal of Experimental Psychology

slide-38
SLIDE 38

Context: Two knights exist

van Schijndel May 2020 36 / 63 Spivey-Knowlton et al., 1993, Canadian Journal of Experimental Psychology

slide-39
SLIDE 39

(2) a. Context (i) 1NP - A knight and his squire were attacking a dragon. With its breath of fire, the dragon killed the knight but not the squire. (ii) 2NP - Two knights were attacking a dragon. With its breath of fire, the dragon killed one of the knights but not the other. b. Target (i) Reduced - The knight killed by the dragon fell to the ground with a thud. (ii) Unreduced - The knight who was killed by the dragon fell to the ground with a thud.

van Schijndel May 2020 37 / 63 Spivey-Knowlton et al., 1993, Canadian Journal of Experimental Psychology

slide-40
SLIDE 40
  • Models: 5 LSTMs with shuffled context

5 similar models but with intact context trained with different random seeds on 80M Wikipedia

  • Test data: Spivey-Knowlton et al. (1993)

Trueswell & Tanenhaus (1991) We sum the surprisal of verb+by

van Schijndel May 2020 38 / 63

slide-41
SLIDE 41

All models predict garden path effect

van Schijndel May 2020 39 / 63

slide-42
SLIDE 42

Reference mitigates garden path

van Schijndel May 2020 40 / 63

slide-43
SLIDE 43

In humans, temporal context also mitigates garden paths

van Schijndel May 2020 41 / 63 Trueswell & Tanenhaus, 1991, Language and Cognitive Processes

slide-44
SLIDE 44

(3) a. Context (i) Past - Several students were sitting together taking an exam in a large lecture hall earlier today. A proctor noticed one of the students cheating. (ii) Future - Several students will be sitting together taking an exam in a large lecture hall later today. A proctor will notice one of the students cheating. b. Target (i) Reduced - The student spotted by the proctor received/will receive a warning. (ii) Unreduced - The student who was spotted by the proctor received/will receive a warning.

van Schijndel May 2020 42 / 63 Trueswell & Tanenhaus, 1991, Language and Cognitive Processes

slide-45
SLIDE 45

Temporal context mitigates garden path

van Schijndel May 2020 43 / 63

slide-46
SLIDE 46

RNNs predict larger temporal mitigation

van Schijndel May 2020 44 / 63

slide-47
SLIDE 47

Conclusions

  • Models learn tense information robustly
  • Referential context and definiteness are less robust
  • RNNs learn enough about discourse to mitigate garden paths

(only when trained with intact discourse)

  • Event knowledge is encoded in text.

Understandable since we talk about the world, but still crazy

van Schijndel May 2020 45 / 63

slide-48
SLIDE 48

The problem with garden paths: The human response correlates with the occurrence statistics Is there a case where the learned occurrence statistics don’t reflect the

  • bserved response?

van Schijndel May 2020 46 / 63

slide-49
SLIDE 49

Part 3: Production / comprehension mismatch

van Schijndel May 2020 47 / 63 Davis & van Schijndel, 2020, Proc. ACL

slide-50
SLIDE 50

RNNs have an observed recency bias Idea: Maybe that prevents them from learning known human biases Recency confounds attachment height

van Schijndel May 2020 48 / 63 Ravfogel et al., 2019, Proc. NAACL

slide-51
SLIDE 51

Relative clause attachment

(4) a. Andrew had dinner yesterday with the nephew of the teachers that was divorced. b. Andrew had dinner yesterday with the nephews of the teacher that was divorced.

van Schijndel May 2020 49 / 63 Fernández, 2003, Bilingual Sentence Processing

slide-52
SLIDE 52

Humans attach LOW/LOCAL/RECENT

van Schijndel May 2020 50 / 63 Fernández, 2003, Bilingual Sentence Processing

slide-53
SLIDE 53
  • Models: 5 Gulordava et al. (2018) LSTMs

trained with different random seeds

  • Test data: Fernández (2003),

Carreiras & Clifton Jr. (1993), POS templates

van Schijndel May 2020 51 / 63

slide-54
SLIDE 54

English models attach LOW/LOCAL/RECENT

van Schijndel May 2020 52 / 63

slide-55
SLIDE 55

English attachment is unusual

Local Non-local Afrikaans Japanese Arabic Norwegian Croatian Persian Danish Polish Dutch

  • B. Portuguese

English Romanian French Russian German Spanish Greek Swedish Italian Thai

van Schijndel May 2020 53 / 63 Brysbaert & Mitchell, 1996/2008, Quarterly Journal of Experimental Psychology Section A

slide-56
SLIDE 56
  • Models: 5 Gulordava et al. (2018) LSTMs

trained with different random seeds on 80M tokens of Spanish Wikipedia

  • Test data: Fernández (2003),

Carreiras & Clifton Jr. (1993), POS templates

van Schijndel May 2020 54 / 63

slide-57
SLIDE 57

Spanish models attach LOW/LOCAL/RECENT

van Schijndel May 2020 55 / 63

slide-58
SLIDE 58

Maybe the recency bias prevents them from learning HIGH attachment? Experiment: Manipulate attachment preference in synthetic training corpus

van Schijndel May 2020 56 / 63

slide-59
SLIDE 59

Synthetic training corpus

(5) a. D N (P D N) (Aux) V (D N) (P D N) b. D N Aux V D N ‘of’ D N ‘that’ ‘was/were’ V (6) a. The nephew near the children was seen by the players next to the lawyer. b. The gymnast has met the hostage of the women that was eating.

van Schijndel May 2020 57 / 63

slide-60
SLIDE 60
  • Models: 5 2-layer unidirectional LSTMs

trained with different random seeds

  • Training data: Synthetic corpus
  • Test data: 300 ambiguous synthetic RCs

van Schijndel May 2020 58 / 63

slide-61
SLIDE 61

HIGH attachment is easy to learn!

1 Training: All RCs attach HIGH unambiguously

Vary number of RCs Result: 20/120k produces HIGH bias at test

2 Training: 10% of data has unambigous RCs

Vary HIGH proportion Result: ≥ 50% HIGH produces HIGH bias at test

van Schijndel May 2020 59 / 63

slide-62
SLIDE 62

If HIGH is easy to learn, why don’t the Spanish models learn it?

van Schijndel May 2020 60 / 63

slide-63
SLIDE 63

What proportion of Spanish data is HIGH?

  • Wikipedia: LOW is 69% more common
  • Newswire (AnCora; UD): LOW is 21% more common

Note that it’s still possible they contain HIGH bias, just not in RCs

van Schijndel May 2020 61 / 63 Scheepers, 2003, Cognition

slide-64
SLIDE 64

Conclusion

  • Supports idea that production and comprehension have different

distributions e.g., Kehler and Rohde (2015, 2018)

  • RNNs won’t learn human comprehension from text alone
  • Provides explanation why increasing training data ceases to help
  • Provides explanation for why training on cognitive signals improve

model accuracy (Klerke et al., 2016; Barrett et al., 2018)

van Schijndel May 2020 62 / 63

slide-65
SLIDE 65

Thanks!

Presentations at CUNY 2020, CogSci 2020, and ACL 2020! CUNY 2020 Recurrent neural networks use discourse context in human-like garden path alleviation CogSci 2020 Interaction with context during recurrent neural network sentence processing ACL 2020 Recurrent neural network language models always learn English-like relative clause attachment

van Schijndel May 2020 63 / 63