Modeling Human Reading with Neural Attention Michael Hahn Frank - PowerPoint PPT Presentation

Modeling Human Reading with Neural Attention Michael Hahn Frank Keller Stanford University University of Edinburgh mhahn2@stanford.edu keller@inf.ed.ac.uk EMNLP 2016 1 / 49

Eye Movements in Human Reading The two young sea-lions took not the slightest interest in our arrival. They were playing on the jetty, rolling over and tumbling into the water together, entirely ignoring the human beings edging awkwardly round adapted from the Dundee corpus [Kennedy and Pynte, 2005] 2 / 49

Eye Movements in Human Reading The two young sea-lions took not the slightest interest in our arrival. They were playing on the jetty, rolling over and tumbling into the water together, entirely ignoring the human beings edging awkwardly round adapted from the Dundee corpus [Kennedy and Pynte, 2005] ◮ Fixations static ◮ Saccades take 20–40 ms, no information obtained from text 3 / 49

Eye Movements in Human Reading The two young sea-lions took not the slightest interest in our arrival. They were playing on the jetty, rolling over and tumbling into the water together, entirely ignoring the human beings edging awkwardly round adapted from the Dundee corpus [Kennedy and Pynte, 2005] ◮ Fixations static ◮ Saccades take 20–40 ms, no information obtained from text ◮ Fixation times vary from ≈ 100 ms to ≈ 300ms 4 / 49

Eye Movements in Human Reading The two young sea-lions took not the slightest interest in our arrival. They were playing on the jetty, rolling over and tumbling into the water together, entirely ignoring the human beings edging awkwardly round adapted from the Dundee corpus [Kennedy and Pynte, 2005] ◮ Fixations static ◮ Saccades take 20–40 ms, no information obtained from text ◮ Fixation times vary from ≈ 100 ms to ≈ 300ms 5 / 49

Eye Movements in Human Reading The two young sea-lions took not the slightest interest in our arrival. They were playing on the jetty, rolling over and tumbling into the water together, entirely ignoring the human beings edging awkwardly round adapted from the Dundee corpus [Kennedy and Pynte, 2005] ◮ Fixations static ◮ Saccades take 20–40 ms, no information obtained from text ◮ Fixation times vary from ≈ 100 ms to ≈ 300ms ◮ ≈ 40 % of words are skipped 6 / 49

Computational Models I 1. models of saccade generation in cognitive psychology ◮ EZ-Reader [Reichle et al., 1998, 2003, 2009] ◮ SWIFT [Engbert et al., 2002, 2005] ◮ Bayesian inference [Bicknell and Levy, 2010] 2. machine learning models trained on eye-tracking data [Nilsson and Nivre, 2009, 2010, Hara et al., 2012, Matthies and Søgaard, 2013] 7 / 49

Computational Models I 1. models of saccade generation in cognitive psychology ◮ EZ-Reader [Reichle et al., 1998, 2003, 2009] ◮ SWIFT [Engbert et al., 2002, 2005] ◮ Bayesian inference [Bicknell and Levy, 2010] 2. machine learning models trained on eye-tracking data [Nilsson and Nivre, 2009, 2010, Hara et al., 2012, Matthies and Søgaard, 2013] These models... ◮ involve theoretical assumptions about human eye-movements, or ◮ require selection of relevant eye-movement features, and ◮ estimate parameters from eye-tracking corpora 8 / 49

Computational Models II: Surprisal Surprisal ( w i | w 1 ... i − 1 ) = − log P ( w i | w 1 ... i − 1 ) (1) ◮ measures predictability of word in context ◮ computed by language model 9 / 49

Computational Models II: Surprisal Surprisal ( w i | w 1 ... i − 1 ) = − log P ( w i | w 1 ... i − 1 ) (1) ◮ measures predictability of word in context ◮ computed by language model ◮ correlates with word-by-word reading times [Hale, 2001, McDonald and Shillcock, 2003a,b, Levy, 2008, Demberg and Keller, 2008, Frank and Bod, 2011, Smith and Levy, 2013] ◮ but cannot explain... ◮ reverse saccades ◮ re-fixations ◮ spillover ◮ skipping ≈ 40% of words are skipped 10 / 49

Tradeoff Hypthesis Goal Build unsupervised models jointly accounting for reading times and skipping 11 / 49

Tradeoff Hypthesis Goal Build unsupervised models jointly accounting for reading times and skipping ◮ reading is recent innovation in evolutionary terms ◮ humans learn it without access to other people’s eye-movements 12 / 49

Tradeoff Hypthesis Goal Build unsupervised models jointly accounting for reading times and skipping ◮ reading is recent innovation in evolutionary terms ◮ humans learn it without access to other people’s eye-movements Hypothesis Human reading optimizes a tradeoff between ◮ Precision of language understanding: Encode the input so that it can be reconstructed accurately ◮ Economy of attention: Fixate as few words as possible 13 / 49

Tradeoff Hypothesis Approach: NEAT (NEural Attention Tradeoff) 1. develop generic architecture integrating ◮ neural language modeling ◮ attention mechanism 2. train end-to-end to optimize tradeoff between precision and economy 3. evaluate on human eyetracking corpus 14 / 49

Architecture I: Recurrent Autoencoder $ w 1 w 2 w 3 w 1 w 2 w 3 R 0 R 1 R 2 R 3 D 0 D 1 D 2 D 3 Reader Decoder 15 / 49

Architecture II: Real-Time Predictions w 1 w 2 w 3 R 0 R 1 R 2 R 3 Decoder 16 / 49

Architecture II: Real-Time Predictions w 1 w 2 w 3 R 0 R 1 R 2 R 3 Decoder ◮ Humans constantly make predictions about the upcoming input 17 / 49

Architecture II: Real-Time Predictions w 1 w 2 w 3 P R 1 P R 2 P R 3 R 0 R 1 R 2 R 3 Decoder ◮ Humans constantly make predictions about the upcoming input ◮ Reader outpus probability distribution P R over the lexicon at each time step ◮ Describes which words are likely to come next 18 / 49

Architecture III: Skipping w 1 A w 2 A w 3 A P R 1 P R 2 P R 3 R 0 R 1 R 2 R 3 Decoder ◮ Attention module shows word to R or skips it 19 / 49

Architecture III: Skipping w 1 A w 2 A w 3 A P R 1 P R 2 P R 3 R 0 R 1 R 2 R 3 Decoder ◮ Attention module shows word to R or skips it ◮ A computes a probability + draws a sample ω ∈ { READ , SKIP } ◮ R receives special ‘SKIPPED’ vector when skipping 20 / 49

Implementing the Tradeoff Hypothesis Training Objective Solve prediction and reconstruction with minimal attention: ω ω arg θ min { E w , ω ω [ L ( ω ω | w , θ )+ α ·� ω ω � ℓ 1 ] } ω Loss on Prediction + Reconstruction # of fixated words 21 / 49

Implementing the Tradeoff Hypothesis Training Objective Solve prediction and reconstruction with minimal attention: ω ω arg θ min { E w , ω ω [ L ( ω ω | w , θ )+ α ·� ω ω � ℓ 1 ] } ω Loss on Prediction + Reconstruction # of fixated words ◮ w is word sequence drawn from corpus ◮ ω ω ω sampled from attention module A ◮ α > 0: encourages NEAT to attend to as few words as possible 22 / 49

Implementation and Training ◮ Implementation ◮ one-layer LSTM network with 1,000 memory cells ◮ attention network: one-layer feedforward network ◮ optimized by SGD + REINFORCE policy gradient method [Williams, 1992] ◮ trained on corpus of newstext [Hermann et al., 2015] ◮ 195,462 articles from Daily Mail ◮ ≈ 200 million tokens ◮ Input data split into sequences of 50 tokens 23 / 49

NEAT as a Model of Reading ◮ Attention module models fixations and skips ◮ NEAT surprisal models reading times of fixated words w 1 A w 2 A w 3 A P R 1 P R 2 P R 3 R 0 R 1 R 2 R 3 Decoder 24 / 49

NEAT as a Model of Reading ◮ Attention module models fixations and skips ◮ NEAT surprisal models reading times of fixated words w 1 A w 2 A w 3 A P R 1 P R 2 P R 3 R 0 R 1 R 2 R 3 Decoder 25 / 49

NEAT as a Model of Reading ◮ Attention module models fixations and skips ◮ NEAT surprisal models reading times of fixated words w 1 A w 2 A w 3 A P R 1 P R 2 P R 3 R 0 R 1 R 2 R 3 Decoder The only ingredients are ◮ architecture ◮ objective ◮ unlabeled corpus No eye-tracking data, lexicon, grammar, ... needed. 26 / 49

Evaluation Setup ◮ English section of the Dundee corpus [Kennedy and Pynte, 2005] ◮ 20 texts from The Independent ◮ annotated with eye-movement data from ten English native speakers who were asked to answer questions after each text. ◮ split into development (1–3) and test set (4–20) ◮ Size: 78,300 tokens (dev); 281,911 tokens (test) ◮ exclude from the evaluation words at the beginning or end of lines, outliers, cases of track loss, out-of-vocabulary words ◮ Fixation rate: 62.1% (dev), 61.3% (test) 27 / 49

Intrinsic Evaluation: Prediction and Reconstruction Perplexity Fix. Rate Prediction Reconstruction NEAT 180 4.5 60.4% ω ∼ Bin ( 0 . 62 ) 333 56 62.1% Word Length 230 40 62.1% Word Freq. 219 39 62.1% Full Surprisal 211 34 62.1% Human 218 39 61.3% ω ≡ 1 107 1.6 100% ◮ For Word Length, Word Frequency, Full Surprisal, we take threshold predictions matching the fixation rate of the development set. 28 / 49

Intrinsic Evaluation: Prediction and Reconstruction Perplexity Fix. Rate Prediction Reconstruction NEAT 180 4.5 60.4% ω ∼ Bin ( 0 . 62 ) 333 56 62.1% Word Length 230 40 62.1% Word Freq. 219 39 62.1% Full Surprisal 211 34 62.1% Human 218 39 61.3% ω ≡ 1 107 1.6 100% ◮ For Word Length, Word Frequency, Full Surprisal, we take threshold predictions matching the fixation rate of the development set. 29 / 49

Evaluating Reading Times: Linear Mixed Models ∑ ∑ FirstPassDuration = β 0 + β i x i + γ j y j + ε i ∈ Predictors j ∈ RandomEffects 30 / 49

Modeling Human Reading with Neural Attention Michael Hahn Frank - PowerPoint PPT Presentation

Modeling Human Reading with Neural Attention Michael Hahn Frank Keller Stanford University University of Edinburgh mhahn2@stanford.edu keller@inf.ed.ac.uk EMNLP 2016 1 / 49 Eye Movements in Human Reading The two young sea-lions took not

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

Modeling Task Effects in Human Reading with Neural Attention Michael Hahn Frank Keller Stanford

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention

Visual Attention FEF V4 spatial attention: simultaneous neural recordings in V4

Advanced Neural Machine Translation Gongbo Tang 21 September 2020 Outline NMT with Attention

Reading Mastery - Reading Presentation Book A - Grade 5 Reading Mastery - Reading Presentation

Document Modeling with Document Modeling with External Attention for Sentence External Attention

The Attention Economy What is the attention economy? A business model where you (as the

multi-hop attention and Transformers Outline Review of common (old fashioned) neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Beekeeping Session Three: The honeybee 1 Tonights Agenda Anatomy of the honeybee

I have no disclosures. Lori Strachowski, MD Clinical Professor of Radiology, UCSF Chief of

Subdivision Subdivision Bezier Curves (Recall) Subdivision using control points yields the

COMS 4160: Problems on Curves Ravi Ramamoorthi Questions 1. Consider a quadratic B-spline curve

Infrequent words are more difficult to comprehend. Shashank Sonkar Computer Sc. &

CS 103: Representation Learning, Information Theory and Control Lecture 3, Jan 25, 2019 Seen

80% 90% Impact 45 90 0 45 0 90 System A System B 3D position of

Learning Flexible Goal-Directed Behavior Christian Balkenius Lund University Cognitive Science

Modeling Human Reading with Neural Attention Michael Hahn Frank - PowerPoint PPT Presentation

Modeling Human Reading with Neural Attention Michael Hahn Frank Keller Stanford University University of Edinburgh mhahn2@stanford.edu keller@inf.ed.ac.uk EMNLP 2016 1 / 49 Eye Movements in Human Reading The two young sea-lions took not

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

Modeling Task Effects in Human Reading with Neural Attention Michael Hahn Frank Keller Stanford

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention

Visual Attention FEF V4 spatial attention: simultaneous neural recordings in V4

Advanced Neural Machine Translation Gongbo Tang 21 September 2020 Outline NMT with Attention

Reading Mastery - Reading Presentation Book A - Grade 5 Reading Mastery - Reading Presentation

Document Modeling with Document Modeling with External Attention for Sentence External Attention

The Attention Economy What is the attention economy? A business model where you (as the

multi-hop attention and Transformers Outline Review of common (old fashioned) neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Beekeeping Session Three: The honeybee 1 Tonights Agenda Anatomy of the honeybee

I have no disclosures. Lori Strachowski, MD Clinical Professor of Radiology, UCSF Chief of

Subdivision Subdivision Bezier Curves (Recall) Subdivision using control points yields the

COMS 4160: Problems on Curves Ravi Ramamoorthi Questions 1. Consider a quadratic B-spline curve

Infrequent words are more difficult to comprehend. Shashank Sonkar Computer Sc. &amp;

CS 103: Representation Learning, Information Theory and Control Lecture 3, Jan 25, 2019 Seen

80% 90% Impact 45 90 0 45 0 90 System A System B 3D position of

Learning Flexible Goal-Directed Behavior Christian Balkenius Lund University Cognitive Science

Infrequent words are more difficult to comprehend. Shashank Sonkar Computer Sc. &