Using Machine Learning to Study the Neural Representations of - - PowerPoint PPT Presentation

using machine learning to study the neural
SMART_READER_LITE
LIVE PREVIEW

Using Machine Learning to Study the Neural Representations of - - PowerPoint PPT Presentation

Using Machine Learning to Study the Neural Representations of Language Meanings Tom M. Mitchell Carnegie Mellon University June 2017 How does neural activity encode word meanings? How does neural activity encode word meanings? How does


slide-1
SLIDE 1

Using Machine Learning to Study the Neural Representations of Language Meanings

Tom M. Mitchell

Carnegie Mellon University June 2017

slide-2
SLIDE 2

How does neural activity encode word meanings?

slide-3
SLIDE 3

How does neural activity encode word meanings? How does brain combine word meanings into sentence meanings?

slide-4
SLIDE 4

Research Scientists Recent/Current PhD Students

Dan Schwartz Marcel Just

Research Scientists

Tom Mitchell Mark Palatucci Mariya Toneva Leila Wehbe Kai-Min Chang

Neurosemantics Research Team

Alona Fyshe Gustavo Sudre

funding: NSF, NIH, IARPA, Keck

Nicole Rafidi Erika Laing Dan Howarth

slide-5
SLIDE 5

Functional MRI

slide-6
SLIDE 6

Typical stimuli

slide-7
SLIDE 7

fMRI activation for “bottle”: Mean activation averaged over 60 different stimuli: “bottle” minus mean activation:

fMRI activation

high below average average bottle

slide-8
SLIDE 8

Classifiers trained to decode the stimulus word

Hammer

  • r

Bottle Trained Classifier (classifier as virtual sensor of mental state)

(SVM, Logistic regression, Deep net,Bayesian classifier ...)

slide-9
SLIDE 9

Classification task: is person viewing a “tool” or “building”?

p4 p8 p6 p11 p5 p7 p10 p9 p2 p12 p3 p1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Participants

statistically significant p<0.05

Classification accuracy

slide-10
SLIDE 10

Are neural representations similar across people?

Can we train classifiers on one group of people, then decode from new person?

slide-11
SLIDE 11

Are representations similar across people?

YES

classify which of 60 items

rank accuracy

slide-12
SLIDE 12

Lessons from fMRI Word Classification

Neural representations similar across

  • people
  • language
  • word vs. picture

Easier to decode:

  • concrete nouns
  • emotion nouns

Harder to decode:

  • abstract nouns
  • verbs*

* except when placed in context

slide-13
SLIDE 13

Predictive Model?

Predicted fMRI activity Arbitrary noun

slide-14
SLIDE 14

Predicted fMRI activity Input noun: “telephone”

trained on other fMRI data

[Mitchell et al., Science, 2008]

Retrieve text statistics

v = fi(w) cvi

i=1 25

å

trillion word text collection

Predictive Model?

vector representing word meaning

slide-15
SLIDE 15

Semantic feature values: “celery” 0.8368, eat 0.3461, taste 0.3153, fill 0.2430, see 0.1145, clean 0.0600, open 0.0586, smell 0.0286, touch … … 0.0000, drive 0.0000, wear 0.0000, lift 0.0000, break 0.0000, ride Semantic feature values: “airplane”

0.8673, ride

0.2891, see 0.2851, say 0.1689, near 0.1228, open 0.0883, hear 0.0771, run 0.0749, lift … … 0.0049, smell 0.0010, wear 0.0000, taste 0.0000, rub 0.0000, manipulate

Represent stimulus noun by co-occurrences with 25 verbs*

* in a trillion word text collection

slide-16
SLIDE 16

Predicted Activation is Sum of Feature Contributions

Celery = + 0.35 0.84 Predicted “Celery” “eat” “taste” + 0.32 + … “fill”

high low

c14382,eat

learned

feat(celery)

from corpus statistics

predictionv = fi(w) cvi

i=1 25

å

500,000 learned cvi parameters

slide-17
SLIDE 17

“celery” “airplane” Predicted: Observed:

fMRI activation

high below average average

Predicted and observed fMRI images for “celery” and “airplane” after training on other nouns.

[Mitchell et al., Science, 2008]

slide-18
SLIDE 18

Evaluating the Computational Model

  • Leave two words out during training

1770 test pairs in leave-2-out: – Random guessing  0.50 accuracy – Accuracy above 0.61 is significant (p<0.05)

celery? airplane?

slide-19
SLIDE 19

Eat Push Run Participant P1 “Gustatory cortex” Pars opercularis (z=24mm) “somato-sensory” Postcentral gyrus (z=30mm) “Biological motion” Superior temporal sulcus (posterior) (z=12mm)

Semantic feature:

Learned activities associated with meaning components

slide-20
SLIDE 20

Alternative semantic feature sets

PREDEFINED corpus features Mean Acc. 25 verb co-occurrences .79 486 verb co-occurrences .79 50,000 word co-occurences .76 300 Latent Semantic Analysis features .73 50 corpus features from Collobert&Weston ICML08 .78

slide-21
SLIDE 21

Alternative semantic feature sets

PREDEFINED corpus features Mean Acc. 25 verb co-occurrences .79 486 verb co-occurrences .79 50,000 word co-occurences .76 300 Latent Semantic Analysis features .73 50 corpus features from Collobert&Weston ICML08 .78 218 features collected using Mechanical Turk .83 Is it heavy? Is it flat? Is it curved? Is it colorful? Is it hollow? Is it smooth? Is it fast? Is it bigger than a car? Is it usually outside? Does it have corners? Does it have moving parts? Does it have seeds? Can it break? Can it swim? Can it change shape? Can you sit on it? Can you pick it up? Could you fit inside of it? Does it roll? Does it use electricity? Does it make a sound? Does it have a backbone? Does it have roots? Do you love it? … features authored by Dean Pomerleau. feature values 1 to 5 features collected from at least three people people provided by Amazon’s “Mechanical Turk”

slide-22
SLIDE 22

Alternative semantic feature sets

PREDEFINED corpus features Mean Acc. 25 verb co-occurrences .79 486 verb co-occurrences .79 50,000 word co-occurences .76 300 Latent Semantic Analysis features .73 50 corpus features from Collobert&Weston ICML08 .78 218 features collected using Mechanical Turk* .83 20 features discovered from the data** .86

* developed by Dean Pommerleau ** developed by Indra Rustandi

slide-23
SLIDE 23

CCA abstraction

fk(w) =

xv cvi

v

å

CCA abstraction

fk(w) =

xv cvi

v

å

CCA abstraction

fk(w) =

xv cvi

v

å

CCA abstraction

fk(w) =

xv cvi

v

å

subj 1, word+pict subj 9, word+pict subj 10, word only subj 20, word only

… … … …

20 learned latent features f (w)

[Rustandi et al., 2009]

specific to study/subject

Discovering shared semantic basis

1. Use CCA to discover latent features across subjects

slide-24
SLIDE 24

[slide courtesy of Indra Rustandi] Each column is

  • ne fMRI image
slide-25
SLIDE 25

CCA abstraction

fk(w) =

xv cvi

v

å

CCA abstraction

fk(w) =

xv cvi

v

å

CCA abstraction

fk(w) =

xv cvi

v

å

CCA abstraction

fk(w) =

xv cvi

v

å

subj 1, word+pict subj 9, word+pict subj 10, word only subj 20, word only

… … … …

20 learned latent features f (w)

[Rustandi et al., 2009]

specific to study/subject

Discovering shared semantic basis

1. Use CCA to discover latent features

slide-26
SLIDE 26

CCA abstraction

fk(w) =

xv cvi

v

å

CCA abstraction

fk(w) =

xv cvi

v

å

CCA abstraction

fk(w) =

xv cvi

v

å

CCA abstraction

fk(w) =

xv cvi

v

å

Discovering shared semantic basis

subj 1, word+pict subj 9, word+pict subj 10, word only subj 20, word only

… … … …

20 learned latent features f (w)

[Rustandi et al., 2009]

specific to study/subject

1. Use CCA to discover latent features 2. Train regression to predict them

218 MTurk features 20 learned latent features

fi(w) = bk(w) cik

k

å

f (w) b(w)

word w

independent of study/subject

slide-27
SLIDE 27

word w

subj 1, word+pict

predict representation

v =

fi(w) cvi

i

å

subj 9, word+pict

predict representation

v =

fi(w) cvi

i

å

subj 10, word only

predict representation

v =

fi(w) cvi

i

å

subj 20, word only

predict representation

v =

fi(w) cvi

i

å

… … … …

218 MTurk features 20 learned latent features

fi(w) = bk(w) cik

k

å

f (w) b(w)

[Rustandi et al., 2009]

specific to study/subject

Discovering shared semantic basis

1. Use CCA to discover latent features 2. Train regression to predict them 3. Invert CCA mapping

independent of study/subject

slide-28
SLIDE 28

CCA Components: Top Stimulus Words

component 1 component 2 component 3 component 4 Stimuli that most activate it apartment church closet house barn screwdriver pliers refrigerator knife hammer telephone butterfly bicycle beetle dog pants dress glass coat chair

shelter? manipulation? things that touch my body?

slide-29
SLIDE 29

Timing?

slide-30
SLIDE 30

MEG: Stimulus “hand” (word plus line drawing)

[Sudre et al., NeuroImage 2012]

slide-31
SLIDE 31

(Sudre et al., under review)

100 ms

word length right diagonalness verticality word length word length

800 ms

50 ms

[Sudre et al., NeuroImage 2012]

slide-32
SLIDE 32

(Sudre et al., under review)

100 ms

word length right diagonalness verticality word length word length

800 ms

[Sudre et al., 2012]

slide-33
SLIDE 33

(Sudre et al., under review)

150 ms

word length internal details aspect ratio

800 ms

[Sudre et al., 2012]

slide-34
SLIDE 34

(Sudre et al., under review)

200 ms

internal details IS IT HAIRY? internal details aspect ratio

800 ms

[Sudre et al., 2012]

slide-35
SLIDE 35

(Sudre et al., under review)

250 ms

IS IT HOLLOW? IS IT MADE OF WOOD? white pixel count horizontalness IS IT HAIRY? IS IT AN ANIMAL?

800 ms

[Sudre et al., 2012]

slide-36
SLIDE 36

(Sudre et al., under review)

300 ms

CAN YOU PICK IT UP? CAN YOU HOLD IT? IS IT BIGGER THAN A CAR? IS IT MAN-MADE? IS IT ALIVE? CAN IT BITE OR STING? IS IT ALIVE? DOES IT GROW? IS IT ALIVE? WAS IT EVER ALIVE? DOES IT GROW?

800 ms

[Sudre et al., 2012]

slide-37
SLIDE 37

(Sudre et al., under review)

350 ms

CAN YOU HOLD IT IN ONE HAND? COULD YOU FIT INSIDE IT? DOES IT HAVE FOUR LEGS? IS IT MAN-MADE? WAS IT EVER ALIVE? IS IT ALIVE? CAN IT BEND? CAN YOU PICK IT UP? CAN YOU HOLD IT?

800 ms

[Sudre et al., 2012]

slide-38
SLIDE 38

(Sudre et al., under review)

400 ms

CAN YOU PICK IT UP? IS IT TALLER THAN A PERSON? IS IT MAN-MADE? WAS IT EVER ALIVE? WAS IT INVENTED? DOES IT HAVE FEELINGS? IS IT ALIVE? IS IT BIGGER THAN A CAR? IS IT MAN-MADE? WAS IT EVER ALIVE? IS IT MANUFACTURED? DOES IT HAVE CORNERS? CAN YOU PICK IT UP?

800 ms

[Sudre et al., 2012]

slide-39
SLIDE 39

(Sudre et al., under review)

450 ms

CAN YOU HOLD IT? IS IT ALIVE? IS IT AN ANIMAL? IS IT HOLLOW? IS IT HOLLOW? DOES IT GROW? IS IT MANUFACTURED? WAS IT INVENTED? IS IT BIGGER THAN A BED?

800 ms

[Sudre et al., 2012]

slide-40
SLIDE 40

(Sudre et al., under review)

500 ms

IS IT BIGGER THAN A BED? IS IT TALLER THAN A PERSON? CAN YOU PICK IT UP? CAN YOU PICK IT UP? DOES IT GROW? CAN YOU HOLD IT IN ONE HAND?

800 ms

[Sudre et al., 2012]

slide-41
SLIDE 41

(Sudre et al., under review)

550 ms

CAN IT BE EASILY MOVED? IS IT ALIVE? IS IT MAN-MADE? WAS IT EVER ALIVE?

800 ms

[Sudre et al., 2012]

slide-42
SLIDE 42

Details

slide-43
SLIDE 43

Color= decodability* of feature “wordlength” (peak decodability 100-150 msec)

L,R precuneus L,R pericalcarine L,R lingual L,R Sup. Parietal L,R cuneus L,R Lat. Occipital

 Brain regions   Time (msec.)

0 - 400 - 200 - 600 - * % of feature variance predicted by MEG, mean across 9 subjects

 100

slide-44
SLIDE 44

Color= decodability of “grasping“ features (initial peak: 200-300 msec)

L,R precuneus L,R pericalcarine L,R lingual L,R Sup. Parietal L,R cuneus L,R Lat. Occip. L Inferior Parietal L Supramarginal L, R Postcentral 0 - 400 - 200 - 600 -

[Sudre et al., 2012]

slide-45
SLIDE 45

20 most accurately decoded semantic features out of 218

size manipulability animacy

[G. Sudre et al., 2012]

shelter

slide-46
SLIDE 46

Story reading

Leila Wehbe

slide-47
SLIDE 47

would he thought never Harry 500ms per word

Reading Harry Potter, one word at a time…

slide-48
SLIDE 48

time

slide-49
SLIDE 49

General Framework

Harry never thought he would meet a person he … Stimulus sequence Vector summary of current word, plus story context Brain Activity fMRI MEG Time

1 . 3 …

… … …

2 . 9 … 1 5 . 1 …

slide-50
SLIDE 50
slide-51
SLIDE 51
slide-52
SLIDE 52

199 story features:

slide-53
SLIDE 53

Test the model on new text passages accuracy: 75%

slide-54
SLIDE 54

Fedorenko et al., Neuropsychologia 2012

previous work: where does reading generate activity?

  • ur work:

where is story information encoded?

Wehbe et al., PLoS One 2014

slide-55
SLIDE 55
  • ur work:

where is story information encoded?

Fedorenko et al., Neuropsychologia 2012

previous work: where does reading generate activity?

drill down

Wehbe et al., PLoS One 2014

slide-56
SLIDE 56

[Fedorenko et al. 2012] [Wehbe et al., 2014]

slide-57
SLIDE 57

Q: Can we observe neural encoding of story content?

[Wehbe et al., EMNLP 2014]

slide-58
SLIDE 58

Modeling context: Recurrent Network

1. MEG subjects read chapter of Harry Potter 2. Train recurrent network language model on 67M words

  • f Harry Potter fan fiction

3. Use learned representation of context s(t-1), current word w(t), current word probability y(t),c(t), to decode* current word from 100 msec windows of neural activity

[Wehbe et al., EMNLP14]

* concatenate 20 random words per example, 2x2

slide-59
SLIDE 59

MEG classification accuracy:

  • 0.80 current word

(embedding)

  • 0.93 context

(recurrent hidden)

  • 0.60 Predicted

probability of current word

* concatenate MEG for 20 random words per example, 2x2

slide-60
SLIDE 60

Results

[Wehbe et al., EMNLP14]

current word context (hidden) word probability

slide-61
SLIDE 61

Implications

  • Much activity encodes context

– decoding based on context > based on current word

  • context most salient 200-250

msec post word onset

  • current word probability most

salient in left hemisphere, at 200-400 msec

[Wehbe et al., EMNLP14]

slide-62
SLIDE 62

Lessons

Neuroscience:

  • Neural code for word meanings distributed across the brain
  • Your neural code and mine are very similar
  • Neural code is built up from more primitive semantic features
  • Neural code evolves over 400 msec after word onset
  • During story reading, diverse information encoded brain-wide
slide-63
SLIDE 63

Lessons

Neuroscience:

  • Neural code for word meanings distributed across the brain
  • Your neural code and mine are very similar
  • Neural code is built up from more primitive semantic features
  • Neural code evolves over 400 msec after word onset
  • During story reading, diverse information encoded brain-wide

Methodology

  • Key role of machine learning

– classifiers, regression, latent representation discovery, language modeling, …

  • Big opportunity 1: jointly analyze data from many experiments
  • Big opportunity 2: build a program that understands sentences, and

as a result predicts neural activity

slide-64
SLIDE 64

thank you!