Investigating neural representations of spoken language Grzegorz - - PowerPoint PPT Presentation

investigating neural representations of spoken language
SMART_READER_LITE
LIVE PREVIEW

Investigating neural representations of spoken language Grzegorz - - PowerPoint PPT Presentation

Investigating neural representations of spoken language Grzegorz Chrupaa In collaboration with Afra Alishahi Lieke Gelderloos Marie Barking Mark van der Laan Automatic Speech Recognition A major success story in Language Technology


slide-1
SLIDE 1

Investigating neural representations of spoken language

Grzegorz Chrupała

slide-2
SLIDE 2

In collaboration with

Afra Alishahi Marie Barking Lieke Gelderloos Mark van der Laan

slide-3
SLIDE 3

Automatic Speech Recognition

A major success story in Language Technology

slide-4
SLIDE 4

Large amounts of fjne- grained supervision

I can see you

slide-5
SLIDE 5

Grounded speech perception

slide-6
SLIDE 6

Modeling spoken language

 Induce representations between

 auditory signal, and  visual semantics

 Understand:

 What representations emerge in models?  How much do they match linguistic analyses?  Which parts of the architecture encode what?

slide-7
SLIDE 7
slide-8
SLIDE 8

Datasets

 Flickr8K Audio Caption Corpus

 8K images, fjve audio captions each

 MS COCO Synthetic Spoken Captions

 300K images, fjve synthetically spoken captions each

 Places Audio Caption 400K Corpus

 400K spoken captions

slide-9
SLIDE 9

Project speech and image to joint space

a bird walks on a beam bears play in water

slide-10
SLIDE 10

a bird walks on a beam

slide-11
SLIDE 11

Image retrieval

Grzegorz Chrupała, Lieke Gelderloos and Afra Alishahi. 2017. Representations of language in a model of visually grounded speech signal. In ACL.

slide-12
SLIDE 12

Further advances

 Harwath, D., Torralba, A., & Glass, J. (2016).

Unsupervised learning of spoken language with visual

  • context. In NeurIPS.

 Harwath, D., & Glass, J. (2017). Learning Word-Like

Units from Joint Audio-Visual Analysis. In ACL.

 Chrupała, G. (2019). Symbolic inductive bias for visually

grounded learning of spoken language. In ACL.

 Merkx, D., Frank, S. L., & Ernestus, M. (2019). Language

learning using Speech to Image retrieval. In Interspeech.

 Ilharco, G., Zhang, Y., & Baldridge, J. (2019). Large-scale

representation learning from visually grounded untranscribed speech. In CoNLL.

 Havard, W. N., Chevrot, J. P., & Besacier, L. (2019). Word

Recognition, Competition, and Activation in a Model of Visually Grounded Speech. In CoNLL

Flickr8K

slide-13
SLIDE 13

Levels of representation

 What aspects of sentences are encoded?  Which parts of the architecture encode what?

slide-14
SLIDE 14

Homonym disambiguation

 Utterances with homonyms

 pair/pear, waste/waist ...

 Decide which meaning was present in an

utterance.

 Easier if meaning is represented, harder if

  • nly form.

Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.

slide-15
SLIDE 15

Synthetic COCO

slide-16
SLIDE 16

Synonym discrimination

 Disentangle phonological form and semantics.  Discriminate between synonyms in identical

context:

A girl looking at a photo. A girl looking at a picture.

 How invariant to phonological form is a

representation?

Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.

slide-17
SLIDE 17

Synthetic COCO

slide-18
SLIDE 18

Phoneme discrimination

ABX task (Schatz et al. 2013) A: /si/ B: /mi/ X: /me/

Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.

slide-19
SLIDE 19

ABX

Especially challenging when the target (B) and distractor (A) belong to same phoneme class.

Synthetic COCO

slide-20
SLIDE 20

Interim summary

 Bottom layers encode form, top layers meaning  Even top layers are not completely form-invariant

slide-21
SLIDE 21

Caveats

slide-22
SLIDE 22

Phoneme decoding

Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.

Synthetic COCO

slide-23
SLIDE 23

Belinkov, Y., Ali, A., & Glass, J. (2019). Analyzing Phonetic and Graphemic Representations in End-to-End Automatic Speech Recognition. In Interspeech.

slide-24
SLIDE 24

Phoneme decoding from random networks

Flickr8K

slide-25
SLIDE 25

Representational Similarity Analysis

Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis- connecting the branches of systems neuroscience. Frontiers in systems neuroscience, 2, 4.

slide-26
SLIDE 26

RSA: an example

RSA score: correlation between Sim A and Sim B

slide-27
SLIDE 27

Structured Spaces

 RSA applies given a similarity/distance metric WITHIN spaces

A and B.

 No need for a metric BETWEEN A and B  A can be a vector space, while B can be a space of

strings/trees/graphs.

 For application to syntax, see:  Chrupała, G., & Alishahi, A. (2019). Correlating neural and

symbolic representations of language. In ACL

slide-28
SLIDE 28

Phoneme with RSA

 A – cosine distances between activation

vectors

 B – edit distances between phonemic

transcriptions

slide-29
SLIDE 29

Pooling

Parameters W, u optimized with respect to RSA scores.

slide-30
SLIDE 30

Phonemes with RSA

Flickr8K

slide-31
SLIDE 31

Conclusion, again

 Baselines and sanity checks are a must.  Diagnostic classifjers may lack

sensitivity to details of representation.

 Multiple analytical approaches to cross-

check results.

slide-32
SLIDE 32

BlackboxNLP

 Workshop on Analyzing and Interpreting

Neural Networks for NLP

 https:/

/blackboxnlp.github.io

 2018: EMNLP in Brussels  2019: ACL, Florence  2020?

slide-33
SLIDE 33

References

 Grzegorz Chrupała, Lieke Gelderloos and Afra Alishahi. 2017.

Representations of language in a model of visually grounded speech

  • signal. In ACL.

 Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of

phonology in a recurrent neural model of grounded speech. In CoNLL.

 Grzegorz Chrupała and Afra Alishahi. 2019. Correlating neural and

symbolic representations of language. In ACL.

 Grzegorz Chrupała. 2019. Symbolic inductive bias for visually grounded

learning of spoken language. In ACL.

slide-34
SLIDE 34

Extras

slide-35
SLIDE 35

Model settings

slide-36
SLIDE 36

Representational Similarity

 Correlations between sets of

pairwise similarities according to

 Activations

VS

 Edit ops on text  Human judgments

(SICK dataset)

slide-37
SLIDE 37

Decoding speaker attributes (Flickr8K) identity gender

slide-38
SLIDE 38

Decoding speaker attributes

 Substantial amount of speaker information in

top layers

 Especially gender  Idea: disentangle semantics from speaker info?

slide-39
SLIDE 39

RSA + Tree Kernels

 Infersent (Conneau 2017)

 trained on NLI

 BERT (Devlin et al. 2018)

 trained on cloze and next-sentence classifjcation

 Random versions of these

slide-40
SLIDE 40
slide-41
SLIDE 41

BERT layers