Investigating neural representations of spoken language Grzegorz - - PowerPoint PPT Presentation
Investigating neural representations of spoken language Grzegorz - - PowerPoint PPT Presentation
Investigating neural representations of spoken language Grzegorz Chrupaa In collaboration with Afra Alishahi Lieke Gelderloos Marie Barking Mark van der Laan Automatic Speech Recognition A major success story in Language Technology
In collaboration with
Afra Alishahi Marie Barking Lieke Gelderloos Mark van der Laan
Automatic Speech Recognition
A major success story in Language Technology
Large amounts of fjne- grained supervision
I can see you
Grounded speech perception
Modeling spoken language
Induce representations between
auditory signal, and visual semantics
Understand:
What representations emerge in models? How much do they match linguistic analyses? Which parts of the architecture encode what?
Datasets
Flickr8K Audio Caption Corpus
8K images, fjve audio captions each
MS COCO Synthetic Spoken Captions
300K images, fjve synthetically spoken captions each
Places Audio Caption 400K Corpus
400K spoken captions
Project speech and image to joint space
a bird walks on a beam bears play in water
a bird walks on a beam
Image retrieval
Grzegorz Chrupała, Lieke Gelderloos and Afra Alishahi. 2017. Representations of language in a model of visually grounded speech signal. In ACL.
Further advances
Harwath, D., Torralba, A., & Glass, J. (2016).
Unsupervised learning of spoken language with visual
- context. In NeurIPS.
Harwath, D., & Glass, J. (2017). Learning Word-Like
Units from Joint Audio-Visual Analysis. In ACL.
Chrupała, G. (2019). Symbolic inductive bias for visually
grounded learning of spoken language. In ACL.
Merkx, D., Frank, S. L., & Ernestus, M. (2019). Language
learning using Speech to Image retrieval. In Interspeech.
Ilharco, G., Zhang, Y., & Baldridge, J. (2019). Large-scale
representation learning from visually grounded untranscribed speech. In CoNLL.
Havard, W. N., Chevrot, J. P., & Besacier, L. (2019). Word
Recognition, Competition, and Activation in a Model of Visually Grounded Speech. In CoNLL
Flickr8K
Levels of representation
What aspects of sentences are encoded? Which parts of the architecture encode what?
Homonym disambiguation
Utterances with homonyms
pair/pear, waste/waist ...
Decide which meaning was present in an
utterance.
Easier if meaning is represented, harder if
- nly form.
Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.
Synthetic COCO
Synonym discrimination
Disentangle phonological form and semantics. Discriminate between synonyms in identical
context:
A girl looking at a photo. A girl looking at a picture.
How invariant to phonological form is a
representation?
Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.
Synthetic COCO
Phoneme discrimination
ABX task (Schatz et al. 2013) A: /si/ B: /mi/ X: /me/
Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.
ABX
Especially challenging when the target (B) and distractor (A) belong to same phoneme class.
Synthetic COCO
Interim summary
Bottom layers encode form, top layers meaning Even top layers are not completely form-invariant
Caveats
Phoneme decoding
Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.
Synthetic COCO
Belinkov, Y., Ali, A., & Glass, J. (2019). Analyzing Phonetic and Graphemic Representations in End-to-End Automatic Speech Recognition. In Interspeech.
Phoneme decoding from random networks
Flickr8K
Representational Similarity Analysis
Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis- connecting the branches of systems neuroscience. Frontiers in systems neuroscience, 2, 4.
RSA: an example
RSA score: correlation between Sim A and Sim B
Structured Spaces
RSA applies given a similarity/distance metric WITHIN spaces
A and B.
No need for a metric BETWEEN A and B A can be a vector space, while B can be a space of
strings/trees/graphs.
For application to syntax, see: Chrupała, G., & Alishahi, A. (2019). Correlating neural and
symbolic representations of language. In ACL
Phoneme with RSA
A – cosine distances between activation
vectors
B – edit distances between phonemic
transcriptions
Pooling
Parameters W, u optimized with respect to RSA scores.
Phonemes with RSA
Flickr8K
Conclusion, again
Baselines and sanity checks are a must. Diagnostic classifjers may lack
sensitivity to details of representation.
Multiple analytical approaches to cross-
check results.
BlackboxNLP
Workshop on Analyzing and Interpreting
Neural Networks for NLP
https:/
/blackboxnlp.github.io
2018: EMNLP in Brussels 2019: ACL, Florence 2020?
References
Grzegorz Chrupała, Lieke Gelderloos and Afra Alishahi. 2017.
Representations of language in a model of visually grounded speech
- signal. In ACL.
Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of
phonology in a recurrent neural model of grounded speech. In CoNLL.
Grzegorz Chrupała and Afra Alishahi. 2019. Correlating neural and
symbolic representations of language. In ACL.
Grzegorz Chrupała. 2019. Symbolic inductive bias for visually grounded
learning of spoken language. In ACL.
Extras
Model settings
Representational Similarity
Correlations between sets of
pairwise similarities according to
Activations
VS
Edit ops on text Human judgments
(SICK dataset)
Decoding speaker attributes (Flickr8K) identity gender
Decoding speaker attributes
Substantial amount of speaker information in
top layers
Especially gender Idea: disentangle semantics from speaker info?
RSA + Tree Kernels
Infersent (Conneau 2017)
trained on NLI
BERT (Devlin et al. 2018)
trained on cloze and next-sentence classifjcation
Random versions of these