Encoding of Phonology in an RNN model of Grounded Speech Afra - - PowerPoint PPT Presentation
Encoding of Phonology in an RNN model of Grounded Speech Afra - - PowerPoint PPT Presentation
Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz Chrupa a A Realistic Language Learning Scenario Two men are washing an elephant. 2 Grounded Analysis of Language Learning Linguistic
A Realistic Language Learning Scenario
2
Two men are washing an elephant.
Elman (1991) Mohamed et al. (2012) Frank et al. (2013) Kadar et al. (2016) Li et al. (2016) Gelderloos & Chrupala (2016) Linzen et al. (2016) Adi et al. (2017) Roy & Pentland (2002) Yu & Ballard (2014) Harwath et al. (2016) Gelderloos & Chrupala (2016) Harwath & Glass (2017) Chrupala et al. (2017) Grounded Language Learning Analysis of Linguistic Knowledge
Elman (1991) Mohamed et al. (2012) Frank et al. (2013) Kadar et al. (2016) Li et al. (2016) Gelderloos & Chrupala (2016) Linzen et al. (2016) Adi et al. (2017) Roy & Pentland (2002) Yu & Ballard (2014) Harwath et al. (2016) Gelderloos & Chrupala (2016) Harwath & Glass (2017) Chrupala et al. (2017) Grounded Language Learning Analysis of Linguistic Knowledge We are here!
A Model of Grounded Speech Perception
4
Image Model Speech Model
Joint Semantic Space
Joint Semantic Space
5
a bird walks on a beam bears play in water
Image Model
6
BOAT BIRD BOAR
P r e
- c
l a s s i f c a t i
- n
l a y e r
VGG-16: Simonyan & Zisserman (2014)
Speech Model
7
Attention RHN #5 RHN #4 RHN #3 RHN #2 RHN #1 Convolution MFCC Project to the joint semantic space
Speech Model
7
Attention RHN #5 RHN #4 RHN #3 RHN #2 RHN #1 Convolution MFCC Project to the joint semantic space
- Attention: weighted sum of
last RHN layer units
- RHN: Recurrent Highway
Networks (Zilly et al., 2016)
- Convolution: subsampling
MFCC vector
Chrupała et al., ACL‘2017
- Representation of language in a model of visually grounded
speech signal
- Using hidden layer activations in a set of auxiliary tasks
- Predicting utterance length and content, measuring
representational similarity and disambiguation of homonyms
8
Chrupała et al., ACL‘2017
- Representation of language in a model of visually grounded
speech signal
- Using hidden layer activations in a set of auxiliary tasks
- Predicting utterance length and content, measuring
representational similarity and disambiguation of homonyms
- Main findings:
- Encodings of form and meaning emerge and evolve in hidden
layers of stacked RNNs processing grounded speech
8
Current Study
- Questions: how is phonology encoded in
- MFCC features extracted from speech signal?
- activations of the layers of the model?
9
Current Study
- Questions: how is phonology encoded in
- MFCC features extracted from speech signal?
- activations of the layers of the model?
- Data: Synthetically Spoken COCO dataset
9
Current Study
- Questions: how is phonology encoded in
- MFCC features extracted from speech signal?
- activations of the layers of the model?
- Data: Synthetically Spoken COCO dataset
- Experiments:
- Phoneme decoding and clustering
- Phoneme discrimination
- Synonym discrimination
9
Phoneme Decoding
- Identifying phonemes from speech signal/activation
patterns: supervised classification of aligned phonemes
10
Phoneme Decoding
- Identifying phonemes from speech signal/activation
patterns: supervised classification of aligned phonemes
- Speech signal was aligned with phonemic transcription
using Gentle toolkit (based on Kaldi, Povey et al., 2011)
10
Phoneme Decoding
- Identifying phonemes from speech signal/activation
patterns: supervised classification of aligned phonemes
11
- 0.3
0.4 0.5 MFCC Conv Rec1 Rec2 Rec3 Rec4 Rec5
Representation Error rate
- ABX task (Schatz et al., 2013): discriminate minimal pairs; is
X closer to A or to B?
Phoneme Discrimination
12
A: be /bi/ B: me /mi/ X: my /maI/
- ABX task (Schatz et al., 2013): discriminate minimal pairs; is
X closer to A or to B?
- A, B and X are CV syllables
Phoneme Discrimination
12
A: be /bi/ B: me /mi/ X: my /maI/
- ABX task (Schatz et al., 2013): discriminate minimal pairs; is
X closer to A or to B?
- A, B and X are CV syllables
- (A,B) and (B,X) are minimum pairs, but (A,X) are not
(34,288 tuples in total)
Phoneme Discrimination
12
A: be /bi/ B: me /mi/ X: my /maI/
Phoneme Discrimination
13
MFCC 0.71 Convolutional 0.73 Recurrent 1 0.82 Recurrent 2 0.82 Recurrent 3 0.80 Recurrent 4 0.76 Recurrent 5 0.74
Phoneme Discrimination
13
MFCC 0.71 Convolutional 0.73 Recurrent 1 0.82 Recurrent 2 0.82 Recurrent 3 0.80 Recurrent 4 0.76 Recurrent 5 0.74
Phoneme Discrimination by Class
14
- The task is most challenging when the target (B) and
distractor (A) belong to the same phoneme class A: be /bi/ B: me /mi/ X: my /maI/
Phoneme Discrimination by Class
15
- The task is most challenging when the target (B) and
distractor (A) belong to the same phoneme class
Vowels i I U u e E @ Ä OI O o aI æ 2 A aU Approximants j ô l w Nasals m n N Plosives p b t d k g Fricatives f v T D s z S Z h Affricates Ù Ã
Phoneme Discrimination by Class
16
- 0.5
0.6 0.7 0.8 0.9 mfcc conv rec1 rec2 rec3 rec4 rec5
Representation Accuracy Class
- affricate
approximant fricative nasal plosive vowel
- The task is most challenging when the target (B) and
distractor (A) belong to the same phoneme class
Organization of Phonemes
- Agglomerative hierarchical clustering of phoneme
activation vectors from the first hidden layer:
17
Synonym Discrimination
- Distinguishing between synonym pairs in the same context:
- A girl looking at a photo
- A girl looking at a picture
18
Synonym Discrimination
- Distinguishing between synonym pairs in the same context:
- A girl looking at a photo
- A girl looking at a picture
- Synonyms were selected using WordNet synsets:
- The pair have the same POS tag and are interchangeable
- The pair clearly differ in form (not donut/doughnut)
- The more frequent token in a pair constitutes less than 95% of
the occurrences.
18
Synonym Discrimination
19
- 0.0
0.1 0.2 0.3 mfcc conv rec1 rec2 rec3 rec4 rec5 emb
Representation Error Pair
Representation Pair
- cut.slice
make.prepare someone.person photo.picture picture.image kid.child photograph.picture slice.piece bicycle.bike photograph.photo couch.sofa tv.television vegetable.veggie sidewalk.pavement rock.stone store.shop purse.bag assortment.variety spot.place pier.dock direction.way carpet.rug bun.roll large.big small.little
Conclusion
20
Conclusion
- Phoneme representations are most salient in lower layers
20
Conclusion
- Phoneme representations are most salient in lower layers
- Large amount of phonological information persists up to
the top recurrent layer
20
Conclusion
- Phoneme representations are most salient in lower layers
- Large amount of phonological information persists up to
the top recurrent layer
- The attention layer filters out and significantly attenuates
encoding of phonology and makes utterance embeddings more invariant to synonymy
20
Conclusion
- Phoneme representations are most salient in lower layers
- Large amount of phonological information persists up to
the top recurrent layer
- The attention layer filters out and significantly attenuates
encoding of phonology and makes utterance embeddings more invariant to synonymy
20
Code: https://github.com/gchrupala/encoding-of-phonology
Speech Model
21
Attention RHN #5 RHN #4 RHN #3 RHN #2 RHN #1 Convolution MFCC
Phonological form
Speech Model
Input: MFCC features extracted from speech signal
22
l y
Objective Function
- Minimize distance between all corresponding utterance-
image (u,i) pairs, and maximize the distance between non- corresponding ones:
23
X
u,i
X
u0
max[0, α +d(u, i)−d(u0, i)] + X
i0
max[0, α + d(u, i) − d(u, i0)] !
- Model specifications:
Experimental Setup
24
Attention: size 512 Recurrent 5: size 512 Recurrent 4: size 512 Recurrent 3: size 512 Recurrent 2: size 512 Recurrent 1: size 512 Convolutional: size 64, length 6, stride 3 Input MFCC: size 13 Table 2: COCO Speech utterance encoder archi-
Organization of Phonemes
- Adjusted Rand Index for the comparison of the phoneme
type hierarchy induced from representations against phoneme classes:
25
- 0.12
0.16 0.20 0.24 mfcc conv rec1 rec2 rec3 rec4 rec5
Representation Adjusted Rand Index