[PPT] - Encoding of Phonology in an RNN model of Grounded Speech Afra PowerPoint Presentation

SLIDE 1

Encoding of Phonology in an RNN model of Grounded Speech

Afra Alishahi, Marie Barking, Grzegorz Chrupała

SLIDE 2

A Realistic Language Learning Scenario

2

Two men are washing an elephant.

SLIDE 3

Elman (1991) Mohamed et al. (2012) Frank et al. (2013) Kadar et al. (2016) Li et al. (2016) Gelderloos & Chrupala (2016) Linzen et al. (2016) Adi et al. (2017) Roy & Pentland (2002) Yu & Ballard (2014) Harwath et al. (2016) Gelderloos & Chrupala (2016) Harwath & Glass (2017) Chrupala et al. (2017) Grounded Language Learning Analysis of Linguistic Knowledge

SLIDE 4

Elman (1991) Mohamed et al. (2012) Frank et al. (2013) Kadar et al. (2016) Li et al. (2016) Gelderloos & Chrupala (2016) Linzen et al. (2016) Adi et al. (2017) Roy & Pentland (2002) Yu & Ballard (2014) Harwath et al. (2016) Gelderloos & Chrupala (2016) Harwath & Glass (2017) Chrupala et al. (2017) Grounded Language Learning Analysis of Linguistic Knowledge We are here!

SLIDE 5

A Model of Grounded Speech Perception

4

Image Model Speech Model

Joint Semantic Space

SLIDE 6

Joint Semantic Space

5

a bird walks on a beam bears play in water

SLIDE 7

Image Model

6

BOAT BIRD BOAR

P r e

c

l a s s i f c a t i

n

l a y e r

VGG-16: Simonyan & Zisserman (2014)

SLIDE 8

Speech Model

7

Attention RHN #5 RHN #4 RHN #3 RHN #2 RHN #1 Convolution MFCC Project to the joint semantic space

SLIDE 9

Speech Model

7

Attention RHN #5 RHN #4 RHN #3 RHN #2 RHN #1 Convolution MFCC Project to the joint semantic space

Attention: weighted sum of

last RHN layer units

RHN: Recurrent Highway

Networks (Zilly et al., 2016)

Convolution: subsampling

MFCC vector

SLIDE 10

Chrupała et al., ACL‘2017

Representation of language in a model of visually grounded

speech signal

Using hidden layer activations in a set of auxiliary tasks
Predicting utterance length and content, measuring

representational similarity and disambiguation of homonyms

8

SLIDE 11

Chrupała et al., ACL‘2017

Representation of language in a model of visually grounded

speech signal

Using hidden layer activations in a set of auxiliary tasks
Predicting utterance length and content, measuring

representational similarity and disambiguation of homonyms

Main findings:
Encodings of form and meaning emerge and evolve in hidden

layers of stacked RNNs processing grounded speech

8

SLIDE 12

Current Study

Questions: how is phonology encoded in
MFCC features extracted from speech signal?
activations of the layers of the model?

9

SLIDE 13

Current Study

Questions: how is phonology encoded in
MFCC features extracted from speech signal?
activations of the layers of the model?
Data: Synthetically Spoken COCO dataset

9

SLIDE 14

Current Study

Questions: how is phonology encoded in
MFCC features extracted from speech signal?
activations of the layers of the model?
Data: Synthetically Spoken COCO dataset
Experiments:
Phoneme decoding and clustering
Phoneme discrimination
Synonym discrimination

9

SLIDE 15

Phoneme Decoding

Identifying phonemes from speech signal/activation

patterns: supervised classification of aligned phonemes

10

SLIDE 16

Phoneme Decoding

Identifying phonemes from speech signal/activation

patterns: supervised classification of aligned phonemes

Speech signal was aligned with phonemic transcription

using Gentle toolkit (based on Kaldi, Povey et al., 2011)

10

SLIDE 17

Phoneme Decoding

Identifying phonemes from speech signal/activation

patterns: supervised classification of aligned phonemes

11

0.3

0.4 0.5 MFCC Conv Rec1 Rec2 Rec3 Rec4 Rec5

Representation Error rate

SLIDE 18

ABX task (Schatz et al., 2013): discriminate minimal pairs; is

X closer to A or to B?

Phoneme Discrimination

12

A: be /bi/ B: me /mi/ X: my /maI/

SLIDE 19

ABX task (Schatz et al., 2013): discriminate minimal pairs; is

X closer to A or to B?

A, B and X are CV syllables

Phoneme Discrimination

12

A: be /bi/ B: me /mi/ X: my /maI/

SLIDE 20

ABX task (Schatz et al., 2013): discriminate minimal pairs; is

X closer to A or to B?

A, B and X are CV syllables
(A,B) and (B,X) are minimum pairs, but (A,X) are not

(34,288 tuples in total)

Phoneme Discrimination

12

A: be /bi/ B: me /mi/ X: my /maI/

SLIDE 21

Phoneme Discrimination

13

MFCC 0.71 Convolutional 0.73 Recurrent 1 0.82 Recurrent 2 0.82 Recurrent 3 0.80 Recurrent 4 0.76 Recurrent 5 0.74

SLIDE 22

Phoneme Discrimination

13

MFCC 0.71 Convolutional 0.73 Recurrent 1 0.82 Recurrent 2 0.82 Recurrent 3 0.80 Recurrent 4 0.76 Recurrent 5 0.74

SLIDE 23

Phoneme Discrimination by Class

14

The task is most challenging when the target (B) and

distractor (A) belong to the same phoneme class A: be /bi/ B: me /mi/ X: my /maI/

SLIDE 24

Phoneme Discrimination by Class

15

The task is most challenging when the target (B) and

distractor (A) belong to the same phoneme class

Vowels i I U u e E @ Ä OI O o aI æ 2 A aU Approximants j ô l w Nasals m n N Plosives p b t d k g Fricatives f v T D s z S Z h Affricates Ù Ã

SLIDE 25

Phoneme Discrimination by Class

16

0.5

0.6 0.7 0.8 0.9 mfcc conv rec1 rec2 rec3 rec4 rec5

Representation Accuracy Class

affricate

approximant fricative nasal plosive vowel

The task is most challenging when the target (B) and

distractor (A) belong to the same phoneme class

SLIDE 26

Organization of Phonemes

Agglomerative hierarchical clustering of phoneme

activation vectors from the first hidden layer:

17

SLIDE 27

Synonym Discrimination

Distinguishing between synonym pairs in the same context:
A girl looking at a photo
A girl looking at a picture

18

SLIDE 28

Synonym Discrimination

Distinguishing between synonym pairs in the same context:
A girl looking at a photo
A girl looking at a picture
Synonyms were selected using WordNet synsets:
The pair have the same POS tag and are interchangeable
The pair clearly differ in form (not donut/doughnut)
The more frequent token in a pair constitutes less than 95% of

the occurrences.

18

SLIDE 29

Synonym Discrimination

19

0.0

0.1 0.2 0.3 mfcc conv rec1 rec2 rec3 rec4 rec5 emb

Representation Error Pair

Representation Pair

cut.slice

make.prepare someone.person photo.picture picture.image kid.child photograph.picture slice.piece bicycle.bike photograph.photo couch.sofa tv.television vegetable.veggie sidewalk.pavement rock.stone store.shop purse.bag assortment.variety spot.place pier.dock direction.way carpet.rug bun.roll large.big small.little

SLIDE 30

Conclusion

20

SLIDE 31

Conclusion

Phoneme representations are most salient in lower layers

20

SLIDE 32

Conclusion

Phoneme representations are most salient in lower layers
Large amount of phonological information persists up to

the top recurrent layer

20

SLIDE 33

Conclusion

Phoneme representations are most salient in lower layers
Large amount of phonological information persists up to

the top recurrent layer

The attention layer filters out and significantly attenuates

encoding of phonology and makes utterance embeddings more invariant to synonymy

20

SLIDE 34

Conclusion

Phoneme representations are most salient in lower layers
Large amount of phonological information persists up to

the top recurrent layer

The attention layer filters out and significantly attenuates

encoding of phonology and makes utterance embeddings more invariant to synonymy

20

Code: https://github.com/gchrupala/encoding-of-phonology

SLIDE 35

Speech Model

21

Attention RHN #5 RHN #4 RHN #3 RHN #2 RHN #1 Convolution MFCC

Phonological form

SLIDE 36

Speech Model

Input: MFCC features extracted from speech signal

22

l y

SLIDE 37

Objective Function

Minimize distance between all corresponding utterance-

image (u,i) pairs, and maximize the distance between non- corresponding ones:

23

X

u,i

X

u0

max[0, α +d(u, i)−d(u0, i)] + X

i0

max[0, α + d(u, i) − d(u, i0)] !

SLIDE 38

Model specifications:

Experimental Setup

24

Attention: size 512 Recurrent 5: size 512 Recurrent 4: size 512 Recurrent 3: size 512 Recurrent 2: size 512 Recurrent 1: size 512 Convolutional: size 64, length 6, stride 3 Input MFCC: size 13 Table 2: COCO Speech utterance encoder archi-

SLIDE 39

Organization of Phonemes

Adjusted Rand Index for the comparison of the phoneme

type hierarchy induced from representations against phoneme classes:

25

0.12

0.16 0.20 0.24 mfcc conv rec1 rec2 rec3 rec4 rec5

Representation Adjusted Rand Index