Encoding of Phonology in an RNN model of Grounded Speech Afra - - PowerPoint PPT Presentation

encoding of phonology in an rnn model of grounded speech
SMART_READER_LITE
LIVE PREVIEW

Encoding of Phonology in an RNN model of Grounded Speech Afra - - PowerPoint PPT Presentation

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz Chrupa a A Realistic Language Learning Scenario Two men are washing an elephant. 2 Grounded Analysis of Language Learning Linguistic


slide-1
SLIDE 1

Encoding of Phonology in an RNN model of Grounded Speech

Afra Alishahi, Marie Barking, Grzegorz Chrupała

slide-2
SLIDE 2

A Realistic Language Learning Scenario

2

Two men are washing an elephant.

slide-3
SLIDE 3

Elman (1991) Mohamed et al. (2012) Frank et al. (2013) Kadar et al. (2016) Li et al. (2016) Gelderloos & Chrupala (2016) Linzen et al. (2016) Adi et al. (2017) Roy & Pentland (2002) Yu & Ballard (2014) Harwath et al. (2016) Gelderloos & Chrupala (2016) Harwath & Glass (2017) Chrupala et al. (2017) Grounded Language Learning Analysis of Linguistic Knowledge

slide-4
SLIDE 4

Elman (1991) Mohamed et al. (2012) Frank et al. (2013) Kadar et al. (2016) Li et al. (2016) Gelderloos & Chrupala (2016) Linzen et al. (2016) Adi et al. (2017) Roy & Pentland (2002) Yu & Ballard (2014) Harwath et al. (2016) Gelderloos & Chrupala (2016) Harwath & Glass (2017) Chrupala et al. (2017) Grounded Language Learning Analysis of Linguistic Knowledge We are here!

slide-5
SLIDE 5

A Model of Grounded Speech Perception

4

Image Model Speech Model

Joint Semantic Space

slide-6
SLIDE 6

Joint Semantic Space

5

a bird walks on a beam bears play in water

slide-7
SLIDE 7

Image Model

6

BOAT BIRD BOAR

P r e

  • c

l a s s i f c a t i

  • n

l a y e r

VGG-16: Simonyan & Zisserman (2014)

slide-8
SLIDE 8

Speech Model

7

Attention RHN #5 RHN #4 RHN #3 RHN #2 RHN #1 Convolution MFCC Project to the joint semantic space

slide-9
SLIDE 9

Speech Model

7

Attention RHN #5 RHN #4 RHN #3 RHN #2 RHN #1 Convolution MFCC Project to the joint semantic space

  • Attention: weighted sum of

last RHN layer units

  • RHN: Recurrent Highway

Networks (Zilly et al., 2016)

  • Convolution: subsampling

MFCC vector

slide-10
SLIDE 10

Chrupała et al., ACL‘2017

  • Representation of language in a model of visually grounded

speech signal

  • Using hidden layer activations in a set of auxiliary tasks
  • Predicting utterance length and content, measuring

representational similarity and disambiguation of homonyms

8

slide-11
SLIDE 11

Chrupała et al., ACL‘2017

  • Representation of language in a model of visually grounded

speech signal

  • Using hidden layer activations in a set of auxiliary tasks
  • Predicting utterance length and content, measuring

representational similarity and disambiguation of homonyms

  • Main findings:
  • Encodings of form and meaning emerge and evolve in hidden

layers of stacked RNNs processing grounded speech

8

slide-12
SLIDE 12

Current Study

  • Questions: how is phonology encoded in
  • MFCC features extracted from speech signal?
  • activations of the layers of the model?

9

slide-13
SLIDE 13

Current Study

  • Questions: how is phonology encoded in
  • MFCC features extracted from speech signal?
  • activations of the layers of the model?
  • Data: Synthetically Spoken COCO dataset

9

slide-14
SLIDE 14

Current Study

  • Questions: how is phonology encoded in
  • MFCC features extracted from speech signal?
  • activations of the layers of the model?
  • Data: Synthetically Spoken COCO dataset
  • Experiments:
  • Phoneme decoding and clustering
  • Phoneme discrimination
  • Synonym discrimination

9

slide-15
SLIDE 15

Phoneme Decoding

  • Identifying phonemes from speech signal/activation

patterns: supervised classification of aligned phonemes

10

slide-16
SLIDE 16

Phoneme Decoding

  • Identifying phonemes from speech signal/activation

patterns: supervised classification of aligned phonemes

  • Speech signal was aligned with phonemic transcription

using Gentle toolkit (based on Kaldi, Povey et al., 2011)

10

slide-17
SLIDE 17

Phoneme Decoding

  • Identifying phonemes from speech signal/activation

patterns: supervised classification of aligned phonemes

11

  • 0.3

0.4 0.5 MFCC Conv Rec1 Rec2 Rec3 Rec4 Rec5

Representation Error rate

slide-18
SLIDE 18
  • ABX task (Schatz et al., 2013): discriminate minimal pairs; is

X closer to A or to B?

Phoneme Discrimination

12

A: be /bi/ B: me /mi/ X: my /maI/

slide-19
SLIDE 19
  • ABX task (Schatz et al., 2013): discriminate minimal pairs; is

X closer to A or to B?

  • A, B and X are CV syllables

Phoneme Discrimination

12

A: be /bi/ B: me /mi/ X: my /maI/

slide-20
SLIDE 20
  • ABX task (Schatz et al., 2013): discriminate minimal pairs; is

X closer to A or to B?

  • A, B and X are CV syllables
  • (A,B) and (B,X) are minimum pairs, but (A,X) are not

(34,288 tuples in total)

Phoneme Discrimination

12

A: be /bi/ B: me /mi/ X: my /maI/

slide-21
SLIDE 21

Phoneme Discrimination

13

MFCC 0.71 Convolutional 0.73 Recurrent 1 0.82 Recurrent 2 0.82 Recurrent 3 0.80 Recurrent 4 0.76 Recurrent 5 0.74

slide-22
SLIDE 22

Phoneme Discrimination

13

MFCC 0.71 Convolutional 0.73 Recurrent 1 0.82 Recurrent 2 0.82 Recurrent 3 0.80 Recurrent 4 0.76 Recurrent 5 0.74

slide-23
SLIDE 23

Phoneme Discrimination by Class

14

  • The task is most challenging when the target (B) and

distractor (A) belong to the same phoneme class A: be /bi/ B: me /mi/ X: my /maI/

slide-24
SLIDE 24

Phoneme Discrimination by Class

15

  • The task is most challenging when the target (B) and

distractor (A) belong to the same phoneme class

Vowels i I U u e E @ Ä OI O o aI æ 2 A aU Approximants j ô l w Nasals m n N Plosives p b t d k g Fricatives f v T D s z S Z h Affricates Ù Ã

slide-25
SLIDE 25

Phoneme Discrimination by Class

16

  • 0.5

0.6 0.7 0.8 0.9 mfcc conv rec1 rec2 rec3 rec4 rec5

Representation Accuracy Class

  • affricate

approximant fricative nasal plosive vowel

  • The task is most challenging when the target (B) and

distractor (A) belong to the same phoneme class

slide-26
SLIDE 26

Organization of Phonemes

  • Agglomerative hierarchical clustering of phoneme

activation vectors from the first hidden layer:

17

slide-27
SLIDE 27

Synonym Discrimination

  • Distinguishing between synonym pairs in the same context:
  • A girl looking at a photo
  • A girl looking at a picture

18

slide-28
SLIDE 28

Synonym Discrimination

  • Distinguishing between synonym pairs in the same context:
  • A girl looking at a photo
  • A girl looking at a picture
  • Synonyms were selected using WordNet synsets:
  • The pair have the same POS tag and are interchangeable
  • The pair clearly differ in form (not donut/doughnut)
  • The more frequent token in a pair constitutes less than 95% of

the occurrences.

18

slide-29
SLIDE 29

Synonym Discrimination

19

  • 0.0

0.1 0.2 0.3 mfcc conv rec1 rec2 rec3 rec4 rec5 emb

Representation Error Pair

Representation Pair

  • cut.slice

make.prepare someone.person photo.picture picture.image kid.child photograph.picture slice.piece bicycle.bike photograph.photo couch.sofa tv.television vegetable.veggie sidewalk.pavement rock.stone store.shop purse.bag assortment.variety spot.place pier.dock direction.way carpet.rug bun.roll large.big small.little

slide-30
SLIDE 30

Conclusion

20

slide-31
SLIDE 31

Conclusion

  • Phoneme representations are most salient in lower layers

20

slide-32
SLIDE 32

Conclusion

  • Phoneme representations are most salient in lower layers
  • Large amount of phonological information persists up to

the top recurrent layer

20

slide-33
SLIDE 33

Conclusion

  • Phoneme representations are most salient in lower layers
  • Large amount of phonological information persists up to

the top recurrent layer

  • The attention layer filters out and significantly attenuates

encoding of phonology and makes utterance embeddings more invariant to synonymy

20

slide-34
SLIDE 34

Conclusion

  • Phoneme representations are most salient in lower layers
  • Large amount of phonological information persists up to

the top recurrent layer

  • The attention layer filters out and significantly attenuates

encoding of phonology and makes utterance embeddings more invariant to synonymy

20

Code: https://github.com/gchrupala/encoding-of-phonology

slide-35
SLIDE 35

Speech Model

21

Attention RHN #5 RHN #4 RHN #3 RHN #2 RHN #1 Convolution MFCC

Phonological form

slide-36
SLIDE 36

Speech Model

Input: MFCC features extracted from speech signal

22

l y

slide-37
SLIDE 37

Objective Function

  • Minimize distance between all corresponding utterance-

image (u,i) pairs, and maximize the distance between non- corresponding ones:

23

X

u,i

X

u0

max[0, α +d(u, i)−d(u0, i)] + X

i0

max[0, α + d(u, i) − d(u, i0)] !

slide-38
SLIDE 38
  • Model specifications:

Experimental Setup

24

Attention: size 512 Recurrent 5: size 512 Recurrent 4: size 512 Recurrent 3: size 512 Recurrent 2: size 512 Recurrent 1: size 512 Convolutional: size 64, length 6, stride 3 Input MFCC: size 13 Table 2: COCO Speech utterance encoder archi-

slide-39
SLIDE 39

Organization of Phonemes

  • Adjusted Rand Index for the comparison of the phoneme

type hierarchy induced from representations against phoneme classes:

25

  • 0.12

0.16 0.20 0.24 mfcc conv rec1 rec2 rec3 rec4 rec5

Representation Adjusted Rand Index