encoding of phonology in an rnn model of grounded speech
play

Encoding of Phonology in an RNN model of Grounded Speech Afra - PowerPoint PPT Presentation

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz Chrupa a A Realistic Language Learning Scenario Two men are washing an elephant. 2 Grounded Analysis of Language Learning Linguistic


  1. Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz Chrupa ł a

  2. A Realistic Language Learning Scenario Two men are washing an elephant. 2

  3. Grounded Analysis of Language Learning Linguistic Knowledge Roy & Pentland (2002) Elman (1991) Yu & Ballard (2014) Mohamed et al. (2012) Harwath et al. (2016) Frank et al. (2013) Gelderloos & Chrupala Kadar et al. (2016) (2016) Li et al. (2016) Harwath & Glass (2017) Gelderloos & Chrupala Chrupala et al. (2017) (2016) Linzen et al. (2016) Adi et al. (2017) We are here!

  4. A Model of Grounded Speech Perception Image Speech Model Model Joint Semantic Space 4

  5. Joint Semantic Space a bird walks on a beam bears play in water 5

  6. Image Model P r e - c l a s s i f c a t i o n l a y e r BOAT BIRD BOAR VGG-16: Simonyan & Zisserman (2014) 6

  7. Speech Model Project to the joint semantic space • Attention: weighted sum of Attention last RHN layer units RHN #5 • RHN: Recurrent Highway RHN #4 Networks (Zilly et al., 2016) RHN #3 • Convolution: subsampling RHN #2 MFCC vector RHN #1 Convolution MFCC 7

  8. Chrupa ł a et al., ACL‘2017 • Representation of language in a model of visually grounded speech signal • Using hidden layer activations in a set of auxiliary tasks • Predicting utterance length and content, measuring representational similarity and disambiguation of homonyms • Main findings: • Encodings of form and meaning emerge and evolve in hidden layers of stacked RNNs processing grounded speech 8

  9. Current Study • Questions: how is phonology encoded in • MFCC features extracted from speech signal? • activations of the layers of the model? • Data: Synthetically Spoken COCO dataset • Experiments: • Phoneme decoding and clustering • Phoneme discrimination • Synonym discrimination 9

  10. Phoneme Decoding • Identifying phonemes from speech signal/activation patterns: supervised classification of aligned phonemes • Speech signal was aligned with phonemic transcription using Gentle toolkit (based on Kaldi, Povey et al., 2011) 10

  11. Phoneme Decoding • Identifying phonemes from speech signal/activation patterns: supervised classification of aligned phonemes ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 Error rate 0.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● MFCC Conv Rec1 Rec2 Rec3 Rec4 Rec5 Representation 11

  12. Phoneme Discrimination • ABX task (Schatz et al., 2013): discriminate minimal pairs; is X closer to A or to B? A: be /bi/ B: me /mi/ X: my /maI/ • A, B and X are CV syllables • (A,B) and (B,X) are minimum pairs, but (A,X) are not (34,288 tuples in total) 12

  13. Phoneme Discrimination MFCC 0.71 Convolutional 0.73 Recurrent 1 0.82 Recurrent 2 0.82 Recurrent 3 0.80 Recurrent 4 0.76 Recurrent 5 0.74 13

  14. Phoneme Discrimination by Class • The task is most challenging when the target (B) and distractor (A) belong to the same phoneme class A: be /bi/ B: me /mi/ X: my /maI/ 14

  15. Phoneme Discrimination by Class • The task is most challenging when the target (B) and distractor (A) belong to the same phoneme class i I U u Vowels e E @ Ä OI O o aI æ 2 A aU Approximants j ô l w m n N Nasals Plosives p b t d k g Fricatives f v T D s z S Z h Ù Ã Affricates 15

  16. Phoneme Discrimination by Class • The task is most challenging when the target (B) and distractor (A) belong to the same phoneme class ● 0.9 ● ● ● ● ● ● 0.8 ● ● ● ● ● ● ● Accuracy ● ● ● 0.7 ● ● ● ● ● ● ● ● ● ● ● ● 0.6 ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● mfcc conv rec1 rec2 rec3 rec4 rec5 Representation affricate fricative plosive ● ● ● Class approximant nasal vowel ● ● ● 16

  17. Organization of Phonemes • Agglomerative hierarchical clustering of phoneme activation vectors from the first hidden layer: 17

  18. Synonym Discrimination • Distinguishing between synonym pairs in the same context: • A girl looking at a photo • A girl looking at a picture • Synonyms were selected using WordNet synsets: • The pair have the same POS tag and are interchangeable • The pair clearly differ in form (not donut/doughnut ) • The more frequent token in a pair constitutes less than 95% of the occurrences. 18

  19. Synonym Discrimination ● ● ● ● ● 0.3 ● ● ● ● ● ● ● ● ● ● Representation ● ● ● ● ● ● ● ● ● ● ● Pair ● ● ● ● ● ● ● ● cut.slice sidewalk.pavement 0.2 ● ● ● ● ● ● ● ● ● ● ● ● Error ● ● make.prepare rock.stone ● ● ● ● ● ● ● someone.person store.shop ● ● ● ● ● ● photo.picture purse.bag ● ● ● ● ● ● ● ● picture.image assortment.variety ● ● ● ● kid.child spot.place ● ● ● 0.1 ● photograph.picture pier.dock ● ● ● ● ● slice.piece direction.way ● ● ● ● ● ● bicycle.bike carpet.rug ● ● ● ● ● ● ● ● ● ● ● ● ● photograph.photo bun.roll ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● couch.sofa large.big ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● tv.television small.little ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● vegetable.veggie ● mfcc conv rec1 rec2 rec3 rec4 rec5 emb Representation Pair 19

  20. Conclusion • Phoneme representations are most salient in lower layers • Large amount of phonological information persists up to the top recurrent layer • The attention layer filters out and significantly attenuates encoding of phonology and makes utterance embeddings more invariant to synonymy Code: https://github.com/gchrupala/encoding-of-phonology 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend