Encoding of Phonology in an RNN model of Grounded Speech Afra - PowerPoint PPT Presentation

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz Chrupa ł a

A Realistic Language Learning Scenario Two men are washing an elephant. 2

Grounded Analysis of Language Learning Linguistic Knowledge Roy & Pentland (2002) Elman (1991) Yu & Ballard (2014) Mohamed et al. (2012) Harwath et al. (2016) Frank et al. (2013) Gelderloos & Chrupala Kadar et al. (2016) (2016) Li et al. (2016) Harwath & Glass (2017) Gelderloos & Chrupala Chrupala et al. (2017) (2016) Linzen et al. (2016) Adi et al. (2017) We are here!

A Model of Grounded Speech Perception Image Speech Model Model Joint Semantic Space 4

Joint Semantic Space a bird walks on a beam bears play in water 5

Image Model P r e - c l a s s i f c a t i o n l a y e r BOAT BIRD BOAR VGG-16: Simonyan & Zisserman (2014) 6

Speech Model Project to the joint semantic space • Attention: weighted sum of Attention last RHN layer units RHN #5 • RHN: Recurrent Highway RHN #4 Networks (Zilly et al., 2016) RHN #3 • Convolution: subsampling RHN #2 MFCC vector RHN #1 Convolution MFCC 7

Chrupa ł a et al., ACL‘2017 • Representation of language in a model of visually grounded speech signal • Using hidden layer activations in a set of auxiliary tasks • Predicting utterance length and content, measuring representational similarity and disambiguation of homonyms • Main findings: • Encodings of form and meaning emerge and evolve in hidden layers of stacked RNNs processing grounded speech 8

Current Study • Questions: how is phonology encoded in • MFCC features extracted from speech signal? • activations of the layers of the model? • Data: Synthetically Spoken COCO dataset • Experiments: • Phoneme decoding and clustering • Phoneme discrimination • Synonym discrimination 9

Phoneme Decoding • Identifying phonemes from speech signal/activation patterns: supervised classification of aligned phonemes • Speech signal was aligned with phonemic transcription using Gentle toolkit (based on Kaldi, Povey et al., 2011) 10

Phoneme Decoding • Identifying phonemes from speech signal/activation patterns: supervised classification of aligned phonemes ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 Error rate 0.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● MFCC Conv Rec1 Rec2 Rec3 Rec4 Rec5 Representation 11

Phoneme Discrimination • ABX task (Schatz et al., 2013): discriminate minimal pairs; is X closer to A or to B? A: be /bi/ B: me /mi/ X: my /maI/ • A, B and X are CV syllables • (A,B) and (B,X) are minimum pairs, but (A,X) are not (34,288 tuples in total) 12

Phoneme Discrimination MFCC 0.71 Convolutional 0.73 Recurrent 1 0.82 Recurrent 2 0.82 Recurrent 3 0.80 Recurrent 4 0.76 Recurrent 5 0.74 13

Phoneme Discrimination by Class • The task is most challenging when the target (B) and distractor (A) belong to the same phoneme class A: be /bi/ B: me /mi/ X: my /maI/ 14

Phoneme Discrimination by Class • The task is most challenging when the target (B) and distractor (A) belong to the same phoneme class i I U u Vowels e E @ Ä OI O o aI æ 2 A aU Approximants j ô l w m n N Nasals Plosives p b t d k g Fricatives f v T D s z S Z h Ù Ã Affricates 15

Phoneme Discrimination by Class • The task is most challenging when the target (B) and distractor (A) belong to the same phoneme class ● 0.9 ● ● ● ● ● ● 0.8 ● ● ● ● ● ● ● Accuracy ● ● ● 0.7 ● ● ● ● ● ● ● ● ● ● ● ● 0.6 ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● mfcc conv rec1 rec2 rec3 rec4 rec5 Representation affricate fricative plosive ● ● ● Class approximant nasal vowel ● ● ● 16

Organization of Phonemes • Agglomerative hierarchical clustering of phoneme activation vectors from the first hidden layer: 17

Synonym Discrimination • Distinguishing between synonym pairs in the same context: • A girl looking at a photo • A girl looking at a picture • Synonyms were selected using WordNet synsets: • The pair have the same POS tag and are interchangeable • The pair clearly differ in form (not donut/doughnut ) • The more frequent token in a pair constitutes less than 95% of the occurrences. 18

Synonym Discrimination ● ● ● ● ● 0.3 ● ● ● ● ● ● ● ● ● ● Representation ● ● ● ● ● ● ● ● ● ● ● Pair ● ● ● ● ● ● ● ● cut.slice sidewalk.pavement 0.2 ● ● ● ● ● ● ● ● ● ● ● ● Error ● ● make.prepare rock.stone ● ● ● ● ● ● ● someone.person store.shop ● ● ● ● ● ● photo.picture purse.bag ● ● ● ● ● ● ● ● picture.image assortment.variety ● ● ● ● kid.child spot.place ● ● ● 0.1 ● photograph.picture pier.dock ● ● ● ● ● slice.piece direction.way ● ● ● ● ● ● bicycle.bike carpet.rug ● ● ● ● ● ● ● ● ● ● ● ● ● photograph.photo bun.roll ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● couch.sofa large.big ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● tv.television small.little ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● vegetable.veggie ● mfcc conv rec1 rec2 rec3 rec4 rec5 emb Representation Pair 19

Conclusion • Phoneme representations are most salient in lower layers • Large amount of phonological information persists up to the top recurrent layer • The attention layer filters out and significantly attenuates encoding of phonology and makes utterance embeddings more invariant to synonymy Code: https://github.com/gchrupala/encoding-of-phonology 20

Encoding of Phonology in an RNN model of Grounded Speech Afra - PowerPoint PPT Presentation

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz Chrupa a A Realistic Language Learning Scenario Two men are washing an elephant. 2 Grounded Analysis of Language Learning Linguistic

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz

P honology Darrell Larsen Linguistics 101 Darrell Larsen Phonology Understanding Phonology

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Response-based Learning for Grounded Grounded SMT Riezler, Machine Translation Simianer, Haas

Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic

Lexical Phonology and Morphology February 4, 2016 Lexical Phonology and Morphology Paul

Phonetics-phonology The phonetics-phonology interface: basic assumptions mismatches

Phonetics & Phonology Jrgen Trouvain Areas of phonetics Speech production Speech

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

Outline Introduction Definition History Features When should Grounded Theory be used? Types

TAKE TAKE GROUNDED GROUNDED DECISIONS DECISIONS Farm Modelling Statistic based, gamification

Phonology II: derivations, rules, phonotactics John Goldsmith LING 20001 17 October 2011 ()

Learning Phonology LINGUIST 397LH Oiry/Hartman Learning phonology

NEFC PARK DESIGN ADVISORY GROUP TERMS OF REFERENCE February 22, 2016 Recommendation A. THAT

20 18 Index 1 2 3 Who are Business Our we? Areas Presence

Understanding Trust: Challenges and Opportunities Roger C. Mayer, PhD Professor, Dept. of

Cartagena Protocol on Biosafety Introduction to basic concepts and core elements of national

Developing Geometric Thinking with Bonny Davenport Welcome! Your host Bonny Davenport

Barnes Halls of Residence & Hawthorns Development A world class university requires

Unitarizable representations and fixed points of groups of biholomorphic transformations of

A NEW SEARCH SPACE IN AUSTRALIAS PREMIER GOLD PROVINCE Tom Sanders, RIU Explorers Conference ASX

Sambuz

Useful Links

Newsletter

Mail Us

Encoding of Phonology in an RNN model of Grounded Speech Afra - PowerPoint PPT Presentation

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz Chrupa a A Realistic Language Learning Scenario Two men are washing an elephant. 2 Grounded Analysis of Language Learning Linguistic

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz

P honology Darrell Larsen Linguistics 101 Darrell Larsen Phonology Understanding Phonology

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN &amp; Gated RNN

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Response-based Learning for Grounded Grounded SMT Riezler, Machine Translation Simianer, Haas

Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic

Lexical Phonology and Morphology February 4, 2016 Lexical Phonology and Morphology Paul

Phonetics-phonology The phonetics-phonology interface: basic assumptions mismatches

Phonetics &amp; Phonology Jrgen Trouvain Areas of phonetics Speech production Speech

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

Outline Introduction Definition History Features When should Grounded Theory be used? Types

TAKE TAKE GROUNDED GROUNDED DECISIONS DECISIONS Farm Modelling Statistic based, gamification

Phonology II: derivations, rules, phonotactics John Goldsmith LING 20001 17 October 2011 ()

Learning Phonology LINGUIST 397LH Oiry/Hartman Learning phonology

NEFC PARK DESIGN ADVISORY GROUP TERMS OF REFERENCE February 22, 2016 Recommendation A. THAT

20 18 Index 1 2 3 Who are Business Our we? Areas Presence

Understanding Trust: Challenges and Opportunities Roger C. Mayer, PhD Professor, Dept. of

Cartagena Protocol on Biosafety Introduction to basic concepts and core elements of national

Developing Geometric Thinking with Bonny Davenport Welcome! Your host Bonny Davenport

Barnes Halls of Residence &amp; Hawthorns Development A world class university requires

Unitarizable representations and fixed points of groups of biholomorphic transformations of

A NEW SEARCH SPACE IN AUSTRALIAS PREMIER GOLD PROVINCE Tom Sanders, RIU Explorers Conference ASX

Sambuz

Useful Links

Newsletter

Mail Us

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Phonetics & Phonology Jrgen Trouvain Areas of phonetics Speech production Speech

Barnes Halls of Residence & Hawthorns Development A world class university requires