learning from unlabelled speech with and without visual
play

Learning from unlabelled speech, with and without visual cues Ohio - PowerPoint PPT Presentation

Learning from unlabelled speech, with and without visual cues Ohio State University, May 2017 Herman Kamper Toyota Technological Institute at Chicago http://www.kamperh.com/ Success in speech recognition 1 / 38 Success in speech recognition


  1. Large-vocabulary: Xitsonga 70 ZRSBaselineUTD (SI) 60 UTDGraphCC (SI) SyllableSegOsc + (SD) 50 BayesSegMinDur-MFCC (SD) BayesSegMinDur-cAE (SI) F -score (%) 40 30 20 10 0 n e y e p r a k y d o T n T u o B ZRSBaselineUTD: [Versteegh et al., IS’15]. UTDGraphCC: [Lyzinski et al., IS’15]. SyllableSegOsc + : [R¨ as¨ anen et al., IS’15]. BayesSeg: [Kamper et al., arXiv’16]. 16 / 38

  2. Listen to discovered clusters • Data for small-vocabulary experiments: Play • Small-vocabulary cluster 45: Play • Large-vocabulary English cluster 1214: Play • Large-vocabulary Xitsonga cluster 629: Play 17 / 38

  3. The true (less rosy) picture Word embedding from cluster 33 ( → one) Embeddings close to the above (non-word segments) Embedding dimensions [Levin et al., ASRU’13]; [Kamper et al., ICASSP’16]; [Settle and Livescu, SLT’16] 18 / 38

  4. Using visual cues to learn from untranscribed speech: Visually Grounded Keyword Prediction

  5. Using visual cues to learn from untranscribed speech: Visually Grounded Keyword Prediction Shane Settle Greg Shakhnarovich Karen Livescu

  6. Arrival

  7. Using images for grounding language 21 / 38

  8. Using images for grounding language • Image captioning: Generate written natural language description of a given image [Vinyals et al., CVPR’15] • Grounding written language using images [Bernardi et al., JAIR’16] 21 / 38

  9. Using images for grounding language • Image captioning: Generate written natural language description of a given image [Vinyals et al., CVPR’15] • Grounding written language using images [Bernardi et al., JAIR’16] • We consider images paired with unlabellel spoken captions: Play 21 / 38

  10. Map images and speech into common space 22 / 38

  11. Map images and speech into common space d ( y vis , y spch ) distance y vis y spch max feedfwd conv VGG max X [Harwath et al., NIPS’16] 22 / 38

  12. Retrieval in common (semantic) space y ∈ R D in D -dimensional space y vis y spch [Harwath et al., NIPS’16] 23 / 38

  13. Can we use (supervised) vision model to get labels? d ( y vis , y spch ) distance y vis y spch max feedfwd conv VGG max X Cannot obtain textual labels for the speech using this model 24 / 38

  14. Word prediction from images and speech 25 / 38

  15. Word prediction from images and speech n t t r a i y vis a h m h s VGG [Kamper et al., arXiv’17] 25 / 38

  16. Word prediction from images and speech n t t r a i y vis a h m h s 0 . 85 0 . 8 0 . 9 VGG [Kamper et al., arXiv’17] 25 / 38

  17. Word prediction from images and speech n t y vis t a r f ( X ) i a h m h s max feedfwd conv VGG max X [Kamper et al., arXiv’17] 25 / 38

  18. Word prediction from images and speech n t y vis t a r f ( X ) i a h m h s Loss max feedfwd L conv VGG max X [Kamper et al., arXiv’17] 25 / 38

  19. Word prediction from images and speech f ( X ) max feedfwd conv max X [Kamper et al., arXiv’17] 25 / 38

  20. Word prediction from images and speech n t a f ( X ) a m h max feedfwd conv max X [Kamper et al., arXiv’17] 25 / 38

  21. Word prediction from images and speech n t a f ( X ) a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities conv max X [Kamper et al., arXiv’17] 25 / 38

  22. Word prediction from images and speech n t a f ( X ) a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities I.e., a spoken bag-of-words conv (BoW) classifier max X [Kamper et al., arXiv’17] 25 / 38

  23. Word prediction from images and speech Vision system outputs y vis , giving probability of word w for image I : y vis ,w = P ( w | I, γ ) 26 / 38

  24. Word prediction from images and speech Vision system outputs y vis , giving probability of word w for image I : y vis ,w = P ( w | I, γ ) Interpret dimension w of the speech network output f ( X ) as: f w ( X ) = P ( w | X, θ ) 26 / 38

  25. Word prediction from images and speech Vision system outputs y vis , giving probability of word w for image I : y vis ,w = P ( w | I, γ ) Interpret dimension w of the speech network output f ( X ) as: f w ( X ) = P ( w | X, θ ) Train using cross-entropy loss (i.e. soft targets): W � L ( f ( X ) , y vis ) = − { y vis ,w log f w ( X ) + (1 − y vis ,w ) log [1 − f w ( X )] } w =1 26 / 38

  26. Word prediction from images and speech Vision system outputs y vis , giving probability of word w for image I : y vis ,w = P ( w | I, γ ) Interpret dimension w of the speech network output f ( X ) as: f w ( X ) = P ( w | X, θ ) Train using cross-entropy loss (i.e. soft targets): W � L ( f ( X ) , y vis ) = − { y vis ,w log f w ( X ) + (1 − y vis ,w ) log [1 − f w ( X )] } w =1 If y vis ,w ∈ { 0 , 1 } , this is summed log loss of W binary classifiers. [Kamper et al., arXiv’17] 26 / 38

  27. Images paired with untranscribed speech We are still in this setting: • I.e., we do not use any of the speech transcriptions during model training (only for evaluation) • But our resulting model can make bag-of-words (BoW) predictions 27 / 38

  28. The vision system W t n t r a i a h m h s y vis • VGG-16 input layers (1.3M images) [Simonyan and Zisserman, arXiv’14] • Train on Flickr30k (caption BoW labels) • Targets: W = 1000 most common word VGG types after removing stop words • Note: Vision system could be seen as language independent (future work) 28 / 38

  29. Experimental details • Data: 8000 images with 5 spoken captions, divided into train, development and test sets [Harwath and Glass, ASRU’15] • Prediction: Output words w where f w ( X ) > α • Tasks: Spoken bag-of-words prediction; keyword spotting • Evaluation: Compare to words in transcriptions of test data 29 / 38

  30. Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels Play 30 / 38

  31. Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels bicycle , bike, man , riding, Play wearing 30 / 38

  32. Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels man on bicycle is doing tricks in an old bicycle , bike, man , riding, building wearing 30 / 38

  33. Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels man on bicycle is doing tricks in an old bicycle , bike, man , riding, building wearing a little girl is climbing a ladder child, girl , little , young a rock climber standing in a crevasse climbing, man, rock a dog running in the grass around sheep dog , field, grass , running a man in a miami basketball uniform ball, basketball , man , looking to the right player, uniform , wearing 30 / 38

  34. Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels man on bicycle is doing tricks in an old bicycle , bike, man , riding, building wearing a little girl is climbing a ladder child, girl , little , young a rock climber standing in a crevasse climbing, man, rock a dog running in the grass around sheep dog , field, grass , running a man in a miami basketball uniform ball, basketball , man , looking to the right player, uniform , wearing 30 / 38

  35. Task 1: Spoken bag-of-words prediction Unigram baseline VisionSpeechCNN 80 OracleSpeechCNN Precision (%) 60 40 20 0 α = 0 . 4 α = 0 . 7 31 / 38

  36. Task 1: Spoken bag-of-words prediction Unigram baseline VisionSpeechCNN 80 OracleSpeechCNN Precision (%) 60 40 20 0 α = 0 . 4 α = 0 . 7 31 / 38

  37. Task 1: Spoken bag-of-words prediction Unigram baseline VisionSpeechCNN 80 OracleSpeechCNN Precision (%) 60 40 20 0 α = 0 . 4 α = 0 . 7 31 / 38

  38. Task 1: Spoken bag-of-words prediction Unigram baseline VisionSpeechCNN 80 OracleSpeechCNN Precision (%) 60 40 20 0 α = 0 . 4 α = 0 . 7 31 / 38

  39. Task 1: Spoken bag-of-words prediction False alarm keywords and words in corresponding utterances 32 / 38

  40. Task 1: Spoken bag-of-words prediction False alarm keywords and words in corresponding utterances: running playing ocean white dogs three lake two boy two dog water mouth small biker white dogs rides two ball dirt red brown riding snowboarder jumping wearing person snowy men man hill air air man snow standing women child three little girls girl boy two two young woman running bicycle grassy ramp white small biker dogs blue two bike grass 32 / 38

  41. Task 2: Keyword spotting Keyword Example of matched utterance Type Play (one of top 10) beach behind bike boys large play sitting yellow young 33 / 38

  42. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . behind bike boys large play sitting yellow young 33 / 38

  43. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind bike boys large play sitting yellow young 33 / 38

  44. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave bike boys large play sitting yellow young 33 / 38

  45. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike boys large play sitting yellow young 33 / 38

  46. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air boys large play sitting yellow young 33 / 38

  47. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys large play sitting yellow young 33 / 38

  48. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys Play large play sitting yellow young 33 / 38

  49. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park large play sitting yellow young 33 / 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend