visually grounded cross lingual keyword spotting in speech
play

Visually grounded cross-lingual keyword spotting in speech SLTU, - PowerPoint PPT Presentation

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and Michael Roth 2 1 E&E Engineering, Stellenbosch University, South Africa 2 Saarland University, Germany http://www.kamperh.com/ Advances in


  1. Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and Michael Roth 2 1 E&E Engineering, Stellenbosch University, South Africa 2 Saarland University, Germany http://www.kamperh.com/

  2. Advances in speech recognition 1 / 12

  3. Advances in speech recognition • Addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] 1 / 12

  4. Advances in speech recognition • Addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] • Very different from the “supervision” infants use to learn language 1 / 12

  5. Advances in speech recognition • Addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] • Very different from the “supervision” infants use to learn language • Sometimes not possible, e.g., for unwritten languages 1 / 12

  6. Images as weak labels for speech 2 / 12

  7. Images as weak labels for speech Can we use images as weak labels in low-resource settings? Play 2 / 12

  8. Images as weak labels for speech Can we use images as weak labels in low-resource settings? Play • Maybe we cannot use this type of data for full ASR, but maybe it can be used for other tasks? 2 / 12

  9. Images as weak labels for speech Can we use images as weak labels in low-resource settings? Play • Maybe we cannot use this type of data for full ASR, but maybe it can be used for other tasks? • Goal: Use this type of data for cross-lingual keyword spotting 2 / 12

  10. Cross-lingual keyword spotting kuwaka Written query: burning (English) kuwaka Swahili speech corpus 3 / 12

  11. Cross-lingual word prediction from images 4 / 12

  12. Cross-lingual word prediction from images t n r t a y vis a i m h h s VGG 4 / 12

  13. Cross-lingual word prediction from images t n r t a y vis a i m h h s 0 . 85 0 . 8 0 . 9 VGG 4 / 12

  14. Cross-lingual word prediction from images t y vis n r f ( X ) t a a i m h h s max feedfwd conv VGG max X 4 / 12

  15. Cross-lingual word prediction from images t y vis n r f ( X ) t a a i m h h s Loss max feedfwd ℓ conv VGG max X 4 / 12

  16. Cross-lingual word prediction from images f ( X ) max feedfwd conv max X 4 / 12

  17. Cross-lingual word prediction from images f ( X ) max feedfwd conv max Swahili speech X 4 / 12

  18. Cross-lingual word prediction from images n f ( X ) t a a m h max feedfwd conv max Swahili speech X 4 / 12

  19. Cross-lingual word prediction from images n f ( X ) t a a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities conv max Swahili speech X 4 / 12

  20. Cross-lingual word prediction from images n f ( X ) t a a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities I.e., a cross-lingual conv spoken bag-of-words (BoW) classifier max Swahili speech X 4 / 12

  21. Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting 5 / 12

  22. Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting • Proof-of-concept: Use English speech with German queries 5 / 12

  23. Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting • Proof-of-concept: Use English speech with German queries: Cross-lingual keyword spotter German (text) tags t e g d n Feld n i ˆ r f ( X ) y de u p H s Loss max feedfwd ℓ conv VGG-16 max I X English speech 5 / 12

  24. Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting • Proof-of-concept: Use English speech with German queries: Cross-lingual keyword spotter German (text) tags t e g d n Feld n i ˆ r f ( X ) y de u p H s Loss max feedfwd ℓ conv VGG-16 max I X English speech • Data: 8000 images with 5 English spoken captions ( ∼ 37 hours) • Weak labels: German visual tagger trained on German Multi30k 5 / 12

  25. Predictions on test data Given German keyword: ‘Hunde’ English speech collection (want to search) corresponds to dim. w f ( X 1 ) f ( X 2 ) f ( X 3 ) f w ( X i ) = P θ ( w | X i ) : score for whether (English) speech X i contains translation of given (German) keyword w Evaluation: Does predicted keyword occur in reference translation? 6 / 12

  26. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword 7 / 12

  27. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad 7 / 12

  28. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): Play • 7 / 12

  29. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day 7 / 12

  30. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city 7 / 12

  31. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) 7 / 12

  32. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) Output (in top 10): Play • 7 / 12

  33. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) Output (in top 10): • a woman in black and red listens to an ipod walks down the street 7 / 12

  34. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) Output (in top 10): • a woman in black and red listens to an ipod walks down the street • people on the city street walk past a puppet theater • an asian woman rides a bicycle in front of two cars 7 / 12

  35. Cross-lingual keyword spotting performance DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 0 20 40 60 80 100 P @10 (%) 8 / 12

  36. Example predictions marked as errors Input: Feld (field) 9 / 12

  37. Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ 9 / 12

  38. Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) 9 / 12

  39. Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) Output: • a brown dog is chasing a red frisbee across a grassy field ∗ 9 / 12

  40. Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) Output: • a brown dog is chasing a red frisbee across a grassy field ∗ Input: groß(en) (big) 9 / 12

  41. Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) Output: • a brown dog is chasing a red frisbee across a grassy field ∗ Input: groß(en) (big) Output: • a small group of people sitting together outside ∗ 9 / 12

  42. Error analysis by annotator DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 0 20 40 60 80 100 P @10 (%) 10 / 12

  43. Error analysis by annotator DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 0 20 40 60 80 100 P @10 (%) 10 / 12

  44. Error analysis by annotator DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 0 20 40 60 80 100 P @10 (%) 10 / 12

  45. Cross-lingual keyword spotting kuwaka Written query: moto burning (English) kuwaka 11 / 12

  46. Conclusions and future work • Visual grounding makes it possible to perform cross-lingual keyword spotting without any parallel speech and text or translations 12 / 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend