cross lingual topic prediction for speech using
play

Cross-lingual topic prediction for speech using translations Sameer - PowerPoint PPT Presentation

Cross-lingual topic prediction for speech using translations Sameer Bansal Herman Kamper Adam Lopez Sharon Goldwater Automated speech-to-text Translation Information Retrieval 2 Current systems English audio: ? downstream task:


  1. Cross-lingual topic prediction for speech using translations Sameer Bansal Herman Kamper Adam Lopez Sharon Goldwater

  2. Automated speech-to-text Translation Information Retrieval 2

  3. Current systems English audio: ? downstream task: translation, IR 3

  4. Current systems English audio: Where is the nearest hospital? Automatic Speech English text: Recognition downstream task: translation, IR 4

  5. ~100 languages supported by Google Translate ... 5

  6. Unwritten languages Mboshi Audio: ASR --- Mboshi text: Aikuma : Bird et al. 2014, LIG-Aikuma : Blachon et al. 2016 Godard et al. 2018 ● ~3,000 languages with no writing system ● Traditional ASR based will not work! 6

  7. Unwritten languages Mboshi Audio: ASR Mboshi text: Aikuma : Bird et al. 2014, LIG-Aikuma : Blachon et al. 2016 French text Godard et al. 2018 Efforts to collect speech and translations using mobile apps 7

  8. Unwritten languages Mboshi Audio: ASR Mboshi text: Aikuma : Bird et al. 2014, LIG-Aikuma : Blachon et al. 2016 French text Godard et al. 2018 Build cross-lingual speech-to-text systems (ST) 8

  9. Why speech input? https://tnw.to/ieUbS “For many Indians, searching by voice rather than text is their first choice.” 9

  10. https://bit.ly/2mL4pf6 Radio content analysis in Uganda 55% households: radio main source of information Quinn and Hidalgo-Sanchis, 2017 10

  11. https://bit.ly/2mL4pf6 Radio content analysis in Uganda Collect data from public radio conversations Quinn and Hidalgo-Sanchis, 2017 11

  12. https://bit.ly/2mL4pf6 Radio content analysis in Uganda “Insights about the spread of infectious diseases, small-scale disasters, etc.” healthcare disasters Quinn and Hidalgo-Sanchis, 2017 12

  13. https://bit.ly/2mL4pf6 Radio content analysis in Uganda Luganda audio Topic? Topic prediction task https://radio.unglobalpulse.net/uganda 13

  14. https://bit.ly/2mL4pf6 Radio content analysis in Uganda Luganda audio Topic? “Eddwaliro lyaffe temuli yadde …” ASR (“… they have built health centers”) Speech to text system https://radio.unglobalpulse.net/uganda 14

  15. https://bit.ly/2mL4pf6 Radio content analysis in Uganda Luganda audio healthcare Topic prediction “ Eddwaliro lyaffe temuli yadde …” ASR (“… they have built health centers ”) Keywords indicate topic information https://radio.unglobalpulse.net/uganda 15

  16. https://bit.ly/2mL4pf6 Radio content analysis in Uganda Luganda audio healthcare Topic prediction “ Eddwaliro lyaffe temuli yadde …” ASR (“… they have built health centers ”) Availability of ASR! https://radio.unglobalpulse.net/uganda 16

  17. https://bit.ly/2mL4pf6 Radio content analysis in Uganda Luganda audio healthcare Topic prediction “ Eddwaliro lyaffe temuli yadde …” ASR (“… they have built health centers ”) Can we predict topics using ST? https://radio.unglobalpulse.net/uganda 17

  18. https://bit.ly/2mL4pf6 Radio content analysis in Uganda Luganda audio healthcare Topic prediction “ Eddwaliro lyaffe temuli yadde …” ASR (“… they have built health centers ”) Can we predict topics using ST? https://radio.unglobalpulse.net/uganda 18

  19. https://bit.ly/2mL4pf6 Radio content analysis in Uganda Luganda audio healthcare Topic prediction “ Eddwaliro lyaffe temuli yadde …” ASR (“… they have built health centers ”) UN study dataset not available! https://radio.unglobalpulse.net/uganda 19

  20. Our work: topic prediction for Spanish speech Spanish audio topic? Topic prediction English text prediction ST ST trained in simulated low-resource settings 20

  21. ST performance in low-resource settings Spanish-English BLEU 160 hours - Weiss et al. 46 *for comparison text-to-text = 58 Good performance if trained on 100+ hours 21

  22. ST performance in low-resource settings Spanish-English BLEU 160 hours - Weiss et al. 46 20 hours - Bansal et al. 2019 19 *for comparison text-to-text = 58 Mediocre performance in low-resource settings 22

  23. ST performance in low-resource settings Spanish-English BLEU 160 hours - Weiss et al. 46 20 hours - Bansal et al. 2019 19 *for comparison text-to-text = 58 “Good applications for crummy machine translation” Church & Hovy, 1993 23

  24. Sample translations Spanish soy cat ́ olica pero no en realidad casi no voy a laiglesia English i am catholic but actually i hardly go to church 24

  25. Sample translations Spanish soy cat ́ olica pero no en realidad casi no voy a laiglesia English i am catholic but actually i hardly go to church 20h i’m catholics but reality i don’t go to the church “Crummy” translation 25

  26. Sample translations Spanish soy cat ́ olica pero no en realidad casi no voy a laiglesia English i am catholic but actually i hardly go to church 20h i’m catholics but reality i don’t go to the church topic religion Keywords can be useful for topic prediction 26

  27. Our work: topic prediction for Spanish speech Spanish audio topic? Topic prediction English text prediction ST ST trained in simulated low-resource settings 27

  28. Our work: topic prediction for Spanish speech Spanish audio topic? Topic prediction English text prediction ST Gold topics labels not available! 28

  29. Learning topic labels Spanish audio Gold topic label? 29

  30. Learning topic labels Spanish audio Gold topic label? I like to listen to jazz Gold translation 30

  31. Learning topic labels Spanish audio Gold topic label? I like to listen to jazz Gold translation Use gold translations to infer topic labels 31

  32. Learning topic labels Spanish audio Silver topic label I like to listen to jazz Gold translation Use gold translations to infer topic labels 32

  33. Learning topic labels Spanish audio Gold human translation I listen to english music I am catholic Topic model hello how are you Training set 33

  34. Learning topic labels Spanish audio Gold human translation I listen to english music I am catholic Topic model hello how are you Topic Terms small-talk hello, fine, name music dance, listen, music religion god, bible, believe ... ... Training set 34

  35. Learning topic labels Spanish audio Gold human translation I listen to english music I am catholic Topic model hello how are you Topic Terms small-talk hello, fine, name music dance, listen, music religion god, bible, believe ... ... Number of topics set to 10 35

  36. Learning topic labels Spanish audio Gold human translation I listen to english music I am catholic Topic model hello how are you Topic Terms small-talk hello, fine, name music dance, listen, music religion god, bible, believe ... ... small-talk most frequent 36

  37. Topic prediction and evaluation Spanish audio Topic model Evaluation set 37

  38. Topic prediction and evaluation Gold translation Silver I like to listen to jazz music Spanish audio Topic model Evaluation set 38

  39. Topic prediction and evaluation Gold translation Silver I like to listen to jazz music Spanish audio Topic model ST translation Predicted I like jazz music Compare predicted and silver topic label 39

  40. Topic prediction and evaluation Gold translation Silver I like to listen to jazz music Spanish audio Topic model ST translation Predicted I like jazz music Good prediction 40

  41. Topic prediction and evaluation Gold translation Silver I like to listen to jazz music Spanish audio Topic model ST translation Predicted I like like small-talk Poor prediction 41

  42. Topic prediction and evaluation Gold translation Silver Spanish audio Topic model ST translation Predicted Evaluate over a 100 hour test set 42

  43. Topic prediction accuracy ● ST trained on <= 20 hours of Spanish-English ● Pretrained on English ASR 43

  44. Topic prediction accuracy small-talk topic is the majority class baseline 44

  45. Topic prediction accuracy Poor performance <= 5 hours ST models 45

  46. Topic prediction accuracy 10-20h ST models outperform majority baseline 46

  47. Topic prediction accuracy BLEU = 13 10-20h ST models outperform majority baseline 47

  48. Topic prediction accuracy 48

  49. Takeaways ● Low-resource ST can still be useful for building downstream applications ● Silver evaluation for this preliminary study ○ Future: human evaluation ● Experiments on low-resource/unwritten languages ○ Datasets required ● Keyword spotting Thanks! ● Check out: “Analyzing ASR pretraining for low-resource speech-to-text translation”, Stoian et al. 49

  50. Backup 50

  51. Topic prediction accuracy 51

  52. Silver labels Speakers were provided discussion prompts 52

  53. Topic labels 53

  54. Spanish dataset discussion prompts 54

  55. Spanish speech to English text Spanish Audio ● Telephone speech (unscripted) ● Realistic noise conditions ● Multiple speakers and dialects Encoder ● Crowdsourced English text translations Attention Closer to real-world conditions Decoder English text

  56. Neural ST model yo vivo en bronx 1.5 s MFCCs i live in bronx EOS 150 x 13 FF-Softmax 37 x 512 CNN LSTM biLSTM Attention Embedding 37 x 512 previous time step Code available on Github 56

Recommend


More recommend