pre training on high resource speech recognition improves
play

Pre-training on high-resource speech recognition improves - PowerPoint PPT Presentation

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation Sameer Bansal Herman Kamper Karen Livescu Adam Lopez Sharon Goldwater Current systems Spanish Audio: ? English text: 2 Current systems


  1. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation Sameer Bansal Herman Kamper Karen Livescu Adam Lopez Sharon Goldwater

  2. Current systems Spanish Audio: ? English text: 2

  3. Current systems Spanish Audio: Automatic Speech ola mi nombre es hodor Spanish text: Recognition ? English text: 3

  4. Current systems Spanish Audio: Automatic Speech ola mi nombre es hodor Spanish text: Recognition hi my name is hodor Machine English text: Translation 4

  5. ~100 languages supported by Google Translate ... 5

  6. Unwritten languages Mboshi: Bantu language, Republic of Congo, ~160K speakers ~3000 languages with no writing system Automatic Speech Mboshi text: not available Recognition 6

  7. Unwritten languages Mboshi: paired with French translations (Godard et al. 2018) ~3000 languages with no writing system Efforts to collect speech and translations using mobile apps ○ Aikuma : Bird et al. 2014, LIG-Aikuma : Blachon et al. 2016 7

  8. Haiti Earthquake, 2010 Survivors sent text messages to helpline People trapped in Moun kwense nan Sacred Heart Sakre Kè nan Church, PauP Pòtoprens ● International rescue teams face language barrier ● No automated tools available ● Volunteers from global Haitian diaspora help create parallel text corpora in short time [Munro 2010] 8

  9. Are we better prepared in 2019? Moun kwense nan Sakre Kè nan Pòtoprens People trapped in Sacred Heart Church, PauP Voice messages 9

  10. Can we build a speech-to-text translation ( ST ) system? … given as training data: (source audio) paired with translations ● Tens of hours of speech paired with text translations ● No source text available 10

  11. Neural models ... Spanish Audio: Weiss et al. (2017) Sequence-to-Sequence hi my name is hodor English text: Directly translate speech 11

  12. Spanish speech to English text Spanish Audio ● telephone speech (unscripted) ● realistic noise conditions ● multiple speakers and dialects Encoder ● crowdsourced English text translations Attention Closer to real-world conditions Decoder English text 12

  13. Spanish speech to English text Weiss et al. Good performance if trained on 100+ hours *for comparison text-to-text = 58 13

  14. But ... Weiss et al. Poor performance in low-resource settings *for comparison text-to-text = 58 14

  15. Goal: to improve translation performance 15

  16. Goal: to improve translation performance … without labeling more low-resource speech 16

  17. 100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Key idea: leverage monolingual data from a different high-resource language 17

  18. 100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English ? 18

  19. 100s of hours of monolingual speech paired with text available … typically used to train ASR systems Weiss et al. 2017 Spanish text Anastasopoulos and Chiang 2018 Bérard et al. 2018 ( Spanish Audio ) Sperber et al. 2019 Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English 19

  20. 100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English ? 20

  21. Why Spanish-English? 21

  22. Why Spanish-English? simulate low-resource settings and test our method 22

  23. Why Spanish-English? simulate low-resource settings and test our method Later: results on truly low-resource language --- Mboshi to French 23

  24. Method Audio Same model architecture for ASR and ST Encoder Attention *randomly initialized parameters Decoder text 24

  25. Pretrain on high-resource English audio 300 hours of English audio and text Encoder Attention *train until convergence Decoder English text 25

  26. Fine-tune on low-resource 20 hours Spanish-English English audio Spanish audio Encoder Encoder Attention Attention transfer from English ASR Decoder Decoder English text English text 26

  27. Fine-tune on low-resource 20 hours Spanish-English Spanish audio Encoder Attention *train until convergence Decoder English text 27

  28. Will this work? 28

  29. Spanish-English BLEU scores *for comparison Weiss et al. = 47.3 baseline 29

  30. Spanish-English BLEU scores pretraining *for comparison Weiss et al. = 47.3 baseline 30

  31. Spanish-English BLEU scores pretraining ● +9 BLEU *for comparison Weiss et al. = 47.3 baseline 31

  32. Spanish-English BLEU scores pretraining ● better performance with half the data *for comparison Weiss et al. = 47.3 baseline 32

  33. Further analysis pretraining 20 hours Spanish-English *for comparison Weiss et al. = 47.3 baseline 33

  34. Faster training time pretraining baseline 34

  35. Faster training time ● potentially useful in time critical scenarios pretraining 2 hours ~20 hours baseline 35

  36. Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 Decoder Decoder English text English text 36

  37. Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 random Attention Attention +English ASR 19.9 Decoder Decoder +English ASR: decoder 10.5 English text English text 37

  38. Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 38

  39. Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 … transferring encoder only parameters works well! 39

  40. Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 … can pretrain on a language different from both source and target in ST pair 40

  41. Pretraining on French French Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: encoder 16.6 French text English text +French ASR: encoder ? *only 20 hours of French ASR 41

  42. Pretraining on French French Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: encoder 16.6 French text English text +French ASR: encoder 12.5 French ASR helps Spanish-English ST 42

  43. Takeaways ● Pretraining on a different language helps ● transfer all model parameters for best gains ● encoder parameters account for most of these … useful when target vocabulary is different 43

  44. … Mboshi-French ST 44

  45. Mboshi-French ST ● ST data by Godard et al. 2018 ○ ~4 hours of speech, paired with French translations ● Mboshi ○ Bantu language, Republic of Congo ○ Unwritten ○ ~160K speakers 45

  46. Mboshi-French: Results Mboshi Mboshi to French, N = 4 hours BLEU Encoder baseline ? Attention Decoder French text 46

  47. Mboshi-French: Results Mboshi Mboshi to French, N = 4 hours BLEU Encoder baseline 3.5 Attention Decoder French text *outperformed by a naive baseline 47

  48. Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all ? Decoder Decoder French text French text transfer all parameters 48

  49. Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 Decoder Decoder French text French text French ASR helps Mboshi-French ST 49

  50. Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 Decoder Decoder French text French text French ASR helps Mboshi-French ST 50

  51. Pretraining on English ASR English Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 random Decoder Decoder +English ASR: encoder ? English text French text using encoder trained on a lot more data 51

  52. Pretraining on English ASR English Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 random Decoder Decoder +English ASR: encoder 5.3 English text French text English ASR helps Mboshi-French ST 52

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend