cross language mapping for small vocabulary asr in under
play

Cross-language mapping for small-vocabulary ASR in under-resourced - PowerPoint PPT Presentation

Cross-language mapping for small-vocabulary ASR in under-resourced languages: Investigating the impact of source language choice Anjana Vakil and Alexis Palmer Department of Computational Linguistics and Phonetics University of Saarland,


  1. Cross-language mapping for small-vocabulary ASR in under-resourced languages: Investigating the impact of source language choice Anjana Vakil and Alexis Palmer Department of Computational Linguistics and Phonetics University of Saarland, Saarbr¨ ucken, Germany SLTU’14, St. Petersburg 15 May 2014

  2. Outline Small-vocabulary recognition: Why & how Cross-language pronunciation mapping The Salaam method (Qiao et al. 2010) Our contribution: Impact of source language choice Data & method Experimental results Conclusions Ongoing & future work 1 / 13

  3. Small-vocabulary recognition: Why & how Goal: Enable non-experts to quickly develop basic speech-driven applications in any Under-Resourced Language (URL) ◮ Training/adapting recognizer takes data, expertise ◮ Many applications use ≤ 100 terms (e.g. Bali et al. 2013) Strategy: Use existing HRL recognizer for small-vocab recognition in URLs (Sherwani 2009; Qiao et al. 2010) 2 / 13

  4. HRL recognizer Small-vocabulary recognition: Why & how Key: Mapped pronunciation lexicon Terms in target lg. (URL) → Pronunciations in source lg. (HRL) Yoruba English i> → igb@ | ib@ | ...? igba gba 3 / 13

  5. Small-vocabulary recognition: Why & how Key: Mapped pronunciation lexicon Terms in target lg. (URL) → Pronunciations in source lg. (HRL) Yoruba English i> → igb@ | ib@ | ...? igba gba ≈ + HRL recognizer 3 / 13

  6. Cross-language pronunciation mapping The Salaam Method (Qiao et al. 2010) ◮ Requires ≥ 1 sample per term (a few minutes of audio) ◮ Mimics phone decoding ◮ “Super-wildcard” recognition grammar: term → {∗| ∗ ∗| ∗ ∗∗} 10 0 ( ∗ = any source-language phoneme) ◮ Iterative training algorithm finds confidence-ranked matches igba → ibæ@ , ibõ@ , ibE@ , . . . ◮ Accuracy: ≈ 80-98% for ≤ 50 terms 4 / 13

  7. Impact of source language choice Hypothesis More phoneme overlap between source/target languages → Easier pronunciation-mapping → Higher recognition accuracy 5 / 13

  8. Impact of source language choice Hypothesis More phoneme overlap between source/target languages → Easier pronunciation-mapping → Higher recognition accuracy Experiment ◮ Target language: Yoruba ◮ Source languages: English (US), French (France) 5 / 13

  9. Impact of source language choice Phonemic segments of Yoruba Found in Found in English French i ɛ ɔ e u h b t d k ɡ a o f s ʃ ɾ ɛ̃ ɔ̃/ã m l j w ĩ ũ ɟ k ͡ p ɡ ͡ b 6 / 13

  10. Data & method Data ◮ 25 Yoruba terms (subset of Qiao et al. 2010 dataset) ◮ 5 samples/term from 2 speakers (1 male, 1 female) ◮ Telephone quality (8 kHz) 7 / 13

  11. Data & method Data ◮ 25 Yoruba terms (subset of Qiao et al. 2010 dataset) ◮ 5 samples/term from 2 speakers (1 male, 1 female) ◮ Telephone quality (8 kHz) Method ◮ Generate Fr./En. lexicons with Salaam (Qiao et al. 2010) • Microsoft Speech Platform ( msdn.microsoft.com/library/hh361572 ) • 1, 3, and 5 pronunciations per term 7 / 13

  12. Data & method Data ◮ 25 Yoruba terms (subset of Qiao et al. 2010 dataset) ◮ 5 samples/term from 2 speakers (1 male, 1 female) ◮ Telephone quality (8 kHz) Method ◮ Generate Fr./En. lexicons with Salaam (Qiao et al. 2010) • Microsoft Speech Platform ( msdn.microsoft.com/library/hh361572 ) • 1, 3, and 5 pronunciations per term ◮ Compare mean word recognition accuracy • Same-speaker: Leave-one-out • Cross-speaker: Train M > Test F; F > M • t -tests for significance ( α = 0.05) 7 / 13

  13. Results Same-speaker accuracy 1 Pronunciation 3 Pronunciations 5 Pronunciations Word recognition accuracy (%) 90 80 70 English French English French English French 80.0 75.2 80.0 77.2 81.6 80.0 p = 0 . 20 p = 0 . 34 p = 0 . 59 8 / 13

  14. Results Cross-Speaker Accuracy Word Recognition Accuracy (%) 80 80 M > F F > M 75 75 English French 70 70 65 65 60 60 55 55 1 pron. 3 prons. 5 prons. En. mean 63.2 71.6 73.6 Fr. mean 60.0 64.8 61.6 p (* ≤ . 05) 0.41 0.04* 0.04* 9 / 13

  15. Results Accuracy by word type ( nasal ) English French Best duro ogba ogba iba shii mejo ogoji ogoji mesan lehin beeni tunse . . . . . . iba mesan igba ookan ogorun sun meta meji bere sun Worst meji igba 10 / 13

  16. Conclusions Hypothesis More phoneme overlap between source/target languages → Easier pronunciation-mapping → Higher recognition accuracy Predicted: French accuracy > English accuracy Observed: French accuracy ≤ English accuracy 11 / 13

  17. Conclusions Hypothesis More phoneme overlap between source/target languages → Easier pronunciation-mapping → Higher recognition accuracy Predicted: French accuracy > English accuracy Observed: French accuracy ≤ English accuracy Possible explanations: ◮ Source languages may be too similar w.r.t. target language 11 / 13

  18. Conclusions Hypothesis More phoneme overlap between source/target languages → Easier pronunciation-mapping → Higher recognition accuracy Predicted: French accuracy > English accuracy Observed: French accuracy ≤ English accuracy Possible explanations: ◮ Source languages may be too similar w.r.t. target language ◮ Better metric needed for evaluating source-target match 11 / 13

  19. Conclusions Hypothesis More phoneme overlap between source/target languages → Easier pronunciation-mapping → Higher recognition accuracy Predicted: French accuracy > English accuracy Observed: French accuracy ≤ English accuracy Possible explanations: ◮ Source languages may be too similar w.r.t. target language ◮ Better metric needed for evaluating source-target match ◮ Baseline recognizer accuracy may play a role 11 / 13

  20. Ongoing & future work lex4all : Pronunciation Lex icons for A ny L ow-resource L anguage (Vakil et al. 2014) http://lex4all.github.io/lex4all Planned experiments: ◮ More source-target language pairs ◮ Discriminative training (Chan and Rosenfeld 2012) ◮ Algorithm modifications 12 / 13

  21. References K. Bali, S. Sitaram, S. Cuendet, and I. Medhi. “A Hindi speech recognizer ◮ for an agricultural video search application”. In: ACM DEV . 2013. H. Y. Chan and R. Rosenfeld. “Discriminative pronunciation learning for ◮ speech recognition for resource scarce languages”. In: ACM DEV . 2012. F. Qiao, J. Sherwani, and R. Rosenfeld. “Small-vocabulary speech ◮ recognition for resource-scarce languages”. In: ACM DEV . 2010. J. Sherwani. “Speech interfaces for information access by low literate ◮ users”. PhD thesis. Carnegie Mellon University, 2009. A. Vakil, M. Paulus, A. Palmer, and M. Regneri. “lex4all: A ◮ language-independent tool for building and evaluating pronunciation lexicons for small-vocabulary speech recognition”. In: ACL 2014: System Demonstrations . 2014. Thank you! Thanks also to: Roni Rosenfeld, Mark Qiao, Hao Yee Chan, Dietrich Klakow, Manfred Pinkal 13 / 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend