Cross-language mapping for small-vocabulary ASR in under-resourced - - PowerPoint PPT Presentation

cross language mapping for small vocabulary asr in under
SMART_READER_LITE
LIVE PREVIEW

Cross-language mapping for small-vocabulary ASR in under-resourced - - PowerPoint PPT Presentation

Cross-language mapping for small-vocabulary ASR in under-resourced languages: Investigating the impact of source language choice Anjana Vakil and Alexis Palmer Department of Computational Linguistics and Phonetics University of Saarland,


slide-1
SLIDE 1

Cross-language mapping for small-vocabulary ASR in under-resourced languages: Investigating the impact of source language choice

Anjana Vakil and Alexis Palmer

Department of Computational Linguistics and Phonetics University of Saarland, Saarbr¨ ucken, Germany

SLTU’14, St. Petersburg 15 May 2014

slide-2
SLIDE 2

Outline

Small-vocabulary recognition: Why & how Cross-language pronunciation mapping The Salaam method (Qiao et al. 2010) Our contribution: Impact of source language choice Data & method Experimental results Conclusions Ongoing & future work

1 / 13

slide-3
SLIDE 3

Small-vocabulary recognition: Why & how

Goal: Enable non-experts to quickly develop basic speech-driven applications in any Under-Resourced Language (URL)

◮ Training/adapting recognizer takes data, expertise ◮ Many applications use ≤100 terms (e.g. Bali et al. 2013)

Strategy: Use existing HRL recognizer for small-vocab recognition in URLs (Sherwani 2009; Qiao et al. 2010)

2 / 13

slide-4
SLIDE 4

Small-vocabulary recognition: Why & how

Key: Mapped pronunciation lexicon Terms in target lg. (URL) → Pronunciations in source lg. (HRL) Yoruba English igba i> gba → igb@ | ib@ | ...?

HRL recognizer

3 / 13

slide-5
SLIDE 5

Small-vocabulary recognition: Why & how

Key: Mapped pronunciation lexicon Terms in target lg. (URL) → Pronunciations in source lg. (HRL) Yoruba English igba i> gba → igb@ | ib@ | ...?

HRL recognizer

+ ≈

3 / 13

slide-6
SLIDE 6

Cross-language pronunciation mapping

The Salaam Method (Qiao et al. 2010)

◮ Requires ≥1 sample per term (a few minutes of audio) ◮ Mimics phone decoding ◮ “Super-wildcard” recognition grammar:

term → {∗| ∗ ∗| ∗ ∗∗}10

(∗ = any source-language phoneme)

◮ Iterative training algorithm finds confidence-ranked matches

igba → ibæ@, ibõ@, ibE@, . . .

◮ Accuracy: ≈80-98% for ≤50 terms

4 / 13

slide-7
SLIDE 7

Impact of source language choice

Hypothesis

More phoneme overlap between source/target languages → Easier pronunciation-mapping → Higher recognition accuracy

5 / 13

slide-8
SLIDE 8

Impact of source language choice

Hypothesis

More phoneme overlap between source/target languages → Easier pronunciation-mapping → Higher recognition accuracy Experiment

◮ Target language: Yoruba ◮ Source languages: English (US), French (France)

5 / 13

slide-9
SLIDE 9

Impact of source language choice

e i a u

  • ɛ̃

ɔ̃/ã h ɾ ɛ ɔ b t d k ɡ f s ʃ m l j w ĩ ũ ɟ k ͡ p ɡ ͡ b Found in English Found in French Phonemic segments of Yoruba

6 / 13

slide-10
SLIDE 10

Data & method

Data

◮ 25 Yoruba terms (subset of Qiao et al. 2010 dataset) ◮ 5 samples/term from 2 speakers (1 male, 1 female) ◮ Telephone quality (8 kHz)

7 / 13

slide-11
SLIDE 11

Data & method

Data

◮ 25 Yoruba terms (subset of Qiao et al. 2010 dataset) ◮ 5 samples/term from 2 speakers (1 male, 1 female) ◮ Telephone quality (8 kHz)

Method

◮ Generate Fr./En. lexicons with Salaam (Qiao et al. 2010)

  • Microsoft Speech Platform (msdn.microsoft.com/library/hh361572)
  • 1, 3, and 5 pronunciations per term

7 / 13

slide-12
SLIDE 12

Data & method

Data

◮ 25 Yoruba terms (subset of Qiao et al. 2010 dataset) ◮ 5 samples/term from 2 speakers (1 male, 1 female) ◮ Telephone quality (8 kHz)

Method

◮ Generate Fr./En. lexicons with Salaam (Qiao et al. 2010)

  • Microsoft Speech Platform (msdn.microsoft.com/library/hh361572)
  • 1, 3, and 5 pronunciations per term

◮ Compare mean word recognition accuracy

  • Same-speaker: Leave-one-out
  • Cross-speaker: Train M > Test F; F>M
  • t-tests for significance (α = 0.05)

7 / 13

slide-13
SLIDE 13

Results

Same-speaker accuracy

Word recognition accuracy (%) 70 80 90 English French

1 Pronunciation

English French

3 Pronunciations

English French

5 Pronunciations

80.0 75.2 80.0 77.2 81.6 80.0

p = 0.20 p = 0.34 p = 0.59

8 / 13

slide-14
SLIDE 14

Results

Cross-Speaker Accuracy

Word Recognition Accuracy (%) 55 60 65 70 75 80 55 60 65 70 75 80 1 pron. 3 prons. 5 prons. M > F F > M English French

  • En. mean

63.2 71.6 73.6

  • Fr. mean

60.0 64.8 61.6 p (* ≤ .05) 0.41 0.04* 0.04*

9 / 13

slide-15
SLIDE 15

Results

Accuracy by word type (nasal) English French Best duro

  • gba
  • gba

iba shii mejo

  • goji
  • goji

mesan lehin beeni tunse

. . . . . .

iba mesan igba

  • okan
  • gorun

sun meta meji sun bere Worst meji igba

10 / 13

slide-16
SLIDE 16

Conclusions

Hypothesis

More phoneme overlap between source/target languages → Easier pronunciation-mapping → Higher recognition accuracy Predicted: French accuracy > English accuracy Observed: French accuracy ≤ English accuracy

11 / 13

slide-17
SLIDE 17

Conclusions

Hypothesis

More phoneme overlap between source/target languages → Easier pronunciation-mapping → Higher recognition accuracy Predicted: French accuracy > English accuracy Observed: French accuracy ≤ English accuracy Possible explanations:

◮ Source languages may be too similar w.r.t. target language

11 / 13

slide-18
SLIDE 18

Conclusions

Hypothesis

More phoneme overlap between source/target languages → Easier pronunciation-mapping → Higher recognition accuracy Predicted: French accuracy > English accuracy Observed: French accuracy ≤ English accuracy Possible explanations:

◮ Source languages may be too similar w.r.t. target language ◮ Better metric needed for evaluating source-target match

11 / 13

slide-19
SLIDE 19

Conclusions

Hypothesis

More phoneme overlap between source/target languages → Easier pronunciation-mapping → Higher recognition accuracy Predicted: French accuracy > English accuracy Observed: French accuracy ≤ English accuracy Possible explanations:

◮ Source languages may be too similar w.r.t. target language ◮ Better metric needed for evaluating source-target match ◮ Baseline recognizer accuracy may play a role

11 / 13

slide-20
SLIDE 20

Ongoing & future work

lex4all: Pronunciation Lexicons for Any Low-resource Language (Vakil et al. 2014) http://lex4all.github.io/lex4all Planned experiments:

◮ More source-target language pairs ◮ Discriminative training (Chan and Rosenfeld 2012) ◮ Algorithm modifications

12 / 13

slide-21
SLIDE 21

References

  • K. Bali, S. Sitaram, S. Cuendet, and I. Medhi. “A Hindi speech recognizer

for an agricultural video search application”. In: ACM DEV. 2013.

  • H. Y. Chan and R. Rosenfeld. “Discriminative pronunciation learning for

speech recognition for resource scarce languages”. In: ACM DEV. 2012.

  • F. Qiao, J. Sherwani, and R. Rosenfeld. “Small-vocabulary speech

recognition for resource-scarce languages”. In: ACM DEV. 2010.

  • J. Sherwani. “Speech interfaces for information access by low literate

users”. PhD thesis. Carnegie Mellon University, 2009.

  • A. Vakil, M. Paulus, A. Palmer, and M. Regneri. “lex4all: A

language-independent tool for building and evaluating pronunciation lexicons for small-vocabulary speech recognition”. In: ACL 2014: System

  • Demonstrations. 2014.

Thank you! Thanks also to:

Roni Rosenfeld, Mark Qiao, Hao Yee Chan, Dietrich Klakow, Manfred Pinkal

13 / 13