Collecting and evaluating speech recognition corpora for nine - - PowerPoint PPT Presentation

collecting and evaluating speech recognition corpora for
SMART_READER_LITE
LIVE PREVIEW

Collecting and evaluating speech recognition corpora for nine - - PowerPoint PPT Presentation

Collecting and evaluating speech recognition corpora for nine Southern Bantu languages Jaco Badenhorst, Charl van Heerden, Marelie Davel and Etienne Barnard March 31, 2009 Introduction ASR corpus design Project Lwazi Computational analysis


slide-1
SLIDE 1

Collecting and evaluating speech recognition corpora for nine Southern Bantu languages

Jaco Badenhorst, Charl van Heerden, Marelie Davel and Etienne Barnard March 31, 2009

slide-2
SLIDE 2

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Outline

Introduction Background:

ASR corpus design The Lwazi ASR corpus

Computational analysis

Approach Analysis of phoneme variability

Conclusion

slide-3
SLIDE 3

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Introduction

Information flow in developing countries

Availability of alternate information sources is low in developing countries Telephone networks (cellular) are spreading rapidly

Spoken dialog systems (SDSs)

Widespread belief that impact can be significant Speech-based access can empower semi-literate people

Applications of SDSs

Education (Speech-enabled learning) Agriculture Health care Government services

slide-4
SLIDE 4

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Introduction

To implement SDSs: ASR and TTS systems are needed Main linguistic resources needed for telephone-based ASR systems:

Electronic pronunciation dictionaries Annotated audio corpora Recognition grammars

Challenges:

ASR only available for handful of African languages Lack of linguistic resources for African languages Lack of relevant audio for specific application (language used, profile of speakers, speaking style, etc.)

slide-5
SLIDE 5

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

ASR audio corpus

Resource intensive process Factors that add to complexity:

Recordings of multiple speakers Matching channel and style Careful orthographic transcription Markers required to indicate important events (eg. non-speech)

Size of corpora:

Corpora of resource-scarce languages tend to be very small (1-10 hours of audio) Contrasts with speech corpora used to build commercial systems (hundreds to thousands of hours)

slide-6
SLIDE 6

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Project Lwazi

Three year (2006-2009) project commissioned by the South African Department of Arts and Culture Development of core speech technology resources and components (ASR, TTS, SDS, etc.) National pilot demonstrating potential impact of speech based systems in South Africa All 11 official languages of South Africa

slide-7
SLIDE 7

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Project Lwazi: Languages

Distribution of home languages for South African population:

9 Southern Bantu languages, 2 Germanic languages

slide-8
SLIDE 8

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Project Lwazi

ASR corpus:

Approximately 200 speakers per language Speaker population selected to provide a balanced profile with regard to age, gender and type of telephone (cellphone/landline) Read and elicited speech recorded over telephone channel 30 Utterances/speaker:

16 Randomly selected from phonetically balanced corpus 14 Short words and phrases

slide-9
SLIDE 9

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Project Lwazi: Southern Bantu languages

Distinct phonemes per language

10 20 30 40 50 60 tsn ven ssw sot nso zul nbl xho tso Distinct phonemes Languages

Speech minutes per language

100 200 300 400 500 ssw nbl zul xho tso sot nso tsn ven Speech minutes Languages

Amount of data within Lwazi ASR corpus

slide-10
SLIDE 10

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Computational analysis

Goal:

Understand data requirements to develop a minimal system that is practically usable Use as seed ASR system to collect additional resources Implications of additional speakers and utterances Develop tools:

Provide indication of data sufficiency Potential for cross-language sharing

slide-11
SLIDE 11

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Computational analysis

Approach:

Measure acoustic variance in terms of the separability between probability densities by modelling specific phonemes Statistical measure provides an indication of the effect that additional training data will have on recognition accuracy Utilise the same measure as indication of acoustic similarity across languages

slide-12
SLIDE 12

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Computational analysis

Mainly focus on four languages here:

isiNdebele (nbl) siSwati (ssw) isiZulu (zul) Tshivenda (ven)

We report only on single-mixture context-independent models (similar trends observed for more complex models) Report on examples from several broad categories of phonemes (SAMPA) which occur most in target languages:

/a/ (vowels) /m/ (nasals) /b/ and /g/ (voiced plosives) /s/ (unvoiced fricatives)

slide-13
SLIDE 13

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Analysis of phoneme variability

0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 5 10 15 20 25 30 5 10 15 20 25 30 35 40 45 50 0.25 0.3 0.35 0.4 0.45 0.5 Bhattacharyya bound Phone Observations per Speaker Number of Speakers

Figure: Speaker-and-utterance three-dimensional plot for the siSwati nasal

/m/

slide-14
SLIDE 14

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Number of phoneme utterances

0.4 0.42 0.44 0.46 0.48 0.5 10 20 30 40 50 Bhattacharyya bound Phone Observations per Speaker /a/-zul /a/-nbl /a/-ven 0.4 0.42 0.44 0.46 0.48 0.5 10 20 30 40 50 Bhattacharyya bound Phone Observations per Speaker /m/-zul /m/-nbl /m/-ven 0.4 0.42 0.44 0.46 0.48 0.5 10 20 30 40 50 Bhattacharyya bound Phone Observations per Speaker /b/-ssw /g/-zul /g/-nbl 0.4 0.42 0.44 0.46 0.48 0.5 10 20 30 40 50 Bhattacharyya bound Phone Observations per Speaker /s/-zul /s/-nbl /s/-ssw

Figure: Effect of number of phoneme utterances per speaker on similarity

measure for different phoneme groups using data from 30 speakers

slide-15
SLIDE 15

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Number of speakers

0.3 0.35 0.4 0.45 0.5 5 10 15 20 25 30 35 Bhattacharyya bound Number of Speakers /a/-zul /a/-nbl /a/-ven 0.3 0.35 0.4 0.45 0.5 5 10 15 20 25 30 35 Bhattacharyya bound Number of Speakers /m/-zul /m/-nbl /m/-ven 0.3 0.35 0.4 0.45 0.5 5 10 15 20 25 30 35 Bhattacharyya bound Number of Speakers /b/-ssw /g/-zul /g/-nbl 0.3 0.35 0.4 0.45 0.5 5 10 15 20 25 30 35 Bhattacharyya bound Number of Speakers /s/-zul /s/-nbl /s/-ssw

Figure: Effect of number of speakers on similarity measure for different

phoneme groups using 20 utterances per speaker

slide-16
SLIDE 16

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Initial ASR Accuracy

Accuracy of phoneme recognisers

20 40 60 80 100 tsn nbl ssw zul tso xho tsn TIM nso sot Accuracy (%) Languages

Developed initial ASR systems for all of the Bantu languages Test sets: 30 speakers per language ASR system is phoneme recogniser, with flat language model A rough benchmark of acceptable phoneme accuracy: N-TIMIT

slide-17
SLIDE 17

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Impact of data reduction

Division factor of 8:

Approximately 20 training speakers Correlate well with the stable phoneme similarity values Figure: Reducing the number of speakers has (approximately) the same effect

as reducing the amount of speech per speaker

slide-18
SLIDE 18

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Distances between phonemes

Based upon proven stability of our phoneme models:

Phoneme similarity between phonemes across languages Figure: Effective distances for isiNdebele phonemes /a/ and /n/ and their

closest matches.

slide-19
SLIDE 19

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Conclusion

New method to determine data sufficiency Confirmed that different phoneme classes have different data requirements Our results suggest that similar phoneme accuracies may be achievable by using more speech from fewer speakers Based upon proven model stability we performed successful measurements of distances between phonemes of different languages

slide-20
SLIDE 20

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion

Conclusion

Project Lwazi website:

http://www.meraka.org.za/lwazi More info Download corpora (ASR, TTS) Download tools Contact details