Collecting and evaluating speech recognition corpora for nine - PowerPoint PPT Presentation

Collecting and evaluating speech recognition corpora for nine Southern Bantu languages Jaco Badenhorst, Charl van Heerden, Marelie Davel and Etienne Barnard March 31, 2009

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Outline Introduction Background: ASR corpus design The Lwazi ASR corpus Computational analysis Approach Analysis of phoneme variability Conclusion

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Introduction Information flow in developing countries Availability of alternate information sources is low in developing countries Telephone networks (cellular) are spreading rapidly Spoken dialog systems (SDSs) Widespread belief that impact can be significant Speech-based access can empower semi-literate people Applications of SDSs Education (Speech-enabled learning) Agriculture Health care Government services

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Introduction To implement SDSs: ASR and TTS systems are needed Main linguistic resources needed for telephone-based ASR systems: Electronic pronunciation dictionaries Annotated audio corpora Recognition grammars Challenges: ASR only available for handful of African languages Lack of linguistic resources for African languages Lack of relevant audio for specific application (language used, profile of speakers, speaking style, etc.)

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion ASR audio corpus Resource intensive process Factors that add to complexity: Recordings of multiple speakers Matching channel and style Careful orthographic transcription Markers required to indicate important events (eg. non-speech) Size of corpora: Corpora of resource-scarce languages tend to be very small (1-10 hours of audio) Contrasts with speech corpora used to build commercial systems (hundreds to thousands of hours)

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Project Lwazi Three year (2006-2009) project commissioned by the South African Department of Arts and Culture Development of core speech technology resources and components (ASR, TTS, SDS, etc.) National pilot demonstrating potential impact of speech based systems in South Africa All 11 official languages of South Africa

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Project Lwazi: Languages Distribution of home languages for South African population: 9 Southern Bantu languages, 2 Germanic languages

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Project Lwazi ASR corpus: Approximately 200 speakers per language Speaker population selected to provide a balanced profile with regard to age, gender and type of telephone (cellphone/landline) Read and elicited speech recorded over telephone channel 30 Utterances/speaker: 16 Randomly selected from phonetically balanced corpus 14 Short words and phrases

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Project Lwazi: Southern Bantu languages Distinct phonemes per language Speech minutes per language 60 500 50 400 Distinct phonemes 40 Speech minutes 300 30 200 20 100 10 0 0 tsn ven ssw sot nso zul nbl xho tso ssw nbl zul xho tso sot nso tsn ven Languages Languages Amount of data within Lwazi ASR corpus

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Computational analysis Goal: Understand data requirements to develop a minimal system that is practically usable Use as seed ASR system to collect additional resources Implications of additional speakers and utterances Develop tools: Provide indication of data sufficiency Potential for cross-language sharing

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Computational analysis Approach: Measure acoustic variance in terms of the separability between probability densities by modelling specific phonemes Statistical measure provides an indication of the effect that additional training data will have on recognition accuracy Utilise the same measure as indication of acoustic similarity across languages

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Computational analysis Mainly focus on four languages here: isiNdebele (nbl) siSwati (ssw) isiZulu (zul) Tshivenda (ven) We report only on single-mixture context-independent models (similar trends observed for more complex models) Report on examples from several broad categories of phonemes (SAMPA) which occur most in target languages: /a/ (vowels) /m/ (nasals) /b/ and /g/ (voiced plosives) /s/ (unvoiced fricatives)

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Analysis of phoneme variability 0.48 Bhattacharyya bound 0.46 0.44 0.5 0.42 0.45 0.4 0.4 0.38 0.35 0.36 0.34 0.3 0.32 0.25 5 30 5 10 15 20 25 30 35 40 45 50 10 15 20 Number of Speakers 25 Phone Observations per Speaker Figure: Speaker-and-utterance three-dimensional plot for the siSwati nasal /m/

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Number of phoneme utterances Bhattacharyya bound Bhattacharyya bound 0.5 0.5 0.48 0.48 0.46 0.46 0.44 0.44 /m/-zul /s/-zul /m/-nbl /s/-nbl 0.42 0.42 /m/-ven /s/-ssw 0.4 0.4 0 10 20 30 40 50 0 10 20 30 40 50 Phone Observations per Speaker Phone Observations per Speaker Bhattacharyya bound Bhattacharyya bound 0.5 0.5 0.48 0.48 0.46 0.46 0.44 0.44 /a/-zul /b/-ssw /a/-nbl /g/-zul 0.42 0.42 /a/-ven /g/-nbl 0.4 0.4 0 10 20 30 40 50 0 10 20 30 40 50 Phone Observations per Speaker Phone Observations per Speaker Figure: Effect of number of phoneme utterances per speaker on similarity measure for different phoneme groups using data from 30 speakers

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Number of speakers Bhattacharyya bound Bhattacharyya bound 0.5 0.5 0.45 0.45 0.4 0.4 /m/-zul /s/-zul 0.35 0.35 /m/-nbl /s/-nbl /m/-ven /s/-ssw 0.3 0.3 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Number of Speakers Number of Speakers Bhattacharyya bound Bhattacharyya bound 0.5 0.5 0.45 0.45 0.4 0.4 /a/-zul /b/-ssw 0.35 /a/-nbl 0.35 /g/-zul /a/-ven /g/-nbl 0.3 0.3 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Number of Speakers Number of Speakers Figure: Effect of number of speakers on similarity measure for different phoneme groups using 20 utterances per speaker

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Initial ASR Accuracy Accuracy of phoneme recognisers 100 Developed initial ASR systems for 80 all of the Bantu languages Accuracy (%) Test sets: 30 speakers per language 60 ASR system is phoneme recogniser , 40 with flat language model 20 A rough benchmark of acceptable 0 phoneme accuracy: N-TIMIT tsn nbl ssw zul tso xho tsn TIM nso sot Languages

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Impact of data reduction Division factor of 8: Approximately 20 training speakers Correlate well with the stable phoneme similarity values Figure: Reducing the number of speakers has (approximately) the same effect as reducing the amount of speech per speaker

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Distances between phonemes Based upon proven stability of our phoneme models: Phoneme similarity between phonemes across languages Figure: Effective distances for isiNdebele phonemes /a/ and /n/ and their closest matches.

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Conclusion New method to determine data sufficiency Confirmed that different phoneme classes have different data requirements Our results suggest that similar phoneme accuracies may be achievable by using more speech from fewer speakers Based upon proven model stability we performed successful measurements of distances between phonemes of different languages

Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Conclusion Project Lwazi website: http://www.meraka.org.za/lwazi More info Download corpora (ASR, TTS) Download tools Contact details

Collecting and evaluating speech recognition corpora for nine - PowerPoint PPT Presentation

Collecting and evaluating speech recognition corpora for nine Southern Bantu languages Jaco Badenhorst, Charl van Heerden, Marelie Davel and Etienne Barnard March 31, 2009 Introduction ASR corpus design Project Lwazi Computational analysis

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Dialogue corpora NPFL070 December 11, 2019 (NPFL070) Dialogue corpora December 11, 2019 1 /

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

Programmable Shaders December 25, 2006 RenderMan & Its Shading Language Key Idea of a

Game Graphics & Real-time Rendering CMPM 163, S2019 Prof. Angus Forbes (instructor)

Spectral and Radiometric Issues for Level 1C L. Larrabee Strow and Scott Hannon Atmospheric

First experiments in audio/video features for phoneme recognition Petr Motl cek FIT VUT

DIFFERENCE TEACHERS MAKE THE ROLES Steve Underwood, Ed.D. Marybeth Flachbart, Ed.D. Education

There are many little ways to enlarge your childs world. Love of books is the best of all.

Welcome to CS11 Week 1 - Day 1 What does this class teach? THEORY APPLICATION COMMUNICATION

Collecting and evaluating speech recognition corpora for nine - PowerPoint PPT Presentation

Collecting and evaluating speech recognition corpora for nine Southern Bantu languages Jaco Badenhorst, Charl van Heerden, Marelie Davel and Etienne Barnard March 31, 2009 Introduction ASR corpus design Project Lwazi Computational analysis

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Dialogue corpora NPFL070 December 11, 2019 (NPFL070) Dialogue corpora December 11, 2019 1 /

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

Programmable Shaders December 25, 2006 RenderMan &amp; Its Shading Language Key Idea of a

Game Graphics &amp; Real-time Rendering CMPM 163, S2019 Prof. Angus Forbes (instructor)

Spectral and Radiometric Issues for Level 1C L. Larrabee Strow and Scott Hannon Atmospheric

First experiments in audio/video features for phoneme recognition Petr Motl cek FIT VUT

DIFFERENCE TEACHERS MAKE THE ROLES Steve Underwood, Ed.D. Marybeth Flachbart, Ed.D. Education

There are many little ways to enlarge your childs world. Love of books is the best of all.

Welcome to CS11 Week 1 - Day 1 What does this class teach? THEORY APPLICATION COMMUNICATION

Programmable Shaders December 25, 2006 RenderMan & Its Shading Language Key Idea of a

Game Graphics & Real-time Rendering CMPM 163, S2019 Prof. Angus Forbes (instructor)