NSTDA-TITECH Workshop - November 2006 1
Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, - - PowerPoint PPT Presentation
Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, - - PowerPoint PPT Presentation
Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, Ph.D. National Electronics and Computer Technology Center (NECTEC) NSTDA-TITECH Workshop - November 2006 1 Outline Brief history Current activities - Speech corpora -
NSTDA-TITECH Workshop - November 2006 2
Outline
- Brief history
- Current activities
- Speech corpora
- Automatic speech recognition (ASR)
- Text-to-speech synthesis (TTS)
- Other related topics
- Demonstration & problems
- Future plan
NSTDA-TITECH Workshop - November 2006 3
Brief History
1997 2000 2005 SID TTS ASR TTS SST ASR TTS
- SID : Speaker identification
- TTS : Text-to-speech synthesis
- ASR : Automatic speech recognition
- SST : Speech-to-speech translation
NSTDA-TITECH Workshop - November 2006 4
ASR Project
- ASR resources
- “iSpeech” toolkit
- Robust ASR
- Thai LVCSR
NSTDA-TITECH Workshop - November 2006 5
ASR Resources
- 5000 freq. words
- Phone-balanced utts.
- Hotel reservation
utts.
- Read, 54 hrs. 48
spks. Various Thai speech for ASR research ATR, Japan 2002 NECTEC- ATR
- Phone-balanced utts.
- 5000-covered utts.
- Read, 70 hrs. 48
spks. 5000-word dictation system PSU & MU, Thailand 2005 LOTUS
- Common isolated
commands
- 24 spks
Isolated commands
- 2005
VoiceCom
Purpose Year Collab. Detail Name
http://www.nectec.or.th/rdi/lotus http://www.nectec.or.th/rdi/lotus
NSTDA-TITECH Workshop - November 2006 6
“iSpeech” Toolkit
- Version 1.0 (2005)
- Isolated word recognition
- Monophone model
- Version 1.5 (2006)
- Model selection for robust ASR
- Automatic endpoint detection
- Version 2.0 (2006)
- Regular grammar model
- Cross-word triphone model
- Website
http://www.nectec.or.th/rdi/ispeech
NSTDA-TITECH Workshop - November 2006 7
Robust ASR (1)
- General approaches for robust ASR
- Robust parameterization
- Model selection
- Robust topology
- Combination
NSTDA-TITECH Workshop - November 2006 8
Robust ASR (2)
- Wavelet-based denoising
Speech
H L
High-band coefficients Low-band coefficients
Wavelet thresholding Wavelet thresholding
Denoised speech
20 30 40 50 60 Accuracy % Clean Waterfall Fan Computer Shaving Baseline Denoising
NSTDA-TITECH Workshop - November 2006 9
Robust ASR (3)
- Model selection
Speech
Noise classification Speech recognition Noise-specific acoustic models
Result
- Feature: MFCC, LSF, NLS (+ PCA)
- Classifier: SVM, ANN, HMM
40 50 60 70 80 Accuracy %
No robustness Multiconditioned acoustic model PCA-NLS & ANN 100% Noise classification
NSTDA-TITECH Workshop - November 2006 10
Robust ASR (4)
- Tree-based model selection
Noise1 SNR 1 Noise1 SNR 2 NoiseN SNR N All noises All SNRs MLLR transformation matrix / Node Automatic noise clustering/merging GMM-based similarity measure
NSTDA-TITECH Workshop - November 2006 11
Thai LVCSR (1)
- Phoneme inventory optimization
i ii e ee x xx v vv q qq a aa u uu o oo @ @@ ia iia va vva ua uua Vowel p t k c ph th kh ch b d m n ng w j r l z h pr tr kr phr thr khr kl phl khl kw khw Initial consonant Syllable- structured phonemes i ii e ee x xx v vv q qq a aa u uu o oo @ @@ Vowel Final consonant Consonant P T K M N NG W J p t k c ph th kh ch b d m n ng w j r l z h Basic phonemes
NSTDA-TITECH Workshop - November 2006 12
Thai LVCSR (2)
- 5K-word dictation system
- Acoustic modeling: 40 hrs. 48 spks.
- Language modeling: 0.07 Mwords
- Perplexity: 140
- Evaluation: 460 utts. 10 spks.
40 50 60 70 80
Word accuracy % No LM LM by Original Transcription LM by Realigned Transcription
NSTDA-TITECH Workshop - November 2006 13
TTS Project
- “Vaja” TTS engine
- TTS resources
- Prosody prediction
- Text processing
- Space reduction
NSTDA-TITECH Workshop - November 2006 14
“Vaja” TTS Engine
- Version 2.0 (2000)
- Demisyllable concatenation
- Version 3.0 (2003)
- Corpus-based unit-selection
- Version 4.0 (2006)
- Multithread
- Client/server
- Version 5.0 (2007)
- Naturalness improvement
- Space reduction
- Website
http://www.nectec.or.th/rdi/vaja
NSTDA-TITECH Workshop - November 2006 15
TTS Resources
- 27,000 sentences
- Word segmentation
- POS-tagged
Thai text corpus for text processing 1997 ORCHID
- Triphone, tritone covered
- 13 hrs., a fluent female
- Prosody tagged
Thai speech corpus for unit-selection speech synthesis 2003 TSynC-1
Purpose Year Detail Name
NSTDA-TITECH Workshop - November 2006 16
Prosody Prediction (1)
- Sentence/Phrase breaking
- Syllable-duration modeling
NSTDA-TITECH Workshop - November 2006 17
Prosody Prediction (2)
- Sentence/Phrase breaking
Preprocessed text
Feature extraction
Break/Non-break
Machine learning
- POS of current and neighboring words
- No. of syllables/words from previous break
- C4.5, RIPPER, CART,
Neural network, POS n-gram
NSTDA-TITECH Workshop - November 2006 18
Prosody Prediction (3)
- Syllable-duration modeling
Duration-tagged Speech samples Regression analysis Regression analysis
Factors:
- Phoneme
- Tone
- Position
Factors:
- Phoneme
- Tone
- Position
Regression model Regression model gives a fair precision of duration prediction (0.73 correlation to references)
NSTDA-TITECH Workshop - November 2006 19
Text Processing (1)
- Word segmentation
- Part-of-speech tagging
- Grapheme-to-phoneme (G2P) conversion
NSTDA-TITECH Workshop - November 2006 20
Text Processing (2)
- G2P difficulties
- Context-dependent segmentation ambiguity (CDSA)
NOWHERE |NOW|HERE| or |NOWHERE|
- Context-independent segmentation ambiguity (CISA)
TOGETHER |TOGETHER| or |TO|GET|HER|
- Homograph ambiguity
LEAD /l i d/ or /l e d/
96.5 94.3 52.5 Homograph 99.7 99.7 98.3 CISA 95.7 93.2 73.0 CDSA Winnow Bayesian Trigram %Acc
NSTDA-TITECH Workshop - November 2006 21
Space Reduction
10 20 30 40 50 60 70 80 90 100 1 10 20 50 100 200 500 All
Maximum frequency of diphone
% Space Reduction
1 2 3 4 5
Mean Opinion Score
% Space Reduction Mean Opinion Score
NSTDA-TITECH Workshop - November 2006 22
SST Project (1)
- 2006 SST prototype
NSTDA-TITECH Workshop - November 2006 23
SST Project (2)
- 2006 SST prototype
- English-to-Thai
- Travel domain
- Push-to-talk
- ASR : CMU Sphinx III
- MT : Nectec Parsit, a rule-based MT
- TTS : Nectec Vaja
NSTDA-TITECH Workshop - November 2006 24
Conclusion
ASR TTS Toolkit Robust LVCSR Corpora Isolated word Regular grammar Robust feature Model selection Phone inventory Transcript system Nectec- ATR LOTUS Engine Prosody Text process Corpora Unit selection Phrase break Duration Word segment G2P TSynC-1 Thai Speech Technology at NECTEC Space reduction SST
NSTDA-TITECH Workshop - November 2006 25
Future Plan
- “iSpeech-N” : N-gram based ASR
- Telephone conversational corpus & model
- Modified tree-based model selection
- Incorporating prosodic models
- TSynC-2
- HMM-based TTS
ASR TTS
NSTDA-TITECH Workshop - November 2006 26
Future Plan
- Two-way SST
- A travel domain parallel corpus
- Example-based MT &
Translation memory
- Spoken language MT
SST
NSTDA-TITECH Workshop - November 2006 27
Tentative Collaborative Projects
HMM-based TTS
- An available large speech corpus
- Producing highly smoothed speech
- The first system for Thai
ASR for Spontaneous telephone speech
- Corpus under developing
- Highly spontaneous dialogues
- Telephone channel & environmental noises
NSTDA-TITECH Workshop - November 2006 28