Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, - - PowerPoint PPT Presentation

▶

Oct 03, 2022 36 likes •319 views

Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, Ph.D. National Electronics and Computer Technology Center (NECTEC) NSTDA-TITECH Workshop - November 2006 1 Outline Brief history Current activities - Speech corpora -

SLIDE 1

NSTDA-TITECH Workshop - November 2006 1

Thai Speech Processing Activities at NECTEC

Chai Wutiwiwatchai, Ph.D.

National Electronics and Computer Technology Center (NECTEC)

SLIDE 2

NSTDA-TITECH Workshop - November 2006 2

Outline

Brief history
Current activities
Speech corpora
Automatic speech recognition (ASR)
Text-to-speech synthesis (TTS)
Other related topics
Demonstration & problems
Future plan

SLIDE 3

NSTDA-TITECH Workshop - November 2006 3

Brief History

1997 2000 2005 SID TTS ASR TTS SST ASR TTS

SID : Speaker identification
TTS : Text-to-speech synthesis
ASR : Automatic speech recognition
SST : Speech-to-speech translation

SLIDE 4

NSTDA-TITECH Workshop - November 2006 4

ASR Project

ASR resources
“iSpeech” toolkit
Robust ASR
Thai LVCSR

SLIDE 5

NSTDA-TITECH Workshop - November 2006 5

ASR Resources

5000 freq. words
Phone-balanced utts.
Hotel reservation

utts.

Read, 54 hrs. 48

spks. Various Thai speech for ASR research ATR, Japan 2002 NECTEC- ATR

Phone-balanced utts.
5000-covered utts.
Read, 70 hrs. 48

spks. 5000-word dictation system PSU & MU, Thailand 2005 LOTUS

Common isolated

commands

24 spks

Isolated commands

2005

VoiceCom

Purpose Year Collab. Detail Name

http://www.nectec.or.th/rdi/lotus http://www.nectec.or.th/rdi/lotus

SLIDE 6

NSTDA-TITECH Workshop - November 2006 6

“iSpeech” Toolkit

Version 1.0 (2005)
Isolated word recognition
Monophone model
Version 1.5 (2006)
Model selection for robust ASR
Automatic endpoint detection
Version 2.0 (2006)
Regular grammar model
Cross-word triphone model
Website

http://www.nectec.or.th/rdi/ispeech

SLIDE 7

NSTDA-TITECH Workshop - November 2006 7

Robust ASR (1)

General approaches for robust ASR
Robust parameterization
Model selection
Robust topology
Combination

SLIDE 8

NSTDA-TITECH Workshop - November 2006 8

Robust ASR (2)

Wavelet-based denoising

Speech

H L

High-band coefficients Low-band coefficients

Wavelet thresholding Wavelet thresholding

Denoised speech

20 30 40 50 60 Accuracy % Clean Waterfall Fan Computer Shaving Baseline Denoising

SLIDE 9

NSTDA-TITECH Workshop - November 2006 9

Robust ASR (3)

Model selection

Speech

Noise classification Speech recognition Noise-specific acoustic models

Result

Feature: MFCC, LSF, NLS (+ PCA)
Classifier: SVM, ANN, HMM

40 50 60 70 80 Accuracy %

No robustness Multiconditioned acoustic model PCA-NLS & ANN 100% Noise classification

SLIDE 10

NSTDA-TITECH Workshop - November 2006 10

Robust ASR (4)

Tree-based model selection

Noise1 SNR 1 Noise1 SNR 2 NoiseN SNR N All noises All SNRs MLLR transformation matrix / Node Automatic noise clustering/merging GMM-based similarity measure

SLIDE 11

NSTDA-TITECH Workshop - November 2006 11

Thai LVCSR (1)

Phoneme inventory optimization

i ii e ee x xx v vv q qq a aa u uu o oo @ @@ ia iia va vva ua uua Vowel p t k c ph th kh ch b d m n ng w j r l z h pr tr kr phr thr khr kl phl khl kw khw Initial consonant Syllable- structured phonemes i ii e ee x xx v vv q qq a aa u uu o oo @ @@ Vowel Final consonant Consonant P T K M N NG W J p t k c ph th kh ch b d m n ng w j r l z h Basic phonemes

SLIDE 12

NSTDA-TITECH Workshop - November 2006 12

Thai LVCSR (2)

5K-word dictation system
Acoustic modeling: 40 hrs. 48 spks.
Language modeling: 0.07 Mwords
Perplexity: 140
Evaluation: 460 utts. 10 spks.

40 50 60 70 80

Word accuracy % No LM LM by Original Transcription LM by Realigned Transcription

SLIDE 13

NSTDA-TITECH Workshop - November 2006 13

TTS Project

“Vaja” TTS engine
TTS resources
Prosody prediction
Text processing
Space reduction

SLIDE 14

NSTDA-TITECH Workshop - November 2006 14

“Vaja” TTS Engine

Version 2.0 (2000)
Demisyllable concatenation
Version 3.0 (2003)
Corpus-based unit-selection
Version 4.0 (2006)
Multithread
Client/server
Version 5.0 (2007)
Naturalness improvement
Space reduction
Website

http://www.nectec.or.th/rdi/vaja

SLIDE 15

NSTDA-TITECH Workshop - November 2006 15

TTS Resources

27,000 sentences
Word segmentation
POS-tagged

Thai text corpus for text processing 1997 ORCHID

Triphone, tritone covered
13 hrs., a fluent female
Prosody tagged

Thai speech corpus for unit-selection speech synthesis 2003 TSynC-1

Purpose Year Detail Name

SLIDE 16

NSTDA-TITECH Workshop - November 2006 16

Prosody Prediction (1)

Sentence/Phrase breaking
Syllable-duration modeling

SLIDE 17

NSTDA-TITECH Workshop - November 2006 17

Prosody Prediction (2)

Sentence/Phrase breaking

Preprocessed text

Feature extraction

Break/Non-break

Machine learning

POS of current and neighboring words
No. of syllables/words from previous break
C4.5, RIPPER, CART,

Neural network, POS n-gram

SLIDE 18

NSTDA-TITECH Workshop - November 2006 18

Prosody Prediction (3)

Syllable-duration modeling

Duration-tagged Speech samples Regression analysis Regression analysis

Factors:

Phoneme
Tone
Position

Factors:

Phoneme
Tone
Position

Regression model Regression model gives a fair precision of duration prediction (0.73 correlation to references)

SLIDE 19

NSTDA-TITECH Workshop - November 2006 19

Text Processing (1)

Word segmentation
Part-of-speech tagging
Grapheme-to-phoneme (G2P) conversion

SLIDE 20

NSTDA-TITECH Workshop - November 2006 20

Text Processing (2)

G2P difficulties
Context-dependent segmentation ambiguity (CDSA)

NOWHERE |NOW|HERE| or |NOWHERE|

Context-independent segmentation ambiguity (CISA)

TOGETHER |TOGETHER| or |TO|GET|HER|

Homograph ambiguity

LEAD /l i d/ or /l e d/

96.5 94.3 52.5 Homograph 99.7 99.7 98.3 CISA 95.7 93.2 73.0 CDSA Winnow Bayesian Trigram %Acc

SLIDE 21

NSTDA-TITECH Workshop - November 2006 21

Space Reduction

10 20 30 40 50 60 70 80 90 100 1 10 20 50 100 200 500 All

Maximum frequency of diphone

% Space Reduction

1 2 3 4 5

Mean Opinion Score

% Space Reduction Mean Opinion Score

SLIDE 22

NSTDA-TITECH Workshop - November 2006 22

SST Project (1)

2006 SST prototype

SLIDE 23

NSTDA-TITECH Workshop - November 2006 23

SST Project (2)

2006 SST prototype
English-to-Thai
Travel domain
Push-to-talk
ASR : CMU Sphinx III
MT : Nectec Parsit, a rule-based MT
TTS : Nectec Vaja

SLIDE 24

NSTDA-TITECH Workshop - November 2006 24

Conclusion

ASR TTS Toolkit Robust LVCSR Corpora Isolated word Regular grammar Robust feature Model selection Phone inventory Transcript system Nectec- ATR LOTUS Engine Prosody Text process Corpora Unit selection Phrase break Duration Word segment G2P TSynC-1 Thai Speech Technology at NECTEC Space reduction SST

SLIDE 25

NSTDA-TITECH Workshop - November 2006 25

Future Plan

“iSpeech-N” : N-gram based ASR
Telephone conversational corpus & model
Modified tree-based model selection
Incorporating prosodic models
TSynC-2
HMM-based TTS

ASR TTS

SLIDE 26

NSTDA-TITECH Workshop - November 2006 26

Future Plan

Two-way SST
A travel domain parallel corpus
Example-based MT &

Translation memory

Spoken language MT

SST

SLIDE 27

NSTDA-TITECH Workshop - November 2006 27

Tentative Collaborative Projects

HMM-based TTS

An available large speech corpus
Producing highly smoothed speech
The first system for Thai

ASR for Spontaneous telephone speech

Corpus under developing
Highly spontaneous dialogues
Telephone channel & environmental noises

SLIDE 28

NSTDA-TITECH Workshop - November 2006 28

Thai Speech Processing Activities at NECTEC

Chai Wutiwiwatchai, Ph.D.

National Electronics and Computer Technology Center (NECTEC)

Outline

Brief History

1997 2000 2005 SID TTS ASR TTS SST ASR TTS

ASR Project

ASR Resources

Purpose Year Collab. Detail Name

“iSpeech” Toolkit

http://www.nectec.or.th/rdi/ispeech

Robust ASR (1)

Robust ASR (2)

Robust ASR (3)

Robust ASR (4)

Thai LVCSR (1)

Thai LVCSR (2)

TTS Project

“Vaja” TTS Engine

http://www.nectec.or.th/rdi/vaja

TTS Resources

Purpose Year Detail Name

Prosody Prediction (1)

Prosody Prediction (2)

Neural network, POS n-gram

Prosody Prediction (3)

Duration-tagged Speech samples Regression analysis Regression analysis

Factors:

Factors:

Regression model Regression model gives a fair precision of duration prediction (0.73 correlation to references)

Text Processing (1)

Text Processing (2)

NOWHERE |NOW|HERE| or |NOWHERE|

TOGETHER |TOGETHER| or |TO|GET|HER|

LEAD /l i d/ or /l e d/

Space Reduction

SST Project (1)

SST Project (2)

Conclusion

Future Plan

ASR TTS

Future Plan

Translation memory

SST

Tentative Collaborative Projects

HMM-based TTS

ASR for Spontaneous telephone speech

Thank you for your attention