Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, - - PowerPoint PPT Presentation

thai speech processing activities at nectec
SMART_READER_LITE
LIVE PREVIEW

Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, - - PowerPoint PPT Presentation

Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, Ph.D. National Electronics and Computer Technology Center (NECTEC) NSTDA-TITECH Workshop - November 2006 1 Outline Brief history Current activities - Speech corpora -


slide-1
SLIDE 1

NSTDA-TITECH Workshop - November 2006 1

Thai Speech Processing Activities at NECTEC

Chai Wutiwiwatchai, Ph.D.

National Electronics and Computer Technology Center (NECTEC)

slide-2
SLIDE 2

NSTDA-TITECH Workshop - November 2006 2

Outline

  • Brief history
  • Current activities
  • Speech corpora
  • Automatic speech recognition (ASR)
  • Text-to-speech synthesis (TTS)
  • Other related topics
  • Demonstration & problems
  • Future plan
slide-3
SLIDE 3

NSTDA-TITECH Workshop - November 2006 3

Brief History

1997 2000 2005 SID TTS ASR TTS SST ASR TTS

  • SID : Speaker identification
  • TTS : Text-to-speech synthesis
  • ASR : Automatic speech recognition
  • SST : Speech-to-speech translation
slide-4
SLIDE 4

NSTDA-TITECH Workshop - November 2006 4

ASR Project

  • ASR resources
  • “iSpeech” toolkit
  • Robust ASR
  • Thai LVCSR
slide-5
SLIDE 5

NSTDA-TITECH Workshop - November 2006 5

ASR Resources

  • 5000 freq. words
  • Phone-balanced utts.
  • Hotel reservation

utts.

  • Read, 54 hrs. 48

spks. Various Thai speech for ASR research ATR, Japan 2002 NECTEC- ATR

  • Phone-balanced utts.
  • 5000-covered utts.
  • Read, 70 hrs. 48

spks. 5000-word dictation system PSU & MU, Thailand 2005 LOTUS

  • Common isolated

commands

  • 24 spks

Isolated commands

  • 2005

VoiceCom

Purpose Year Collab. Detail Name

http://www.nectec.or.th/rdi/lotus http://www.nectec.or.th/rdi/lotus

slide-6
SLIDE 6

NSTDA-TITECH Workshop - November 2006 6

“iSpeech” Toolkit

  • Version 1.0 (2005)
  • Isolated word recognition
  • Monophone model
  • Version 1.5 (2006)
  • Model selection for robust ASR
  • Automatic endpoint detection
  • Version 2.0 (2006)
  • Regular grammar model
  • Cross-word triphone model
  • Website

http://www.nectec.or.th/rdi/ispeech

slide-7
SLIDE 7

NSTDA-TITECH Workshop - November 2006 7

Robust ASR (1)

  • General approaches for robust ASR
  • Robust parameterization
  • Model selection
  • Robust topology
  • Combination
slide-8
SLIDE 8

NSTDA-TITECH Workshop - November 2006 8

Robust ASR (2)

  • Wavelet-based denoising

Speech

H L

High-band coefficients Low-band coefficients

Wavelet thresholding Wavelet thresholding

Denoised speech

20 30 40 50 60 Accuracy % Clean Waterfall Fan Computer Shaving Baseline Denoising

slide-9
SLIDE 9

NSTDA-TITECH Workshop - November 2006 9

Robust ASR (3)

  • Model selection

Speech

Noise classification Speech recognition Noise-specific acoustic models

Result

  • Feature: MFCC, LSF, NLS (+ PCA)
  • Classifier: SVM, ANN, HMM

40 50 60 70 80 Accuracy %

No robustness Multiconditioned acoustic model PCA-NLS & ANN 100% Noise classification

slide-10
SLIDE 10

NSTDA-TITECH Workshop - November 2006 10

Robust ASR (4)

  • Tree-based model selection

Noise1 SNR 1 Noise1 SNR 2 NoiseN SNR N All noises All SNRs MLLR transformation matrix / Node Automatic noise clustering/merging GMM-based similarity measure

slide-11
SLIDE 11

NSTDA-TITECH Workshop - November 2006 11

Thai LVCSR (1)

  • Phoneme inventory optimization

i ii e ee x xx v vv q qq a aa u uu o oo @ @@ ia iia va vva ua uua Vowel p t k c ph th kh ch b d m n ng w j r l z h pr tr kr phr thr khr kl phl khl kw khw Initial consonant Syllable- structured phonemes i ii e ee x xx v vv q qq a aa u uu o oo @ @@ Vowel Final consonant Consonant P T K M N NG W J p t k c ph th kh ch b d m n ng w j r l z h Basic phonemes

slide-12
SLIDE 12

NSTDA-TITECH Workshop - November 2006 12

Thai LVCSR (2)

  • 5K-word dictation system
  • Acoustic modeling: 40 hrs. 48 spks.
  • Language modeling: 0.07 Mwords
  • Perplexity: 140
  • Evaluation: 460 utts. 10 spks.

40 50 60 70 80

Word accuracy % No LM LM by Original Transcription LM by Realigned Transcription

slide-13
SLIDE 13

NSTDA-TITECH Workshop - November 2006 13

TTS Project

  • “Vaja” TTS engine
  • TTS resources
  • Prosody prediction
  • Text processing
  • Space reduction
slide-14
SLIDE 14

NSTDA-TITECH Workshop - November 2006 14

“Vaja” TTS Engine

  • Version 2.0 (2000)
  • Demisyllable concatenation
  • Version 3.0 (2003)
  • Corpus-based unit-selection
  • Version 4.0 (2006)
  • Multithread
  • Client/server
  • Version 5.0 (2007)
  • Naturalness improvement
  • Space reduction
  • Website

http://www.nectec.or.th/rdi/vaja

slide-15
SLIDE 15

NSTDA-TITECH Workshop - November 2006 15

TTS Resources

  • 27,000 sentences
  • Word segmentation
  • POS-tagged

Thai text corpus for text processing 1997 ORCHID

  • Triphone, tritone covered
  • 13 hrs., a fluent female
  • Prosody tagged

Thai speech corpus for unit-selection speech synthesis 2003 TSynC-1

Purpose Year Detail Name

slide-16
SLIDE 16

NSTDA-TITECH Workshop - November 2006 16

Prosody Prediction (1)

  • Sentence/Phrase breaking
  • Syllable-duration modeling
slide-17
SLIDE 17

NSTDA-TITECH Workshop - November 2006 17

Prosody Prediction (2)

  • Sentence/Phrase breaking

Preprocessed text

Feature extraction

Break/Non-break

Machine learning

  • POS of current and neighboring words
  • No. of syllables/words from previous break
  • C4.5, RIPPER, CART,

Neural network, POS n-gram

slide-18
SLIDE 18

NSTDA-TITECH Workshop - November 2006 18

Prosody Prediction (3)

  • Syllable-duration modeling

Duration-tagged Speech samples Regression analysis Regression analysis

Factors:

  • Phoneme
  • Tone
  • Position

Factors:

  • Phoneme
  • Tone
  • Position

Regression model Regression model gives a fair precision of duration prediction (0.73 correlation to references)

slide-19
SLIDE 19

NSTDA-TITECH Workshop - November 2006 19

Text Processing (1)

  • Word segmentation
  • Part-of-speech tagging
  • Grapheme-to-phoneme (G2P) conversion
slide-20
SLIDE 20

NSTDA-TITECH Workshop - November 2006 20

Text Processing (2)

  • G2P difficulties
  • Context-dependent segmentation ambiguity (CDSA)

NOWHERE |NOW|HERE| or |NOWHERE|

  • Context-independent segmentation ambiguity (CISA)

TOGETHER |TOGETHER| or |TO|GET|HER|

  • Homograph ambiguity

LEAD /l i d/ or /l e d/

96.5 94.3 52.5 Homograph 99.7 99.7 98.3 CISA 95.7 93.2 73.0 CDSA Winnow Bayesian Trigram %Acc

slide-21
SLIDE 21

NSTDA-TITECH Workshop - November 2006 21

Space Reduction

10 20 30 40 50 60 70 80 90 100 1 10 20 50 100 200 500 All

Maximum frequency of diphone

% Space Reduction

1 2 3 4 5

Mean Opinion Score

% Space Reduction Mean Opinion Score

slide-22
SLIDE 22

NSTDA-TITECH Workshop - November 2006 22

SST Project (1)

  • 2006 SST prototype
slide-23
SLIDE 23

NSTDA-TITECH Workshop - November 2006 23

SST Project (2)

  • 2006 SST prototype
  • English-to-Thai
  • Travel domain
  • Push-to-talk
  • ASR : CMU Sphinx III
  • MT : Nectec Parsit, a rule-based MT
  • TTS : Nectec Vaja
slide-24
SLIDE 24

NSTDA-TITECH Workshop - November 2006 24

Conclusion

ASR TTS Toolkit Robust LVCSR Corpora Isolated word Regular grammar Robust feature Model selection Phone inventory Transcript system Nectec- ATR LOTUS Engine Prosody Text process Corpora Unit selection Phrase break Duration Word segment G2P TSynC-1 Thai Speech Technology at NECTEC Space reduction SST

slide-25
SLIDE 25

NSTDA-TITECH Workshop - November 2006 25

Future Plan

  • “iSpeech-N” : N-gram based ASR
  • Telephone conversational corpus & model
  • Modified tree-based model selection
  • Incorporating prosodic models
  • TSynC-2
  • HMM-based TTS

ASR TTS

slide-26
SLIDE 26

NSTDA-TITECH Workshop - November 2006 26

Future Plan

  • Two-way SST
  • A travel domain parallel corpus
  • Example-based MT &

Translation memory

  • Spoken language MT

SST

slide-27
SLIDE 27

NSTDA-TITECH Workshop - November 2006 27

Tentative Collaborative Projects

HMM-based TTS

  • An available large speech corpus
  • Producing highly smoothed speech
  • The first system for Thai

ASR for Spontaneous telephone speech

  • Corpus under developing
  • Highly spontaneous dialogues
  • Telephone channel & environmental noises
slide-28
SLIDE 28

NSTDA-TITECH Workshop - November 2006 28

Thank you for your attention