Cyberon Voice Com m ander - - PowerPoint PPT Presentation

cyberon voice com m ander
SMART_READER_LITE
LIVE PREVIEW

Cyberon Voice Com m ander - - PowerPoint PPT Presentation

Cyberon Voice Com m ander 2007/ 08/ 16 1 Cyberon Profile One of the leading em bedded speech solution One of the leading em bedded speech solution


slide-1
SLIDE 1

1

Cyberon Voice Com m ander 多國語言語音命令系統開發經驗談

賽微科技研發部協理 劉進榮 2007/ 08/ 16

slide-2
SLIDE 2

2

Cyberon Profile

  • One of the leading em bedded speech solution

One of the leading em bedded speech solution providers w orldw ide providers w orldw ide

  • Establishm ent : Jan, 2000
  • Headquarter : Hsin-Tien City, Taipei, Taiwan

China Office : XiaMen

  • Em ployees : 36 (R&D: 27)
  • More than 1 5 m illion units shipped w orldw ide
  • 2 0 0 6 Revenue: NTD 87million, EPS: NTD 13.2
slide-3
SLIDE 3

3

Pow ered By Cyberon – Grand China

slide-4
SLIDE 4

4

Pow ered By Cyberon - W orldw ide

slide-5
SLIDE 5

5

Cyberon’s Solutions

  • Speaker-Dependent Voice Recognition

Spead Dial engine Cyberon Voice Spead Dial (CVSD)

  • Speaker-Independent Voice Recognition

CListener engine Cyberon Voice Dialer (CVD) Cyberon Voice Com m ander ( CVC)

  • Text-To-Speech

CReader engine Cyberon Talking Tutor (CTT) Cyberon Talking Dictionary (CTD)

slide-6
SLIDE 6

6

Cyberon Voice Com m ander

  • A Voice Dialing and Command&Control Application

Name Dial/ Digit Dial Phone Book Lookup Program Launch Media Player Control E-mail/ SMS/ Calendar reader Callback, Redial, Time... etc. Voice Feedback Bilingual Speech Recognition

  • Technology

Speaker-I ndependent Com m and-Based SR Text-To-Speech Continuous Digit Recognition Speaker-Dependent SR (Voice Tag) Speaker Adaptation (for Digit Model)

slide-7
SLIDE 7

7

Supported Language

Asian Region Asian Region

Traditional Chinese Traditional Chinese Simplified Chinese Simplified Chinese Chinese Accent English Chinese Accent English Korean Korean Thai Thai Cantonese Cantonese Japanese Japanese

European Region European Region

UK English UK English German German French French Italian Italian Spanish Spanish Portuguese Portuguese Russian Russian Dutch Dutch Polish Polish

Am erican Region Am erican Region

Northern American English Northern American English Brazilian Portuguese Brazilian Portuguese Southern American Spanish Southern American Spanish Czech Czech Turkish Turkish Danish Danish Swedish Swedish Finnish (07 Finnish (07’ ’Q3) Q3) Norwegian (07 Norwegian (07’ ’Q3) Q3) Greek (07 Greek (07’ ’Q3) Q3) Slovak (07 Slovak (07’ ’Q4) Q4) Hungarian (07 Hungarian (07’ ’Q4) Q4) Ukrainian (07 Ukrainian (07’ ’Q4) Q4)

slide-8
SLIDE 8

8

Speaker-I ndependent Speech Recognition

slide-9
SLIDE 9

9

Search Algorithm Feature Extraction Acoustic Database Vocabulary Feature Vectors Result Voice Signal Lexicon

Architecture

Size

  • Small : tens
  • Middle: hundreds
  • Large: thousands
  • Very Large: ten thousands

Speaker Dependence

  • Speaker-Dependent (SD)
  • Speaker-Independent (SI)
  • Speaker Adaptation (SA)

Approach

  • Neural Network
  • HMM

Search Algorithm

  • Isolated Word
  • Discrete Speech
  • Continue Speech
  • Keyword Spotting

Grammar

Unit

  • Word based
  • Phoneme based
slide-10
SLIDE 10

10

  • Feature

Input Signal: 8k Hz, 16-bit PCM 8-Dim MFCC and 8-Dim Delta MFCC 100 Frames Per Second Cepstral Mean Subtraction

  • Grammar

Feature & Gram m ar

start 打電話 開啟 播放 其他單詞命令 人名 應用程式名 歌曲名 end 住家、公司、手機 start 開啟 播放 其他單詞命令 人名 應用程式名 歌曲名 end 住家、公司、手機 打電話

slide-11
SLIDE 11

11

  • Lexicon

Word-to-Phone Conversion Several approaches for different languages 30 KB ~ 250 KB per language

  • Model

Phoneme-Based HMM 3 Left-to-Right States for a Phoneme Model Decision-Tree Triphone Model Forward-Backward Training 180KB ~ 220 KB per language

  • Search Algorithm

Viterbi Search Word transition governed by Grammar

Lexicon, Model & Search

slide-12
SLIDE 12

12

  • Procedure

Define Phoneme Set

  • Wikipedia, SAMPA, Language Learning Web Site, ...

Build Lexicon Module

  • Rule: Academic Paper, Language Learning Web Site...
  • Pronunciation Dictionary: LDC, ELRA, other research organizations...

Design Recording Scripts

  • News Web Site

Collect Speech Data

  • Local Agents

Train Model & Test

  • 3 ~ 6 months for developing a language

Language Developm ent

slide-13
SLIDE 13

13

  • Basic Approaches

Rule

  • Simple Letter-to-Phone Rules
  • Ex: Italian, Spanish, Portuguese... etc.

Hardcode

  • Ex: Chinese, Korean... etc.

Decision Tree

  • Trained by a pronunciation dictionary
  • Accuracy: inside 92% ~ 98% , outside 60% ~ 75%
  • Ex: English, German, French... etc.
  • Hybrid for Most Languages

Lexicon Module

slide-14
SLIDE 14

14

Data Collection

  • Corpus

100 ~ 800 Informants Per Language Per Speaker

  • 40 ~ 60 short words for booting model
  • 200 ~ 300 sentences (25 ~ 30 min) for training
  • Accent Issue

Collect data in big cities Try to enlarge the coverage of accents

  • Verification & Phoneme Transcription

Done by tools

slide-15
SLIDE 15

15

Engine Sim ulation Test

Vocabulary: 200 full names Tester: 4 ~ 6 native speakers Device: Dopod 900 (HTC Universal) Add several degrees AURORA CAR noise to source data Accuracy (% )

S/N Language Taiwan Mandarin 98.03 97.04 96.37 93.09 75.33 China Mandarin 96.62 96.21 95.21 90.33 71.67 Cantonese 95.36 94.01 93.97 88.01 71.62 US English 98.9 97.9 96.68 92.58 79.4 UK English 93.88 94.85 94.21 91.45 77.79 German 95.17 95.17 93.65 87.81 75.29 French 94.83 95.02 94.08 90.25 76.62 Italian 95.77 94.15 93.64 91.56 81.73 Spanish 96.18 95.37 92.83 89.28 78 Brazilian Portuguese 96.2 97.15 95.49 93.35 80.29 Dutch 94.25 93.12 92.62 88.12 74.75 Japanese 96.55 96.1 92.4 90.4 81.1 Russian 97.15 95.6 93.62 87.07 75.47 Average 96.07 95.51 94.21 90.25 76.85 0dB Clean 15dB 10dB 5dB

slide-16
SLIDE 16

16

CVC Field Test

Vocabulary: 200 full names, 20 ~ 30 apps with grammar Tester: 4 ~ 6 native speakers Device: Several PocketPC phone models Environment: Office, Roadside, and Highway Accuracy

Env. Language Taiwan Mandarin 98.6 92.8 93.5 China Mandarin 96.2 90.4 92.3 Cantonese 94.8 89.7 91.5 US English 93.7 85.2 90.5 UK English 93.2 83.7 88.5 German 95.7 86.3 93.8 French 96.5 91.4 92.6 Italian 97.5 92.3 94 Spanish 97.1 89.4 91.2 Brazilian Portuguese 95.3 87.6 88.7 Dutch 92.4 84 91.3 Japanese 96.2 88.3 91.2 Russian 96.3 88.4 92.8 Average 95.63 88.42 91.68 Office Roadside Highway

slide-17
SLIDE 17

17

Text-To-Speech

slide-18
SLIDE 18

18

  • Mainly for voice feedback of VR result
  • 16k Hz, 16-bit PCM Output
  • Compact Size: 300 KB ~ 600 KB per Language
  • Acceptable quality
  • Lack of rich prosody (Robotic)
  • Good for pronunciation of single word and short phrase

after fine tuning

Cyberon Talking Dictionary

TTS in CVC

slide-19
SLIDE 19

19

Architecture

Text Analysis

Input Text

Synthesizer

Output Speech Prosody Model Speech Unit Database Word/Phrase break Pronunciation POS tag Pronunciation Lexicon POS Lexicon

slide-20
SLIDE 20

20

  • Word Boundary

For Chinese and Thai Longest word first

  • POS Tagging

By POS n-gram and Viterbi search

  • Phrase Boundary

By boundary n-gram and Viterbi search Simplified approach: by syllable length

Text Analysis

slide-21
SLIDE 21

21

  • Mandarin & Cantonese (Syllable Unit)

Save first tone of each syllable in database Pre-define F0 contour of each tone Adopt fixed base F0 contour of phrase Compute duration by syllable position in word and in phrase

  • Other Languages (Diphone Unit)

Predict accent position and type by CART (Classification And Regression Tree) Generate F0 contour by linear regression Predict duration by CART

Prosody Model

slide-22
SLIDE 22

22

  • LPC (Linear Predictive Coding)-Based Approach

Save LPC coefficients and residual of pitch of speech unit into database Adjust residual length for F0 contour Adjust number of pitch for duration

Synthesizer

slide-23
SLIDE 23

23

Conclusion

slide-24
SLIDE 24

24

Cyberon Voice Com m ander

  • A successful commercial voice application on mobile

device

  • Integrate several speech technologies, such as SI VR

and TTS, into embedded system

  • Experience of developing a lot of languages
  • Show speech technologies workable in real daily life
slide-25
SLIDE 25

25

Future W ork

  • Improve TTS quality
  • Enhance recognition performance in heavy noisy condition
  • Find accurate approach to verify and transcribe speech data
  • Create more effective procedure of developing a language
  • Develop other advanced speech technology and application
slide-26
SLIDE 26

26

The End and Thanks

Cyberon Corporation

TEL : + 8 8 6 -2 -2 9 1 0 -9 0 8 8 FAX : + 8 8 6 -2 -2 9 1 0 -7 9 8 6 W ebsite : w w w .cyberon.com .tw