[PPT] - Cyberon Voice Com m ander PowerPoint Presentation, free download

SLIDE 1

1

Cyberon Voice Com m ander 多國語言語音命令系統開發經驗談

賽微科技研發部協理劉進榮 2007/ 08/ 16

SLIDE 2

2

Cyberon Profile

One of the leading em bedded speech solution

One of the leading em bedded speech solution providers w orldw ide providers w orldw ide

Establishm ent : Jan, 2000
Headquarter : Hsin-Tien City, Taipei, Taiwan

China Office : XiaMen

Em ployees : 36 (R&D: 27)
More than 1 5 m illion units shipped w orldw ide
2 0 0 6 Revenue: NTD 87million, EPS: NTD 13.2

SLIDE 3

3

Pow ered By Cyberon – Grand China

SLIDE 4

4

Pow ered By Cyberon - W orldw ide

SLIDE 5

5

Cyberon’s Solutions

Speaker-Dependent Voice Recognition

Spead Dial engine Cyberon Voice Spead Dial (CVSD)

Speaker-Independent Voice Recognition

CListener engine Cyberon Voice Dialer (CVD) Cyberon Voice Com m ander ( CVC)

Text-To-Speech

CReader engine Cyberon Talking Tutor (CTT) Cyberon Talking Dictionary (CTD)

SLIDE 6

6

Cyberon Voice Com m ander

A Voice Dialing and Command&Control Application

Name Dial/ Digit Dial Phone Book Lookup Program Launch Media Player Control E-mail/ SMS/ Calendar reader Callback, Redial, Time... etc. Voice Feedback Bilingual Speech Recognition

Technology

Speaker-I ndependent Com m and-Based SR Text-To-Speech Continuous Digit Recognition Speaker-Dependent SR (Voice Tag) Speaker Adaptation (for Digit Model)

SLIDE 7

7

Supported Language

Asian Region Asian Region

Traditional Chinese Traditional Chinese Simplified Chinese Simplified Chinese Chinese Accent English Chinese Accent English Korean Korean Thai Thai Cantonese Cantonese Japanese Japanese

European Region European Region

UK English UK English German German French French Italian Italian Spanish Spanish Portuguese Portuguese Russian Russian Dutch Dutch Polish Polish

Am erican Region Am erican Region

Northern American English Northern American English Brazilian Portuguese Brazilian Portuguese Southern American Spanish Southern American Spanish Czech Czech Turkish Turkish Danish Danish Swedish Swedish Finnish (07 Finnish (07’ ’Q3) Q3) Norwegian (07 Norwegian (07’ ’Q3) Q3) Greek (07 Greek (07’ ’Q3) Q3) Slovak (07 Slovak (07’ ’Q4) Q4) Hungarian (07 Hungarian (07’ ’Q4) Q4) Ukrainian (07 Ukrainian (07’ ’Q4) Q4)

SLIDE 8

8

Speaker-I ndependent Speech Recognition

SLIDE 9

9

Search Algorithm Feature Extraction Acoustic Database Vocabulary Feature Vectors Result Voice Signal Lexicon

Architecture

Size

Small : tens
Middle: hundreds
Large: thousands
Very Large: ten thousands

Speaker Dependence

Speaker-Dependent (SD)
Speaker-Independent (SI)
Speaker Adaptation (SA)

Approach

Neural Network
HMM

Search Algorithm

Isolated Word
Discrete Speech
Continue Speech
Keyword Spotting

Grammar

Unit

Word based
Phoneme based

SLIDE 10

10

Feature

Input Signal: 8k Hz, 16-bit PCM 8-Dim MFCC and 8-Dim Delta MFCC 100 Frames Per Second Cepstral Mean Subtraction

Grammar

Feature & Gram m ar

start 打電話開啟播放其他單詞命令人名應用程式名歌曲名 end 住家、公司、手機 start 開啟播放其他單詞命令人名應用程式名歌曲名 end 住家、公司、手機打電話

SLIDE 11

11

Lexicon

Word-to-Phone Conversion Several approaches for different languages 30 KB ~ 250 KB per language

Model

Phoneme-Based HMM 3 Left-to-Right States for a Phoneme Model Decision-Tree Triphone Model Forward-Backward Training 180KB ~ 220 KB per language

Search Algorithm

Viterbi Search Word transition governed by Grammar

Lexicon, Model & Search

SLIDE 12

12

Procedure

Define Phoneme Set

Wikipedia, SAMPA, Language Learning Web Site, ...

Build Lexicon Module

Rule: Academic Paper, Language Learning Web Site...
Pronunciation Dictionary: LDC, ELRA, other research organizations...

Design Recording Scripts

News Web Site

Collect Speech Data

Local Agents

Train Model & Test

3 ~ 6 months for developing a language

Language Developm ent

SLIDE 13

13

Basic Approaches

Rule

Simple Letter-to-Phone Rules
Ex: Italian, Spanish, Portuguese... etc.

Hardcode

Ex: Chinese, Korean... etc.

Decision Tree

Trained by a pronunciation dictionary
Accuracy: inside 92% ~ 98% , outside 60% ~ 75%
Ex: English, German, French... etc.
Hybrid for Most Languages

Lexicon Module

SLIDE 14

14

Data Collection

Corpus

100 ~ 800 Informants Per Language Per Speaker

40 ~ 60 short words for booting model
200 ~ 300 sentences (25 ~ 30 min) for training
Accent Issue

Collect data in big cities Try to enlarge the coverage of accents

Verification & Phoneme Transcription

Done by tools

SLIDE 15

15

Engine Sim ulation Test

Vocabulary: 200 full names Tester: 4 ~ 6 native speakers Device: Dopod 900 (HTC Universal) Add several degrees AURORA CAR noise to source data Accuracy (% )

S/N Language Taiwan Mandarin 98.03 97.04 96.37 93.09 75.33 China Mandarin 96.62 96.21 95.21 90.33 71.67 Cantonese 95.36 94.01 93.97 88.01 71.62 US English 98.9 97.9 96.68 92.58 79.4 UK English 93.88 94.85 94.21 91.45 77.79 German 95.17 95.17 93.65 87.81 75.29 French 94.83 95.02 94.08 90.25 76.62 Italian 95.77 94.15 93.64 91.56 81.73 Spanish 96.18 95.37 92.83 89.28 78 Brazilian Portuguese 96.2 97.15 95.49 93.35 80.29 Dutch 94.25 93.12 92.62 88.12 74.75 Japanese 96.55 96.1 92.4 90.4 81.1 Russian 97.15 95.6 93.62 87.07 75.47 Average 96.07 95.51 94.21 90.25 76.85 0dB Clean 15dB 10dB 5dB

SLIDE 16

16

CVC Field Test

Vocabulary: 200 full names, 20 ~ 30 apps with grammar Tester: 4 ~ 6 native speakers Device: Several PocketPC phone models Environment: Office, Roadside, and Highway Accuracy

Env. Language Taiwan Mandarin 98.6 92.8 93.5 China Mandarin 96.2 90.4 92.3 Cantonese 94.8 89.7 91.5 US English 93.7 85.2 90.5 UK English 93.2 83.7 88.5 German 95.7 86.3 93.8 French 96.5 91.4 92.6 Italian 97.5 92.3 94 Spanish 97.1 89.4 91.2 Brazilian Portuguese 95.3 87.6 88.7 Dutch 92.4 84 91.3 Japanese 96.2 88.3 91.2 Russian 96.3 88.4 92.8 Average 95.63 88.42 91.68 Office Roadside Highway

SLIDE 17

17

Text-To-Speech

SLIDE 18

18

Mainly for voice feedback of VR result
16k Hz, 16-bit PCM Output
Compact Size: 300 KB ~ 600 KB per Language
Acceptable quality
Lack of rich prosody (Robotic)
Good for pronunciation of single word and short phrase

after fine tuning

Cyberon Talking Dictionary

TTS in CVC

SLIDE 19

19

Architecture

Text Analysis

Input Text

Synthesizer

Output Speech Prosody Model Speech Unit Database Word/Phrase break Pronunciation POS tag Pronunciation Lexicon POS Lexicon

SLIDE 20

20

Word Boundary

For Chinese and Thai Longest word first

POS Tagging

By POS n-gram and Viterbi search

Phrase Boundary

By boundary n-gram and Viterbi search Simplified approach: by syllable length

Text Analysis

SLIDE 21

21

Mandarin & Cantonese (Syllable Unit)

Save first tone of each syllable in database Pre-define F0 contour of each tone Adopt fixed base F0 contour of phrase Compute duration by syllable position in word and in phrase

Other Languages (Diphone Unit)

Predict accent position and type by CART (Classification And Regression Tree) Generate F0 contour by linear regression Predict duration by CART

Prosody Model

SLIDE 22

22

LPC (Linear Predictive Coding)-Based Approach

Save LPC coefficients and residual of pitch of speech unit into database Adjust residual length for F0 contour Adjust number of pitch for duration

Synthesizer

SLIDE 23

23

Conclusion

SLIDE 24

24

Cyberon Voice Com m ander

A successful commercial voice application on mobile

device

Integrate several speech technologies, such as SI VR

and TTS, into embedded system

Experience of developing a lot of languages
Show speech technologies workable in real daily life

SLIDE 25

25

Future W ork

Improve TTS quality
Enhance recognition performance in heavy noisy condition
Find accurate approach to verify and transcribe speech data
Create more effective procedure of developing a language
Develop other advanced speech technology and application

SLIDE 26

26

The End and Thanks

Cyberon Corporation

TEL : + 8 8 6 -2 -2 9 1 0 -9 0 8 8 FAX : + 8 8 6 -2 -2 9 1 0 -7 9 8 6 W ebsite : w w w .cyberon.com .tw