1
Cyberon Voice Com m ander 多國語言語音命令系統開發經驗談
賽微科技研發部協理 劉進榮 2007/ 08/ 16
Cyberon Voice Com m ander - - PowerPoint PPT Presentation
Cyberon Voice Com m ander 2007/ 08/ 16 1 Cyberon Profile One of the leading em bedded speech solution One of the leading em bedded speech solution
1
賽微科技研發部協理 劉進榮 2007/ 08/ 16
2
One of the leading em bedded speech solution providers w orldw ide providers w orldw ide
China Office : XiaMen
3
4
5
Spead Dial engine Cyberon Voice Spead Dial (CVSD)
CListener engine Cyberon Voice Dialer (CVD) Cyberon Voice Com m ander ( CVC)
CReader engine Cyberon Talking Tutor (CTT) Cyberon Talking Dictionary (CTD)
6
Name Dial/ Digit Dial Phone Book Lookup Program Launch Media Player Control E-mail/ SMS/ Calendar reader Callback, Redial, Time... etc. Voice Feedback Bilingual Speech Recognition
Speaker-I ndependent Com m and-Based SR Text-To-Speech Continuous Digit Recognition Speaker-Dependent SR (Voice Tag) Speaker Adaptation (for Digit Model)
7
Asian Region Asian Region
Traditional Chinese Traditional Chinese Simplified Chinese Simplified Chinese Chinese Accent English Chinese Accent English Korean Korean Thai Thai Cantonese Cantonese Japanese Japanese
European Region European Region
UK English UK English German German French French Italian Italian Spanish Spanish Portuguese Portuguese Russian Russian Dutch Dutch Polish Polish
Am erican Region Am erican Region
Northern American English Northern American English Brazilian Portuguese Brazilian Portuguese Southern American Spanish Southern American Spanish Czech Czech Turkish Turkish Danish Danish Swedish Swedish Finnish (07 Finnish (07’ ’Q3) Q3) Norwegian (07 Norwegian (07’ ’Q3) Q3) Greek (07 Greek (07’ ’Q3) Q3) Slovak (07 Slovak (07’ ’Q4) Q4) Hungarian (07 Hungarian (07’ ’Q4) Q4) Ukrainian (07 Ukrainian (07’ ’Q4) Q4)
8
9
Search Algorithm Feature Extraction Acoustic Database Vocabulary Feature Vectors Result Voice Signal Lexicon
Size
Speaker Dependence
Approach
Search Algorithm
Grammar
Unit
10
Input Signal: 8k Hz, 16-bit PCM 8-Dim MFCC and 8-Dim Delta MFCC 100 Frames Per Second Cepstral Mean Subtraction
start 打電話 開啟 播放 其他單詞命令 人名 應用程式名 歌曲名 end 住家、公司、手機 start 開啟 播放 其他單詞命令 人名 應用程式名 歌曲名 end 住家、公司、手機 打電話
11
Word-to-Phone Conversion Several approaches for different languages 30 KB ~ 250 KB per language
Phoneme-Based HMM 3 Left-to-Right States for a Phoneme Model Decision-Tree Triphone Model Forward-Backward Training 180KB ~ 220 KB per language
Viterbi Search Word transition governed by Grammar
12
Define Phoneme Set
Build Lexicon Module
Design Recording Scripts
Collect Speech Data
Train Model & Test
13
Rule
Hardcode
Decision Tree
14
100 ~ 800 Informants Per Language Per Speaker
Collect data in big cities Try to enlarge the coverage of accents
Done by tools
15
Vocabulary: 200 full names Tester: 4 ~ 6 native speakers Device: Dopod 900 (HTC Universal) Add several degrees AURORA CAR noise to source data Accuracy (% )
S/N Language Taiwan Mandarin 98.03 97.04 96.37 93.09 75.33 China Mandarin 96.62 96.21 95.21 90.33 71.67 Cantonese 95.36 94.01 93.97 88.01 71.62 US English 98.9 97.9 96.68 92.58 79.4 UK English 93.88 94.85 94.21 91.45 77.79 German 95.17 95.17 93.65 87.81 75.29 French 94.83 95.02 94.08 90.25 76.62 Italian 95.77 94.15 93.64 91.56 81.73 Spanish 96.18 95.37 92.83 89.28 78 Brazilian Portuguese 96.2 97.15 95.49 93.35 80.29 Dutch 94.25 93.12 92.62 88.12 74.75 Japanese 96.55 96.1 92.4 90.4 81.1 Russian 97.15 95.6 93.62 87.07 75.47 Average 96.07 95.51 94.21 90.25 76.85 0dB Clean 15dB 10dB 5dB
16
Vocabulary: 200 full names, 20 ~ 30 apps with grammar Tester: 4 ~ 6 native speakers Device: Several PocketPC phone models Environment: Office, Roadside, and Highway Accuracy
Env. Language Taiwan Mandarin 98.6 92.8 93.5 China Mandarin 96.2 90.4 92.3 Cantonese 94.8 89.7 91.5 US English 93.7 85.2 90.5 UK English 93.2 83.7 88.5 German 95.7 86.3 93.8 French 96.5 91.4 92.6 Italian 97.5 92.3 94 Spanish 97.1 89.4 91.2 Brazilian Portuguese 95.3 87.6 88.7 Dutch 92.4 84 91.3 Japanese 96.2 88.3 91.2 Russian 96.3 88.4 92.8 Average 95.63 88.42 91.68 Office Roadside Highway
17
18
after fine tuning
Cyberon Talking Dictionary
19
Text Analysis
Input Text
Synthesizer
Output Speech Prosody Model Speech Unit Database Word/Phrase break Pronunciation POS tag Pronunciation Lexicon POS Lexicon
20
For Chinese and Thai Longest word first
By POS n-gram and Viterbi search
By boundary n-gram and Viterbi search Simplified approach: by syllable length
21
Save first tone of each syllable in database Pre-define F0 contour of each tone Adopt fixed base F0 contour of phrase Compute duration by syllable position in word and in phrase
Predict accent position and type by CART (Classification And Regression Tree) Generate F0 contour by linear regression Predict duration by CART
22
Save LPC coefficients and residual of pitch of speech unit into database Adjust residual length for F0 contour Adjust number of pitch for duration
23
24
device
and TTS, into embedded system
25
26
Cyberon Corporation
TEL : + 8 8 6 -2 -2 9 1 0 -9 0 8 8 FAX : + 8 8 6 -2 -2 9 1 0 -7 9 8 6 W ebsite : w w w .cyberon.com .tw