26/9/2014 1
Speech Technology Using in Wechat
Powered by WeChat
FENG RAO
Outline
- Introduce Algorithm of Speech Recognition
– Acoustic Model – Language Model – Decoder
- Speech Technology Open Platform
Speech Technology Using in Wechat FENG RAO Powered by WeChat - - PDF document
26/9/2014 Speech Technology Using in Wechat FENG RAO Powered by WeChat Outline Introduce Algorithm of Speech Recognition Acoustic Model Language Model Decoder Speech Technology Open Platform Framework of Speech
Powered by WeChat
FENG RAO
ˆ W = argmax
W ∈L
P(W |O)
ˆ W = argmax
W ∈L
P(O |W )P(W ) P(O)
ˆ W = argmax
W ∈L
P(O |W )P(W )
i
i=1 M
i=1 M
hidden layers input vector Compare outputs with correct answer to get error signal
P(Y = i | x,W,b) = soft maxi(Wx +b) = e
Wix+bi
e
Wjx+bj j
1 2 1 2 1 1 2 3 1
( ) ( , ,..., ) ( ) ( | )... ( | , , ,..., )
k k k
p S p W W W p W p W W p W W W W W
−
= =
text training corpus: how likely is one word to follow another? To follow the two previous words?
– KN, GT ,Stupid Backoff
used as bidirectional communication protocol.
hello
world
</s>
<s>
N-Gram Grammar public $basicCmd = $digit<1->; $digit = (0|1|2|3|4|5|6|7|8);
– A sequence of acoustic feature vectors (O) – A trained HMM (AM) – Lexicon (PM) – Probabilities of word sequences (LM)
– Weighted finite state transducer – Build network composed with HMM trip-hone and words in Am and Lm. – Calculate most likely state sequence in HMM given transition and
– Trace back through state sequence to get the word sequence. – Viterbi decoder – N best vs. 1 best vs. lattice output
– Lattice minimization and determination – Pruning: beam search
t start End Phoneme / word
t start End
t start End
t start End
t start End
Universal Interface decoder General Filed LM decoder Map filed LM decoder Command Filed LM decoder Others…
Universal Interface One Pass Decoder Lattice Decoder Parallel Decoding Space Ngram Model ABNF Parallel Decoding Space GMM DNN One –Best N-Best One –Best N-Best
Non-Finite Field
The core performance of Speech Recognition is
Accuracy rate: 94% (Audio sampling at 16kHz) Usage amount: 18 million per day
92.60% 94.10% 94.70% 94.30% 95.10% 94.90% 91.00% 91.50% 92.00% 92.50% 93.00% 93.50% 94.00% 94.50% 95.00% 95.50% Week 33 Week 34 Week 35 Week 36 Week 37 Week 38
Sampling Accuracy of Current Availablity
* Source: Accuracy Assessment from Third Party in April'14
Multi Verticals
Unify entrance with Parallel decoding of space technology Parallel recognition supports 11 classifications of verticals 30% better in performance than speech input in Verticals recognition rate: 96%, more accurate than common
* Source: Accuracy Assessment from Third Party in April'14
92.70% 93.40% 94.40% 96.20% 95.80% 95.50% 93.80% 90.00% 91.00% 92.00% 93.00% 94.00% 95.00% 96.00% 97.00%
21
Input Tool QQ Input Wechat Input
Music Searching QQ Map
Contact Searching Voice Quality Identify Voice Awaken To Unlock Mobile Phone