Speech Technology Using in Wechat FENG RAO Powered by WeChat - - PDF document

speech technology using in wechat
SMART_READER_LITE
LIVE PREVIEW

Speech Technology Using in Wechat FENG RAO Powered by WeChat - - PDF document

26/9/2014 Speech Technology Using in Wechat FENG RAO Powered by WeChat Outline Introduce Algorithm of Speech Recognition Acoustic Model Language Model Decoder Speech Technology Open Platform Framework of Speech


slide-1
SLIDE 1

26/9/2014 1

Speech Technology Using in Wechat

Powered by WeChat

FENG RAO

Outline

  • Introduce Algorithm of Speech Recognition

– Acoustic Model – Language Model – Decoder

  • Speech Technology Open Platform

– Framework of Speech Recognition – Products of Speech Recognition – Speech Synthesis – Speaker Verification

slide-2
SLIDE 2

26/9/2014 2

Speech Recognition

ˆ W = argmax

W ∈L

P(W |O)

ˆ W = argmax

W ∈L

P(O |W )P(W ) P(O)

ˆ W = argmax

W ∈L

P(O |W )P(W )

Acoustic Model

  • Spoken words: “I think there are”
  • Phonemes: ‘ay-th-in-nk-kd dh-eh-r-aa-r’
  • Each tri-phone correspond to a hmm.
  • H.M.M : 5 state representation
  • Each state correspond to a mixture Gaussian

model

slide-3
SLIDE 3

26/9/2014 3

Acoustic Model

P(O | S) = ϖ iN(O |µi,

i

  • )

i=1 M

  • P(O |W ) =

P(O | S)

i=1 M

  • Deep Neural Network
  • utputs

hidden layers input vector Compare outputs with correct answer to get error signal

Back-propagate error signal to get derivatives for learning

P(Y = i | x,W,b) = soft maxi(Wx +b) = e

Wix+bi

e

Wjx+bj j

slide-4
SLIDE 4

26/9/2014 4

Language Model

1 2 1 2 1 1 2 3 1

( ) ( , ,..., ) ( ) ( | )... ( | , , ,..., )

k k k

p S p W W W p W p W W p W W W W W

= =

  • N-Gram Model
  • Build the LM by calculating n-gram probabilities from

text training corpus: how likely is one word to follow another? To follow the two previous words?

  • Smooth methods

– KN, GT ,Stupid Backoff

  • Grammar
  • ABNF, is to describe a formal system of a language to be

used as bidirectional communication protocol.

  • Quick , Small
  • \data\
  • ngram 1=4
  • ngram 2=3
  • ngram 3=2
  • \1-grams:-
  • 0.60206

hello

  • 0.39794
  • 0.60206

world

  • 0.3979
  • 0.60206

</s>

  • 0.39794
  • 0.60206

<s>

  • 0.39794
  • \2-grams:0
  • hello world
  • 0.39794
  • world </s>
  • 0.39794
  • <s> hello -0.39794
  • \3-grams:
  • hello world </s>
  • <s> hello world\end\

N-Gram Grammar public $basicCmd = $digit<1->; $digit = (0|1|2|3|4|5|6|7|8);

slide-5
SLIDE 5

26/9/2014 5

Decoder

  • Find the best hypothesis P(O|W) P(W) given

– A sequence of acoustic feature vectors (O) – A trained HMM (AM) – Lexicon (PM) – Probabilities of word sequences (LM)

  • For O

– Weighted finite state transducer – Build network composed with HMM trip-hone and words in Am and Lm. – Calculate most likely state sequence in HMM given transition and

  • bservation probs.

– Trace back through state sequence to get the word sequence. – Viterbi decoder – N best vs. 1 best vs. lattice output

  • Limiting search

– Lattice minimization and determination – Pruning: beam search

Decoder Network

  • Viterbi Decoder Process

t start End Phoneme / word

slide-6
SLIDE 6

26/9/2014 6

Decoder Network

  • Viterbi Decoder Process

t start End

Decoder Network

  • Viterbi Decoder Process

t start End

slide-7
SLIDE 7

26/9/2014 7

Decoder Network

  • Viterbi Decoder Process

t start End

Decoder Network

  • Viterbi Decoder Process

t start End

slide-8
SLIDE 8

26/9/2014 8

Challenge Under Internet

  • Big Training Data

– Txt corpus is TB level and thousand hours of speech data as training data – Speed Optimized methods

  • Large Mount of Users

– Real time response – More machines, Robust service

  • Quick Update

– Content in Internet in changing every day. – Update model especially on language model

Speech Open Platform

Speech recognition Speech synthesis Speaker verification …

  • -using in wechat
slide-9
SLIDE 9

26/9/2014 9

One Network Multiple Products

Universal Interface decoder General Filed LM decoder Map filed LM decoder Command Filed LM decoder Others…

One Network Multiple Technology

Universal Interface One Pass Decoder Lattice Decoder Parallel Decoding Space Ngram Model ABNF Parallel Decoding Space GMM DNN One –Best N-Best One –Best N-Best

slide-10
SLIDE 10

26/9/2014 10

Non-Finite Field

The core performance of Speech Recognition is

  • ptimized and developing

Accuracy rate: 94% (Audio sampling at 16kHz) Usage amount: 18 million per day

92.60% 94.10% 94.70% 94.30% 95.10% 94.90% 91.00% 91.50% 92.00% 92.50% 93.00% 93.50% 94.00% 94.50% 95.00% 95.50% Week 33 Week 34 Week 35 Week 36 Week 37 Week 38

Sampling Accuracy of Current Availablity

* Source: Accuracy Assessment from Third Party in April'14

Recognition rate

Multi Verticals

Unify entrance with Parallel decoding of space technology Parallel recognition supports 11 classifications of verticals 30% better in performance than speech input in Verticals recognition rate: 96%, more accurate than common

* Source: Accuracy Assessment from Third Party in April'14

92.70% 93.40% 94.40% 96.20% 95.80% 95.50% 93.80% 90.00% 91.00% 92.00% 93.00% 94.00% 95.00% 96.00% 97.00%

Vertical Fields

slide-11
SLIDE 11

26/9/2014 11

Speech Technology Product

21

  • Speech to text

Input Tool QQ Input Wechat Input

Speech Technology Product

  • Vertical Application

Music Searching QQ Map

slide-12
SLIDE 12

26/9/2014 12

Contact Searching Voice Quality Identify Voice Awaken To Unlock Mobile Phone

Speech Synthesis

  • Features

– 1. High efficient synthesis. – 2. Available SDK for Android and iOS clients. – 3. Offline and Online TTS

  • Applications

– 1. WeChat Official Account. – 2. WeCall. .

slide-13
SLIDE 13

26/9/2014 13

Speaker verification

Application of scene

User login verification bank transfer, payment verification Forgot password

Advantage:

Convenient , fast Safety Good user experience

How To Get Speech Technology

  • http://pr.weixin.qq.com/voice/intro
slide-14
SLIDE 14

26/9/2014 14

Thanks