Speech Technology Using in Wechat FENG RAO Powered by WeChat - PDF document

26/9/2014 Speech Technology Using in Wechat FENG RAO Powered by WeChat Outline • Introduce Algorithm of Speech Recognition – Acoustic Model – Language Model – Decoder • Speech Technology Open Platform – Framework of Speech Recognition – Products of Speech Recognition – Speech Synthesis – Speaker Verification 1

26/9/2014 Speech Recognition ˆ P ( O | W ) P ( W ) W = argmax P ( W | O ) ˆ ˆ W = argmax P ( O | W ) P ( W ) W = argmax P ( O ) W ∈ L W ∈ L W ∈ L Acoustic Model • Spoken words: “ I think there are ” • Phonemes: ‘ ay-th-in-nk-kd dh-eh-r-aa-r ’ • Each tri-phone correspond to a hmm. • H.M.M : 5 state representation • Each state correspond to a mixture Gaussian model 2

26/9/2014 Acoustic Model M � � P ( O | S ) = ϖ i N ( O | µ i , ) i i = 1 M � P ( O | W ) = P ( O | S ) i = 1 Deep Neural Network Compare outputs with correct answer to get error signal Back-propagate error signal to outputs get derivatives for learning hidden layers input vector W i x + b i e P ( Y = i | x , W , b ) = soft max i ( Wx + b ) = � W j x + b j e j 3

26/9/2014 Language Model • N-Gram Model • Build the LM by calculating n-gram probabilities from text training corpus: how likely is one word to follow another? To follow the two previous words? p S ( ) p W W ( , ,..., W ) p W ( ) ( p W | W )... ( p W W W W | , , ,..., W ) = = 1 2 k 1 2 1 k 1 2 3 k − 1 • Smooth methods – KN, GT ,Stupid Backoff • Grammar • ABNF, is to describe a formal system of a language to be used as bidirectional communication protocol. • Quick , Small N-Gram • \data\ • ngram 1=4 • ngram 2=3 • ngram 3=2 • \1-grams:- • 0.60206 hello -0.39794 • -0.60206 world -0.3979 • -0.60206 </s> -0.39794 • -0.60206 <s> -0.39794 • \2-grams:0 • 0 hello world -0.39794 • 0 world </s> -0.39794 • 0 <s> hello -0.39794 • \3-grams: • 0 hello world </s> • 0 <s> hello world\end\ Grammar public $basicCmd = $digit<1->; $digit = (0|1|2|3|4|5|6|7|8); 4

26/9/2014 Decoder • Find the best hypothesis P(O|W) P(W) given – A sequence of acoustic feature vectors (O) – A trained HMM (AM) – Lexicon (PM) – Probabilities of word sequences (LM) • For O – Weighted finite state transducer – Build network composed with HMM trip-hone and words in Am and Lm. – Calculate most likely state sequence in HMM given transition and observation probs. – Trace back through state sequence to get the word sequence. – Viterbi decoder – N best vs. 1 best vs. lattice output • Limiting search – Lattice minimization and determination – Pruning: beam search Decoder Network • Viterbi Decoder Process t Phoneme / word start End 5

26/9/2014 Decoder Network • Viterbi Decoder Process t End start Decoder Network • Viterbi Decoder Process t start End 6

26/9/2014 Decoder Network • Viterbi Decoder Process t End start Decoder Network • Viterbi Decoder Process t End start 7

26/9/2014 Challenge Under Internet • Big Training Data – Txt corpus is TB level and thousand hours of speech data as training data – Speed Optimized methods • Large Mount of Users – Real time response – More machines, Robust service • Quick Update – Content in Internet in changing every day. – Update model especially on language model Speech Open Platform --using in wechat Speech recognition Speech synthesis Speaker verification … 8

26/9/2014 One Network Multiple Products Universal Interface decoder decoder decoder decoder General Map filed Command Others… Filed LM LM Filed LM One Network Multiple Technology Universal Interface Ngram Model One Pass Decoder GMM ABNF Lattice Decoder DNN Parallel Decoding Space Parallel Decoding Space One –Best One –Best N-Best N-Best 9

26/9/2014 Recognition rate Non-Finite Field Sampling Accuracy of Current Availablity The core performance of 95.50% Speech Recognition is 95.10% 95.00% 94.90% 94.70% 94.50% optimized and developing 94.30% 94.10% 94.00% 93.50% � Accuracy rate: 94% (Audio 93.00% 92.60% 92.50% 92.00% sampling at 16kHz) 91.50% � Usage amount: 18 million per day 91.00% Week 33 Week 34 Week 35 Week 36 Week 37 Week 38 * Source: Accuracy Assessment from Third Party in April'14 Vertical Fields Multi Verticals 97.00% Unify entrance with Parallel 96.20% 95.80% 96.00% 95.50% decoding of space technology 95.00% 94.40% 93.80% 94.00% � Parallel recognition supports 93.40% 92.70% 93.00% 11 classifications of verticals 92.00% � 30% better in performance than 91.00% speech input in Verticals 90.00% � recognition rate: 96% , more * Source: Accuracy Assessment from Third Party in April'14 accurate than common 10

26/9/2014 Speech Technology Product • Speech to text Input Tool Wechat Input QQ Input 21 Speech Technology Product • Vertical Application Music QQ Map Searching 11

26/9/2014 Voice Quality Identify Contact Searching Voice Awaken To Unlock Mobile Phone Speech Synthesis • Features – 1. High efficient synthesis. – 2. Available SDK for Android and iOS clients. – 3. Offline and Online TTS • Applications – 1. WeChat Official Account. – 2. WeCall. . 12

26/9/2014 Speaker verification � Application of scene � User login verification � bank transfer, payment verification � Forgot password � Advantage ： � Convenient , fast � Safety � Good user experience How To Get Speech Technology • http://pr.weixin.qq.com/voice/intro 13

26/9/2014 Thanks 14

Speech Technology Using in Wechat FENG RAO Powered by WeChat - PDF document

26/9/2014 Speech Technology Using in Wechat FENG RAO Powered by WeChat Outline Introduce Algorithm of Speech Recognition Acoustic Model Language Model Decoder Speech Technology Open Platform Framework of Speech

Intelligence Technology behind WeChat XIAO Bin Pattern Recognition Center, WeChat Product

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Intelligent Chatbot on WeChat WeChat AI NLP 2017.05.09 WeCh We Chat is is the he le leading

Leverage WeChat at its best to reach and serve Chinese consumers Bruxelles, January 2017 1

Overload Control for Scaling WeChat Microservices WeChat The new way to connect Chat Moments

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech sound disorder by Sajjal (2018) Definition A speech sound disorder (SSD) is a speech

Speech of Greta Thunberg at the UN Climate Change COP24 Conference in Katowice Content -Greta

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

SSML for Indian Languages Text to Speech Synthesis Presented by: Vibhu Agarwal President and co-

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 28 January 2009 What is text-to-speech

simon Open-Source Speech Recognition Developed by the non profit organization Simon Listens in

Focusing Language Models For Automatic Speech Recognition Daniele Falavigna, Roberto Gretter

CS 523: Multimedia Systems Angus Forbes creativecoding.evl.uic.edu/courses/cs523 Image-to-Image

Speech Synthesis, Reinforcement Learning Milan Straka May 13, 2019 Charles University in Prague

From phonetics to speech technology Einar Meister Laboratory of Phonetics and Speech Technology

Imperceptible, Robust and Targeted Adversarial Examples for Automatic Speech Recognition 1 2 2