End-to-End AI Speech in DiDi - From Algorithm to Application - PowerPoint PPT Presentation

End-to-End AI Speech in DiDi - From Algorithm to Application 滴滴端到端语音AI实践-从算法到实现 lixiangang@didiglobal.com pengyiping@didiglobal.com

�� • �� MORE THAN A JOURNEY

Speech Processing & NLP Layout Understand Think Say What You Say What You Think What You Say Language Speech Recognition Speech Synthesis Understanding Natural language Processing Text can be converted into speech. Speech can be converted into texts including text categorization, Music notes can be converted Voices of different people can be syntax parsing, intention into songs. identified recognition and semantic Different voices can be categorized comprehension, etc. 3

Driver Care Assistant Intelligent Bot

Intelligent Assistant ASR NLU Recognize Questions Provide Solutions Generate Abstracts 6

Voice Interactive Voice Interactive ASR NLU TTS n Japan & Australia: Accept Orders n China: Cancel Orders Japan Australia China

Voice Interactive

• �� MORE THAN A JOURNEY

Contents p Algorithm and computation capability are key to speech AI landing. p From algorithm to implementation, we will briefly walk through our most important speech AI applications. ◦ Speech & NLP applications ◦ How GPU enables our applications ◦ Smart customer service assistant ◦ Graph optimization ◦ Responsibility judgement ◦ Online serving ◦ Voice interaction ◦ DELTA ◦ Speech Technology ◦ ASR ◦ Signal processing ◦ Speaker Identification ◦ Emotion Recognition 10

Attentional ASR • Dictionary • The modeling units for Mandarin Chinese ASR Word Character Syllable Initial-final/phones 北京北京 bei jing b ei j ing • Characters are usually selected as the basic modeling units • Language Model • How to benefit from the large text corpus without N-gram ? • We pre-train RNN-LM and then merged into acoustic neural network

End-to-end speech recognition • End-to-end is a relative concept phoneme syllable/character We need decision-tree based DNN-HMM state clustering, dictionary, language model We need dictionary, language model, The N-gram based language (If we use the cd-phone as RNN-CTC models would improve the modeling units, we still need performance decision-tree based state clustering) RNN-Attention We do not need extra models

Attentional ASR • Sequence-to-sequence model from translation

Listen-Attend-Spell • Encoder • Listen, map the input feature sequence to embedding • Decoder • Spell, map the embedding based on the attention information to the output symbols

Attention vs. CTC • Advantages • There is no conditional independence assumptions • Joint learning of acoustic information and language information • Speech recognition system is more simple • Disadvantages • Not easy to converge, We need more tricks to train attention model • Cannot be used for “streaming” speech recognition, during inference, the model can produce the first output token only after all input speech frames have been consumed.

Listen-Attend-Spell • Hard to train – many “tricks” • Schedule sampling • Label smoothing (2016) • Multi-Task Learning (2017) • Multi-headed Attention (2018) • SpecAugment (2019) • Data augmentation to LAS • Achieved sota results on Librispeech and SWBD

Speech-Transformer • Speech Transformer • Transformer applied to ASR • With Conv layers as inputs

Speech-Transformer • Speech Transformer • Transformer applied to ASR • With Conv layers as inputs • Time-restricted self-attention • Left & Right Contexts restricting the attention mechanism

Unsupervised pre-training for speech-transformer • Pre-training: • Like BERT in NLP, e.g. Mask Predictive Coding • Fine-tuning: • Plug in a decoder

Unsupervised pre-training for speech-transformer • Mask Predictive Coding: • mask 15% of all frames in each sequence at random, and only predict the masked frame rather than reconstructing the entire input • Dynamic Masking: • Like RoBERTA, masking strategies are not decided in advance • Down-sampling: • Local smoothness of speech makes learning too easy without down-sampling. Eight-fold down- sampling is used, like LAS.

Unsupervised pre-training for speech-transformer

Related topics: signal processing for noise and far-field AEC De-reverb BSS AGC NS Beamforming ( ) x k Fixed filter 0 + ( ) Å Å y k ( ) x k Z -L Fixed filter 1 - …… …… ( ) x k Fixed filter N - 1 ( ) u k 0 Adap Filter ( ) Å u k Adap Filter 1 BM ( ) k u M Adap Filter

-- Acoustic Echo Cancellation -- Noise suppression -- Beamforming / Blind source separation -- Auto Gain Control original speech Note: The reasons for this dive seemed foolish now. His captain was thin and haggard and his beautiful boots were worn and shabby. processed speech Production may fall far below expectations.

Multimodality: speech + text Speech Text Intent/slot… ASR model NLP model Can we utilize speech information or even build an end-to-end model?

Multimodal speech emotion recognition • Aims to automatically identify the emotional state of a human being from his or her voice. • audio signals features à speech emotion recognition • transcribed text à sentiment analysis • Combination of audio and text à multimodal methods https://www.cs.cmu.edu/~morency/MMML-Tutorial-ACL2017.pdf

Multimodal emotion recognition • Motivation (Xu et al., Interspeech, 2019) • The existing methods ignore the temporal relationship between speech and text in a fine-grained level • The multimodal system will be benefit from using the alignment information since the speech and text inherently co-exist in the temporal dimension • Utilize an attention network to learn the alignment between speech and text

Learning alignment between speech and text • Speech encoder and text encoder • Speech + ASR-recognized text • BLSTM for each • Alignment • Utilize the attention to learn the alignment weights between speech frames and text words • Concatenate the aligned feature to multimodal feature for classification

Speaker Identification Algorithm d-vector x-vector • Classical i-vector • GMM-UBM Speaker ID Speaker ID • A generative model XENT XENT • d-vector FC FC utt-level • Sum over all hidden outputs • Somehow text-dependent FC FC …… Stats Pooling • x-vector based models … • Statistical pooling • Faster to compute FC TDNN frame-level • More robust to noises FC TDNN • Better on short utterances • Scales better on large data Filterbank Filterbank Features Features

Speaker Identification Algorithm EER comparison for different models 12.000 10.762 10.000 9.114 State-of-the-art performance 7.592 8.000 7.254 7.188 • Based on additive angular margin loss 6.000 4.012 4.000 3.232 • Significantly lower error rate 3.270 2.542 2.000 Development: 3000k utterance，200k • 0.000 speaker 10k 20k 100k 200k 500k 1000k 3000k Training: 12.5k utterance, 2k speaker • ivector xvector GE2E Testing: 12.5k utterance, same 2k speaker •

• �� MORE THAN A JOURNEY

Online inference acceleration Huge amount of speech data Average inference performance per node ◦ Huge amount of data per day 1400.0% ◦ Requires strong processing power 1200.0% 1144.1% 1000.0% 900.0% Strategy 800.0% 单机性能 572.0% 600.0% 单位 cost 性能 Algorithm 450.0% 400.0% 317.8% 250.0% Computation 200.0% 100.0% 100.0% 0.0% CPU( 双路 4114) P4 float32 P4 int8 P4 int8 + model X86 compression

Significant speed up with GPU deployment • AVX-512 brings no significant speed up (~20%, TDP constraint) • Some SKU has only one AVX unit which limits speed up X86 • CNN takes up to 90% computation time • Uses elastic Tesla P4 instances on DiDiYun GPU • Int8 quantization brings +80% speed up over float32 Model • Negligible accuracy loss Quantization • Distill-based model compression Model • Acceptable accuracy loss compression

Inference optimization p Overall strategy • Bypass-based graph optimization • Custom Graph Optimization strategy based on TF Grappler • More platform-specific ops Based on TF Custom Op（X86/ARM/GPU） • Decoupling with TF source • Hand-crafted high performance ops

Graph Optimization p Op fusion 复杂子图融合如图中红框，复杂的激活函数（Gelu）计算图，融合为单独 Gelu 算子 • Higher computation density 简单计算图融合 • Less kernel invocation 常见的融合图融合策略比如 • Less memory access Conv+BatchNorm • Utilizes registers better Conv+BatchNorm+Relu Matmul+[Matmul] +Relu ……

End-to-End AI Speech in DiDi - From Algorithm to Application - PowerPoint PPT Presentation

End-to-End AI Speech in DiDi - From Algorithm to Application AI- lixiangang@didiglobal.com pengyiping@didiglobal.com

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

evaluation on tectonic setting Didi S. Agustawijaya, Heri Sulistiyono, Ikhwan Elhuda

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

Speech sound disorder by Sajjal (2018) Definition A speech sound disorder (SSD) is a speech

Speech of Greta Thunberg at the UN Climate Change COP24 Conference in Katowice Content -Greta

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

An Adaptive Hybrid Pattern-Matching Algorithm on Indeterminate Strings . Smyth 1 , Shu Wang 1 and

Hugh Robertson 337-534-0613 lafdisc@gmail.com Welcome to the Digital Age. Consumers

OPPORTUNITY DAY May 30, 2019 Disclaimer The information (Confidential Information)

Safety and evacuation procedures Wijnhaven Discover the world at Leiden University The Wijnhaven

(GIS & Photogrammetry Technique ) Tuesday 30, August, 2016 Survey Coverage Physical

On-line Multi-label Classification A Problem Transformation Approach Jesse Read Supervisors:

Control Structures Conditionals: If-Statements Format Example if < boolean-expression >:

Coffee Time? Programming Construct Two: Selection Selection Statements Problem Example 4:

End-to-End AI Speech in DiDi - From Algorithm to Application - PowerPoint PPT Presentation

End-to-End AI Speech in DiDi - From Algorithm to Application AI- lixiangang@didiglobal.com pengyiping@didiglobal.com

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

evaluation on tectonic setting Didi S. Agustawijaya, Heri Sulistiyono, Ikhwan Elhuda

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

Speech sound disorder by Sajjal (2018) Definition A speech sound disorder (SSD) is a speech

Speech of Greta Thunberg at the UN Climate Change COP24 Conference in Katowice Content -Greta

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

An Adaptive Hybrid Pattern-Matching Algorithm on Indeterminate Strings . Smyth 1 , Shu Wang 1 and

Hugh Robertson 337-534-0613 lafdisc@gmail.com Welcome to the Digital Age. Consumers

OPPORTUNITY DAY May 30, 2019 Disclaimer The information (Confidential Information)

Safety and evacuation procedures Wijnhaven Discover the world at Leiden University The Wijnhaven

(GIS &amp; Photogrammetry Technique ) Tuesday 30, August, 2016 Survey Coverage Physical

On-line Multi-label Classification A Problem Transformation Approach Jesse Read Supervisors:

Control Structures Conditionals: If-Statements Format Example if &lt; boolean-expression &gt;:

Coffee Time? Programming Construct Two: Selection Selection Statements Problem Example 4:

(GIS & Photogrammetry Technique ) Tuesday 30, August, 2016 Survey Coverage Physical

Control Structures Conditionals: If-Statements Format Example if < boolean-expression >: