End-to-End AI Speech in DiDi - From Algorithm to Application - - PowerPoint PPT Presentation

end to end ai speech in didi from algorithm to application
SMART_READER_LITE
LIVE PREVIEW

End-to-End AI Speech in DiDi - From Algorithm to Application - - PowerPoint PPT Presentation

End-to-End AI Speech in DiDi - From Algorithm to Application AI- lixiangang@didiglobal.com pengyiping@didiglobal.com


slide-1
SLIDE 1

End-to-End AI Speech in DiDi - From Algorithm to Application

滴滴端到端语音AI实践-从算法到实现 lixiangang@didiglobal.com pengyiping@didiglobal.com

slide-2
SLIDE 2

MORE THAN A JOURNEY

slide-3
SLIDE 3

3 Speech Recognition

Text can be converted into speech. Music notes can be converted into songs.

Speech Synthesis Language Understanding

Speech can be converted into texts Voices of different people can be identified Different voices can be categorized Natural language Processing including text categorization, syntax parsing, intention recognition and semantic comprehension, etc.

Understand What You Say Think What You Think Say What You Say

Speech Processing & NLP Layout

slide-4
SLIDE 4

Driver Care Assistant Intelligent Bot

slide-5
SLIDE 5

Driver Care Assistant Intelligent Bot

slide-6
SLIDE 6

6

Intelligent Assistant

Recognize Questions Provide Solutions Generate Abstracts

ASR NLU

slide-7
SLIDE 7

Voice Interactive

ASR NLU TTS Australia Japan China n Japan & Australia: Accept Orders n China: Cancel Orders Voice Interactive

slide-8
SLIDE 8

Voice Interactive

slide-9
SLIDE 9

MORE THAN A JOURNEY

slide-10
SLIDE 10

10

Contents

p Algorithm and computation capability are key to speech AI landing. p From algorithm to implementation, we will briefly walk through our most important speech AI applications.

  • Speech & NLP applications
  • Smart customer service assistant
  • Responsibility judgement
  • Voice interaction
  • Speech Technology
  • ASR
  • Signal processing
  • Speaker Identification
  • Emotion Recognition
  • How GPU enables our applications
  • Graph optimization
  • Online serving
  • DELTA
slide-11
SLIDE 11

Attentional ASR

  • Dictionary
  • The modeling units for Mandarin Chinese ASR
  • Characters are usually selected as the basic modeling units
  • Language Model
  • How to benefit from the large text corpus without N-gram ?
  • We pre-train RNN-LM and then merged into acoustic neural network

Word Character Syllable Initial-final/phones 北京 北 京 bei jing b ei j ing

slide-12
SLIDE 12
  • End-to-end is a relative concept

End-to-end speech recognition

phoneme syllable/character DNN-HMM We need decision-tree based state clustering, dictionary, language model RNN-CTC We need dictionary, language model, (If we use the cd-phone as modeling units, we still need decision-tree based state clustering) The N-gram based language models would improve the performance RNN-Attention We do not need extra models

slide-13
SLIDE 13
  • Sequence-to-sequence model from translation

Attentional ASR

slide-14
SLIDE 14
  • Encoder
  • Listen, map the input feature

sequence to embedding

  • Decoder
  • Spell, map the embedding based on

the attention information to the

  • utput symbols

Listen-Attend-Spell

slide-15
SLIDE 15
  • Advantages
  • There is no conditional independence assumptions
  • Joint learning of acoustic information and language information
  • Speech recognition system is more simple
  • Disadvantages
  • Not easy to converge, We need more tricks to train attention model
  • Cannot be used for “streaming” speech recognition, during inference, the

model can produce the first output token only after all input speech frames have been consumed.

Attention vs. CTC

slide-16
SLIDE 16
  • Hard to train – many “tricks”
  • Schedule sampling
  • Label smoothing (2016)
  • Multi-Task Learning (2017)
  • Multi-headed Attention (2018)
  • SpecAugment (2019)
  • Data augmentation to LAS
  • Achieved sota results on Librispeech and

SWBD

Listen-Attend-Spell

slide-17
SLIDE 17
  • Speech Transformer
  • Transformer applied to ASR
  • With Conv layers as inputs

Speech-Transformer

slide-18
SLIDE 18
  • Speech Transformer
  • Transformer applied to ASR
  • With Conv layers as inputs

Speech-Transformer

slide-19
SLIDE 19
  • Speech Transformer
  • Transformer applied to ASR
  • With Conv layers as inputs
  • Time-restricted self-attention
  • Left & Right Contexts restricting the attention mechanism

Speech-Transformer

slide-20
SLIDE 20
  • Pre-training:
  • Like BERT in NLP, e.g. Mask Predictive

Coding

  • Fine-tuning:
  • Plug in a decoder

Unsupervised pre-training for speech-transformer

slide-21
SLIDE 21
  • Mask Predictive Coding:
  • mask 15% of all frames in each sequence at

random, and only predict the masked frame rather than reconstructing the entire input

  • Dynamic Masking:
  • Like RoBERTA, masking strategies are not decided in

advance

  • Down-sampling:
  • Local smoothness of speech makes learning too

easy without down-sampling. Eight-fold down- sampling is used, like LAS.

Unsupervised pre-training for speech-transformer

slide-22
SLIDE 22

Unsupervised pre-training for speech-transformer

slide-23
SLIDE 23

Related topics: signal processing for noise and far-field

AEC De-reverb BSS Beamforming NS AGC

Fixed filter Fixed filter …… ……

BM

( ) x k ( ) 1 x k ( ) 1 N x k
  • Adap Filter
Fixed filter Adap Filter Adap Filter

Å Å Å

Z-L

+

  • ( )
u k ( ) 1 u k ( ) M u k ( ) y k
slide-24
SLIDE 24
  • - Acoustic Echo Cancellation
  • - Noise suppression
  • - Beamforming / Blind source separation
  • - Auto Gain Control
  • riginal speech

processed speech Note: The reasons for this dive seemed foolish now. His captain was thin and haggard and his beautiful boots were worn and shabby. Production may fall far below expectations.

slide-25
SLIDE 25

Multimodality: speech + text

Speech Text

ASR model NLP model

Intent/slot… Can we utilize speech information or even build an end-to-end model?

slide-26
SLIDE 26

Multimodal speech emotion recognition

  • Aims to automatically identify the emotional state of a human being

from his or her voice.

  • audio signals features à speech emotion recognition
  • transcribed text à sentiment analysis
  • Combination of audio and text à multimodal methods

https://www.cs.cmu.edu/~morency/MMML-Tutorial-ACL2017.pdf

slide-27
SLIDE 27

Multimodal emotion recognition

  • Motivation (Xu et al., Interspeech, 2019)
  • The existing methods ignore the temporal relationship between

speech and text in a fine-grained level

  • The multimodal system will be benefit from using the alignment

information since the speech and text inherently co-exist in the temporal dimension

  • Utilize an attention network to learn the alignment between speech

and text

slide-28
SLIDE 28

Learning alignment between speech and text

  • Speech encoder and text encoder
  • Speech + ASR-recognized text
  • BLSTM for each
  • Alignment
  • Utilize the attention to learn the

alignment weights between speech frames and text words

  • Concatenate the aligned feature to

multimodal feature for classification

slide-29
SLIDE 29

Speaker Identification Algorithm

  • Classical i-vector
  • GMM-UBM
  • A generative model
  • d-vector
  • Sum over all hidden outputs
  • Somehow text-dependent
  • x-vector based models
  • Statistical pooling
  • Faster to compute
  • More robust to noises
  • Better on short utterances
  • Scales better on large data

TDNN TDNN

FC XENT

Filterbank Features

Speaker ID FC Stats Pooling FC FC

……

FC XENT

Filterbank Features

Speaker ID FC x-vector d-vector frame-level utt-level

slide-30
SLIDE 30

Speaker Identification Algorithm

State-of-the-art performance

  • Based on additive angular margin loss
  • Significantly lower error rate
  • Development: 3000k utterance,200k

speaker

  • Training: 12.5k utterance, 2k speaker
  • Testing: 12.5k utterance, same 2k speaker

10.762 9.114 7.592 7.254 7.188 4.012 3.232 2.542

3.270 0.000 2.000 4.000 6.000 8.000 10.000 12.000 10k 20k 100k 200k 500k 1000k 3000k

EER comparison for different models

ivector xvector GE2E

slide-31
SLIDE 31

MORE THAN A JOURNEY

slide-32
SLIDE 32

Online inference acceleration

Huge amount of speech data

  • Huge amount of data per day
  • Requires strong processing

power

Strategy Algorithm Computation

100.0% 250.0% 450.0% 900.0% 100.0% 317.8% 572.0% 1144.1% 0.0% 200.0% 400.0% 600.0% 800.0% 1000.0% 1200.0% 1400.0% CPU(双路4114) P4 float32 P4 int8 P4 int8 + model compression

Average inference performance per node

单机性能 单位cost性能

X86

slide-33
SLIDE 33

Significant speed up with GPU deployment

X86

  • AVX-512 brings no significant speed up (~20%, TDP constraint)
  • Some SKU has only one AVX unit which limits speed up

GPU

  • CNN takes up to 90% computation time
  • Uses elastic Tesla P4 instances on DiDiYun

Model Quantization

  • Int8 quantization brings +80% speed up over float32
  • Negligible accuracy loss

Model compression

  • Distill-based model compression
  • Acceptable accuracy loss
slide-34
SLIDE 34

Inference optimization

p Overall strategy

  • Bypass-based graph optimization
  • Custom Graph Optimization strategy

based on TF Grappler

  • More platform-specific ops Based on

TF Custom Op(X86/ARM/GPU)

  • Decoupling with TF source
  • Hand-crafted high performance ops
slide-35
SLIDE 35

Graph Optimization

p Op fusion

简单计算图融合

常见的融合图融合策略比如 Conv+BatchNorm Conv+BatchNorm+Relu Matmul+[Matmul] +Relu ……

复杂子图融合

如图中红框,复杂的激活函数 (Gelu)计算图,融合为单独 Gelu 算子

  • Higher computation density
  • Less kernel invocation
  • Less memory access
  • Utilizes registers better
slide-36
SLIDE 36

Graph Optimization

p Redundancy Elimination

Add(c1, Add(x, c2)) è Add(x, c1+c2) Conv(c1*x, c2) è Conv(x, c1*c2) AddN(c1, x, c2, y) è AddN(c1+c2, x, y) ……

  • Merge constant term

如右等式,不同计算逻辑 可以根据情况合并常量计算, 从而提前获取计算结果,提 升运行时性能

  • 计算图逻辑化简

如右图 Resnet 类似结构通过 控制Stride尺度来减少后续算子计 算量,并保证计算逻辑等价

slide-37
SLIDE 37

High performance kernel

p High performance kernel ① Hand-crafted ② Auto compilation

  • 如红框所示,高性能的算子集合模块
  • 手写算子主要针对计算密集度高的算

子,并针对通用的size进行优化

  • 类似TVM图编译方式的kernel生成,主

要针对特定尺寸的算子以及一些IO bound型算子

slide-38
SLIDE 38

High performance kernel

p High performance kernel ① Hand-crafted

  • 小型计算合并大的kernel
  • 加法换乘法
  • Layout 调整(e.g.

NCHW à NHWC)

  • 提升缓存内数据利用率
  • 减少bank冲突
  • ……

② Auto compilation

  • 加法换乘法
  • 小型算子内部合并
slide-39
SLIDE 39

High performance kernel

p High performance kernel ① Hand-crafted ② Auto compilation

  • 实用:基于IR系统的传统优化(TVM)

1、支持JIT和LIB灵活部署 2、LLVM支持硬件众多 3、极大减少硬件开发人力

Stmt = LoopXXX(Stmt)

Loop Interchange(permutation) Loop Parallel Loop skewing Loop reversal Loop fusion Loop distribution Loop Tiling ……

slide-40
SLIDE 40

High performance kernel

p High performance kernel ① Hand-crafted ② Auto compilation

  • 探索: 多面体模型的编译优化方式(PlaidML、TC)

1、另外一种角度处理编译优化问题(仿射变换) 2、更加容易的对transformation进行组合 3、不需要频繁生成中间代码 4、编译速度快

slide-41
SLIDE 41

Performance comparison

p Transformer

Batchsize: 1 Feature Dim: 40 X86 : 2.20GHz (1 Thread) GPU: Tesla P4 ()

100.00% 100.00% 100.00% 100.00% 100.00% 190.14% 342.19% 387.64% 462.22% 552.76% 293.48% 1314.00% 2391.53% 3873.81% 4741.38%

0.00% 1000.00% 2000.00% 3000.00% 4000.00% 5000.00% 100 500 1000 2000 3000

Input Length (Frame)

Relative Inference Performance (%)

v1.14 ( X86 : Baseline ) v1.14 ( X86 : Optimized ) v1.14 ( GPU : Baseline )

slide-42
SLIDE 42

Do better with DiDiYun !

p ASR & TTS services now available on didiyun.com

slide-43
SLIDE 43

Do better with DiDi DELTA !

Easy to use Easy to deploy Easy to develop

A DEep learning based Language Technology platform

https://github.com/didi/delta

Open source!

slide-44
SLIDE 44

MORE THAN A JOURNEY