End-to-End AI Speech in DiDi - From Algorithm to Application
滴滴端到端语音AI实践-从算法到实现 lixiangang@didiglobal.com pengyiping@didiglobal.com
End-to-End AI Speech in DiDi - From Algorithm to Application - - PowerPoint PPT Presentation
End-to-End AI Speech in DiDi - From Algorithm to Application AI- lixiangang@didiglobal.com pengyiping@didiglobal.com
滴滴端到端语音AI实践-从算法到实现 lixiangang@didiglobal.com pengyiping@didiglobal.com
MORE THAN A JOURNEY
3 Speech Recognition
Text can be converted into speech. Music notes can be converted into songs.
Speech Synthesis Language Understanding
Speech can be converted into texts Voices of different people can be identified Different voices can be categorized Natural language Processing including text categorization, syntax parsing, intention recognition and semantic comprehension, etc.
Understand What You Say Think What You Think Say What You Say
6
Recognize Questions Provide Solutions Generate Abstracts
ASR NLU
ASR NLU TTS Australia Japan China n Japan & Australia: Accept Orders n China: Cancel Orders Voice Interactive
MORE THAN A JOURNEY
10
p Algorithm and computation capability are key to speech AI landing. p From algorithm to implementation, we will briefly walk through our most important speech AI applications.
Word Character Syllable Initial-final/phones 北京 北 京 bei jing b ei j ing
phoneme syllable/character DNN-HMM We need decision-tree based state clustering, dictionary, language model RNN-CTC We need dictionary, language model, (If we use the cd-phone as modeling units, we still need decision-tree based state clustering) The N-gram based language models would improve the performance RNN-Attention We do not need extra models
sequence to embedding
the attention information to the
model can produce the first output token only after all input speech frames have been consumed.
SWBD
Coding
random, and only predict the masked frame rather than reconstructing the entire input
advance
easy without down-sampling. Eight-fold down- sampling is used, like LAS.
AEC De-reverb BSS Beamforming NS AGC
Fixed filter Fixed filter …… ……BM
( ) x k ( ) 1 x k ( ) 1 N x kÅ Å Å
Z-L+
processed speech Note: The reasons for this dive seemed foolish now. His captain was thin and haggard and his beautiful boots were worn and shabby. Production may fall far below expectations.
ASR model NLP model
https://www.cs.cmu.edu/~morency/MMML-Tutorial-ACL2017.pdf
alignment weights between speech frames and text words
multimodal feature for classification
TDNN TDNN
FC XENT
Filterbank Features
Speaker ID FC Stats Pooling FC FC
FC XENT
Filterbank Features
Speaker ID FC x-vector d-vector frame-level utt-level
State-of-the-art performance
speaker
10.762 9.114 7.592 7.254 7.188 4.012 3.232 2.542
3.270 0.000 2.000 4.000 6.000 8.000 10.000 12.000 10k 20k 100k 200k 500k 1000k 3000k
EER comparison for different models
ivector xvector GE2E
MORE THAN A JOURNEY
Huge amount of speech data
power
Strategy Algorithm Computation
100.0% 250.0% 450.0% 900.0% 100.0% 317.8% 572.0% 1144.1% 0.0% 200.0% 400.0% 600.0% 800.0% 1000.0% 1200.0% 1400.0% CPU(双路4114) P4 float32 P4 int8 P4 int8 + model compression
Average inference performance per node
单机性能 单位cost性能
X86
X86
GPU
Model Quantization
Model compression
p Overall strategy
based on TF Grappler
TF Custom Op(X86/ARM/GPU)
p Op fusion
简单计算图融合
常见的融合图融合策略比如 Conv+BatchNorm Conv+BatchNorm+Relu Matmul+[Matmul] +Relu ……
复杂子图融合
如图中红框,复杂的激活函数 (Gelu)计算图,融合为单独 Gelu 算子
p Redundancy Elimination
Add(c1, Add(x, c2)) è Add(x, c1+c2) Conv(c1*x, c2) è Conv(x, c1*c2) AddN(c1, x, c2, y) è AddN(c1+c2, x, y) ……
如右等式,不同计算逻辑 可以根据情况合并常量计算, 从而提前获取计算结果,提 升运行时性能
如右图 Resnet 类似结构通过 控制Stride尺度来减少后续算子计 算量,并保证计算逻辑等价
p High performance kernel ① Hand-crafted ② Auto compilation
子,并针对通用的size进行优化
要针对特定尺寸的算子以及一些IO bound型算子
p High performance kernel ① Hand-crafted
NCHW à NHWC)
② Auto compilation
p High performance kernel ① Hand-crafted ② Auto compilation
1、支持JIT和LIB灵活部署 2、LLVM支持硬件众多 3、极大减少硬件开发人力
Stmt = LoopXXX(Stmt)
Loop Interchange(permutation) Loop Parallel Loop skewing Loop reversal Loop fusion Loop distribution Loop Tiling ……
p High performance kernel ① Hand-crafted ② Auto compilation
1、另外一种角度处理编译优化问题(仿射变换) 2、更加容易的对transformation进行组合 3、不需要频繁生成中间代码 4、编译速度快
p Transformer
Batchsize: 1 Feature Dim: 40 X86 : 2.20GHz (1 Thread) GPU: Tesla P4 ()
100.00% 100.00% 100.00% 100.00% 100.00% 190.14% 342.19% 387.64% 462.22% 552.76% 293.48% 1314.00% 2391.53% 3873.81% 4741.38%
0.00% 1000.00% 2000.00% 3000.00% 4000.00% 5000.00% 100 500 1000 2000 3000
Input Length (Frame)
Relative Inference Performance (%)
v1.14 ( X86 : Baseline ) v1.14 ( X86 : Optimized ) v1.14 ( GPU : Baseline )
p ASR & TTS services now available on didiyun.com
A DEep learning based Language Technology platform
https://github.com/didi/delta
Open source!
MORE THAN A JOURNEY