[PPT] - Speech: The Next Generation Bryan Catanzaro along with PowerPoint Presentation

SLIDE 1

Speech: ¡The ¡Next ¡Generation ¡

Bryan ¡Catanzaro ¡ along ¡with ¡Baidu ¡SVAIL ¡

SLIDE 2

Bryan ¡Catanzaro ¡

Speech ¡Recognition: ¡interface ¡of ¡the ¡future ¡

Awni ¡Hannun ¡

SLIDE 3

Bryan ¡Catanzaro ¡

Speech ¡Recognition: ¡Traditional ¡ASR ¡

Traditional ¡speech ¡systems ¡are ¡hard ¡to ¡build. ¡

– Many ¡specialized ¡stages ¡combined. ¡

Features ¡ Acoustic ¡Model ¡ HMM ¡ Language ¡ Model ¡ Transcription ¡

“The ¡quick ¡brown ¡fox ¡ jumps ¡over ¡the ¡lazy ¡dog.” ¡

Adam ¡Coates ¡

SLIDE 4

Bryan ¡Catanzaro ¡

Speech ¡Recognition: ¡Traditional ¡ASR ¡

Getting ¡higher ¡performance ¡is ¡hard ¡
Improve ¡each ¡stage ¡by ¡engineering ¡

Accuracy ¡

Traditional ¡ASR ¡

Data ¡+ ¡Model ¡Size ¡

Expert ¡engineering. ¡

Adam ¡Coates ¡

SLIDE 5

Bryan ¡Catanzaro ¡

Speech ¡recognition: ¡Traditional ¡ASR ¡

Huge ¡investment ¡in ¡features ¡for ¡speech! ¡

– Decades ¡of ¡work ¡to ¡get ¡very ¡small ¡improvements ¡

Spectrogram ¡ MFCC ¡ Flux ¡

Adam ¡Coates ¡

SLIDE 6

Bryan ¡Catanzaro ¡

Speech ¡Recognition ¡2: ¡Deep ¡Learning! ¡

Since ¡2011, ¡deep ¡learning ¡for ¡features ¡

Acoustic ¡Model ¡ HMM ¡ Language ¡ Model ¡ Transcription ¡

“The ¡quick ¡brown ¡fox ¡ jumps ¡over ¡the ¡lazy ¡dog.” ¡

Adam ¡Coates ¡

SLIDE 7

Bryan ¡Catanzaro ¡

Speech ¡Recognition ¡2: ¡Deep ¡Learning! ¡

With ¡more ¡data, ¡DL ¡acoustic ¡models ¡perform ¡

better ¡than ¡traditional ¡models ¡

Accuracy ¡

Traditional ¡ASR ¡

Data ¡+ ¡Model ¡Size ¡

DL ¡V1 ¡for ¡Speech ¡ Adam ¡Coates ¡

SLIDE 8

Bryan ¡Catanzaro ¡

Speech ¡Recognition ¡3: ¡ ¡“Deep ¡Speech” ¡

End-‑to-‑end ¡learning ¡

“The ¡quick ¡brown ¡fox ¡ jumps ¡over ¡the ¡lazy ¡dog.” ¡

Transcription ¡

Adam ¡Coates ¡

SLIDE 9

Bryan ¡Catanzaro ¡

Speech ¡Recognition ¡3: ¡“Deep ¡Speech” ¡

End-‑to-‑end ¡DL ¡may ¡work ¡better ¡ ¡

when ¡we ¡have ¡big ¡models ¡and ¡ ¡ lots ¡of ¡data ¡ Accuracy ¡

Traditional ¡ASR ¡

Data ¡+ ¡Model ¡Size ¡

DL ¡V1 ¡for ¡Speech ¡

Deep ¡Speech ¡

Adam ¡Coates ¡

SLIDE 10

Bryan ¡Catanzaro ¡

End-‑to-‑end ¡speech ¡with ¡DL ¡

Deep ¡neural ¡network ¡predicts ¡characters ¡directly ¡from ¡audio ¡

. ¡. ¡. ¡ ¡ . ¡. ¡. ¡ ¡

T H _ E … D O G Adam ¡Coates ¡

SLIDE 11

Bryan ¡Catanzaro ¡

Bidirectional ¡Recurrent ¡Network ¡

RNNs ¡model ¡temporal ¡dependence ¡
Various ¡flavors ¡used ¡in ¡many ¡applications ¡

– Especially ¡time ¡series ¡data ¡

Sequential ¡dependence ¡complicates ¡parallelism ¡

SLIDE 12

Bryan ¡Catanzaro ¡

Connectionist ¡Temporal ¡Classification ¡

How ¡to ¡connect ¡speech ¡data ¡with ¡transcription? ¡

– Transcription ¡not ¡labeled ¡per ¡millisecond ¡

Use ¡CTC, ¡from ¡[Graves ¡06] ¡
Efficient ¡dynamic ¡programming ¡of ¡all ¡possible ¡

alignments ¡to ¡compute ¡error ¡of ¡{audio, ¡transcription} ¡

T H _ E … D O G

? ¡ ? ¡

SLIDE 13

Bryan ¡Catanzaro ¡

Speech ¡Recognition ¡3: ¡“Deep ¡Speech” ¡

To ¡make ¡this ¡work, ¡we ¡need ¡

– bigger ¡datasets ¡ – bigger ¡models ¡

Accuracy ¡

Traditional ¡ASR ¡

Data ¡+ ¡Model ¡Size ¡

DL ¡V1 ¡for ¡Speech ¡

Deep ¡Speech ¡

SLIDE 14

Bryan ¡Catanzaro ¡

More ¡labeled ¡speech ¡

Speech ¡transcription ¡is ¡expensive ¡(so ¡use ¡AMTurk!) ¡

0 ¡ 1000 ¡ 2000 ¡ 3000 ¡ 4000 ¡ 5000 ¡ 6000 ¡ 7000 ¡ 8000 ¡ WSJ ¡ Switchboard ¡ Fisher ¡ Deep ¡Speech ¡ Hours ¡ Adam ¡Coates ¡

SLIDE 15

Bryan ¡Catanzaro ¡

More ¡labeled ¡speech ¡

Need ¡lots ¡of ¡data ¡for ¡“noisy” ¡environments. ¡

– Want ¡system ¡to ¡give ¡correct ¡character ¡outputs ¡ even ¡when ¡input ¡is ¡noisy! ¡ – Solution: ¡ ¡synthesize ¡“noisy” ¡recordings ¡by ¡ combining ¡audio ¡clips. ¡

Adam ¡Coates ¡

SLIDE 16

Bryan ¡Catanzaro ¡

Dataset ¡synthesis ¡

0 ¡ 20000 ¡ 40000 ¡ 60000 ¡ 80000 ¡ 100000 ¡ 120000 ¡ WSJ ¡ Switchboard ¡ Fisher ¡ Deep ¡Speech ¡ 300 ¡ 2000 ¡ >100,000 ¡

Synthesized ¡ data ¡

Hours ¡ Adam ¡Coates ¡

SLIDE 17

Bryan ¡Catanzaro ¡

Training ¡Parallelization ¡

2x ¡model ¡parallelism ¡
4x ¡data ¡parallelism ¡

(synchronous ¡ implementation) ¡

Training ¡model ¡on ¡8 ¡

K40s ¡total ¡

Takes ¡about ¡4 ¡days ¡to ¡

train ¡our ¡model ¡

5 ¡billion ¡connections ¡

. ¡. ¡. ¡ ¡ . ¡. ¡. ¡ ¡

GPU 0 GPU 1

SLIDE 18

Bryan ¡Catanzaro ¡

Systems ¡Infrastructure ¡

Small ¡clusters ¡+ ¡MPI ¡+ ¡CUDA ¡

– Strong ¡scaling ¡most ¡important ¡

Infiniband ¡

– Latency ¡matters ¡

GPUs ¡

– Currently ¡training ¡with ¡ ¡ Tesla ¡K40 ¡and ¡GTX980 ¡

SLIDE 19

Bryan ¡Catanzaro ¡

Results ¡on ¡Hub5’00 ¡

Widely ¡used ¡dataset. ¡Conversational, ¡little ¡noise. ¡

Adam ¡Coates ¡

SLIDE 20

Bryan ¡Catanzaro ¡

Results, ¡continued ¡

Our ¡goal: ¡ ¡improve ¡in ¡noisy ¡environments. ¡How’d ¡we ¡do? ¡

– Construct ¡a ¡new ¡dataset ¡of ¡~200 ¡recordings ¡in ¡both ¡clean ¡and ¡noisy ¡

settings. ¡

0 ¡ 10 ¡ 20 ¡ 30 ¡ 40 ¡ 50 ¡ Clean ¡ Noisy ¡ Combined ¡ Apple ¡Dictation ¡ Bing ¡Speech ¡ Google ¡API ¡ wit.ai ¡ Deep ¡Speech ¡ Word ¡Error ¡Rate ¡(%) ¡ Adam ¡Coates ¡

SLIDE 21

Bryan ¡Catanzaro ¡

Conclusion ¡

End-‑to-‑end ¡deep ¡learning ¡works ¡for ¡speech ¡

recognition ¡

We ¡are ¡pushing ¡boundaries ¡of ¡multi-‑GPU ¡

Speech: ¡The ¡Next ¡Generation ¡

Bryan ¡Catanzaro ¡ along ¡with ¡Baidu ¡SVAIL ¡

Speech ¡Recognition: ¡interface ¡of ¡the ¡future ¡

Speech ¡Recognition: ¡Traditional ¡ASR ¡

Features ¡ Acoustic ¡Model ¡ HMM ¡ Language ¡ Model ¡ Transcription ¡

Speech ¡Recognition: ¡Traditional ¡ASR ¡

Accuracy ¡

Data ¡+ ¡Model ¡Size ¡

Speech ¡recognition: ¡Traditional ¡ASR ¡

– Decades ¡of ¡work ¡to ¡get ¡very ¡small ¡improvements ¡

Spectrogram ¡ MFCC ¡ Flux ¡

Speech ¡Recognition ¡2: ¡Deep ¡Learning! ¡

Acoustic ¡Model ¡ HMM ¡ Language ¡ Model ¡ Transcription ¡

Speech ¡Recognition ¡2: ¡Deep ¡Learning! ¡

better ¡than ¡traditional ¡models ¡

Accuracy ¡

Data ¡+ ¡Model ¡Size ¡

Speech ¡Recognition ¡3: ¡ ¡“Deep ¡Speech” ¡

Transcription ¡

Speech ¡Recognition ¡3: ¡“Deep ¡Speech” ¡

when ¡we ¡have ¡big ¡models ¡and ¡ ¡ lots ¡of ¡data ¡ Accuracy ¡

Data ¡+ ¡Model ¡Size ¡

End-­‑to-­‑end ¡speech ¡with ¡DL ¡

. ¡. ¡. ¡ ¡ . ¡. ¡. ¡ ¡

Bidirectional ¡Recurrent ¡Network ¡

– Especially ¡time ¡series ¡data ¡

Connectionist ¡Temporal ¡Classification ¡

– Transcription ¡not ¡labeled ¡per ¡millisecond ¡

alignments ¡to ¡compute ¡error ¡of ¡{audio, ¡transcription} ¡

? ¡ ? ¡

Speech ¡Recognition ¡3: ¡“Deep ¡Speech” ¡

Accuracy ¡

Data ¡+ ¡Model ¡Size ¡

More ¡labeled ¡speech ¡

More ¡labeled ¡speech ¡

– Want ¡system ¡to ¡give ¡correct ¡character ¡outputs ¡ even ¡when ¡input ¡is ¡noisy! ¡ – Solution: ¡ ¡synthesize ¡“noisy” ¡recordings ¡by ¡ combining ¡audio ¡clips. ¡

Dataset ¡synthesis ¡

Training ¡Parallelization ¡

(synchronous ¡ implementation) ¡

K40s ¡total ¡

train ¡our ¡model ¡

. ¡. ¡. ¡ ¡ . ¡. ¡. ¡ ¡

Systems ¡Infrastructure ¡

– Strong ¡scaling ¡most ¡important ¡

– Latency ¡matters ¡

– Currently ¡training ¡with ¡ ¡ Tesla ¡K40 ¡and ¡GTX980 ¡

Results ¡on ¡Hub5’00 ¡

Results, ¡continued ¡

Conclusion ¡

recognition ¡

training ¡for ¡speech ¡networks ¡

– Always ¡looking ¡for ¡great ¡GPU ¡hackers ¡to ¡help ¡ make ¡progress ¡in ¡AI! ¡ bcatanzaro@baidu.com ¡ ¡

End-‑to-‑end ¡speech ¡with ¡DL ¡