GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech - PowerPoint PPT Presentation

GPU-Accelerated GPU-Accelerated � Large Vocabulary Continuous Speech Recognition Large Vocabulary Continuous Speech Recognition � for Scalable Distributed Speech Recognition for Scalable Distributed Speech Recognition Jungsuk ¡Kim ¡ ¡Ian ¡Lane ¡ ¡ Electrical and Computer Engineering Carnegie Mellon University March 20, 2015 @GTC2015 1

Carnegie Mellon University Overview ¡ • Introduc5on ¡ • Background ¡ • Weighted ¡Finite ¡State ¡Transducers ¡in ¡Speech ¡Recogni5on ¡ • Proposed ¡Approach ¡ • GPU-‑Accelerated ¡scalable ¡DSR ¡ • Evalua5on ¡ • Conclusion ¡ 2

Carnegie Mellon University Introduc5on ¡ • Voice interfaces a core technology for User Interaction • Mobile devices, Smart TVs, In-Vehicle Systems, … • For a captivating User Experience, Voice UI must be: • Robust • Acoustic robustness à à Large Acoustic Models • Linguistics robustness à à Large Vocabulary Recognition • Responsive • Low latency à à Faster than real-time search • Adaptive • User and Task adaptation 3

Carnegie Mellon University Introduc5on ¡ • Large models critical for accurate speech recognition: • Large acoustic models è è Tens of Millions of parameters • Large vocabulary è è Millions of words è Billions of n-gram entries (>= 20GB) • Large language model è • Examples include: Acoustic modeling for telephony [Mass 2014] or Youtube [Bacchiani 2014] • • ~200M parameter Deep Neural Networks Language model rescoring for Voice Search [Schalkwyk 2010] • 1.2M vocabulary, 5-gram LM, 12.7B n-gram entries • 4

Carnegie Mellon University Introduc5on ¡ Speech recognition contains many highly parallel tasks = + Graphic Processing Units ASR engine designed (SIMT, ~3000 cores, <24GB) specifically for GPUs optimized for parallel Large Models computing More Accurate 5

Carnegie Mellon University Introduc5on ¡ • 1 ¡Million ¡ Vocabulary ¡(3-‑gram) ¡ • 30 ¡Million ¡ parameter ¡ Deep ¡Neural ¡Network ¡ Titan X Tesla K40 Tegra K1 Tegra X1 Maxwell, 3072 cores Kepler, 2880 cores Kepler, 192 cores Maxwell, 256 cores RTF 0.02 0.01 0.17 0.14 xRT 50X 100X 6X 7X 1hour 72s 36s 612s 504s 6

Background ¡ Weighted ¡Finite ¡State ¡Transducers ¡(WFSTs) ¡ in ¡Speech ¡Recogni7on ¡

Carnegie Mellon University WFST ¡in ¡Speech ¡Recogni5on ¡ s:SPEECH/ w[SPEECH] k:RECOGNIZE/ z ax g n ay 3 5 10 12 13 6 8 w[RECOGNIZE] p ε iy ch r eh ε a s 0 1 2 15 16 17/0 n:NICE/ w[NICE] ay a:A/w[A] s b:BEACH/ 7 11 9 14 4 k:WRECK/ w[BEACH] w[WRECK] • “ Recognize speech ” v.s. “ Wreck a nice beach ” .. . • Search is performed in 3 phases. • Phase 0 : Active Set Preparation. • Phase 1 : Acoustic Score Computation. • Phase 2 : WFST Search. 8

Carnegie Mellon University WFST ¡in ¡Speech ¡Recogni5on ¡ s:SPEECH/ w[SPEECH] k:RECOGNIZE/ ax g n ay z 3 5 12 13 6 8 10 w[RECOGNIZE] p ε ch r eh iy ε a s 0 1 2 15 16 17/0 n:NICE/ w[NICE] ay a:A/w[A] s b:BEACH/ 7 9 11 14 4 k:WRECK/ w[BEACH] w[WRECK] • Phase 0: Active Set Preparation • Collect active hypotheses from previous frame. 9

Carnegie Mellon University WFST ¡in ¡Speech ¡Recogni5on ¡ s:SPEECH/ w[SPEECH] k:RECOGNIZE/ ax g n ay z 3 5 12 13 10 6 8 w[RECOGNIZE] p ε r iy ch eh ε a s 2 15 16 0 1 17/0 n:NICE/ w[NICE] ay a:A/w[A] s b:BEACH/ 7 11 4 9 14 k:WRECK/ w[BEACH] w[WRECK] • Phase 1: Acoustic Score Computation • Compute acoustic similarity between given speech and phonetic models using Deep Neural Network 10

Carnegie Mellon University WFST ¡in ¡Speech ¡Recogni5on ¡ s:SPEECH/ w[SPEECH] k:RECOGNIZE/ ax g n ay z 3 5 10 12 13 6 8 w[RECOGNIZE] p ε iy ch r eh ε a s 0 1 2 15 16 17/0 n:NICE/ w[NICE] ay a:A/w[A] s b:BEACH/ 7 9 11 14 4 k:WRECK/ w[BEACH] w[WRECK] • Phase 2: WFST Search • Perform frame synchronous Viterbi beam search on WFST network. • If multiple transitions have same next state s , then the most likely (minimum score) hypothesis is retained (i.e. ¡state ¡12, ¡14, ¡15…) 11

Carnegie Mellon University WFST ¡in ¡Speech ¡Recogni5on ¡ s:SPEECH/ w[SPEECH] k:RECOGNIZE/ ax g n ay z 3 5 10 12 13 6 8 w[RECOGNIZE] p ε iy ch r eh ε a s 0 1 2 15 16 17/0 n:NICE/ w[NICE] ay a:A/w[A] s b:BEACH/ 7 9 11 14 4 k:WRECK/ w[BEACH] w[WRECK] • Iterate these 3 phases until input audio ends. • Phase ¡0: ¡ Ac7ve ¡Set ¡Prepara7on 12

Carnegie Mellon University WFST ¡in ¡Speech ¡Recogni5on ¡ s:SPEECH/ w[SPEECH] k:RECOGNIZE/ ax g n ay z 3 5 10 12 13 6 8 w[RECOGNIZE] p ε iy ch r eh ε a s 0 1 2 15 16 17/0 n:NICE/ w[NICE] ay a:A/w[A] s b:BEACH/ 7 9 11 14 4 k:WRECK/ w[BEACH] w[WRECK] • Phase ¡1: ¡ Acous7c ¡Score ¡Computa7on 13

Carnegie Mellon University WFST ¡in ¡Speech ¡Recogni5on ¡ s:SPEECH/ w[SPEECH] k:RECOGNIZE/ ax g n ay z 3 5 10 12 13 6 8 w[RECOGNIZE] p ε iy ch r eh ε a s 0 1 2 15 16 17/0 n:NICE/ w[NICE] ay a:A/w[A] s b:BEACH/ 7 9 11 14 4 k:WRECK/ w[BEACH] w[WRECK] • Phase ¡2: ¡ WFST ¡Search 14

Carnegie Mellon University WFST ¡in ¡Speech ¡Recogni5on ¡ s:SPEECH/ w[SPEECH] k:RECOGNIZE/ ax g n ay z 3 5 10 12 13 6 8 w[RECOGNIZE] p ε iy ch r eh ε a s 0 1 2 15 16 17/0 n:NICE/ w[NICE] ay a:A/w[A] s b:BEACH/ 7 9 11 14 4 k:WRECK/ w[BEACH] w[WRECK] • Phase ¡0: ¡ Ac7ve ¡Set ¡Prepara7on 15

Carnegie Mellon University WFST ¡in ¡Speech ¡Recogni5on ¡ s:SPEECH/ w[SPEECH] k:RECOGNIZE/ ax g n ay z 3 5 10 12 13 6 8 w[RECOGNIZE] p ε iy ch r eh ε a s 0 1 2 15 16 17/0 n:NICE/ w[NICE] ay a:A/w[A] s b:BEACH/ 7 9 11 14 4 k:WRECK/ w[BEACH] w[WRECK] • Phase ¡1: ¡ Acous7c ¡Score ¡Computa7on 16

Carnegie Mellon University WFST ¡in ¡Speech ¡Recogni5on ¡ s:SPEECH/ w[SPEECH] k:RECOGNIZE/ ax g n ay z 3 5 10 12 13 6 8 w[RECOGNIZE] p ε iy ch r eh ε a s 0 1 2 15 16 17/0 n:NICE/ w[NICE] ay a:A/w[A] s b:BEACH/ 7 9 11 14 4 k:WRECK/ w[BEACH] w[WRECK] • Phase ¡2: ¡ WFST ¡Search 17

Carnegie Mellon University WFST ¡in ¡Speech ¡Recogni5on ¡ s: SPEECH / w[SPEECH] k: RECOGNIZE / ax g n ay z 3 5 10 12 13 6 8 w[RECOGNIZE] p ε iy ch r eh ε a s 0 1 2 15 16 17/0 n:NICE/ w[NICE] ay a:A/w[A] s b:BEACH/ 7 9 11 14 4 k:WRECK/ w[BEACH] w[WRECK] • Recognized result is an output symbol sequence over the best path. • Result: “RECOGNIZE SPEECH” 18

Proposed ¡Approach ¡ GPU-‑Accelerated ¡Scalable ¡DSR ¡

Carnegie Mellon University Distributed ¡Speech ¡Recogni5on ¡(DSR) ¡ (1)$Extract$features$ (2)$Stack$incoming$ (3)$Conduct$Viterbi$ (0)$Itera1on$control,$ (4)$Send$result$back$ from$ac1ve$audio5 frames$from$ac1ve$ beam$search$over$ over$TCP/IP,$Data5 data$prepara1on,$ collec1on.$ streams$into$stacked$ audio5streams$and$ WFST$and$conduct$ result$handling.$ feature$vector$ compute$likelihoods$$ on5the5fly$rescoring$ Acous+c'Score' Feature'Extrac+on' Graph'Search' Post'Processing' Audio&Stream- Itera+on'Control' Update-hyp.- Computa+on' • Itera7on ¡control ¡ • Acous7c ¡score ¡computa7on ¡ Allocate ¡or ¡deallocate ¡data ¡structures. ¡ Deep ¡Neural ¡Network ¡(Forward ¡PropagaKon). ¡ • • Terminate ¡decoding ¡task. ¡ • • Graph ¡search ¡ • Feature ¡extrac7on ¡ Conduct ¡frame ¡synchronous ¡WFST ¡search. ¡ • Receive ¡audio ¡and ¡extract ¡feature ¡for ¡ End-‑of-‑uSerance ¡detecKon. ¡ • • current ¡iteraKon ¡(batch). ¡ • Post ¡processing ¡ Speaker ¡dependent ¡adaptaKon. ¡ • Output ¡(LaUce) ¡processing. ¡ • Sending ¡result ¡back ¡to ¡client. ¡ • ¡ 20

Carnegie Mellon University Producer/Consumer ¡design ¡paOern ¡ Master/Slave ¡paSern. ¡ • Decuple ¡processes ¡that ¡produce ¡and ¡ • consume ¡data ¡at ¡different ¡rates. ¡ Advantages: ¡ ¡ • Enhanced ¡data ¡sharing ¡ • Processes ¡can ¡run ¡in ¡different ¡speeds. ¡ • Buffered ¡communicaKon ¡between ¡ • processes. ¡ ¡ Producer-Consumer multi-threaded model 21

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech - PowerPoint PPT Presentation

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large Vocabulary Continuous Speech Recognition for Scalable Distributed Speech Recognition for Scalable Distributed Speech Recognition Jungsuk Kim

Vocabulary and Reading in Secondary School (VaRiSS) Jessie Ricketts Royal Holloway Vocabulary

VOCABULARY ATI TEAS ENGLISH AND LANGUAGE USAGE VOCABULARY Vocabulary questions on this part of

VOCABULARY ATI TEAS ENGLISH AND LANGUAGE USAGE VOCABULARY Vocabulary questions on this part of

Building Science Vocabulary: Seeds of Science Roots of Reading Goal Review our model for

Teaching Vocabulary Pre-Teaching Vocabulary + Pre-Teaching Vocabulary: An Example for 2 nd -5 th

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

What s in a Word? Academic Vocabulary Development for ELLs CCRC 2014 1 Essential

Vocabulary Word #1 flinched : (verb) to make a quick, nervous movement. Ellie the elephant flinched

THE LORDSHIP OF JESUS RD THE LORDSHIP OF JESUS RD VOCABULARY THE LORDSHIP OF JESUS RD

Vocabulary Word #1 fury : (noun) wild or violent anger. In his fury , he could not answer the math

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Sophomore Accelerated English A Separate Peace Vocabulary Presentation Throughout our study of A

Presenting the International LuckyKids Camp LuckyKids International camp is created in 2012 and it

Learning J.J. (Jia-Jie) Zhu Boston College GAAL 2 Active learning classify 400 instances 30

Research-Based Instructional Practices Instructional Leadership Academy October 5-6, 2016 Debbie

Active Learning Passive Learning Active Learning 1. Think 1. Acquisition of knowledge Ability

UCF Office of Undergraduate Admissions Review DE/EA Program with OCPS Guidance Counselors

Mandatory Orientation Academic Governing Council First Presentation

Final Grade Options What You Should Know umanitoba.ca Topics to be covered GPA what it

IP Appeals in the Federal Circuit Jarrett Perlow, Chief Deputy USCA Federal Circuit Calendaring