OPPORTUNITIES AND CHALLENGES OF PARALLELIZING SPEECH RECOGNITION - PowerPoint PPT Presentation

OPPORTUNITIES AND CHALLENGES OF PARALLELIZING SPEECH RECOGNITION Jike Chong, Gerald Friedland, Adam Janin , Nelson Morgan, Chris Oei 1

OUTLINE • Motivation • Improving Accuracy • Improving Throughput • Improving Latency 2

Meeting Diarist Application “Parlab All” 3

MEETING DIARIST Speaker higher-level analysis "who spoke when" "who said what" Diarization Speaker Indexing, Search, Question Audio Attribution Retrieval Answering Signal Speech ... Recognition ... Summarization "what was said" Relevant Web "what are the main points" Scraping ... "what's relevant to this" ... 4

MOTIVATION • Speech technology has a long history of using up all available compute resources. • Many previous attempts with specialized hardware with mixed results. 5

1: IMPROVING ACCURACY • Speech Technology works well when: • Large amounts of training data match application data • Small vocabulary; simple grammar • Quiet environment • Head-worn microphones • “Prepared” speech • Each change adds 10% error! 6

FEATURES • Most state-of-the-art features are loosely based on perceptual models of the cochlea with a few dozen features. • Combining multiple representations almost always improves accuracy, especially in noise. • Typical systems combine 2-4 representations. What if we used a LOT more? 7

MANYSTREAM •Based on cortical models •Large number of filters 8

MANYSTREAM • Each filter feeds an MLP . • Current combination method uses entropy-weighted MLP , but many other possibilities. 9

MANYSTREAM It helps! • 47% relative improvement over baseline for noisy “numbers” using 28-stream system. • 13.3% relative improvement over baseline for Mandarin Broadcast News using preliminary 4-stream system. 10

MANYSTREAM • Next steps: • Fully parallel implementation • Many more streams • Other combination methods 11

2: IMPROVING THROUGHPUT • Serial state-of-the-art systems can take 100 hours to process one hour of a meeting. • Analysis over all available audio is generally more accurate than on-line systems. • Batch processing per utterance is “embarrassingly” parallel. 12

SPEECH RECOGNITION PIPELINE 13

INFERENCE ENGINE Features from  Gaussian Mixture Model  HMM Acous5c  one frame for One Phone State Phone Model Pronuncia5on Model Mixture Components ... … aa Compu-ng  HOP  hh aa p distance to each mixture  ... hh components ON   aa n … ... … … … … … … n POP  p aa p ... Compu-ng 14 weighted sum of all components Bigram  Language Model HOP POP HAT THE CAT ON IN ... ... ... ... ... ... CAT ... HAT HOP IN ... ON POP ... WFST Recogni-on Network THE ... 14

INFERENCE ENGINE WFST Recogni-on Network • At each time step, compute likelihood for each outgoing arc using the acoustic model. • For each incoming arc, track all hypotheses. • Regularlize data structures to allow efficient implementation. • The entire inference step runs on the GPU. 15

INFERENCE ENGINE • 11x speed-up over serial implementation. • 18x speed-up for compute intensive phase. • 4x speed-up for communication intensive phase. • Flexible architecture • Audio/visual plugin added by domain expert. 16

INFERENCE ENGINE • Next steps: • Generate lattices and/or N-best lists. • Explore other parallel architectures. • Distribute to clusters. • Explore accuracy/speed trade-offs. 17

3: IMPROVING LATENCY • For batch, latency = length of audio + time to process. • On-line applications require control of latency. • Parallelization allows lower latency and potentially better accuracy. 18

SPEAKER DIARIZATION Audiotrack: Segmentation: Clustering: 19

OFFLINE SPEAKER DIARIZATION Initialization Cluster2 Cluster2 Cluster1 Cluster1 Cluster2 Cluster2 Cluster2 Cluster2 No Yes (Re-)Training Merge two Clusters? End (Re-)Alignment Cluster1 Cluster1 Cluster2 Cluster2 Cluster1 Cluster1 Cluster2 Cluster2 20

ONLINE SPEAKER DIARIZATION • Precompute models for each speaker. • Run offline diarization on the start of a meeting. • Train models on first 60 seconds from each resulting speaker. • Another approach: stored models per speaker. • Every 2.5 seconds, compute scores for each speaker model and output the highest. 21

HYBRID ONLINE/OFFLINE DIARIZATION "who is 2.5 sec Online speaking Buffer Decision now" Online MAP Online Decision Training Subsystem Audio Signal History Speaker Segmentation Buffer Mapping Diarization Engine Clustering Offl ine Subsystem 22

HYBRID ONLINE/OFFLINE DIARIZATION Online Diarization: DER/Core 40.00 39.00 38.00 37.00 Error % 36.00 35.00 34.00 33.00 32.00 1 2 3 4 5 6 7 8 7+GPU Cores Dedicated to Offline Subsystem 23

DIARIZATION • Next steps: • CPU/GPU hybrid system • Implement serial optimizations in parallel version • Integrate with manystream approach 24

CONCLUSION • Speech technology can use all resources that are available. • Parallelism enables improvements in several areas: • Accuracy • Throughput • Latency • Programming parallel systems continues to be challenging. 25

OPPORTUNITIES AND CHALLENGES OF PARALLELIZING SPEECH RECOGNITION - PowerPoint PPT Presentation

OPPORTUNITIES AND CHALLENGES OF PARALLELIZING SPEECH RECOGNITION Jike Chong, Gerald Friedland, Adam Janin , Nelson Morgan, Chris Oei 1 OUTLINE Motivation Improving Accuracy Improving Throughput Improving Latency 2 Meeting

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Parallelizing an Interactive Theorem Prover Functional Programming and Proofs with ACL2 David L.

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus EPFL AACBB

Parallelizing SCIP-SDP via the UG framework Tristan Gally joint work with Marc E. Pfetsch,

Speech Processing 11-492/18-495 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Speech and Language CS 188: Artificial Intelligence Spring 2011 Speech technologies

Cochlear Implant Systems todays challenges in embedded firmware design December 7, 2009 Ren

Move to hear and listen to perform . On auditory stimulation through motor activities for

Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond

Computing Like the Brain The Path To Machine Intelligence Jeff Hawkins GROK - Numenta

Some Essentials of Data Analysis with Wavelets Slides in the wavelet part of the course in data

Cognitive Neuroscience of Speech and Language William Marslen-Wilson MRC Cognition and Brain

1 Ask students to share their ideas how does a light actually turn on? Discussion prompts:

Chapter 6: Voice-based Health Analytics Voice is an extremely useful and important part of health