OPPORTUNITIES AND CHALLENGES OF PARALLELIZING SPEECH RECOGNITION - - PowerPoint PPT Presentation

opportunities and challenges of parallelizing speech
SMART_READER_LITE
LIVE PREVIEW

OPPORTUNITIES AND CHALLENGES OF PARALLELIZING SPEECH RECOGNITION - - PowerPoint PPT Presentation

OPPORTUNITIES AND CHALLENGES OF PARALLELIZING SPEECH RECOGNITION Jike Chong, Gerald Friedland, Adam Janin , Nelson Morgan, Chris Oei 1 OUTLINE Motivation Improving Accuracy Improving Throughput Improving Latency 2 Meeting


slide-1
SLIDE 1

OPPORTUNITIES AND CHALLENGES OF PARALLELIZING SPEECH RECOGNITION

Jike Chong, Gerald Friedland, Adam Janin, Nelson Morgan, Chris Oei

1

slide-2
SLIDE 2

OUTLINE

  • Motivation
  • Improving Accuracy
  • Improving Throughput
  • Improving Latency

2

slide-3
SLIDE 3

Meeting Diarist Application “Parlab All”

3

slide-4
SLIDE 4

MEETING DIARIST

Speech Recognition Relevant Web Scraping Audio Signal

"who spoke when"

Speaker Diarization Speaker Attribution

"what's relevant to this" "who said what"

Summarization

"what was said"

Indexing, Search, Retrieval Question Answering

... ... higher-level analysis ... "what are the main points" ...

4

slide-5
SLIDE 5

MOTIVATION

  • Speech technology has a long history of using up all available

compute resources.

  • Many previous attempts with specialized hardware with mixed

results.

5

slide-6
SLIDE 6

1: IMPROVING ACCURACY

  • Speech Technology works well when:
  • Large amounts of training data match application data
  • Small vocabulary; simple grammar
  • Quiet environment
  • Head-worn microphones
  • “Prepared” speech
  • Each change adds 10% error!

6

slide-7
SLIDE 7

FEATURES

  • Most state-of-the-art features are loosely based on perceptual

models of the cochlea with a few dozen features.

  • Combining multiple representations almost always improves

accuracy, especially in noise.

  • Typical systems combine 2-4 representations.

What if we used a LOT more?

7

slide-8
SLIDE 8

MANYSTREAM

  • Based on cortical models
  • Large number of filters

8

slide-9
SLIDE 9

MANYSTREAM

  • Each filter feeds an MLP

.

  • Current combination method uses entropy-weighted MLP

, but many other possibilities.

9

slide-10
SLIDE 10

MANYSTREAM

It helps!

  • 47% relative improvement over baseline for noisy “numbers”

using 28-stream system.

  • 13.3% relative improvement over baseline for Mandarin

Broadcast News using preliminary 4-stream system.

10

slide-11
SLIDE 11

MANYSTREAM

  • Next steps:
  • Fully parallel implementation
  • Many more streams
  • Other combination methods

11

slide-12
SLIDE 12

2: IMPROVING THROUGHPUT

  • Serial state-of-the-art systems can take 100 hours to process
  • ne hour of a meeting.
  • Analysis over all available audio is generally more accurate

than on-line systems.

  • Batch processing per utterance is “embarrassingly” parallel.

12

slide-13
SLIDE 13

SPEECH RECOGNITION PIPELINE

13

slide-14
SLIDE 14

WFST
Recogni-on
Network

... HOP

hh
aa
p ... ON


aa
n ... POP

p
aa
p ...

aa hh n

HOP ON POP CAT HAT IN THE ... ... ... ... ... CAT HAT ... ... HOP IN ... ON POP ... THE ...

HMM
Acous5c
 Phone
Model Pronuncia5on
Model Bigram
 Language
Model

Features
from


  • ne
frame

Gaussian
Mixture
Model
 for
One
Phone
State

… … … … … … …

Mixture
Components Compu-ng
 distance
to each
mixture
 components Compu-ng weighted
sum

  • f
all
components

INFERENCE ENGINE

14 14

slide-15
SLIDE 15

INFERENCE ENGINE

  • At each time step, compute likelihood for each outgoing arc

using the acoustic model.

  • For each incoming arc, track all hypotheses.
  • Regularlize data structures to allow efficient implementation.
  • The entire inference step runs on the GPU.

WFST
Recogni-on
Network

15

slide-16
SLIDE 16

INFERENCE ENGINE

  • 11x speed-up over serial implementation.
  • 18x speed-up for compute intensive phase.
  • 4x speed-up for communication intensive phase.
  • Flexible architecture
  • Audio/visual plugin added by domain expert.

16

slide-17
SLIDE 17

INFERENCE ENGINE

  • Next steps:
  • Generate lattices and/or N-best lists.
  • Explore other parallel architectures.
  • Distribute to clusters.
  • Explore accuracy/speed trade-offs.

17

slide-18
SLIDE 18

3: IMPROVING LATENCY

  • For batch, latency = length of audio + time to process.
  • On-line applications require control of latency.
  • Parallelization allows lower latency and potentially better

accuracy.

18

slide-19
SLIDE 19

SPEAKER DIARIZATION

Audiotrack: Clustering: Segmentation:

19

slide-20
SLIDE 20

OFFLINE SPEAKER DIARIZATION

(Re-)Alignment Merge two Clusters?

Yes

(Re-)Training

Cluster1 Cluster2 Cluster1 Cluster2 Cluster1 Cluster2 Cluster1 Cluster2

End

No

Initialization

Cluster1 Cluster2 Cluster2 Cluster2 Cluster1 Cluster2 Cluster2 Cluster2

20

slide-21
SLIDE 21

ONLINE SPEAKER DIARIZATION

  • Precompute models for each speaker.
  • Run offline diarization on the start of a meeting.
  • Train models on first 60 seconds from each resulting

speaker.

  • Another approach: stored models per speaker.
  • Every 2.5 seconds, compute scores for each speaker model

and output the highest.

21

slide-22
SLIDE 22

HYBRID ONLINE/OFFLINE DIARIZATION

Audio Signal History Buffer Speaker Mapping Diarization Engine Segmentation Clustering Online Subsystem Offline Subsystem 2.5 sec Buffer Online Decision Online Decision MAP Training "who is speaking now"

22

slide-23
SLIDE 23

HYBRID ONLINE/OFFLINE DIARIZATION

32.00 33.00 34.00 35.00 36.00 37.00 38.00 39.00 40.00

1 2 3 4 5 6 7 8

Online Diarization: DER/Core

Error % Cores Dedicated to Offline Subsystem

7+GPU

23

slide-24
SLIDE 24

DIARIZATION

  • Next steps:
  • CPU/GPU hybrid system
  • Implement serial optimizations in parallel version
  • Integrate with manystream approach

24

slide-25
SLIDE 25

CONCLUSION

  • Speech technology can use all resources that are available.
  • Parallelism enables improvements in several areas:
  • Accuracy
  • Throughput
  • Latency
  • Programming parallel systems continues to be challenging.

25