Language, Dialect, and Speaker Recognition Using Gaussian Mixture - - PowerPoint PPT Presentation

language dialect and speaker recognition using gaussian
SMART_READER_LITE
LIVE PREVIEW

Language, Dialect, and Speaker Recognition Using Gaussian Mixture - - PowerPoint PPT Presentation

Language, Dialect, and Speaker Recognition Using Gaussian Mixture Models on the Cell Processor Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner {nmalyska, smohindra, karen.lauro, reynolds, kepner}@ll.mit.edu


slide-1
SLIDE 1

MIT Lincoln Laboratory

Language, Dialect, and Speaker Recognition Using Gaussian Mixture Models on the Cell Processor

Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

{nmalyska, smohindra, karen.lauro, reynolds, kepner}@ll.mit.edu

This work is sponsored by the United States Air Force under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

slide-2
SLIDE 2

MIT Lincoln Laboratory

Outline

  • Introduction
  • Recognition for speech applications using GMMs
  • Parallel implementation of the GMM
  • Performance model
  • Conclusions and future work
slide-3
SLIDE 3

MIT Lincoln Laboratory

Introduction

Automatic Recognition Systems

  • In this presentation, we will discuss technology that can be

applied to different kinds of recognition systems

– Language recognition – Dialect recognition – Speaker recognition

Who is the speaker? What dialect are they using? What language are they speaking?

slide-4
SLIDE 4

MIT Lincoln Laboratory

Introduction

The Scale Challenge

  • Speech processing problems are often described as one person

interacting with a single computer system and receiving a response

slide-5
SLIDE 5

MIT Lincoln Laboratory

Introduction

The Scale Challenge

  • Real speech applications, however, often involve data from

multiple talkers and use multiple networked multicore machines

– Interactive voice response systems – Voice portals – Large corpus evaluations with hundreds of hours of data

Information About Speaker, Dialect, or Language

slide-6
SLIDE 6

MIT Lincoln Laboratory

Introduction

The Computational Challenge

  • Speech-processing algorithms are computationally expensive
  • Large amounts of data need to be available for these applications

– Must cache required data efficiently so that it is quickly available

  • Algorithms must be parallelized to maximize throughput

– Conventional approaches focus on parallel solutions over multiple networked computers – Existing packages not optimized for high-performance-per-watt machines with multiple cores, required in embedded systems with power, thermal, and size constraints – Want highly-responsive “real-time” systems in many applications, including in embedded systems

slide-7
SLIDE 7

MIT Lincoln Laboratory

Outline

  • Introduction
  • Recognition for speech applications using GMMs
  • Parallel implementation of the GMM
  • Performance model
  • Conclusions and future work
slide-8
SLIDE 8

MIT Lincoln Laboratory

Recognition Systems

Summary

  • A modern language, dialect, or speaker recognition system

is composed of two main stages

– Front-end processing – Pattern recognition

  • We will show how a speech signal is processed by modern

recognition systems

– Focus on a recognition technology called Gaussian mixture models

Speech Front End Pattern Recognition Decision on the identity, dialect,

  • r language of speaker
slide-9
SLIDE 9

MIT Lincoln Laboratory

Recognition Systems

Frame-Based Processing

  • The first step in modern speech systems is to convert

incoming speech samples into frames

  • A typical frame rate for a speech stream is 100 frames per

second

Speech Frames Speech Samples … Frame Number Time

slide-10
SLIDE 10

MIT Lincoln Laboratory

Recognition Systems

Frame-Based Processing

  • The first step in modern speech systems is to convert

incoming speech samples into frames

  • A typical frame rate for a speech stream is 100 frames per

second

Speech Frames Speech Samples … Frame Number Time

slide-11
SLIDE 11

MIT Lincoln Laboratory

Recognition Systems

Frame-Based Processing

  • The first step in modern speech systems is to convert

incoming speech samples into frames

  • A typical frame rate for a speech stream is 100 frames per

second

Speech Frames Speech Samples … Frame Number Time

slide-12
SLIDE 12

MIT Lincoln Laboratory

Recognition Systems

Frame-Based Processing

  • The first step in modern speech systems is to convert

incoming speech samples into frames

  • A typical frame rate for a speech stream is 100 frames per

second

Speech Frames Speech Samples … Frame Number Time

slide-13
SLIDE 13

MIT Lincoln Laboratory

Recognition Systems

Frame-Based Processing

  • The first step in modern speech systems is to convert

incoming speech samples into frames

  • A typical frame rate for a speech stream is 100 frames per

second

Speech Frames Frame Number Speech Samples … Time

slide-14
SLIDE 14

MIT Lincoln Laboratory

Recognition Systems

Front-End Processing

  • Front-end processing converts observed speech frames

into an alternative representation, features

– Lower dimensionality – Carries information relevant to the problem Speech Frames Feature Vectors

Front End

Frame Number Feature Number

1 2

{ , , , }

K

X =

  • x x

x

1

x

2

x

3

x

4

x Dim 2 Dim 1

slide-15
SLIDE 15

MIT Lincoln Laboratory

Recognition Systems

Pattern Recognition Training

  • A recognition system makes

decisions about observed data based on a knowledge of past data

  • During training, the system

learns about the data it uses to make decisions

– A set of features are collected from a certain language, dialect, or speaker

Training Features

1

x

2

x

3

x

4

x Dim 2 Dim 1

slide-16
SLIDE 16

MIT Lincoln Laboratory

Recognition Systems

Pattern Recognition Training

  • A recognition system makes

decisions about observed data based on a knowledge of past data

  • During training, the system

learns about the data it uses to make decisions

– A set of features are collected from a certain language, dialect, or speaker – A model is generated to represent the data

Training Features

Dim 1 Dim 2 Dim 1 Dim 2

Model

1

x

2

x ( ) p x

slide-17
SLIDE 17

MIT Lincoln Laboratory

Recognition Systems

Gaussian Mixture Models

  • A Gaussian mixture model

(GMM) represents features as the weighted sum of multiple Gaussian distributions

  • Each Gaussian state i has a

– Mean – Covariance – Weight

Dim 1 Dim 2

i

μ

i

Σ

i

w Model λ

( | ) p λ x

slide-18
SLIDE 18

MIT Lincoln Laboratory

Recognition Systems

Gaussian Mixture Models

Parameters

i

μ

i

Σ

i

w

Dim 1 Dim 2 ( ) p x

slide-19
SLIDE 19

MIT Lincoln Laboratory

Recognition Systems

Gaussian Mixture Models

Model States Parameters

Dim 1 Dim 2 ( ) p x

slide-20
SLIDE 20

MIT Lincoln Laboratory

Recognition Systems

Language, Speaker, and Dialect Models

Model States Languages, Dialects,

  • r Speakers

Parameters

1

Model λ

2

Model λ

3

Model λ

( | )

C

p λ x Dim 1 Dim 2

In LID, DID, and SID, we train a set of target models for each dialect, language, or speaker

C

λ

slide-21
SLIDE 21

MIT Lincoln Laboratory

Recognition Systems

Universal Background Model

Model States Parameters

( | )

C

p λ x

We also train a universal background model representing all speech

C

λ

Model

C

λ

Dim 1 Dim 2

slide-22
SLIDE 22

MIT Lincoln Laboratory

Recognition Systems

Hypothesis Test

  • Given a set of test
  • bservations, we perform a

hypothesis test to determine whether a certain class produced it

1

: is from the hypothesized class : is not from the hypothesized class

test test

H X H X

1 2

{ , , , }

test K

X =

  • x x

x

Dim 1 Dim 2

slide-23
SLIDE 23

MIT Lincoln Laboratory

Recognition Systems

Hypothesis Test

  • Given a set of test
  • bservations, we perform a

hypothesis test to determine whether a certain class produced it

1

: is from the hypothesized class : is not from the hypothesized class

test test

H X H X

1 2

{ , , , }

test K

X =

  • x x

x

0 ?

H

1?

H

1

( | ) p λ x ( | )

C

p λ x Dim 1 Dim 2 Dim 1 Dim 2 Dim 1 Dim 2

slide-24
SLIDE 24

MIT Lincoln Laboratory

Recognition Systems

Hypothesis Test

  • Given a set of test
  • bservations, we perform a

hypothesis test to determine whether a certain class produced it

1 2

{ , , , }

test K

X =

  • x x

x

1

( | ) p λ x ( | )

C

p λ x Dim 1 Dim 2 Dim 1 Dim 2 Dim 1 Dim 2

English? Not English?

slide-25
SLIDE 25

MIT Lincoln Laboratory

Recognition Systems

Log-Likelihood Ratio Score

  • We determine which hypothesis is true using the ratio:
  • We use the log-likelihood ratio score to decide whether an
  • bserved speaker, language, or dialect is the target

( ) log[ ( | )] log[ ( | )]

C C

X p X p X λ λ Λ = − threshold, generated by ( ) threshold, generated by

C C

X X X λ λ ≥ ⎧ Λ ⎨< ⎩

1

threshold, accept ( | ) threshold, reject ( | ) H p X H H p X H ≥ ⎧ ⎨≤ ⎩

slide-26
SLIDE 26

MIT Lincoln Laboratory

Recognition Systems

Log-Likelihood Computation

  • The observation log-likelihood given a model is:

λ

( | ) p λ x Dim 1 Dim 2 Dim 1 Dim 2

log[ ( | )]? p X λ

slide-27
SLIDE 27

MIT Lincoln Laboratory

Recognition Systems

Log-Likelihood Computation

  • The observation log-likelihood given a model is:

( )

1 1 1 2 1 1

log[ ( | )] log exp ( ) ( )

K M T i i i i K i

p X C λ

− =

⎛ ⎞ = − − Σ − ⎜ ⎟ ⎝ ⎠

∑ ∑

x μ x μ Dot product

( | ) p λ x Dim 1 Dim 2 Dim 1 Dim 2

log[ ( | )]? p X λ λ

slide-28
SLIDE 28

MIT Lincoln Laboratory

Recognition Systems

Log-Likelihood Computation

  • The observation log-likelihood given a model is:

( )

1 1 1 2 1 1

log[ ( | )] log exp ( ) ( )

K M T i i i i K i

p X C λ

− =

⎛ ⎞ = − − Σ − ⎜ ⎟ ⎝ ⎠

∑ ∑

x μ x μ

( | ) p λ x Dim 1 Dim 2 Dim 1 Dim 2

log[ ( | )]? p X λ Constant derived from weight and covariance λ

slide-29
SLIDE 29

MIT Lincoln Laboratory

Recognition Systems

Log-Likelihood Computation

  • The observation log-likelihood given a model is:

( )

1 1 1 2 1 1

log[ ( | )] log exp ( ) ( )

K M T i i i i K i

p X C λ

− =

⎛ ⎞ = − − Σ − ⎜ ⎟ ⎝ ⎠

∑ ∑

x μ x μ

( | ) p λ x Dim 1 Dim 2 Dim 1 Dim 2

log[ ( | )]? p X λ Table lookup used to compute this function λ

slide-30
SLIDE 30

MIT Lincoln Laboratory

Recognition Systems

Log-Likelihood Computation

  • The observation log-likelihood given a model is:

( )

1 1 1 2 1 1

log[ ( | )] log exp ( ) ( )

K M T i i i i K i

p X C λ

− =

⎛ ⎞ = − − Σ − ⎜ ⎟ ⎝ ⎠

∑ ∑

x μ x μ

( | ) p λ x Dim 1 Dim 2 Dim 1 Dim 2

log[ ( | )]? p X λ Sum over all K features λ

slide-31
SLIDE 31

MIT Lincoln Laboratory

Outline

  • Introduction
  • Recognition for speech applications using GMMs
  • Parallel implementation of the GMM
  • Performance model
  • Conclusions and future work
slide-32
SLIDE 32

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Summary

  • We have developed an algorithm to perform GMM scoring
  • n the Cell processor
  • This scoring stage of pattern recognition is where much of

the time is spent in current systems

  • This section:

– Describes the Cell Broadband Engine architecture – Summarizes the strengths and limitations of the Cell – Discusses step-by-step the algorithm we developed for GMM scoring on the Cell

slide-33
SLIDE 33

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Cell Architecture

  • The Cell Broadband Engine has

leading performance-per-watt specifications in its class

  • Synergistic processing elements

(SPEs)

– 256KB of local store memory – 25.6 GFLOPs per SPE – SIMD instructions

  • PowerPC processor element

(PPE)

  • PPE and multiple SPEs operate in

parallel and communicate via a high-speed bus

– 12.8e9 bytes/second (one way)

  • Each SPE can transfer data from

main memory using DMA

– PPE can effectively “send” data to the SPEs using this method

PPE SPE 0 SPE N Main Memory High-Speed Bus

slide-34
SLIDE 34

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Cell Design Principles

  • Limitations of the Cell processor

– Size of local store is small—only 256KB – All SPE data must explicitly be transferred in and out of local store – The PPE is much slower than the SPEs

  • Solutions to maximize throughput

– Do computations on SPEs when possible – Minimize time when SPEs are idle – Keep commonly-used data on SPEs to avoid cost of transferring to local store

slide-35
SLIDE 35

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Algorithm: Background Scoring

  • Begin with a background

model and a single feature vector

1

x

( | )

C

p λ x Dim 1 Dim 2

On PPE On PPE

slide-36
SLIDE 36

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Algorithm Step 1

  • Broadcast the background

model to the SPEs

– 616K model is split across SPEs since it will not fit on single SPE – Kept on SPEs throughout scoring procedure

To SPE 0

( | )

C

p λ x Dim 1 Dim 2

To SPE 7

slide-37
SLIDE 37

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Algorithm Step 2

  • Broadcast copy of the feature

vector to the SPEs

x To SPE 0 … To SPE 7

slide-38
SLIDE 38

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Algorithm Step 3

  • Score the feature vector

against the background model states on each SPE log[ ( | )]

C

p λ x

SPE 0 SPE 7

7

log[ ( | )]

C

p λ x

x

slide-39
SLIDE 39

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Algorithm Step 4

  • Move the background scores

to the PPE and aggregate log[ ( | )]

C

p λ x

7

log[ ( | )]

C

p λ x

On SPEs On PPE

( )

7

log[ ( | )] log exp log[ ( | )]

C Cr r

p p λ λ

=

=

x x

slide-40
SLIDE 40

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Algorithm: Target Scoring

  • Begin with a target model and

keep the single feature vector

  • n the SPEs

x

( | )

C

p λ x Dim 1 Dim 2

On SPEs On PPE

slide-41
SLIDE 41

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Algorithm Step 5

  • Distribute target model states

to the SPEs

– Only a subset of states need to be scored (called Gaussian short-lists)

( | )

C

p λ x Dim 1 Dim 2

To SPE 0 To SPE 7

slide-42
SLIDE 42

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Algorithm Step 6

  • Score feature vectors against

target models log[ ( | )]

C

p λ x

SPE 0 SPE 7

7

log[ ( | )]

C

p λ x

x

slide-43
SLIDE 43

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Algorithm Step 7

  • Collect target scores from

SPEs and aggregate log[ ( | )]

C

p λ x

7

log[ ( | )]

C

p λ x

On SPEs On PPE

( )

7

log[ ( | )] log exp log[ ( | )]

C Cr r

p p λ λ

=

=

x x

slide-44
SLIDE 44

MIT Lincoln Laboratory

Parallel Implementation of the GMM

Implementation Challenges

  • We have begun implementing our algorithm on the Cell

processor

  • Implementing vectorization is a challenge

– Concentrate on optimizing dot product and aggregation algorithms

  • Designing data transfers is another challenging problem

– Subdividing and distributing the models to minimize transfer time – Timing transfers so that they overlap with computation (double buffering)

( )

1 1 1 2 1 1

log[ ( | )] log exp ( ) ( )

K M T i i i i K i

p X C λ

− =

⎛ ⎞ = − − Σ − ⎜ ⎟ ⎝ ⎠

∑ ∑

x μ x μ

slide-45
SLIDE 45

MIT Lincoln Laboratory

Outline

  • Introduction
  • Recognition for speech applications using GMMs
  • Parallel implementation of the GMM
  • Performance model
  • Conclusions and future work
slide-46
SLIDE 46

MIT Lincoln Laboratory

Performance Model

Cell Resources

PPE SPE 0 SPE N Main Memory High-Speed Bus

8 SPEs used (25.6 GFLOPs each) 12.8e9 bytes per sec

slide-47
SLIDE 47

MIT Lincoln Laboratory

Performance Model

Data Structures

Model States (2048) Targets (10) Parameters (77)

Feature Vectors

1 2

{ , , , }

K

X =

  • x x

x

1

x

2

x

3

x

4

x

Universal Background Model Target Models

Parameters (77) Model States (2048) 38 Dimensional 100 Features per Second For Each Stream

slide-48
SLIDE 48

MIT Lincoln Laboratory 100 Features per Second For Each Stream

Performance Model

Data Structures

Model States (2048) Targets (10) Parameters (77)

Feature Vectors

1 2

{ , , , }

K

X =

  • x x

x

1

x

2

x

3

x

4

x

Universal Background Model Target Models

Parameters (77) Model States (2048) 38 Dimensional 152 Bytes Per Frame

slide-49
SLIDE 49

MIT Lincoln Laboratory

Performance Model

Data Structures

Model States (2048) Targets (10) Parameters (77)

Feature Vectors

1 2

{ , , , }

K

X =

  • x x

x

1

x

2

x

3

x

4

x

Universal Background Model Target Models

Parameters (77) Model States (2048) 38 Dimensional 6 MB for all Targets Only 5 States (15 KB) Used Per Frame 100 Features per Second For Each Stream

slide-50
SLIDE 50

MIT Lincoln Laboratory

Performance Model

Data Structures

Model States (2048) Targets (10) Parameters (77)

Feature Vectors

1 2

{ , , , }

K

X =

  • x x

x

1

x

2

x

3

x

4

x

Universal Background Model Target Models

Parameters (77) Model States (2048) 38 Dimensional 616KB for Entire UBM 77 KB Resides on Each SPE 100 Features per Second For Each Stream

slide-51
SLIDE 51

MIT Lincoln Laboratory

Performance Model

Simulation and Measurements

Computational Efficiency (Percent) Concurrent Real-Time Speech Streams

slide-52
SLIDE 52

MIT Lincoln Laboratory

Performance Model

Simulation and Measurements

Computational Efficiency (Percent) Increasing

  • ptimization

Concurrent Real-Time Speech Streams

slide-53
SLIDE 53

MIT Lincoln Laboratory

Performance Model

Simulation and Measurements

Concurrent Real-Time Speech Streams Computational Efficiency (Percent) Communication and synchronization more important here

slide-54
SLIDE 54

MIT Lincoln Laboratory

Performance Model

Simulation and Measurements

  • The effect of increasing the number of speakers, dialects,
  • r languages (targets) was simulated

– Changing the number of targets varies the amount of data sent to SPEs and the amount of calculation per SPE

20% computational efficiency 50% data-transfer efficiency Concurrent Real-Time Speech Streams Number of Speakers, Dialects, or Languages

slide-55
SLIDE 55

MIT Lincoln Laboratory

Outline

  • Introduction
  • Recognition for speech applications using GMMs
  • Parallel implementation of the GMM
  • Performance model
  • Conclusions and future work
slide-56
SLIDE 56

MIT Lincoln Laboratory

Conclusions

Summary

  • Language, dialect, and speaker recognition systems are

large in scale and will benefit from parallelization due to their need for high throughput

  • GMM scoring is expensive both in terms of computing

resources and memory

  • We have designed and modeled an algorithm to perform

GMM scoring in an efficient way

– Preserving often-used data on the SPEs – Performing most calculations on the SPEs

slide-57
SLIDE 57

MIT Lincoln Laboratory

Conclusions

Future Work

  • Optimization and measurement of the full algorithm to

validate the model

  • Compare our system against other state-of-the-art serial

and parallel approaches

– Intel single processor – Intel multicore – Intel networked – Cell PPE

  • Our results will become part of the PVTOL library
slide-58
SLIDE 58

MIT Lincoln Laboratory

Acknowledgement

  • Cliff Weinstein
  • Joe Campbell
  • Alan McCree
  • Tom Quatieri
  • Sharon Sacco
slide-59
SLIDE 59

MIT Lincoln Laboratory

Backup

slide-60
SLIDE 60

MIT Lincoln Laboratory

Gaussian Mixture Model

Equation

  • A Gaussian mixture model

(GMM) represents features as the weighted sum of multiple Gaussian distributions

  • Each Gaussian state i has a

– Mean – Covariance – Weight

Dim 1 Dim 2

i

μ

i

Σ

i

w

( )

1/ 2 / 2

1 1 2 (2 ) 1

( | ) exp ( ) ( )

i D i

M w T i i i i

p

π

λ

− ∑ =

= − − Σ −

x x μ x μ Model λ

( | ) p λ x