Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic - - PowerPoint PPT Presentation

speech processing 15 492 18 492
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic - - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech Recognition From acoustics to text From acoustics to text Acoustic modeling Acoustic modeling Recognizing all forms of all phonemes


slide-1
SLIDE 1

Speech Processing 15-492/18-492

Speech Recognition Intro Acoustic modelling HMMs

slide-2
SLIDE 2

Speech Recognition

  • From acoustics to text

From acoustics to text

  • Acoustic modeling

Acoustic modeling

  • Recognizing all forms of all phonemes

Recognizing all forms of all phonemes

  • Language modeling

Language modeling

  • Expectation of what might be said

Expectation of what might be said

  • We need both to do recognition

We need both to do recognition

slide-3
SLIDE 3

Acoustics are not enough

  • Last Saturday in Hawaii, numerous

Last Saturday in Hawaii, numerous Waipouli Waipouli vacationers were vacationers were shocked to find their beach cordoned off for a UC Berkeley Drama shocked to find their beach cordoned off for a UC Berkeley Drama enactment of "Personal office space". The play features exclusiv enactment of "Personal office space". The play features exclusively ely topless men and women in an everyday office environment. Richard topless men and women in an everyday office environment. Richard Carlson, one of the annoyed tourists and a regular swimmer at Carlson, one of the annoyed tourists and a regular swimmer at Waipouli Waipouli beach, complained that they really knew how to wreck a nice beach, complained that they really knew how to wreck a nice beach with the nudist play. Many of the tourists appeared ruffle beach with the nudist play. Many of the tourists appeared ruffled by the d by the content and fled the scene to avoid compromising photos. content and fled the scene to avoid compromising photos.

  • In yesterday's press release, AT&T unveiled

In yesterday's press release, AT&T unveiled SpeechKit SpeechKit, its new , its new speech recognition toolkit. According to Michael Armstrong, the speech recognition toolkit. According to Michael Armstrong, the COO COO

  • f the company, the most innovative feature of the system is its
  • f the company, the most innovative feature of the system is its

revolutionary three revolutionary three-

  • dimensional interface, which opens a new universe

dimensional interface, which opens a new universe

  • f possibilities for the speech recognition community. During t
  • f possibilities for the speech recognition community. During the

he

  • fficial software release, Jonathan Blues, a senior researcher a
  • fficial software release, Jonathan Blues, a senior researcher at AT&T

t AT&T Labs, explained how to recognize speech with the new display, an Labs, explained how to recognize speech with the new display, and d how the toolkit has already played a crucial role in his researc how the toolkit has already played a crucial role in his research. h.

slide-4
SLIDE 4

Acoustics are not enough

  • Last Saturday in Hawaii, numerous

Last Saturday in Hawaii, numerous Waipouli Waipouli vacationers were vacationers were shocked to find their beach cordoned off for a UC Berkeley Drama shocked to find their beach cordoned off for a UC Berkeley Drama enactment of "Personal office space". The play features exclusiv enactment of "Personal office space". The play features exclusively ely topless men and women in an everyday office environment. Richard topless men and women in an everyday office environment. Richard Carlson, one of the annoyed tourists and a regular swimmer at Carlson, one of the annoyed tourists and a regular swimmer at Waipouli Waipouli beach, complained that they really knew beach, complained that they really knew how to wreck a nice how to wreck a nice beach with this nudist play beach with this nudist play. Many of the tourists appeared ruffled by . Many of the tourists appeared ruffled by the content and fled the scene to avoid compromising photos. the content and fled the scene to avoid compromising photos.

  • In yesterday's press release, AT&T unveiled

In yesterday's press release, AT&T unveiled SpeechKit SpeechKit, its new , its new speech recognition toolkit. According to Michael Armstrong, the speech recognition toolkit. According to Michael Armstrong, the COO COO

  • f the company, the most innovative feature of the system is its
  • f the company, the most innovative feature of the system is its

revolutionary three revolutionary three-

  • dimensional interface, which opens a new universe

dimensional interface, which opens a new universe

  • f possibilities for the speech recognition community. During t
  • f possibilities for the speech recognition community. During the

he

  • fficial software release, Jonathan Blues, a senior researcher a
  • fficial software release, Jonathan Blues, a senior researcher at AT&T

t AT&T Labs, explained Labs, explained how to recognize speech with this new display how to recognize speech with this new display, and , and how the toolkit has already played a crucial role in his researc how the toolkit has already played a crucial role in his research. h.

slide-5
SLIDE 5

Split the task

  • Build Acoustic models

Build Acoustic models

  • Probability of phones given acoustics

Probability of phones given acoustics

  • Build Language models

Build Language models

  • Probability of word string

Probability of word string

slide-6
SLIDE 6

Acoustic models

  • Represent all ways to say each phoneme

Represent all ways to say each phoneme

  • Like “templates” for each phoneme

Like “templates” for each phoneme

  • Averages over multiple examples

Averages over multiple examples

  • Different phonetic contexts

Different phonetic contexts

  “sow”

“sow” vs vs “see” etc “see” etc

  • Different people speaking

Different people speaking

  • Different acoustic environment

Different acoustic environment

  • Different channels

Different channels

  (assume channel is similar)

(assume channel is similar)

slide-7
SLIDE 7

Better Acoustic Models

  • DTW Template

DTW Template

  • Could be averages over multiple examples

Could be averages over multiple examples

  • Need to be time normalized

Need to be time normalized

  Linear interpolate or try to match

Linear interpolate or try to match

  • Matching probabilistically

Matching probabilistically

  What is the probability that example matches

What is the probability that example matches

  Test each frame

Test each frame

slide-8
SLIDE 8

Hidden Markov Models

  • Markov Process

– Future can be predicted from the past

  • Hidden Markov Models:

– When the state is unknown – A probability is given for each states

slide-9
SLIDE 9

Hidden Markov Model

slide-10
SLIDE 10

Key Requirements

slide-11
SLIDE 11

Find Probability of Observation

  • Given observation O and model M

Given observation O and model M

  • Efficiently file P(O|M)

Efficiently file P(O|M)

  • Called

Called decoding decoding

  • Find sum of all paths probabilities

Find sum of all paths probabilities

  • Each path

Each path prob prob is product of each transition in is product of each transition in state sequence state sequence

  • Use dynamic programming (generalized DTW)

Use dynamic programming (generalized DTW)

  • Also used in Chart Parsers, Theorem

Also used in Chart Parsers, Theorem Provers Provers

slide-12
SLIDE 12

Finding the Best Path

  • What is the most probable state sequence

What is the most probable state sequence

  • Use

Use Viterbi Viterbi algorithm algorithm

  • Maximize best sequence

Maximize best sequence

  • At each point hold list possible states

At each point hold list possible states

  • Hold back

Hold back-

  • pointer to best previous state

pointer to best previous state

  • Cumulate values along path

Cumulate values along path

  • Because we are looking for BEST

Because we are looking for BEST

  • Can ignore other back

Can ignore other back-

  • pointers

pointers

  • (When looking for N

(When looking for N-

  • best need more complex

best need more complex structure) structure)

slide-13
SLIDE 13

Parameter Estimation

  • Called

Called training training

  • Use Maximum Likelihood Estimation

Use Maximum Likelihood Estimation

  • Baum

Baum-

  • Welch (forward/backward algorithm)

Welch (forward/backward algorithm)

  • Special case of EM (Expectation Maximization)

Special case of EM (Expectation Maximization)

  • Run observation and find current

Run observation and find current probs probs (forward) (forward)

  • Modify probabilities to make observations best path

Modify probabilities to make observations best path (backward) (backward)

  • Repeat until convergences

Repeat until convergences

  • Not globally optimal

Not globally optimal

  • May find local maximum

May find local maximum

slide-14
SLIDE 14

HMM recognition

  • A bunch of HMM

A bunch of HMM

  • One for each phone type

One for each phone type

  • Each observation (e.g. 10ms frame)

Each observation (e.g. 10ms frame)

  • Probability distribution of possible phone type

Probability distribution of possible phone type

  • Thus can find most probably sequence

Thus can find most probably sequence

  • Use

Use Viterbi Viterbi to find best path to find best path

slide-15
SLIDE 15

But that’s not enough

  • But not all phones are equi-probable
  • Find word sequences that maximizes
  • Using Bayes’ Law
  • Combine models

– Us HMMs to provide – Use language model to provide

slide-16
SLIDE 16

How many HMM models

  • How many models

How many models

  • One for each thing you want to recognize:

One for each thing you want to recognize:

  One per phone

One per phone

  One per word

One per word

  One per city name …

One per city name …

  • What is the size and shape of the model

What is the size and shape of the model

slide-17
SLIDE 17

HMM Topology

1 state 1 state 3 state 3 state 3 state with skips 3 state with skips

slide-18
SLIDE 18

How many models

  • Context Independent models:

Context Independent models:

  • One for each phoneme

One for each phoneme

  • One for silence, noises

One for silence, noises

  • Triphone

Triphone models models

  • Context dependent

Context dependent

  • Phone before and after

Phone before and after

  • Need lots of data to train this

Need lots of data to train this

  • Tied states (semi

Tied states (semi-

  • continuous)

continuous)

  • Build full

Build full triphone triphone models models

  • Combine low frequency “similar” phones

Combine low frequency “similar” phones

  • Train again on smaller set

Train again on smaller set

slide-19
SLIDE 19

But even that’s not enough

  • HMM for words

HMM for words

  • For common words or common in domain

For common words or common in domain

  • E.g. City, State (need more than 3 states)

E.g. City, State (need more than 3 states)

slide-20
SLIDE 20

Search space is very large

  • Prune

Prune Viterbi Viterbi search search

  • Best number of paths

Best number of paths

  • Some percentage of probability mass

Some percentage of probability mass

  • Prune lexical trees

Prune lexical trees

  • Restrict vocabulary

Restrict vocabulary

  • Use language model

Use language model

  • Or even grammar

Or even grammar

slide-21
SLIDE 21

Some computational issues

  • Probabilities are multiplied along paths

Probabilities are multiplied along paths

  • They get

They get very very small small

  • Treat probabilities as logs

Treat probabilities as logs

  • Thus add rather than multiple

Thus add rather than multiple

  • Typically use negative log

Typically use negative log probabilties probabilties

slide-22
SLIDE 22

Training

  • How much data do you need

How much data do you need

  • As much as you can get

As much as you can get

  • More than 10Hrs (100Hrs, 1000Hrs)

More than 10Hrs (100Hrs, 1000Hrs)

  • Can take months to train

Can take months to train

  • The larger the models

The larger the models

  • The larger the number of parameters

The larger the number of parameters

  • More data needs to be used for training

More data needs to be used for training

  • Examples are

Examples are equi equi-

  • probably (find

probably (find oy

  • y-
  • oy
  • y

examples is hard) examples is hard)

slide-23
SLIDE 23

The right type of data

  • Training data must match intended domain

Training data must match intended domain

  • Male/Female, Native/non

Male/Female, Native/non-

  • native, UK/US

native, UK/US

  • As close to target domain as possible

As close to target domain as possible

  • Right channel (cell phone/land line)

Right channel (cell phone/land line)

slide-24
SLIDE 24

How to improve ASR

  • Get more data

Get more data

  • Fix bugs

Fix bugs

slide-25
SLIDE 25

Summary

  • HMMs

HMMs

  • Find probability of observation (decoding)

Find probability of observation (decoding)

  • Find best path (

Find best path (Viterbi Viterbi) )

  • Train the parameters (Baum

Train the parameters (Baum-

  • Welch)

Welch)

  • Bayes

Bayes Law Law

  • Acoustic model and Language model

Acoustic model and Language model

slide-26
SLIDE 26

Reading

  • Section 8.2 Definition of Hidden Markov

Section 8.2 Definition of Hidden Markov Model pp 380 Model pp 380-

  • 393

393

  • Section 8.4 Practical Issues in using HMMS

Section 8.4 Practical Issues in using HMMS pp 398 pp 398-

  • 405

405

  • In Huang et al.

In Huang et al.

  • Two page description of the contents

Two page description of the contents emailed to emailed to awb@cs.cmu.edu awb@cs.cmu.edu before before 3:30pm Monday 3:30pm Monday 13 13th

th September

September

slide-27
SLIDE 27