[PPT] - Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 PowerPoint Presentation

SLIDE 1

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492

Speech Recognition Acoustic modeling Pronunciation dictionary

SLIDE 2

Acoustic Modeling Acoustic Modeling

 Speech and Signal Variability

Speech and Signal Variability

 Measuring Error

Measuring Error

 Pronunciation lexicons

Pronunciation lexicons

SLIDE 3

Variability in Speech Signal Variability in Speech Signal

 “

“Mr Wright should write to Ms Wright right Mr Wright should write to Ms Wright right away about his Ford or four door Honda. away about his Ford or four door Honda.

 Homophones: same pronunciation

Homophones: same pronunciation

 “

“wright” “right” “write” / r ay t / wright” “right” “write” / r ay t /

 “

“ford or” “four door” / f ao r d ao r / ford or” “four door” / f ao r d ao r /

SLIDE 4

Style Variability Style Variability

 Different articulation in different situations

Different articulation in different situations

 Clear vs Conversational

Clear vs Conversational

 Whisper vs shouting

Whisper vs shouting

 Talking to machine, talking to others

Talking to machine, talking to others

 Frustrated speech

Frustrated speech

SLIDE 5

Speaker variability Speaker variability

 Gender, age, dialect, health

Gender, age, dialect, health

 Speaker dependent systems

Speaker dependent systems

 Speaker independent systems

Speaker independent systems

 Speaker adaptive systems

Speaker adaptive systems

 Enrolment stage (acoustics and language)

Enrolment stage (acoustics and language)

SLIDE 6

Environment Variability Environment Variability

 Different background noises

Different background noises

 Office vs Outside

Office vs Outside

 Different applications, different

Different applications, different environments environments

 Desktop dictation, to Warehouse pick

Desktop dictation, to Warehouse pick

 Single speaker vs Multispeaker

Single speaker vs Multispeaker

 Background music

Background music

SLIDE 7

Channel Variability Channel Variability

 Telephone vs Desktop

Telephone vs Desktop

 8KHz vs 16KHz

8KHz vs 16KHz

 Mobile vs Desktop

Mobile vs Desktop

 Close-talking vs far-field

Close-talking vs far-field

 Cell Phone vs Landline vs VOIP

Cell Phone vs Landline vs VOIP

SLIDE 8

Measuring Speech Recognition Error Measuring Speech Recognition Error

 Word Error Rate

Word Error Rate

 Substitutions: word is replaced

Substitutions: word is replaced

 Deletions: word is missed out

Deletions: word is missed out

 Insertions: word is added

Insertions: word is added Subs+Dels+Ins Subs+Dels+Ins WER = 100% x ----------------------------------- WER = 100% x ----------------------------------- word in correct sentence word in correct sentence

SLIDE 9

Word Error Rate Word Error Rate

 WER requires:

WER requires:

 Transcription (the correct word string)

Transcription (the correct word string)

 Alignment between ASR output and Transcript

Alignment between ASR output and Transcript

 Not just left to right matching

Not just left to right matching

 Sometimes Accuracy is given

Sometimes Accuracy is given

 100-WER

100-WER

 NOT number of words correct

NOT number of words correct

SLIDE 10

Word Error Rate Word Error Rate

 Can get > 100%

Can get > 100%

 But something is very wrong

But something is very wrong

 Outputting “the” only, ignoring the speech

Outputting “the” only, ignoring the speech

 Sometimes gives WER < 100%

Sometimes gives WER < 100%

 All words are treated equal

All words are treated equal

 “

“This specimen” vs “The specimen” This specimen” vs “The specimen”

 “

“Is absent” vs “Is present” Is absent” vs “Is present”

SLIDE 11

Signal Acquisition Signal Acquisition

 High quality signal quality

High quality signal quality

 Lower sample rate will increase WER

Lower sample rate will increase WER

 8KHz baseline

8KHz baseline

 16KHz -10%

16KHz -10%

SLIDE 12

End-Point Detection End-Point Detection

 Long silence will likely increase WER

Long silence will likely increase WER

 It will recognize phantom words

It will recognize phantom words

 Need to find the speech in the signal

Need to find the speech in the signal

 VAD (Voice Activity Detection)

VAD (Voice Activity Detection)

 Find beginning and end of speech

Find beginning and end of speech

 Typically do continuous recognition

Typically do continuous recognition

 Recognized while listening

Recognized while listening

 But need end point (have to wait)

But need end point (have to wait)

SLIDE 13

Feature normalization Feature normalization

 Sometimes do normalization

Sometimes do normalization

 Remove mean from MFCCs

Remove mean from MFCCs

 Can make recognition more reliable in noise

Can make recognition more reliable in noise

 Often include deltas and delta deltas

Often include deltas and delta deltas

 Sometimes to feature reduction

Sometimes to feature reduction

 Principal Component Analysis

Principal Component Analysis

SLIDE 14

What phones/segments What phones/segments

 Need the best set for discrimination

Need the best set for discrimination

 Not necessary the same as Linguistic Phones

Not necessary the same as Linguistic Phones

 More phones means more training

More phones means more training

 And needs to have consistent Lexicon

And needs to have consistent Lexicon

 Extra phones

Extra phones

 t vs dx

t vs dx

 t vs nx: /t w eh n t iy/ vs / t w eh nx iy /

t vs nx: /t w eh n t iy/ vs / t w eh nx iy /

 Stops as closures and bursts

Stops as closures and bursts

 Schwas: ax and ix

Schwas: ax and ix

 Syllabics: el, em, en

Syllabics: el, em, en

 Accents/Tones: ah1, ah0, ….

Accents/Tones: ah1, ah0, ….

SLIDE 15

Context dependency Context dependency

 Care about the contexts of each phone

Care about the contexts of each phone

 Post vocalic /r/ and /n/ /m/ affect vowel

Post vocalic /r/ and /n/ /m/ affect vowel

 Utterances start and end affect phonemes

Utterances start and end affect phonemes

 Need more than simple phone models

Need more than simple phone models

SLIDE 16

Tri-phone Models Tri-phone Models

 Have models for each phone and context

Have models for each phone and context

 43^3 contexts about 80K models

43^3 contexts about 80K models

 Not all contexts have enough examples

Not all contexts have enough examples

 oy (oy) oy very rare

y (oy) oy very rare

 sh (ax) n very common

sh (ax) n very common

 Merge tri-phones that are similar

Merge tri-phones that are similar

 E.g t(ih)n with d(ih)n

E.g t(ih)n with d(ih)n

SLIDE 17

Find phones to merge Find phones to merge

 Using phonetic features

Using phonetic features

 Most similar feature, most similar acoustics

Most similar feature, most similar acoustics

 Stops, voicing, vowel type …

Stops, voicing, vowel type …

 Usually automatic cluster of triphones

Usually automatic cluster of triphones

 Using CART trees indexed by phonetic features

Using CART trees indexed by phonetic features

SLIDE 18

Adaptation Adaptation

 Change behavior after use

Change behavior after use

 Human adaptation

Human adaptation

 They will change how they speak

They will change how they speak

 Channel adaptation

Channel adaptation

 Cepstral Normalization

Cepstral Normalization

 Model adaptation

Model adaptation

 Move the means (or weights on means)

Move the means (or weights on means)

SLIDE 19

Adaptation Adaptation

 Assume recognition is correct

Assume recognition is correct

 (Maybe with some threshold)

(Maybe with some threshold)

 Modify model to make answer more correct

Modify model to make answer more correct

 Adaptation to speaker characteristics

Adaptation to speaker characteristics

 Adaptation to speaker style

Adaptation to speaker style

 Can improve accuracy by a few %

Can improve accuracy by a few %

SLIDE 20

Pronunciation lexicon Pronunciation lexicon

 Need list of words and their pronunciation

Need list of words and their pronunciation

 Pencil p eh n s ih l

Pencil p eh n s ih l

 Two t uw

Two t uw

 Too t uw

Too t uw

 …

…

 Need pronunciation of ALL words

Need pronunciation of ALL words

SLIDE 21

What’s a word What’s a word

 Basic words are clear

Basic words are clear

 What about morphological variants

What about morphological variants

 walk, walks, walked, walking

walk, walks, walked, walking

 Multi-word words

Multi-word words

 Los Angeles, New York

Los Angeles, New York

 Contractions

Contractions

 Wanna, gonna …

Wanna, gonna …

 Yes ALL words that you will recognize

Yes ALL words that you will recognize

SLIDE 22

Pronunciation variants Pronunciation variants

 Homographs: (same writing different

Homographs: (same writing different pronuncation) pronuncation)

 bass: / b ae s / (fish) / b ey s / (music)

bass: / b ae s / (fish) / b ey s / (music)

 project: N / p r aa jh eh k t / V /p r ax jh eh k t /

project: N / p r aa jh eh k t / V /p r ax jh eh k t /

 Natural variants

Natural variants

 route: / r uw t / and / r aw t /

route: / r uw t / and / r aw t /

 coupon: / k uw p ao n / and / k y uw p ao n /

coupon: / k uw p ao n / and / k y uw p ao n /

 water: / w ao t er / and / w ao dx er /

water: / w ao t er / and / w ao dx er /

SLIDE 23

CMU Pronunciation Dict CMU Pronunciation Dict

 Free pronunciation lexicon

Free pronunciation lexicon

 American English

American English

 Over 100K words

Over 100K words

 Not always consistent

Not always consistent

 Words for your application will be missing

Words for your application will be missing

 We can never get a complete lexicon

We can never get a complete lexicon

SLIDE 24

Pronunciation of Unknown Words Pronunciation of Unknown Words

 Build statistical model from lexicon

Build statistical model from lexicon

 Predict pronunciation from letters

Predict pronunciation from letters

 (Humans do this when they see a new word)

(Humans do this when they see a new word)

 Typically about 70-85% correct for new

Typically about 70-85% correct for new words words

 Should always check domain words

Should always check domain words

SLIDE 25

Modeling Variability Modeling Variability

 In Gaussians (in HMM state)

In Gaussians (in HMM state)

 Multiple mixtures

Multiple mixtures

 In HMM topology

In HMM topology

 Number of states and connectivity

Number of states and connectivity

 In State Tying

In State Tying

 Sharing Gaussians between states

Sharing Gaussians between states

 In Phone choice

In Phone choice

 More/less phones

More/less phones

 In Lexical Pronunciation

In Lexical Pronunciation

 Multiple lexical entries

Multiple lexical entries

SLIDE 26

Summary Summary

 Acoustic modeling

Acoustic modeling

 Word Error Rate/Accuracy

Word Error Rate/Accuracy

 Lexical pronunciation

Lexical pronunciation

SLIDE 27