Speech Processing 15-492/18-492 Speech Recognition Acoustic - - PowerPoint PPT Presentation

speech processing 15 492 18 492
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 15-492/18-492 Speech Recognition Acoustic - - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary Acoustic Modeling Speech and Signal Variability Speech and Signal Variability Measuring Error Measuring Error Pronunciation


slide-1
SLIDE 1

Speech Processing 15-492/18-492

Speech Recognition Acoustic modeling Pronunciation dictionary

slide-2
SLIDE 2

Acoustic Modeling

  • Speech and Signal Variability

Speech and Signal Variability

  • Measuring Error

Measuring Error

  • Pronunciation lexicons

Pronunciation lexicons

slide-3
SLIDE 3

Variability in Speech Signal

“Mr Mr Wright should write to Ms Wright right Wright should write to Ms Wright right away about his Ford or four door Honda. away about his Ford or four door Honda.

  • Homophones: same pronunciation

Homophones: same pronunciation

“wright wright” “right” “write” / r ay t / ” “right” “write” / r ay t /

  • “ford or” “four door” / f

“ford or” “four door” / f ao ao r d r d ao ao r / r /

slide-4
SLIDE 4

Style Variability

  • Different articulation in different situations

Different articulation in different situations

  • Clear

Clear vs vs Conversational Conversational

  • Whisper

Whisper vs vs shouting shouting

  • Talking to machine, talking to others

Talking to machine, talking to others

  • Frustrated speech

Frustrated speech

slide-5
SLIDE 5

Speaker variability

  • Gender, age, dialect, health

Gender, age, dialect, health

  • Speaker dependent systems

Speaker dependent systems

  • Speaker independent systems

Speaker independent systems

  • Speaker adaptive systems

Speaker adaptive systems

  • Enrolment stage (acoustics and language)

Enrolment stage (acoustics and language)

slide-6
SLIDE 6

Environment Variability

  • Different background noises

Different background noises

  • Office

Office vs vs Outside Outside

  • Different applications, different

Different applications, different environments environments

  • Desktop dictation, to Warehouse pick

Desktop dictation, to Warehouse pick

  • Single speaker

Single speaker vs vs Multispeaker Multispeaker

  • Background music

Background music

slide-7
SLIDE 7

Channel Variability

  • Telephone

Telephone vs vs Desktop Desktop

  • 8KHz

8KHz vs vs 16KHz 16KHz

  • PDA

PDA vs vs Desktop Desktop

  • Close

Close-

  • talking

talking vs vs far far-

  • field

field

  • Cell Phone

Cell Phone vs vs Landline Landline

slide-8
SLIDE 8

Measuring Speech Recognition Error

  • Word Error Rate

Word Error Rate

  • Substitutions: word is replaced

Substitutions: word is replaced

  • Deletions: word is missed out

Deletions: word is missed out

  • Insertions: word is added

Insertions: word is added Subs+Dels+Ins Subs+Dels+Ins WER = 100% x WER = 100% x -----------------------------------

  • word in correct sentence

word in correct sentence

slide-9
SLIDE 9

Word Error Rate

  • WER requires:

WER requires:

  • Transcription (the correct word string)

Transcription (the correct word string)

  • Alignment between ASR output and Transcript

Alignment between ASR output and Transcript

  • Not just left to right matching

Not just left to right matching

  • Sometimes Accuracy is given

Sometimes Accuracy is given

  • 100

100-

  • WER

WER

  • NOT number of words correct

NOT number of words correct

slide-10
SLIDE 10

Word Error Rate

  • Can get > 100%

Can get > 100%

  • But something is very wrong

But something is very wrong

  • Outputting “the” only, ignoring the speech

Outputting “the” only, ignoring the speech

  • Sometimes gives WER < 100%

Sometimes gives WER < 100%

  • All words are treated equal

All words are treated equal

  • “This specimen”

“This specimen” vs vs “The specimen” “The specimen”

  • “Is absent”

“Is absent” vs vs “Is present” “Is present”

slide-11
SLIDE 11

Signal Acquisition

  • High quality signal quality

High quality signal quality

  • Lower sample rate will increase WER

Lower sample rate will increase WER

  • 8KHz baseline

8KHz baseline

  • 16KHz

16KHz -

  • 10%

10%

slide-12
SLIDE 12

End-Point Detection

  • Long silence will likely increase WER

Long silence will likely increase WER

  • It will recognize phantom words

It will recognize phantom words

  • Need to find the speech in the signal

Need to find the speech in the signal

  • VAD (Voice Activity Detection)

VAD (Voice Activity Detection)

  • Find beginning and end of speech

Find beginning and end of speech

  • Typically do continuous recognition

Typically do continuous recognition

  • Recognized while listening

Recognized while listening

  • But need end point (have to wait)

But need end point (have to wait)

slide-13
SLIDE 13

Feature normalization

  • Sometimes do normalization

Sometimes do normalization

  • Remove mean from

Remove mean from MFCCs MFCCs

  • Can make recognition more reliable in noise

Can make recognition more reliable in noise

  • Often include deltas and delta deltas

Often include deltas and delta deltas

  • Sometimes to feature reduction

Sometimes to feature reduction

  • Principal Component Analysis

Principal Component Analysis

slide-14
SLIDE 14

What phones/segments

  • Need the best set for discrimination

Need the best set for discrimination

  • Not necessary the same as Linguistic Phones

Not necessary the same as Linguistic Phones

  • More phones means more training

More phones means more training

  • And needs to have consistent Lexicon

And needs to have consistent Lexicon

  • Extra phones

Extra phones

  • t

t vs vs dx dx

  • t

t vs vs nx nx: /t w eh n t : /t w eh n t iy iy/ / vs vs / t w eh / t w eh nx nx iy iy / /

  • Stops as closures and bursts

Stops as closures and bursts

  • Schwas: ax and ix

Schwas: ax and ix

  • Syllabics: el,

Syllabics: el, em em, en , en

  • Accents/Tones: ah1, ah0, ….

Accents/Tones: ah1, ah0, ….

slide-15
SLIDE 15

Context dependency

  • Care about the contexts of each phone

Care about the contexts of each phone

  • Post vocalic /r/ and /n/ /m/ affect vowel

Post vocalic /r/ and /n/ /m/ affect vowel

  • Utterances start and end affect phonemes

Utterances start and end affect phonemes

  • Need more than simple phone models

Need more than simple phone models

slide-16
SLIDE 16

Tri-phone Models

  • Have models for each phone and context

Have models for each phone and context

  • 43^3 contexts about 80K models

43^3 contexts about 80K models

  • Not all contexts have enough examples

Not all contexts have enough examples

  • y
  • y (

(oy

  • y)

) oy

  • y very rare

very rare

  • sh

sh (ax) n very common (ax) n very common

  • Merge tri

Merge tri-

  • phones that are similar

phones that are similar

  • E.g

E.g t(ih)n t(ih)n with with d(ih)n d(ih)n

slide-17
SLIDE 17

Find phones to merge

  • Using phonetic features

Using phonetic features

  • Most similar feature, most similar acoustics

Most similar feature, most similar acoustics

  • Stops, voicing, vowel type …

Stops, voicing, vowel type …

  • Usually automatic cluster of

Usually automatic cluster of triphones triphones

  • Using CART trees indexed by phonetic features

Using CART trees indexed by phonetic features

slide-18
SLIDE 18

Adaptation

  • Change behavior after use

Change behavior after use

  • Human adaptation

Human adaptation

  • They will change how they speak

They will change how they speak

  • Channel adaptation

Channel adaptation

  • Cepstral

Cepstral Normalization Normalization

  • Model adaptation

Model adaptation

  • Move the means (or weights on means)

Move the means (or weights on means)

slide-19
SLIDE 19

Adaptation

  • Assume recognition is correct

Assume recognition is correct

  • (Maybe with some threshold)

(Maybe with some threshold)

  • Modify model to make answer more correct

Modify model to make answer more correct

  • Adaptation to speaker characteristics

Adaptation to speaker characteristics

  • Adaptation to speaker style

Adaptation to speaker style

  • Can improve accuracy by a few %

Can improve accuracy by a few %

slide-20
SLIDE 20

Pronunciation lexicon

  • Need list of words and their pronunciation

Need list of words and their pronunciation

  • Pencil p eh n s

Pencil p eh n s ih ih l l

  • Two t

Two t uw uw

  • Too t

Too t uw uw

  • Need pronunciation of ALL words

Need pronunciation of ALL words

slide-21
SLIDE 21

What’s a word

  • Basic words are clear

Basic words are clear

  • What about morphological variants

What about morphological variants

  • walk, walks, walked, walking

walk, walks, walked, walking

  • Multi

Multi-

  • word words

word words

  • Los Angeles, New York

Los Angeles, New York

  • Contractions

Contractions

  • Wanna

Wanna, , gonna gonna … …

  • Yes ALL words that you will recognize

Yes ALL words that you will recognize

slide-22
SLIDE 22

Pronunciation variants

  • Homographs: (same writing different

Homographs: (same writing different pronuncation pronuncation) )

  • bass: / b

bass: / b ae ae s / (fish) / b s / (fish) / b ey ey s / (music) s / (music)

  • project: N / p r

project: N / p r aa aa jh jh eh k t / V /p r ax eh k t / V /p r ax jh jh eh k t / eh k t /

  • Natural variants

Natural variants

  • route: / r

route: / r uw uw t / and / r aw t / t / and / r aw t /

  • coupon: / k

coupon: / k uw uw p p ao ao n / and / k y n / and / k y uw uw p p ao ao n / n /

  • water: / w

water: / w ao ao t t er er / and / w / and / w ao ao dx dx er er / /

slide-23
SLIDE 23

CMU Pronunciation Dict

  • Free pronunciation lexicon

Free pronunciation lexicon

  • American English

American English

  • Over 100K words

Over 100K words

  • Not always consistent

Not always consistent

  • Words for your application will be missing

Words for your application will be missing

  • We can never get a complete lexicon

We can never get a complete lexicon

slide-24
SLIDE 24

Pronunciation of Unknown Words

  • Build statistical model from lexicon

Build statistical model from lexicon

  • Predict pronunciation from letters

Predict pronunciation from letters

  • (Humans do this when they see a new word)

(Humans do this when they see a new word)

  • Typically about 70

Typically about 70-

  • 85% correct for new

85% correct for new words words

  • Should always check domain words

Should always check domain words

slide-25
SLIDE 25

Modeling Variability

  • In Gaussians (in HMM state)

In Gaussians (in HMM state)

  • Multiple mixtures

Multiple mixtures

  • In HMM topology

In HMM topology

  • Number of states and connectivity

Number of states and connectivity

  • In State Tying

In State Tying

  • Sharing Gaussians between states

Sharing Gaussians between states

  • In Phone choice

In Phone choice

  • More/less phones

More/less phones

  • In Lexical Pronunciation

In Lexical Pronunciation

  • Multiple lexical entries

Multiple lexical entries

slide-26
SLIDE 26

Summary

  • Acoustic modeling

Acoustic modeling

  • Word Error Rate/Accuracy

Word Error Rate/Accuracy

  • Lexical pronunciation

Lexical pronunciation

slide-27
SLIDE 27

Reading

  • Section 8.2 Definition of Hidden Markov

Section 8.2 Definition of Hidden Markov Model pp 380 Model pp 380-

  • 393

393

  • Section 8.4 Practical Issues in using HMMS

Section 8.4 Practical Issues in using HMMS pp 398 pp 398-

  • 405

405

  • In Huang et al.

In Huang et al.

  • Two page description of the contents

Two page description of the contents emailed to emailed to dhuggins@cs.cmu.edu dhuggins@cs.cmu.edu before before 3:30pm Monday 15 3:30pm Monday 15th

th September

September

slide-28
SLIDE 28