Speech Processing 15-492/18-492 Speech Recognition Acoustic - - PowerPoint PPT Presentation

▶

speech processing 15 492 18 492

Speech Processing 15-492/18-492 Speech Recognition Acoustic - - PowerPoint PPT Presentation

Mar 10, 2023 292 likes •575 views

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary Acoustic Modeling Speech and Signal Variability Speech and Signal Variability Measuring Error Measuring Error Pronunciation

slide-1

SLIDE 1

Speech Processing 15-492/18-492

Speech Recognition Acoustic modeling Pronunciation dictionary

slide-2

SLIDE 2

Acoustic Modeling

Speech and Signal Variability

Speech and Signal Variability

Measuring Error

Measuring Error

Pronunciation lexicons

Pronunciation lexicons

slide-3

SLIDE 3

Variability in Speech Signal

“

“Mr Mr Wright should write to Ms Wright right Wright should write to Ms Wright right away about his Ford or four door Honda. away about his Ford or four door Honda.

Homophones: same pronunciation

Homophones: same pronunciation

“

“wright wright” “right” “write” / r ay t / ” “right” “write” / r ay t /

“ford or” “four door” / f

“ford or” “four door” / f ao ao r d r d ao ao r / r /

slide-4

SLIDE 4

Style Variability

Different articulation in different situations

Different articulation in different situations

Clear

Clear vs vs Conversational Conversational

Whisper

Whisper vs vs shouting shouting

Talking to machine, talking to others

Talking to machine, talking to others

Frustrated speech

Frustrated speech

slide-5

SLIDE 5

Speaker variability

Gender, age, dialect, health

Gender, age, dialect, health

Speaker dependent systems

Speaker dependent systems

Speaker independent systems

Speaker independent systems

Speaker adaptive systems

Speaker adaptive systems

Enrolment stage (acoustics and language)

Enrolment stage (acoustics and language)

slide-6

SLIDE 6

Environment Variability

Different background noises

Different background noises

Office

Office vs vs Outside Outside

Different applications, different

Different applications, different environments environments

Desktop dictation, to Warehouse pick

Desktop dictation, to Warehouse pick

Single speaker

Single speaker vs vs Multispeaker Multispeaker

Background music

Background music

slide-7

SLIDE 7

Channel Variability

Telephone

Telephone vs vs Desktop Desktop

8KHz

8KHz vs vs 16KHz 16KHz

PDA

PDA vs vs Desktop Desktop

Close

Close-

talking

talking vs vs far far-

field

field

Cell Phone

Cell Phone vs vs Landline Landline

slide-8

SLIDE 8

Measuring Speech Recognition Error

Word Error Rate

Word Error Rate

Substitutions: word is replaced

Substitutions: word is replaced

Deletions: word is missed out

Deletions: word is missed out

Insertions: word is added

Insertions: word is added Subs+Dels+Ins Subs+Dels+Ins WER = 100% x WER = 100% x -----------------------------------

word in correct sentence

word in correct sentence

slide-9

SLIDE 9

Word Error Rate

WER requires:

WER requires:

Transcription (the correct word string)

Transcription (the correct word string)

Alignment between ASR output and Transcript

Alignment between ASR output and Transcript

Not just left to right matching

Not just left to right matching

Sometimes Accuracy is given

Sometimes Accuracy is given

100

100-

WER

WER

NOT number of words correct

NOT number of words correct

slide-10

SLIDE 10

Word Error Rate

Can get > 100%

Can get > 100%

But something is very wrong

But something is very wrong

Outputting “the” only, ignoring the speech

Outputting “the” only, ignoring the speech

Sometimes gives WER < 100%

Sometimes gives WER < 100%

All words are treated equal

All words are treated equal

“This specimen”

“This specimen” vs vs “The specimen” “The specimen”

“Is absent”

“Is absent” vs vs “Is present” “Is present”

slide-11

SLIDE 11

Signal Acquisition

High quality signal quality

High quality signal quality

Lower sample rate will increase WER

Lower sample rate will increase WER

8KHz baseline

8KHz baseline

16KHz

16KHz -

10%

10%

slide-12

SLIDE 12

End-Point Detection

Long silence will likely increase WER

Long silence will likely increase WER

It will recognize phantom words

It will recognize phantom words

Need to find the speech in the signal

Need to find the speech in the signal

VAD (Voice Activity Detection)

VAD (Voice Activity Detection)

Find beginning and end of speech

Find beginning and end of speech

Typically do continuous recognition

Typically do continuous recognition

Recognized while listening

Recognized while listening

But need end point (have to wait)

But need end point (have to wait)

slide-13

SLIDE 13

Feature normalization

Sometimes do normalization

Sometimes do normalization

Remove mean from

Remove mean from MFCCs MFCCs

Can make recognition more reliable in noise

Can make recognition more reliable in noise

Often include deltas and delta deltas

Often include deltas and delta deltas

Sometimes to feature reduction

Sometimes to feature reduction

Principal Component Analysis

Principal Component Analysis

slide-14

SLIDE 14

What phones/segments

Need the best set for discrimination

Need the best set for discrimination

Not necessary the same as Linguistic Phones

Not necessary the same as Linguistic Phones

More phones means more training

More phones means more training

And needs to have consistent Lexicon

And needs to have consistent Lexicon

Extra phones

Extra phones

t

t vs vs dx dx

t

t vs vs nx nx: /t w eh n t : /t w eh n t iy iy/ / vs vs / t w eh / t w eh nx nx iy iy / /

Stops as closures and bursts

Stops as closures and bursts

Schwas: ax and ix

Schwas: ax and ix

Syllabics: el,

Syllabics: el, em em, en , en

Accents/Tones: ah1, ah0, ….

Accents/Tones: ah1, ah0, ….

slide-15

SLIDE 15

Context dependency

Care about the contexts of each phone

Care about the contexts of each phone

Post vocalic /r/ and /n/ /m/ affect vowel

Post vocalic /r/ and /n/ /m/ affect vowel

Utterances start and end affect phonemes

Utterances start and end affect phonemes

Need more than simple phone models

Need more than simple phone models

slide-16

SLIDE 16

Tri-phone Models

Have models for each phone and context

Have models for each phone and context

43^3 contexts about 80K models

43^3 contexts about 80K models

Not all contexts have enough examples

Not all contexts have enough examples

y
y (

(oy

y)

) oy

y very rare

very rare

sh

sh (ax) n very common (ax) n very common

Merge tri

Merge tri-

phones that are similar

phones that are similar

E.g

E.g t(ih)n t(ih)n with with d(ih)n d(ih)n

slide-17

SLIDE 17

Find phones to merge

Using phonetic features

Using phonetic features

Most similar feature, most similar acoustics

Most similar feature, most similar acoustics

Stops, voicing, vowel type …

Stops, voicing, vowel type …

Usually automatic cluster of

Usually automatic cluster of triphones triphones

Using CART trees indexed by phonetic features

Using CART trees indexed by phonetic features

slide-18

SLIDE 18

Adaptation

Change behavior after use

Change behavior after use

Human adaptation

Human adaptation

They will change how they speak

They will change how they speak

Channel adaptation

Channel adaptation

Cepstral

Cepstral Normalization Normalization

Model adaptation

Model adaptation

Move the means (or weights on means)

Move the means (or weights on means)

slide-19

SLIDE 19

Adaptation

Assume recognition is correct

Assume recognition is correct

(Maybe with some threshold)

(Maybe with some threshold)

Modify model to make answer more correct

Modify model to make answer more correct

Adaptation to speaker characteristics

Adaptation to speaker characteristics

Adaptation to speaker style

Adaptation to speaker style

Can improve accuracy by a few %

Can improve accuracy by a few %

slide-20

SLIDE 20

Pronunciation lexicon

Need list of words and their pronunciation

Need list of words and their pronunciation

Pencil p eh n s

Pencil p eh n s ih ih l l

Two t

Two t uw uw

Too t

Too t uw uw

…

…

Need pronunciation of ALL words

Need pronunciation of ALL words

slide-21

SLIDE 21

What’s a word

Basic words are clear

Basic words are clear

What about morphological variants

What about morphological variants

walk, walks, walked, walking

walk, walks, walked, walking

Multi

Multi-

word words

word words

Los Angeles, New York

Los Angeles, New York

Contractions

Contractions

Wanna

Wanna, , gonna gonna … …

Yes ALL words that you will recognize

Yes ALL words that you will recognize

slide-22

SLIDE 22

Pronunciation variants

Homographs: (same writing different

Homographs: (same writing different pronuncation pronuncation) )

bass: / b

bass: / b ae ae s / (fish) / b s / (fish) / b ey ey s / (music) s / (music)

project: N / p r

project: N / p r aa aa jh jh eh k t / V /p r ax eh k t / V /p r ax jh jh eh k t / eh k t /

Natural variants

Natural variants

route: / r

route: / r uw uw t / and / r aw t / t / and / r aw t /

coupon: / k

coupon: / k uw uw p p ao ao n / and / k y n / and / k y uw uw p p ao ao n / n /

water: / w

water: / w ao ao t t er er / and / w / and / w ao ao dx dx er er / /

slide-23

SLIDE 23

CMU Pronunciation Dict

Free pronunciation lexicon

Free pronunciation lexicon

American English

American English

Over 100K words

Over 100K words

Not always consistent

Not always consistent

Words for your application will be missing

Words for your application will be missing

We can never get a complete lexicon

We can never get a complete lexicon

slide-24

SLIDE 24

Pronunciation of Unknown Words

Build statistical model from lexicon

Build statistical model from lexicon

Predict pronunciation from letters

Predict pronunciation from letters

(Humans do this when they see a new word)

(Humans do this when they see a new word)

Typically about 70

Typically about 70-

85% correct for new

85% correct for new words words

Should always check domain words

Should always check domain words

slide-25

SLIDE 25

Modeling Variability

In Gaussians (in HMM state)

In Gaussians (in HMM state)

Multiple mixtures

Multiple mixtures

In HMM topology

In HMM topology

Number of states and connectivity

Number of states and connectivity

In State Tying

In State Tying

Sharing Gaussians between states

Sharing Gaussians between states

In Phone choice

In Phone choice

More/less phones

More/less phones

In Lexical Pronunciation

In Lexical Pronunciation

Multiple lexical entries

Multiple lexical entries

slide-26

SLIDE 26

Summary

Acoustic modeling

Acoustic modeling

Word Error Rate/Accuracy

Word Error Rate/Accuracy

Lexical pronunciation

Lexical pronunciation

slide-27

SLIDE 27

Reading

Section 8.2 Definition of Hidden Markov

Section 8.2 Definition of Hidden Markov Model pp 380 Model pp 380-

393

393

Section 8.4 Practical Issues in using HMMS

Section 8.4 Practical Issues in using HMMS pp 398 pp 398-

405

405

In Huang et al.

In Huang et al.

Two page description of the contents

Two page description of the contents emailed to emailed to dhuggins@cs.cmu.edu dhuggins@cs.cmu.edu before before 3:30pm Monday 15 3:30pm Monday 15th

th September

September

slide-28

SLIDE 28