Speech Processing 15-492/18-492 Speech Recognition Systems Other - - PowerPoint PPT Presentation

▶

May 04, 2023 117 likes •315 views

Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques ASR Systems How good are they? How good are they? Expected ASR Expected ASR Factors that make things worse Factors that make things worse

SLIDE 1

Speech Processing 15-492/18-492

Speech Recognition Systems Other ASR techniques

SLIDE 2

ASR Systems

How good are they?

How good are they?

Expected ASR

Expected ASR

Factors that make things worse

Factors that make things worse

How good do they need to be?

How good do they need to be?

What can you do with low WER?

What can you do with low WER?

SLIDE 3

ASR Tasks

SLIDE 4

What makes it worse

Channel

Channel

Telephone

Telephone vs vs Wide band Wide band

Close-

talking

talking vs vs far far-

field

field

Style:

Style:

Command and Control

Command and Control

Limit information getting

Limit information getting

Limit domain but general speech

Limit domain but general speech

Machine directed

Machine directed vs vs Human directed speech Human directed speech

Broadcast (performance)

Broadcast (performance) vs vs Conversational Conversational

Single

Single vs vs Dialog Dialog vs vs Multiperson Multiperson

SLIDE 5

Expected WER: Real-time

Command and Control

Command and Control

Limited vocabulary and directed speech

Limited vocabulary and directed speech

< 10% (< 5% for some users)

< 10% (< 5% for some users)

Simple Dialog

Simple Dialog

Machine directed speech with interested users

Machine directed speech with interested users

< 20% (but sometimes works with < 30%)

< 20% (but sometimes works with < 30%)

Dictation

Dictation

Single speaker, well performed

Single speaker, well performed

<5% for some

<5% for some useds useds > 30% for (short term) users > 30% for (short term) users

Speech

Speech-

to-

Speech Translation

Speech Translation

Machine mediated, target domain

Machine mediated, target domain

<20% (but will vary for different people)

<20% (but will vary for different people)

SLIDE 6

Expected WER: offline

Broadcast News

Broadcast News

Large vocabulary, well performed

Large vocabulary, well performed

<10% but not real

<10% but not real-

time (maybe 100 times real time)

time (maybe 100 times real time)

Conversational Speech (Call Home)

Conversational Speech (Call Home)

Large vocabulary, not well performed

Large vocabulary, not well performed

> 40% WER (depends on particular users and

> 40% WER (depends on particular users and conversations) conversations)

Information retrieval

Information retrieval

Large vocabulary very varied content

Large vocabulary very varied content

> 60% can still give useful results

> 60% can still give useful results

SLIDE 7

Other uses

TV show subtitling for the deaf

TV show subtitling for the deaf

Court transcription

Court transcription

Medical dictation

Medical dictation

Air traffic control transcription

Air traffic control transcription

SLIDE 8

Other ASR techniques

Including

Including Articulatory Articulatory/Phonetic Features ( /Phonetic Features (Metze Metze) )

Build recognizers for

Build recognizers for

Voiced/unvoiced

Voiced/unvoiced

Nasality

Nasality

Closures (quiet part of stops)

Closures (quiet part of stops)

Aspiration (Fricatives)

Aspiration (Fricatives)

Tongue position

Tongue position

Run all in parallel and “join” them

Run all in parallel and “join” them

Combine with more standard approaches

Combine with more standard approaches

Can be more robust to speaking style

Can be more robust to speaking style

SLIDE 9

Multi-engine Recognition

Use three recognizers and combine results

Use three recognizers and combine results

Rover

Rover

Combine scores per

Combine scores per-

sentence

sentence

Combine lattices

Combine lattices

Confusion networks

Confusion networks

Cross adaptation

Cross adaptation

Interleave systems with adaptation

Interleave systems with adaptation

It usually works better when system different

It usually works better when system different

(and both of them good)

(and both of them good)

SLIDE 10

Whispered Speech

Doesn’t disturb other people

Doesn’t disturb other people

Can use throat mike

Can use throat mike

Works in noisy environment

Works in noisy environment

SLIDE 11

Muscle Movement

EMG:

EMG: Electromyographic Electromyographic Signals Signals

Recognize muscle impulses

Recognize muscle impulses

Can work in noisy environments

Can work in noisy environments

Can work without you making a noise

Can work without you making a noise

SLIDE 12

Articulatory Movement

Attach metal studs to:

Attach metal studs to:

Lips, teeth, tongue, velum

Lips, teeth, tongue, velum

Record movement in magnetic field

Record movement in magnetic field

Non-

intrusive

intrusive

SLIDE 13

EMA: Electromagentoarticulatograph

SLIDE 14

ASR Summary

ASR requires:

ASR requires:

Acoustic model

Acoustic model

  HMMs

HMMs trained from lots of data trained from lots of data

Pronunciation lexicon

Pronunciation lexicon

  List of pronunciations for words

List of pronunciations for words

Language model

Language model

  Trigrams trained from lots of data

Trigrams trained from lots of data

SLIDE 15

ASR Trade-offs

More/better training data

More/better training data

Well transcribed and closest to target system

Well transcribed and closest to target system

Better signal

Better signal

Better microphone, no noise

Better microphone, no noise

Better speaker

Better speaker

Interested party, know how to speak

Interested party, know how to speak

Time and memory

Time and memory

Bigger systems do better

Bigger systems do better

Greater CPU does better

Greater CPU does better

SLIDE 16

SLIDE 17

Homework 1

Build a speech recognition system

Build a speech recognition system

An acoustic model

An acoustic model

A pronunciation lexicon

A pronunciation lexicon

A language model

A language model

Note it takes time to build

Note it takes time to build

What is your initial WER

What is your initial WER

How did you improve it

How did you improve it

Submitted by 3:30pm Monday 29

Submitted by 3:30pm Monday 29th

th Sep

Sep

SLIDE 18