Speech Processing 15-492/18-492 Speech Recognition Systems Other - - PowerPoint PPT Presentation

speech processing 15 492 18 492
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 15-492/18-492 Speech Recognition Systems Other - - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques ASR Systems How good are they? How good are they? Expected ASR Expected ASR Factors that make things worse Factors that make things worse


slide-1
SLIDE 1

Speech Processing 15-492/18-492

Speech Recognition Systems Other ASR techniques

slide-2
SLIDE 2

ASR Systems

  • How good are they?

How good are they?

  • Expected ASR

Expected ASR

  • Factors that make things worse

Factors that make things worse

  • How good do they need to be?

How good do they need to be?

  • What can you do with low WER?

What can you do with low WER?

slide-3
SLIDE 3

ASR Tasks

slide-4
SLIDE 4

What makes it worse

  • Channel

Channel

  • Telephone

Telephone vs vs Wide band Wide band

  • Close

Close-

  • talking

talking vs vs far far-

  • field

field

  • Style:

Style:

  • Command and Control

Command and Control

  • Limit information getting

Limit information getting

  • Limit domain but general speech

Limit domain but general speech

  • Machine directed

Machine directed vs vs Human directed speech Human directed speech

  • Broadcast (performance)

Broadcast (performance) vs vs Conversational Conversational

  • Single

Single vs vs Dialog Dialog vs vs Multiperson Multiperson

slide-5
SLIDE 5

Expected WER: Real-time

  • Command and Control

Command and Control

  • Limited vocabulary and directed speech

Limited vocabulary and directed speech

  • < 10% (< 5% for some users)

< 10% (< 5% for some users)

  • Simple Dialog

Simple Dialog

  • Machine directed speech with interested users

Machine directed speech with interested users

  • < 20% (but sometimes works with < 30%)

< 20% (but sometimes works with < 30%)

  • Dictation

Dictation

  • Single speaker, well performed

Single speaker, well performed

  • <5% for some

<5% for some useds useds > 30% for (short term) users > 30% for (short term) users

  • Speech

Speech-

  • to

to-

  • Speech Translation

Speech Translation

  • Machine mediated, target domain

Machine mediated, target domain

  • <20% (but will vary for different people)

<20% (but will vary for different people)

slide-6
SLIDE 6

Expected WER: offline

  • Broadcast News

Broadcast News

  • Large vocabulary, well performed

Large vocabulary, well performed

  • <10% but not real

<10% but not real-

  • time (maybe 100 times real time)

time (maybe 100 times real time)

  • Conversational Speech (Call Home)

Conversational Speech (Call Home)

  • Large vocabulary, not well performed

Large vocabulary, not well performed

  • > 40% WER (depends on particular users and

> 40% WER (depends on particular users and conversations) conversations)

  • Information retrieval

Information retrieval

  • Large vocabulary very varied content

Large vocabulary very varied content

  • > 60% can still give useful results

> 60% can still give useful results

slide-7
SLIDE 7

Other uses

  • TV show subtitling for the deaf

TV show subtitling for the deaf

  • Court transcription

Court transcription

  • Medical dictation

Medical dictation

  • Air traffic control transcription

Air traffic control transcription

slide-8
SLIDE 8

Other ASR techniques

  • Including

Including Articulatory Articulatory/Phonetic Features ( /Phonetic Features (Metze Metze) )

  • Build recognizers for

Build recognizers for

  • Voiced/unvoiced

Voiced/unvoiced

  • Nasality

Nasality

  • Closures (quiet part of stops)

Closures (quiet part of stops)

  • Aspiration (Fricatives)

Aspiration (Fricatives)

  • Tongue position

Tongue position

  • Run all in parallel and “join” them

Run all in parallel and “join” them

  • Combine with more standard approaches

Combine with more standard approaches

  • Can be more robust to speaking style

Can be more robust to speaking style

slide-9
SLIDE 9

Multi-engine Recognition

  • Use three recognizers and combine results

Use three recognizers and combine results

  • Rover

Rover

  • Combine scores per

Combine scores per-

  • sentence

sentence

  • Combine lattices

Combine lattices

  • Confusion networks

Confusion networks

  • Cross adaptation

Cross adaptation

  • Interleave systems with adaptation

Interleave systems with adaptation

  • It usually works better when system different

It usually works better when system different

  • (and both of them good)

(and both of them good)

slide-10
SLIDE 10

Whispered Speech

  • Doesn’t disturb other people

Doesn’t disturb other people

  • Can use throat mike

Can use throat mike

  • Works in noisy environment

Works in noisy environment

slide-11
SLIDE 11

Muscle Movement

  • EMG:

EMG: Electromyographic Electromyographic Signals Signals

  • Recognize muscle impulses

Recognize muscle impulses

  • Can work in noisy environments

Can work in noisy environments

  • Can work without you making a noise

Can work without you making a noise

slide-12
SLIDE 12

Articulatory Movement

  • Attach metal studs to:

Attach metal studs to:

  • Lips, teeth, tongue, velum

Lips, teeth, tongue, velum

  • Record movement in magnetic field

Record movement in magnetic field

  • Non

Non-

  • intrusive

intrusive

slide-13
SLIDE 13

EMA: Electromagentoarticulatograph

slide-14
SLIDE 14

ASR Summary

  • ASR requires:

ASR requires:

  • Acoustic model

Acoustic model

  HMMs

HMMs trained from lots of data trained from lots of data

  • Pronunciation lexicon

Pronunciation lexicon

  List of pronunciations for words

List of pronunciations for words

  • Language model

Language model

  Trigrams trained from lots of data

Trigrams trained from lots of data

slide-15
SLIDE 15

ASR Trade-offs

  • More/better training data

More/better training data

  • Well transcribed and closest to target system

Well transcribed and closest to target system

  • Better signal

Better signal

  • Better microphone, no noise

Better microphone, no noise

  • Better speaker

Better speaker

  • Interested party, know how to speak

Interested party, know how to speak

  • Time and memory

Time and memory

  • Bigger systems do better

Bigger systems do better

  • Greater CPU does better

Greater CPU does better

slide-16
SLIDE 16
slide-17
SLIDE 17

Homework 1

  • Build a speech recognition system

Build a speech recognition system

  • An acoustic model

An acoustic model

  • A pronunciation lexicon

A pronunciation lexicon

  • A language model

A language model

  • Note it takes time to build

Note it takes time to build

  • What is your initial WER

What is your initial WER

  • How did you improve it

How did you improve it

  • Submitted by 3:30pm Monday 29

Submitted by 3:30pm Monday 29th

th Sep

Sep

slide-18
SLIDE 18