Speech Processing 11-492/18-492 Spoken Term Detection/Key Word - - PowerPoint PPT Presentation

▶

Feb 27, 2023 631 likes •797 views

Speech Processing 11-492/18-492 Spoken Term Detection/Key Word Spotting Listening for Keywords Listening for Keywords No need to use push-to-talk No need to use push-to-talk Always on Always on Examples Examples Uses Uses

SLIDE 1

Speech Processing 11-492/18-492

Spoken Term Detection/Key Word Spotting

SLIDE 2

Listening for Keywords Listening for Keywords

 No need to use push-to-talk

No need to use push-to-talk

 Always on

Always on

SLIDE 3

Examples Examples

SLIDE 4

Uses Uses

 Activate a computer/task

Activate a computer/task

– “ “Computer, locate Commander Riker” Computer, locate Commander Riker”

 Robot/device control

Robot/device control

– “ “Next slide” Next slide”

 Broadcast News, Meetings

Broadcast News, Meetings

– Tell me when “Microsoft” is mentioned Tell me when “Microsoft” is mentioned

 “

“Triggers” vs “(General) Keyword spotting” Triggers” vs “(General) Keyword spotting”

SLIDE 5

Google Now Google Now

 “

“Okay Google schedule” Okay Google schedule”

 Always on

Always on

 Hands free

Hands free

 Uses battery all the time

Uses battery all the time

 But “Okay Google” is only said when its meant

But “Okay Google” is only said when its meant

SLIDE 6

How to do it How to do it

 Full ASR

Full ASR – Run full ASR all the time Run full ASR all the time – Post process it to find keyword Post process it to find keyword – Very computationally expensive Very computationally expensive

 Model for Keyword

Model for Keyword – Build an acoustic model just for keyword Build an acoustic model just for keyword – Run DTW (or similar) on the acoustics Run DTW (or similar) on the acoustics

SLIDE 7

How to measure its success How to measure its success

 False Positives

False Positives – Find examples that aren't there Find examples that aren't there

 False Negatives

False Negatives – Miss examples that are there Miss examples that are there

 What is the relative cost of the error

What is the relative cost of the error – FN: FN:

Trigger: person will say it again

Trigger: person will say it again

KWS: its lost

KWS: its lost – FP FP

Trigger: an extra command will be interpreted

Trigger: an extra command will be interpreted

KWS: time wasted in looking at example to discard it

KWS: time wasted in looking at example to discard it

 Change your thresholds

Change your thresholds – Trigger: less FP Trigger: less FP – KWS: less FN KWS: less FN

SLIDE 8

Hot Spots Hot Spots

 Only look in good places

Only look in good places

 Speech vs non-speech

Speech vs non-speech

 Target Speaker vs Other speakers

Target Speaker vs Other speakers

 “

“Long” speech vs (very) short speech Long” speech vs (very) short speech

 Prosodically interesting parts

Prosodically interesting parts

SLIDE 9

Noise Cancellation Noise Cancellation

 Remove known (irrelevant) channels

Remove known (irrelevant) channels

– Remove TV feed from ASR stream Remove TV feed from ASR stream – Remove Others from conference call Remove Others from conference call

SLIDE 10

Boosting Boosting

 (For Keyword Spotting)

(For Keyword Spotting)

 Words are defined by the company they keep

Words are defined by the company they keep

 Words will typically appear more than once

Words will typically appear more than once – Near to each other Near to each other

 Recognition with lattices (i.e. choices)

Recognition with lattices (i.e. choices) – If a document has one occurrence If a document has one occurrence – boost others boost others

 If related words in document

If related words in document – Boost others Boost others

SLIDE 11

Choose your Trigger Word Choose your Trigger Word

 Something unlikely to appear elsewhere

Something unlikely to appear elsewhere

 Something easy to recognize

Something easy to recognize

 Something not confusable

Something not confusable

 Something easy to remember

Something easy to remember

 Something relevant

Something relevant

 Good examples

Good examples – Affirmative and negative (vs yes and no) Affirmative and negative (vs yes and no) – “ “Okay Google” Okay Google” – “ “Nebuchadnezzar” Nebuchadnezzar”

 Bad examples

Bad examples – “ “huh” “sass” huh” “sass”

SLIDE 12

IARPA Babel Project IARPA Babel Project

 4 teams

4 teams

– CMU/JHU CMU/JHU – BBN BBN – IBM IBM – ICSI (and others) ICSI (and others)

 35 languages over 5 years

35 languages over 5 years

– Low resource languages Low resource languages – Pashto, Bengali, Vietnamese, Cantonese,... Pashto, Bengali, Vietnamese, Cantonese,...

 100 hours, 10 hours, and 0 hours

100 hours, 10 hours, and 0 hours

SLIDE 13

0 Data Case 0 Data Case

 No labeled data in “unknown” language

No labeled data in “unknown” language

– So can't build initial ASR engine So can't build initial ASR engine – Build index in the audio domain Build index in the audio domain

 Keywords are spoken

Keywords are spoken

– “ “Look for 'apple computers' Look for 'apple computers'

 Issues

Issues

– Cross speaker mapping Cross speaker mapping – (Use of synthesis – but need data) (Use of synthesis – but need data)

SLIDE 14

Spoken Term Detection Spoken Term Detection

 “

“Old” goal but popular again Old” goal but popular again

– Not in fact much easier than full ASR Not in fact much easier than full ASR – You can constrain the problem though You can constrain the problem though

Limited keywords, train people

Limited keywords, train people  Can be used for search

Can be used for search

– But need good ASR in language But need good ASR in language

SLIDE 15