Speech Processing 11-492/18-492 Spoken Term Detection/Key Word - - PowerPoint PPT Presentation
Speech Processing 11-492/18-492 Spoken Term Detection/Key Word - - PowerPoint PPT Presentation
Speech Processing 11-492/18-492 Spoken Term Detection/Key Word Spotting Listening for Keywords Listening for Keywords No need to use push-to-talk No need to use push-to-talk Always on Always on Examples Examples Uses Uses
Listening for Keywords Listening for Keywords
No need to use push-to-talk
No need to use push-to-talk
Always on
Always on
Examples Examples
Uses Uses
Activate a computer/task
Activate a computer/task
– “ “Computer, locate Commander Riker” Computer, locate Commander Riker”
Robot/device control
Robot/device control
– “ “Next slide” Next slide”
Broadcast News, Meetings
Broadcast News, Meetings
– Tell me when “Microsoft” is mentioned Tell me when “Microsoft” is mentioned
“
“Triggers” vs “(General) Keyword spotting” Triggers” vs “(General) Keyword spotting”
Google Now Google Now
“
“Okay Google schedule” Okay Google schedule”
Always on
Always on
Hands free
Hands free
Uses battery all the time
Uses battery all the time
But “Okay Google” is only said when its meant
But “Okay Google” is only said when its meant
How to do it How to do it
Full ASR
Full ASR – Run full ASR all the time Run full ASR all the time – Post process it to find keyword Post process it to find keyword – Very computationally expensive Very computationally expensive
Model for Keyword
Model for Keyword – Build an acoustic model just for keyword Build an acoustic model just for keyword – Run DTW (or similar) on the acoustics Run DTW (or similar) on the acoustics
How to measure its success How to measure its success
False Positives
False Positives – Find examples that aren't there Find examples that aren't there
False Negatives
False Negatives – Miss examples that are there Miss examples that are there
What is the relative cost of the error
What is the relative cost of the error – FN: FN:
- Trigger: person will say it again
Trigger: person will say it again
- KWS: its lost
KWS: its lost – FP FP
- Trigger: an extra command will be interpreted
Trigger: an extra command will be interpreted
- KWS: time wasted in looking at example to discard it
KWS: time wasted in looking at example to discard it
Change your thresholds
Change your thresholds – Trigger: less FP Trigger: less FP – KWS: less FN KWS: less FN
Hot Spots Hot Spots
Only look in good places
Only look in good places
Speech vs non-speech
Speech vs non-speech
Target Speaker vs Other speakers
Target Speaker vs Other speakers
“
“Long” speech vs (very) short speech Long” speech vs (very) short speech
Prosodically interesting parts
Prosodically interesting parts
Noise Cancellation Noise Cancellation
Remove known (irrelevant) channels
Remove known (irrelevant) channels
– Remove TV feed from ASR stream Remove TV feed from ASR stream – Remove Others from conference call Remove Others from conference call
Boosting Boosting
(For Keyword Spotting)
(For Keyword Spotting)
Words are defined by the company they keep
Words are defined by the company they keep
Words will typically appear more than once
Words will typically appear more than once – Near to each other Near to each other
Recognition with lattices (i.e. choices)
Recognition with lattices (i.e. choices) – If a document has one occurrence If a document has one occurrence – boost others boost others
If related words in document
If related words in document – Boost others Boost others
Choose your Trigger Word Choose your Trigger Word
Something unlikely to appear elsewhere
Something unlikely to appear elsewhere
Something easy to recognize
Something easy to recognize
Something not confusable
Something not confusable
Something easy to remember
Something easy to remember
Something relevant
Something relevant
Good examples
Good examples – Affirmative and negative (vs yes and no) Affirmative and negative (vs yes and no) – “ “Okay Google” Okay Google” – “ “Nebuchadnezzar” Nebuchadnezzar”
Bad examples
Bad examples – “ “huh” “sass” huh” “sass”
IARPA Babel Project IARPA Babel Project
4 teams
4 teams
– CMU/JHU CMU/JHU – BBN BBN – IBM IBM – ICSI (and others) ICSI (and others)
35 languages over 5 years
35 languages over 5 years
– Low resource languages Low resource languages – Pashto, Bengali, Vietnamese, Cantonese,... Pashto, Bengali, Vietnamese, Cantonese,...
100 hours, 10 hours, and 0 hours
100 hours, 10 hours, and 0 hours
0 Data Case 0 Data Case
No labeled data in “unknown” language
No labeled data in “unknown” language
– So can't build initial ASR engine So can't build initial ASR engine – Build index in the audio domain Build index in the audio domain
Keywords are spoken
Keywords are spoken
– “ “Look for 'apple computers' Look for 'apple computers'
Issues
Issues
– Cross speaker mapping Cross speaker mapping – (Use of synthesis – but need data) (Use of synthesis – but need data)
Spoken Term Detection Spoken Term Detection
“
“Old” goal but popular again Old” goal but popular again
– Not in fact much easier than full ASR Not in fact much easier than full ASR – You can constrain the problem though You can constrain the problem though
- Limited keywords, train people