Speech Processing 11-492/18-495 Speech Processing 11-492/18-495 - - PowerPoint PPT Presentation
Speech Processing 11-492/18-495 Speech Processing 11-492/18-495 - - PowerPoint PPT Presentation
Speech Processing 11-492/18-495 Speech Processing 11-492/18-495 Spoken Dialog Systems Conversing with machines Spoken Dialog Systems Spoken Dialog Systems Not just ASR bolted onto TTS Not just ASR bolted onto TTS Different styles of
Spoken Dialog Systems Spoken Dialog Systems
Not just ASR bolted onto TTS
Not just ASR bolted onto TTS
Different styles of interaction
Different styles of interaction
Question/response systems
Question/response systems
Mixed initiative systems
Mixed initiative systems
“
“How May I Help You?” open questions How May I Help You?” open questions
True conversational machine-human interaction
True conversational machine-human interaction
SDS Overview SDS Overview
Introduction
Introduction
Building simple dialog systems
Building simple dialog systems
VoiceXML
VoiceXML
A language for writing systems
A language for writing systems
Beyond tree-based systems
Beyond tree-based systems
Beyond spoken language
Beyond spoken language
Non-task-oriented systems
Non-task-oriented systems
Real-world deployment considerations
Real-world deployment considerations
SDS Applications SDS Applications
Information giving/request
Information giving/request
Flights, buses, stocks and weather
Flights, buses, stocks and weather
Driving directions
Driving directions
Answer questions, news
Answer questions, news
Transactional
Transactional
Reply your email
Reply your email
Credit card and bank enquiries, product purchase
Credit card and bank enquiries, product purchase
Maintenance
Maintenance
Technical support
Technical support
Customer service
Customer service
SDS Applications SDS Applications
Entertainment
Entertainment
Game characters (NPC), toys, robots
Game characters (NPC), toys, robots
Tutoring
Tutoring
Math, science
Math, science
Language learning
Language learning
Health care
Health care
Depression screening
Depression screening
Aphasia therapy
Aphasia therapy
Dialog Types Dialog Types
System initiative
System initiative
Form-filling paradigm
Form-filling paradigm
Can switch language models at each turn
Can switch language models at each turn
Can “know” which is likely to be said
Can “know” which is likely to be said
Mixed initiative
Mixed initiative
Users can go where they like
Users can go where they like
System or user can lead the discussion
System or user can lead the discussion
Classifying:
Classifying:
Users can say what they like
Users can say what they like
But really only “N” operations possible
But really only “N” operations possible
E.g. AT&T? “How may I help you?”
E.g. AT&T? “How may I help you?”
System Initiative System Initiative
Most common
Most common
Machine controls the call
Machine controls the call
Few choices in the dialog
Few choices in the dialog
Simple form filling:
Simple form filling:
What is your bank account number
What is your bank account number
Advantages:
Advantages:
You know what users will say (sort of)
You know what users will say (sort of)
Hard for user to get confused
Hard for user to get confused
Hard for system to get confused
Hard for system to get confused
Easy to build
Easy to build
Disadvantages:
Disadvantages:
Limited flexibility in interaction
Limited flexibility in interaction
Fixed dialog structure
Fixed dialog structure
Most reliable, but many turns
Most reliable, but many turns
System Initiative System Initiative
Let’s Go Bus Information
Let’s Go Bus Information
412 268 3526 (Anytime)
412 268 3526 (Anytime)
Provides bus information for Pittsburgh
Provides bus information for Pittsburgh
Tell Me
Tell Me
Company getting others to build systems
Company getting others to build systems
Stocks, weather, entertainment
Stocks, weather, entertainment
Mixed Initiative Mixed Initiative
User or system takes initiative
User or system takes initiative
More interesting dialogs
More interesting dialogs
“
“jump” through different parts of dialog state jump” through different parts of dialog state
Advantages
Advantages
More realistic dialog
More realistic dialog
Can do more complex tasks
Can do more complex tasks
Disadvantages
Disadvantages
Can get confusing
Can get confusing
Can miss important parts
Can miss important parts
Classification Dialogs Classification Dialogs
Sort out from N things
Sort out from N things
User says “anything” and system directs them
User says “anything” and system directs them
Receptionist
Receptionist
I have a problem with my bill
I have a problem with my bill
What’s the area code for Miami
What’s the area code for Miami
Did you know I can see the beach from here
Did you know I can see the beach from here
Advantages
Advantages
(Apparently) complex understanding
(Apparently) complex understanding
Solves a very common task
Solves a very common task
Disadvantages
Disadvantages
Actually quite restrictive
Actually quite restrictive
Needs data to train from
Needs data to train from
Needs to be updated
Needs to be updated
Beyond Telephones Beyond Telephones
Telematics
Telematics
Voice communication in cars
Voice communication in cars
CPS, music selection etc
CPS, music selection etc
Web-based dialog systems
Web-based dialog systems
Robot Interaction
Robot Interaction
Robot-robot and robot-human interaction
Robot-robot and robot-human interaction
Animated talking head
Animated talking head
Non-player characters – web agents
Non-player characters – web agents
Speech to Speech translation
Speech to Speech translation
CMU Dialport: integrating many dialog
CMU Dialport: integrating many dialog systems systems
Team Talk Team Talk
Using speech to control multiple robots
Using speech to control multiple robots
Robots have names and distinct voices
Robots have names and distinct voices
They report to each other and to you in voice
They report to each other and to you in voice
Other SDS Other SDS
Microsoft: Situated Interaction
Microsoft: Situated Interaction
Talking Head that follows you
Talking Head that follows you
CMU SV: Aidas
CMU SV: Aidas
Restaurant recommendations in situ
Restaurant recommendations in situ
True conversation True conversation
Requires more than just speech
Requires more than just speech
Non-verbal noises: laughing, er, um, etc
Non-verbal noises: laughing, er, um, etc
Eye gaze
Eye gaze
Proper timing (not waiting 500ms before
Proper timing (not waiting 500ms before speaker) speaker)
Back-channeling
Back-channeling
Movement
Movement
Talking about nothing
Talking about nothing
Roboreceptionist Roboreceptionist
Entrance to NSH
Entrance to NSH
Keyboard (no ASR)
Keyboard (no ASR)
TTS, face, movement
TTS, face, movement
Range finder to detect people
Range finder to detect people
Significant background
Significant background character character
Mostly talks about nothing
Mostly talks about nothing
Personal Intelligent Systems Personal Intelligent Systems
Example: Apple Siri, Google Now, Microsoft
Example: Apple Siri, Google Now, Microsoft Cortana, Amazon Echo, etc. Cortana, Amazon Echo, etc.
Hub of all applications
Hub of all applications
Extendable
Extendable
Personalization
Personalization
Cross-Language
Cross-Language
Cross-Cultural
Cross-Cultural
Future: interface-> true companion
Future: interface-> true companion
Speech Processing 11-492/18-492 Speech Processing 11-492/18-492
Spoken Dialog Systems SDS components
Spoken Dialog Systems Spoken Dialog Systems
More than just ASR and TTS
More than just ASR and TTS
Recognition
Recognition
Language understanding
Language understanding
Manipulation of utterances
Manipulation of utterances
Generation of new information
Generation of new information
Text generation
Text generation
Synthesis
Synthesis
SDS Architecture SDS Architecture
Language Generation ASR Language Understanding Synthesis Dialog Manager
Error Handling Strategies Non Understanding
SDS Internals SDS Internals
Language Understanding
Language Understanding
From words to structure
From words to structure
Dialog Manager
Dialog Manager
State of dialog (who is talking)
State of dialog (who is talking)
Direction of dialog (what next)
Direction of dialog (what next)
References, user profile etc
References, user profile etc
Interaction of database/internet
Interaction of database/internet
Language Generation
Language Generation
From structure to words
From structure to words
Language Understanding Language Understanding
Parsing of SPEECH not TEXT
Parsing of SPEECH not TEXT
Eh, I wanna go, wanna go to Boston tomorrow
Eh, I wanna go, wanna go to Boston tomorrow
If its not too much trouble I’d be very grateful if
If its not too much trouble I’d be very grateful if
- ne might be able to aid me in arranging my
- ne might be able to aid me in arranging my
travel arrangements to Boston, Logan airport, at travel arrangements to Boston, Logan airport, at sometime tomorrow morning, thank you. sometime tomorrow morning, thank you.
Boston, tomorrow
Boston, tomorrow
Parsing: Output structure Parsing: Output structure
“
“I wanna go to Boston, tomorrow” I wanna go to Boston, tomorrow”
Destination: BOS
Destination: BOS
Departure: 20081028, AM
Departure: 20081028, AM
Airline: unspecifed
Airline: unspecifed
Special: unspecifed
Special: unspecifed
Convert speech to structure
Convert speech to structure
Sufficient for further processing/query
Sufficient for further processing/query
Interaction Example Interaction Example
Intelligent Agent
Cheap Taiwanese eating places include Din Tai Fung, Boiling Point, etc. What do you want to choose? I can help you go there. fjnd a cheap eating place oor taiwanese
- ood
User
SDS Process SDS Process
fjnd a cheap eating place oor taiwanese
- ood
User
target
- ood
price
AMOD NN
seeking
PREP_FOR Intelligent Agent
SDS Process SDS Process
User
target
- ood
price
AMOD NN
seeking
PREP_FOR
Organized Domain Knowledge
Intelligent Agent Ontology Induction
(semanti c slot)
fjnd a cheap eating place oor taiwanese
- ood
SDS Process SDS Process
User
target
- ood
price
AMOD NN
seeking
PREP_FOR
Organized Domain Knowledge Intelligent Agent
Ontology Induction
(semanti c slot)
Structure Learning
(inter-slot relation)
fjnd a cheap eating place oor taiwanese
- ood
SDS Process SDS Process
User
targe t
- ood
price
AMOD NN
seekin g
PREP_FOR
Intelligent Agent
seeking=“fjnd” target=“eating place” price=“cheap”
- ood=“taiwanese”
fjnd a cheap eating place oor taiwanese
- ood
fjnd a cheap eating place oor taiwanese
- ood
SDS Process SDS Process
User
targe t
- ood
price
AMOD NN
seekin g
PREP_FOR
Intelligent Agent
seeking=“fjnd” target=“eating place” price=“cheap”
- ood=“taiwanese”
Semantic Decoding
Automatic Slot Induction Automatic Slot Induction
Chen et al. ASRU’13 Chen et al. ASRU’13
can i have a cheap restaurant
Frame: capability Frame: expensiveness Frame: locale by use
Domain Domain General slot candidate
3 2
can i have a cheap restaurant
Parsing vs Language Model Parsing vs Language Model
Language Model
Language Model
Model what actually gets said
Model what actually gets said
Parsing
Parsing
Extract the information you want
Extract the information you want
Models *can* be shared
Models *can* be shared
Only accept things in the grammar
Only accept things in the grammar
Can be over limiting
Can be over limiting
Neural Networks for SLU Neural Networks for SLU
RNN for Slot Filling
RNN for Slot Filling
Step 1: word embedding
Step 1: word embedding
Step 2: short-term dependencies capturing
Step 2: short-term dependencies capturing
Step 3: long-term dependencies capturing
Step 3: long-term dependencies capturing
Step 4: different types of neural architecture
Step 4: different types of neural architecture
http://deeplearning.net/tutorial/rnnslu.html#rnnslu http://deeplearning.net/tutorial/rnnslu.html#rnnslu
Mesnil et al. 2013
Interactive Learning for SLU Interactive Learning for SLU
Luis : Interactive machine learning for Luis : Interactive machine learning for language understanding language understanding Advantages: Advantages:
Non-expert could add in knowledge in
Non-expert could add in knowledge in feature engineering feature engineering
Active-learning reduces heavy labeling
Active-learning reduces heavy labeling
https://www.luis.ai/ https://www.luis.ai/
Williams et al. 2016
Dialog Manager Dialog Manager
Maintain state
Maintain state
Where are we in the dialog
Where are we in the dialog
Whose turn is it
Whose turn is it
Waiting for speaker
Waiting for speaker
Waiting for database query (stall user)
Waiting for database query (stall user)
Deal with barge-in