Speech Processing 15-492/18-492 Spoken Dialog Systems Conversing - - PowerPoint PPT Presentation
Speech Processing 15-492/18-492 Spoken Dialog Systems Conversing - - PowerPoint PPT Presentation
Speech Processing 15-492/18-492 Spoken Dialog Systems Conversing with machines Spoken Dialog Systems Not just ASR bolted onto TTS Not just ASR bolted onto TTS Different styles of interaction Different styles of interaction
Spoken Dialog Systems
- Not just ASR bolted onto TTS
Not just ASR bolted onto TTS
- Different styles of interaction
Different styles of interaction
- IVR/Tree question/response systems
IVR/Tree question/response systems
- Mixed initiative systems
Mixed initiative systems
- “How May I Help You?” open questions
“How May I Help You?” open questions
- True conversational machine
True conversational machine-
- human interaction
human interaction
- Strings of characters to words
Strings of characters to words
SDS Overview
- Introduction
Introduction
- Building simple dialog systems
Building simple dialog systems
- VoiceXML
VoiceXML
- A language for writing systems
A language for writing systems
- Beyond tree
Beyond tree-
- based systems
based systems
- CMU’s Olympus systems
CMU’s Olympus systems
- Real
Real-
- world deployment considerations
world deployment considerations
SDS Applications
- Information giving
Information giving
- Flights, buses, stocks weather
Flights, buses, stocks weather
- Driving directions
Driving directions
- News
News
- Information navigators
Information navigators
- Read your mail
Read your mail
- Search the web
Search the web
- Answer questions
Answer questions
- Provide personalities
Provide personalities
- Game characters (NPC), toys, robots
Game characters (NPC), toys, robots
- Speech
Speech-
- to
to-
- speech translation
speech translation
- Cross
Cross-
- lingual interaction
lingual interaction
Dialog Types
- System initiative
System initiative
- Form
Form-
- filling paradigm
filling paradigm
- Can switch language models at each turn
Can switch language models at each turn
- Can “know” which is likely to be said
Can “know” which is likely to be said
- Mixed initiative
Mixed initiative
- Users can go where they like
Users can go where they like
- System or user can lead the discussion
System or user can lead the discussion
- Classifying:
Classifying:
- Users can say what they like
Users can say what they like
- But really only “N” operations possible
But really only “N” operations possible
- E.g. AT&T? “How may I help you?”
E.g. AT&T? “How may I help you?”
System Initiative
- Most common
Most common
- Machine controls the call
Machine controls the call
- Few choices in the dialog
Few choices in the dialog
- Simple form filling:
Simple form filling:
- What is your bank account number
What is your bank account number
- Advantages:
Advantages:
- You know what users will say (sort of)
You know what users will say (sort of)
- Hard for user to get confused
Hard for user to get confused
- Hard for system to get confused
Hard for system to get confused
- Easy to build
Easy to build
- Disadvantages:
Disadvantages:
- Limited flexibility in interaction
Limited flexibility in interaction
- Fixed dialog structure
Fixed dialog structure
- Most reliable, but many turns
Most reliable, but many turns
System Initiative
- Let’s Go Bus Information
Let’s Go Bus Information
- 412 442 2000 (Evenings)
412 442 2000 (Evenings)
- Provides bus information for Pittsburgh East
Provides bus information for Pittsburgh East End (61x 5[469]x) End (61x 5[469]x)
- Tell Me
Tell Me
- Company getting others to build systems
Company getting others to build systems
- Stocks, weather, entertainment
Stocks, weather, entertainment
- 1 800 555 8355
1 800 555 8355
Mixed Initiative
- User or system takes initiative
User or system takes initiative
- More interesting dialogs
More interesting dialogs
- “jump” through different parts of dialog state
“jump” through different parts of dialog state
- Advantages
Advantages
- More realistic dialog
More realistic dialog
- Can do more complex tasks
Can do more complex tasks
- Disadvantages
Disadvantages
- Can get confusing
Can get confusing
- Can miss important parts
Can miss important parts
Vera
Classification Dialogs
- Sort out from N things
Sort out from N things
- User says “anything” and system directs them
User says “anything” and system directs them
- Receptionist
Receptionist
I have a problem with my bill
I have a problem with my bill
What’s the area code for Miami
What’s the area code for Miami
Did you know I can see the beach from here
Did you know I can see the beach from here
- Advantages
Advantages
- (Apparently) complex understanding
(Apparently) complex understanding
- Solves a very common task
Solves a very common task
- Disadvantages
Disadvantages
- Actually quite restrictive
Actually quite restrictive
- Needs data to train from
Needs data to train from
- Needs to be updated
Needs to be updated
Beyond Telephones
- Telematics
Telematics
- Voice communication in cars
Voice communication in cars
- CPS, music selection etc
CPS, music selection etc
- Robot Interaction
Robot Interaction
- Robot
Robot-
- robot and robot
robot and robot-
- human interaction
human interaction
- Animated talking head
Animated talking head
- Non
Non-
- player characters
player characters – – web agents web agents
- Speech to Speech translation
Speech to Speech translation
Team Talk
- Using speech to control multiple robots
Using speech to control multiple robots
- Robots have names and distinct voices
Robots have names and distinct voices
- They report to each other and to you in voice
They report to each other and to you in voice
USI
- Lots of different interfaces is confusing
Lots of different interfaces is confusing
- Try to have general expectations and discover
Try to have general expectations and discover
- Try for some level of standardization
Try for some level of standardization
- (like programming applications: file menu)
(like programming applications: file menu)
True conversation
- Requires mores than just speech
Requires mores than just speech
- Non
Non-
- verbal noises: laughing,
verbal noises: laughing, er er, um, etc , um, etc
- Eye gaze
Eye gaze
- Proper timing (not waiting 500ms before
Proper timing (not waiting 500ms before speaker) speaker)
- Back
Back-
- channeling
channeling
- Movement
Movement
- Talking about nothing
Talking about nothing
Roboreceptionist
- Entrance to NSH
Entrance to NSH
- Keyboard (no ASR)
Keyboard (no ASR)
- TTS, face, movement
TTS, face, movement
- Range finder to detect people
Range finder to detect people
- Significant background
Significant background character character
- Mostly talks about nothing