Speech Processing 15-492/18-492 Spoken Dialog Systems SDS - - PowerPoint PPT Presentation
Speech Processing 15-492/18-492 Spoken Dialog Systems SDS - - PowerPoint PPT Presentation
Speech Processing 15-492/18-492 Spoken Dialog Systems SDS components Spoken Dialog Systems More than just ASR and TTS More than just ASR and TTS Recognition Recognition Parsing Parsing Manipulation of utterances
Spoken Dialog Systems
- More than just ASR and TTS
More than just ASR and TTS
- Recognition
Recognition
- Parsing
Parsing
- Manipulation of utterances
Manipulation of utterances
- Generation of new information
Generation of new information
- Text generation
Text generation
- Synthesis
Synthesis
SDS Architecture
SDS Internals
- Parser
Parser
- From words to structure
From words to structure
- Dialog Manager
Dialog Manager
- State of dialog (who is talking)
State of dialog (who is talking)
- Direction of dialog (what next)
Direction of dialog (what next)
- References, user profile etc
References, user profile etc
- Interaction of database/internet
Interaction of database/internet
- Language Generation
Language Generation
- From structure to words
From structure to words
Parsing
- Parsing of SPEECH not TEXT
Parsing of SPEECH not TEXT
- Eh, I
Eh, I wanna wanna go, go, wanna wanna go to Boston tomorrow go to Boston tomorrow
- If its not too much trouble I’d be very grateful if
If its not too much trouble I’d be very grateful if
- ne might be able to aid me in arranging my
- ne might be able to aid me in arranging my
travel arrangements to Boston, Logan airport, travel arrangements to Boston, Logan airport, at sometime tomorrow morning, thank you. at sometime tomorrow morning, thank you.
- Boston, tomorrow
Boston, tomorrow
Parsing: Output structure
- “
“I I wanna wanna go to Boston, tomorrow” go to Boston, tomorrow”
- Convert speech to structure
Convert speech to structure
- Sufficient for further processing/query
Sufficient for further processing/query
Phoenix Parser
7
[Place] (carnegie mellon university) (downtown) (robinson towne center) (the airport) (south hills junction) (mount oliver) (the south side) (oakland) (bloomfield) (polish hill) (the strip district) (the north side) ; [NextBus] (*WHEN_IS *the next *BUS) (*WHEN_IS *the BUS after that *BUS) WHEN_IS (when is) (when's) BUS (bus) (one) ;
Phoenix Parser
- Parse what is important
Parse what is important
- Ignore other parts
Ignore other parts
- Map know parts to usually information
Map know parts to usually information
Parsing vs Language Model
- Language Model
Language Model
- Model what actually gets says
Model what actually gets says
- Parsing
Parsing
- Extract the information you want
Extract the information you want
- Models *can* be shared
Models *can* be shared
- Only accept things in the grammar
Only accept things in the grammar
- Can be over limiting
Can be over limiting
Dialog Manager
- Maintain state
Maintain state
- Where are we in the dialog
Where are we in the dialog
- Whose turn is it
Whose turn is it
Waiting for speaker
Waiting for speaker
Waiting for database query (stall user)
Waiting for database query (stall user)
- Deal with barge
Deal with barge-
- in
in
Language Generation
- Query for flights to Boston
Query for flights to Boston
- Template fill
Template fill answer(s answer(s) )
- The next flight to DEST leaves at
The next flight to DEST leaves at DEPART_TIME arriving at ARRIVE_TIME. DEPART_TIME arriving at ARRIVE_TIME.
- Templates may be much more complex
Templates may be much more complex
Language Generation
- Choose which template to use
Choose which template to use
- Based on state, answer type
Based on state, answer type
- Natural variation
Natural variation
- Statistical variation
Statistical variation
- Include <
Include <ssml ssml> tags to help synthesis > tags to help synthesis
- Can <
Can <emph emph>emphasize</ >emphasize</emph emph> parts > parts
- Can identify dates, numbers etc.
Can identify dates, numbers etc.
- Humans like variation in the output
Humans like variation in the output
- It is rare for a human to repeat things exactly
It is rare for a human to repeat things exactly
Language Generation
- Frames structures to (marked up) text
Frames structures to (marked up) text
- START: Pittsburgh
START: Pittsburgh
- END: Boston
END: Boston
- DATE: 20081028
DATE: 20081028
- TIME: 07:45
TIME: 07:45
- FLIGHT: US075
FLIGHT: US075
- Can generation
Can generation
- I have US 075 leaving at 07:45 tomorrow
I have US 075 leaving at 07:45 tomorrow
- US Airways has a flight departing tomorrow at 07:45
US Airways has a flight departing tomorrow at 07:45
Standardized things
- Help
Help
- User should be able to get help at any time
User should be able to get help at any time
- Explain where they are and what they are
Explain where they are and what they are expected to say (with explicit examples) expected to say (with explicit examples)
- Errors
Errors
- “I didn’t understand” …
“I didn’t understand” …
- Confirmation
Confirmation
- Did you say “Boston”?
Did you say “Boston”?
Confirmation
- Explicit confirmation
Explicit confirmation
- Where are you traveling to ?
Where are you traveling to ? Boston Boston
- Boston, did I get that right?
Boston, did I get that right? Yes Yes
Confirmation
- Implicit confirmation
Implicit confirmation
- Where are you traveling to?
Where are you traveling to? Boston Boston
- Boston, where …
Boston, where … <can barge in> <can barge in>
Confirmation
- Explicit confirmation
Explicit confirmation
- Safe but slow
Safe but slow
- Implicit confirmation
Implicit confirmation
- Natural, but requires good support for barge
Natural, but requires good support for barge-
- in
in
Grounding
- Showing evidence the system understands
Showing evidence the system understands
- Where are you traveling to?
Where are you traveling to? Boston. Boston.
- Right. Where ….
- Right. Where ….
Boston, right. Where …. Boston, right. Where ….
Designing Prompts
- Constrain your questions:
Constrain your questions:
- How may I help you?
How may I help you?
Long story reply
Long story reply
- What bus number would like schedules for?
What bus number would like schedules for?
Expect bus number replies