Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS components

Spoken Dialog Systems Spoken Dialog Systems  More than just ASR and TTS More than just ASR and TTS  Recognition Recognition  Language understanding Language understanding  Manipulation of utterances Manipulation of utterances  Generation of new information Generation of new information  Text generation Text generation  Synthesis Synthesis

SDS Architecture SDS Architecture Language ASR Understanding Dialog Manager Language Understanding Synthesis Non Generation Error Handling Strategies

SDS Internals SDS Internals  Language Understanding Language Understanding  From words to structure From words to structure  Dialog Manager Dialog Manager  State of dialog (who is talking) State of dialog (who is talking)  Direction of dialog (what next) Direction of dialog (what next)  References, user profile etc References, user profile etc  Interaction of database/internet Interaction of database/internet  Language Generation Language Generation  From structure to words From structure to words

Language Understanding Language Understanding  Parsing of SPEECH not TEXT Parsing of SPEECH not TEXT  Eh, I wanna go, wanna go to Boston tomorrow Eh, I wanna go, wanna go to Boston tomorrow  If its not too much trouble I’d be very grateful if If its not too much trouble I’d be very grateful if one might be able to aid me in arranging my one might be able to aid me in arranging my travel arrangements to Boston, Logan airport, at travel arrangements to Boston, Logan airport, at sometime tomorrow morning, thank you. sometime tomorrow morning, thank you.  Boston, tomorrow Boston, tomorrow

Parsing: Output structure Parsing: Output structure  “ “I wanna go to Boston, tomorrow” I wanna go to Boston, tomorrow”  Destination: BOS Destination: BOS  Departure: 20081028, AM Departure: 20081028, AM  Airline: unspecifed Airline: unspecifed  Special: unspecifed Special: unspecifed  Convert speech to structure Convert speech to structure  Sufficient for further processing/query Sufficient for further processing/query

Interaction Example Interaction Example User fjnd a cheap eating place oor taiwanese oood Cheap Taiwanese eating places include Din Tai Fung, Boiling Point, etc. What do you want to choose? I can help you go there. Intelligent Agent

SDS Process SDS Process User fjnd a cheap eating place oor taiwanese oood price oood AMOD NN targe t PREP_FOR seekin Intelligent g Agent

SDS Process SDS Process User fjnd a cheap eating place oor taiwanese oood Ontology Induction (semanti price oood c slot) AMOD NN target PREP_FOR seekin Intelligent g Agent Organized Domain Knowledge

SDS Process SDS Process User fjnd a cheap eating place oor taiwanese oood Ontology Induction (semanti price oood c slot) AMOD NN target Structure Learning PREP_FOR seekin (inter-slot relation) Intelligent g Agent Organized Domain Knowledge

SDS Process SDS Process User fjnd a cheap eating place oor taiwanese oood price oood AMOD NN targe seeking=“fjnd” t target=“eating PREP_FOR seekin Intelligent place” g Agent price=“cheap” oood=“taiwanese”

SDS Process SDS Process User fjnd a cheap eating place oor taiwanese oood price oood AMOD NN Semantic targe Decoding seeking=“fjnd” t target=“eating PREP_FOR seekin Intelligent place” g Agent price=“cheap” oood=“taiwanese”

Automatic Slot Induction Automatic Slot Induction Chen et al. ASRU’13 Chen et al. ASRU’13 Domain Domain can i have a cheap restaurant can i have a cheap restaurant General Frame: expensiveness Frame: capability Frame: locale by use slot candidate 1 5

Parsing vs Language Model Parsing vs Language Model  Language Model Language Model  Model what actually gets said Model what actually gets said  Parsing Parsing  Extract the information you want Extract the information you want  Models *can* be shared Models *can* be shared  Only accept things in the grammar Only accept things in the grammar  Can be over limiting Can be over limiting

Neural Networks for SLU Neural Networks for SLU Mesnil et al.  RNN for Slot Filling 2013 RNN for Slot Filling  Step 1: word embedding Step 1: word embedding  Step 2: short-term dependencies capturing Step 2: short-term dependencies capturing  Step 3: long-term dependencies capturing Step 3: long-term dependencies capturing  Step 4: different types of neural architecture Step 4: different types of neural architecture http://deeplearning.net/tutorial/rnnslu.html#rnnslu http://deeplearning.net/tutorial/rnnslu.html#rnnslu

Interactive Learning for SLU Interactive Learning for SLU Williams et al. 2016 Luis : Interactive machine learning for Luis : Interactive machine learning for language understanding language understanding Advantages: Advantages: Non-expert could add in knowledge in  Non-expert could add in knowledge in feature engineering feature engineering Active-learning reduces heavy labeling  Active-learning reduces heavy labeling https://www.luis.ai/ https://www.luis.ai/

Dialog Manager Dialog Manager  Maintain state Maintain state  Where are we in the dialog Where are we in the dialog  Whose turn is it Whose turn is it  Waiting for speaker Waiting for speaker  Waiting for database query (stall user) Waiting for database query (stall user)  Deal with barge-in Deal with barge-in

Frame Based Dialog Manger Frame Based Dialog Manger  Used for transaction dialog Used for transaction dialog  Generalizes finite-state approach by allowing Generalizes finite-state approach by allowing multiple paths to acquire info multiple paths to acquire info  Central data structure is frame with slots Central data structure is frame with slots • DM is monitoring frame, filling in slots DM is monitoring frame, filling in slots  Frame: Frame:  Set of information needed Set of information needed  Context for utterance interpretation Context for utterance interpretation  Context for dialogue progress Context for dialogue progress  Allows mixed initiative Allows mixed initiative  Allows over-answering Allows over-answering  Also called form-based (MIT): Often called “slot-filling” Also called form-based (MIT): Often called “slot-filling”

Problems with Frames Problems with Frames  Not easily applicable to complex tasks Not easily applicable to complex tasks  May not be a single frame May not be a single frame  Dynamic construction of information Dynamic construction of information  User access to “product” User access to “product”

Agenda + Frame Agenda + Frame  Product: Product:  hierarchical composition of frames hierarchical composition of frames  Process: Process:  Agenda Agenda  Generalization of stack Generalization of stack  Ordered list of topics Ordered list of topics  List of handlers List of handlers

Statistical Approaches to DM Statistical Approaches to DM  Allow for dialog complexity beyond human Allow for dialog complexity beyond human mind mind  Find optimal decision for non-trivial design Find optimal decision for non-trivial design problems problems  Life-long learning Life-long learning

Decisions Decisions  Difficult design decision over the course of Difficult design decision over the course of interaction interaction  When to ask open / directive questions? When to ask open / directive questions?  When to confirm? When to confirm?  When to barge-in / wait? When to barge-in / wait?  Which type of feedback to provide? (ex, Which type of feedback to provide? (ex, intelligent tutoring system) intelligent tutoring system)  Sample efficient policy search Sample efficient policy search  Policy space is too huge to search with Policy space is too huge to search with traditional ways of SDS development traditional ways of SDS development

System-Initiative VS Mixed-Initiative System-Initiative VS Mixed-Initiative S1: Welcome to CMU Let’s Go. Where do you leave from? Where do you leave from? S1: Welcome to CMU Let’s Go. U1: CMU U1: CMU S2: From CMU, did I get that right? S2: From CMU, did I get that right? U2: Yes. U2: Yes. S3: Where are you going? Where are you going? S3: U3: Downtown. U3: Downtown. S4: To Downtown, did I get that right? S4: To Downtown, did I get that right? U4: Yes. U4: Yes. S1: Welcome to CMU Let’s Go. How may I help you? How may I help you? S1: Welcome to CMU Let’s Go. U1: I'd like to go from CMU to Downtown. U1: I'd like to go from CMU to Downtown. S2: From CMU to Downtown, did I get that right? S2: From CMU to Downtown, did I get that right? U2: Yes. U2: Yes. S3 : When are you going to take the bus? : When are you going to take the bus? S3 U3: Now U3: Now S3: You want the next bus, is that right? S3: You want the next bus, is that right? U3: Yes. U3: Yes.

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS components Spoken Dialog Systems Spoken Dialog Systems More than just ASR and TTS More than just ASR and TTS Recognition Recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Dialogue Systems System: [returns a list of flights] Show: (Arrival-time) Origin (City

Type-based Human-Computer Dialogue Peter Ljunglf Dept. of Computer Science and Engineering

Towards Customizable Individualized Dialogue Systems Marilyn Walker, S. Whittaker, R. Moore, J.

o To start your script, click on the Add Dialogue box and select which character you want to

implementation support Implementation support programming tools levels of services for

Enterprise One year later Claudiu Musat The Past... Paradigm Shift: The machine reaches out to

I n k r e m e n t e l l e V e r a r b e i t u n g a l s H e b e l

Heineken Worlds Apart https://www.youtube.com/watch?v=8wYXw4K0A3g Conversations in a Civil