Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - - PowerPoint PPT Presentation

speech processing 11 492 18 492 speech processing 11 492
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS components Spoken Dialog Systems Spoken Dialog Systems More than just ASR and TTS More than just ASR and TTS Recognition Recognition


slide-1
SLIDE 1

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492

Spoken Dialog Systems SDS components

slide-2
SLIDE 2

Spoken Dialog Systems Spoken Dialog Systems

 More than just ASR and TTS

More than just ASR and TTS

 Recognition

Recognition

 Language understanding

Language understanding

 Manipulation of utterances

Manipulation of utterances

 Generation of new information

Generation of new information

 Text generation

Text generation

 Synthesis

Synthesis

slide-3
SLIDE 3

SDS Architecture SDS Architecture

Language Generation ASR Language Understanding Synthesis Dialog Manager

Error Handling Strategies Non Understanding

slide-4
SLIDE 4

SDS Internals SDS Internals

 Language Understanding

Language Understanding

 From words to structure

From words to structure

 Dialog Manager

Dialog Manager

 State of dialog (who is talking)

State of dialog (who is talking)

 Direction of dialog (what next)

Direction of dialog (what next)

 References, user profile etc

References, user profile etc

 Interaction of database/internet

Interaction of database/internet

 Language Generation

Language Generation

 From structure to words

From structure to words

slide-5
SLIDE 5

Language Understanding Language Understanding

 Parsing of SPEECH not TEXT

Parsing of SPEECH not TEXT

 Eh, I wanna go, wanna go to Boston tomorrow

Eh, I wanna go, wanna go to Boston tomorrow

 If its not too much trouble I’d be very grateful if

If its not too much trouble I’d be very grateful if

  • ne might be able to aid me in arranging my
  • ne might be able to aid me in arranging my

travel arrangements to Boston, Logan airport, at travel arrangements to Boston, Logan airport, at sometime tomorrow morning, thank you. sometime tomorrow morning, thank you.

 Boston, tomorrow

Boston, tomorrow

slide-6
SLIDE 6

Parsing: Output structure Parsing: Output structure

 “

“I wanna go to Boston, tomorrow” I wanna go to Boston, tomorrow”

 Destination: BOS

Destination: BOS

 Departure: 20081028, AM

Departure: 20081028, AM

 Airline: unspecifed

Airline: unspecifed

 Special: unspecifed

Special: unspecifed

 Convert speech to structure

Convert speech to structure

 Sufficient for further processing/query

Sufficient for further processing/query

slide-7
SLIDE 7

Interaction Example Interaction Example

Intelligent Agent

Cheap Taiwanese eating places include Din Tai Fung, Boiling Point, etc. What do you want to choose? I can help you go there. fjnd a cheap eating place oor taiwanese

  • ood

User

slide-8
SLIDE 8

SDS Process SDS Process

fjnd a cheap eating place oor taiwanese

  • ood

User

targe t

  • ood

price

AMOD NN

seekin g

PREP_FOR Intelligent Agent

slide-9
SLIDE 9

SDS Process SDS Process

User

target

  • ood

price

AMOD NN

seekin g

PREP_FOR

Organized Domain Knowledge

Intelligent Agent Ontology Induction

(semanti c slot)

fjnd a cheap eating place oor taiwanese

  • ood
slide-10
SLIDE 10

SDS Process SDS Process

User

target

  • ood

price

AMOD NN

seekin g

PREP_FOR

Organized Domain Knowledge Intelligent Agent

Ontology Induction

(semanti c slot)

Structure Learning

(inter-slot relation)

fjnd a cheap eating place oor taiwanese

  • ood
slide-11
SLIDE 11

SDS Process SDS Process

User

targe t

  • ood

price

AMOD NN

seekin g

PREP_FOR

Intelligent Agent

seeking=“fjnd” target=“eating place” price=“cheap”

  • ood=“taiwanese”

fjnd a cheap eating place oor taiwanese

  • ood
slide-12
SLIDE 12

fjnd a cheap eating place oor taiwanese

  • ood

SDS Process SDS Process

User

targe t

  • ood

price

AMOD NN

seekin g

PREP_FOR

Intelligent Agent

seeking=“fjnd” target=“eating place” price=“cheap”

  • ood=“taiwanese”

Semantic Decoding

slide-13
SLIDE 13

Automatic Slot Induction Automatic Slot Induction

Chen et al. ASRU’13 Chen et al. ASRU’13

can i have a cheap restaurant

Frame: capability Frame: expensiveness Frame: locale by use

Domain Domain General slot candidate

1 5

can i have a cheap restaurant

slide-14
SLIDE 14

Parsing vs Language Model Parsing vs Language Model

 Language Model

Language Model

 Model what actually gets said

Model what actually gets said

 Parsing

Parsing

 Extract the information you want

Extract the information you want

 Models *can* be shared

Models *can* be shared

 Only accept things in the grammar

Only accept things in the grammar

 Can be over limiting

Can be over limiting

slide-15
SLIDE 15

Neural Networks for SLU Neural Networks for SLU

 RNN for Slot Filling

RNN for Slot Filling

 Step 1: word embedding

Step 1: word embedding

 Step 2: short-term dependencies capturing

Step 2: short-term dependencies capturing

 Step 3: long-term dependencies capturing

Step 3: long-term dependencies capturing

 Step 4: different types of neural architecture

Step 4: different types of neural architecture

http://deeplearning.net/tutorial/rnnslu.html#rnnslu http://deeplearning.net/tutorial/rnnslu.html#rnnslu

Mesnil et al. 2013

slide-16
SLIDE 16

Interactive Learning for SLU Interactive Learning for SLU

Luis : Interactive machine learning for Luis : Interactive machine learning for language understanding language understanding Advantages: Advantages:

Non-expert could add in knowledge in

Non-expert could add in knowledge in feature engineering feature engineering

Active-learning reduces heavy labeling

Active-learning reduces heavy labeling

https://www.luis.ai/ https://www.luis.ai/

Williams et al. 2016

slide-17
SLIDE 17

Dialog Manager Dialog Manager

 Maintain state

Maintain state

 Where are we in the dialog

Where are we in the dialog

 Whose turn is it

Whose turn is it

 Waiting for speaker

Waiting for speaker

 Waiting for database query (stall user)

Waiting for database query (stall user)

 Deal with barge-in

Deal with barge-in

slide-18
SLIDE 18

Frame Based Dialog Manger Frame Based Dialog Manger

 Used for transaction dialog

Used for transaction dialog

 Generalizes finite-state approach by allowing

Generalizes finite-state approach by allowing multiple paths to acquire info multiple paths to acquire info

Central data structure is frame with slots Central data structure is frame with slots

  • DM is monitoring frame, filling in slots

DM is monitoring frame, filling in slots

Frame: Frame:

 Set of information needed

Set of information needed

 Context for utterance interpretation

Context for utterance interpretation

 Context for dialogue progress

Context for dialogue progress

Allows mixed initiative Allows mixed initiative

 Allows over-answering

Allows over-answering

Also called form-based (MIT): Often called “slot-filling” Also called form-based (MIT): Often called “slot-filling”

slide-19
SLIDE 19

Problems with Frames Problems with Frames

 Not easily applicable to complex tasks

Not easily applicable to complex tasks

May not be a single frame May not be a single frame

Dynamic construction of information Dynamic construction of information

User access to “product” User access to “product”

slide-20
SLIDE 20

Agenda + Frame Agenda + Frame

 Product:

Product:

hierarchical composition of frames hierarchical composition of frames

Process: Process:

Agenda Agenda

Generalization of stack Generalization of stack

 Ordered list of topics

Ordered list of topics

List of handlers List of handlers

slide-21
SLIDE 21

Statistical Approaches to DM Statistical Approaches to DM

 Allow for dialog complexity beyond human

Allow for dialog complexity beyond human mind mind

 Find optimal decision for non-trivial design

Find optimal decision for non-trivial design problems problems

 Life-long learning

Life-long learning

slide-22
SLIDE 22

Decisions Decisions

 Difficult design decision over the course of

Difficult design decision over the course of interaction interaction

 When to ask open / directive questions?

When to ask open / directive questions?

When to confirm? When to confirm?

When to barge-in / wait? When to barge-in / wait?

 Which type of feedback to provide? (ex,

Which type of feedback to provide? (ex, intelligent tutoring system) intelligent tutoring system)

 Sample efficient policy search

Sample efficient policy search

Policy space is too huge to search with Policy space is too huge to search with traditional ways of SDS development traditional ways of SDS development

slide-23
SLIDE 23

System-Initiative VS Mixed-Initiative System-Initiative VS Mixed-Initiative

S1: Welcome to CMU Let’s Go. S1: Welcome to CMU Let’s Go. Where do you leave from? Where do you leave from? U1: CMU U1: CMU S2: From CMU, did I get that right? S2: From CMU, did I get that right? U2: Yes. U2: Yes. S3: S3: Where are you going? Where are you going? U3: Downtown. U3: Downtown. S4: To Downtown, did I get that right? S4: To Downtown, did I get that right? U4: Yes. U4: Yes. S1: Welcome to CMU Let’s Go. S1: Welcome to CMU Let’s Go. How may I help you? How may I help you? U1: I'd like to go from CMU to Downtown. U1: I'd like to go from CMU to Downtown. S2: From CMU to Downtown, did I get that right? S2: From CMU to Downtown, did I get that right? U2: Yes. U2: Yes. S3 S3: When are you going to take the bus? : When are you going to take the bus? U3: Now U3: Now S3: You want the next bus, is that right? S3: You want the next bus, is that right? U3: Yes. U3: Yes.

slide-24
SLIDE 24

Confirm Each VS Confirm All Confirm Each VS Confirm All

S1: Welcome to CMU Let’s Go. Where do you leave from? S1: Welcome to CMU Let’s Go. Where do you leave from? U1: CMU U1: CMU S2 S2: From CMU, did I get that right? : From CMU, did I get that right? U2: Yes. U2: Yes. S3: S3: Where are you going? Where are you going? U3: Downtown. U3: Downtown. S4: To Downtown, did I get that right? S4: To Downtown, did I get that right? U4: Yes. U4: Yes. S1: Welcome to CMU Let’s Go. Where do you leave from? S1: Welcome to CMU Let’s Go. Where do you leave from? U1: CMU • S2: Where are you going? U1: CMU • S2: Where are you going? U2: Downtown. U2: Downtown. S3: When are you going to take bus? S3: When are you going to take bus? U3: Now U3: Now S4: S4: Leaving from CMU going to Downtown immediately, is it correct? Leaving from CMU going to Downtown immediately, is it correct?

slide-25
SLIDE 25

Explicit Confirm VS Implicit Confirm Explicit Confirm VS Implicit Confirm

S1: Welcome to CMU Let’s Go. Where do you leave from? S1: Welcome to CMU Let’s Go. Where do you leave from? U1: CMU U1: CMU S2: S2: From CMU, did I get that right? From CMU, did I get that right? U2: Yes. U2: Yes. S3: S3: Where are you going? Where are you going? U3: Downtown. U3: Downtown. S4: S4: To Downtown, did I get that right? To Downtown, did I get that right? U4: Yes. U4: Yes. S1: Welcome to CMU Let’s Go. Where do you leave from? S1: Welcome to CMU Let’s Go. Where do you leave from? U1: CMU U1: CMU S2: S2: Departing CMU Departing CMU. Where are you going? . Where are you going? U2: Downtown. U2: Downtown. S3: S3: Going to Downtown. Going to Downtown. When are you going to take bus? When are you going to take bus? U3: Now U3: Now

slide-26
SLIDE 26

Corrective Feedback Corrective Feedback

S1: Welcome to CMU Let’s Go. How may I help you? S1: Welcome to CMU Let’s Go. How may I help you? U1: I likes go CMU to Downtown. U1: I likes go CMU to Downtown. S2: S2: You must not use third person singular verb after first person You must not use third person singular verb after first person singular pronoun [explicit correction] singular pronoun [explicit correction] S1: Welcome to CMU Let’s Go. How may I help you? S1: Welcome to CMU Let’s Go. How may I help you? U1: I likes go CMU to Downtown. U1: I likes go CMU to Downtown. S2: S2: You said I like to go from CMU to Downtown. Did I get that right? You said I like to go from CMU to Downtown. Did I get that right? [recast] [recast]

slide-27
SLIDE 27

Traditional Policy Design Traditional Policy Design

 Traditional SDS development paradigm

Traditional SDS development paradigm

Several versions of a system are developed (where each version Several versions of a system are developed (where each version uses a single dialog policy, intuitively designed by an expert) uses a single dialog policy, intuitively designed by an expert)

 Dialog corpora are collected with human users interacting with

Dialog corpora are collected with human users interacting with different versions of the system different versions of the system

A number of evaluation metrics are measured for each dialog A number of evaluation metrics are measured for each dialog

 The different versions are statistically compared

The different versions are statistically compared

 Update the system with “best” dialog policy

Update the system with “best” dialog policy

 Due to the costs of experimentation, only a handful of

Due to the costs of experimentation, only a handful of policies are usually explored in any one experiment policies are usually explored in any one experiment

slide-28
SLIDE 28

Problem of Traditional Policy Problem of Traditional Policy

The number of possible policies is much larger Limited Data

slide-29
SLIDE 29

Modeling Modeling

Agent updates state by observing environment Given the state, agent performs the best action to achieve its goal Need to learn the mapping h from s S to a A ∈ ∈

slide-30
SLIDE 30

Learning Learning

 Supervised learning

Supervised learning

 For each input state s, optimal action a is known

For each input state s, optimal action a is known

 Infer input-output relationship a ≈ h(s)

Infer input-output relationship a ≈ h(s)

 Example: neural networks, support vector

Example: neural networks, support vector machine machine

slide-31
SLIDE 31

Markov Decision Process Markov Decision Process

 No data:

No data:

 Human-Human is differences between human

Human-Human is differences between human and machine perception and machine perception

 Wizard of Oz is costly and wizard is hard to

Wizard of Oz is costly and wizard is hard to train. train.

Statistical method to formulate the question

Statistical method to formulate the question

MDP computes an optimal dialog policy within a much larger search MDP computes an optimal dialog policy within a much larger search space, using a relatively small number of training dialog space, using a relatively small number of training dialog MDP evaluates actions (small number) not policies (large number) by MDP evaluates actions (small number) not policies (large number) by credit assignment of total reward to each action using dynamic credit assignment of total reward to each action using dynamic programming programming

slide-32
SLIDE 32

Markov Decision Process Markov Decision Process

slide-33
SLIDE 33

Learning Goal Learning Goal

slide-34
SLIDE 34

Dialog Management as MDP Dialog Management as MDP

slide-35
SLIDE 35

Model Dialog Model Dialog

slide-36
SLIDE 36

State State

slide-37
SLIDE 37

Actions Actions

slide-38
SLIDE 38

Transition Functions Transition Functions

slide-39
SLIDE 39

Reward Function Reward Function

slide-40
SLIDE 40

Example: Non-Task-Oriented System Example: Non-Task-Oriented System Architecture Architecture

Lexical Semantic Strategies Engagement Appropriatene ss Strategies User Input Context Tracking Strategies System Output High Generation Confidence Low Generation Confidence

Trigger Condition Not Meet

Trigger Condition Meet Response Generation Method Response Generation Conversation Strategy Selection

slide-41
SLIDE 41

Active Participation Active Participation

Start conversations with a concrete topic Start conversations with a concrete topic e.g. TickTock: Hi, I am TickTock, I really like sports,

e.g. TickTock: Hi, I am TickTock, I really like sports, let’s talk about sports. let’s talk about sports.

Design each strategies with respect to a topic Design each strategies with respect to a topic Stay on the topic Stay on the topic Change the topic Change the topic

43

slide-42
SLIDE 42

Stay on the Topic: Stay on the Topic:

S1:Tell a joke (joke)

S1:Tell a joke (joke) e.g. Do you know that people usually spend far e.g. Do you know that people usually spend far more time watching sports than actually playing any. more time watching sports than actually playing any. S2: Initiate activities (initiate) S2: Initiate activities (initiate) e.g. Do you want to see a game together some e.g. Do you want to see a game together some time? time? S3:Talk more (more): S3:Talk more (more): e.g. Let’s talk more about sports. e.g. Let’s talk more about sports.

44

slide-43
SLIDE 43

Change the Topic Change the Topic

S4: Switch topics (switch): S4: Switch topics (switch): e.g. How about we talk about movies? e.g. How about we talk about movies? S5: End topics with an open question (end): S5: End topics with an open question (end): e.g. That’s interesting, could you share with me e.g. That’s interesting, could you share with me some interesting news on the Internet? some interesting news on the Internet?

45

slide-44
SLIDE 44

A Failure Conversation Case A Failure Conversation Case

TickTock: Hello, I really like politics. Let’s talk about TickTock: Hello, I really like politics. Let’s talk about politics. politics. User User: : No, I don’t like politics. No, I don’t like politics. TickTock: Why is that? TickTock: Why is that? User: User: I just don’t like politics. I just don’t like politics. TickTock: How about we watch a debate together TickTock: How about we watch a debate together some day? (initiate activity) some day? (initiate activity) User User: : I told you I don’t want to talk about I told you I don’t want to talk about politics! politics!

46

Not Appropriate! Not Appropriate!

slide-45
SLIDE 45

Sentiment Context Sentiment Context

TickTock: Hello, I really like politics. Let’s talk about TickTock: Hello, I really like politics. Let’s talk about politics.

  • politics. neutral

neutral User: User: No, I don’t like politics. No, I don’t like politics. negative negative TickTock: Why is that? TickTock: Why is that? neutral neutral User: User: I just don’t like politics. I just don’t like politics. negative negative TickTock: OK, how about we talk about movies? TickTock: OK, how about we talk about movies? (switch topics) (switch topics) User: User: Sure. I do watch movies a lot.

  • Sure. I do watch movies a lot. neutral

neutral

47

Appropriate! Appropriate!

slide-46
SLIDE 46

Engagement Maintain Policy Engagement Maintain Policy

Goal: Improve system Goal: Improve system appropriateness appropriateness considering considering context. context. Method: Reinforcement learning Method: Reinforcement learning

Q Learning : Q Learning :

State Variable (S): State Variable (S):

system-appropriateness confidence system-appropriateness confidence all previous utterance-sentiment confidence all previous utterance-sentiment confidence time of each strategy executed time of each strategy executed turn position turn position most recently used strategy most recently used strategy

Actions (A) : engagement strategies (5 types) and Actions (A) : engagement strategies (5 types) and generated utterance generated utterance

48

slide-47
SLIDE 47

Model Detail Model Detail

Reward function(R): Cumulated Appropriateness *0.3 + Reward function(R): Cumulated Appropriateness *0.3 + Conversation depth*0.3 + Information gain *0.4 Conversation depth*0.3 + Information gain *0.4

Appropriateness Appropriateness : the current response’s coherence with the user : the current response’s coherence with the user utterance. utterance. Automatic predictor: SVM binary classifier (Inappropriate, Automatic predictor: SVM binary classifier (Inappropriate, interpretable VS Appropriate) interpretable VS Appropriate) Conversation depth Conversation depth: the maximum number of consecutive utterances : the maximum number of consecutive utterances

  • n the same topic.
  • n the same topic.

Automatic predictor: SVM binary classifier (Shallow, Intermediate VS Automatic predictor: SVM binary classifier (Shallow, Intermediate VS Deep) Deep) Information gain Information gain: the number of unique tokens. : the number of unique tokens.

Update: Update:

Simulator: A.L.I.C.E. chatbot. Interface: Multithreaded text-only API Simulator: A.L.I.C.E. chatbot. Interface: Multithreaded text-only API (open source) (open source)

49

slide-48
SLIDE 48

Language Generation Language Generation

 Query for flights to Boston

Query for flights to Boston

 Template fill answer(s)

Template fill answer(s)

 The next flight to DEST leaves at

The next flight to DEST leaves at DEPART_TIME arriving at ARRIVE_TIME. DEPART_TIME arriving at ARRIVE_TIME.

 Templates may be much more complex

Templates may be much more complex

slide-49
SLIDE 49

Language Generation Language Generation

 Choose which template to use

Choose which template to use

 Based on state, answer type

Based on state, answer type

 Natural variation

Natural variation

 Statistical variation

Statistical variation

 Include <ssml> tags to help synthesis

Include <ssml> tags to help synthesis

 Can <emph>emphasize</emph> parts

Can <emph>emphasize</emph> parts

 Can identify dates, numbers etc.

Can identify dates, numbers etc.

 Humans like variation in the output

Humans like variation in the output

 It is rare for a human to repeat things exactly

It is rare for a human to repeat things exactly

slide-50
SLIDE 50

Language Generation Language Generation

 Frames structures to (marked up) text

Frames structures to (marked up) text

 START: Pittsburgh

START: Pittsburgh

 END: Boston

END: Boston

 DATE: 20081028

DATE: 20081028

 TIME: 07:45

TIME: 07:45

 FLIGHT: US075

FLIGHT: US075

 Can generate

Can generate

 I have US 075 leaving at 07:45 tomorrow

I have US 075 leaving at 07:45 tomorrow

 US Airways has a flight departing tomorrow at 07:45

US Airways has a flight departing tomorrow at 07:45

slide-51
SLIDE 51

Standardized things Standardized things

 Help

Help

 User should be able to get help at any time

User should be able to get help at any time

 Explain where they are and what they are

Explain where they are and what they are expected to say (with explicit examples) expected to say (with explicit examples)

 Errors

Errors

 “

“I didn’t understand” … I didn’t understand” …

 Confirmation

Confirmation

 Did you say “Boston”?

Did you say “Boston”?

slide-52
SLIDE 52

Designing Prompts Designing Prompts

 Constrain your questions:

Constrain your questions:

 How may I help you?

How may I help you?

 Long story reply

Long story reply

 What bus number would like schedules for?

What bus number would like schedules for?

 Expect bus number replies

Expect bus number replies

slide-53
SLIDE 53