Speech Processing 11-492/18-495 Speech Processing 11-492/18-495 - - PowerPoint PPT Presentation

speech processing 11 492 18 495 speech processing 11 492
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 11-492/18-495 Speech Processing 11-492/18-495 - - PowerPoint PPT Presentation

Speech Processing 11-492/18-495 Speech Processing 11-492/18-495 Spoken Dialog Systems Conversing with machines Spoken Dialog Systems Spoken Dialog Systems Not just ASR bolted onto TTS Not just ASR bolted onto TTS Different styles of


slide-1
SLIDE 1

Speech Processing 11-492/18-495 Speech Processing 11-492/18-495

Spoken Dialog Systems Conversing with machines

slide-2
SLIDE 2

Spoken Dialog Systems Spoken Dialog Systems

 Not just ASR bolted onto TTS

Not just ASR bolted onto TTS

 Different styles of interaction

Different styles of interaction

 Question/response systems

Question/response systems

 Mixed initiative systems

Mixed initiative systems

 “

“How May I Help You?” open questions How May I Help You?” open questions

 True conversational machine-human interaction

True conversational machine-human interaction

slide-3
SLIDE 3

SDS Overview SDS Overview

 Introduction

Introduction

 Building simple dialog systems

Building simple dialog systems

 VoiceXML

VoiceXML

 A language for writing systems

A language for writing systems

 Beyond tree-based systems

Beyond tree-based systems

 Beyond spoken language

Beyond spoken language

 Non-task-oriented systems

Non-task-oriented systems

 Real-world deployment considerations

Real-world deployment considerations

slide-4
SLIDE 4

SDS Applications SDS Applications

 Information giving/request

Information giving/request

 Flights, buses, stocks and weather

Flights, buses, stocks and weather

 Driving directions

Driving directions

 Answer questions, news

Answer questions, news

 Transactional

Transactional

 Reply your email

Reply your email

 Credit card and bank enquiries, product purchase

Credit card and bank enquiries, product purchase

 Maintenance

Maintenance

 Technical support

Technical support

 Customer service

Customer service

slide-5
SLIDE 5
slide-6
SLIDE 6

SDS Applications SDS Applications

 Entertainment

Entertainment

 Game characters (NPC), toys, robots

Game characters (NPC), toys, robots

 Tutoring

Tutoring

 Math, science

Math, science

 Language learning

Language learning

 Health care

Health care

 Depression screening

Depression screening

 Aphasia therapy

Aphasia therapy

slide-7
SLIDE 7

Dialog Types Dialog Types

 System initiative

System initiative

 Form-filling paradigm

Form-filling paradigm

 Can switch language models at each turn

Can switch language models at each turn

 Can “know” which is likely to be said

Can “know” which is likely to be said

 Mixed initiative

Mixed initiative

 Users can go where they like

Users can go where they like

 System or user can lead the discussion

System or user can lead the discussion

 Classifying:

Classifying:

 Users can say what they like

Users can say what they like

 But really only “N” operations possible

But really only “N” operations possible

 E.g. AT&T? “How may I help you?”

E.g. AT&T? “How may I help you?”

slide-8
SLIDE 8

System Initiative System Initiative

 Most common

Most common

 Machine controls the call

Machine controls the call

 Few choices in the dialog

Few choices in the dialog

 Simple form filling:

Simple form filling:

 What is your bank account number

What is your bank account number

 Advantages:

Advantages:

 You know what users will say (sort of)

You know what users will say (sort of)

 Hard for user to get confused

Hard for user to get confused

 Hard for system to get confused

Hard for system to get confused

 Easy to build

Easy to build

 Disadvantages:

Disadvantages:

 Limited flexibility in interaction

Limited flexibility in interaction

 Fixed dialog structure

Fixed dialog structure

 Most reliable, but many turns

Most reliable, but many turns

slide-9
SLIDE 9

System Initiative System Initiative

 Let’s Go Bus Information

Let’s Go Bus Information

 412 268 3526 (Anytime)

412 268 3526 (Anytime)

 Provides bus information for Pittsburgh

Provides bus information for Pittsburgh

 Tell Me

Tell Me

 Company getting others to build systems

Company getting others to build systems

 Stocks, weather, entertainment

Stocks, weather, entertainment

slide-10
SLIDE 10

Mixed Initiative Mixed Initiative

 User or system takes initiative

User or system takes initiative

 More interesting dialogs

More interesting dialogs

 “

“jump” through different parts of dialog state jump” through different parts of dialog state

 Advantages

Advantages

 More realistic dialog

More realistic dialog

 Can do more complex tasks

Can do more complex tasks

 Disadvantages

Disadvantages

 Can get confusing

Can get confusing

 Can miss important parts

Can miss important parts

slide-11
SLIDE 11

Classification Dialogs Classification Dialogs

 Sort out from N things

Sort out from N things

 User says “anything” and system directs them

User says “anything” and system directs them

 Receptionist

Receptionist

 I have a problem with my bill

I have a problem with my bill

 What’s the area code for Miami

What’s the area code for Miami

 Did you know I can see the beach from here

Did you know I can see the beach from here

 Advantages

Advantages

 (Apparently) complex understanding

(Apparently) complex understanding

 Solves a very common task

Solves a very common task

 Disadvantages

Disadvantages

 Actually quite restrictive

Actually quite restrictive

 Needs data to train from

Needs data to train from

 Needs to be updated

Needs to be updated

slide-12
SLIDE 12

Beyond Telephones Beyond Telephones

 Telematics

Telematics

 Voice communication in cars

Voice communication in cars

 CPS, music selection etc

CPS, music selection etc

 Web-based dialog systems

Web-based dialog systems

 Robot Interaction

Robot Interaction

 Robot-robot and robot-human interaction

Robot-robot and robot-human interaction

 Animated talking head

Animated talking head

 Non-player characters – web agents

Non-player characters – web agents

 Speech to Speech translation

Speech to Speech translation

 CMU Dialport: integrating many dialog

CMU Dialport: integrating many dialog systems systems

slide-13
SLIDE 13

Team Talk Team Talk

 Using speech to control multiple robots

Using speech to control multiple robots

 Robots have names and distinct voices

Robots have names and distinct voices

 They report to each other and to you in voice

They report to each other and to you in voice

slide-14
SLIDE 14

Other SDS Other SDS

 Microsoft: Situated Interaction

Microsoft: Situated Interaction

 Talking Head that follows you

Talking Head that follows you

 CMU SV: Aidas

CMU SV: Aidas

 Restaurant recommendations in situ

Restaurant recommendations in situ

slide-15
SLIDE 15

True conversation True conversation

 Requires more than just speech

Requires more than just speech

 Non-verbal noises: laughing, er, um, etc

Non-verbal noises: laughing, er, um, etc

 Eye gaze

Eye gaze

 Proper timing (not waiting 500ms before

Proper timing (not waiting 500ms before speaker) speaker)

 Back-channeling

Back-channeling

 Movement

Movement

 Talking about nothing

Talking about nothing

slide-16
SLIDE 16

Roboreceptionist Roboreceptionist

 Entrance to NSH

Entrance to NSH

 Keyboard (no ASR)

Keyboard (no ASR)

 TTS, face, movement

TTS, face, movement

 Range finder to detect people

Range finder to detect people

 Significant background

Significant background character character

 Mostly talks about nothing

Mostly talks about nothing

slide-17
SLIDE 17

Personal Intelligent Systems Personal Intelligent Systems

 Example: Apple Siri, Google Now, Microsoft

Example: Apple Siri, Google Now, Microsoft Cortana, Amazon Echo, etc. Cortana, Amazon Echo, etc.

 Hub of all applications

Hub of all applications

 Extendable

Extendable

 Personalization

Personalization

 Cross-Language

Cross-Language

 Cross-Cultural

Cross-Cultural

 Future: interface-> true companion

Future: interface-> true companion

slide-18
SLIDE 18

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492

Spoken Dialog Systems SDS components

slide-19
SLIDE 19

Spoken Dialog Systems Spoken Dialog Systems

 More than just ASR and TTS

More than just ASR and TTS

 Recognition

Recognition

 Language understanding

Language understanding

 Manipulation of utterances

Manipulation of utterances

 Generation of new information

Generation of new information

 Text generation

Text generation

 Synthesis

Synthesis

slide-20
SLIDE 20

SDS Architecture SDS Architecture

Language Generation ASR Language Understanding Synthesis Dialog Manager

Error Handling Strategies Non Understanding

slide-21
SLIDE 21

SDS Internals SDS Internals

 Language Understanding

Language Understanding

 From words to structure

From words to structure

 Dialog Manager

Dialog Manager

 State of dialog (who is talking)

State of dialog (who is talking)

 Direction of dialog (what next)

Direction of dialog (what next)

 References, user profile etc

References, user profile etc

 Interaction of database/internet

Interaction of database/internet

 Language Generation

Language Generation

 From structure to words

From structure to words

slide-22
SLIDE 22

Language Understanding Language Understanding

 Parsing of SPEECH not TEXT

Parsing of SPEECH not TEXT

 Eh, I wanna go, wanna go to Boston tomorrow

Eh, I wanna go, wanna go to Boston tomorrow

 If its not too much trouble I’d be very grateful if

If its not too much trouble I’d be very grateful if

  • ne might be able to aid me in arranging my
  • ne might be able to aid me in arranging my

travel arrangements to Boston, Logan airport, at travel arrangements to Boston, Logan airport, at sometime tomorrow morning, thank you. sometime tomorrow morning, thank you.

 Boston, tomorrow

Boston, tomorrow

slide-23
SLIDE 23

Parsing: Output structure Parsing: Output structure

 “

“I wanna go to Boston, tomorrow” I wanna go to Boston, tomorrow”

 Destination: BOS

Destination: BOS

 Departure: 20081028, AM

Departure: 20081028, AM

 Airline: unspecifed

Airline: unspecifed

 Special: unspecifed

Special: unspecifed

 Convert speech to structure

Convert speech to structure

 Sufficient for further processing/query

Sufficient for further processing/query

slide-24
SLIDE 24

Interaction Example Interaction Example

Intelligent Agent

Cheap Taiwanese eating places include Din Tai Fung, Boiling Point, etc. What do you want to choose? I can help you go there. fjnd a cheap eating place oor taiwanese

  • ood

User

slide-25
SLIDE 25

SDS Process SDS Process

fjnd a cheap eating place oor taiwanese

  • ood

User

target

  • ood

price

AMOD NN

seeking

PREP_FOR Intelligent Agent

slide-26
SLIDE 26

SDS Process SDS Process

User

target

  • ood

price

AMOD NN

seeking

PREP_FOR

Organized Domain Knowledge

Intelligent Agent Ontology Induction

(semanti c slot)

fjnd a cheap eating place oor taiwanese

  • ood
slide-27
SLIDE 27

SDS Process SDS Process

User

target

  • ood

price

AMOD NN

seeking

PREP_FOR

Organized Domain Knowledge Intelligent Agent

Ontology Induction

(semanti c slot)

Structure Learning

(inter-slot relation)

fjnd a cheap eating place oor taiwanese

  • ood
slide-28
SLIDE 28

SDS Process SDS Process

User

targe t

  • ood

price

AMOD NN

seekin g

PREP_FOR

Intelligent Agent

seeking=“fjnd” target=“eating place” price=“cheap”

  • ood=“taiwanese”

fjnd a cheap eating place oor taiwanese

  • ood
slide-29
SLIDE 29

fjnd a cheap eating place oor taiwanese

  • ood

SDS Process SDS Process

User

targe t

  • ood

price

AMOD NN

seekin g

PREP_FOR

Intelligent Agent

seeking=“fjnd” target=“eating place” price=“cheap”

  • ood=“taiwanese”

Semantic Decoding

slide-30
SLIDE 30

Automatic Slot Induction Automatic Slot Induction

Chen et al. ASRU’13 Chen et al. ASRU’13

can i have a cheap restaurant

Frame: capability Frame: expensiveness Frame: locale by use

Domain Domain General slot candidate

3 2

can i have a cheap restaurant

slide-31
SLIDE 31

Parsing vs Language Model Parsing vs Language Model

 Language Model

Language Model

 Model what actually gets said

Model what actually gets said

 Parsing

Parsing

 Extract the information you want

Extract the information you want

 Models *can* be shared

Models *can* be shared

 Only accept things in the grammar

Only accept things in the grammar

 Can be over limiting

Can be over limiting

slide-32
SLIDE 32

Neural Networks for SLU Neural Networks for SLU

 RNN for Slot Filling

RNN for Slot Filling

 Step 1: word embedding

Step 1: word embedding

 Step 2: short-term dependencies capturing

Step 2: short-term dependencies capturing

 Step 3: long-term dependencies capturing

Step 3: long-term dependencies capturing

 Step 4: different types of neural architecture

Step 4: different types of neural architecture

http://deeplearning.net/tutorial/rnnslu.html#rnnslu http://deeplearning.net/tutorial/rnnslu.html#rnnslu

Mesnil et al. 2013

slide-33
SLIDE 33

Interactive Learning for SLU Interactive Learning for SLU

Luis : Interactive machine learning for Luis : Interactive machine learning for language understanding language understanding Advantages: Advantages:

Non-expert could add in knowledge in

Non-expert could add in knowledge in feature engineering feature engineering

Active-learning reduces heavy labeling

Active-learning reduces heavy labeling

https://www.luis.ai/ https://www.luis.ai/

Williams et al. 2016

slide-34
SLIDE 34

Dialog Manager Dialog Manager

 Maintain state

Maintain state

 Where are we in the dialog

Where are we in the dialog

 Whose turn is it

Whose turn is it

 Waiting for speaker

Waiting for speaker

 Waiting for database query (stall user)

Waiting for database query (stall user)

 Deal with barge-in

Deal with barge-in