Proposal of a Hierarchical Proposal of a Hierarchical Architecture - - PowerPoint PPT Presentation

proposal of a hierarchical proposal of a hierarchical
SMART_READER_LITE
LIVE PREVIEW

Proposal of a Hierarchical Proposal of a Hierarchical Architecture - - PowerPoint PPT Presentation

Proposal of a Hierarchical Proposal of a Hierarchical Architecture for Multimodal Architecture for Multimodal Interactive Systems y Masahiro Araki* 1 Tsuneo Nitta* 2 Kouichi Katsurada* 2 Takuya Nishimoto* 3 Tetsuo Amakasu* 4 Shinnichi Kawamoto* 5


slide-1
SLIDE 1

Proposal of a Hierarchical Proposal of a Hierarchical Architecture for Multimodal Architecture for Multimodal Interactive Systems y

Masahiro Araki*1 Tsuneo Nitta*2 Kouichi Katsurada*2 Takuya Nishimoto*3 Tetsuo Amakasu*4 Shinnichi Kawamoto*5 *1Kyoto Institute of Technology *1Kyoto Institute of Technology *2Toyohashi University of Technology *3The University of Tokyo *4NTT Cyber Space Labs. *5ATR

2007/11/16 1 W3C MMI ws

slide-2
SLIDE 2

Outline Outline

  • Background

– Introduction of speech IF committee under ITSCJ – Introduction to Galatea toolkit

  • Problems of W3C MMI Architecture

– Modality Component is too large y p g – Fragile Modality fusion and fission functionality – How to deal with user model?

  • Our Proposal

– Hierarchical MMI architecture Hierarchical MMI architecture – “Convention over Configuration” in various layers

2007/11/16 W3C MMI ws 2

slide-3
SLIDE 3

Background(1) Background(1)

  • What is ITSCJ?

– Information Technology Standards Commission of Japan p

  • under IPSJ (Information Processing Society of Japan)
  • Speech Interface Committee under ITSCJ
  • Speech Interface Committee under ITSCJ

– Mission

  • Publish TS (Trial Standard) document concerning

multimodal dialogue systems

2007/11/16 W3C MMI ws 3

slide-4
SLIDE 4

Background(2) Background(2)

  • Theme of the committee

h f – Architecture of MMI system – Requirements of each component

  • Future directions

Guideline for implementing practical MMI system – Guideline for implementing practical MMI system – specify markup language

2007/11/16 W3C MMI ws 4

slide-5
SLIDE 5

Our Aim

  • 1. Propose an MMI architecture which can be

used for advanced MMI research used for advanced MMI research

W3C: From the practical point of view (mobile, accessibility)

  • 2. Examine the validity of the architecture through

y g system implementation

Galatea Toolkit

  • 3. Develop a framework and release it as a open

source

towards de facto standard

2007/11/16 W3C MMI ws 5

slide-6
SLIDE 6

Galatea Toolkit(1) Galatea Toolkit(1)

  • Platform for
  • Platform for

developing MMI systems systems

  • Speech

recognition

  • Speech

Synthesis

  • Face Image

g Synthesis

2007/11/16 W3C MMI ws 6

slide-7
SLIDE 7

Galatea Toolkit(2) Galatea Toolkit(2)

ASR Julian Dialogue Manager Galatea DM TTS Galatea talk Face FSM

2007/11/16 W3C MMI ws 7

slide-8
SLIDE 8

Galatea Toolkit(3) Galatea Toolkit(3)

Di l Phoenix Dialogue Manager

Macro Control Layer (AM‐MCL)

Agent

( ) Direct Control Layer (AM‐DCL)

Agent Manager ASR TTS Face ASR Julian TTS Galatea talk Face FSM

2007/11/16 W3C MMI ws 8

slide-9
SLIDE 9

Problems of W3C MMI(1) Problems of W3C MMI(1)

  • The “size” of Modality Component does not

it f lif lik t t l suit for life‐like agent control

Delivery Context Interaction Data Runtime Framework Context Component manager Component Modality Component API Modality Component API Speech Modality Face Image Modality ASR TTS FSM

2007/11/16 W3C MMI ws 9

slide-10
SLIDE 10

Problems of W3C MMI(1) Problems of W3C MMI(1)

  • Lip synchronization with speech output

l Delivery Context Component Interaction manager Data Component Runtime Framework

  • [65] h[60]

[ ] set lip

1 3

h d l d l set Text= “ohayou” a[65] ... moving sequence

1 2

Speech Modality Face Image Modality ASR TTS FSM

4

ASR TTS start

2007/11/16 W3C MMI ws 10

slide-11
SLIDE 11

Problems of W3C MMI(1) Problems of W3C MMI(1)

  • Back channeling mechanism

l Delivery Context Component Interaction manager Data Component Runtime Framework

1 2

h d l d l nod short pause set Text=“hai” start

2

Speech Modality Face Image Modality ASR TTS FSM ASR TTS

2007/11/16 W3C MMI ws 11

slide-12
SLIDE 12

Problems of W3C MMI(2) Problems of W3C MMI(2)

  • Fragile Modality fusion and fission functionality

Delivery Context C t Interaction manager Data Component Runtime Framework Component g p

How to define

“from here to there” point (120,139) point (200,300)

How to define multimodal grammar?

Speech Modality Tactile Modality h

g Is simple

ASR touch sensor

unification enough?

2007/11/16 W3C MMI ws 12

slide-13
SLIDE 13

Problems of W3C MMI(2) Problems of W3C MMI(2)

  • Fragile Modality fusion and fission functionality

Delivery Context C t Interaction manager Data Component Runtime Framework Component g p “this is route map” SVG

Contents planning is

Speech Modality Graphic Modality

p g suitable for adapting various

TTS SVG Viewer

devices.

2007/11/16 W3C MMI ws 13

slide-14
SLIDE 14

Problems of W3C MMI(3) Problems of W3C MMI(3)

  • How to deal with user model?

Delivery Context Interaction manager Data Component Runtime Framework Component manager Component

Where is the user model information t d?

Speech Modality Face Image Modality

stored?

ASR TTS FSM fails many times

2007/11/16 W3C MMI ws 14

times

slide-15
SLIDE 15

Solution Solution

  • Back to multimodal framework

– more smaller modality component

  • Separate state transition description
  • Separate state transition description

– task flow – interaction flow – modality fusion/fission

hierar hi al ar hite t re hierarchical architecture

2007/11/16 W3C MMI ws 15

slide-16
SLIDE 16

Investigation procedure g p Phase 1

use case analysis requirement for overall systems Working draft for MMI architecture

2007/11/16 W3C MMI ws 16

slide-17
SLIDE 17

Use case analysis y

Name input modality

  • utput modality

li di l h a on‐line shopping mouse, speech display, speech animated agent b voice search mouse speech display speech b voice search mouse, speech display, speech c site search mouse, speech, key display, speech i t ti ith d interaction with robot speech, image, sensor speech, display negotiation e negotiation with interactive agent speech speech, face image agent f kiosk terminal touch, speech speech, display

2007/11/16 W3C MMI ws 17

slide-18
SLIDE 18

Example of use case

Interaction with robot

Nishijin Kasuri is

Interaction with robot

Nishijin Kasuri is a traditional texture in Kyoto. What is Kasuri?

2007/11/16 18 W3C MMI ws

slide-19
SLIDE 19

Requirements q

  • 1. general

2 input modality

in common

  • 2. input modality
  • 3. output modality

4 hit t i t ti d h i ti

in common with W3C

  • 4. architecture, integration and synchronization

point 5 runtimes and deployments

  • 5. runtimes and deployments
  • 6. dialogue management

7 h dli f f d fi ld

extension

  • 7. handling of forms and fields
  • 8. connection with outside application

extension

  • 9. user model and environment information

10.from the viewpoint of developer

2007/11/16 W3C MMI ws 19

slide-20
SLIDE 20

data model application logic

layer 6: application

user model / device model control

layer 5: task control set/get event/ control set/get task control layer 4 event / result command

control

layer 4 interaction control integrated result / event event command command

control / understanding control

layer 3: modality integration interpreted result/ event command

control/ interpret control/ interpret control control

layer 2: modality component interpreted result/ event event command command

ASR / t h TTS / graphical p p

component layer 1: results・event event command command

ASR pen / touch / audio output g p

  • utput

y I/O device

slide-21
SLIDE 21

Investigation procedure Investigation procedure Phase 2

Detailed analysis of use case Requirements for each layer Publish trial standard

release reference implementation

2007/11/16 W3C MMI ws 21

slide-22
SLIDE 22

Detailed use case analysis y

2007/11/16 W3C MMI ws 22

slide-23
SLIDE 23

Requirements of each layer q y

  • Clarify Input/Output with adjacent layers
  • Define events
  • Clarify inner layer processing
  • Clarify inner layer processing
  • Investigate markup language

2007/11/16 W3C MMI ws 23

slide-24
SLIDE 24

1st layer:Input/Output module 1 layer:Input/Output module

  • Function

/ – Uni‐modal recognition/synthesis module

  • Input module

– Input:(from outside) signal (from 2nd layer) information used for recognition – Output:(to 2nd ) recognition result – Example:ASR, touch input, face detection, ...

  • Output module

– Input:(from 2nd ) output contents – Output:(to outside) signal – Example:TTS, Face image synthesizer, Web browser, ...

2007/11/16 24 W3C MMI ws

slide-25
SLIDE 25

2nd : Modality component y p

  • Function

lapper that absorbs the difference of 1st la er – lapper that absorbs the difference of 1st layer ex)Speech Recognition component

grammar:SRGS semantic analysis : SISR result: EMMA

– provide multimodal synchronization ex) TTS with lip synchronization LS‐TTS

2nd: Modality component

TTS

1st: Input/Output

FSM

2007/11/16 25 W3C MMI ws

p p module

slide-26
SLIDE 26

3rd :Modality Fusion

  • Integration of input information

– Interpretation of sequential / simultaneous input – Interpretation of sequential / simultaneous input – Output the integrated result as EMMA format Modality Fusion

3rd: Modality fusion

EMMA

Modality Fusion

<emma:interpretation id="int1“ emma:medium="acoustic“ <emma:sequence id="seq1"> <emma:interpretation id="int2“ emma:medium="tactile“ emma:mode="speech"> <action> move </action> <object> this </object> d ti ti h /d ti ti emma:medium tactile emma:mode="ink"> <x>0.253</x> <y>0.124</y> </emma:interpretation> <emma:interpretation id="int3“ Speech

2nd:

Touch IMC <destination> here </destination> </emma:interpretation> <emma:interpretation id int3 emma:medium="tactile“ emma:mode="ink"> <x>0.866</x> <y>0.724</y> </emma:interpretation>

2007/11/16 26 W3C MMI ws

IMC

Modality component

Touch IMC </emma:interpretation> </emma:sequence>

slide-27
SLIDE 27

3rd :Modality Fission

  • Rendering output information

– Synchronization of sequential/simultaneous output – Coordination of output modality based on the Coordination of output modality based on the access device Modality Fission ? Modality Fission

I recommend “ hi d i”

Name Price Feature

Speech OMC Graphical OMC “sushi dai”.

Sushi dai 3800

good taste

  • kame

3650

good service 2007/11/16 27 W3C MMI ws

iwasa 3500

shelfish

slide-28
SLIDE 28

4th :Inner task control

  • Image

a piece of dialog e at client side

4 :Inner task control

– a piece of dialogue at client side

S: Please input member ID U: 2024 S:Please select food. U: Meat

2007/11/16 28 W3C MMI ws

S: Is it OK? U: Yes.

slide-29
SLIDE 29

4th :Inner task control

  • Required functions

4 :Inner task control

– Error handling

ex) check departure time < arrival time ) p

– Default subdialogue

ex) confirmation retry ex) confirmation, retry, ...

– Form filling algorithm

) F I t t ti Al ith ex) Form Interpretation Algorithm

– Slot update information

ex) process of negative response to confirmation request (“NO, from Kyoto.”)

2007/11/16 29 W3C MMI ws

slide-30
SLIDE 30

4th :Inner task control

control 5th control

Initialize event start dialogue(uri or code) data end event(status)

  • FIA
  • Input analysis (with error check)

g ( ) ( ) p y ( )

  • Update data module
  • Update user model

4th

device information end event(status) Initialize event

  • utput contents

Initialize event Start Input (with interruption) device information EMMA

Modality Fusion Modality Fission 3rd

EMMA

2007/11/16 30 W3C MMI ws

slide-31
SLIDE 31

5th :Task control 5 :Task control

  • Image

– describe overall task flow server side controller – server side controller

  • Possible markup languae

– SCXML – Controller definition in MVC model Controller definition in MVC model

  • entry points and their processing

– Script language on Rails application framework – Script language on Rails application framework

  • contains application logic (6th layer)
  • easy to prototype and customize
  • easy to prototype and customize

2007/11/16 31 W3C MMI ws

slide-32
SLIDE 32

5th :Task control 5 :Task control

data module

application logic

user model/ device model 6th

  • state transition

set/get call set/get

  • state transition
  • conditional branch
  • event handling
  • subdialogue management

5th

  • subdialogue management

Initialize event start dialogue (uri or code) data end event(status)

control 4th

start dialogue (uri or code) ( )

2007/11/16 32 W3C MMI ws

slide-33
SLIDE 33

6th :Application

  • Image

pp

– Processing module outside of dialogue system

  • accessed from various layers

accessed from various layers

  • modules

– application logic

ex)DB access, Web API access

  • Persist, update, delete, search of data

– user model / device model

  • persist user’s information through sessions
  • manage device information defined in ontology

g gy

2007/11/16 33 W3C MMI ws

slide-34
SLIDE 34

Too many markup language? y p g g

  • Does each level require different markup

l ? language?

– No. – simple functionality of 5th and 4th layer can provide data model approach (ex) Ruby on Rails) pp ( ) y ) – default function of 3rd layer can be realized simple principle (ex) unification in modality fusion) principle (ex) unification in modality fusion) – 2nd layer functions are task/domain independent

“Convention over Configuration”

2007/11/16 34 W3C MMI ws

slide-35
SLIDE 35

Summary

  • Problems of W3C MMI Architecture

d l – Modality Component – Modality fusion and fission functionality – User model

  • Our Proposal
  • Our Proposal

– Hierarchical MMI architecture – “Convention over Configuration” in various layers

2007/11/16 35 W3C MMI ws