Combining Modalities in Multimodal Interfaces Focus on speech and - - PowerPoint PPT Presentation
Combining Modalities in Multimodal Interfaces Focus on speech and - - PowerPoint PPT Presentation
Combining Modalities in Multimodal Interfaces Focus on speech and gestures Focus on speech and gestures Gabriel Skantze gabriel@speech.kth.se Common misconceptions Oviatt Ten myths about multimodal interaction 1. If you build a
Common misconceptions
Oviatt “Ten myths about multimodal interaction”
Multimodal interaction
- 1. If you build a multimodal system, users will
interact multimodally.
- 2. Speech and pointing is the dominant multimodal
integration pattern.
- 3. Multimodal input involves simultaneous signals.
- 4. Speech is the primary input mode in any
- 4. Speech is the primary input mode in any
multimodal system that includes it.
- 5. Multimodal language does not differ
linguistically from unimodal language.
Common misconceptions
Oviatt “Ten myths about multimodal interaction”
Multimodal interaction
- 6. Multimodal integration involves redundancy of content
between modes.
- 7. Individual error-prone recognition technologies
combine multimodally to produce even greater unreliability.
- 8. All users’ multimodal commands are integrated in a
uniform way. uniform way.
- 9. Different input modes are capable of transmitting
comparable content.
- 10. Enhanced efficiency is the main advantage of
multimodal systems.
Multimodal interface = Multimodal interaction?
- Video:
Multimodal interaction
BTSLogic provides Directory Assistance and Information Services solutions to telecommunications carriers and operator services companies worldwide.
Almost all users (95% to 100%) prefer to interact multimodally if they are given the choice. But this does not mean that all interaction is multimodal, rather that the best option is used for every task. About 20% of the interaction has been observed to be multimodal with multimodal interfaces.
Depends on the type of task
Multimodal interaction
…and Complexity of task
Multimodal interaction
Common misconceptions
Oviatt “Ten myths about multimodal interaction”
Multimodal interaction
- 1. If you build a multimodal system, users will
interact multimodally.
- 2. Speech and pointing is the dominant multimodal
integration pattern.
- 3. Multimodal input involves simultaneous signals.
- 4. Speech is the primary input mode in any
- 4. Speech is the primary input mode in any
multimodal system that includes it.
- 5. Multimodal language does not differ
linguistically from unimodal language.
Put That There
[Bolt, 1980]
More than put that there
- Combinations of written input, manual gesturing,
and facial expressions can generate symbolic information that is more richly expressive than
Multimodal interaction
information that is more richly expressive than simple object selection.
- Speak-and-point pattern only comprises 14% of
all spontaneous multimodal utterances.
– Pen input is used to create graphics, symbols and signs, gestural marks, digits and lexical content.
- In interpersonal multimodal communication,
- In interpersonal multimodal communication,
pointing gestures account for less than 20% of all gestures.
- Conclusion: Multimodal systems should handle
- ther input than speak-and-point.
Common misconceptions
Oviatt “Ten myths about multimodal interaction”
Multimodal interaction
- 1. If you build a multimodal system, users will
interact multimodally.
- 2. Speech and pointing is the dominant multimodal
integration pattern.
- 3. Multimodal input involves simultaneous signals.
- 4. Speech is the primary input mode in any
- 4. Speech is the primary input mode in any
multimodal system that includes it.
- 5. Multimodal language does not differ
linguistically from unimodal language.
Simultaneous or Sequential
Common misconceptions
Oviatt “Ten myths about multimodal interaction”
Multimodal interaction
- 1. If you build a multimodal system, users will
interact multimodally.
- 2. Speech and pointing is the dominant multimodal
integration pattern.
- 3. Multimodal input involves simultaneous signals.
- 4. Speech is the primary input mode in any
- 4. Speech is the primary input mode in any
multimodal system that includes it.
- 5. Multimodal language does not differ
linguistically from unimodal language.
Speech is not everything
- Traditionally, speech has been viewed as the primary
modality and writing, gestures and haptic as merely supporting modalities.
Multimodal interaction
supporting modalities.
- However, the other modalities can give information that is
not present in the speech signal, e.g., spatial information
- Pen input precedes speech in 99% of sequentially
integrated multimodal commands, and in most simultaneously-integrated ones.
Common misconceptions
Oviatt “Ten myths about multimodal interaction”
Multimodal interaction
- 1. If you build a multimodal system, users will
interact multimodally.
- 2. Speech and pointing is the dominant multimodal
integration pattern.
- 3. Multimodal input involves simultaneous signals.
- 4. Speech is the primary input mode in any
- 4. Speech is the primary input mode in any
multimodal system that includes it.
- 5. Multimodal language does not differ
linguistically from unimodal language.
Speech in multimodality
- Briefer, syntactically simpler, and less disfluent
than users’ unimodal speech.
Multimodal interaction
than users’ unimodal speech.
“Place a boat dock on the east, no, west end of Reward Lake.” [drawing rectangle] “Add dock.”
Common misconceptions
Oviatt “Ten myths about multimodal interaction”
Multimodal interaction
- 6. Multimodal integration involves redundancy of content
between modes.
- 7. Individual error-prone recognition technologies
combine multimodally to produce even greater unreliability.
- 8. All users’ multimodal commands are integrated in a
uniform way. uniform way.
- 9. Different input modes are capable of transmitting
comparable content.
- 10. Enhanced efficiency is the main advantage of
multimodal systems.
Complementary, not redundant
- Multimodal input is actually mostly complementary,
not redundant
- Speech and pen give different semantic information:
Multimodal interaction
- Speech and pen give different semantic information:
– subject, verb, and object spoken, – location with pen.
- Even during multimodal correction of errors,
redundant information is given less than 1% of the time.
- During human communication, spontaneous speech
- During human communication, spontaneous speech
and gesturing do not involve duplicate information.
- Designers of multimodal systems therefore should not
expect to rely on duplicated information when processing multimodal language.
Common misconceptions
Oviatt “Ten myths about multimodal interaction”
Multimodal interaction
- 6. Multimodal integration involves redundancy of content
between modes.
- 7. Individual error-prone recognition technologies
combine multimodally to produce even greater unreliability.
- 8. All users’ multimodal commands are integrated in a
uniform way. uniform way.
- 9. Different input modes are capable of transmitting
comparable content.
- 10. Enhanced efficiency is the main advantage of
multimodal systems.
Unimodal errors are corrected
- 1. User may select
least error prone
Multimodal interaction
least error prone modality
- 2. User may switch
modality
- 3. Mutual
- 3. Mutual
disambiguation
Common misconceptions
Oviatt “Ten myths about multimodal interaction”
Multimodal interaction
- 6. Multimodal integration involves redundancy of content
between modes.
- 7. Individual error-prone recognition technologies
combine multimodally to produce even greater unreliability.
- 8. All users’ multimodal commands are integrated in a
uniform way. uniform way.
- 9. Different input modes are capable of transmitting
comparable content.
- 10. Enhanced efficiency is the main advantage of
multimodal systems.
Individual patterns
- Large individual
differences in interaction patterns.
Multimodal interaction
interaction patterns.
- User keeps using the
same pattern from the beginning to the end.
- Hence: Multimodal
systems that can detect and adapt to a user’s and adapt to a user’s dominant interaction type can considerably improve recognition rates.
Common misconceptions
Oviatt “Ten myths about multimodal interaction”
Multimodal interaction
- 6. Multimodal integration involves redundancy of content
between modes.
- 7. Individual error-prone recognition technologies
combine multimodally to produce even greater unreliability.
- 8. All users’ multimodal commands are integrated in a
uniform way. uniform way.
- 9. Different input modes are capable of transmitting
comparable content.
- 10. Enhanced efficiency is the main advantage of
multimodal systems.
Strict Multimodality
- Strict modality
redundancy:
Multimodal interaction
redundancy:
– All user actions should be possible to express using each modality – All system information should be possible to present in each modality
- Motivation:
- Motivation:
– Flexibility, predictability – “Design for all”
Coupling content & modality
- All modalities are not equal for all messages.
- Speech/writing can convey much information,
Multimodal interaction
- Speech/writing can convey much information,
but complex spatial shapes, relations among graphic objects, or precise location information is difficult…
– … but trivial to sketch using a pen.
- Speech delivers information directly and
- Speech delivers information directly and
intentionally,
– but gaze reflects the speaker’s focus of interest more passively and unintentionally.
- Hence adapt the input modality to the task
Common misconceptions
Oviatt “Ten myths about multimodal interaction”
Multimodal interaction
- 6. Multimodal integration involves redundancy of content
between modes.
- 7. Individual error-prone recognition technologies
combine multimodally to produce even greater unreliability.
- 8. All users’ multimodal commands are integrated in a
uniform way. uniform way.
- 9. Different input modes are capable of transmitting
comparable content.
- 10. Enhanced efficiency is the main advantage of
multimodal systems.
Two combination hypotheses
- 1. The combination of human output channels effectively
increases the bandwidth of the human-machine channel.
Multimodal interaction
channel.
– This has been discovered in many empirical studies of multimodal human-computer interaction.
- 2. Adding extra output modality requires more
neurocomputational resources that leads to deteriorated output quality. The effective bandwidth is reduced. reduced.
– Two types of effects are usually observed:
- a slow-down of all output processes, and
- interference errors due to the fact that selective attention cannot
be divided between the increased number of output channels.
– Two examples of this: writing when speaking, and speaking when driving a car.
Other advantages than speed
- 10% speed-up with multimodal
pen/voice interaction
- But there are other advantages:
Multimodal interaction
- But there are other advantages:
- Task-critical errors and
disfluencies can drop by 36–50%.
- Users’ prefer to interact
multimodally
- Flexibility to alternate modes to
avoid overextertion. avoid overextertion.
- Error avoidance and easier error
recovery
- Wearability
Three experiments on combining modalities
The AdApt multimodal dialogue system
Three experiments on combining modalities
- 1. Coordination of
modalities
Multimodal interaction
modalities
- 2. Visualising constraints
- 3. Turn-taking signals
Two way of referring to objects
- Referring by descriptions
– Using language to point out an object Multimodal interaction – Using language to point out an object
- Referring by ”deictic” expressions
– Using gestures to point out an object – ”this”, ”that”
Referring by descriptions
U: ”Are there any apartments around Karlaplan?” S: ”There are five apartments around Karlaplan.” Multimodal interaction S: ”There are five apartments around Karlaplan.” (highlights Karlaplan, shows five apartments) U: ”What does this apartment cost?” (points at an apartment) S: ”The red apartment at Karlaplan costs 3750000 crowns.” crowns.”
Referring by ”deictic” expressions
U: ”Are there any apartments around Karlaplan?” S: ”There are five apartments in this area.” Multimodal interaction S: ”There are five apartments in this area.” (highlights Karlaplan, shows five apartments) U: ”What does this apartment cost?” (points at an apartment) S: ”This apartment costs 3750000 crowns.” (the apartment blinks) apartment blinks)
Coordination of modalities
- How does the system’s choice of referring
expressions affect the user’s choice of
Multimodal interaction
expressions affect the user’s choice of referring expressions in a multimodal dialogue system?
- Motivation
– Reduce variation in system input – – Usability – Error handling
Motivation: reduce variation
- Referring expressions
– Deictic: Multimodal interaction – Deictic:
- “How much does it cost?” (clicking on an apartment)
– Descriptions:
- “How much does the red apartment cost?”
- “How much does the apartment at Karlavägen 108
cost?” cost?”
– Anaphora:
- “How much does it cost?” (local anaphora)
- “How much did the apartment we spoke about before
cost?” (global anaphora)
Motivation: usability
- Referring expressions can be seen as a device
for selecting objects in a multimodal user
Multimodal interaction
for selecting objects in a multimodal user interface.
– Apparent in the graphical user interface (local context) – Not apparent in the graphical user interface (global context) (global context)
- The way the user and the system talk about
- bjects could affect which means of selecting
- bjects that will be availible for the user.
Motivation: error handling
- The user should be promoted to use the least
error-prone means of expression.
Multimodal interaction
error-prone means of expression.
- Different modalities and means of expression
could be more or less error prone for different users.
- The user should be promoted to alternate
- The user should be promoted to alternate
means of expression when errors occur. (Oviatt 1996)
Coordination of expressions
- Garrod & Andersson (1987)
– Conceptual pacts Multimodal interaction – Conceptual pacts – Human-human dialogue
- Zoltan-Ford (1991)
– Verbal shaping – Human-computer dialogue – Human-computer dialogue
- Brennan (1996)
– Lexical entrainment – Embedded and exposed modeling
Two versions of the system
- 1. Consistently referred by descriptions
– Using language to point out an object Multimodal interaction – Using language to point out an object
- 2. Consistently referred by ”deictic” expressions
– Using gestures to point out an object – ”this”, ”that”
Experimental design
- Wizard-of-Oz setup
- Between group design
Multimodal interaction
- Between group design
- 16 subjects, 8 in each group
- Task: To find two apartments that matched
some constraints that were gradually presented as pictures. presented as pictures.
- Evoke global anaphora.
References to apartments
80 90 100 Descr group Deictic group
Multimodal interaction
20 30 40 50 60 70 80 Deictic group 10 20 Color descr* Non-color descr Deictic term* Pointing gesture*
Three experiments on combining modalities
- 1. Coordination of
modalities
Multimodal interaction
modalities
- 2. Visualising constraints
- 3. Turn-taking signals
Example Dialogue
USER: I would like a two room apartment in this area [showing with mouse] SYS: How much are you willing to pay?
Multimodal interaction
SYS: How much are you willing to pay? USER: One million SYS: Is there something in particular you want the apartment to have? USER: Yes...a tiled stove would be nice and ... perhaps a balcony SYS: There are no apartments that satisfy your constraints SYS: There are no apartments that satisfy your constraints
Constraints in Information Browsing
- It is not a fixed set of constraints that need to be
specified before presenting found apartments
Multimodal interaction
specified before presenting found apartments
- Users give some constraints and then ask about
- thers when apartments are presented
- To be able to navigate among a large number of
apartments constraints must be added, deleted
- r changed
- r changed
- The set of constraints that are relevant to specify
is to some degree different for different users
Problems related to constraint management
- How should the system communicate:
– which constraints it ’heard’? Multimodal interaction – which constraints it ’heard’? – how these constraints were (re)interpreted? – the current set of search constraints?
- How can the system make it easier for the
user to change the current set of constraints user to change the current set of constraints
- Should the system automatically relax user
constraints?
Why is constraint grounding needed
- The users want a feeling of control in
interfaces, which is hard to obtain if the
Multimodal interaction
interfaces, which is hard to obtain if the system relaxes constraints automatically
- Error handling is made easier since the users
can detect and correct errors at once
- Finding apartments can be seen as a process
- Finding apartments can be seen as a process
- f grounding constraints
Verbal grounding
- In every turn:
– explicit ”Did you say two rooms?”
Multimodal interaction
– explicit ”Did you say two rooms?” – implicit ”How much are you willing to pay for the two room apartment with a balcony?”
- Resumé before presenting search results:
”You are looking for two-room apartments in the
- ld town that cost less than two million and
that have a balcony and wooden floor...”
However, repeated confirmation turns give users the impression that the system is slow and make the human-computer dialogue appear less natural (boyce 1999)
that have a balcony and wooden floor...”
– explicit ”...Is that correct?” – implicit ”...These are shown on the map”
- Could reduce the cognitive load in the verbal channel
- Faster dialogue than verbal confirmations
Graphical grounding
Multimodal interaction
- Faster dialogue than verbal confirmations
- Icons are less distracting than written feedback
- Could make error handling easier:
– makes it natural to change one constraint at a time – fewer verbal meta-utterances – makes the recognition grammar simpler during corrections: corrections: (selecting the price icon while saying ”two million” VS saying ”no i did not say two thousand, but two million”)
- Facilitates importance ranking
(lock icon or importance scale)
Using Icons for Graphical Grounding
- According to the ‘idiomatic paradigm’ users learns to
connect certain icons to functions in the same way that people use idioms in language (Cooper 1995).
Multimodal interaction
people use idioms in language (Cooper 1995).
- In our system these things are visualized:
– the constraints that the user provides during the initial part of the dialogue are visualized – the found constraints in an utterance, not its function – the current set of constraints originating from the whole dialogue
- Three kinds of visualized constraint types:
- Three kinds of visualized constraint types:
– Numbers Intervals Set members
- Examples of constraint icons
Direct Manipulation of the Constraint Icons
Three experiments on combining modalities
- 1. Coordination of
modalities
Multimodal interaction
modalities
- 2. Visualising constraints
- 3. Turn-taking signals
Importance of turn-taking signals
U: what does it cost? S: the ap...
The system needs some time to answer
Multimodal interaction
S: the ap... U: how mu... S: sorry, I didn’t understand
User uncertain whether system has noticed the user’s utterance time to answer
Confusion about turn-taking may lead to more false “barge-ins” (user interrupting the system)
Turn-taking signals
Multimodal interaction
Barge-ins
- % of system responses with user ”barge-in”
- System slower by the end of sessions
Multimodal interaction
- System slower by the end of sessions
- No feedback group waited a long time between
utterances in the beginning of the test
10 15
Gestures Symbols No feedback
5 10
All of session Last 2/3 of session
No feedback
User questionnaire
- # of volonteered comments about turn-taking in comment
section of survey
- Positive comments impossible for the no feedback group
Multimodal interaction
- Positive comments impossible for the no feedback group
- ”Somewhat critical” is when the user understood that
there were turn-taking signals, but did not quite like them
Somewhat Positive
Gestures Symbols No feedback 2 4 6 8 10 12
Total feedback comments Negative Somewhat critical
No feedback
User survey
- Results of pairwise comparison of 10-question
user satisfaction evaluation
Multimodal interaction
user satisfaction evaluation
- Questions adopted from PARADISE
- 5-step scale (0-4)
Gestures Symbol No feedback Gestures Symbol No feedback Gestures Symbol Gestures No feedback Gestures No feedback
Human-human interaction as a model for human-computer interaction
Human-human Communication
- A model for human-
computer interaction?
Multimodal interaction
computer interaction?
- Why look at the human-
human interface?
– Users already know how to interact
Why not simply copy human-human interaction?
- Computers have other capabilities and
restrictions Human communication patterns may not always
Multimodal interaction
- Human communication patterns may not always
be efficient
Metaphors for spoken interaction
GUI metaphor Human metaphor
Not build a fully human replica But: Build a system that humans interact with as if it was another human
Replicate Human interaction in Human-Computer Interaction?
Requirements on computer
Multimodal interaction
Human-oriented perception Human-readable action
ENGAGED ACQUIRED Pointing (53,92,12) Fixating (47,98,37) Saying “/o’ver[200] \/there[325]”
Human-oriented perception
- Person detection, tracking
- Expression classification
- Speech/prosody recognition
- Touch sensitive
Human-readable action
– Express engagement, emotions, locus of attention – Speech/prosody generation – Sensomotoric feedback