The SmartKom Multimodal Corpus Data Collection and EndtoEnd - - PowerPoint PPT Presentation
The SmartKom Multimodal Corpus Data Collection and EndtoEnd - - PowerPoint PPT Presentation
The SmartKom Multimodal Corpus Data Collection and EndtoEnd Evaluation Nicole Beringer Institut fr Phonetik und Sprachliche Kommunikation LMU Mnchen The SmartKom Multimodal Corpus Data Collection and EndtoEnd
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
Where can the IPSK (LMU) be found within the project?
Data Collection, Evaluation, Annotation
Feedback about user reactions
Modules Modules
user behaviour? Implementation of problem solving strategies improved prototype
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
Responsibilities of the IPSK−group in SmartKom
Overview:
- Data Collection
WOZ design WOZ experiments some useful results
- End−to−End−Evaluation
Problems with Multimodality Evaluation Framework
- Annotation
Transliteration of the audio
data
Prosodic Annotation Annotation of the gestures Annotation of facial expression Annotation of user states
Data Collection Evaluation
User modelling
WOZ System − Studio Recordings Annotation of audio, gesture,
emotion
Distribution
MODULES Providing Data for Recognition
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
Responsibility Network
Data Collection
- Creating and publishing of data for
the training of recognizers (speech, prosodic feature, gesture,
facial expression, emotion)
dialogue creation generation of information (speech)
- Research
user modelling evaluation (usability & technical evaluation)
- Software
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
Training of recognizers user modelling The BIG Problem: Wizard−of−Oz
different users Instruction − „Market Research“ 2 recordings (4,5 minutes each) Recording of audio (different characteristics) Recording of video (face, profile, display, gestures) Interview
How to persuade users of a nonexisting system just by simulation?
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
realistic prototype created by partners & LMU influence on development playback of atmosphere creation of the studio Reliability Quality of speech output Experiment design WOZ System with technical defects Evocation of behaviour (trial and error, gestures, emotion) Instruction Provoking of different behaviour (new gestures, anger, new input facilities) Design of the display few associations to existing systems Dialogue with intelligent machine, no ordinary input facilities
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
good preparation intensive training of the wizards System makes mistakes Perception of the SmartKom−System
The system is a machine The system is a person The system is something in between
„That’s a telephone box, I wouldn’t expect to talk to a human. I do not have illusions!“
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
Reliability: the fraud should not be noticed
Only few associations to existing systems allowed
Simulation of a personal assistant. existing dialogue partner Assistant has „personality“ Assistant leads through the dialogue, has proposals
Percent 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Polite Users
subjects used polite expressions subjects used greetings subjects used thanks subjects used sorry
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
N 2.5 5 7.5 10 12.5 15 17.5 20
positive aspects
verbale Interaktion mit Assistent läuft gut einzelne Anwendungen oder Seiten positive Bewertung Persona Schnell insgesamt eine gute Idee Übersichtlich Praktisch Benutzung macht Spaß Multimodalität Sonstiges
− verbal Interaction! − Multimodality is
- nly noticed by a few
users − too slow − too few Possibilities − more Help needed − Persona not often criticized!
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
N 2.5 5 7.5 10 12.5 15 17.5 20
negative aspects
Kritik an der Sprachausgabe zu langsam zu geringer Umfang zu wenig Unterstützung Kritik an der Spracheingabe insgesamt nicht gut Straßenlärm stört Kritik an der Persona Gestikeingabe nicht gut Display
What characterizes a comfortable system?
Einfache Bedienung Spracherkennung Hardware/Aus− stattung Display−Layout Schnelligkeit Serviceangebot Multimodalität Synthese Sonstiges
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
SmartKom WOZ−Recordings and Processing of the Data at the LMU
WOZ − Recordings
Coordin.
- f Graph
Tablet DV−Video Front DV−Video Side View Beamer− Output SIVIT Stream 11 Audio− streams Cutting Transliteration (TRL) Preparation of Gesture Label− stream Holistic US−Labeling (USH) Prosodic US−Labeling (TRP) Gesture Labeling (GES) US−Labeling Facial Expr. (USM)
- Deliver. Files to
DFKI Server Recording
- f DVD
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
Annotation of emotions
System is simulated Subjects are recorded (audio and video) 4,5Min interaction − e.g. „find a movie for this evening“ emotions are partly provoked by the wizards
Subjects during a recording Front view Side view
- Orthographical Annotation
- Marking of repetitions, hesitations, noise, speech disfluencies etc.
w001_pkw_003_SMA: <Ger"ausch> @1hier @1sehen <:<#> Sie:> <:<#> eine:> "Ubersicht "uber das Programm der ~Heidelberger Kinos . w001_pkd_004_AAA: mhm [PA] [B3 cont] . <Ger"ausch> oh<Z> [B2] , ~F<Z>ight+Club<ROT> <!1 Flight−Club> [NA] [B2] , ~Das+f"unfte+Element<Z><ROT> [NA] [B2] , ~Drum%<ROT> , ~Jakob+der+L"ugner<ROT> [NA] [B3 cont] . <A> ah<OOT> [PA] [B2] , ich w"urde gerne [NA] ~Aimee+_ <"ah> _und+Jaguar [PA] sehen [B3 fall] . <Ger"ausch> wo [PA] wird das gespielt<Z> [NA] [B3 rise] ? <PP>
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
Annotation of gestures in 3 categories:
Interactional gestures: pointing (long & short), free gestures Supporting gestures: reading, searching, counting Residual gestures: Emotional gestures, not identifiable
gestures
I−Point (short −) R−Emotional (+ cubus)
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
3 steps:
Prosodic annotation: audio only, formal labelling system Holistic labelling: facial expression, audio, context
Holistic labeling includes context information, which is not relevant for the facial expression recognizer. Therefore we included a „facial expression only“ labeling step (no audio). For the analysis of the prosody the speech had to be labeled. The functional approach did not seem to work with speech. Therefore we adopted a formal coding step that was used in Verbmobil (Fischer, 1999) for the prosody. The holistic and the formal step for the speech can be combined to get ecological valid data.
facial expression: labelling without audio
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
Annotation of emotions
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
Annotation of emotions
- Categories for the prosody
step:
Pauses between phrases Pauses between words Pauses between syllables Irregular length of syllables Emphasized words Strongly emphasized words Clearly articulated words Hyperarticulated words Words overlapped by laughing Labeling with some defined
subjective categories
„anger/irritation" „joy/gratification (being
successful)“
„helplessness“ „pondering/reflecting“ „surprise“ „neutral“ „unidentifiable episode“
Conclusion (WOZ)
WOZ: realistic data for man−machine interaction
Training of recognizers Observation of user behaviour
WOZ−technique is time consuming and expensive BUT: Results out of user observations and
questionnaires can early influence the development of the system
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
- Website: http://www.smartkom.org/
- http://www.phonetik.uni−muenchen.de/Forschung/Publications/index.html
- Corpus Overview: Schiel, F. et al. (2002): Integration of multi−modal data
and annotations into a simple extendable form: the extension of the BAS Partitur Format. LREC Conference
- Steininger, S. et al. (2002b): User−State Labeling Procedures For The
Multimodal Data Collection Of SmartKom. LREC conference.
- Beringer N. (2001): Evoking Gestures in SmartKom − Design of the
Graphical User Interface. Gesture Workshop 2001, London, UK. to appear in: Springer "Gesture Workshop 2001, London"
- Labeling of gestures: Steininger, S. et al. (2001): Labeling of Gestures in
SmartKom − The Coding System. Gesture Workshop 2001, London.
- Transliteration: Oppermann, D. et al.: Transliterationskonventionen.
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
General Criteria of Dialogue System Evaluation (End−to−End Evaluation)
- „The performance of the evaluation is very often driven by the
characteristics of the system that has to be judged “ [Andenfilger− 97].
- An evaluation framework must abstract from the system itself and
from different dialogue strategies.
- Combination of the developers’ and the users’ needs as well as the
constraints on the evaluation of multimodal systems in general.
- Combination of objective and subjective evaluation criteria
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
PARADISE: Paradigm for Dialogue Systems Evaluation
- Comparison of Dialogue Strategies
- Direct Comparison with other Dialogue Systems
- Comparison of usability and objectively measurable results
- Generalization and normalization over measures
Standardization of
the Evaluation of successful transactions via Attribute Value Matrices
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
Evaluation framework for unimodal Dialogue Systems − Problems
Usability What about multimodal systems? separation of user satisfaction and dialogue complexity unique scales Objective measures multimodal costs higher dimensional AVMs there exist no static definitions of the ‘‘keys’’ necessary to compute an AVM
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
Problems with Spoken Dialogue Evaluation Frameworks in Multimodal Dialogue Environments
How to score multimodal inputs or outputs? How to score the use of multimodal technologies? How to weight the several multimodal components of
recognition systems?
How to evaluate different scenarios?
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
Problems with Spoken Dialogue Evaluation Frameworks in Multimodal Dialogue Environments
How to define an optimal dialogue? How to evaluate uncompleted tasks? How to deal with bad performance
due to user incooperativity?
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation Usability
Multimodal evaluation criteria Questionnaire adapted to cost functions User Satisfaction is separately compiled Standardization of questions User Satisfaction range from −3 to +3
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation Objective Evaluation Measures
Optimal dialogues depend on the system processing Length of the dialogue is defined by the user Weighting of quality and quantity measures and task success by Correlation between user satisfaction and
- bjective measure.
Definition of multimodal costs Definition of a bipolar function τ for the compilation of task success via biunique information clusters Integration of uncompleted tasks: τ (j) = − 1 : task failure.
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation Definition of Weights
Quality and quantity measures usability question
Transaction success Task complexity The task was easy to solve Misunderstanding of input Offtalk SmartKom has understood my input Misunderstanding of output SmartKom can easily be understood Semantical, syntactical correctness Incremental compatibility SmartKom has answered properly in most cases Mean system response time Mean user response time The speed of the system was acceptable for each situation Timeout I always knew what to say
- Acc. gesture recognition
The gestural input was successful
- Acc. ASR
The speech input was successful
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation Definition of Weights
Quality and quantity measures usability question
Dialogue complexity SmartKom worked as assumed SmartKom reacted quickly to my input SmartKom is easy to handle Percentage of appropriate/inappropriate system directive diagnostic utterances SmartKom offered an adequate amount
- f high quality information
Percentage of explicit recovery answers SmartKom is easy to handle repetitions
- No. of ambiguities
Diagnostic error messages Rejections SmartKom needs input only once to successfully complete a task Timeout Help−analyzer SmartKom offers adequate help
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation Definition of Weights
Quality and quantity measures usability question
Output complexity (display) The display is clearly designed Mean elapsed time Task completion time Dialogue elapsed time SmartKom reacted fast to my input Duration of speech input Duration of ASR SmartKom reacted fast to speech input Duration of gestural input Duration of gesture recognition SmartKom reacted fast to gestural input BargeIn Cancel SmartKom allows interrupts Dialogue complexity Was the task difficult? Gesture turns input via graphical display Ways of interaction Display turns
- utput via graphical display
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation Definition of Weights
Quality and quantity measures usability question
Speech input speech input Speech synthesis (synchronicity) speech output N−way communication Ways of interaction Error rate of questions Input complexity Possibility to interact in a quasi− human way with SmartKom Recognition/duration of facial expression Prosodic features SmartKom reacted my emotional state Synchronicity Graphical output (turns) How do you score the competence
- f the agent?
Cooperativity Were actions of the persona natural? Gestural input Gestural input
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation Information Clusters
Extract different superordinate concepts
depending on the task at hand.
Example: EPG
„City of Angels“ (Assumption: unique day,
time, channel) => one information needed Movie today at 8 p.m. on SAT1(channel) => three informations needed
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation User Incooperativity
Smartakus, do the dishes! Other frameworks: task failure attributed to
the system
Only dialogues with cooperative users are
evaluated using empirical methods
Only dialogues which terminate with finished
tasks are evaluated.
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation How to score multimodal inputs or outputs?
Multimodal cost functions „no.of multiple
input“ and „ways of interaction“
Weighting of recognition scores via defined
user satisfaction score
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation How to evaluate different scenarios?
- Intra−scenarios: Normalization over tasks
Inter−scenarios: three systems Possibility to compute the performance
- ver the three scenarios after all
evaluation periods
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation j = biunique
Information cluster; t j) = + 1 : task success; t j) = − 1 : task failure;
ci = cost function i
__
Performance = α⋅τ− ∑n
i=1 ωi⋅N ( ci )
α = Correlation between User Satisfaction und mean value of τ ωi = Correlation between User Satisfaction und normalized costs _ x − x N (x) = −−−−−−−− σx
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation Conclusion (Evaluation)
PROMISE offers an overall evaluation result
integrating cost functions and user satisfaction
PROMISE can deal with multimodality PROMISE is independent of task definitions
(static or dynamic tasks)
The SmartKom Multimodal Corpus − Data Collection and End−to−End Evaluation
- Beringer et al. (2002): End−to−End Evaluation of Multimodal Dialogue
Systems −can we Transfer Established Methods? Proc. of the Third International Conference on Language Resources and Evaluation. Las Palmas, Gran Canaria, Spain.
- Beringer et al. (2002): PROMISE: A Procedure for Multimodal Interactive
System Evaluation. Proceedings of the Workshop ’Multimodal Resources and Multimodal Systems Evaluation’ 2002, Las Palmas, Gran Canaria, Spain, pp. 77−80.
- Beringer et al. (2002): How to relate User Satisfaction and System
Performance in Multimodal Dialogue Situations − a Graphical
- Approach. Proceedings of the International CLASS Workshop on Natural,