Retrieving Target Gestures Toward Speech Driven Animation with - PowerPoint PPT Presentation

Retrieving Target Gestures Toward Speech Driven � Animation with Meaningful Behaviors N AJMEH S ADOUGHI AND C ARLOS B USSO Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering and Computer Science Nov. 11th, 2015 msp.utdallas.edu

Motivation • Creating naturalistic nonverbal behaviors is important for conversation agents (CAs) • Animations • Entertainment • Virtual reality maxresdefault.jpg • More than 90% human gestures occur while speaking • Complex relationship between gestures and speech • Cross modality interplay • Synchronization ICT-USC 2 msp.utdallas.edu

Previous studies on co-verbal gesture synthesis • Rule based frameworks [Cassell et al., 1994; S. Kopp 2006] + Define rules based on the semantics - Synchronization is challenging - The variation is limited • Speech prosody driven systems [Levine et al., 2010; Busso et al. 2007] + Learn movements and their synchronization from recordings + Capture the variation in the data Rule-Based - Disregard the context Speech Driven • Combination of data driven and rule based methods [Stone et al. 2004, Marsella et al. 2013, and Sadoughi et al. 2014] + Utilizing the advantages and overcoming the disadvantages 3 msp.utdallas.edu

Previous studies using both approaches • Stone et al., [2004] S i , M i • Search for combination of speech and motion units with S k , M k similar meaning to speech and planned behaviors phrase p phrase p+1 • Marsella et al., [2013] • Create appropriate gestures depending on the communicative goal of the utterance • Use speech prosody features to capture the stress and emotional state of the speaker • Sadoughi et al., [2014] • Constrain a speech driven animation model based on semantic labels (e.g., Question and Affirmation) msp.utdallas.edu 4

Our Vision Rule-based systems Data-driven systems • Creating a bridge between rule based systems and data driven framework • SAIBA framework [Kopp et al., 2006]: Intent Planning Behavior Planning Behavior Realization • Considering the target gesture for synthesis is known • Synthesizing behaviors that are timely aligned and coordinated with speech • Synthesizing behaviors that convey the right meaning 5 msp.utdallas.edu

Objective of This Study Training the Behavior Realization Retrieving model similar Goal: gestures to Retrieve the examples examples of Annotating few samples prototypical of a gestures prototypical gesture msp.utdallas.edu 6

Gesture Segmentation and Classification Gesture Detection Segmentation Detection • Kovar et al. [2004] • Find gestures similar to a target gesture using DTW and use retrieved samples to expand the training samples • Joshi et al. [2015] • Train a random forest model using video and depth map of the joints • They use a multi-scale window sliding for new data (forward search). • Zhou et al. [2013] • Hierarchical aligned cluster analysis (HACA) to dynamically segment and cluster motion capture data into movement primitives 7 msp.utdallas.edu

MSP-AVATAR Corpus • Multimodal database comprising: • Motion capture data • Video camera • Speech recordings • Four dyadic interaction between actors • We motion captured one of the actors • Database rich in terms of discourse functions 8 msp.utdallas.edu

Discourse Functions in MSP-AVATAR corpus The discourse functions • Discourse functions that elicit Contrast Negation specific gestural behaviors Question Uncertainty • Selection guided by previous studies Confirmation Suggest • Poggi et al [2005] • Marsella et al. [2013] Warn Inform • 2-5 scenarios per discourse function • We used the recordings from one Pronouns I/ Large/Small You/Other of the actors (66 mins) 9 msp.utdallas.edu

MSP - CRSS � Prototypical Behaviors So-What To-Fro Regress Nods Shakes So-What To-Fro Regress Nods Shakes Samples train 14 27 26 24 27 Samples test&developing 21 29 73 138 115 msp.utdallas.edu 10

Gesture Retrieval Framework Overview • Temporal reduction • The data is captured by 120 fps, and may have redundant information • Gesture segmentation • Gestures can happen with arbitrary durations • Gesture detection • Binary decision per segment msp.utdallas.edu 11

Temporal Reduction • Reduce the complexity of the system • Inspired by Zhou et al. [2013] • Non-uniform downsampling • Based on Linde-Buzo-Gray vector quantization (LBG-VQ) • Discard consecutive frames up to 5 frames if they are in the same cluster msp.utdallas.edu 12

Gesture Segmentation L w • Window size ( L w ) L max • Minimum length of search segment ( L min ) ∆ t L min L max • Maximum length of search segment ( L max ) 5 • Increment frames between iterations ∆ t L min • Δ = ( L max - L min )/30 L max 5 • One winner per window ∆ t L min msp.utdallas.edu 13

Gesture Detection One-Class • One-class SVMs SVMs • Efficiently reduce the number of candidates If Y=1 • Dynamic time alignment kernel (DTAK) • To increase precision DTAK msp.utdallas.edu 14

One-Class SVMs • Only positive samples STD(Joint 1-x ) One-Class SVMs • Limited number of training STD(Joint 1-y ) One-Class SVMs STD(Joint 1-z ) One-Class SVMs instances STD(Joint 2-x ) One-Class SVMs • Train separately for different features STD(Joint 2-y ) One-Class SVMs AND • STD(Joint 2-z ) One-Class SVMs Fuse the classifiers using the AND operator • Feature selection by cross- STD(Joint n-x ) One-Class SVMs STD(Joint n-y ) One-Class SVMs validation STD(Joint n-z ) One-Class SVMs • Sort features according to accuracy • Remove one by one to get accuracy>0.85 msp.utdallas.edu 15

DTAK by Zhou et al. [2013] • DTAK finds similarity between two segments regardless of their length in term of a kernel (Gaussian) 2 ⎛ ⎞ x y − ⎜ ⎟ i j K exp = − i , j ⎜ ⎟ 2 2 σ ⎜ ⎟ ⎝ ⎠ u K ⎧ + u i 1 , j i , j − ⎪ l , l ( X , Y ) , u max u 2 K x y τ = = + ⎨ i , j i 1 , j 1 i , j − − l l + ⎪ x y u K + ⎩ i , j 1 i , j − • Final score: the median of the similarity measure to the training examples • Find a threshold by maximizing the F-score on the developing set msp.utdallas.edu 16

Evaluation of Retrieved Gestures • Precision in head gestures > 0.85 • Precision in hand gestures > 0.59 • Head vs. hand gestures: • Less complex Gesture Test & Developing Gesture 19 Sessions Sessions Precision Recall Precision [%] [%] [%] Head Shake 95.65 42.31 Head Shake 91.32 Head Nod 87.10 61.36 Head Nod 85.04 To-Fro 67.86 67.86 To-Fro 59.52 So-What 76.92 47.62 So-What 76.68 Regress 78.85 57.75 Regress 71.77 msp.utdallas.edu 17

Analysis of Gestures vs. Discourse Functions HEAD HAND • The histograms of the discourse functions vs. behaviors • Different gestures appear with different frequencies across different discourse functions • Shakes happen in Negation more than in Affirmation • Nods happen in Affirmation more than in Negation • So-What happens more in Question than other discourse functions msp.utdallas.edu 18

Modeling the gestures • Gesture retrieval à more samples to train the models • Assumptions Gesture #Retrieved Head Shake 287 • Target gesture is known Head Nod 535 • Speech prosody features are known To-Fro 223 • How to model the gesture? So-What 114 • Speech driven models Regress 262 • Training: speech prosody features , motion capture data , and prototypical gesture • Testing (synthesis): speech prosody features , and prototypical gesture msp.utdallas.edu 19

Speech driven animation t-1 t • Dynamic Bayesian Network Gesture Gesture • Shared hidden variable between speech H h&s H h&s and head/hand • Constrained on gestures Head/ Head/ Speech Speech Hand Hand • Add the constraint node as parent of Gesture the hidden state: • More robust to unbalanced data • Learns separately: H h&s • Prior probabilities of the gestures Head/ Speech • The affect of gestures on transition matrices Hand msp.utdallas.edu 20

HEAD Synthesis For illustration gesture is always “on” Nods Shakes msp.utdallas.edu 21

HAND Synthesis For illustration gesture is always “on” To-Fro Regress So-What msp.utdallas.edu 22

Conclusions • This paper proposed a framework to automatically detect target gestures • Using few examples in a motion capture database • The advantage of this framework is its flexibility to retrieve any gesture • The approach jointly solved the segmentation and detection of gestures • Multi scale windows • Two-step detection framework • We used the retrieved samples to synthesize novel realizations of these gestures • Speech-driven animations constrained by these target behaviors 23 msp.utdallas.edu

Future Work • Explore the minimum number of examples per gesture to achieve acceptable detection rates • Using adaptation to generalize the models to retrieve similar gestures from different subjects • With more data, more restrictive threshold can be considered • Explore the effects of detection errors on the performance of the speech driven models msp.utdallas.edu 24

Multimodal Signal Processing (MSP) • Questions? http://msp.utdallas.edu/ � msp.utdallas.edu 25

Retrieving Target Gestures Toward Speech Driven Animation with - PowerPoint PPT Presentation

Retrieving Target Gestures Toward Speech Driven Animation with Meaningful Behaviors N AJMEH S ADOUGHI AND C ARLOS B USSO Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering and Computer

CS371m - Mobile Computing Gestures Common Gestures Or Scroll 2 Common Gestures

Animation Ats Kurvet What is animation? Where to use animation? How it works... Vision

Animation Overview Overview Traditional Animation Traditional Animation Keyframe Animation

CS 184: Foundations of Computer Graphics Lecture 23: Intro to Animation Rahul Narain Animation

Animation Presented by Sancho McCann Animation Is animation useful? Why?

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Combining Modalities in Multimodal Interfaces Focus on speech and gestures Focus on speech and

Show Me The Way A GROUNDED FRAMEWORK FOR GESTURES AND ITS APPLICATIONS Debidatta Dwibedi SE367

Animation CS 148: Summer 2016 Introduction of Graphics and Imaging Zahid Hossain Animation

Chapter 8 - Animation Animation before the time of 3DCG Animation techniques

Toward Toward Univeral Network-based Univeral Network-based Speech Translation Speech

Performance-Driven Facial Animation Performance-based Facial Animation Creating an animation

Data-Driven Animation Full-body animation Skin deformation Facial animation Motion

How to make Key-Frame Animation with Automatic Function 1. The Aurora 3D Animation has key-frame

CS 4204 Computer Graphics Computer Animation Computer Animation Yong Cao Yong Cao Virginia

Human body animation March 2010 Based on slides by Marco Gillies Human Body Animation Skeletal

Storage and Retrieval Cycle A storage and retrieval (S/R) cycle is one complete roundtrip from

Goals Advance math-aware search Advance semantic analysis of mathematical notation and

Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang (Cheng)

In Information Retrieval (CIR IR) - A re review of f neura ral appro roaches Jianfeng Gao,

Semi-supervised Question Retrieval with Gated Convolutions Tao Lei joint work with Hrishikesh

XML Out-Of-Band Data Retrieval Timur Yunusov Alexey

Digital preservation at Wellcome Alex Chan ~ a.chan@wellcome.ac.uk ~ they/them Senior

Utilizing Knowledge Bases for Text Retrieval: A Wishlist for Text Retrieval: A Wishlist