Retrieving Target Gestures Toward Speech Driven Animation with - - PowerPoint PPT Presentation

retrieving target gestures toward speech driven animation
SMART_READER_LITE
LIVE PREVIEW

Retrieving Target Gestures Toward Speech Driven Animation with - - PowerPoint PPT Presentation

Retrieving Target Gestures Toward Speech Driven Animation with Meaningful Behaviors N AJMEH S ADOUGHI AND C ARLOS B USSO Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering and Computer


slide-1
SLIDE 1

msp.utdallas.edu

Retrieving Target Gestures Toward Speech Driven Animation with Meaningful Behaviors

Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering and Computer Science

  • Nov. 11th, 2015

NAJMEH SADOUGHI AND CARLOS BUSSO

slide-2
SLIDE 2

msp.utdallas.edu 2

Motivation

  • Creating naturalistic nonverbal behaviors is important for

conversation agents (CAs)

  • Animations
  • Entertainment
  • Virtual reality
  • More than 90% human gestures occur while speaking
  • Complex relationship between gestures and speech
  • Cross modality interplay
  • Synchronization

maxresdefault.jpg

ICT-USC

slide-3
SLIDE 3

msp.utdallas.edu 3

Previous studies on co-verbal gesture synthesis

  • Rule based frameworks [Cassell et al., 1994; S. Kopp 2006]

+ Define rules based on the semantics

  • Synchronization is challenging
  • The variation is limited
  • Speech prosody driven systems [Levine et al., 2010; Busso

et al. 2007]

+ Learn movements and their synchronization from recordings + Capture the variation in the data

  • Disregard the context
  • Combination of data driven and rule based methods [Stone

et al. 2004, Marsella et al. 2013, and Sadoughi et al. 2014]

+ Utilizing the advantages and overcoming the disadvantages

Rule-Based Speech Driven

slide-4
SLIDE 4

msp.utdallas.edu

Previous studies using both approaches

  • Stone et al., [2004]
  • Search for combination of speech and motion units with

similar meaning to speech and planned behaviors

  • Marsella et al., [2013]
  • Create appropriate gestures depending on the

communicative goal of the utterance

  • Use speech prosody features to capture the stress and

emotional state of the speaker

  • Sadoughi et al., [2014]
  • Constrain a speech driven animation model based on

semantic labels (e.g., Question and Affirmation)

Si, Mi Sk, Mk

phrase p phrase p+1

4

slide-5
SLIDE 5

msp.utdallas.edu

Our Vision

  • Creating a bridge between rule based systems and data

driven framework

  • SAIBA framework [Kopp et al., 2006]:
  • Considering the target gesture for synthesis is known
  • Synthesizing behaviors that are timely aligned and coordinated

with speech

  • Synthesizing behaviors that convey the right meaning

5

Rule-based systems Data-driven systems

Intent Planning Behavior Planning Behavior Realization

slide-6
SLIDE 6

msp.utdallas.edu

Objective of This Study

Annotating few samples

  • f a

prototypical gesture Retrieving similar gestures to the examples Training the Behavior Realization model

6

Goal: Retrieve examples of prototypical gestures

slide-7
SLIDE 7

msp.utdallas.edu

Gesture Segmentation and Classification

  • Kovar et al. [2004]
  • Find gestures similar to a target gesture using DTW and use retrieved

samples to expand the training samples

  • Joshi et al. [2015]
  • Train a random forest model using video and depth map of the joints
  • They use a multi-scale window sliding for new data (forward search).
  • Zhou et al. [2013]
  • Hierarchical aligned cluster analysis (HACA) to dynamically segment and

cluster motion capture data into movement primitives

7

Gesture Detection Segmentation Detection

slide-8
SLIDE 8

msp.utdallas.edu

MSP-AVATAR Corpus

  • Multimodal database comprising:
  • Motion capture data
  • Video camera
  • Speech recordings
  • Four dyadic interaction between actors
  • We motion captured one of the actors
  • Database rich in terms of discourse functions

8

slide-9
SLIDE 9

msp.utdallas.edu

Discourse Functions in MSP-AVATAR corpus

  • Discourse functions that elicit

specific gestural behaviors

  • Selection guided by previous studies
  • Poggi et al [2005]
  • Marsella et al. [2013]
  • 2-5 scenarios per discourse function
  • We used the recordings from one
  • f the actors (66 mins)

9

The discourse functions

Contrast Negation Question Uncertainty Confirmation Suggest Warn Inform Pronouns I/ You/Other Large/Small

slide-10
SLIDE 10

msp.utdallas.edu

MSP - CRSS

Prototypical Behaviors

10

To-Fro So-What Regress

So-What To-Fro Regress Nods Shakes Samplestrain 14 27 26 24 27 Samplestest&developing 21 29 73 138 115

Nods Shakes

slide-11
SLIDE 11

msp.utdallas.edu

Gesture Retrieval Framework Overview

  • Temporal reduction
  • The data is captured by 120 fps, and may have redundant

information

  • Gesture segmentation
  • Gestures can happen with arbitrary durations
  • Gesture detection
  • Binary decision per segment

11

slide-12
SLIDE 12

msp.utdallas.edu

Temporal Reduction

  • Reduce the complexity of the system
  • Inspired by Zhou et al. [2013]
  • Non-uniform downsampling
  • Based on Linde-Buzo-Gray vector quantization (LBG-VQ)
  • Discard consecutive frames up to 5 frames if they are in the same cluster

12

slide-13
SLIDE 13

msp.utdallas.edu

Gesture Segmentation

  • Window size (Lw)
  • Minimum length of search segment (Lmin)
  • Maximum length of search segment (Lmax)
  • Increment frames between iterations
  • Δ = (Lmax - Lmin)/30
  • One winner per window

∆ ∆ Lmin Lmax Lmin Lw Lmax ∆ Lmin Lmax 5 5 t t t

13

slide-14
SLIDE 14

msp.utdallas.edu

Gesture Detection

  • One-class SVMs
  • Efficiently reduce the number of

candidates

  • Dynamic time alignment kernel

(DTAK)

  • To increase precision

One-Class SVMs DTAK

If Y=1

14

slide-15
SLIDE 15

msp.utdallas.edu

One-Class SVMs

  • Only positive samples
  • Limited number of training

instances

  • Train separately for different features
  • Fuse the classifiers using the AND
  • perator
  • Feature selection by cross-

validation

  • Sort features according to accuracy
  • Remove one by one to get accuracy>0.85

15

STD(Joint 2-x) STD(Joint 2-y) STD(Joint 2-z) STD(Joint n-x) STD(Joint n-y) STD(Joint n-z) STD(Joint 1-x) STD(Joint 1-y) STD(Joint 1-z) One-Class SVMs One-Class SVMs One-Class SVMs One-Class SVMs One-Class SVMs One-Class SVMs One-Class SVMs One-Class SVMs One-Class SVMs

AND

slide-16
SLIDE 16

msp.utdallas.edu

DTAK by Zhou et al. [2013]

  • DTAK finds similarity between two segments regardless of

their length in term of a kernel (Gaussian)

  • Final score: the median of the similarity measure to the

training examples

  • Find a threshold by maximizing the F-score on the developing

set

⎪ ⎩ ⎪ ⎨ ⎧ + + + = + = ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ − − =

− − − − j i j i j i j i j i j i j i y x l l j i j i

K u K u K u u l l u Y X y x K

y x

, 1 , , 1 , 1 , , 1 , , 2 2 ,

2 max , ) , ( 2 exp τ σ

16

slide-17
SLIDE 17

msp.utdallas.edu

Gesture Test & Developing Sessions Precision [%] Recall [%] Head Shake 95.65 42.31 Head Nod 87.10 61.36 To-Fro 67.86 67.86 So-What 76.92 47.62 Regress 78.85 57.75

Evaluation of Retrieved Gestures

  • Precision in head gestures > 0.85
  • Precision in hand gestures > 0.59
  • Head vs. hand gestures:
  • Less complex

Gesture 19 Sessions Precision [%] Head Shake 91.32 Head Nod 85.04 To-Fro 59.52 So-What 76.68 Regress 71.77

17

slide-18
SLIDE 18

msp.utdallas.edu

Analysis of Gestures vs. Discourse Functions

  • The histograms of the discourse functions vs. behaviors
  • Different gestures appear with different frequencies across different discourse

functions

  • Shakes happen in Negation more than in Affirmation
  • Nods happen in Affirmation more than in Negation
  • So-What happens more in Question than other discourse functions

HEAD HAND

18

slide-19
SLIDE 19

msp.utdallas.edu

Modeling the gestures

  • Gesture retrieval à more samples to train the models
  • Assumptions
  • Target gesture is known
  • Speech prosody features are known
  • How to model the gesture?
  • Speech driven models
  • Training: speech prosody features, motion capture data, and

prototypical gesture

  • Testing (synthesis): speech prosody features, and prototypical

gesture

19 Gesture #Retrieved Head Shake 287 Head Nod 535 To-Fro 223 So-What 114 Regress 262

slide-20
SLIDE 20

msp.utdallas.edu

Speech driven animation

  • Dynamic Bayesian Network
  • Shared hidden variable between speech

and head/hand

  • Constrained on gestures
  • Add the constraint node as parent of

the hidden state:

  • More robust to unbalanced data
  • Learns separately:
  • Prior probabilities of the gestures
  • The affect of gestures on transition matrices

Hh&s

Speech

Hh&s

Gesture Speech Head/ Hand t-1 t

Hh&s

Gesture Speech Head/ Hand

Head/ Hand

Gesture 20

slide-21
SLIDE 21

msp.utdallas.edu

HEAD Synthesis

21

Nods Shakes

For illustration gesture is always “on”

slide-22
SLIDE 22

msp.utdallas.edu

HAND Synthesis

22

To-Fro So-What Regress

For illustration gesture is always “on”

slide-23
SLIDE 23

msp.utdallas.edu

Conclusions

  • This paper proposed a framework to automatically detect

target gestures

  • Using few examples in a motion capture database
  • The advantage of this framework is its flexibility to retrieve any gesture
  • The approach jointly solved the segmentation and detection
  • f gestures
  • Multi scale windows
  • Two-step detection framework
  • We used the retrieved samples to synthesize novel

realizations of these gestures

  • Speech-driven animations constrained by these target behaviors

23

slide-24
SLIDE 24

msp.utdallas.edu

Future Work

  • Explore the minimum number of examples per gesture

to achieve acceptable detection rates

  • Using adaptation to generalize the models to retrieve

similar gestures from different subjects

  • With more data, more restrictive threshold can be considered
  • Explore the effects of detection errors on the

performance of the speech driven models

24

slide-25
SLIDE 25

msp.utdallas.edu

Multimodal Signal Processing (MSP)

  • Questions?

25

http://msp.utdallas.edu/

slide-26
SLIDE 26

msp.utdallas.edu

HEAD Synthesis

26

Nods Shakes

For illustration gesture is always “on”

slide-27
SLIDE 27

msp.utdallas.edu

HAND Synthesis

27

To-Fro So-What Regress

For illustration gesture is always “on”

slide-28
SLIDE 28

msp.utdallas.edu

HEAD Synthesis

28

Nods Shakes

For illustration gesture is always “on”

slide-29
SLIDE 29

msp.utdallas.edu

HAND Synthesis

29

To-Fro So-What Regress

For illustration gesture is always “on”