Head Motion Generation with Synthetic Speech: a Data Driven Approach - PowerPoint PPT Presentation

Head Motion Generation with Synthetic Speech: a Data Driven Approach N AJMEH S ADOUGHI AND C ARLOS B USSO Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering and Computer Science Sep, 2016 msp.utdallas.edu

Motivation • Head motion and speech prosodic patterns are strongly coupled • Believable conversational agents should capture this relationship • Speech intelligibility [K. G. Munhall et al., 2004] • Naturalness [C. Busso et al., 2007, C. Liu et al., 2012, Mariooryad et al., 2013] • Rule-based approaches • Rely on the content of the message to choose the movement • Synchronization with speech is challenging • Speech-driven approaches • Learn the coupling from synchronized motion [Sadoughi et al., 2014] capture and audio recordings 2 msp.utdallas.edu

Motivation Speech-driven F ramework Scaling Don’t you have anything on file here? Speech-driven F Text-to-speech ramework • Training with synchronized speech and head movement recording, testing with synthetic speech [Van Welbergen, Herwin et al., 2015] Testing Training t-1 t M Speech: i s m a t c h t-1 t Recorded Audio Speech: H h&s H h&s Synthesized H h&s H h&s Head Pose: Speech (TTS) Head Pose Head Pose Speech Speech Recorded Motion Speech Head Pose Speech Head Pose Capture • This paper addresses the problem with the mismatch 3 msp.utdallas.edu

Overview Original Synthesized Our Proposal Aligned Training or adaptation Testing t-1 t Speech: t-1 t Parallel corpus with Speech: synthetic speech H h&s H h&s Synthesized H h&s H h&s Speech (TTS) Head Pose: Speech Head Pose Speech Head Pose Recorded Motion Speech Head Pose Speech Head Pose Capture msp.utdallas.edu 4

Corpus: IEMOCAP • Video, audio and MoCap recording • Dyadic interactions • Script and improvisation scenarios • We used 270.16 mins (non- overlapping speech) • Three head angular rotations • F0 and intensity (Praat) • Mean normalization per subject • Variance normalization, globally 5 msp.utdallas.edu

Parallel Corpus • OpenMary: open source text-to-speech (TTS) Synthetic speech • Aligning the synthesized and original speech (word-level) [Lotfian and Busso, 2015] • Praat warps the speech (pitch synchronous overlap add) • Replacing the zero segments with silent recordings • Mean normalization per voice • Variance normalization to match the variance of the neutral segments in IEMOCAP Original speech Aligned synthetic speech 6 msp.utdallas.edu

Modeling t-1 t • The dynamic Bayesian network H h&s H h&s proposed by Mariooryad and Busso Speech Head Pose Speech Head Pose (2013) • Captures the coupling between speech prosodic features and head pose H h&s • Full observation during training Speech Head Pose • Partial observation during testing • Initialization by VQ 7 msp.utdallas.edu

Experiments • Three training settings: • C1 (Baseline): • Train with natural recordings H h&s − Mismatch Speech Head Pose • C2: • Train with the parallel corpus − Synthetic speech is emotionally neutral • C3: • Train with natural recording and adapt n n x µ + p pi i Adaptation µ = i to synthetic speech n n + p • Mean and covariance adaptation ( ) t t n ( ) n ( x x )( x x ) ∑ + µ − µ + − − p pi pi i i i i i ∑ = • Adaptation only on speech i n n + p 8 msp.utdallas.edu

Objective Evaluation • 5-fold cross validation • CCA s&h • CCA between the input speech and the generated head motion sequences • KLD • The amount of information lost by using the synthesized head movements distributions compared to the original one Turn-based CCA s&h KLD M1 0.8615 8.4617 Train & Test with original C1 0.8103 8.3530 * p < 0.05 Train with original C2 0.7901** 4.7579 Train with parallel corpus ** p < 0.01 C3-1 0.8399 ** 8.6299 Mean adaptation C3-2 0.8189 * 9.3203 Mean & Covariance adaptation msp.utdallas.edu 9

Subjective Evaluation AMT • Smartbody to render BVH files • 20 videos with the three conditions (C1, C2, C3-1) • 2 consecutive turns, to incorporate enough context • Each evaluator is given 10 x 3 videos • 30 evaluators in total • Each video is annotated by 15 raters • Kruskal-Wallis test (pairwise comparison) • C1 and C3-1 are different (p < 7.4e − 7) • C1 and C2 are different (p < 3.5e − 3) 10 msp.utdallas.edu

Subjective Evaluation Trained with Trained with aligned s Adapted to the aligned original speech ynthetic speech synthetic speech 11 msp.utdallas.edu

� � Conclusions • This paper proposed a novel approach to scale a speech-driven framework for head motion generation to synthetic speech • We proposed to use a corpus of synthetic speech with time- aligned signals to the natural recordings • We used the parallel corpus to retrain or adapt the model to the synthetic speech (C2, and C3) • This approach reduces the mismatch between train and test • Both objective and subjective evaluations demonstrate its benefits � msp.utdallas.edu 12

Future Work • Adding emotional behaviors into our models • Including other facial gestures (e.g., eyebrow motion) and hand gestures • Constraining the generated behaviors on the underlying discourse function of the message to generate meaningful behaviors � msp.utdallas.edu 13

Multimodal Signal Processing (MSP) • Questions? http://msp.utdallas.edu/ ! msp.utdallas.edu 14

Head Motion Generation with Synthetic Speech: a Data Driven Approach - PowerPoint PPT Presentation

Head Motion Generation with Synthetic Speech: a Data Driven Approach N AJMEH S ADOUGHI AND C ARLOS B USSO Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering and Computer Science Sep, 2016

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Visual Motion Motion illusions Uses for motion cues Optic flow Motion blindness

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Learning to Synthesize Motion Blur CVPR 2019 Tim Brooks and Jon Barron Research Motion During

Motion Estimation for Video Coding Motion-Compensated Prediction Bit Allocation Motion

Forces and Motion Click on the topic to go to that section Motion Motion Graphs of Motion

Forces and Motion Click on the topic to go to that section Motion Motion Graphs of Motion

Synthetic Biology Considerations in Synthetic Biology Considerations in Synthetic Biology

SYNTHETIC DATA GENERATION FOR AN ALL-IN-ONE DMS Sagar Bhokre NEED FOR SYNTHETIC DATA Where does

Motion in Photography Freeze Motion / Blur Motion Objective The student will create freeze

Outline Outline Motion & Inverse Motion Motion & Inverse Motion Time

Motion Aftereffects Without Motion: Engaging the Human Motion Perception System With Still

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

The NoSQL Movement FlockDB CSCI 470: Web Science

Ombudsmen and the Open Government Agenda: Challenges and Opportunities Nathaniel Heller and Thomas

1 Version 2: MIT, 1983 Version 2: MIT, 1983 Bazaar Model Bazaar Model Richard Stallman was

Salvus: A flexible open-source package for full- waveform modelling and inversion Michael

Processor Architecture Past Present Future Steve Wallach swallachatconveycomputer.com

Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. Agenda

Open Source Virtualization About Me Dan Deighton CISSP, CISA, RHCE,... Co-founder of Aplura

CSE 461 Midterm Review A quick tour of what we have learned so far Midterm Topic Coverage

Head Motion Generation with Synthetic Speech: a Data Driven Approach - PowerPoint PPT Presentation

Head Motion Generation with Synthetic Speech: a Data Driven Approach N AJMEH S ADOUGHI AND C ARLOS B USSO Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering and Computer Science Sep, 2016

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Visual Motion Motion illusions Uses for motion cues Optic flow Motion blindness

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Learning to Synthesize Motion Blur CVPR 2019 Tim Brooks and Jon Barron Research Motion During

Motion Estimation for Video Coding Motion-Compensated Prediction Bit Allocation Motion

Forces and Motion Click on the topic to go to that section Motion Motion Graphs of Motion

Forces and Motion Click on the topic to go to that section Motion Motion Graphs of Motion

Synthetic Biology Considerations in Synthetic Biology Considerations in Synthetic Biology

SYNTHETIC DATA GENERATION FOR AN ALL-IN-ONE DMS Sagar Bhokre NEED FOR SYNTHETIC DATA Where does

Motion in Photography Freeze Motion / Blur Motion Objective The student will create freeze

Outline Outline Motion &amp; Inverse Motion Motion &amp; Inverse Motion Time

Motion Aftereffects Without Motion: Engaging the Human Motion Perception System With Still

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

The NoSQL Movement FlockDB CSCI 470: Web Science

Ombudsmen and the Open Government Agenda: Challenges and Opportunities Nathaniel Heller and Thomas

1 Version 2: MIT, 1983 Version 2: MIT, 1983 Bazaar Model Bazaar Model Richard Stallman was

Salvus: A flexible open-source package for full- waveform modelling and inversion Michael

Processor Architecture Past Present Future Steve Wallach swallachatconveycomputer.com

Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. Agenda

Open Source Virtualization About Me Dan Deighton CISSP, CISA, RHCE,... Co-founder of Aplura

CSE 461 Midterm Review A quick tour of what we have learned so far Midterm Topic Coverage

Outline Outline Motion & Inverse Motion Motion & Inverse Motion Time