Speaker Movement Correlates with Prosodic Indicators of Engagement - - PowerPoint PPT Presentation

speaker movement correlates with
SMART_READER_LITE
LIVE PREVIEW

Speaker Movement Correlates with Prosodic Indicators of Engagement - - PowerPoint PPT Presentation

Speaker Movement Correlates with Prosodic Indicators of Engagement Rob Voigt, Robert J. Podesva, and Dan Jurafsky L INGUISTICS D EPARTMENT S TANFORD U NIVERSITY Links Between Acoustic and Visual Prosody Gestural apices align with pitch


slide-1
SLIDE 1

Speaker Movement Correlates with Prosodic Indicators of Engagement

Rob Voigt, Robert J. Podesva, and Dan Jurafsky

LINGUISTICS DEPARTMENT STANFORD UNIVERSITY

slide-2
SLIDE 2

Links Between Acoustic and Visual Prosody

  • Gestural apices align with pitch accents

Jannedy and Mendoza-Denton (2006)

  • Production of “visual beats” increases the

prominence of the co-occurring speech

Krahmer and Swerts (2007)

  • Speakers move their head and eyebrows more

during prosodically focused words

Cvejic et al. (2010)

slide-3
SLIDE 3
  • Previous research
  • Identified discrete relationships
  • Our proposal
  • Examine scalar relationships
  • Particularly between movement and

affective measures of engagement

Yu et al. (2004), Mairesse et al. (2007), Gravano et al. (2011), Oertel et al. (2011), MacFarland et al. (2013), etc.

Question 1: Is the Relationship Between Acoustic and Visual Prosody Continuous?

slide-4
SLIDE 4

Question 2: Methodological Barriers to Studying Visual Prosody

  • Prior studies generally employ
  • Time-intensive annotation schemes
  • r
  • Expensive or invasive experimental hardware
  • Thus face limitations
  • Small amounts of data
  • Prohibitive expense
slide-5
SLIDE 5

Our Solution: New Data Source

Automatically extract visual and acoustic data from YouTube

  • Potentially huge amounts of data
  • Ecologically valid (“in the wild”)
  • Allows replicability
slide-6
SLIDE 6

Our Solution: New Data Source

  • We chose “first day of school” video blogs

(“vlogs”)

  • 14 videos, 95 minutes of data
  • Static backgrounds and stable cameras
  • Generally engaged, animated
slide-7
SLIDE 7

Our Solution: Automatic Phrasal Units

Approximate pause-bounded units (PBUs)

  • Our unit of prosodic analysis
  • Calculated with a simple iterative algorithm
  • Find silences (Praat) with a threshold of -30.0dB;

sounding portions are approximate PBUs

  • If average phrase length > 2 seconds,

raise threshold by 3.0dB and re-extract

slide-8
SLIDE 8

Our Solution: New Visual Feature

Movement Amplitude

  • Assumes speaker is talking in front of a static

background

  • Quantifies speaker movement as pixel-by-pixel

difference between frames

  • Calculated in log space, z-scored per speaker
slide-9
SLIDE 9

Visualization: Continuous Measurements

  • Video at 30 FPS allows observations at 30 Hz
slide-10
SLIDE 10

Visualization: Movement-Only Video

  • Coarse, but reasonable overall estimation
slide-11
SLIDE 11

Acoustic Features

  • Following prior work on prosodic engagement
  • Pitch (fundamental frequency) and

Intensity (loudness)

  • Eight features per phrase
  • max, min, mean, standard deviation (std)

for both pitch and intensity

slide-12
SLIDE 12

Statistical Analysis

  • Movement amplitude measures (max, min, mean, std) are

highly co-linear

  • PCA for dimensionality reduction
  • Two components explain 96% of variance in MA
slide-13
SLIDE 13

Statistical Analysis

  • Series of linear regressions
  • Predicting acoustic variables from

OVERALL MOVEMENT and MOVEMENT VARIANCE

  • Controlling for speaker-specific variation

by including speakers as random effects

  • Controlling for log(phrase length)
slide-14
SLIDE 14

Experimental Pipeline

  • Download videos, extract frames and audio
  • Calculate approximate phrase units (PBUs)
  • Compute movement amplitude for each frame
  • Calculate MA principal components
  • Extract acoustic features
  • Run statistical models
slide-15
SLIDE 15

Results

** is p < 0.01, *** is p < 0.001, — is no significant relationship

slide-16
SLIDE 16

Results

  • During phrases with more OVERALL MOVEMENT,

speakers use

  • higher and more variable pitch
  • louder and more variable intensity
  • MOVEMENT VARIANCE was not predictive of

any of our acoustic features

slide-17
SLIDE 17

Visualization: Across Phrases

  • Notice light and dark vertical banding
  • Suggests sequence modeling as future work
slide-18
SLIDE 18

Moving Forward

  • More advanced vision-based features
  • Face tracking
  • Gesture recognition
  • Expanding the data
  • Genre effects
  • Sociolinguistic variables
  • Movement in interaction
slide-19
SLIDE 19

Discussion

  • Further empirical evidence for a rich link between

acoustic and visual prosody

  • Adds dimension of quantity / continuous

association, in addition to previously demonstrated temporal synchrony

  • Methodological contributions suggest new

avenues for multi-modal analysis of prosody

  • Code and Corpus:

nlp.stanford.edu/robvoigt/speechprosody