Taking Synchrony Seriously: Taking Synchrony Seriously: A - - PowerPoint PPT Presentation

taking synchrony seriously taking synchrony seriously a
SMART_READER_LITE
LIVE PREVIEW

Taking Synchrony Seriously: Taking Synchrony Seriously: A - - PowerPoint PPT Presentation

Taking Synchrony Seriously: Taking Synchrony Seriously: A Perceptual-Level Model of Infant A Perceptual-Level Model of Infant Synchrony Detection Synchrony Detection Christopher G. Prince, George J. Hollich, Christopher G. Prince, George J.


slide-1
SLIDE 1

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 1

Taking Synchrony Seriously: A Perceptual-Level Model of Infant Synchrony Detection Taking Synchrony Seriously: A Perceptual-Level Model of Infant Synchrony Detection

Christopher G. Prince, George J. Hollich, Nathan A. Helder, Eric J. Mislivec, Anoop Reddy, Sampanna Salunke, & Naveed Memon Christopher G. Prince, George J. Hollich, Nathan A. Helder, Eric J. Mislivec, Anoop Reddy, Sampanna Salunke, & Naveed Memon

Department of Computer Science University of Minnesota Duluth Duluth, MN USA Department of Psychological Sciences Purdue University West Lafayette, IN USA chris@cprince.com

slide-2
SLIDE 2

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 2

Outline of Talk Outline of Talk

 Types of Synchrony Detection  A Model of Synchrony Detection  Comparison to Infant Behavior  Conclusions  Types of Synchrony Detection  A Model of Synchrony Detection  Comparison to Infant Behavior  Conclusions

slide-3
SLIDE 3

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 3

Acknowledgements Acknowledgements

 Collaborators

 Lakshmi Gogate

 Students

 Soleh Dib, Tyrel Pollak  Tim Colburn’s CS 4531 software engineering class

 Colleagues

 Rocio Alba-Flores, Kang James

 Supported in part by UROP grants and by a donation

from Digi-Key

 Collaborators

 Lakshmi Gogate

 Students

 Soleh Dib, Tyrel Pollak  Tim Colburn’s CS 4531 software engineering class

 Colleagues

 Rocio Alba-Flores, Kang James

 Supported in part by UROP grants and by a donation

from Digi-Key

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
slide-4
SLIDE 4

4

Types of Audio-Visual Synchrony Detection Types of Audio-Visual Synchrony Detection

 Punctuate speech-object synchrony

 Two month olds can detect (Gogate et

al., 2004)

 Face-voice synchrony

 10- to 16-week old infants (Dodd,

1979)

 Talker with distractor

 E.g., cocktail party (Hollich et al., in

press)

 Multiple visual events

 E.g., multiple talkers (Pickens et al.,

1994; Hollich & Prince, in progress)

 Punctuate speech-object synchrony

 Two month olds can detect (Gogate et

al., 2004)

 Face-voice synchrony

 10- to 16-week old infants (Dodd,

1979)

 Talker with distractor

 E.g., cocktail party (Hollich et al., in

press)

 Multiple visual events

 E.g., multiple talkers (Pickens et al.,

1994; Hollich & Prince, in progress)

QuickTime™ and a YUV420 codec decompressor are needed to see this picture. QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.
slide-5
SLIDE 5

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 5

Research Question Research Question

 Can a single general-purpose synchrony detection mechanism, estimating audio- visual synchrony from low-level signal features, account for infant synchrony detection across a broad range of audio- visual speech integration tasks?  Can a single general-purpose synchrony detection mechanism, estimating audio- visual synchrony from low-level signal features, account for infant synchrony detection across a broad range of audio- visual speech integration tasks?

slide-6
SLIDE 6

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 6

Hershey & Movellan (2000) Hershey & Movellan (2000)

 Computes mutual information between two

sensory channels over a time window (length S)

 Assumes Gaussian distributed sensory signals  Synchrony defined as mutual-information

between sensory channels

 Computes mutual information between two

sensory channels over a time window (length S)

 Assumes Gaussian distributed sensory signals  Synchrony defined as mutual-information

between sensory channels M(x,y,tk)  1 2 log2 | A(tk) || V(x,y,tk) |

 

| A,V(x,y,tk) |

(audioDampening)

For other approaches see:

http://www.cprince.com/PubRes/Zurich04

slide-7
SLIDE 7

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 7

Synchrony Detection with HM Synchrony Detection with HM

HM algorithm

Generates mixelgrams

Each pixel of the mixelgram is a mixel, a mutual information pixel

HM algorithm

Generates mixelgrams

Each pixel of the mixelgram is a mixel, a mutual information pixel

Perceptually relevant mixelgrams typically indicate synchrony between the two input channels (Vuppla, 2004)

SenseStream progam: Mixels computed from mutual information between audio and visual channels (Mislivec, 2004)

slide-8
SLIDE 8

25 August 2004 25 August 2004 8

Frames from one channel consist

  • f single vectors of n-elements,

here processed audio features

1

É n 1 É n 1 É n 1 É n 1 É n

Frames from other channel consist

  • f h x w m-element vectors, here

visual image features

1 É m h w

S frames

M(x,y,tk)  1 2 log2 | A(tk) || V(x,y,tk) |

 

| A,V(x,y,tk) |

n+m n m Joint covariance matrix:

A,V(x,y,tk)

QuickTime™ and a decompressor are needed to see this picture.

Audio cov arian ce ma trix:

A(tk)

QuickTime™ and a decompressor are needed to see this picture.

Visual cov arian ce ma trix:

V(x,y,tk)

Calculation for Each Mixel

slide-9
SLIDE 9

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 9

Audio Dampening Audio Dampening

 We use an additional term on the HM

equation to dampen mutual information

  • utputs when audio is “sub-audible”

 r = max RMS audio value over S interval  = 50 is a fixed threshold  We use an additional term on the HM

equation to dampen mutual information

  • utputs when audio is “sub-audible”

 r = max RMS audio value over S interval  = 50 is a fixed threshold

M(x,y,tk)  1 2 log2 | A(tk) || V(x,y,tk) |

 

| A,V(x,y,tk) |

(1 1 2r )

slide-10
SLIDE 10

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 10 10

SenseStream Program Running SenseStream Program Running

Original Video SenseStream Running On Video

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

QuickTime™ and a H.261 decompressor are needed to see this picture.

slide-11
SLIDE 11

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 11 11

Quantitative Analysis of Synchrony Quantitative Analysis of Synchrony

 HM algorithm outputs mixelgrams

 Qualitative  Depict synchrony graphically

 Also useful to reduce mixelgrams to

scalars

 Quantitative synchrony analysis

 HM algorithm outputs mixelgrams

 Qualitative  Depict synchrony graphically

 Also useful to reduce mixelgrams to

scalars

 Quantitative synchrony analysis

slide-12
SLIDE 12

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 12 12

Idea: Connected Regions Idea: Connected Regions

Original Video SenseStream Running On Video

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

QuickTime™ and a H.261 decompressor are needed to see this picture.

slide-13
SLIDE 13

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 13 13

Connected Region Analysis Connected Region Analysis

 Compute variance in sizes of connected regions per mixelgram. Nonzero mixels i and j are said to be connected when j is one of the eight-neighbors

  • f i (edge mixels have fewer neighbors), and

 applies where M(mixel) is the value of the mixel, and Threshold = 1.125.

 Connected regions are the spatial extent of pairs of mixels that are connected.  Compute variance in sizes of connected regions per mixelgram. Nonzero mixels i and j are said to be connected when j is one of the eight-neighbors

  • f i (edge mixels have fewer neighbors), and

 applies where M(mixel) is the value of the mixel, and Threshold = 1.125.

 Connected regions are the spatial extent of pairs of mixels that are connected.

max M(i) M( j) ,M( j) M(i)       Threshold

slide-14
SLIDE 14

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 14 14

Edge Detection Method Edge Detection Method

 Another synchrony estimation method uses general-purpose image processing  Relies on a similar observation to that of connected region analysis  With mixelgram M,  Generally better results than with connected region analysis  Another synchrony estimation method uses general-purpose image processing  Relies on a similar observation to that of connected region analysis  With mixelgram M,  Generally better results than with connected region analysis

Sobel33(Gaussian

1515(M )) i1 hw

slide-15
SLIDE 15

15 15

Comparison to Infant Synchrony Detection Skills Comparison to Infant Synchrony Detection Skills

Multiple events: Two talkers 5 Multiple events: Cocktail party with oscilloscope visual 4 Multiple events: Cocktail party 3 Face-voice synchrony 2 Punctuate speech-object synchrony 1

Type of Synchrony Detection No.

QuickTime™ and a YUV420 codec decompressor are needed to see this picture. QuickTime™ and a YUV420 codec decompressor are needed to see this picture. QuickTime™ and a YUV420 codec decompressor are needed to see this picture. QuickTime™ and a YUV420 codec decompressor are needed to see this picture. QuickTime™ and a YUV420 codec decompressor are needed to see this picture.
slide-16
SLIDE 16

25 August 2004 25 August 2004 16 16

# 1: Punctuate Speech-Object Synchrony: Infants # 1: Punctuate Speech-Object Synchrony: Infants

 Methods

 7-month-olds; each infant habituated in one condition:

 Synchronous movement condition (N=16 infants)

 A hand moved 1 of 2 unfamiliar objects (toy crab & porcupine, or lamb chop & star), synchronous with vowel “ahhh” (e.g., for crab) or “eee” (e.g., for porcupine)

 Asynchronous condition (N=16).

 Movements were the same; vowels uttered between movements

 Static condition (N=16)

 Vowels same, but no hand or object movement

 Testing: vowel-object pairings switched  Measurements: looking time to the display

 Results

 After synchronized condition only: infants look longer on switched trials relative to control trials (avg: 4.68 s)

 Methods

 7-month-olds; each infant habituated in one condition:

 Synchronous movement condition (N=16 infants)

 A hand moved 1 of 2 unfamiliar objects (toy crab & porcupine, or lamb chop & star), synchronous with vowel “ahhh” (e.g., for crab) or “eee” (e.g., for porcupine)

 Asynchronous condition (N=16).

 Movements were the same; vowels uttered between movements

 Static condition (N=16)

 Vowels same, but no hand or object movement

 Testing: vowel-object pairings switched  Measurements: looking time to the display

 Results

 After synchronized condition only: infants look longer on switched trials relative to control trials (avg: 4.68 s)

(Gogate & Bahrick, 1998)

slide-17
SLIDE 17

25 August 2004 25 August 2004 17 17

Punctuate Speech-Object Synchrony: Model Punctuate Speech-Object Synchrony: Model

Stim ulus A: Object Motion and Speech

5000 10000 15000 20000 25000 1 101 201 301 401 501 601 701 801 Fram e

Synchrony Estimate Speech

  • nset and
  • ffset

 Correlation with manually determined word

  • nset/offset times

 r = 0.719, t(872) = 30.7, p < .001

 Correlation with manually determined word

  • nset/offset times

 r = 0.719, t(872) = 30.7, p < .001 (Stimulus A, Prince & Hollich, submitted)

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.
slide-18
SLIDE 18

25 August 2004 25 August 2004 18 18

# 2: Face-Voice Synchrony: Infants # 2: Face-Voice Synchrony: Infants

 Method

 10- to 16-week-old infants (n=12)  Watched person in sound-proof chamber reciting nursery rhymes  Audio switched every 60s from being in synchrony to being 400ms out-of-synchrony (400ms audio delay)

 Results

 Inattention averaged 14.9% (range: 2.9%-29.2%) when speech was synchronous  Averaged 34.3% (range: 1%-87%) when asynchronous

 19.36% decrement in infants’ attention

 Face-speech synchrony helped infants direct their attention to talker

 Method

 10- to 16-week-old infants (n=12)  Watched person in sound-proof chamber reciting nursery rhymes  Audio switched every 60s from being in synchrony to being 400ms out-of-synchrony (400ms audio delay)

 Results

 Inattention averaged 14.9% (range: 2.9%-29.2%) when speech was synchronous  Averaged 34.3% (range: 1%-87%) when asynchronous

 19.36% decrement in infants’ attention

 Face-speech synchrony helped infants direct their attention to talker (Dodd, 1979)

slide-19
SLIDE 19

25 August 2004 25 August 2004 19 19

Face-Voice Synchrony: Model Face-Voice Synchrony: Model

Stim ulus B Control1 : Single Speaker on Right

50 100 150 200 250 300 350 400 1 44 87 130 173 216 259 302 345 388 431 474 517 560 603 646 689 732 775 818 861 Video Fram e

Centroid Centroid Average I mage Center

Stim ulus B Control2: Single Speaker on Left

50 100 150 200 250 300 350 400 1 44 87 130 173 216 259 302 345 388 431 474 517 560 603 646 689 732 775 818 861 Video Fram e

Centroid Centroid Average I mage Center

QuickTime™ and a YUV420 codec decompressor are needed to see this picture. QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

(Stimulus B, controls, Prince & Hollich, submitted)

slide-20
SLIDE 20

25 August 2004 25 August 2004 20 20

# 3: Cocktail Party: Infants # 3: Cocktail Party: Infants

 Methods

 7.5 mo infants (N= 30 per condition)  Familiarization

 3 conditions, each with female audio-visual &

distractor male speech

 1) “Both” video  2) Asynchronous dynamic female face  3) Static female face

 Testing

 Headturn Preference Procedure; “cup” or “dog” repeated on

left or right

 Results

 Only in “Both” condition did infants look reliably longer when

target (vs. non-target) words played in test (p < .0001)

 Methods

 7.5 mo infants (N= 30 per condition)  Familiarization

 3 conditions, each with female audio-visual &

distractor male speech

 1) “Both” video  2) Asynchronous dynamic female face  3) Static female face

 Testing

 Headturn Preference Procedure; “cup” or “dog” repeated on

left or right

 Results

 Only in “Both” condition did infants look reliably longer when

target (vs. non-target) words played in test (p < .0001)

(Hollich, Newman & Jusczyk, in press)

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

Both

slide-21
SLIDE 21

25 August 2004 25 August 2004 21 21

Cocktail Party: Model Stimuli Cocktail Party: Model Stimuli

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

Male-

  • nly
QuickTime™ and a YUV420 codec decompressor are needed to see this picture. QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

Both Female-

  • nly

(Stimulus C, Prince & Hollich, submitted)

slide-22
SLIDE 22

25 August 2004 25 August 2004 22 22

Cocktail Party: Model Results Cocktail Party: Model Results

 Synchrony estimates: average per mixelgram, edge

detection method

 Synchrony estimates: average per mixelgram, edge

detection method

Female Voice Both Voices Male Voice Mean 21, 139.1 19, 438.8 17, 444.4 Std Dev 14, 973.8 11, 220.5 12, 116.6 Max 77, 518.4 64, 085.1 78, 751.6 Min 3.9 3.4 768.1

Means are statistically significantly different (p’s < 0.005) (Stimulus C, Prince & Hollich, submitted)

slide-23
SLIDE 23

23 23

# 4 & # 5: Model Difficulties # 4 & # 5: Model Difficulties

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

Stim ulus B: Left and Right Speakers Faces Moving But Talking Alternately

50 100 150 200 250 300 350 400 1 41 81 121 161 201 241 281 321 361 401 441 481 521 561 601 641 681 721 761 801 841 881 Video Fram e

Centroid Centroid Average Image Center

Female Voice Both Voices Male Voice Mean 7, 287.2 7, 971.9 7, 233.3 Std Dev 4, 077.7 2, 903.4 3, 138.7 Max 29, 822.7 28, 309.5 28, 056.0 Min 9.2 1038.2 535.0

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

(Stimulus B1 & D, Prince & Hollich, submitted)

slide-24
SLIDE 24

25 August 2004 25 August 2004 http://www.cprince.com/PubRes/EpiRob04 http://www.cprince.com/PubRes/EpiRob04 24 24

Infant & Model Summary Infant & Model Summary

Left vs. Right Looking

Segment words Segment words

Synch/ Asynch

Speech-

  • bj assoc.

I nfants

Multiple events: Two talkers Multiple events: Cocktail party with oscilloscope visual Multiple events: Cocktail party Face-voice synchrony Punctuate speech-object synchrony

Type of Synchrony Detection

Problems with Left

  • vs. Right

Discrimination

Problems with synchrony estimates

Varying synchrony estimates Left vs. Right Discrimination Good synchrony estimates

Model

5 4 3 2 1

No.

slide-25
SLIDE 25

25 25

Conclusions Conclusions

 Progress in perceptual-level modeling of

infant synchrony detection using a single, general purpose mechanism

 Accords with evidence that synchrony detection by

infants may be domain general (Hollich et al., in press)

 Model issues

 Better audio-visual features?  No visual or audio scene analysis  No stochastic elements in model yet

 E.g., no learning or attention  Have a preliminary model using AV synchrony to

supervise speech-object learning (Prince & Mislivec, unpub.)

 Progress in perceptual-level modeling of

infant synchrony detection using a single, general purpose mechanism

 Accords with evidence that synchrony detection by

infants may be domain general (Hollich et al., in press)

 Model issues

 Better audio-visual features?  No visual or audio scene analysis  No stochastic elements in model yet

 E.g., no learning or attention  Have a preliminary model using AV synchrony to

supervise speech-object learning (Prince & Mislivec, unpub.)