Jeremy *Materials in this presentation partially comes from Daniel - - PowerPoint PPT Presentation

jeremy
SMART_READER_LITE
LIVE PREVIEW

Jeremy *Materials in this presentation partially comes from Daniel - - PowerPoint PPT Presentation

Understanding Dyadic Human Spoken Interactions Using Speech Processing Techniques: Case studies in Autism Spectrum Disorder (ASD) and behavioral Couple Therapy Jeremy *Materials in this presentation partially comes from Daniel Bone,


slide-1
SLIDE 1

Understanding Dyadic Human Spoken Interactions Using Speech Processing Techniques:

Case studies in Autism Spectrum Disorder (ASD) and behavioral Couple Therapy

Jeremy 李祈均

*Materials in this presentation partially comes from Daniel Bone, Dr. Matt Black, Prof. Panos Georgiou, Prof. Shri Narayanan

slide-2
SLIDE 2

2

Picture credit to the USC SAIL lab: http://sail.usc.edu

slide-3
SLIDE 3

3

Employ and advance signal processing and machine learning to sense human behaviors

  • Aid in, and transform the traditional observational methods
  • Focus on mental health research and practice

Many benefits: speedup, parallel observation capabilities, large scale trends etc.

  • Significance: USA--‐10mil people receive psychotherapy every year, increasing!
  • State of the art hasn’t changed for decades

What is BSP?

slide-4
SLIDE 4

4

Mental health: traditional observational study

*Picture credit to

  • S. Narayanan and P. G. Georgiou, "Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and

Language: Computational techniques are presented to analyze and model expressed and perceived human behavior-variedly characterized as typical, atypical, distressed, and disordered-from speech and language cues and their applications in health, commerce, education, and beyond," Proc IEEE Inst Electr Electron Eng, vol. 101, pp. 1203-1233, Feb 7 2013.

slide-5
SLIDE 5

5

Mental health: putting BSP in the loop

*Picture credit to

  • S. Narayanan and P. G. Georgiou, "Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and

Language: Computational techniques are presented to analyze and model expressed and perceived human behavior-variedly characterized as typical, atypical, distressed, and disordered-from speech and language cues and their applications in health, commerce, education, and beyond," Proc IEEE Inst Electr Electron Eng, vol. 101, pp. 1203-1233, Feb 7 2013.

slide-6
SLIDE 6

6

slide-7
SLIDE 7

7

Case Study I

Domain: behavioral couple therapy Specifics: problem solving interactions as part of IBCT Engineering Task: interaction modeling (vocal synchrony quantification)

Chi-Chun Lee, Athanasios Katsamanis, Matthew Black, Brian Baucom, Andrew Christensen, Panayiotis G. Georgiou, and Shrikanth S. Narayanan, "Compute Vocal Entrainment: A Signal-Derived PCA-based quantification Scheme with Application for Affect Analysis in Married Couple Interactions ", in: Journal of Computer Speech and Language, 28(2): 518-539 doi:10.1016/j.csl.2012.06.006

slide-8
SLIDE 8

8

Couple therapy: Integrated Behavior Couple Therapy

slide-9
SLIDE 9

9

  • Collaborative work between UCLA and UW
  • 134 seriously and chronically distressed REAL couples
  • 10 minutes long problem-solving spoken interactions
  • Audio-video recording (far-field microphone, varying noise conditions)
  • 33 global ratings of behavioral codes for each spouse (SSIRS, CIRS)
  • 372 Sessions  90 hours of data
  • Manual transcripts available

Couple therapy database

Studying this large amount of data in a spontaneous fashion

slide-10
SLIDE 10

Application Domain 2: Couples Therapy Research 10 / 55

Automatic pre-processing: automatic speaker segmentation

  • Segment the sessions into meaningful regions

– Recursive automatic speech-text alignment technique [Moreno 1998] – Session split into regions: wife/husband/unknown – Segmented >60% of sessions’ words into wife/husband regions for 293/574 sessions

“… that she’s known for five months and didn’t tell me …” Example: Aligned Text

AM = Acoustic Model LM = Language Model Dict = Dictionary MFCC = Mel-Frequency Cepstral Coefficients ASR = Automatic Speech Recognition HYP = ASR Hypothesized Transcript *slide content credit to Dr. Matthew P. Black

slide-11
SLIDE 11

Application Domain 2: Couples Therapy Research 11 / 55

Automatic acoustic feature extraction: LLDs computation

  • Acoustic features shown to be relevant (e.g., [Gottman 1977, Yildirim et al. 2010])
  • 11 low-level descriptors (LLDs) extracted every 10ms with 25ms window

– Voice Activity Detector (VAD), speaking rate, pitch, energy, harmonics-to-noise ratio, voice quality, 13 MFCCs, 26 MFBs, magnitude of spectral centroid, spectral flux

  • Each session split into 3 “domains”: wife, husband, speaker-independent
  • 13 statistics (mean, std. dev. …) across each domain for each LLD

– 2000 features capture the global acoustic properties for each spouse

*slide content credit to Dr. Matthew P. Black

slide-12
SLIDE 12

12

  • Definition:
  • Naturally-spontaneous behavioral matching between dyadic social

interactions

  • Purpose in human interactions
  • Achieving communication efficiency* – unintentional effort
  • Communicating interest and engagement* – conscious effort
  • Psychological significance in theory and practice
  • Learning and memory in child-parent interactions
  • Regulating emotion processes*
  • Precursor to empathy
  • Mirroring neurons

What is vocal synchrony?

No quantification method is present! Can we do it even when it’s not possible for human perception – no ground truth

slide-13
SLIDE 13

13

Unsupervised signal-derived method

slide-14
SLIDE 14

14

Verification

slide-15
SLIDE 15

15

Study of behavioral codes and vocal synchrony

slide-16
SLIDE 16

16

Utilization as features for affect recognition application

slide-17
SLIDE 17

04/11/2013 17

Utilization as quantitative metrics for clinical analysis via MLM Analysis

husband-to-wife entrainment wife demander/husband withdrawer polarization p<0.001, within-partner p<0.001 between-partner not significant p<0.01 between-partner

Wife-demander/Husband-withdrawer

wife-to-husband entrainment wife demander/husband withdrawer polarization husband-to-wife entrainment wife demander/husband withdrawer polarization wife-to-husband entrainment wife demander/husband withdrawer polarization

Clinical Implications Behavioral Informatics

slide-18
SLIDE 18

18

Case Study II

Domain: Autism spectrum disorder Specifics: ADOS III Interview session Engineering Task: interaction modeling (atypical prosody quantification)

*slide content credit to Daniel Bone

Daniel Bone, Chi-Chun Lee, Matthew Black, Marian Williams, Sungbok Lee, Pat Levitt, and Shrikanth S. Narayanan, “The Psychologist as an Interlocutor in ASD Assessment: Insights from a Study of Spontaneous Prosody”, in: Journal of Speech, Language, and Hearing Research 2014 Feb 11. doi: 10.1044/2014_JSLHR-S-13-0062

slide-19
SLIDE 19

19

Autism Spectrum Disorder: ADOS session

slide-20
SLIDE 20

20

ADOS – Module 3: behavioral codes

slide-21
SLIDE 21

21

  • ADOS semi-structured assessment framework
  • Used to help psychologists diagnose autism (one popular tool)
  • Subject interacts with a psychologist for ~30-45 minutes
  • Constrained developmentally-appropriate tasks
  • 4 modules, depending on expressive language level and age
  • Module 1 (less than phrase speech): Free play, response to joint attention
  • Module 2 (some phrase speech): Joint interactive play, bubble play
  • Module 3 (verbally fluent): Make-believe play, telling a story from a book
  • Module 4 (verbally fluent adolescents/adults): More interview style
  • Psychologist rate the child’s socio-communicative skills
  • e.g., speech abnormalities (intonation/volume/rhythm/rate)
  • e.g., reciprocal social interaction (unusual eye contact)
  • Scores on sub-assessments added, and total score is used to diagnose ASD
  • Psychologists trained to administer ADOS using stringent training protocol
slide-22
SLIDE 22

22

Atypical Prosody

  • Prosody refers to the way in which something is said (rhythm)
  • Intonation, Volume, Rate, and Voice Quality
  • Critical role in expressivity and social-affective reciprocity
  • Variety of abnormalities
  • Monotonous
  • Atypical lexical stress and pragmatic prosody
  • Speaking Rate
  • “Bizarre” quality to speech
  • Qualitative descriptions are general and contrasting, “bizarre”

"slow, rapid, jerky and irregular in rhythm, odd intonation or inappropriate pitch and stress, markedly flat and toneless, or consistently abnormal volume” -[Lord et al. 2003]

slide-23
SLIDE 23

23

USC CARE Corpus

  • Child-psychologist ADOS interactions
  • ADOS- Autism Diagnostic Observation Schedule. [Lord et al., 2000]
  • Multimodal: 2 HD video and 2 far-field microphones (ecological validity)
slide-24
SLIDE 24

Experimental Setup: Subject Sample

24

  • Analysis focused on subjects administered the ADOS Module 3
  • Verbally fluent children and young adults
  • 30 sessions total, 28 appropriate for analysis
  • Manual transcription and segmentation
  • Transcription: spoken words, non-verbal communication, and

vocalizations

  • Segmentation: single speaker utterances, temporal markings
  • Psychologists
  • Three trained clinical psychologists conducted the ADOS sessions
  • Each psychologist administered ~9 sessions
slide-25
SLIDE 25

25

  • Coding
  • 60 minute session/14 subtasks
  • 28 codes scored by psychologist that is interacting with child
  • Not all codes used
  • Code of Interest–Speech Abnormalities Associated with Autism
  • Scored on an integer scale from ‘0’ (appropriate) to ‘2’ (clearly abnormal)
  • Code of Interest-ADOS Totals
  • ADOS totals relate to ‘Severity’ of autism spectrum disorder
  • Three total codes: Communication, Social Interaction, and C.+S.I.
  • Higher resolution, (min. 0, max. 8-22)
  • Spearman’s ρ=0.74 (p<10e-6) for Speech Abnormality and C.+S.I. Total

Experimental Setup: Labels

slide-26
SLIDE 26

26

  • ASD literature: intonation, volume, rate, and voice quality
  • 25 acoustic-prosodic features per speaker
  • Intonation and volume (12 functionals)
  • Mean (μ) and Stdv (σ) of 2nd-order coefficients
  • Speaking rate and rhythm (9 functionals)
  • Mean (μ) and 90% quantile (q90) of both turn-end and non-turn-end

syllabic-speaking rate

  • Mean (μ) and Stdv (σ) of vowel and consonant durations
  • Proportion of vowel speech to total speech
  • Voice quality (4 functionals)
  • Median and inter-quartile ratio (IQR) of jitter and shimmer
  • Jitter and shimmer are extracted on extended vowels (at

least 3T0)

Experimental Setup: Acoustic features

slide-27
SLIDE 27

27

  • Word-level features:
  • Phrase boundary prosody is most perceptually salient, so we extract

word-level features on turn-end words.

  • We utilize lexical transcriptions and turn-level alignments.
  • HTK alignment, Colorado Corpus children’s models, WSJ adult models

(also used for phonetic features)

  • Intonation (pitch) and volume (intensity) contours:
  • Extracted using Praat. Both in log-domain.
  • Normalized per speaker by subtracting mean.
  • Contour are bounded in range [-1,1], then fit by a 2nd-order polynomial

Experimental Setup: Acoustic features

slide-28
SLIDE 28

28

slide-29
SLIDE 29

29

Speech processing technology in behavioral science

Engineering pros:

Large scale, consistency, parallel observation (third-party) – interaction modeling

Scientific pros:

Quantitative insights, data-driven discovery, objective method

Challenges:

Large amount heterogeneity in behavioral production, perception, and effects coupling during interaction (while interaction is essential!)

slide-30
SLIDE 30

Human-centered Behavioral Signal Processing

Behavioral Interface

Behavioral Informatics

Behavioral Signal Processing

Psychiatry Psychology Behavioral Couple therapy

  • Automating manual observational coding

Autism Spectrum Disorder (ASD)

  • ADOS diagnosis: Prosody modeling
  • RapidABC: speech engagement modeling
  • Virtual game: Physiology signal processing

Rapid Automatized Naming

  • Quantification of eye-voice coordination

Affective Computing

  • Multimodal emotion recognition
  • Multimedia theater acting analysis
  • Cross-corpora recognition

Interaction Synchrony

  • Quantification of vocal behavior

synchronization

Automatic performance scoring

  • Impromptu speech scoring

Human-Machine Interface Education

slide-31
SLIDE 31

31

Thank You