Predicting, Detecting & Explaining the Occurrence of Vocal - - PowerPoint PPT Presentation

predicting detecting explaining the occurrence of vocal
SMART_READER_LITE
LIVE PREVIEW

Predicting, Detecting & Explaining the Occurrence of Vocal - - PowerPoint PPT Presentation

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Predicting, Detecting & Explaining the Occurrence of Vocal Activity in Multi-Party Conversation Kornel Laskowski PhD Defense Committee: Richard


slide-1
SLIDE 1

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Predicting, Detecting & Explaining the Occurrence of Vocal Activity in Multi-Party Conversation

Kornel Laskowski PhD Defense Committee: Richard Stern, chair Anton Batliner (FAU) Alan Black Alex Waibel

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 1

slide-2
SLIDE 2

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

A Multi-Party Conversation

a social event

  • f duration T
  • f K > 2 participants

the predominant activity is talk What shapes participants’ deployment of talk?

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 2

slide-3
SLIDE 3

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

The Vocal Interaction Chronogram Q

(Chapple, 1940; Dabbs & Ruback, 1987) participant index, k time t, − → ≡ Speaking, ≡ notSpeaking elides content (“what?”) expresses form, via evolving local context

chronemics (“when?”) attribution (“who?”)

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 3

slide-4
SLIDE 4

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Modeling Chronograms

Given a chronogram Q, want the probability P (Q). What does this mean? Constrastive speech exchange systems (Sacks et al, 1974): conversation lecture formal debate ritual

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 4

slide-5
SLIDE 5

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Why Model Chronograms?

1 P (Q) can represent a time- independent and participant-

independent prior for speech activity detection

like a language model yields a prior for speech recognition

2 P (Q|G) can yield a similar prior for conversational genre G

allows for inference of “what genre G is this conversation?”

3 P (Q|t) yields a time-dependent prior

allows for inference of “what is happening at instant t?”

4 P (Q|k) yields a participant-dependent prior

allows for inference of “what is the role of participant k?”

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 5

slide-6
SLIDE 6

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Past Work on Modeling Chronograms

interaction chronography (Chapple, 1939; Chapple, 1949) modeling in dialogue: K = 2

telecomminications (Norwine & Murphy, 1938; Brady, 1969) sociolinguistics (Jaffe & Feldstein, 1970) psycholinguistics (Dabbs & Ruback, 1987) dialogue systems (cf. Raux, 2008)

modeling in multi-party settings: K > 2

qualitative: Conversation Analysis (Sacks et al, 1974) quantitative: THIS THESIS

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 6

slide-7
SLIDE 7

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

How to Model Multi-Party Chronograms?

That depends very much on the task.

1 acoustic detection 1

speech

2

laughter

2 intent recognition 1

dialog acts

2

attempts to amuse

3 participant characterization 1

diffuse social status

2

assigned role

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 7

slide-8
SLIDE 8

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

SPEECH DETECTION

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 8

slide-9
SLIDE 9

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

The Goal of Speech Activity Detection (SAD)

Given multichannel nearfield audio X:

1260 1265 1270 1275 1280 1285 1 2 3 4 5 6 7

Produce multi-participant speech chronogram Q :

1260 1265 1270 1275 1280 1285 1 2 3 4 5 6 7

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 9

slide-10
SLIDE 10

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Prior Research in SAD in Multi-Party Meetings

nearfield, HMM-based speech activity detection (Acero, 1994) in meetings: ASR segmentation (Pfau, Ellis & Stolcke, 2001)

crosstalk is the most serious problem

in meetings: multiple microphone states

3 states (Huang & Harper, 2005) 4 states (Wrigley et al, 2005)

in meetings: crosstalk suppression

energy normalization (Boakye & Stolcke, 2006) echo cancellation (Dines et al, 2006)

all of this work decodes participants one at a time

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 10

slide-11
SLIDE 11

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

The Standard SAD Baseline

hidden Markov model decoder topology enforced minimum duration constraints

16 ms frame step 500 ms for speech 500 ms for non-speech

acoustic model

32 ms frame size log-energy, MFCCs, ∆s, ∆∆s (39) Gaussian mixture model (GMM) emissions

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 11

slide-12
SLIDE 12

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

The Crosstalk Problem

1275 1275.5 1276 1276.5 1277 1277.5 1278 1278.5 1279 1279.5 1280 mn036 me045 me010 me003 mn015 fe004 me012

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 12

slide-13
SLIDE 13

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

How Might Chronogram Modeling Help?

Detection is the inference of the chronogram: P (Q|X) ∝ P (X|Q) · P (Q)

1 treat Q as a vector-valued process:

· · · , qt−1 =    

   , qt =    

   , qt+1 =    

   , · · ·

2 assume process is 1st-order Markovian:

P (Q) =

T

  • t=1

P (qt|qt−1)

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 13

slide-14
SLIDE 14

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

The Multi-Participant State Space

if the topology

T, of N states, for a single participant is

then the K-participant topology is the Cartesian product of

T

q ∈

T × T × · · · × T

the number of multi-participant states is NK

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 14

slide-15
SLIDE 15

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Joint Transition Model: Degree of Overlap

Want the transition from qt−1 to qt to be: invariant to participant index rotation independent of number K of participants

3 replace q with q, the number of speaking participants

 

 →  

 → 1  

 →  

 2 → 1

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 15

slide-16
SLIDE 16

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Joint Transition Model: Extended Degree of Overlap

Unfortunately,

 

 →  

 1 → 1  

 →  

 1 → 1

4 augment “to”-state with number of same participants

speaking at both t and t − 1

 

 →  

 1 → [ 1, 1 ]  

 →  

 1 → [ 0, 1 ]

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 16

slide-17
SLIDE 17

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Joint Acoustic Model

Can assume multi-channel acoustics to be independent, P (X|Q) =

K

  • k=1

P (X [k] |Q [k]) but crosstalk proves that they are not. The covariance matrix Σ of log-energy has size K × K

  • ff-diagonal entries are non-zero
  • ff-diagonal entries generalize poorly

depend on room acoustics depend on inter-participant proximity

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 17

slide-18
SLIDE 18

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Two-Pass Decoding

A solution to this problem is to:

1 obtain models on test conversation 2 high-precision first pass (Laskowski & Schultz, 2004; 2006)

Non-Target-Normalization of Cross-Correlation Maxima compute cross-correlation maxima for all channel pairs can infer relative geometry of all participants

3 train full-covariance log-energy model (from scratch) 4 interpolate with supervised single-participant models

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 18

slide-19
SLIDE 19

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Classification error on EvalSet

8.68 8.13 4.67 3.95 4.06 16ms 100ms +JTM +JAM +DUR 11.9 %rel 0.47 %abs 6.6 %rel 4.62 %abs 53.2 %rel

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 19

slide-20
SLIDE 20

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Classification error on EvalSet

8.68 8.13 4.67 3.95 4.06 16ms 100ms +JTM +JAM +DUR 11.9 %rel 0.47 %abs 6.6 %rel 4.62 %abs 53.2 %rel 3.96 3.80 3.73 3.50 3.49

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 19

slide-21
SLIDE 21

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Classification error on EvalSet

4.62 %abs 53.2 %rel 8.68 8.13 4.67 3.95 4.06 16ms 100ms +JTM +JAM +DUR 11.9 %rel 0.47 %abs 6.6 %rel 3.96 3.80 3.73 3.50 3.49

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 19

slide-22
SLIDE 22

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Classification error on EvalSet

4.62 %abs 53.2 %rel 8.68 8.13 4.67 3.95 4.06 16ms 100ms +JTM +JAM +DUR 6.6 %rel 11.9 %rel 0.47 %abs 3.96 3.80 3.73 3.50 3.49

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 19

slide-23
SLIDE 23

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Classification error on EvalSet

4.62 %abs 53.2 %rel 8.68 8.13 4.67 3.95 4.06 16ms 100ms +JTM +JAM +DUR 6.6 %rel 11.9 %rel 0.47 %abs 3.96 3.80 3.73 3.50 3.49

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 19

slide-24
SLIDE 24

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Summary

1 Chronograms make it easy to model joint behavior. 2 This enables control over hypothesized degree of overlap.

participants take turns to talk

3 Limiting potential overlap reduces impact of crosstalk. 4 Error rates reduced by 40-70% relative to standard baseline.

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 20

slide-25
SLIDE 25

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

LAUGHTER DETECTION

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 21

slide-26
SLIDE 26

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Laughter is Surprisingly Frequent

what else are participants doing (than can be heard)? analysis in ICSI Meeting Corpus (67 hours of conversation) laughter is (Laskowski & Burger, 2007):

the most frequently transcribed non-verbal vocalization >13,000 bouts of laughter in total accounts for 9% of all vocal effort bouts containing some voicing: 66% bouts containing no voicing: 34%

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 22

slide-27
SLIDE 27

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Laughter Detection Results, Briefly

extend 2-class SAD topology to 3-class topology achieves F-scores in the range 30-50%

ERR = misses + false alarms of about 20-30% higher than reported for the 4000 most audible voiced bouts EERs < 10% (Truong & van Leeuwen, 2007; Knox et al, 2008)

  • btained F = 47.7% only available baseline for all laughter

joint modeling improves F-scores only by ≈ 2%abs

and only for small topologies

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 23

slide-28
SLIDE 28

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Why Laughter Detection Poorer than Speech Detection

1 laughter not very confusable with speech 2 laughter most confusable with silence

laughter syllables contain long intervening pauses also, unvoiced syllables sound just like breathing

3 highest F-scores achieved by extending minimum duration

constraints

well beyond the most likely durations of laugh bouts

4 large topologies prohibit joint participant decoding 5 also: joint participant decoding of only limited viability

participants wait their turn to talk participants do not wait their turn to laugh

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 24

slide-29
SLIDE 29

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

DIALOG ACT RECOGNITION

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 25

slide-30
SLIDE 30

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Time-Dependent Modeling of Chronograms

condition transition probabilities at instant t:

not only on whether participants are talking or not talking but on what they are trying to achieve by talking − → inference of intent encoded in content-independent dialog act (DA) type

e.g., statements, questions, backchannels

enables text-independent DA recognition (Laskowski & Shriberg, 2009; 2010) assign to each instant t, at which a participant is talking,

a DA type

  • ptionally, a DA boundary type

recognition ≡ segmentation AND classification

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 26

slide-31
SLIDE 31

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Time-Dependent Modeling of Chronograms

condition transition probabilities at instant t:

not only on whether participants are talking or not talking but on what they are trying to achieve by talking − → inference of intent encoded in content-independent dialog act (DA) type

e.g., statements, questions, backchannels

enables text-independent DA recognition (Laskowski & Shriberg, 2009; 2010) assign to each instant t, at which a participant is talking,

a DA type

  • ptionally, a DA boundary type

recognition ≡ segmentation AND classification

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 26

slide-32
SLIDE 32

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Time-Dependent Modeling of Chronograms

condition transition probabilities at instant t:

not only on whether participants are talking or not talking but on what they are trying to achieve by talking − → inference of intent encoded in content-independent dialog act (DA) type

e.g., statements, questions, backchannels

enables text-independent DA recognition (Laskowski & Shriberg, 2009; 2010) assign to each instant t, at which a participant is talking,

a DA type

  • ptionally, a DA boundary type

recognition ≡ segmentation AND classification

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 26

slide-33
SLIDE 33

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Prior Research on Dialog Act Recognition

lots of work in meetings, e.g.

Ang, Liu & Shriberg, ICASSP 2005. Ji & Bilmes, ICASSP 2005. Zimmermann, Stolcke & Shriberg, ICASSP 2006. Dielmann & Renals, MLMI 2007.

relying on one or more of

true DA boundaries (i.e., DA classification only) word identities (true or ASR) word boundaries (true or ASR)

work in which DA boundaries, word boundaries, and word identities are not assumed had not been done

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 27

slide-34
SLIDE 34

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

DA Types in ICSI Meetings

Propositional Content DA Types statement, s (85%) question, q (6.6%) “Short” DA Types Feedback Types (5.4%) backchannel, b (2.8%)

acknowledgment, bk (1.5%)

assert, aa (1.1%) Floor Mechanism Types (3.6%) floor holder, fh (2.7%) floor grabber, fg (0.6%) hold, h (0.3%)

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 28

slide-35
SLIDE 35

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

The Single-Participant DA State Space

  • ne DA-specific sub-topology for each of 8 DA types

fully connected via silence sub-topologies

s q h fh fg bk b aa

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 29

slide-36
SLIDE 36

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

A DA Sub-Topology

DA-terminal TSF subtopology subtopologies to other DA type topology non-DA-terminal TSF subtopologies intra-DA gap subtopology subtopology from other DA type inter-DA gap

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 30

slide-37
SLIDE 37

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Time-Dependent Modeling of Chronograms

single-participant state space consists of hundreds of states therefore, model participant transitions independently but capture a local chronogram snapshop as an emission

OTH2: OTH1: SPKR: OTH3: OTH4: T/2 T/2

K-independence: retain only 3 most talkative interlocutors rotation invariance: rotate interlocutors by talkativity rank

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 31

slide-38
SLIDE 38

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

The Probability of Speaking Near DA Types

upper panel: most talkative interlocutor lower panel: target participant producing the DA statement floor grabber

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 32

slide-39
SLIDE 39

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Average 8-class F-scores, EvalSet

54.5 39.8 33.7 31.1 21.8

TOPO + CHRONOGRAM + PROSODY + TRUE WORDS TOPO + PROSODY TOPO + CHRONOGRAM TOPO TOPO

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 33

slide-40
SLIDE 40

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Punctuation F-scores, EvalSet

TOPO + CHRONOGRAM + PROSODY + TRUE WORDS TOPO + PROSODY TOPO + CHRONOGRAM TOPO TOPO

53.9 68.2 62.6 68.6 71.3

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 34

slide-41
SLIDE 41

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Summary

1 local snapshots of speech chronograms correlate with

production of specific dialog act types

2 correlation sufficiently strong to form the basis of a

text-independent DA recognizer

3 local snapshots of speech chronograms complementary with

prosodic features

4 for several dialog act types and dialog act boundary types,

performance approaches that using models of manually transcribed word sequences

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 35

slide-42
SLIDE 42

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

HUMOR DETECTION

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 36

slide-43
SLIDE 43

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Why Care About Humor?

talk produced not only to communicate facts or control floor also to regulate socio-emotional state of interlocutors humor qualifies seriousness of propositional content

  • nly prior research in meetings (Clark & Popescu-Belis, 2004)

indicated detectability not above chance

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 37

slide-44
SLIDE 44

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Humor Annotation in ICSI Meetings

both statements (s) and questions (q) license the optional j

attempts to amuse or attempts at sarcasm accounts for 0.6% of speech by time break j out as a 9th DA type then run DA recognition, as shown earlier score only detection of j

s q h fh fg bk b aa aa b bk fg fh h q s j

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 38

slide-45
SLIDE 45

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Humor Detection Error Rates, EvalSet

TOPO TOPO + TRUE WORDS TOPO + SPEECH

23.7 19.4 71.4 83.3 94.2

TOPO + LAUGHTER TOPO + SPEECH + LAUGHTER

laughter chronograms = best single source of information for detecting humor combination with speech chronograms leads to improvement combination with lexical system leads to no improvement

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 39

slide-46
SLIDE 46

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Interlocutor Probability of Laughing

¬j DAs j DAs locally 1st most laughing locally 2nd most laughing

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 40

slide-47
SLIDE 47

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Interlocutor Probability of Laughing

¬j DAs most laughing locally 2nd most laughing locally 1st j DAs

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 40

slide-48
SLIDE 48

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Target Speaker Probability of Laughing

¬j DAs j DAs target speaker

How well do we do with laughter only from the target speaker? ERR = 31% rather than 23.7%

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 41

slide-49
SLIDE 49

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Target Speaker Probability of Laughing

¬j DAs j DAs target speaker

How well do we do with laughter only from the target speaker? ERR = 31% rather than 23.7%

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 41

slide-50
SLIDE 50

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Target Speaker Probability of Laughing

¬j DAs j DAs target speaker

How well do we do with laughter only from the target speaker? ERR = 31% rather than 23.7%

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 41

slide-51
SLIDE 51

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Summary

speech chronograms play an important role in text-independent DA recognition text-independent: without using words system approaches performance achievable of a text-dependent system laughter chronograms play a crucial role in detection of attempts to amuse

in either text-independent or text-dependent systems

jokers, in work-place conversations, appear to signal that they have joked by laughing themselves

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 42

slide-52
SLIDE 52

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

STATUS CLASSIFICATION

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 43

slide-53
SLIDE 53

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

What can be said of individuals?

  • bserving only the vocal interaction chronogram

(Laskowski, Ostendorf & Schultz, 2007; 2008)

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 44

slide-54
SLIDE 54

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Prior Research

static characterization of meeting participants

dominance rankings: Rienks & Heylen, 2005 influence rankings: Rienks et al., 2006

static characterization of radio talk show participants

roles: Vinciarelli, 2007

dynamic characterization of meeting participants

roles: Banerjee & Rudnicky, 2004 roles: Zancanaro et al., 2006 roles: Rienks et al., 2006

lots of work in social psychology, for dialogue

human resource allocation diagnosis of psychological disorders

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 45

slide-55
SLIDE 55

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Prior Research

static characterization of meeting participants

dominance rankings: Rienks & Heylen, 2005 influence rankings: Rienks et al., 2006

static characterization of radio talk show participants

roles: Vinciarelli, 2007

dynamic characterization of meeting participants

roles: Banerjee & Rudnicky, 2004 roles: Zancanaro et al., 2006 roles: Rienks et al., 2006

lots of work in social psychology, for dialogue

human resource allocation diagnosis of psychological disorders

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 45

slide-56
SLIDE 56

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Prior Research

static characterization of meeting participants

dominance rankings: Rienks & Heylen, 2005 influence rankings: Rienks et al., 2006

static characterization of radio talk show participants

roles: Vinciarelli, 2007

dynamic characterization of meeting participants

roles: Banerjee & Rudnicky, 2004 roles: Zancanaro et al., 2006 roles: Rienks et al., 2006

lots of work in social psychology, for dialogue

human resource allocation diagnosis of psychological disorders

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 45

slide-57
SLIDE 57

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Prior Research

static characterization of meeting participants

dominance rankings: Rienks & Heylen, 2005 influence rankings: Rienks et al., 2006

static characterization of radio talk show participants

roles: Vinciarelli, 2007

dynamic characterization of meeting participants

roles: Banerjee & Rudnicky, 2004 roles: Zancanaro et al., 2006 roles: Rienks et al., 2006

lots of work in social psychology, for dialogue

human resource allocation diagnosis of psychological disorders

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 45

slide-58
SLIDE 58

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Modeling Individual Participation

1 assume participant behavior to be conditionally independent,

given prior joint participant behavior P (qt|qt−1) =

K

  • k=1

P (qt [k] |qt−1)

2 infer model for each participant, given test conversation 3 extract specific probabilities as features 4 model using independent Gaussian emission probabilities

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 46

slide-59
SLIDE 59

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Features F Describing Participant Classes

1 probability of vocalizing (V) 2 probability of initiating vocalization (VI) in prior silence 3 probability of continuing vocalization (VC) in prior non-overlap 4 probability of initiating overlap (OI) in prior non-overlap 5 probability of continuing overlap (OC) in prior overlap

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 47

slide-60
SLIDE 60

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Features F Describing Participant Classes

1 probability of vocalizing (V) 2 probability of initiating vocalization (VI) in prior silence 3 probability of continuing vocalization (VC) in prior non-overlap 4 probability of initiating overlap (OI) in prior non-overlap 5 probability of continuing overlap (OC) in prior overlap

k f V

k

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 47

slide-61
SLIDE 61

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Features F Describing Participant Classes

1 probability of vocalizing (V) 2 probability of initiating vocalization (VI) in prior silence 3 probability of continuing vocalization (VC) in prior non-overlap 4 probability of initiating overlap (OI) in prior non-overlap 5 probability of continuing overlap (OC) in prior overlap

k f VI

k

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 47

slide-62
SLIDE 62

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Features F Describing Participant Classes

1 probability of vocalizing (V) 2 probability of initiating vocalization (VI) in prior silence 3 probability of continuing vocalization (VC) in prior non-overlap 4 probability of initiating overlap (OI) in prior non-overlap 5 probability of continuing overlap (OC) in prior overlap

k f VC

k

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 47

slide-63
SLIDE 63

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Features F Describing Participant Classes

1 probability of vocalizing (V) 2 probability of initiating vocalization (VI) in prior silence 3 probability of continuing vocalization (VC) in prior non-overlap 4 probability of initiating overlap (OI) in prior non-overlap 5 probability of continuing overlap (OC) in prior overlap

k f OI

k,j

j

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 47

slide-64
SLIDE 64

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Features F Describing Participant Classes

1 probability of vocalizing (V) 2 probability of initiating vocalization (VI) in prior silence 3 probability of continuing vocalization (VC) in prior non-overlap 4 probability of initiating overlap (OI) in prior non-overlap 5 probability of continuing overlap (OC) in prior overlap

k j f OC

k,j

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 47

slide-65
SLIDE 65

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

What Participant Classes Can We Identify?

ICSI Meeting Corpus naturally occurring meetings participants self-reported as one of three of:

professor (Prof) possessing PhD (PhD) graduate students (Stud)

− → organizational seniority 67 meetings of one of three types:

professor-student discussions (Bed) annotation discussions (Bmr) research discussions (Bro)

presumably, people behave differently in different settings

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 48

slide-66
SLIDE 66

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Seniority Classification Accuracy

73 67 61 45

CONDITIONED ON INFERRED MEETING TYPE CONDITIONED ON TRUE MEETING TYPE GLOBAL MODELS GUESSING PRIORS

1st-best feature type: continuation of overlap 2nd-best feature type: initiation of overlap 3rd-best feature type: total speaking time proportion

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 49

slide-67
SLIDE 67

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Seniority Level Feature Distributions

0.2 0.4 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (GRAD,*) (PHD,GRAD) (GRAD,*) (PHD,GRAD) (PHD,PHD) (PHD,PROF) (PHD,PHD) (PHD,PROF) (PROF,GRAD) (PROF,GRAD) (PROF,PHD) (PROF,PHD) feature fV feature fOC (GRAD,GRAD) (GRAD,PHD) (GRAD,PROF) (PHD,GRAD) (PHD,PHD) (PHD,PROF) (PROF,GRAD) (PROF,PHD)

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 50

slide-68
SLIDE 68

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

ROLE CLASSIFICATION

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 51

slide-69
SLIDE 69

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Assigned Role in Meetings

Can we detect a role given to a participant, independent of their diffuse status characteristics? AMI Meeting Corpus always K = 4 participants always 4 roles

project manager (PM) marketing expert (ME) user interface designer (UI) industrial designed (ID)

classification paradigm identical to seniority classification, except that roles are mutually exclusive classification: 53% accuracy (guessing priors: 25%) detection of PM: 75% accuracy

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 52

slide-70
SLIDE 70

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Feature Distributions for Finding Project Managers

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.005 0.01 0.015 0.02 0.025 (¬L,¬L) (¬L,¬L) (¬L,L) (¬L,L) (L,¬L) (L,¬L) feature fVI feature fOI (¬L,¬L) (¬L,L) (L,¬L)

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 53

slide-71
SLIDE 71

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Participant Characterization Summary

aspects of chronogram patterns correlated with characterizations of individual participants

diffuse characteristics, e.g. seniority (temporarily) assigned roles

correlation sufficiently strong to allow for inference of participant type first baselines for both text-independent tasks

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 54

slide-72
SLIDE 72

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Conclusions

1 the chronogram is a deceptively sparse representation 2 appears to contain very rich information

particularly that information which is not explicitly stated

3 it makes it easy to consider participants’ joint behavior

vocal behavior readily synchronizable across participants

4 chronograms are amenable to various modeling alternatives,

leading to successful inference

1

time- and participant-indendent: detection of vocal activity

2

time-dependent: recognition of intent

3

participant-dependent: characterization of participants

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 55

slide-73
SLIDE 73

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Contributions

1 explicit framework and techniques for modeling chronograms

in a variety of ways, depending on application shown to corroborate many findings in the social sciences

2 a text-independent conversation understanding system

allowing inference of many aspects of conversation without ever needing to recognize a word

3 first-ever text-independent baselines for several tasks

detection of all laughter segmentation and classification of dialog acts detection of attempts to amuse classification of (tacit) participant seniority classification of assigned participant role

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 56

slide-74
SLIDE 74

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

Potential Future Impact

1 it is now possible to automatically compare conversations

across genres across cultures across languages

2 it is now possible to perform large-scale, automated validation

  • f the qualitative findings of

conversation analysis social psychology anthropology and others ...

3 merging conversational content with conversation form is

promising

4 many of the presented systems are amenable to immediate

improvement

  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 57

slide-75
SLIDE 75

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary

THANK YOU.

Special thanks to: Anton Batliner, Alan Black, Susi Burger, Jaime Carbonell, Jens Edlund, Christian F¨ ugen, Mattias Heldner, Qin Jin, Rob Malkin, Florian Metze, Mari Ostendorf, Matthias Paulik, Tanja Schultz, Liz Shriberg, Richard Stern, Ashish Venugopal, Stephan Vogel, Alex Waibel & Mattias W¨

  • lfel.
  • K. Laskowski

Vocal Interaction in Multi-Party Conversation 58