Predicting, Detecting & Explaining the Occurrence of Vocal - PowerPoint PPT Presentation

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Predicting, Detecting & Explaining the Occurrence of Vocal Activity in Multi-Party Conversation Kornel Laskowski PhD Defense Committee: Richard Stern, chair Anton Batliner (FAU) Alan Black Alex Waibel K. Laskowski Vocal Interaction in Multi-Party Conversation 1

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary A Multi-Party Conversation a social event of duration T of K > 2 participants the predominant activity is talk What shapes participants’ deployment of talk? K. Laskowski Vocal Interaction in Multi-Party Conversation 2

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary The Vocal Interaction Chronogram Q (Chapple, 1940; Dabbs & Ruback, 1987) participant index, k time t , − → � ≡ Speaking , � ≡ notSpeaking elides content ( “what?” ) expresses form , via evolving local context chronemics ( “when?” ) attribution ( “who?” ) K. Laskowski Vocal Interaction in Multi-Party Conversation 3

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Modeling Chronograms Given a chronogram Q , want the probability P ( Q ). What does this mean? Constrastive speech exchange systems (Sacks et al, 1974): conversation formal debate lecture ritual K. Laskowski Vocal Interaction in Multi-Party Conversation 4

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Why Model Chronograms? 1 P ( Q ) can represent a time- independent and participant- independent prior for speech activity detection like a language model yields a prior for speech recognition 2 P ( Q |G ) can yield a similar prior for conversational genre G allows for inference of “what genre G is this conversation?” 3 P ( Q | t ) yields a time-dependent prior allows for inference of “what is happening at instant t ?” 4 P ( Q | k ) yields a participant-dependent prior allows for inference of “what is the role of participant k ?” K. Laskowski Vocal Interaction in Multi-Party Conversation 5

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Past Work on Modeling Chronograms interaction chronography (Chapple, 1939; Chapple, 1949) modeling in dialogue : K = 2 telecomminications (Norwine & Murphy, 1938; Brady, 1969) sociolinguistics (Jaffe & Feldstein, 1970) psycholinguistics (Dabbs & Ruback, 1987) dialogue systems (cf. Raux, 2008) modeling in multi-party settings : K > 2 qualitative: Conversation Analysis (Sacks et al, 1974) quantitative: THIS THESIS K. Laskowski Vocal Interaction in Multi-Party Conversation 6

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary How to Model Multi-Party Chronograms? That depends very much on the task. 1 acoustic detection speech 1 laughter 2 2 intent recognition dialog acts 1 attempts to amuse 2 3 participant characterization diffuse social status 1 assigned role 2 K. Laskowski Vocal Interaction in Multi-Party Conversation 7

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary SPEECH DETECTION K. Laskowski Vocal Interaction in Multi-Party Conversation 8

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary The Goal of Speech Activity Detection (SAD) Given multichannel nearfield audio X : 1 2 3 4 5 6 7 1260 1265 1270 1275 1280 1285 Produce multi-participant speech chronogram Q : 1 2 3 4 5 6 7 1260 1265 1270 1275 1280 1285 K. Laskowski Vocal Interaction in Multi-Party Conversation 9

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Prior Research in SAD in Multi-Party Meetings nearfield, HMM-based speech activity detection (Acero, 1994) in meetings: ASR segmentation (Pfau, Ellis & Stolcke, 2001) crosstalk is the most serious problem in meetings: multiple microphone states 3 states (Huang & Harper, 2005) 4 states (Wrigley et al, 2005) in meetings: crosstalk suppression energy normalization (Boakye & Stolcke, 2006) echo cancellation (Dines et al, 2006) all of this work decodes participants one at a time K. Laskowski Vocal Interaction in Multi-Party Conversation 10

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary The Standard SAD Baseline hidden Markov model decoder topology enforced minimum duration constraints 16 ms frame step 500 ms for speech � 500 ms for non-speech � acoustic model 32 ms frame size log-energy, MFCCs, ∆s, ∆∆s (39) Gaussian mixture model (GMM) emissions K. Laskowski Vocal Interaction in Multi-Party Conversation 11

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary The Crosstalk Problem mn036 me045 me010 me003 mn015 fe004 me012 1275 1275.5 1276 1276.5 1277 1277.5 1278 1278.5 1279 1279.5 1280 K. Laskowski Vocal Interaction in Multi-Party Conversation 12

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary How Might Chronogram Modeling Help? Detection is the inference of the chronogram : P ( Q | X ) ∝ P ( X | Q ) · P ( Q ) 1 treat Q as a vector-valued process:       � � � � � �       · · · , q t − 1 =  , q t =  , q t +1 =  , · · ·       � � �    � � � 2 assume process is 1st-order Markovian: T � P ( Q ) = P ( q t | q t − 1 ) t =1 K. Laskowski Vocal Interaction in Multi-Party Conversation 13

T , of N states, for a single participant is Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary The Multi-Participant State Space T T × T × · · · × T if the topology then the K -participant topology is the Cartesian product of q ∈ the number of multi-participant states is N K K. Laskowski Vocal Interaction in Multi-Party Conversation 14

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Joint Transition Model: Degree of Overlap Want the transition from q t − 1 to q t to be: invariant to participant index rotation independent of number K of participants 3 replace q with � q � , the number of speaking participants         � � � � → → � � � �         � � � � 0 → 1 2 → 1 K. Laskowski Vocal Interaction in Multi-Party Conversation 15

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Joint Transition Model: Extended Degree of Overlap Unfortunately,         � � � � � → � � → �         � � � � 1 → 1 1 → 1 4 augment “to”-state with number of same participants speaking at both t and t − 1         � � � � → → � � � �         � � � � 1 → [ 1 , 1 ] 1 → [ 0 , 1 ] K. Laskowski Vocal Interaction in Multi-Party Conversation 16

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Joint Acoustic Model Can assume multi-channel acoustics to be independent, K � P ( X | Q ) = P ( X [ k ] | Q [ k ]) k =1 but crosstalk proves that they are not. The covariance matrix Σ of log-energy has size K × K off-diagonal entries are non-zero off-diagonal entries generalize poorly depend on room acoustics depend on inter-participant proximity K. Laskowski Vocal Interaction in Multi-Party Conversation 17

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Two-Pass Decoding A solution to this problem is to: 1 obtain models on test conversation 2 high-precision first pass (Laskowski & Schultz, 2004; 2006) Non-Target-Normalization of Cross-Correlation Maxima compute cross-correlation maxima for all channel pairs can infer relative geometry of all participants 3 train full-covariance log-energy model (from scratch) 4 interpolate with supervised single-participant models K. Laskowski Vocal Interaction in Multi-Party Conversation 18

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Classification error on EvalSet 8.68 8.13 4.62 %abs 53.2 %rel 4.67 0.47 %abs 4.06 11.9 %rel 3.95 6.6 %rel 16ms 100ms +JTM +JAM +DUR K. Laskowski Vocal Interaction in Multi-Party Conversation 19

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Classification error on EvalSet 8.68 8.13 4.62 %abs 53.2 %rel 4.67 0.47 %abs 4.06 11.9 %rel 3.95 3.96 3.80 3.73 3.50 3.49 6.6 %rel 16ms 100ms +JTM +JAM +DUR K. Laskowski Vocal Interaction in Multi-Party Conversation 19

Predicting, Detecting & Explaining the Occurrence of Vocal - PowerPoint PPT Presentation

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Predicting, Detecting & Explaining the Occurrence of Vocal Activity in Multi-Party Conversation Kornel Laskowski PhD Defense Committee: Richard

Animal Communication Animal Communication Focus on Vocal Learning Focus on Vocal

High School Vocal Music Presented by: Michelle Ridlen, Fine Arts Content Leader Elisabeth Baird,

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

Chapter 5 Sound Propagation in the Human Vocal Tract 1 Basics can

Arsenic Occurrence and Arsenic Occurrence and Innovative Technologies Innovative Technologies

PFAS OCCURRENCE & MONITORING GUIDANCE for California water systems Rick Zimmer May 2, 2019

Welcome Predicting Change Outcomes Leveraging SQL Server Profiler Lee Everest SQL Rx Predicting

Explaining Deep Learning Predictions and Isaac Ahern Integrating Domain Ontologies Outline

Explaining Type Errors Brent Yorgey Richard Eisenberg Harley Eades Off the Beaten Track 13

Assessment of Vocal Noise via Bi-directional Long-term Linear Prediction of Running Speech F.

Speech Processing 15-492/18-492 Human Speech Processing Phonetics and Phonology The vocal tract

INTRODUCTION TO RHYTHM YU / LAMONT MARCH 27, 2018 2 REVIEW OF VOCAL TRACT LENGTH Review

NetFlow Analysis: Detecting covert channels on the network Detecting malicious traffic by using

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

Predicting Regulatory Elements Predicting Regulatory Elements in P. falciparum in P. falciparum

Rilpivirine-Tenofovir DF-Emtricitabine ( Complera ) David H. Spach, MD Brian R. Wood, MD Last

Constructive Interaction for Talking about Interesting Topics Kristiina Jokinen & Graham

Centers for Disease Control released new prevalence statistics this year, based on 2008

Simple IO Eric McCreath putchar We have looked at printf and it is a simple way of outputting to

Is the search for the origin of the Highest Energy Cosmic Rays over? Alan Watson University of

Emitter-coupled Logic INEL4207 Digital Electronics Current Switch R v C1 v IN Q 1 Q 2 V REF I EE

Status of ECL Status of ECL Trigger-DAQ workshop, 2017.08.24 Trigger-DAQ workshop, 2017.08.24

Closing Remarks R.Itoh, KEK Global Belle II DAQ(COPPER to storage) Actjvitjes of data-taking

Predicting, Detecting & Explaining the Occurrence of Vocal - PowerPoint PPT Presentation

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Predicting, Detecting & Explaining the Occurrence of Vocal Activity in Multi-Party Conversation Kornel Laskowski PhD Defense Committee: Richard

Animal Communication Animal Communication Focus on Vocal Learning Focus on Vocal

High School Vocal Music Presented by: Michelle Ridlen, Fine Arts Content Leader Elisabeth Baird,

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

Chapter 5 Sound Propagation in the Human Vocal Tract 1 Basics can

Arsenic Occurrence and Arsenic Occurrence and Innovative Technologies Innovative Technologies

PFAS OCCURRENCE &amp; MONITORING GUIDANCE for California water systems Rick Zimmer May 2, 2019

Welcome Predicting Change Outcomes Leveraging SQL Server Profiler Lee Everest SQL Rx Predicting

Explaining Deep Learning Predictions and Isaac Ahern Integrating Domain Ontologies Outline

Explaining Type Errors Brent Yorgey Richard Eisenberg Harley Eades Off the Beaten Track 13

Assessment of Vocal Noise via Bi-directional Long-term Linear Prediction of Running Speech F.

Speech Processing 15-492/18-492 Human Speech Processing Phonetics and Phonology The vocal tract

INTRODUCTION TO RHYTHM YU / LAMONT MARCH 27, 2018 2 REVIEW OF VOCAL TRACT LENGTH Review

NetFlow Analysis: Detecting covert channels on the network Detecting malicious traffic by using

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

Predicting Regulatory Elements Predicting Regulatory Elements in P. falciparum in P. falciparum

Rilpivirine-Tenofovir DF-Emtricitabine ( Complera ) David H. Spach, MD Brian R. Wood, MD Last

Constructive Interaction for Talking about Interesting Topics Kristiina Jokinen &amp; Graham

Centers for Disease Control released new prevalence statistics this year, based on 2008

Simple IO Eric McCreath putchar We have looked at printf and it is a simple way of outputting to

Is the search for the origin of the Highest Energy Cosmic Rays over? Alan Watson University of

Emitter-coupled Logic INEL4207 Digital Electronics Current Switch R v C1 v IN Q 1 Q 2 V REF I EE

Status of ECL Status of ECL Trigger-DAQ workshop, 2017.08.24 Trigger-DAQ workshop, 2017.08.24

Closing Remarks R.Itoh, KEK Global Belle II DAQ(COPPER to storage) Actjvitjes of data-taking

PFAS OCCURRENCE & MONITORING GUIDANCE for California water systems Rick Zimmer May 2, 2019

Constructive Interaction for Talking about Interesting Topics Kristiina Jokinen & Graham