predicting detecting explaining the occurrence of vocal
play

Predicting, Detecting & Explaining the Occurrence of Vocal - PowerPoint PPT Presentation

Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Predicting, Detecting & Explaining the Occurrence of Vocal Activity in Multi-Party Conversation Kornel Laskowski PhD Defense Committee: Richard


  1. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Predicting, Detecting & Explaining the Occurrence of Vocal Activity in Multi-Party Conversation Kornel Laskowski PhD Defense Committee: Richard Stern, chair Anton Batliner (FAU) Alan Black Alex Waibel K. Laskowski Vocal Interaction in Multi-Party Conversation 1

  2. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary A Multi-Party Conversation a social event of duration T of K > 2 participants the predominant activity is talk What shapes participants’ deployment of talk? K. Laskowski Vocal Interaction in Multi-Party Conversation 2

  3. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary The Vocal Interaction Chronogram Q (Chapple, 1940; Dabbs & Ruback, 1987) participant index, k time t , − → � ≡ Speaking , � ≡ notSpeaking elides content ( “what?” ) expresses form , via evolving local context chronemics ( “when?” ) attribution ( “who?” ) K. Laskowski Vocal Interaction in Multi-Party Conversation 3

  4. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Modeling Chronograms Given a chronogram Q , want the probability P ( Q ). What does this mean? Constrastive speech exchange systems (Sacks et al, 1974): conversation formal debate lecture ritual K. Laskowski Vocal Interaction in Multi-Party Conversation 4

  5. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Why Model Chronograms? 1 P ( Q ) can represent a time- independent and participant- independent prior for speech activity detection like a language model yields a prior for speech recognition 2 P ( Q |G ) can yield a similar prior for conversational genre G allows for inference of “what genre G is this conversation?” 3 P ( Q | t ) yields a time-dependent prior allows for inference of “what is happening at instant t ?” 4 P ( Q | k ) yields a participant-dependent prior allows for inference of “what is the role of participant k ?” K. Laskowski Vocal Interaction in Multi-Party Conversation 5

  6. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Past Work on Modeling Chronograms interaction chronography (Chapple, 1939; Chapple, 1949) modeling in dialogue : K = 2 telecomminications (Norwine & Murphy, 1938; Brady, 1969) sociolinguistics (Jaffe & Feldstein, 1970) psycholinguistics (Dabbs & Ruback, 1987) dialogue systems (cf. Raux, 2008) modeling in multi-party settings : K > 2 qualitative: Conversation Analysis (Sacks et al, 1974) quantitative: THIS THESIS K. Laskowski Vocal Interaction in Multi-Party Conversation 6

  7. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary How to Model Multi-Party Chronograms? That depends very much on the task. 1 acoustic detection speech 1 laughter 2 2 intent recognition dialog acts 1 attempts to amuse 2 3 participant characterization diffuse social status 1 assigned role 2 K. Laskowski Vocal Interaction in Multi-Party Conversation 7

  8. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary SPEECH DETECTION K. Laskowski Vocal Interaction in Multi-Party Conversation 8

  9. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary The Goal of Speech Activity Detection (SAD) Given multichannel nearfield audio X : 1 2 3 4 5 6 7 1260 1265 1270 1275 1280 1285 Produce multi-participant speech chronogram Q : 1 2 3 4 5 6 7 1260 1265 1270 1275 1280 1285 K. Laskowski Vocal Interaction in Multi-Party Conversation 9

  10. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Prior Research in SAD in Multi-Party Meetings nearfield, HMM-based speech activity detection (Acero, 1994) in meetings: ASR segmentation (Pfau, Ellis & Stolcke, 2001) crosstalk is the most serious problem in meetings: multiple microphone states 3 states (Huang & Harper, 2005) 4 states (Wrigley et al, 2005) in meetings: crosstalk suppression energy normalization (Boakye & Stolcke, 2006) echo cancellation (Dines et al, 2006) all of this work decodes participants one at a time K. Laskowski Vocal Interaction in Multi-Party Conversation 10

  11. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary The Standard SAD Baseline hidden Markov model decoder topology enforced minimum duration constraints 16 ms frame step 500 ms for speech � 500 ms for non-speech � acoustic model 32 ms frame size log-energy, MFCCs, ∆s, ∆∆s (39) Gaussian mixture model (GMM) emissions K. Laskowski Vocal Interaction in Multi-Party Conversation 11

  12. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary The Crosstalk Problem mn036 me045 me010 me003 mn015 fe004 me012 1275 1275.5 1276 1276.5 1277 1277.5 1278 1278.5 1279 1279.5 1280 K. Laskowski Vocal Interaction in Multi-Party Conversation 12

  13. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary How Might Chronogram Modeling Help? Detection is the inference of the chronogram : P ( Q | X ) ∝ P ( X | Q ) · P ( Q ) 1 treat Q as a vector-valued process:       � � � � � �       · · · , q t − 1 =  , q t =  , q t +1 =  , · · ·       � � �    � � � 2 assume process is 1st-order Markovian: T � P ( Q ) = P ( q t | q t − 1 ) t =1 K. Laskowski Vocal Interaction in Multi-Party Conversation 13

  14. T , of N states, for a single participant is Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary The Multi-Participant State Space T T × T × · · · × T if the topology then the K -participant topology is the Cartesian product of q ∈ the number of multi-participant states is N K K. Laskowski Vocal Interaction in Multi-Party Conversation 14

  15. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Joint Transition Model: Degree of Overlap Want the transition from q t − 1 to q t to be: invariant to participant index rotation independent of number K of participants 3 replace q with � q � , the number of speaking participants         � � � � → → � � � �         � � � � 0 → 1 2 → 1 K. Laskowski Vocal Interaction in Multi-Party Conversation 15

  16. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Joint Transition Model: Extended Degree of Overlap Unfortunately,         � � � � � → � � → �         � � � � 1 → 1 1 → 1 4 augment “to”-state with number of same participants speaking at both t and t − 1         � � � � → → � � � �         � � � � 1 → [ 1 , 1 ] 1 → [ 0 , 1 ] K. Laskowski Vocal Interaction in Multi-Party Conversation 16

  17. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Joint Acoustic Model Can assume multi-channel acoustics to be independent, K � P ( X | Q ) = P ( X [ k ] | Q [ k ]) k =1 but crosstalk proves that they are not. The covariance matrix Σ of log-energy has size K × K off-diagonal entries are non-zero off-diagonal entries generalize poorly depend on room acoustics depend on inter-participant proximity K. Laskowski Vocal Interaction in Multi-Party Conversation 17

  18. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Two-Pass Decoding A solution to this problem is to: 1 obtain models on test conversation 2 high-precision first pass (Laskowski & Schultz, 2004; 2006) Non-Target-Normalization of Cross-Correlation Maxima compute cross-correlation maxima for all channel pairs can infer relative geometry of all participants 3 train full-covariance log-energy model (from scratch) 4 interpolate with supervised single-participant models K. Laskowski Vocal Interaction in Multi-Party Conversation 18

  19. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Classification error on EvalSet 8.68 8.13 4.62 %abs 53.2 %rel 4.67 0.47 %abs 4.06 11.9 %rel 3.95 6.6 %rel 16ms 100ms +JTM +JAM +DUR K. Laskowski Vocal Interaction in Multi-Party Conversation 19

  20. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Classification error on EvalSet 8.68 8.13 4.62 %abs 53.2 %rel 4.67 0.47 %abs 4.06 11.9 %rel 3.95 3.96 3.80 3.73 3.50 3.49 6.6 %rel 16ms 100ms +JTM +JAM +DUR K. Laskowski Vocal Interaction in Multi-Party Conversation 19

  21. Prolegomena Acoustic Detection Intent Recognition Participant Characterization Summary Classification error on EvalSet 8.68 8.13 4.62 %abs 53.2 %rel 4.67 0.47 %abs 4.06 11.9 %rel 3.95 3.96 3.80 3.73 3.50 3.49 6.6 %rel 16ms 100ms +JTM +JAM +DUR K. Laskowski Vocal Interaction in Multi-Party Conversation 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend