speaker and emotion recognition of tv series data using
play

Speaker and Emotion Recognition of TV-Series Data Using Multimodal - PowerPoint PPT Presentation

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning Sashi Novitasari 1 , Quoc Truong Do 1 , Sakriani Saktj 1,3 , Dessi Lestari 2 , Satoshi Nakamura 1,3 1 Graduate School of Informatjon Science, Nara


  1. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning Sashi Novitasari 1 , Quoc Truong Do 1 , Sakriani Saktj 1,3 , Dessi Lestari 2 , Satoshi Nakamura 1,3 1 Graduate School of Informatjon Science, Nara Instjtute of Science and Technology 2 Department of Informatjcs, Bandung Instjtute of Technology 3 RIKEN AIP 1 {sashi.novitasari.si3, do.truong.dj3, ssaktj, s-nakamura}@is.naist.jp 2 {dessipuji}@informatjka.org

  2. Outline 1. Introductjon 2. Data 3. Model Architectures 4. Features 5. Experiment 6. Conclusion

  3. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning I. Introduction ● Real-life communicatjon involves linguistjc and paralinguistjc aspects ● Multjmodal and multjtask recognitjon of non-verbal aspects of speech ● Recognitjon of speech’s speaker and emotjon from emotjon-rich data ● Previous works: - Multjmodal or multjtask emotjon-speaker recognitjon (not integrated) (Tang et al., 2016; Tian et al., 2016; Vallet et al., 2013)

  4. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning II. Data ● TV-series data → expressive conversatjon ○ Video graphic: Facial features ○ Audio : Acoustjc features ○ Subtjtle : Lexical features ● English ● Utuerance-level annotatjon ○ Speaker : 57 names ○ Emotjon - valence: 3 classes (negatjve - neutral – positjve) ○ Emotjon - arousal : 3 classes (negatjve - neutral - positjve)

  5. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning III. Model Architectures ● Multjlayer perceptron models (5 layers) ● Multjmodal classifjcatjon ● Multjtask classifjcatjon

  6. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning III. Model Architectures Multimodal Classifjcation 2 evaluated approaches: a. Features concatenatjon b. Features hierarchical fusion

  7. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning III. Model Architectures Multitask Classifjcation Perform classifjcatjon on several tasks at once.

  8. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning IV. Features 1. Acoustjc (main) ○ INTERSPEECH 2010 feature conf. ○ openSMILE toolkit (Eyben et al., 2010) 2. Lexical ○ Word-vectors average ○ Pre-trained Google Word2Vec (Mikolov et al., 2013) 3. Facial ○ Facial contours and angles ○ openFace toolkit (Baltrusaitjs et al., 2016)

  9. V. Experiment

  10. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning V. Experiment ● Train set : 2460 utuerances ○ Speaker : 57 speaker (imbalanced) ○ Valence : Negatjve 31%, Neutral 60% , Positjve 9% ○ Arousal : Negatjve 4%, Neutral 75% , Positjve 21% ● Evaluated on 300 utuerances ○ Speaker : 10 speaker, 30 samples each ○ Valence : Negatjve 32%, Neutral 57% , Positjve 11% ○ Arousal : Negatjve 1%, Neutral 78% , Positjve 21% ● Compared performance of unimodal, multjmodal, single-task, and multjtask models ● Evaluated based on F1-score(%) on evaluatjon set

  11. V. Experiment Result

  12. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning V. Experiment Result: Speaker F1-scores (%) on evaluatjon set *Multjmodal approaches U - Unimodal C - Features concatenatjon H - Features hierarchical fusion Feature types A - Acoustjc F - Facial L - Lexical

  13. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning V. Experiment Result: Emotion F1-score (%) on evaluatjon set *Multjmodal approaches Feature types U - Unimodal A - Acoustjc C - Features concatenatjon F - Facial H - Features hierarchical fusion L - Lexical

  14. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning V. Experiment Result Summary *Multjmodal approaches Feature types U - Unimodal A - Acoustjc C - Features concatenatjon F - Facial H - Features hierarchical fusion L - Lexical

  15. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning VI. Conclusion ● We constructed the multimodal and multitask speaker-emotion recognition model by using deep learning and TV-series data ● Multitask model able to outperform single-task model, especially when recognizing emotion by using acoustic features only ● Multimodal-multitask model did not result in a signifjcant improvement (larger data might be needed)

  16. Thank You

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend