Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin - - PowerPoint PPT Presentation

fusical multimodal fusion for video sentiment
SMART_READER_LITE
LIVE PREVIEW

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin - - PowerPoint PPT Presentation

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen Amil Khanzada Stanford University University of Miami University of California University of California Berkeley Berkeley tomjin@stanford.edu


slide-1
SLIDE 1

Fusical: Multimodal Fusion for Video Sentiment

Boyang Tom Jin

Stanford University tomjin@stanford.edu

Leila Abdelrahman

University of Miami lxa215@miami.edu

Cong Kevin Chen

University of California Berkeley kevincong95@berkeley.edu

Amil Khanzada

University of California Berkeley amil@berkeley.edu

ACM International Conference on Multimodal Interaction October 24th-29th 2020

slide-2
SLIDE 2

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Introductions

slide-3
SLIDE 3

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Introduction: Understanding Video Sentiment

slide-4
SLIDE 4

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Dataset: EmotiW 2020 Video Group Emotion [1,2]

slide-5
SLIDE 5

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Our Approach

slide-6
SLIDE 6

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Multimodal Hidden Layer Ensembling

slide-7
SLIDE 7

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Scene

slide-8
SLIDE 8

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Scene Modality

slide-9
SLIDE 9

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Scene: Activated on People

Strong activations on foreground individuals Activations followed individuals frame-to-frame

slide-10
SLIDE 10

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Scene: ResNet-50 Outperforms Inception-v3

ResNet activates

  • n foreground

people Inception-v3 gets distracted background lighting

slide-11
SLIDE 11

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Image Captioning

slide-12
SLIDE 12

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Captions Tell Descriptive Nouns

slide-13
SLIDE 13

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Pose

slide-14
SLIDE 14

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Pose Model

slide-15
SLIDE 15

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Pose: Importance of Upper Body Joints

Elbows Hands

slide-16
SLIDE 16

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Audio and Laughter

slide-17
SLIDE 17

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Audio Model System Diagram

  • CNN-LSTM
  • Time dependent frames
slide-18
SLIDE 18

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Laughter Model

slide-19
SLIDE 19

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Facial

slide-20
SLIDE 20

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Facial Pipeline

slide-21
SLIDE 21

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Results & Error Analysis

slide-22
SLIDE 22

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Independent Modality Results

Modality Accuracy F1-Score Scene 0.546 0.541 Pose 0.486 0.489 Audio 0.577 0.577 Face 0.400 0.348 Image Captioning 0.505 0.506

slide-23
SLIDE 23

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Fully Connected Early Fusion Results

Dataset Accuracy Baseline [Test] 0.479 Validation 0.640 Test 0.639

slide-24
SLIDE 24

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Ablation Study on Modalities

  • ResNet Scene had high

positive class saliency

  • Every modality struggled

to predict positive videos accurately, except for scene Ensemble: 64.0%

slide-25
SLIDE 25

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Modality Activation Contributions

slide-26
SLIDE 26

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Summary & Future

slide-27
SLIDE 27

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Summary

  • Required research into various modalities and ensembling methods
  • We provide 2 novel modalities: image captioning and laughter
  • Early fusion methods improved classification performance
  • Beat the baseline test accuracy of 47.9% by about 16 percentage points:

63.9%

slide-28
SLIDE 28

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Future

  • Transitory facial expression datasets
  • 3D pose points, YOLO and object sentiment analysis
  • Real world / research: YouTube likes, self-driving cars, telehealth
  • Affective image captioning
slide-29
SLIDE 29

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Acknowledgements

  • Dr. Fei Fei Li, Dr. Ranjay Krishna, and Christina Yuan, and the rest of

the Stanford CS231n teaching staff guided us through the project.

  • Dr. Pawan Nandakishore reviewed our approaches and provided

guidance.

  • Vincent La helped us explore using YOLOv3 to perform object

detection and text-based sentiment analysis.

  • EmotiW competition organizers for providing an interesting challenge

and large dataset.

slide-30
SLIDE 30

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

References

[1] Garima Sharma, Shreya Ghosh, and Abhinav Dhall. Automatic group level affect and cohesion prediction in videos. In Nadia Bianchi-Berthouze, Julien Epps, Andrea Kleinsmith, and Picard Rosalind, editors, International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) 2019, pages 161–167, United States of America, 2019. IEEE, Institute of Electrical and Electronics

  • Engineers. International Conference on Affective Computing and Intelligent Interaction

Workshops and Demos 2019, ACIIW 2019 ; Conference date: 03-09-2019 Through 06-09-2019. [2] Roland Goecke, Abhinav Dhall, Garima Sharma and Tom Gedeon. Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based

  • challenges. ACM International Conference on Multimodal Interaction 2020, 2020.
slide-31
SLIDE 31

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Thank You!

We’re on GitHub! https://github.com/kevincong95/cs231n-emotiw