Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin - - PowerPoint PPT Presentation

▶

Mar 17, 2023 643 likes •970 views

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen Amil Khanzada Stanford University University of Miami University of California University of California Berkeley Berkeley tomjin@stanford.edu

SLIDE 1

Fusical: Multimodal Fusion for Video Sentiment

Boyang Tom Jin

Stanford University tomjin@stanford.edu

Leila Abdelrahman

University of Miami lxa215@miami.edu

Cong Kevin Chen

University of California Berkeley kevincong95@berkeley.edu

Amil Khanzada

University of California Berkeley amil@berkeley.edu

ACM International Conference on Multimodal Interaction October 24th-29th 2020

SLIDE 2

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Introductions

SLIDE 3

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Introduction: Understanding Video Sentiment

SLIDE 4

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Dataset: EmotiW 2020 Video Group Emotion [1,2]

SLIDE 5

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Our Approach

SLIDE 6

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Multimodal Hidden Layer Ensembling

SLIDE 7

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Scene

SLIDE 8

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Scene Modality

SLIDE 9

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Scene: Activated on People

Strong activations on foreground individuals Activations followed individuals frame-to-frame

SLIDE 10

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Scene: ResNet-50 Outperforms Inception-v3

ResNet activates

n foreground

people Inception-v3 gets distracted background lighting

SLIDE 11

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Image Captioning

SLIDE 12

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Captions Tell Descriptive Nouns

SLIDE 13

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Pose

SLIDE 14

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Pose Model

SLIDE 15

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Pose: Importance of Upper Body Joints

Elbows Hands

SLIDE 16

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Audio and Laughter

SLIDE 17

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Audio Model System Diagram

CNN-LSTM
Time dependent frames

SLIDE 18

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Laughter Model

SLIDE 19

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Facial

SLIDE 20

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Facial Pipeline

SLIDE 21

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Results & Error Analysis

SLIDE 22

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Independent Modality Results

Modality Accuracy F1-Score Scene 0.546 0.541 Pose 0.486 0.489 Audio 0.577 0.577 Face 0.400 0.348 Image Captioning 0.505 0.506

SLIDE 23

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Fully Connected Early Fusion Results

Dataset Accuracy Baseline [Test] 0.479 Validation 0.640 Test 0.639

SLIDE 24

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Ablation Study on Modalities

ResNet Scene had high

positive class saliency

Every modality struggled

to predict positive videos accurately, except for scene Ensemble: 64.0%

SLIDE 25

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Modality Activation Contributions

SLIDE 26

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Summary & Future

SLIDE 27

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Summary

Required research into various modalities and ensembling methods
We provide 2 novel modalities: image captioning and laughter
Early fusion methods improved classification performance
Beat the baseline test accuracy of 47.9% by about 16 percentage points:

63.9%

SLIDE 28

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Future

Transitory facial expression datasets
3D pose points, YOLO and object sentiment analysis
Real world / research: YouTube likes, self-driving cars, telehealth
Affective image captioning

SLIDE 29

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Acknowledgements

Dr. Fei Fei Li, Dr. Ranjay Krishna, and Christina Yuan, and the rest of

the Stanford CS231n teaching staff guided us through the project.

Dr. Pawan Nandakishore reviewed our approaches and provided

guidance.

Vincent La helped us explore using YOLOv3 to perform object

detection and text-based sentiment analysis.

EmotiW competition organizers for providing an interesting challenge

and large dataset.

SLIDE 30

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

References

[1] Garima Sharma, Shreya Ghosh, and Abhinav Dhall. Automatic group level affect and cohesion prediction in videos. In Nadia Bianchi-Berthouze, Julien Epps, Andrea Kleinsmith, and Picard Rosalind, editors, International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) 2019, pages 161–167, United States of America, 2019. IEEE, Institute of Electrical and Electronics

Engineers. International Conference on Affective Computing and Intelligent Interaction

Workshops and Demos 2019, ACIIW 2019 ; Conference date: 03-09-2019 Through 06-09-2019. [2] Roland Goecke, Abhinav Dhall, Garima Sharma and Tom Gedeon. Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based

challenges. ACM International Conference on Multimodal Interaction 2020, 2020.

SLIDE 31

Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Fusical: Multimodal Fusion for Video Sentiment

ACM International Conference on Multimodal Interaction October 24th-29th 2020

Introductions

Introduction: Understanding Video Sentiment

Dataset: EmotiW 2020 Video Group Emotion [1,2]

Our Approach

Multimodal Hidden Layer Ensembling

Scene

Scene Modality

Scene: Activated on People

Strong activations on foreground individuals Activations followed individuals frame-to-frame

Scene: ResNet-50 Outperforms Inception-v3

ResNet activates

people Inception-v3 gets distracted background lighting

Image Captioning

Captions Tell Descriptive Nouns

Pose

Pose Model

Pose: Importance of Upper Body Joints

Elbows Hands

Audio and Laughter

Audio Model System Diagram

Laughter Model

Facial

Facial Pipeline

Results & Error Analysis

Independent Modality Results

Modality Accuracy F1-Score Scene 0.546 0.541 Pose 0.486 0.489 Audio 0.577 0.577 Face 0.400 0.348 Image Captioning 0.505 0.506

Fully Connected Early Fusion Results

Dataset Accuracy Baseline [Test] 0.479 Validation 0.640 Test 0.639

Ablation Study on Modalities

positive class saliency

to predict positive videos accurately, except for scene Ensemble: 64.0%

Modality Activation Contributions

Summary & Future

Summary

63.9%

Future

Acknowledgements

the Stanford CS231n teaching staff guided us through the project.

guidance.

detection and text-based sentiment analysis.

and large dataset.

References

Thank You!

We’re on GitHub! https://github.com/kevincong95/cs231n-emotiw