Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda dality-sp specifi fic Factors
Zhun Liu, Ying Shen, Varun Bharadwaj, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency
Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda - - PowerPoint PPT Presentation
Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda dality-sp specifi fic Factors Zhun Liu, Ying Shen, Varun Bharadwaj, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency Artifici cial Intelligence ce Sen Sentimen ent an
Zhun Liu, Ying Shen, Varun Bharadwaj, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency
“This movie is sick”
Speaker’s behaviors Sentiment Intensity time
Smile Loud
“This movie is sick”
Speaker’s behaviors Sentiment Intensity time
Unimodal Multimodal Representation (Multimodal Fusion) Bimodal Trimodal ① Intra-modal Interactions ② Cross-modal Interactions ③ Computational Efficiency
Smile Loud
“This movie is sick”
Visual Language
Bimodal Unimodal
“Tensor Fusion Network for Multimodal Sentiment Analysis” by Zadeh, A., et, al. (2017)
= 𝑨# 𝑨# ⊗ 𝑨% 1 𝑨% 𝒶 = 𝑨# 1 ⊗ 𝑨% 1
Intra-modal interactions Cross-modal interactions Computational efficiency
··· ··· Multimodal Representation
|ℎ|
𝒶
𝟐
𝒶
𝟐
𝑷(𝒆𝟐×𝒆𝟑×𝒆𝟒) 𝑷(𝒆𝟐×𝒆𝟑)
𝑷 3 𝒆𝒏
𝑵 𝒏6𝟐
M= M=2 M= M=3
Low-rank Multimodal Fusion (LMF)
7
8
Visual Language
① Decomposition of weight 𝑋. ② Decomposition of input tensor 𝑎. ③ Rearrange the computation of ℎ.
Visual Language
Low-rank Multimodal Fusion Tensor Fusion Networks
9
Rank of tensor 𝑋: minimum number of vector tuples needed for exact reconstruction
10
|ℎ|
|ℎ| |ℎ|
11
|ℎ| Retain the dimension for the multimodal representation ℎ during decomposition |ℎ| |ℎ|
ℎ
⨂
𝟐 𝟐
𝑨% 𝑨# 𝒶
𝟐
𝒳
12
𝑥%
(>)
⨂
+ + ⋯
𝑥#
(>)
𝑥%
(@)
⨂ 𝑥#
(@)
⨂
𝟐 𝟐
𝑨% 𝑨# 𝒶
𝟐
ℎ
13
𝑥%
(>)
⨂
+ + ⋯
𝑥%
(@)
⨂ 𝑥#
(>)
𝑥#
(@)
⨂
𝟐
𝑨%
𝟐
𝑨# 𝒶
𝟐
ℎ
14
15
16
17
Intra-modal interactions Cross-modal interactions Computational complexity
18
Emotion Recognition 10039 video segments
Segment level annotations
Sentiment Analysis 2199 video segments
Segment level annotations
Speaker Trait Recognition 1000 full video clips
Video level annotations
19
CMU-MOSI POM IEMOCAP
Low-rank Multimodal Fusion (Our Model) Tensor Fusion Networks (Zadeh, et al., 2017)
71.5
Acc-7 LMF TFN Correlation Acc-2 MAE F1
0.91 0.97 0.67 0.63 76.4 75.7 73.4 32.8 32.1 20 0.88 0.60 73.9 31.6
CMU-MOSI
0.98 0.67 0.90 76.5 33.5
71.5 0.75
Correlation MAE
0.91 0.97 0.67 0.63 0.80 0.40 0.09 85.8 83.6 85.9 82.8
MAE Correlation F1-Happy F1-Sad
21 0.88 0.60 0.89 0.0 81.0
CMU-MOSI POM IEMOCAP
0.98 0.67 0.90 1.0 86.0
1.019
Deep Fusion
0.912 0.965 0.968 0.970 1.143
Mean Average Error (MAE)
0.0 22
CMU-MOSI
1.15
TFN MV-LSTM MARN LMF MFN Low-rank Multimodal Fusion (our model) Memory Fusion Networks (Zadeh, et al., 2018) Multi-attention Recurrent Networks (Zadeh, et al., 2018) Tensor Fusion Networks (Zadeh, et al., 2017) Multi-view LSTM (Rajagopalan, et al., 2016) Deep Fusion (Nojavanasghari, et al., 2016)
Correlation MAE
0.668 0.632 0.796 0.805 0.396 0.349 89.0 84.3 85.9 82.8
MAE Correlation F1-Angry F1-Sad
23
CMU-MOSI POM IEMOCAP TFN MV-LSTM MARN LMF MFN
0.00 0.633 0.60 0.886 0.270 0.75 0.0 84.2 81.0 82.1 1.15 0.67 0.89 0.6 90.0 0.912 0.965 0.968
1134.82 2249.9 340.74 1177.17
500 1000 1500 2000 2500
Training - samples/s Testing - samples/s
LMF (Ours) TFN (Zadeh, et al., 2017)
CMU-MOSI Efficiency Metric: Number of data samples processed per second
24
Intra-modal interactions Cross-modal interactions Computational complexity State-of-the-art results
25
Code: https://github.com/Justin1904/Low-rank-Multimodal-Fusion
http://multicomp.cs.cmu.edu/