Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda - - PowerPoint PPT Presentation

▶

Oct 06, 2022 34 likes •299 views

Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda dality-sp specifi fic Factors Zhun Liu, Ying Shen, Varun Bharadwaj, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency Artifici cial Intelligence ce Sen Sentimen ent an

SLIDE 1

Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda dality-sp specifi fic Factors

Zhun Liu, Ying Shen, Varun Bharadwaj, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency

SLIDE 2

Artifici cial Intelligence ce

SLIDE 3

“This movie is sick”

Speaker’s behaviors Sentiment Intensity time

Sen Sentimen ent an and Em Emotion An Analysis

?

Smile Loud

SLIDE 4

“This movie is sick”

Speaker’s behaviors Sentiment Intensity time

Mul Multi timoda dal Sen Sentimen ent an and Em Emotion An Analysis

?

Unimodal Multimodal Representation (Multimodal Fusion) Bimodal Trimodal ① Intra-modal Interactions ② Cross-modal Interactions ③ Computational Efficiency

Smile Loud

SLIDE 5

Mul Multi timoda dal Fu Fusion us using ng Te Tensor Re Representation

“This movie is sick”

Visual Language

Bimodal Unimodal

“Tensor Fusion Network for Multimodal Sentiment Analysis” by Zadeh, A., et, al. (2017)

= 𝑨# 𝑨# ⊗ 𝑨% 1 𝑨% 𝒶 = 𝑨# 1 ⊗ 𝑨% 1

Intra-modal interactions Cross-modal interactions Computational efficiency

··· ··· Multimodal Representation

|ℎ|

SLIDE 6

Co Comp mputati tional Co Comp mplexity ty – Tensor Product ct

𝒶

𝟐

𝒶

𝟐

𝑷(𝒆𝟐×𝒆𝟑×𝒆𝟒) 𝑷(𝒆𝟐×𝒆𝟑)

𝑷 3 𝒆𝒏

𝑵 𝒏6𝟐

M= M=2 M= M=3

SLIDE 7

CO CORE CO CONTRIBUTIONS

Low-rank Multimodal Fusion (LMF)

SLIDE 8

Fr From T m Ten ensor Re Representation to Low-ra rank Fusion

Visual Language

① Decomposition of weight 𝑋. ② Decomposition of input tensor 𝑎. ③ Rearrange the computation of ℎ.

Visual Language

Low-rank Multimodal Fusion Tensor Fusion Networks

SLIDE 9

Canonical Polyadic c (CP) Decomposition of tensors

Rank of tensor 𝑋: minimum number of vector tuples needed for exact reconstruction

SLIDE 10

Canonical Polyadic (CP) Decomposition of 3D tensors

|ℎ|

⨂ + ⨂

|ℎ| |ℎ|

SLIDE 11

Mo Moda dality ty-speci cific De Decomp mpositi tion

|ℎ| Retain the dimension for the multimodal representation ℎ during decomposition |ℎ| |ℎ|

SLIDE 12

=

ℎ

;

⨂

𝟐 𝟐

𝑨% 𝑨# 𝒶

𝟐

① De Decomp mpositi tion o

weight t t tensor W W

𝒳

SLIDE 13

𝑥%

(>)

⨂

+ + ⋯

𝑥#

(>)

𝑥%

(@)

⨂ 𝑥#

(@)

;

⨂

𝟐 𝟐

𝑨% 𝑨# 𝒶

𝟐

① De Decomp mpositi tion o

weight t t tensor W W

=

ℎ

SLIDE 14

𝑥%

(>)

⨂

+ + ⋯

𝑥%

(@)

⨂ 𝑥#

(>)

𝑥#

(@)

;

⨂

𝟐

𝑨%

𝟐

𝑨# 𝒶

𝟐

② De Decomp mpositi tion o

Z

=

ℎ

SLIDE 15

③ Re Rearranging computation

SLIDE 16

Lo Low-ra rank Multimodal Fusion

SLIDE 17

Ea Easily scales to more modalities

Intra-modal interactions Cross-modal interactions Computational complexity

SLIDE 18

EX EXPER PERIMEN ENTS AND RE RESUL SULTS

SLIDE 19

Da Datasets ts

Emotion Recognition 10039 video segments

Dyadic interaction
From 302 videos

Segment level annotations

10 classes of emotions
Categorical annotations

Sentiment Analysis 2199 video segments

Single-speaker
From 93 Movie reviews

Segment level annotations

Sentiment
Real-valued

Speaker Trait Recognition 1000 full video clips

Single-speaker
Movie reviews

Video level annotations

16 types of speaker traits
Categorical annotations

CMU-MOSI POM IEMOCAP

SLIDE 20

Low-rank Multimodal Fusion (Our Model) Tensor Fusion Networks (Zadeh, et al., 2017)

71.5

Co Comp mpare t e to f full r rank t k ten ensor f fusion

Acc-7 LMF TFN Correlation Acc-2 MAE F1

0.91 0.97 0.67 0.63 76.4 75.7 73.4 32.8 32.1 20 0.88 0.60 73.9 31.6

CMU-MOSI

0.98 0.67 0.90 76.5 33.5

SLIDE 21

71.5 0.75

Co Comp mpare t e to f full r rank t k ten ensor f fusion

Correlation MAE

0.91 0.97 0.67 0.63 0.80 0.40 0.09 85.8 83.6 85.9 82.8

MAE Correlation F1-Happy F1-Sad

21 0.88 0.60 0.89 0.0 81.0

CMU-MOSI POM IEMOCAP

0.98 0.67 0.90 1.0 86.0

SLIDE 22

1.019

Co Comp mpare w e with th St State-of

f-the

the-Ar Art Approach ches

Deep Fusion

0.912 0.965 0.968 0.970 1.143

Mean Average Error (MAE)

0.0 22

CMU-MOSI

1.15

TFN MV-LSTM MARN LMF MFN Low-rank Multimodal Fusion (our model) Memory Fusion Networks (Zadeh, et al., 2018) Multi-attention Recurrent Networks (Zadeh, et al., 2018) Tensor Fusion Networks (Zadeh, et al., 2017) Multi-view LSTM (Rajagopalan, et al., 2016) Deep Fusion (Nojavanasghari, et al., 2016)

SLIDE 23

Co Comp mpare w e with th To Top 2 St State-of

f-the

the-Ar Art Approach ches

Correlation MAE

0.668 0.632 0.796 0.805 0.396 0.349 89.0 84.3 85.9 82.8

MAE Correlation F1-Angry F1-Sad

CMU-MOSI POM IEMOCAP TFN MV-LSTM MARN LMF MFN

0.00 0.633 0.60 0.886 0.270 0.75 0.0 84.2 81.0 82.1 1.15 0.67 0.89 0.6 90.0 0.912 0.965 0.968

SLIDE 24

Effici ciency cy Improvement

1134.82 2249.9 340.74 1177.17

500 1000 1500 2000 2500

Training - samples/s Testing - samples/s

LMF (Ours) TFN (Zadeh, et al., 2017)

CMU-MOSI Efficiency Metric: Number of data samples processed per second

Training Efficiency
Testing Efficiency

SLIDE 25

Concl clusions

Intra-modal interactions Cross-modal interactions Computational complexity State-of-the-art results

SLIDE 26

Code: https://github.com/Justin1904/Low-rank-Multimodal-Fusion

Th Thank yo you!

http://multicomp.cs.cmu.edu/