Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda - - PowerPoint PPT Presentation

ef efficientlo low ra rank multimodal fusion wi with h
SMART_READER_LITE
LIVE PREVIEW

Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda - - PowerPoint PPT Presentation

Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda dality-sp specifi fic Factors Zhun Liu, Ying Shen, Varun Bharadwaj, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency Artifici cial Intelligence ce Sen Sentimen ent an


slide-1
SLIDE 1

Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda dality-sp specifi fic Factors

Zhun Liu, Ying Shen, Varun Bharadwaj, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency

slide-2
SLIDE 2

Artifici cial Intelligence ce

slide-3
SLIDE 3

“This movie is sick”

Speaker’s behaviors Sentiment Intensity time

Sen Sentimen ent an and Em Emotion An Analysis

?

Smile Loud

slide-4
SLIDE 4

“This movie is sick”

Speaker’s behaviors Sentiment Intensity time

Mul Multi timoda dal Sen Sentimen ent an and Em Emotion An Analysis

?

Unimodal Multimodal Representation (Multimodal Fusion) Bimodal Trimodal ① Intra-modal Interactions ② Cross-modal Interactions ③ Computational Efficiency

Smile Loud

slide-5
SLIDE 5

Mul Multi timoda dal Fu Fusion us using ng Te Tensor Re Representation

“This movie is sick”

Visual Language

Bimodal Unimodal

“Tensor Fusion Network for Multimodal Sentiment Analysis” by Zadeh, A., et, al. (2017)

= 𝑨# 𝑨# ⊗ 𝑨% 1 𝑨% 𝒶 = 𝑨# 1 ⊗ 𝑨% 1

Intra-modal interactions Cross-modal interactions Computational efficiency

··· ··· Multimodal Representation

|ℎ|

slide-6
SLIDE 6

Co Comp mputati tional Co Comp mplexity ty – Tensor Product ct

𝒶

𝟐

𝒶

𝟐

𝑷(𝒆𝟐×𝒆𝟑×𝒆𝟒) 𝑷(𝒆𝟐×𝒆𝟑)

𝑷 3 𝒆𝒏

𝑵 𝒏6𝟐

M= M=2 M= M=3

slide-7
SLIDE 7

CO CORE CO CONTRIBUTIONS

Low-rank Multimodal Fusion (LMF)

7

slide-8
SLIDE 8

Fr From T m Ten ensor Re Representation to Low-ra rank Fusion

8

Visual Language

① Decomposition of weight 𝑋. ② Decomposition of input tensor 𝑎. ③ Rearrange the computation of ℎ.

Visual Language

Low-rank Multimodal Fusion Tensor Fusion Networks

slide-9
SLIDE 9

Canonical Polyadic c (CP) Decomposition of tensors

9

Rank of tensor 𝑋: minimum number of vector tuples needed for exact reconstruction

slide-10
SLIDE 10

Canonical Polyadic (CP) Decomposition of 3D tensors

10

|ℎ|

⨂ + ⨂

|ℎ| |ℎ|

slide-11
SLIDE 11

Mo Moda dality ty-speci cific De Decomp mpositi tion

11

|ℎ| Retain the dimension for the multimodal representation ℎ during decomposition |ℎ| |ℎ|

slide-12
SLIDE 12

=

;

𝟐 𝟐

𝑨% 𝑨# 𝒶

𝟐

① De Decomp mpositi tion o

  • f w

weight t t tensor W W

𝒳

12

slide-13
SLIDE 13

𝑥%

(>)

+ + ⋯

𝑥#

(>)

𝑥%

(@)

⨂ 𝑥#

(@)

;

𝟐 𝟐

𝑨% 𝑨# 𝒶

𝟐

① De Decomp mpositi tion o

  • f w

weight t t tensor W W

=

13

slide-14
SLIDE 14

𝑥%

(>)

+ + ⋯

𝑥%

(@)

⨂ 𝑥#

(>)

𝑥#

(@)

;

𝟐

𝑨%

𝟐

𝑨# 𝒶

𝟐

② De Decomp mpositi tion o

  • f Z

Z

=

14

slide-15
SLIDE 15

③ Re Rearranging computation

15

slide-16
SLIDE 16

Lo Low-ra rank Multimodal Fusion

16

slide-17
SLIDE 17

Ea Easily scales to more modalities

17

Intra-modal interactions Cross-modal interactions Computational complexity

slide-18
SLIDE 18

EX EXPER PERIMEN ENTS AND RE RESUL SULTS

18

slide-19
SLIDE 19

Da Datasets ts

Emotion Recognition 10039 video segments

  • Dyadic interaction
  • From 302 videos

Segment level annotations

  • 10 classes of emotions
  • Categorical annotations

Sentiment Analysis 2199 video segments

  • Single-speaker
  • From 93 Movie reviews

Segment level annotations

  • Sentiment
  • Real-valued

Speaker Trait Recognition 1000 full video clips

  • Single-speaker
  • Movie reviews

Video level annotations

  • 16 types of speaker traits
  • Categorical annotations

19

CMU-MOSI POM IEMOCAP

slide-20
SLIDE 20

Low-rank Multimodal Fusion (Our Model) Tensor Fusion Networks (Zadeh, et al., 2017)

71.5

Co Comp mpare t e to f full r rank t k ten ensor f fusion

Acc-7 LMF TFN Correlation Acc-2 MAE F1

0.91 0.97 0.67 0.63 76.4 75.7 73.4 32.8 32.1 20 0.88 0.60 73.9 31.6

CMU-MOSI

0.98 0.67 0.90 76.5 33.5

slide-21
SLIDE 21

71.5 0.75

Co Comp mpare t e to f full r rank t k ten ensor f fusion

Correlation MAE

0.91 0.97 0.67 0.63 0.80 0.40 0.09 85.8 83.6 85.9 82.8

MAE Correlation F1-Happy F1-Sad

21 0.88 0.60 0.89 0.0 81.0

CMU-MOSI POM IEMOCAP

0.98 0.67 0.90 1.0 86.0

slide-22
SLIDE 22

1.019

Co Comp mpare w e with th St State-of

  • f-the

the-Ar Art Approach ches

Deep Fusion

0.912 0.965 0.968 0.970 1.143

Mean Average Error (MAE)

0.0 22

CMU-MOSI

1.15

TFN MV-LSTM MARN LMF MFN Low-rank Multimodal Fusion (our model) Memory Fusion Networks (Zadeh, et al., 2018) Multi-attention Recurrent Networks (Zadeh, et al., 2018) Tensor Fusion Networks (Zadeh, et al., 2017) Multi-view LSTM (Rajagopalan, et al., 2016) Deep Fusion (Nojavanasghari, et al., 2016)

slide-23
SLIDE 23

Co Comp mpare w e with th To Top 2 St State-of

  • f-the

the-Ar Art Approach ches

Correlation MAE

0.668 0.632 0.796 0.805 0.396 0.349 89.0 84.3 85.9 82.8

MAE Correlation F1-Angry F1-Sad

23

CMU-MOSI POM IEMOCAP TFN MV-LSTM MARN LMF MFN

0.00 0.633 0.60 0.886 0.270 0.75 0.0 84.2 81.0 82.1 1.15 0.67 0.89 0.6 90.0 0.912 0.965 0.968

slide-24
SLIDE 24

Effici ciency cy Improvement

1134.82 2249.9 340.74 1177.17

500 1000 1500 2000 2500

Training - samples/s Testing - samples/s

LMF (Ours) TFN (Zadeh, et al., 2017)

CMU-MOSI Efficiency Metric: Number of data samples processed per second

  • Training Efficiency
  • Testing Efficiency

24

slide-25
SLIDE 25

Concl clusions

Intra-modal interactions Cross-modal interactions Computational complexity State-of-the-art results

25

slide-26
SLIDE 26

Code: https://github.com/Justin1904/Low-rank-Multimodal-Fusion

Th Thank yo you!

http://multicomp.cs.cmu.edu/