Multimodal Affective Analysis Using Hierarchical Attention Strategy - - PowerPoint PPT Presentation

multimodal affective analysis using hierarchical
SMART_READER_LITE
LIVE PREVIEW

Multimodal Affective Analysis Using Hierarchical Attention Strategy - - PowerPoint PPT Presentation

Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment Yue Gu* , Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, Ivan Marsic Multimedia Image Processing Lab Electrical and Computer Engineering Department


slide-1
SLIDE 1

Yue Gu*, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, Ivan Marsic Multimedia Image Processing Lab Electrical and Computer Engineering Department Rutgers, The State University of New Jersey

Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment

slide-2
SLIDE 2

Question and Answer Recommendation System AI Assistant

Why the affective analysis is necessary?

Human Speech Accurate Response Affects

2

slide-3
SLIDE 3

Progress of Affective Computing

Emotion Recognition Emotion Recognition

 Happy, Excited  Sadness  Anger  Neutral  Frustration  Happy, Excited  Sadness  Anger  Neutral  Frustration

Sentiment Analysis Sentiment Analysis

 Strong Positive  Positive  Neutral  Negative  Strong Negative  Strong Positive  Positive  Neutral  Negative  Strong Negative Affective Analysis Affective Analysis  MFCCs  Prosody  Vocal Quality  MFCCs  Prosody  Vocal Quality

Speech Signal Processing Speech Signal Processing

 BoW  POS  CNNs, LSTMs  BoW  POS  CNNs, LSTMs

Natural Language Processing Natural Language Processing

Multi-Modality

3

slide-4
SLIDE 4

Is multi-modality needed?

  • Vocal signal prominence

Oh you don’t like you are west-sider Oh you don’t like that you are west-sider Neutral Frustration

  • r

that

4

slide-5
SLIDE 5

Is multi-modality needed?

  • Vocal signal prominence

Oh you don’t like that you are west-sider Oh you don’t like that you are west-sider Neutral Frustration

  • r

Oh you don’t like that you are west-sider Happy

5

slide-6
SLIDE 6

Is multi-modality needed?

  • Vocal signal prominence

Oh you don’t like that you are west-sider Oh you don’t like that you are west-sider Neutral Frustration

  • r

Oh you don’t like that you are west-sider Happy “I love this city!” “I hate this city!”

  • Acoustic ambiguity

6

slide-7
SLIDE 7

Challenges: Feature Extraction

  • Gap between features and actual affective states
  • Lack of high-level associations
  • Not all parts contribute equally

7

slide-8
SLIDE 8

Challenges: Modality Fusion

  • Decision-level Fusion
  • Lack of mutual association learning
  • Feature-level Fusion
  • Fail to learn time-dependent interactions
  • Lack of consistency

8

slide-9
SLIDE 9

Proposed Solutions

  • Feature Extraction
  • Hierarchical attention based bidirectional GRUs
  • Modality Fusion
  • Word-level fusion with attention
  • An End-to-End multimodal network

9

slide-10
SLIDE 10

Data Pre-processing

  • Text Branch
  • Word Embedding: word2vec
  • Audio Branch
  • Mel-frequency spectral coefficients (MFSCs)
  • Synchronization
  • Word-level forced alignment

10

slide-11
SLIDE 11

BiGRU BiGRU

“I” “mean” “guys” …

𝑢_ℎ1 𝑢_ℎ2 𝑢_ℎ𝑂

𝑢_𝑓1 𝑢_𝑓2 𝑢_𝑓𝑂 𝑢_𝛽1 𝑢_𝛽2 𝑢_𝛽𝑂 𝑊

1

𝑊

2

𝑊

𝑂

𝑔_ℎ21 𝑔_ℎ22 𝑔_ℎ2𝑀

𝑔_𝑓2𝑘and 𝑔_𝛽2𝑘 𝑥_𝛽1 𝑥_𝛽2 𝑥_𝛽𝑂 𝑥_𝑓1 𝑥_𝑓2 𝑥_𝑓𝑂 𝑥_ℎ1 𝑥_ℎ2 𝑥_ℎ𝑂

… … … … … …

BiGRU

Audio (MFSC) Frame-level Acoustic Attention Word-level Acoustic Attention Word-level Textual Attention Text (Embedded)

CNN

Text Audio Fusion

Softmax Layer

𝑔_𝑓𝑗𝑘 = 𝑢𝑏𝑜ℎ 𝑋

𝑔𝑔_ℎ𝑗𝑘 + 𝑐𝑔

𝑔_𝛽𝑗𝑘 = 𝑓𝑦𝑞(𝑔_𝑓𝑗𝑘

⊺𝑤𝑔)

𝑙=1

𝑀

𝑓𝑦𝑞(𝑔_𝑓𝑗𝑙

⊺𝑤𝑔) Result

Word-level Fusion

𝑥_ℎ1 𝑥_ℎ2 𝑥_ℎ𝑂 𝑥_𝛽1 𝑥_𝛽2 𝑥_𝛽𝑂 𝑢_𝛽1 𝑢_𝛽2 𝑢_𝛽𝑂 𝑢_ℎ1 𝑢_ℎ2 𝑢_ℎ𝑂

11

slide-12
SLIDE 12

Word-level Fusion

𝑢_ℎ𝑗 𝑢_𝛽𝑗 𝑥_ℎ𝑗 𝑥_𝛽𝑗 ℎ𝑗 Dense Layer 𝑡_𝛽𝑗 𝑊

𝑗

c 𝑢_ℎ𝑗 𝑢_𝛽𝑗 𝑥_ℎ𝑗 𝑥_𝛽𝑗 ℎ𝑗 Dense Layer 𝑡_𝛽𝑗 𝑊

𝑗

c 𝑣_𝛽𝑗 Attention Layer

(b) Vertical Fusion (c) Fine-tuning Attention Fusion

𝑥_ℎ𝑗 𝑢_ℎ𝑗 𝑥_𝛽𝑗 𝑢_𝑊

𝑗

Dense Layer 𝑥_𝑊

𝑗

𝑊

𝑗

(a) Horizontal Fusion

𝑢_𝛽𝑗

𝑣_𝛽𝑗 = 𝑓𝑦𝑞(𝑣_𝑓𝑗⊺𝑤𝑣) 𝑙=1

𝑂

𝑓𝑦𝑞(𝑣_𝑓𝑙⊺𝑤𝑣) + 𝑡_𝛽𝑗

12

𝑥_𝛽𝑗 𝑥_𝛽𝑗 𝑢_𝛽𝑗 𝑢_𝛽𝑗 Word-level acoustic attention distribution Word-level textual attention distribution Word-level acoustic contextual state Word-level textual contextual state 𝑥_ℎ𝑗 𝑥_ℎ𝑗 𝑢_ℎ𝑗 𝑢_ℎ𝑗

slide-13
SLIDE 13

Baselines

  • Sentiment Analysis
  • BL-SVM, LSTM-SVM
  • C-MKL, TFN, LSTM(A)
  • Emotion Recognition
  • SVM Trees, GSV-eVector
  • C-MKL, H-DMS
  • Fusion
  • Decision-level, Feature-level (utterance-level)

13

slide-14
SLIDE 14

Sentiment Analysis Result

60 62 64 66 68 70 72 74 76 78

MOSI

Weighted Accuracy Weighted F1

14

slide-15
SLIDE 15

Emotion Recognition Result

50 55 60 65 70 75

IEMOCAP

Weighted Accuracy Unweighted Accuracy

15

slide-16
SLIDE 16

Multimodal architecture is needed

50 60 70 80

T A T+A

MOSI

Weighted Accuracy Weighted F1 55 60 65 70 75

T A T+A

IEMOCAP

Weighted Accuracy Weighted F1

16

slide-17
SLIDE 17

Generalization

60 62 64 66 68

Ours-HF Ours-VF Ours-HAF

MOSI to YouTube

Weighted Accuracy Weighted F1 56 57 58 59 60 61 62

Ours-HF Ours-VF Ours-HAF

IEMOCAP to EmotiW

Weighted Accuracy Weighted F1

17

slide-18
SLIDE 18

Attention Visualization

What about the business what the hell is this

Label: anger

𝑡_𝛽𝑗 𝑡_𝛽𝑗 𝑣_𝛽𝑗 𝑣_𝛽𝑗 𝑥_𝛽𝑗 𝑥_𝛽𝑗 𝑢_𝛽𝑗 𝑢_𝛽𝑗 𝑡_𝛽𝑗 𝑡_𝛽𝑗 𝑣_𝛽𝑗 𝑣_𝛽𝑗 𝑥_𝛽𝑗 𝑥_𝛽𝑗 𝑢_𝛽𝑗 𝑢_𝛽𝑗 Word-level acoustic attention distribution Word-level textual attention distribution Shared attention distribution Fine-tuning attention distribution

Carry representative information in both text and audio Successfully combine both textual and acoustic attentions

18

slide-19
SLIDE 19

Attention Visualization

𝑡_𝛽𝑗 𝑡_𝛽𝑗 𝑣_𝛽𝑗 𝑣_𝛽𝑗 𝑥_𝛽𝑗 𝑥_𝛽𝑗 𝑢_𝛽𝑗 𝑢_𝛽𝑗 𝑡_𝛽𝑗 𝑡_𝛽𝑗 𝑣_𝛽𝑗 𝑣_𝛽𝑗 𝑥_𝛽𝑗 𝑥_𝛽𝑗 𝑢_𝛽𝑗 𝑢_𝛽𝑗 Word-level acoustic attention distribution Word-level textual attention distribution Shared attention distribution Fine-tuning attention distribution

Capture emphasis and importance variation Vocal signal prominence

Oh you don’t like that you’re west-sider

Label: happy

19

slide-20
SLIDE 20

Summary

  • A hierarchical attention based multimodal structure
  • The word-level fusion strategies
  • Word-level attention visualization

20

slide-21
SLIDE 21

Thank you !

Email: yg202@scarletmail.rutgers.edu Homepage: www.ieyuegu.com

21