Multimodal Affective Analysis Using Hierarchical Attention Strategy - - PowerPoint PPT Presentation
Multimodal Affective Analysis Using Hierarchical Attention Strategy - - PowerPoint PPT Presentation
Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment Yue Gu* , Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, Ivan Marsic Multimedia Image Processing Lab Electrical and Computer Engineering Department
Question and Answer Recommendation System AI Assistant
Why the affective analysis is necessary?
Human Speech Accurate Response Affects
2
Progress of Affective Computing
Emotion Recognition Emotion Recognition
Happy, Excited Sadness Anger Neutral Frustration Happy, Excited Sadness Anger Neutral Frustration
Sentiment Analysis Sentiment Analysis
Strong Positive Positive Neutral Negative Strong Negative Strong Positive Positive Neutral Negative Strong Negative Affective Analysis Affective Analysis MFCCs Prosody Vocal Quality MFCCs Prosody Vocal Quality
Speech Signal Processing Speech Signal Processing
BoW POS CNNs, LSTMs BoW POS CNNs, LSTMs
Natural Language Processing Natural Language Processing
Multi-Modality
3
Is multi-modality needed?
- Vocal signal prominence
Oh you don’t like you are west-sider Oh you don’t like that you are west-sider Neutral Frustration
- r
that
4
Is multi-modality needed?
- Vocal signal prominence
Oh you don’t like that you are west-sider Oh you don’t like that you are west-sider Neutral Frustration
- r
Oh you don’t like that you are west-sider Happy
5
Is multi-modality needed?
- Vocal signal prominence
Oh you don’t like that you are west-sider Oh you don’t like that you are west-sider Neutral Frustration
- r
Oh you don’t like that you are west-sider Happy “I love this city!” “I hate this city!”
- Acoustic ambiguity
6
Challenges: Feature Extraction
- Gap between features and actual affective states
- Lack of high-level associations
- Not all parts contribute equally
7
Challenges: Modality Fusion
- Decision-level Fusion
- Lack of mutual association learning
- Feature-level Fusion
- Fail to learn time-dependent interactions
- Lack of consistency
8
Proposed Solutions
- Feature Extraction
- Hierarchical attention based bidirectional GRUs
- Modality Fusion
- Word-level fusion with attention
- An End-to-End multimodal network
9
Data Pre-processing
- Text Branch
- Word Embedding: word2vec
- Audio Branch
- Mel-frequency spectral coefficients (MFSCs)
- Synchronization
- Word-level forced alignment
10
BiGRU BiGRU
“I” “mean” “guys” …
𝑢_ℎ1 𝑢_ℎ2 𝑢_ℎ𝑂
…
𝑢_𝑓1 𝑢_𝑓2 𝑢_𝑓𝑂 𝑢_𝛽1 𝑢_𝛽2 𝑢_𝛽𝑂 𝑊
1
𝑊
2
𝑊
𝑂
𝑔_ℎ21 𝑔_ℎ22 𝑔_ℎ2𝑀
…
𝑔_𝑓2𝑘and 𝑔_𝛽2𝑘 𝑥_𝛽1 𝑥_𝛽2 𝑥_𝛽𝑂 𝑥_𝑓1 𝑥_𝑓2 𝑥_𝑓𝑂 𝑥_ℎ1 𝑥_ℎ2 𝑥_ℎ𝑂
… … … … … …
BiGRU
Audio (MFSC) Frame-level Acoustic Attention Word-level Acoustic Attention Word-level Textual Attention Text (Embedded)
CNN
Text Audio Fusion
Softmax Layer
𝑔_𝑓𝑗𝑘 = 𝑢𝑏𝑜ℎ 𝑋
𝑔𝑔_ℎ𝑗𝑘 + 𝑐𝑔
𝑔_𝛽𝑗𝑘 = 𝑓𝑦𝑞(𝑔_𝑓𝑗𝑘
⊺𝑤𝑔)
𝑙=1
𝑀
𝑓𝑦𝑞(𝑔_𝑓𝑗𝑙
⊺𝑤𝑔) Result
Word-level Fusion
𝑥_ℎ1 𝑥_ℎ2 𝑥_ℎ𝑂 𝑥_𝛽1 𝑥_𝛽2 𝑥_𝛽𝑂 𝑢_𝛽1 𝑢_𝛽2 𝑢_𝛽𝑂 𝑢_ℎ1 𝑢_ℎ2 𝑢_ℎ𝑂
11
Word-level Fusion
𝑢_ℎ𝑗 𝑢_𝛽𝑗 𝑥_ℎ𝑗 𝑥_𝛽𝑗 ℎ𝑗 Dense Layer 𝑡_𝛽𝑗 𝑊
𝑗
c 𝑢_ℎ𝑗 𝑢_𝛽𝑗 𝑥_ℎ𝑗 𝑥_𝛽𝑗 ℎ𝑗 Dense Layer 𝑡_𝛽𝑗 𝑊
𝑗
c 𝑣_𝛽𝑗 Attention Layer
(b) Vertical Fusion (c) Fine-tuning Attention Fusion
𝑥_ℎ𝑗 𝑢_ℎ𝑗 𝑥_𝛽𝑗 𝑢_𝑊
𝑗
Dense Layer 𝑥_𝑊
𝑗
𝑊
𝑗
(a) Horizontal Fusion
𝑢_𝛽𝑗
𝑣_𝛽𝑗 = 𝑓𝑦𝑞(𝑣_𝑓𝑗⊺𝑤𝑣) 𝑙=1
𝑂
𝑓𝑦𝑞(𝑣_𝑓𝑙⊺𝑤𝑣) + 𝑡_𝛽𝑗
12
𝑥_𝛽𝑗 𝑥_𝛽𝑗 𝑢_𝛽𝑗 𝑢_𝛽𝑗 Word-level acoustic attention distribution Word-level textual attention distribution Word-level acoustic contextual state Word-level textual contextual state 𝑥_ℎ𝑗 𝑥_ℎ𝑗 𝑢_ℎ𝑗 𝑢_ℎ𝑗
Baselines
- Sentiment Analysis
- BL-SVM, LSTM-SVM
- C-MKL, TFN, LSTM(A)
- Emotion Recognition
- SVM Trees, GSV-eVector
- C-MKL, H-DMS
- Fusion
- Decision-level, Feature-level (utterance-level)
13
Sentiment Analysis Result
60 62 64 66 68 70 72 74 76 78
MOSI
Weighted Accuracy Weighted F1
14
Emotion Recognition Result
50 55 60 65 70 75
IEMOCAP
Weighted Accuracy Unweighted Accuracy
15
Multimodal architecture is needed
50 60 70 80
T A T+A
MOSI
Weighted Accuracy Weighted F1 55 60 65 70 75
T A T+A
IEMOCAP
Weighted Accuracy Weighted F1
16
Generalization
60 62 64 66 68
Ours-HF Ours-VF Ours-HAF
MOSI to YouTube
Weighted Accuracy Weighted F1 56 57 58 59 60 61 62
Ours-HF Ours-VF Ours-HAF
IEMOCAP to EmotiW
Weighted Accuracy Weighted F1
17
Attention Visualization
What about the business what the hell is this
Label: anger
𝑡_𝛽𝑗 𝑡_𝛽𝑗 𝑣_𝛽𝑗 𝑣_𝛽𝑗 𝑥_𝛽𝑗 𝑥_𝛽𝑗 𝑢_𝛽𝑗 𝑢_𝛽𝑗 𝑡_𝛽𝑗 𝑡_𝛽𝑗 𝑣_𝛽𝑗 𝑣_𝛽𝑗 𝑥_𝛽𝑗 𝑥_𝛽𝑗 𝑢_𝛽𝑗 𝑢_𝛽𝑗 Word-level acoustic attention distribution Word-level textual attention distribution Shared attention distribution Fine-tuning attention distribution
Carry representative information in both text and audio Successfully combine both textual and acoustic attentions
18
Attention Visualization
𝑡_𝛽𝑗 𝑡_𝛽𝑗 𝑣_𝛽𝑗 𝑣_𝛽𝑗 𝑥_𝛽𝑗 𝑥_𝛽𝑗 𝑢_𝛽𝑗 𝑢_𝛽𝑗 𝑡_𝛽𝑗 𝑡_𝛽𝑗 𝑣_𝛽𝑗 𝑣_𝛽𝑗 𝑥_𝛽𝑗 𝑥_𝛽𝑗 𝑢_𝛽𝑗 𝑢_𝛽𝑗 Word-level acoustic attention distribution Word-level textual attention distribution Shared attention distribution Fine-tuning attention distribution
Capture emphasis and importance variation Vocal signal prominence
Oh you don’t like that you’re west-sider
Label: happy
19
Summary
- A hierarchical attention based multimodal structure
- The word-level fusion strategies
- Word-level attention visualization
20
Thank you !
Email: yg202@scarletmail.rutgers.edu Homepage: www.ieyuegu.com
21