1
Presenter: Paul Pu Liang
Paul Pu Liang, Ziyin Liu, Amir Zadeh, Louis-Philippe Morency
Multimodal Language Analysis with Recurrent Multistage Fusion
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
Multimodal Language Analysis with Recurrent Multistage Fusion - - PowerPoint PPT Presentation
Multimodal Language Analysis with Recurrent Multistage Fusion Presenter: Paul Pu Liang Paul Pu Liang, Ziyin Liu, Amir Zadeh, Louis-Philippe Morency 1 Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion Progress of
1
Presenter: Paul Pu Liang
Paul Pu Liang, Ziyin Liu, Amir Zadeh, Louis-Philippe Morency
Multimodal Language Analysis with Recurrent Multistage Fusion
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
2
Progress of Artificial Intelligence
Multimedia Content Intelligent Personal Assistants Robots and Virtual Agents
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
3
Multimodal Language Modalities
Ø Gestures Ø Body language Ø Eye contact Ø Facial expressions
Language Visual Acoustic
Ø Lexicon Ø Syntax Ø Pragmatics Ø Prosody Ø Vocal expressions
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
4
Multimodal Language Modalities
Ø Gestures Ø Body language Ø Eye contact Ø Facial expressions
Language Visual Acoustic
Ø Lexicon Ø Syntax Ø Pragmatics Ø Prosody Ø Vocal expressions
Ø Anger Ø Disgust Ø Fear Ø Happiness Ø Sadness Ø Surprise
Emotion Personality
Ø Confidence Ø Persuasion Ø Passion
Sentiment
Ø Positive Ø Negative
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
5
Challenge 1: Intra-modal Interactions
“This movie is great” Smile
Intra-modal Speaker’s behaviors Sentiment Intensity time time
Head nod
a) Temporal sequences
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
6
Challenge 2: Cross-modal Interactions
“This movie is great” Smile
Cross-modal Speaker’s behaviors Sentiment Intensity
Loud voice
time
a) Multiple co-occurring interactions b) Different weighted combinations
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
7
Multistage Aggregation in Humans
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
(Parsini et al. 2015, Taylor et al. 2017) wide smile loud voice
8
Multistage Aggregation in Humans
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
(Parsini et al. 2015, Taylor et al. 2017) wide smile loud voice positive reaction positive words
9
Multistage Aggregation in Humans
excitement joyous wide smile loud voice positive reaction positive words (Parsini et al. 2015, Taylor et al. 2017)
10
Computational Model for Multistage Fusion
excitement joyous wide smile loud voice positive reaction positive words
Computational Model
11
Multimodal Descriptors
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
type He’s Language Visual Acoustic
time … … …
… multimodal descriptors average
12
Language Descriptors
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
type He’s Language Visual Acoustic
time
average
… … …
neutral word … multimodal descriptors
13
Visual Descriptors
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
type He’s Language Visual Acoustic
time
average
… … …
neutral word shrug … frown multimodal descriptors
14
Acoustic Descriptors
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
type He’s Language Visual Acoustic
time
average
… … …
neutral word loud voice shrug speech elongation … frown multimodal descriptors
15
Multistage Fusion
loud voice shrug speech elongation … neutral word frown
16
Multistage Fusion
stage 1
HIGHLIGHT
loud voice shrug speech elongation … neutral word frown
17
Multistage Fusion
stage 1
HIGHLIGHT FUSE
loud voice shrug speech elongation … neutral word
negative negative
frown
18
Multistage Fusion
stage 1
HIGHLIGHT FUSE
loud voice shrug speech elongation … neutral word neutral word shrug speech elongation … loud voice
negative negative
stage 2
frown frown
19
Multistage Fusion
stage 1
HIGHLIGHT FUSE
loud voice shrug speech elongation … neutral word neutral word shrug speech elongation … loud voice
negative negative emphasis
stage 2
frown frown
20
Multistage Fusion
stage 1
HIGHLIGHT FUSE
loud voice shrug speech elongation … neutral word neutral word shrug speech elongation … loud voice
negative negative emphasis strongly negative
stage 2
frown frown
21
Multistage Fusion
stage 1
HIGHLIGHT FUSE
loud voice shrug speech elongation … neutral word neutral word shrug speech elongation … loud voice neutral word loud voice shrug speech elongation …
negative negative emphasis strongly negative
stage 2 stage 3
frown frown frown
22
Multistage Fusion
stage 1
HIGHLIGHT FUSE
loud voice shrug speech elongation … neutral word neutral word shrug speech elongation … loud voice neutral word loud voice shrug speech elongation …
negative negative emphasis strongly negative
stage 2
ambivalence
stage 3
frown frown frown
23
Multistage Fusion
stage 1
HIGHLIGHT FUSE
loud voice shrug speech elongation … neutral word neutral word shrug speech elongation … loud voice neutral word loud voice shrug speech elongation …
negative negative emphasis strongly negative
stage 2
ambivalence disappointed
stage 3
frown frown frown
24
Intra-modal Recurrent Networks
LSTHM ! LSTHM ! LSTHM " LSTHM " LSTHM # LSTHM #
$%
& $% ' $% (
time ) time ) + +
25
Multistage Fusion Process
!"
# !" $ !" %
Multistage Fusion Process
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
26
Multistage Fusion Process
!"
# !" $ !" %
stage 1
HIGHLIGHT
Multistage Fusion Process
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
27
Multistage Fusion Process
!"
# !" $ !" %
stage 1
HIGHLIGHT
Multistage Fusion Process
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
28
Multistage Fusion Process
FUSE
!"
# !" $ !" %
stage 1
HIGHLIGHT
Multistage Fusion Process
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
29
Multistage Fusion Process
FUSE
!"
# !" $ !" %
stage 1 stage 2
HIGHLIGHT HIGHLIGHT
Multistage Fusion Process
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
30
Multistage Fusion Process
FUSE
!"
# !" $ !" %
stage 1 stage 2
HIGHLIGHT HIGHLIGHT
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
Highlight LSTM
Multistage Fusion Process
31
Multistage Fusion Process
FUSE FUSE
!"
# !" $ !" %
stage 1 stage 2
HIGHLIGHT HIGHLIGHT
Multistage Fusion Process
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
Highlight LSTM Fuse LSTM
32
Multistage Fusion Process
FUSE FUSE FUSE
!"
# !" $ !" %
stage 1 stage 2 stage &
⋯
HIGHLIGHT
⋯
HIGHLIGHT HIGHLIGHT
Multistage Fusion Process
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
Highlight LSTM Fuse LSTM
33
!"
Multistage Fusion Process
FUSE FUSE FUSE
#"
$ #" % #" &
stage 1 stage 2 stage '
SUMMARIZE
⋯
HIGHLIGHT
⋯
HIGHLIGHT HIGHLIGHT
Multistage Fusion Process
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
Highlight LSTM Fuse LSTM
34
Recurrent Multistage Fusion Network
LSTHM ! LSTHM ! LSTHM " LSTHM " LSTHM # LSTHM #
$%
Multistage Fusion Process
FUSE FUSE FUSE
&%
' &% ( &% )
stage 1 stage 2 stage *
SUMMARIZE
time + time + + -
⋯
HIGHLIGHT
⋯
HIGHLIGHT HIGHLIGHT
35
Recurrent Multistage Fusion Network
LSTHM ! LSTHM ! LSTHM " LSTHM " LSTHM # LSTHM #
$%
Multistage Fusion Process
FUSE FUSE FUSE
&%
' &% ( &% )
stage 1 stage 2 stage *
SUMMARIZE
time + time + + -
⋯
HIGHLIGHT
⋯
HIGHLIGHT HIGHLIGHT
36
1. Non-temporal Models
§ SVM (Cortes and Vapnik, 1995), DF (Nojavanasghari et al., 2016)
2. Early Fusion
§ EF-LSTM (Hochreiter and Schmidhuber, 1997), EF-RHN (Zilly et al., 2016)
3. Late Fusion
§ LMF (Liu et al., 2018), TFN (Zadeh et al., 2017), BC-LSTM (Poria et al., 2017)
4. Multi-view Learning
§ MV-LSTM (Rajagopalan et al., 2016)
5. Memory-based models
§ MARN, MFN (Zadeh et al., 2018)
Baseline Models
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
37
State-of-the-art Results
Baseline Models RMFN
73 73.5 74 74.5 75 75.5 76 76.5 77 SVM-MD DF EF-RHN EF-LSTM TFN BC-LSTM MV-LSTM MARN MFN Graph-MFN
CMU-MOSI Sentiment (Binary Accuracy)
78.4%
MFN MARN SVM DF TFN EF-RHN EF-LSTM BC-LSTM RMFN MVLSTM
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
38
State-of-the-art Results
Best Baseline Model RMFN
0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55
CMU-MOSI Sentiment (Correlation)
44.65 44.7 44.75 44.8 44.85 44.9 44.95 45 45.05 45.1 45.15
POM Personality Traits (Multiclass Accuracy)
60 60.5 61 61.5 62 62.5 63
IEMOCAP Happy Emotion (Binary Accuracy)
60 61 62 63 64 65 66 67 68 69 70
IEMOCAP Sad Emotion (Binary Accuracy)
RMFN RMFN RMFN RMFN MV-LSTM MFN MARN MFN
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
39
Results
Best Baseline Model RMFN
60 62 64 66 68 70 72 74
IEMOCAP Neutral Emotion (Binary Accuracy)
RMFN MFN
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
40
Multiple Stages are Important
74 75 76 77 78 79 80 40 42 44 46 48 50 52 54 56
1 2 3 4 5 1 2 3 4 5 Number of stages Number of stages
CMU-MOSI Sentiment Analysis (Binary Accuracy) CMU-MOSI Sentiment Analysis (Multiclass Accuracy)
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
41
Ablation Studies
Paul Pu Liang Multimodal Language Analysis with Recurrent Multistage Fusion
42
Interpretable Fusion
Language Visual Acoustic
I (elongation) (emphasis) thought it was fun
43
Interpretable Fusion
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
I (elongation)
stages
(emphasis) thought it was fun
& t = 1
44
Interpretable Fusion
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
I (elongation)
stages
(emphasis) thought it was fun
&'( t = 1
45
Interpretable Fusion
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
I (elongation)
stages stages
(emphasis) thought it was fun
&'( &'( t = 1 t = 5
46
Across Stages
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
I (elongation)
stages stages
(emphasis) thought it was fun
&'( &'( t = 1 t = 5
47
Across Time
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
I (elongation)
stages stages
(emphasis) thought it was fun
&'( &'( t = 1 t = 5
48
Multimodal Priors
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
I (elongation)
stages stages
(emphasis) thought it was fun
&'( &'( t = 1 t = 5
49
Synchronized Interactions
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
I (elongation)
stages stages
(emphasis) thought it was fun
&'( &'( t = 1 t = 5
50
Synchronized Interactions
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
I (elongation)
stages stages
(emphasis) thought it was fun
&'( &'( t = 1 t = 5
51
Synchronized Interactions
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
I (elongation)
stages stages
(emphasis) thought it was fun
&'( &'( t = 1 t = 5
52
Asynchronous Trimodal Interactions
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language
He delivers a lot
&'( stages stages &'( t = 1 t = 6
53
Asynchronous Trimodal Interactions
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
He delivers a lot
(emphasis)
&'( stages stages
(smile) (smile)
&'( t = 1 t = 6
54
Asynchronous Trimodal Interactions
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
He delivers a lot
(emphasis)
&'( stages stages
(smile) (smile)
&'( t = 1 t = 6
55
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
It doesn’t give any insight or help
stages stages
(soft) (emphasis) (disappointed)
&'( &'( t = 1 t = 7
Bimodal Interactions
56
low high
ℎ"
#
ℎ"
$
ℎ"
%
Language Visual Acoustic
It doesn’t give any insight or help
stages stages
(soft) (emphasis) (disappointed)
&'( &'( t = 1 t = 7
Bimodal Interactions
57
Recurrent Multistage Fusion Network
LSTHM ! LSTHM ! LSTHM " LSTHM " LSTHM # LSTHM #
$%
Multistage Fusion Process
FUSE FUSE FUSE
&%
' &% ( &% )
stage 1 stage 2 stage *
SUMMARIZE
time + time + + -
⋯
HIGHLIGHT
⋯
HIGHLIGHT HIGHLIGHT
58
Website: www.cs.cmu.edu/~pliang Email: pliang@cs.cmu.edu Twitter: @pliang279
!"
Multistage Fusion Process
FUSE FUSE FUSE
#"
$ #" % #" &
stage 1 stage 2 stage '
SUMMARIZE
⋯
HIGHLIGHT
⋯
HIGHLIGHT HIGHLIGHT