Backpropagating through Structured Argmax using a SPIGOT
Hao Peng, Sam Thomson, Noah A. Smith @ACL July 17, 2018
Backpropagating through Structured Argmax using a SPIGOT Hao Peng, - - PowerPoint PPT Presentation
Backpropagating through Structured Argmax using a SPIGOT Hao Peng, Sam Thomson, Noah A. Smith @ACL July 17, 2018 Overview Shareholders took their money Parser arg max Shareholders took their money Downstream task Loss L Overview
Hao Peng, Sam Thomson, Noah A. Smith @ACL July 17, 2018
Shareholders their took money Shareholders their took money
Downstream task
Loss L
Parser
arg max
Shareholders their took money Shareholders their took money
Downstream task
Loss L
Parser
arg max
Head token
Yang and Mitchell, 2017
Tree-RNN
Tai et al., 2015
Graph CNN
Kipf and Welling, 2017
…
Shareholders their took money Shareholders their took money
Downstream task
Loss L
Parser
arg max
A layer in the computation graph?
Shareholders their took money Shareholders their took money
Downstream task
Loss L
Parser
arg max
Non-differentiable A layer in the computation graph?
Aim
Shareholders their took money
Intermediate parser θ
arg max
Shareholders their took money
Downstream task Loss L
?
rθL
Motivation
Ji and Smith, 2017; Oepen et al., 2017
universally optimal.
Williams, 2017
Aim
Shareholders their took money
Intermediate parser θ
arg max
Shareholders their took money
Downstream task Loss L
?
rθL
Motivation
Ji and Smith, 2017; Oepen et al., 2017
universally optimal.
Williams, 2017
Challenges
Aim
Shareholders their took money
Intermediate parser θ
arg max
Shareholders their took money
Downstream task Loss L
?
rθL
Motivation
Ji and Smith, 2017; Oepen et al., 2017
universally optimal.
Williams, 2017
Challenges
Structured Prediction Intermediate Gradients Optimization Technique
SPIGOT
A proxy
Method
❖ Background: structured prediction as linear programs ❖ Method: SPIGOT algorithm ❖ Experiments
Input
Output
Input
Score
head mod
arcs
Output
s.t.
z>sθ Input
Score
took their took money their took their money
sθ = ⇥ sθ ( ) , sθ ( ) , sθ ( ) , . . . , sθ ( ) ⇤> z = [ 1?, 0?, 1?, . . . , 0? ]>
their money took their took money their took
sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )
s.t.
Shareholders their took money
Roth and Yih, 2004; Martins et al., 2009
their money took their took money their took
sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )
s.t.
Shareholders their took money
Roth and Yih, 2004; Martins et al., 2009
relaxation
zi ∈ {0, 1} zi ∈ [0, 1]
❖ Background: structured prediction as linear programs ❖ Method: SPIGOT algorithm ❖ Experiments
Downstream task Loss L
Shareholders their took money
their money took their took money their took
sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )
z forms a tree
s.t.
rθL
Downstream task Loss L
Shareholders their took money
their money took their took money their took
sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )
z forms a tree
s.t.
Backprop
rˆ
zL
rθL
Downstream task Loss L
Shareholders their took money
their money took their took money their took
sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )
z forms a tree
s.t.
Backprop Backprop
rθL rˆ
zL
rsL
Downstream task Loss L
Shareholders their took money
their money took their took money their took
sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )
z forms a tree
s.t.
Backprop Proxy Backprop
rθL rˆ
zL
rsL
We need: rsL
We have: rˆ
zL
We need: rsL
We have: rˆ
zL
Leibniz, 1676
“ ”
We need: rsL
We have: rˆ
zL
Leibniz, 1676
“ ”
s.t.
Jacobian not defined
We need: rsL
We have: rˆ
zL
Straight-through Estimator (STE)
Hinton, 2012; Bengio et al., 2013
Leibniz, 1676
“ ”
Shareholders their took money
ˆ z = [1, 0, 1, · · · , 0]>
Straight-through Estimator (STE): rsL , rˆ
zL
Shareholders their took money
ˆ z = [1, 0, 1, · · · , 0]>
Straight-through Estimator (STE): rsL , rˆ
zL
rˆ
zL = [0.3, 0.5, 0.4, . . . , 0.2]
Shareholders their took money
ˆ z = [1, 0, 1, · · · , 0]>
Shareholders their took money
p = ˆ z rˆ
zL
rˆ
zL = [0.3, 0.5, 0.4, . . . , 0.2]
Straight-through Estimator (STE): rsL , rˆ
zL
Shareholders their took money
ˆ z = [1, 0, 1, · · · , 0]>
Shareholders their took money
SPIGOT
p = ˆ z rˆ
zL
rˆ
zL = [0.3, 0.5, 0.4, . . . , 0.2]
Shareholders their took money
ˆ z = [1, 0, 1, · · · , 0]>
Shareholders their took money
SPIGOT
p = ˆ z rˆ
zL
q = proj(p) rsL , ˆ z q
p = ˆ z rˆ
zL
rˆ
zL = [0.3, 0.5, 0.4, . . . , 0.2]
rsL
SPIGOT
rsL
ˆ z rˆ
zL
ˆ z rˆ
zL
Input
Parser θ
Shareholders their took money
their money took their took money their took
sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )
z forms a tree
s.t.
Shareholders their took money
Input
Parser θ Downstream task φ
Shareholders their took money
their money took their took money their took
sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )
z forms a tree
s.t.
Shareholders their took money
Loss L
Input
Parser θ Downstream task φ
Shareholders their took money
their money took their took money their took
sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )
z forms a tree
s.t.
Shareholders their took money
Loss L
Backprop
zL
Input
Parser θ Downstream task φ
Shareholders their took money
their money took their took money their took
sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )
z forms a tree
s.t.
Shareholders their took money
Loss L
Backprop Project onto
zL
p = ˆ z rˆ
zL
q = proj(p) rsL , ˆ z q
Input
Parser θ Downstream task φ
Shareholders their took money
their money took their took money their took
sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )
z forms a tree
s.t.
Shareholders their took money
Loss L
Backprop Project onto Backprop
zL
p = ˆ z rˆ
zL
q = proj(p) rsL , ˆ z q
Structured Attention: Kim et al., 2017
STE
Pipeline STE Structured Att. SPIGOT Hard decision on Backprop Marginal Projection
ˆ z
SPIGOT
ˆ z rˆ
zL
rsL rsL
ˆ z rˆ
zL
ˆ z
Structured Attention
ˆ z = arg max (. . . )
ˆ z = softmax(. . . )
Structured Attention: Kim et al., 2017 Pipeline STE Structured Att. SPIGOT Hard decision on Backprop Marginal Projection
ˆ z
SPIGOT
ˆ z rˆ
zL
rsL
Joint learning
Swayamdipta et al., 2016
Shareholders their took money
Parser θ
arg max
Shareholders their took money
rθL1
L1
Training data
Joint learning
Swayamdipta et al., 2016
Shareholders their took money
Parser θ
arg max
Shareholders their took money
Downstream task φ Loss L2
rφL2 rθL2 rθL1
L1
Training data
Joint learning
Swayamdipta et al., 2016
Shareholders their took money
Parser θ
arg max
Shareholders their took money
Downstream task φ Loss L2
rφL2 rθL2 rθL1
L1
Shareholders their took money
Parser θ
arg max
Shareholders their took money
Downstream task φ Loss L
rθL
rφL
Training data Training data
Induce latent structures
Yogatama et al., 2017; Williams et al., 2017
❖ Background: structured prediction as linear programs ❖ Method: SPIGOT algorithm ❖ Experiments
Shareholders their took money
Syntactic Parser
arg max Shareholders their took money
Semantic Parser φ
arg1 arg2 poss
Shareholders their took money Input Syntactic tree Semantic graph
Shareholders their took money
Syntactic Parser
arg max Shareholders their took money
BiLSTM + MLP
Kiperwasser and Goldberg, 2016
Eisner Algorithm
Eisner, 1996
Semantic Parser φ
arg1 arg2 poss
Shareholders their took money Input Syntactic tree Semantic graph
Shareholders their took money
Syntactic Parser
arg max Shareholders their took money
BiLSTM + MLP
Kiperwasser and Goldberg, 2016
Eisner Algorithm
Eisner, 1996
NeurboParser
Peng et al., 2017
Semantic Parser φ
arg1 arg2 poss
Shareholders their took money
took took money root
Concat head token embedding
Input Syntactic tree Semantic graph
SemEval ’15. Micro-averaged labeled F1
Neurbo: Peng et al., 2017 Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop N/A Hard decision N/A Projection N/A
ˆ z
F1 80 82 84 86 88
in-domain
F1 80 82 84 86 88
in-domain
SemEval ’15. Micro-averaged labeled F1
Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop N/A Hard decision N/A Projection N/A Neurbo: Peng et al., 2017
ˆ z
SemEval ’15. Micro-averaged labeled F1 F1 80 82 84 86 88
in-domain
Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop N/A Hard decision N/A Projection N/A Neurbo: Peng et al., 2017
ˆ z
SemEval ’15. Micro-averaged labeled F1 F1 80 82 84 86 88
in-domain
Neurbo: Peng et al., 2017 Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop N/A Hard decision N/A Projection N/A
ˆ z
Shareholders their took money
Semantic Parser
arg max
Classifier φ
arg1 arg2 poss
Shareholders their took money Input Semantic graph
Positive? Negative?
Shareholders their took money
Semantic Parser
arg max
NeurboParser
Peng et al., 2017
AD3
Martins et al., 2011
BiLSTM+MLP
Classifier φ
arg1 arg2 poss
Shareholders their took money
took: arg1 took:arg2; their:poss … …
Concat head token and role
Input Semantic graph
Positive? Negative?
Accuracy
82 83 84 85 86 87 88
BiLSTM
Stanford Sentiment Treebank accuracy
Pipeline STE SPIGOT
SPIGOT
SPIGOT