[PPT] - Backpropagating through Structured Argmax using a SPIGOT Hao Peng, PowerPoint Presentation

SLIDE 1

Backpropagating through Structured Argmax using a SPIGOT

Hao Peng, Sam Thomson, Noah A. Smith @ACL July 17, 2018

SLIDE 2

Overview

Shareholders their took money Shareholders their took money

Downstream task

Loss L

Parser

arg max

SLIDE 3

Overview

Shareholders their took money Shareholders their took money

Downstream task

Loss L

Parser

arg max

Head token

Yang and Mitchell, 2017

Tree-RNN

Tai et al., 2015

Graph CNN

Kipf and Welling, 2017

…

SLIDE 4

Overview

Shareholders their took money Shareholders their took money

Downstream task

Loss L

Parser

arg max

A layer in the computation graph?

SLIDE 5

Overview

Shareholders their took money Shareholders their took money

Downstream task

Loss L

Parser

arg max

Non-differentiable A layer in the computation graph?

SLIDE 6

Aim

Structured prediction as a layer.

Overview

Shareholders their took money

Intermediate parser θ

arg max

Shareholders their took money

Downstream task Loss L

?

rθL

Motivation

Structures help.

Ji and Smith, 2017; Oepen et al., 2017

Linguistic structures may not be

universally optimal.

Williams, 2017

SLIDE 7

Aim

Structured prediction as a layer.

Overview

Shareholders their took money

Intermediate parser θ

arg max

Shareholders their took money

Downstream task Loss L

?

rθL

Motivation

Structures help.

Ji and Smith, 2017; Oepen et al., 2017

Linguistic structures may not be

universally optimal.

Williams, 2017

Challenges

argmax is non-differentiable.

SLIDE 8

Aim

Structured prediction as a layer.

Overview

Shareholders their took money

Intermediate parser θ

arg max

Shareholders their took money

Downstream task Loss L

?

rθL

Motivation

Structures help.

Ji and Smith, 2017; Oepen et al., 2017

Linguistic structures may not be

universally optimal.

Williams, 2017

Challenges

argmax is non-differentiable.

Structured Prediction Intermediate Gradients Optimization Technique

SPIGOT

A proxy

Method

SLIDE 9

❖ Background: structured prediction as linear programs ❖ Method: SPIGOT algorithm ❖ Experiments

Outline

SLIDE 10

Input

Structured Prediction Reviewed

Output

Shareholders their took money Shareholders their took money

SLIDE 11

Input

Shareholders their took money

Structured Prediction Reviewed

Score

Shareholders their took money

Sθ ( )

X sθ ( )

head mod

arcs

=

SLIDE 12

Output

ˆ z

Structured Prediction Reviewed

arg max

z forms a tree

s.t.

z>sθ Input

Shareholders their took money

Score

took their took money their took their money

sθ = ⇥ sθ ( ) , sθ ( ) , sθ ( ) , . . . , sθ ( ) ⇤> z = [ 1?, 0?, 1?, . . . , 0? ]>

Shareholders their took money

SLIDE 13

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Az ≤ b

=

Shareholders their took money

ˆ z

Linear Programming Formulation

Roth and Yih, 2004; Martins et al., 2009

SLIDE 14

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Az ≤ b

=

Shareholders their took money

ˆ z

Linear Programming Formulation

Roth and Yih, 2004; Martins et al., 2009

relaxation

zi ∈ {0, 1} zi ∈ [0, 1]

SLIDE 15

❖ Background: structured prediction as linear programs ❖ Method: SPIGOT algorithm ❖ Experiments

Outline

SLIDE 16

Backprop

Downstream task Loss L

Shareholders their took money

ˆ z

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

ˆ z =

rθL

SLIDE 17

Backprop

Downstream task Loss L

Shareholders their took money

ˆ z

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

ˆ z =

Backprop

rˆ

zL

rθL

SLIDE 18

Backprop

Downstream task Loss L

Shareholders their took money

ˆ z

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

ˆ z =

Backprop Backprop

rθL rˆ

zL

rsL

SLIDE 19

Backprop

Downstream task Loss L

Shareholders their took money

ˆ z

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

ˆ z =

Backprop Proxy Backprop

rθL rˆ

zL

rsL

SLIDE 20

Backprop

We need: rsL

We have: rˆ

zL

SLIDE 21

Backprop

We need: rsL

We have: rˆ

zL

Leibniz, 1676

“ ”

rsL = J rˆ

zL

SLIDE 22

Backprop

We need: rsL

We have: rˆ

zL

Leibniz, 1676

“ ”

rsL = J rˆ

zL

z forms a tree

s.t.

ˆ z = arg max z>sθ

Jacobian not defined

SLIDE 23

Backprop

We need: rsL

We have: rˆ

zL

Straight-through Estimator (STE)

Hinton, 2012; Bengio et al., 2013

Leibniz, 1676

“ ”

rsL = J rˆ

zL

rsL , rˆ

zL

SLIDE 24

Some Geometry…

Shareholders their took money

ˆ z = [1, 0, 1, · · · , 0]>

Az ≤ b

Straight-through Estimator (STE): rsL , rˆ

zL

SLIDE 25

Some Geometry…

Shareholders their took money

ˆ z = [1, 0, 1, · · · , 0]>

Az ≤ b

Straight-through Estimator (STE): rsL , rˆ

zL

rˆ

zL = [0.3, 0.5, 0.4, . . . , 0.2]

SLIDE 26

Some Geometry…

Shareholders their took money

ˆ z = [1, 0, 1, · · · , 0]>

Shareholders their took money

Az ≤ b

p = ˆ z rˆ

zL

rˆ

zL = [0.3, 0.5, 0.4, . . . , 0.2]

Straight-through Estimator (STE): rsL , rˆ

zL

SLIDE 27

Some Geometry…

Shareholders their took money

ˆ z = [1, 0, 1, · · · , 0]>

Shareholders their took money

Az ≤ b

SPIGOT

q

p = ˆ z rˆ

zL

rˆ

zL = [0.3, 0.5, 0.4, . . . , 0.2]

SLIDE 28

Some Geometry…

Shareholders their took money

ˆ z = [1, 0, 1, · · · , 0]>

Shareholders their took money

Az ≤ b

SPIGOT

q

p = ˆ z rˆ

zL

q = proj(p) rsL , ˆ z q

p = ˆ z rˆ

zL

rsL

rˆ

zL = [0.3, 0.5, 0.4, . . . , 0.2]

SLIDE 29

Some Geometry…

rsL

SPIGOT

rsL

ˆ z ˆ z

ˆ z rˆ

zL

ˆ z rˆ

zL

SLIDE 30

Algorithm

Input

ˆ z =

Parser θ

Shareholders their took money

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Shareholders their took money

ˆ z

SLIDE 31

Algorithm

Input

ˆ z =

Parser θ Downstream task φ

Shareholders their took money

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Shareholders their took money

ˆ z

Loss L

SLIDE 32

Algorithm

Input

ˆ z =

Parser θ Downstream task φ

Shareholders their took money

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Shareholders their took money

ˆ z

Loss L

Backprop

rˆ

zL

SLIDE 33

Algorithm

Input

ˆ z =

Parser θ Downstream task φ

Shareholders their took money

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Shareholders their took money

ˆ z

Loss L

Backprop Project onto

rˆ

zL

rsL

p = ˆ z rˆ

zL

q = proj(p) rsL , ˆ z q

SLIDE 34

Algorithm

Input

ˆ z =

Parser θ Downstream task φ

Shareholders their took money

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Shareholders their took money

ˆ z

Loss L

Backprop Project onto Backprop

rθL rˆ

zL

rsL

p = ˆ z rˆ

zL

q = proj(p) rsL , ˆ z q

SLIDE 35

Connections to Related Work

Structured Attention: Kim et al., 2017

STE

Pipeline STE Structured Att. SPIGOT Hard decision on Backprop Marginal Projection

ˆ z

SPIGOT

ˆ z rˆ

zL

rsL rsL

ˆ z rˆ

zL

ˆ z

SLIDE 36

Connections to Related Work

Structured Attention

ˆ z = arg max (. . . )

ˆ z = softmax(. . . )

Structured Attention: Kim et al., 2017 Pipeline STE Structured Att. SPIGOT Hard decision on Backprop Marginal Projection

ˆ z

SPIGOT

ˆ z rˆ

zL

rsL

SLIDE 37

Joint learning

Swayamdipta et al., 2016

Shareholders their took money

Parser θ

arg max

Shareholders their took money

rθL1

L1

Training data

Applications

SLIDE 38

Joint learning

Swayamdipta et al., 2016

Shareholders their took money

Parser θ

arg max

Shareholders their took money

Downstream task φ Loss L2

rφL2 rθL2 rθL1

L1

Training data

Applications

SLIDE 39

Joint learning

Swayamdipta et al., 2016

Shareholders their took money

Parser θ

arg max

Shareholders their took money

Downstream task φ Loss L2

rφL2 rθL2 rθL1

L1

Shareholders their took money

Parser θ

arg max

Shareholders their took money

Downstream task φ Loss L

rθL

rφL

Training data Training data

Induce latent structures

Yogatama et al., 2017; Williams et al., 2017

Applications

SLIDE 40

❖ Background: structured prediction as linear programs ❖ Method: SPIGOT algorithm ❖ Experiments

Outline

SLIDE 41

Experiments: Syntactic-then-semantic Parsing

Shareholders their took money

Syntactic Parser

arg max Shareholders their took money

θ

Semantic Parser φ

arg1 arg2 poss

Shareholders their took money Input Syntactic tree Semantic graph

SLIDE 42

Experiments: Syntactic-then-semantic Parsing

Shareholders their took money

Syntactic Parser

arg max Shareholders their took money

θ

BiLSTM + MLP

Kiperwasser and Goldberg, 2016

Eisner Algorithm

Eisner, 1996

Semantic Parser φ

arg1 arg2 poss

Shareholders their took money Input Syntactic tree Semantic graph

SLIDE 43

Experiments: Syntactic-then-semantic Parsing

Shareholders their took money

Syntactic Parser

arg max Shareholders their took money

θ

BiLSTM + MLP

Kiperwasser and Goldberg, 2016

Eisner Algorithm

Eisner, 1996

NeurboParser

Peng et al., 2017

Semantic Parser φ

arg1 arg2 poss

Shareholders their took money

took took money root

Concat head token embedding

Input Syntactic tree Semantic graph

SLIDE 44

SemEval ’15. Micro-averaged labeled F1

Neurbo: Peng et al., 2017 Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop N/A Hard decision N/A Projection N/A

ˆ z

F1 80 82 84 86 88

in-domain

ut-of-domain

SLIDE 45

F1 80 82 84 86 88

in-domain

ut-of-domain

SemEval ’15. Micro-averaged labeled F1

Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop N/A Hard decision N/A Projection N/A Neurbo: Peng et al., 2017

ˆ z

SLIDE 46

SemEval ’15. Micro-averaged labeled F1 F1 80 82 84 86 88

in-domain

ut-of-domain

Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop N/A Hard decision N/A Projection N/A Neurbo: Peng et al., 2017

ˆ z

SLIDE 47

SemEval ’15. Micro-averaged labeled F1 F1 80 82 84 86 88

in-domain

ut-of-domain

Neurbo: Peng et al., 2017 Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop N/A Hard decision N/A Projection N/A

ˆ z

SLIDE 48

Semantic Parsing for Sentiment Classification

Shareholders their took money

Semantic Parser

arg max

θ

Classifier φ

arg1 arg2 poss

Shareholders their took money Input Semantic graph

Positive? Negative?

SLIDE 49

Semantic Parsing for Sentiment Classification

Shareholders their took money

Semantic Parser

arg max

θ

NeurboParser

Peng et al., 2017

AD3

Martins et al., 2011

BiLSTM+MLP

Classifier φ

arg1 arg2 poss

Shareholders their took money

took: arg1 took:arg2; their:poss … …

Concat head token and role

Input Semantic graph

Positive? Negative?

SLIDE 50

Accuracy

82 83 84 85 86 87 88

BiLSTM

Stanford Sentiment Treebank accuracy

Pipeline STE SPIGOT

SLIDE 51

Conclusion

Problem

SLIDE 52

Conclusion

Problem Method

SPIGOT

SLIDE 53

Conclusion

Problem Method Results

SPIGOT

SLIDE 54