Backpropagating through Structured Argmax using a SPIGOT Hao Peng, - - PowerPoint PPT Presentation

backpropagating through structured argmax using a spigot
SMART_READER_LITE
LIVE PREVIEW

Backpropagating through Structured Argmax using a SPIGOT Hao Peng, - - PowerPoint PPT Presentation

Backpropagating through Structured Argmax using a SPIGOT Hao Peng, Sam Thomson, Noah A. Smith @ACL July 17, 2018 Overview Shareholders took their money Parser arg max Shareholders took their money Downstream task Loss L Overview


slide-1
SLIDE 1

Backpropagating through Structured Argmax using a SPIGOT

Hao Peng, Sam Thomson, Noah A. Smith @ACL July 17, 2018

slide-2
SLIDE 2

Overview

Shareholders their took money Shareholders their took money

Downstream task

Loss L

Parser

arg max

slide-3
SLIDE 3

Overview

Shareholders their took money Shareholders their took money

Downstream task

Loss L

Parser

arg max

Head token

Yang and Mitchell, 2017

Tree-RNN

Tai et al., 2015

Graph CNN

Kipf and Welling, 2017

slide-4
SLIDE 4

Overview

Shareholders their took money Shareholders their took money

Downstream task

Loss L

Parser

arg max

A layer in the computation graph?

slide-5
SLIDE 5

Overview

Shareholders their took money Shareholders their took money

Downstream task

Loss L

Parser

arg max

Non-differentiable A layer in the computation graph?

slide-6
SLIDE 6

Aim

  • Structured prediction as a layer.

Overview

Shareholders their took money

Intermediate parser θ

arg max

Shareholders their took money

Downstream task Loss L

?

rθL

Motivation

  • Structures help.

Ji and Smith, 2017; Oepen et al., 2017

  • Linguistic structures may not be

universally optimal.

Williams, 2017

slide-7
SLIDE 7

Aim

  • Structured prediction as a layer.

Overview

Shareholders their took money

Intermediate parser θ

arg max

Shareholders their took money

Downstream task Loss L

?

rθL

Motivation

  • Structures help.

Ji and Smith, 2017; Oepen et al., 2017

  • Linguistic structures may not be

universally optimal.

Williams, 2017

Challenges

  • argmax is non-differentiable.
slide-8
SLIDE 8

Aim

  • Structured prediction as a layer.

Overview

Shareholders their took money

Intermediate parser θ

arg max

Shareholders their took money

Downstream task Loss L

?

rθL

Motivation

  • Structures help.

Ji and Smith, 2017; Oepen et al., 2017

  • Linguistic structures may not be

universally optimal.

Williams, 2017

Challenges

  • argmax is non-differentiable.

Structured Prediction Intermediate Gradients Optimization Technique

SPIGOT

A proxy

Method

slide-9
SLIDE 9

❖ Background: structured prediction as linear programs ❖ Method: SPIGOT algorithm ❖ Experiments

Outline

slide-10
SLIDE 10

Input

Structured Prediction Reviewed

Output

Shareholders their took money Shareholders their took money

slide-11
SLIDE 11

Input

Shareholders their took money

Structured Prediction Reviewed

Score

Shareholders their took money

Sθ ( )

X sθ ( )

head mod

arcs

=

slide-12
SLIDE 12

Output

ˆ z

Structured Prediction Reviewed

arg max

z forms a tree

s.t.

z>sθ Input

Shareholders their took money

Score

took their took money their took their money

sθ = ⇥ sθ ( ) , sθ ( ) , sθ ( ) , . . . , sθ ( ) ⇤> z = [ 1?, 0?, 1?, . . . , 0? ]>

Shareholders their took money

slide-13
SLIDE 13

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Az ≤ b

=

Shareholders their took money

ˆ z

Linear Programming Formulation

Roth and Yih, 2004; Martins et al., 2009

slide-14
SLIDE 14

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Az ≤ b

=

Shareholders their took money

ˆ z

Linear Programming Formulation

Roth and Yih, 2004; Martins et al., 2009

relaxation

zi ∈ {0, 1} zi ∈ [0, 1]

slide-15
SLIDE 15

❖ Background: structured prediction as linear programs ❖ Method: SPIGOT algorithm ❖ Experiments

Outline

slide-16
SLIDE 16

Backprop

Downstream task Loss L

Shareholders their took money

ˆ z

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

ˆ z =

rθL

slide-17
SLIDE 17

Backprop

Downstream task Loss L

Shareholders their took money

ˆ z

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

ˆ z =

Backprop

zL

rθL

slide-18
SLIDE 18

Backprop

Downstream task Loss L

Shareholders their took money

ˆ z

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

ˆ z =

Backprop Backprop

rθL rˆ

zL

rsL

slide-19
SLIDE 19

Backprop

Downstream task Loss L

Shareholders their took money

ˆ z

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

ˆ z =

Backprop Proxy Backprop

rθL rˆ

zL

rsL

slide-20
SLIDE 20

Backprop

We need: rsL

We have: rˆ

zL

slide-21
SLIDE 21

Backprop

We need: rsL

We have: rˆ

zL

Leibniz, 1676

“ ”

rsL = J rˆ

zL

slide-22
SLIDE 22

Backprop

We need: rsL

We have: rˆ

zL

Leibniz, 1676

“ ”

rsL = J rˆ

zL

z forms a tree

s.t.

ˆ z = arg max z>sθ

Jacobian not defined

slide-23
SLIDE 23

Backprop

We need: rsL

We have: rˆ

zL

Straight-through Estimator (STE)

Hinton, 2012; Bengio et al., 2013

Leibniz, 1676

“ ”

rsL = J rˆ

zL

rsL , rˆ

zL

slide-24
SLIDE 24

Some Geometry…

Shareholders their took money

ˆ z = [1, 0, 1, · · · , 0]>

Az ≤ b

Straight-through Estimator (STE): rsL , rˆ

zL

slide-25
SLIDE 25

Some Geometry…

Shareholders their took money

ˆ z = [1, 0, 1, · · · , 0]>

Az ≤ b

Straight-through Estimator (STE): rsL , rˆ

zL

zL = [0.3, 0.5, 0.4, . . . , 0.2]

slide-26
SLIDE 26

Some Geometry…

Shareholders their took money

ˆ z = [1, 0, 1, · · · , 0]>

Shareholders their took money

Az ≤ b

p = ˆ z rˆ

zL

zL = [0.3, 0.5, 0.4, . . . , 0.2]

Straight-through Estimator (STE): rsL , rˆ

zL

slide-27
SLIDE 27

Some Geometry…

Shareholders their took money

ˆ z = [1, 0, 1, · · · , 0]>

Shareholders their took money

Az ≤ b

SPIGOT

q

p = ˆ z rˆ

zL

zL = [0.3, 0.5, 0.4, . . . , 0.2]

slide-28
SLIDE 28

Some Geometry…

Shareholders their took money

ˆ z = [1, 0, 1, · · · , 0]>

Shareholders their took money

Az ≤ b

SPIGOT

q

p = ˆ z rˆ

zL

q = proj(p) rsL , ˆ z q

p = ˆ z rˆ

zL

rsL

zL = [0.3, 0.5, 0.4, . . . , 0.2]

slide-29
SLIDE 29

Some Geometry…

rsL

SPIGOT

rsL

ˆ z ˆ z

ˆ z rˆ

zL

ˆ z rˆ

zL

slide-30
SLIDE 30

Algorithm

Input

ˆ z =

Parser θ

Shareholders their took money

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Shareholders their took money

ˆ z

slide-31
SLIDE 31

Algorithm

Input

ˆ z =

Parser θ Downstream task φ

Shareholders their took money

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Shareholders their took money

ˆ z

Loss L

slide-32
SLIDE 32

Algorithm

Input

ˆ z =

Parser θ Downstream task φ

Shareholders their took money

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Shareholders their took money

ˆ z

Loss L

Backprop

zL

slide-33
SLIDE 33

Algorithm

Input

ˆ z =

Parser θ Downstream task φ

Shareholders their took money

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Shareholders their took money

ˆ z

Loss L

Backprop Project onto

zL

rsL

p = ˆ z rˆ

zL

q = proj(p) rsL , ˆ z q

slide-34
SLIDE 34

Algorithm

Input

ˆ z =

Parser θ Downstream task φ

Shareholders their took money

their money took their took money their took

       sθ ( ) sθ ( ) sθ ( ) . . . sθ ( )       

arg max z>

z forms a tree

s.t.

Shareholders their took money

ˆ z

Loss L

Backprop Project onto Backprop

rθL rˆ

zL

rsL

p = ˆ z rˆ

zL

q = proj(p) rsL , ˆ z q

slide-35
SLIDE 35

Connections to Related Work

Structured Attention: Kim et al., 2017

STE

Pipeline STE Structured Att. SPIGOT Hard decision on Backprop Marginal Projection

ˆ z

SPIGOT

ˆ z rˆ

zL

rsL rsL

ˆ z rˆ

zL

ˆ z

slide-36
SLIDE 36

Connections to Related Work

Structured Attention

ˆ z = arg max (. . . )

ˆ z = softmax(. . . )

Structured Attention: Kim et al., 2017 Pipeline STE Structured Att. SPIGOT Hard decision on Backprop Marginal Projection

ˆ z

SPIGOT

ˆ z rˆ

zL

rsL

slide-37
SLIDE 37

Joint learning

Swayamdipta et al., 2016

Shareholders their took money

Parser θ

arg max

Shareholders their took money

rθL1

L1

Training data

Applications

slide-38
SLIDE 38

Joint learning

Swayamdipta et al., 2016

Shareholders their took money

Parser θ

arg max

Shareholders their took money

Downstream task φ Loss L2

rφL2 rθL2 rθL1

L1

Training data

Applications

slide-39
SLIDE 39

Joint learning

Swayamdipta et al., 2016

Shareholders their took money

Parser θ

arg max

Shareholders their took money

Downstream task φ Loss L2

rφL2 rθL2 rθL1

L1

Shareholders their took money

Parser θ

arg max

Shareholders their took money

Downstream task φ Loss L

rθL

rφL

Training data Training data

Induce latent structures

Yogatama et al., 2017; Williams et al., 2017

Applications

slide-40
SLIDE 40

❖ Background: structured prediction as linear programs ❖ Method: SPIGOT algorithm ❖ Experiments

Outline

slide-41
SLIDE 41

Experiments: Syntactic-then-semantic Parsing

Shareholders their took money

Syntactic Parser

arg max Shareholders their took money

θ

Semantic Parser φ

arg1 arg2 poss

Shareholders their took money Input Syntactic tree Semantic graph

slide-42
SLIDE 42

Experiments: Syntactic-then-semantic Parsing

Shareholders their took money

Syntactic Parser

arg max Shareholders their took money

θ

BiLSTM + MLP

Kiperwasser and Goldberg, 2016

Eisner Algorithm

Eisner, 1996

Semantic Parser φ

arg1 arg2 poss

Shareholders their took money Input Syntactic tree Semantic graph

slide-43
SLIDE 43

Experiments: Syntactic-then-semantic Parsing

Shareholders their took money

Syntactic Parser

arg max Shareholders their took money

θ

BiLSTM + MLP

Kiperwasser and Goldberg, 2016

Eisner Algorithm

Eisner, 1996

NeurboParser

Peng et al., 2017

Semantic Parser φ

arg1 arg2 poss

Shareholders their took money

took took money root

Concat head token embedding

Input Syntactic tree Semantic graph

slide-44
SLIDE 44

SemEval ’15. Micro-averaged labeled F1

Neurbo: Peng et al., 2017 Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop N/A Hard decision N/A Projection N/A

ˆ z

F1 80 82 84 86 88

in-domain

  • ut-of-domain
slide-45
SLIDE 45

F1 80 82 84 86 88

in-domain

  • ut-of-domain

SemEval ’15. Micro-averaged labeled F1

Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop N/A Hard decision N/A Projection N/A Neurbo: Peng et al., 2017

ˆ z

slide-46
SLIDE 46

SemEval ’15. Micro-averaged labeled F1 F1 80 82 84 86 88

in-domain

  • ut-of-domain

Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop N/A Hard decision N/A Projection N/A Neurbo: Peng et al., 2017

ˆ z

slide-47
SLIDE 47

SemEval ’15. Micro-averaged labeled F1 F1 80 82 84 86 88

in-domain

  • ut-of-domain

Neurbo: Peng et al., 2017 Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop N/A Hard decision N/A Projection N/A

ˆ z

slide-48
SLIDE 48

Semantic Parsing for Sentiment Classification

Shareholders their took money

Semantic Parser

arg max

θ

Classifier φ

arg1 arg2 poss

Shareholders their took money Input Semantic graph

Positive? Negative?

slide-49
SLIDE 49

Semantic Parsing for Sentiment Classification

Shareholders their took money

Semantic Parser

arg max

θ

NeurboParser

Peng et al., 2017

AD3

Martins et al., 2011

BiLSTM+MLP

Classifier φ

arg1 arg2 poss

Shareholders their took money

took: arg1 took:arg2; their:poss … …

Concat head token and role

Input Semantic graph

Positive? Negative?

slide-50
SLIDE 50

Accuracy

82 83 84 85 86 87 88

BiLSTM

Stanford Sentiment Treebank accuracy

Pipeline STE SPIGOT

slide-51
SLIDE 51

Conclusion

Problem

slide-52
SLIDE 52

Conclusion

Problem Method

SPIGOT

slide-53
SLIDE 53

Conclusion

Problem Method Results

SPIGOT

slide-54
SLIDE 54

Thank you!