Dual-Decomposed Learning with Factorwise Oracles for Structured - - PowerPoint PPT Presentation

dual decomposed learning with factorwise oracles for
SMART_READER_LITE
LIVE PREVIEW

Dual-Decomposed Learning with Factorwise Oracles for Structured - - PowerPoint PPT Presentation

Dual-Decomposed Learning with Factorwise Oracles for Structured Prediction of Large Output Domain Xiangru Huang Joint work 1 with Ian E.H. Yen , Kai Zhong , Ruohan Zhang , Chia Dai , Pradeep Ravikumar and Inderjit Dhillon


slide-1
SLIDE 1

Dual-Decomposed Learning with Factorwise Oracles for Structured Prediction of Large Output Domain

Xiangru Huang ∗

Joint work 1 with Ian E.H. Yen†, Kai Zhong∗, Ruohan Zhang∗, Chia Dai†, Pradeep Ravikumar† and Inderjit Dhillon∗.

∗ University of Texas at Austin † Carnegie Mellon University 1[1] Dual Decomposed Learning with Factorwise Oracle for Structural SVM

  • f Large Output Domain. NIPS 2016.
slide-2
SLIDE 2

Outline

Motivations Key Idea Methodology Sketch Experimental Results

slide-3
SLIDE 3

Problem Setting

◮ Classification: learn function g : X → Y

slide-4
SLIDE 4

Problem Setting

◮ Classification: learn function g : X → Y ◮ Structural: Assuming structured dependencies on output

g : X → Y1 × Y2 × · · · × Ym

slide-5
SLIDE 5

Example: Sequence Labeling

◮ Unigram Factor:

θu : Yt × Xt → R

◮ Bigram Factor:

Yb = Yt−1 × Yt θb : Yb → R

Figure: Sequence Labeling

slide-6
SLIDE 6

Example: Multi-Label Classification with Pairwise Interaction

◮ Unigram Factor :

θu : Yk × X → R

◮ Bigram Factor :

Yb = Yk × Yk′ θb : Yb → R

Figure: Multi-Label with Pairwise Interaction

slide-7
SLIDE 7

Motivations

◮ g : X → Y1 × Y2 × · · · × Ym

slide-8
SLIDE 8

Motivations

◮ g : X → Y1 × Y2 × · · · × Ym ◮ Learning requires inference per iteration. ◮ Exact inference is slow: each iteration takes O(|Yi|n) for

n-gram factor, where |Yi| ≥ 3000.

slide-9
SLIDE 9

Motivations

◮ g : X → Y1 × Y2 × · · · × Ym ◮ Learning requires inference per iteration. ◮ Exact inference is slow: each iteration takes O(|Yi|n) for

n-gram factor, where |Yi| ≥ 3000.

◮ Approximation downgrades performance.

slide-10
SLIDE 10

Key Idea: Dual Decomposed Learning

◮ Structural Oracle (joint inference) is too expensive.

slide-11
SLIDE 11

Key Idea: Dual Decomposed Learning

◮ Structural Oracle (joint inference) is too expensive. ◮ Reduce Structural SVM to Multiclass SVMs via soft

enforcement of consistency between factors.

slide-12
SLIDE 12

Key Idea: Dual Decomposed Learning

◮ Structural Oracle (joint inference) is too expensive. ◮ Reduce Structural SVM to Multiclass SVMs via soft

enforcement of consistency between factors.

◮ (Cheap) Active Sets + Factorwise Oracles + Message Passing

(between factors).

slide-13
SLIDE 13

Key Idea: Factorwise Oracles

◮ Inner-Product (unigram) Factor: θw(x, y) = wy, x.

◮ Reduces to a primal and dual sparse Extreme Multiclass SVM . ◮ Reduce O(

D

  • feat. dim.

·|Yi|) to O( |Fu|

  • #uni. fac.

·|Ai|) (details see [2]) 2.

◮ Indicator (bigram) Factor: θ(y1, y2) = vy1,y2.

◮ Maintain Priority Queue on vy1,y2. ◮ Reduce O(|Y1||Y2|) to O( |A1||A2|

  • active set sizes

).

2[2] PD-Sparse: A Primal and Dual Sparse Approach to Extreme Multiclass

and Multilabel Classification. ICML 2016.

slide-14
SLIDE 14

Methodology Sketch

◮ Original problem:

min

w

1 2w2 + C

n

  • i=1

L(w; xi, yi)

  • struct hinge loss

3Simon Julien et al. Block-Coordinate Frank-Wolfe Optimization for

Structural SVMs. ICML 2013.

slide-15
SLIDE 15

Methodology Sketch

◮ Original problem:

min

w

1 2w2 + C

n

  • i=1

L(w; xi, yi)

  • struct hinge loss

◮ Dual-Decomposed into independent problems:

min

αf ∈∆|Yf | G(α) := 1

2

  • F
  • f ∈F

φ(xf , yf )Tαf 2 −

  • j∈V

δT

j αj

  • Independent Multiclass SVMs

with consistency constraints Mif αf = αi, ∀(i, f ) ∈ E.

◮ Standard approach 3 finds feasible descent direction, which

however needs joint inference.

3Simon Julien et al. Block-Coordinate Frank-Wolfe Optimization for

Structural SVMs. ICML 2013.

slide-16
SLIDE 16

Methodology Sketch

◮ Dual-Decomposed into independent problems:

min

αf ∈∆|Yf | G(α) := 1

2

  • F
  • f ∈F

φ(xf , yf )Tαf 2 −

  • j∈V

δT

j αj

with consistency constraints Mjf αf = αj, ∀(j, f ) ∈ E

◮ Augmented Lagrangian Method:

L(α, λ) :=

  • F

GF(αF)

  • indep. multiclass SVMs

+ ρ 2

  • (j,f )∈E

Mjf αf − αj + λt

jf 2

  • messages between factors (sparse)

with incremental updated multipliers λt+1

jf

= λt

jf + η(Mjf αt+1 f

− αt+1

j

)

slide-17
SLIDE 17

Methodology Sketch

◮ Augmented Lagrangian Method:

L(α, λ) :=

  • F

GF(αF)

  • indep. multiclass SVMs

+ ρ 2

  • (j,f )∈E

Mjf αf − αj + λt

jf 2

  • messages between factors (sparse)

with incremental updated multipliers λt+1

jf

= λt

jf + η(Mjf αt+1 f

− αt+1

j

)

◮ Update α and λ alternatively.

slide-18
SLIDE 18

Experiments: Sequence Labeling (on ChineseOCR)

◮ Chinese OCR: N = 12, 064, T = 14.4, D = 400 , K = 3, 039. ◮ |Yb| = 3, 0392 = 9, 235, 521 (bigram language model). ◮ Decoding: Viterbi Algorithm.

10 3 10 4 time 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 test error ChineseOCR BCFW GDMM-subFMO SSG Soft-BCFW-ρ=1 Soft-BCFW-ρ=10

Figure: Test Error

10 3 10 4 time 1.5 2 2.5 3 Objective ×10 5 ChineseOCR BCFW GDMM-subFMO SSG Soft-BCFW-ρ=1 Soft-BCFW-ρ=10

Figure: Objective

slide-19
SLIDE 19

Experiments: Multi-Label Classification (on RCV1)

◮ RCV-1: N = 23, 149, D = 47, 236 , K = 228. ◮ |Fb| = 2282 = 51, 984 (pairwise interaction). ◮ Decoding: Linear Program

10 2 10 3 10 4 10 5 time 10 -2 test error RCV1-regions BCFW GDMM-subFMO SSG Soft-BCFW-ρ=1 Soft-BCFW-ρ=10

Figure: Test Error

10 2 10 3 10 4 10 5 time 10 4 10 5 10 6 10 7 10 8 10 9 Objective RCV1-regions BCFW GDMM-subFMO SSG Soft-BCFW-ρ=1 Soft-BCFW-ρ=10

Figure: Objective