dual decomposed learning with factorwise oracles for
play

Dual-Decomposed Learning with Factorwise Oracles for Structured - PowerPoint PPT Presentation

Dual-Decomposed Learning with Factorwise Oracles for Structured Prediction of Large Output Domain Xiangru Huang Joint work 1 with Ian E.H. Yen , Kai Zhong , Ruohan Zhang , Chia Dai , Pradeep Ravikumar and Inderjit Dhillon


  1. Dual-Decomposed Learning with Factorwise Oracles for Structured Prediction of Large Output Domain Xiangru Huang ∗ Joint work 1 with Ian E.H. Yen † , Kai Zhong ∗ , Ruohan Zhang ∗ , Chia Dai † , Pradeep Ravikumar † and Inderjit Dhillon ∗ . ∗ University of Texas at Austin † Carnegie Mellon University 1 [1] Dual Decomposed Learning with Factorwise Oracle for Structural SVM of Large Output Domain. NIPS 2016.

  2. Outline Motivations Key Idea Methodology Sketch Experimental Results

  3. Problem Setting ◮ Classification: learn function g : X → Y

  4. Problem Setting ◮ Classification: learn function g : X → Y ◮ Structural: Assuming structured dependencies on output g : X → Y 1 × Y 2 × · · · × Y m

  5. Example: Sequence Labeling ◮ Unigram Factor: θ u : Y t × X t → R ◮ Bigram Factor: Y b = Y t − 1 × Y t θ b : Y b → R Figure: Sequence Labeling

  6. Example: Multi-Label Classification with Pairwise Interaction ◮ Unigram Factor : θ u : Y k × X → R ◮ Bigram Factor : Y b = Y k × Y k ′ θ b : Y b → R Figure: Multi-Label with Pairwise Interaction

  7. Motivations ◮ g : X → Y 1 × Y 2 × · · · × Y m

  8. Motivations ◮ g : X → Y 1 × Y 2 × · · · × Y m ◮ Learning requires inference per iteration. ◮ Exact inference is slow: each iteration takes O( |Y i | n ) for n-gram factor, where |Y i | ≥ 3000.

  9. Motivations ◮ g : X → Y 1 × Y 2 × · · · × Y m ◮ Learning requires inference per iteration. ◮ Exact inference is slow: each iteration takes O( |Y i | n ) for n-gram factor, where |Y i | ≥ 3000. ◮ Approximation downgrades performance.

  10. Key Idea: Dual Decomposed Learning ◮ Structural Oracle (joint inference) is too expensive.

  11. Key Idea: Dual Decomposed Learning ◮ Structural Oracle (joint inference) is too expensive. ◮ Reduce Structural SVM to Multiclass SVMs via soft enforcement of consistency between factors.

  12. Key Idea: Dual Decomposed Learning ◮ Structural Oracle (joint inference) is too expensive. ◮ Reduce Structural SVM to Multiclass SVMs via soft enforcement of consistency between factors. ◮ (Cheap) Active Sets + Factorwise Oracles + Message Passing (between factors).

  13. Key Idea: Factorwise Oracles ◮ Inner-Product (unigram) Factor : θ w ( x , y ) = � w y , x � . ◮ Reduces to a primal and dual sparse Extreme Multiclass SVM . ◮ Reduce O ( ·|A i | ) (details see [2]) 2 . D ·|Y i | ) to O ( |F u | ���� ���� feat. dim. #uni. fac. ◮ Indicator (bigram) Factor : θ ( y 1 , y 2 ) = v y 1 , y 2 . ◮ Maintain Priority Queue on v y 1 , y 2 . ◮ Reduce O ( |Y 1 ||Y 2 | ) to O ( |A 1 ||A 2 | ). � �� � active set sizes 2 [2] PD-Sparse: A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification. ICML 2016.

  14. Methodology Sketch ◮ Original problem: n 1 � 2 � w � 2 + C min L ( w ; x i , y i ) w i =1 � �� � struct hinge loss 3 Simon Julien et al. Block-Coordinate Frank-Wolfe Optimization for Structural SVMs. ICML 2013.

  15. Methodology Sketch ◮ Original problem: n 1 � 2 � w � 2 + C min L ( w ; x i , y i ) w i =1 � �� � struct hinge loss ◮ Dual-Decomposed into independent problems: α f ∈ ∆ |Y f | G ( α ) := 1 � � � φ ( x f , y f ) T α f � 2 − δ T min � j α j 2 F f ∈ F j ∈V � �� � Independent Multiclass SVMs with consistency constraints M if α f = α i , ∀ ( i , f ) ∈ E . ◮ Standard approach 3 finds feasible descent direction, which however needs joint inference. 3 Simon Julien et al. Block-Coordinate Frank-Wolfe Optimization for Structural SVMs. ICML 2013.

  16. Methodology Sketch ◮ Dual-Decomposed into independent problems: α f ∈ ∆ |Y f | G ( α ) := 1 � � � φ ( x f , y f ) T α f � 2 − δ T min � j α j 2 F f ∈ F j ∈V with consistency constraints M jf α f = α j , ∀ ( j , f ) ∈ E ◮ Augmented Lagrangian Method: + ρ � � � M jf α f − α j + λ t jf � 2 L ( α, λ ) := G F ( α F ) 2 F ( j , f ) ∈E � �� � � �� � indep. multiclass SVMs messages between factors (sparse) with incremental updated multipliers λ t +1 = λ t jf + η ( M jf α t +1 − α t +1 ) j jf f

  17. Methodology Sketch ◮ Augmented Lagrangian Method: + ρ � � � M jf α f − α j + λ t jf � 2 L ( α, λ ) := G F ( α F ) 2 F ( j , f ) ∈E � �� � � �� � indep. multiclass SVMs messages between factors (sparse) with incremental updated multipliers λ t +1 = λ t jf + η ( M jf α t +1 − α t +1 ) jf f j ◮ Update α and λ alternatively.

  18. Experiments: Sequence Labeling (on ChineseOCR) ◮ Chinese OCR: N = 12 , 064, T = 14 . 4, D = 400 , K = 3 , 039. ◮ |Y b | = 3 , 039 2 = 9 , 235 , 521 (bigram language model). ◮ Decoding: Viterbi Algorithm. ChineseOCR × 10 5 ChineseOCR 3 0.95 BCFW GDMM-subFMO 0.9 SSG Soft-BCFW- ρ =1 0.85 Soft-BCFW- ρ =10 BCFW 2.5 GDMM-subFMO 0.8 SSG 0.75 Soft-BCFW- ρ =1 test error Objective Soft-BCFW- ρ =10 0.7 2 0.65 0.6 0.55 0.5 1.5 0.45 10 3 10 4 10 3 10 4 time time Figure: Test Error Figure: Objective

  19. Experiments: Multi-Label Classification (on RCV1) ◮ RCV-1: N = 23 , 149, D = 47 , 236 , K = 228. ◮ |F b | = 228 2 = 51 , 984 (pairwise interaction). ◮ Decoding: Linear Program RCV1-regions RCV1-regions BCFW 10 9 BCFW GDMM-subFMO GDMM-subFMO SSG SSG Soft-BCFW- ρ =1 Soft-BCFW- ρ =1 Soft-BCFW- ρ =10 Soft-BCFW- ρ =10 10 8 10 7 test error Objective 10 -2 10 6 10 5 10 4 10 2 10 3 10 4 10 5 10 2 10 3 10 4 10 5 time time Figure: Objective Figure: Test Error

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend