equal contributors
play

* Equal Contributors Maryland Virginia Tech - PowerPoint PPT Presentation

Paired-Dual Learning for Fast Training of Latent Variable Hinge-Loss MRFs Stephen H. Bach* Bert Huang* Jordan Boyd-Graber Lise Getoor * Equal Contributors Maryland Virginia Tech Colorado


  1. Paired-Dual Learning for Fast Training of Latent Variable Hinge-Loss MRFs Stephen H. Bach* Bert Huang* Jordan Boyd-Graber Lise Getoor * Equal Contributors Maryland Virginia Tech Colorado UC Santa Cruz ICML 2015

  2. This Talk § In rich, structured domains, latent variables can capture fundamental aspects and increase accuracy § Learning with latent variables needs repeated inferences § Recent work has overcome the inference bottleneck in discrete models, but using continuous variables introduces new challenges § We introduce paired-dual learning (PDL) § PDL is so fast that is often finishes before traditional methods make a single parameter update 2

  3. Latent Variable Models

  4. Community Detection � � � � � � � � � 4

  5. Latent User Attributes � � � Connector? Popular? � � � Introverted? � � � 5

  6. Image Reconstruction § Latent variables can represent archetypical components Originals With LVs Without § Learned components for face reconstruction: 6

  7. Learning with Latent Variables

  8. Model § Observations x § Targets with ground-truth labels ˆ y y § Latent (unlabeled) z § Parameters w 1 − w > φ ( x , y , z ) � � P ( y , z | x ; w ) = Z ( x ; w ) exp X − w > φ ( x , y , z ) � � Z ( x ; w ) = exp y , z 8

  9. Learning Objective log P (ˆ y | x ; w ) = log Z ( x , ˆ y ; w ) − log Z ( x ; w ) w > φ ( x , y , z ) ⇥ ⇤ = min max − H ( ρ ) q 2 ∆ ( z ) E ρ ρ 2 ∆ ( y , z ) w > φ ( x , ˆ ⇥ ⇤ y , z ) + H ( q ) − E q Optimize w Inference in P ( y , z | x ; w ) Inference in P ( z | x , ˆ y ; w ) 9

  10. Traditional Method § Perform full inference in each distribution § Compute the gradient with respect to w § Update using the gradient w Optimize r w w Inference in P ( y , z | x ; w ) Inference in P ( z | x , ˆ y ; w ) 10

  11. How can we solve the inference bottleneck? 11

  12. Smart Supervised Learning § Supervised learning objective contains an inner inference § Interleave inference and learning - e.g., Taskar et al. [ICML 2005], Meshi et al. [ICML 2010], Hazan and Urtasun [NIPS 2010] § Idea: turn saddle-point optimization into joint minimization by dualizing inner inference problem : 12

  13. Smart Latent Variable Learning § For discrete models , Schwing et al. [ICML 2012] proposed dualizing one of the inferences and interleaving with parameter updates Optimize w r w Inference in P ( y , z | x ; w ) Inference in P ( z | x , ˆ y ; w ) r δ 13

  14. How can we solve the inference bottleneck for continuous models? 14

  15. Continuous Structured Prediction § The learning objective contains expectations and entropy functions that are intractable for continuous distributions § Recently, there’s been a lot of work on developing - continuous probabilistic graphical models - continuous probabilistic programming languages 15

  16. Hinge-Loss Markov Random Fields § Natural language processing - Beltagy et al. [ACL 2014], Foulds et al. [ICML 2015] § Social network analysis - Huang et al. [SBP 2013], West et al. [TACL 2014], Li et al. [2014] § Massive open online course (MOOC) analysis - Ramesh et al. [AAAI 2014, ACL 2015] § Bioinformatics - Fakhraei et al. [TCBB 2014] 16

  17. Hinge-Loss Markov Random Fields § MRFs over continuous variables in [0,1] and hinge-loss potential functions 0 1 m X w j (max { � j ( y ) , 0 } ) p j P ( y ) ∝ exp @ − A j =1 where is a linear function and p j ∈ { 1 , 2 } ` j 17

  18. MAP Inference in HL-MRFs § Exact MAP inference in HL-MRFs is very fast, thanks to the alternating direction method of multipliers (ADMM) § ADMM decomposes inference by - Forming augmented Lagrangian - Iteratively updating blocks of variables L w ( y , z , α , ¯ y , ¯ z ) 18

  19. Paired-Dual Learning

  20. Continuous Latent Variables § The objective is the same, but the expectations and entropies are intractable arg min max min ρ 2 ∆ ( y , z ) q 2 ∆ ( z ) w λ 2 k w k 2 � E ρ w > φ ( x , y , z ) ⇥ ⇤ + H ( ρ ) w > φ ( x , ˆ ⇥ ⇤ + E q y , z ) � H ( q ) 20

  21. Variational Approximations § We can restrict the distribution families to single points - In other words, we can approximate expectations with MAP - Great for models with fast, convex inference, like HL-MRFs § But, the entropy of a point distribution is always zero arg min max min z 0 y , z w λ 2 k w k 2 � w > φ ( x , y , z ) + w > φ ( x , ˆ y , z 0 ) § Therefore, is always a global optimum w = 0 21

  22. Entropy Surrogates § We design surrogates to fill the role of entropy terms - They need to be tractable - Choice should be tailored to problem and model - Options include curvature and one-sided vs. two-sided § Goal: require non-zero parameters to predict ground truth § Example: − max { y, 0 } 2 − max { 1 − y, 0 } 2 22

  23. Paired-Dual Learning arg min max min z 0 y , z w λ 2 k w k 2 � w > φ ( x , y , z ) + h ( y , z ) + w > φ ( x , ˆ y , z 0 ) � h (ˆ y , z 0 ) § Repeatedly solving the inner inference problems with ADMM still becomes expensive § But we can replace the inference problems with their augmented Lagrangians 23

  24. Paired-Dual Learning arg min Optimize max min min v 0 max w r w v , ¯ v 0 , ¯ α 0 v α w λ 2 k w k 2 + L 0 w ( v 0 , α 0 , ¯ v 0 ) � L w ( v , α , ¯ v ) Optimize Optimize L 0 w ( z 0 , α 0 , ¯ z 0 ) L w ( y , z , α , ¯ y , ¯ z ) ( z 0 , α 0 , ¯ z 0 ) ( y , z , α , ¯ y , ¯ z ) § If the inner maxes and mins were solved to convergence this objective would be equivalent § Instead, paired-dual learning iteratively updates the parameters and blocks of Lagrangian variables 24

  25. Evaluation

  26. Evaluation § Three real-world problems: - Community detection - Latent user attributes - Image reconstruction § Learning methods: - Paired-dual learning (PDL) (N=1, N=10) - Expectation maximization (EM) - Primal gradient descent (Primal) § Evaluated: - Learning objective - Predictive performance - Vs. ADMM (inference) iterations 26

  27. Community Detection § Case Study: 2012 Venezuelan Presidential Election - Incumbent: Hugo Chávez - Challenger: Henrique Capriles Chávez Capriles 27 Left: This photograph was produced by Agência Brasil, a public Brazilian news agency. This file is licensed under the Creative Commons Attribution 3.0 Brazil license. Right: This photograph was produced by Wilfredor. This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

  28. Twitter (One Fold) 4 x 10 PDL, N=1 5 PDL, N=10 EM Objective Primal 4 3 2 0 500 1000 1500 2000 2500 ADMM iterations 28

  29. Twitter (One Fold) 0.4 0.3 AuPR 0.2 PDL, N=1 PDL, N=10 0.1 EM Primal 0 0 500 1000 1500 2000 2500 ADMM iterations 29

  30. Latent User Attributes § Task: trust prediction in Epinions social network [Richardson et al., ISWC 2003] § Latent variables represent whether users are: Trusting? Trustworthy? � � � 30

  31. Epinions (One Fold) 12000 PDL, N=1 PDL, N=10 10000 EM Objective Primal 8000 6000 4000 2000 0 1000 2000 ADMM iterations 31

  32. Epinions (One Fold) 0.6 0.4 AuPR PDL, N=1 0.2 PDL, N=10 EM Primal 0 0 500 1000 1500 2000 2500 ADMM iterations 32

  33. Image Reconstruction § Tested on Olivetti faces [Famaria and Harter, 1994], using experimental protocol of Poon and Domingos [UAI 2012] § Latent variables capture facial structure Originals With LVs Without 33

  34. Image Reconstruction 5000 PDL, N=1 PDL, N=10 EM Primal 4500 Objective 4000 3500 0 1000 2000 3000 4000 ADMM iterations 34

  35. Image Reconstruction 1800 PDL, N=1 PDL, N=10 EM Primal 1600 MSE 1400 1200 0 1000 2000 3000 4000 ADMM iterations 35

  36. Conclusion

  37. Conclusion § Continuous latent variables - Capture rich, nuanced information in structured domains - Learning them introduces new challenges Thank You! § Paired-dual learning bach@cs.umd.edu @stevebach - Learns accurate models much faster than traditional methods, often before they make a single parameter update - Makes large-scale, latent variable hinge-loss MRFs practical § Open questions - Convergence proof for paired-dual learning - Should we also use it for discrete models? 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend