* Equal Contributors Maryland Virginia Tech - - PowerPoint PPT Presentation

equal contributors
SMART_READER_LITE
LIVE PREVIEW

* Equal Contributors Maryland Virginia Tech - - PowerPoint PPT Presentation

Paired-Dual Learning for Fast Training of Latent Variable Hinge-Loss MRFs Stephen H. Bach* Bert Huang* Jordan Boyd-Graber Lise Getoor * Equal Contributors Maryland Virginia Tech Colorado


slide-1
SLIDE 1

Paired-Dual Learning for Fast Training of Latent Variable Hinge-Loss MRFs

Stephen H. Bach* Bert Huang* Jordan Boyd-Graber Lise Getoor Maryland Virginia Tech Colorado UC Santa Cruz

ICML 2015

* Equal Contributors

slide-2
SLIDE 2

2

This Talk

§ In rich, structured domains, latent variables can capture fundamental aspects and increase accuracy § Learning with latent variables needs repeated inferences § Recent work has overcome the inference bottleneck in discrete models, but using continuous variables introduces new challenges § We introduce paired-dual learning (PDL) § PDL is so fast that is often finishes before traditional methods make a single parameter update

slide-3
SLIDE 3

Latent Variable Models

slide-4
SLIDE 4

4

Community Detection

slide-5
SLIDE 5

5

Latent User Attributes

  • Popular?

Introverted? Connector?

slide-6
SLIDE 6

6

Image Reconstruction

§ Latent variables can represent archetypical components § Learned components for face reconstruction:

Originals With LVs Without

slide-7
SLIDE 7

Learning with Latent Variables

slide-8
SLIDE 8

8

Model

§ Observations § Targets with ground-truth labels § Latent (unlabeled) § Parameters

x y z w ˆ y

P(y, z|x; w) = 1 Z(x; w) exp

  • −w>φ(x, y, z)
  • Z(x; w) =

X

y,z

exp

  • −w>φ(x, y, z)
slide-9
SLIDE 9

9

Learning Objective

Optimize

Inference in Inference in P(y, z|x; w)

w

P(z|x, ˆ y; w)

log P(ˆ y|x; w) = log Z(x, ˆ y; w) − log Z(x; w) = min

ρ2∆(y,z)

max

q2∆(z) Eρ

⇥ w>φ(x, y, z) ⇤ − H(ρ) − Eq ⇥ w>φ(x, ˆ y, z) ⇤ + H(q)

slide-10
SLIDE 10

10

Traditional Method

§ Perform full inference in each distribution § Compute the gradient with respect to § Update using the gradient

Optimize

Inference in Inference in P(y, z|x; w)

w

w w

rw

P(z|x, ˆ y; w)

slide-11
SLIDE 11

11

How can we solve the inference bottleneck?

slide-12
SLIDE 12

12

Smart Supervised Learning

§ Supervised learning objective contains an inner inference § Interleave inference and learning

  • e.g., Taskar et al. [ICML 2005], Meshi et al. [ICML 2010], Hazan

and Urtasun [NIPS 2010]

§ Idea: turn saddle-point optimization into joint minimization by dualizing inner inference problem:

slide-13
SLIDE 13

13

Smart Latent Variable Learning

§ For discrete models, Schwing et al. [ICML 2012] proposed dualizing one of the inferences and interleaving with parameter updates

Optimize

Inference in Inference in P(y, z|x; w)

w

rw

P(z|x, ˆ y; w)

slide-14
SLIDE 14

14

How can we solve the inference bottleneck for continuous models?

slide-15
SLIDE 15

15

Continuous Structured Prediction

§ The learning objective contains expectations and entropy functions that are intractable for continuous distributions § Recently, there’s been a lot of work on developing

  • continuous probabilistic graphical models
  • continuous probabilistic programming languages
slide-16
SLIDE 16

16

Hinge-Loss Markov Random Fields

§ Natural language processing

  • Beltagy et al. [ACL 2014], Foulds et al. [ICML 2015]

§ Social network analysis

  • Huang et al. [SBP 2013], West et al. [TACL 2014], Li et al. [2014]

§ Massive open online course (MOOC) analysis

  • Ramesh et al. [AAAI 2014, ACL 2015]

§ Bioinformatics

  • Fakhraei et al. [TCBB 2014]
slide-17
SLIDE 17

17

Hinge-Loss Markov Random Fields

§ MRFs over continuous variables in [0,1] and hinge-loss potential functions where is a linear function and

P(y) ∝ exp @−

m

X

j=1

wj (max {j(y), 0})pj 1 A

`j pj ∈ {1, 2}

slide-18
SLIDE 18

18

MAP Inference in HL-MRFs

§ Exact MAP inference in HL-MRFs is very fast, thanks to the alternating direction method of multipliers (ADMM) § ADMM decomposes inference by

  • Forming augmented Lagrangian
  • Iteratively updating blocks of variables

Lw(y, z, α, ¯ y, ¯ z)

slide-19
SLIDE 19

Paired-Dual Learning

slide-20
SLIDE 20

20

Continuous Latent Variables

§ The objective is the same, but the expectations and entropies are intractable

arg min

w

max

ρ2∆(y,z)

min

q2∆(z)

λ 2 kwk2 Eρ ⇥ w>φ(x, y, z) ⇤ + H(ρ) + Eq ⇥ w>φ(x, ˆ y, z) ⇤ H(q)

slide-21
SLIDE 21

21

Variational Approximations

§ We can restrict the distribution families to single points

  • In other words, we can approximate expectations with MAP
  • Great for models with fast, convex inference, like HL-MRFs

§ But, the entropy of a point distribution is always zero § Therefore, is always a global optimum arg min

w

max

y,z

min

z0

λ 2 kwk2 w>φ(x, y, z) + w>φ(x, ˆ y, z0)

w = 0

slide-22
SLIDE 22

22

Entropy Surrogates

§ We design surrogates to fill the role of entropy terms

  • They need to be tractable
  • Choice should be tailored to problem and model
  • Options include curvature and one-sided vs. two-sided

§ Goal: require non-zero parameters to predict ground truth § Example:

− max{y, 0}2 − max{1 − y, 0}2

slide-23
SLIDE 23

23

Paired-Dual Learning

§ Repeatedly solving the inner inference problems with ADMM still becomes expensive § But we can replace the inference problems with their augmented Lagrangians

arg min

w

max

y,z

min

z0

λ 2 kwk2 w>φ(x, y, z) + h(y, z) + w>φ(x, ˆ y, z0) h(ˆ y, z0)

slide-24
SLIDE 24

24

Paired-Dual Learning

§ If the inner maxes and mins were solved to convergence this objective would be equivalent § Instead, paired-dual learning iteratively updates the parameters and blocks of Lagrangian variables

arg min

w

max

v,¯ v

min

α

min

v0,¯ v0 max α0

λ 2 kwk2 + L0

w(v0, α0, ¯

v0) Lw(v, α, ¯ v)

Optimize

Optimize Optimize

w

rw

Lw(y, z, α, ¯ y, ¯ z)

L0

w(z0, α0, ¯

z0)

(y, z, α, ¯ y, ¯ z) (z0, α0, ¯ z0)

slide-25
SLIDE 25

Evaluation

slide-26
SLIDE 26

26

Evaluation

§ Three real-world problems:

  • Community detection
  • Latent user attributes
  • Image reconstruction

§ Learning methods:

  • Paired-dual learning (PDL) (N=1, N=10)
  • Expectation maximization (EM)
  • Primal gradient descent (Primal)

§ Evaluated:

  • Learning objective
  • Predictive performance
  • Vs. ADMM (inference) iterations
slide-27
SLIDE 27

27

Community Detection

§ Case Study: 2012 Venezuelan Presidential Election

  • Incumbent: Hugo Chávez
  • Challenger: Henrique Capriles
Left: This photograph was produced by Agência Brasil, a public Brazilian news agency. This file is licensed under the Creative Commons Attribution 3.0 Brazil license. Right: This photograph was produced by Wilfredor. This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

Chávez Capriles

slide-28
SLIDE 28

28

500 1000 1500 2000 2500 2 3 4 5 x 10

4

ADMM iterations Objective Twitter (One Fold)

PDL, N=1 PDL, N=10 EM Primal

slide-29
SLIDE 29

29

500 1000 1500 2000 2500 0.1 0.2 0.3 0.4 ADMM iterations AuPR Twitter (One Fold)

PDL, N=1 PDL, N=10 EM Primal

slide-30
SLIDE 30

30

Latent User Attributes

§ Task: trust prediction in Epinions social network [Richardson et al., ISWC 2003] § Latent variables represent whether users are:

  • Trusting?

Trustworthy?

slide-31
SLIDE 31

31

1000 2000 2000 4000 6000 8000 10000 12000 ADMM iterations Objective Epinions (One Fold)

PDL, N=1 PDL, N=10 EM Primal

slide-32
SLIDE 32

32

500 1000 1500 2000 2500 0.2 0.4 0.6 ADMM iterations AuPR Epinions (One Fold)

PDL, N=1 PDL, N=10 EM Primal

slide-33
SLIDE 33

33

Image Reconstruction

§ Tested on Olivetti faces [Famaria and Harter, 1994], using experimental protocol of Poon and Domingos [UAI 2012] § Latent variables capture facial structure

Originals With LVs Without

slide-34
SLIDE 34

34

1000 2000 3000 4000 3500 4000 4500 5000 ADMM iterations Objective Image Reconstruction

PDL, N=1 PDL, N=10 EM Primal

slide-35
SLIDE 35

35

1000 2000 3000 4000 1200 1400 1600 1800 ADMM iterations MSE Image Reconstruction

PDL, N=1 PDL, N=10 EM Primal

slide-36
SLIDE 36

Conclusion

slide-37
SLIDE 37

37

Conclusion

§ Continuous latent variables

  • Capture rich, nuanced information in structured domains
  • Learning them introduces new challenges

§ Paired-dual learning

  • Learns accurate models much faster than traditional methods,
  • ften before they make a single parameter update
  • Makes large-scale, latent variable hinge-loss MRFs practical

§ Open questions

  • Convergence proof for paired-dual learning
  • Should we also use it for discrete models?

Thank You!

bach@cs.umd.edu @stevebach