Paired-Dual Learning for Fast Training of Latent Variable Hinge-Loss MRFs
Stephen H. Bach* Bert Huang* Jordan Boyd-Graber Lise Getoor Maryland Virginia Tech Colorado UC Santa Cruz
ICML 2015
* Equal Contributors
* Equal Contributors Maryland Virginia Tech - - PowerPoint PPT Presentation
Paired-Dual Learning for Fast Training of Latent Variable Hinge-Loss MRFs Stephen H. Bach* Bert Huang* Jordan Boyd-Graber Lise Getoor * Equal Contributors Maryland Virginia Tech Colorado
Paired-Dual Learning for Fast Training of Latent Variable Hinge-Loss MRFs
Stephen H. Bach* Bert Huang* Jordan Boyd-Graber Lise Getoor Maryland Virginia Tech Colorado UC Santa Cruz
ICML 2015
* Equal Contributors
2
This Talk
§ In rich, structured domains, latent variables can capture fundamental aspects and increase accuracy § Learning with latent variables needs repeated inferences § Recent work has overcome the inference bottleneck in discrete models, but using continuous variables introduces new challenges § We introduce paired-dual learning (PDL) § PDL is so fast that is often finishes before traditional methods make a single parameter update
4
Community Detection
5
Latent User Attributes
Introverted? Connector?
6
Image Reconstruction
§ Latent variables can represent archetypical components § Learned components for face reconstruction:
Originals With LVs Without
8
Model
§ Observations § Targets with ground-truth labels § Latent (unlabeled) § Parameters
x y z w ˆ y
P(y, z|x; w) = 1 Z(x; w) exp
X
y,z
exp
9
Learning Objective
Optimize
Inference in Inference in P(y, z|x; w)
P(z|x, ˆ y; w)
log P(ˆ y|x; w) = log Z(x, ˆ y; w) − log Z(x; w) = min
ρ2∆(y,z)
max
q2∆(z) Eρ
⇥ w>φ(x, y, z) ⇤ − H(ρ) − Eq ⇥ w>φ(x, ˆ y, z) ⇤ + H(q)
10
Traditional Method
§ Perform full inference in each distribution § Compute the gradient with respect to § Update using the gradient
Optimize
Inference in Inference in P(y, z|x; w)
w w
rw
P(z|x, ˆ y; w)
11
12
Smart Supervised Learning
§ Supervised learning objective contains an inner inference § Interleave inference and learning
and Urtasun [NIPS 2010]
§ Idea: turn saddle-point optimization into joint minimization by dualizing inner inference problem:
13
Smart Latent Variable Learning
§ For discrete models, Schwing et al. [ICML 2012] proposed dualizing one of the inferences and interleaving with parameter updates
Optimize
Inference in Inference in P(y, z|x; w)
rw
P(z|x, ˆ y; w)
rδ
14
15
Continuous Structured Prediction
§ The learning objective contains expectations and entropy functions that are intractable for continuous distributions § Recently, there’s been a lot of work on developing
16
Hinge-Loss Markov Random Fields
§ Natural language processing
§ Social network analysis
§ Massive open online course (MOOC) analysis
§ Bioinformatics
17
Hinge-Loss Markov Random Fields
§ MRFs over continuous variables in [0,1] and hinge-loss potential functions where is a linear function and
P(y) ∝ exp @−
m
X
j=1
wj (max {j(y), 0})pj 1 A
`j pj ∈ {1, 2}
18
MAP Inference in HL-MRFs
§ Exact MAP inference in HL-MRFs is very fast, thanks to the alternating direction method of multipliers (ADMM) § ADMM decomposes inference by
Lw(y, z, α, ¯ y, ¯ z)
20
Continuous Latent Variables
§ The objective is the same, but the expectations and entropies are intractable
arg min
w
max
ρ2∆(y,z)
min
q2∆(z)
λ 2 kwk2 Eρ ⇥ w>φ(x, y, z) ⇤ + H(ρ) + Eq ⇥ w>φ(x, ˆ y, z) ⇤ H(q)
21
Variational Approximations
§ We can restrict the distribution families to single points
§ But, the entropy of a point distribution is always zero § Therefore, is always a global optimum arg min
w
max
y,z
min
z0
λ 2 kwk2 w>φ(x, y, z) + w>φ(x, ˆ y, z0)
w = 0
22
Entropy Surrogates
§ We design surrogates to fill the role of entropy terms
§ Goal: require non-zero parameters to predict ground truth § Example:
− max{y, 0}2 − max{1 − y, 0}2
23
Paired-Dual Learning
§ Repeatedly solving the inner inference problems with ADMM still becomes expensive § But we can replace the inference problems with their augmented Lagrangians
arg min
w
max
y,z
min
z0
λ 2 kwk2 w>φ(x, y, z) + h(y, z) + w>φ(x, ˆ y, z0) h(ˆ y, z0)
24
Paired-Dual Learning
§ If the inner maxes and mins were solved to convergence this objective would be equivalent § Instead, paired-dual learning iteratively updates the parameters and blocks of Lagrangian variables
arg min
w
max
v,¯ v
min
α
min
v0,¯ v0 max α0
λ 2 kwk2 + L0
w(v0, α0, ¯
v0) Lw(v, α, ¯ v)
Optimize
Optimize Optimize
rw
Lw(y, z, α, ¯ y, ¯ z)
L0
w(z0, α0, ¯
z0)
(y, z, α, ¯ y, ¯ z) (z0, α0, ¯ z0)
26
Evaluation
§ Three real-world problems:
§ Learning methods:
§ Evaluated:
27
Community Detection
§ Case Study: 2012 Venezuelan Presidential Election
Chávez Capriles
28
500 1000 1500 2000 2500 2 3 4 5 x 10
4
ADMM iterations Objective Twitter (One Fold)
PDL, N=1 PDL, N=10 EM Primal
29
500 1000 1500 2000 2500 0.1 0.2 0.3 0.4 ADMM iterations AuPR Twitter (One Fold)
PDL, N=1 PDL, N=10 EM Primal
30
Latent User Attributes
§ Task: trust prediction in Epinions social network [Richardson et al., ISWC 2003] § Latent variables represent whether users are:
Trustworthy?
31
1000 2000 2000 4000 6000 8000 10000 12000 ADMM iterations Objective Epinions (One Fold)
PDL, N=1 PDL, N=10 EM Primal
32
500 1000 1500 2000 2500 0.2 0.4 0.6 ADMM iterations AuPR Epinions (One Fold)
PDL, N=1 PDL, N=10 EM Primal
33
Image Reconstruction
§ Tested on Olivetti faces [Famaria and Harter, 1994], using experimental protocol of Poon and Domingos [UAI 2012] § Latent variables capture facial structure
Originals With LVs Without
34
1000 2000 3000 4000 3500 4000 4500 5000 ADMM iterations Objective Image Reconstruction
PDL, N=1 PDL, N=10 EM Primal
35
1000 2000 3000 4000 1200 1400 1600 1800 ADMM iterations MSE Image Reconstruction
PDL, N=1 PDL, N=10 EM Primal
37
Conclusion
§ Continuous latent variables
§ Paired-dual learning
§ Open questions
Thank You!
bach@cs.umd.edu @stevebach