SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning
Guillaume Obozinski
Swiss Data Science Center
Joint work with Shell Xu Hu
Imaging and Machine Learning workshop, IHP, April 2nd 2019
SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF - - PowerPoint PPT Presentation
SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning Guillaume Obozinski Swiss Data Science Center Joint work with Shell Xu Hu Imaging and Machine Learning workshop, IHP, April 2nd 2019 Outline Motivation and context 1
Guillaume Obozinski
Swiss Data Science Center
Joint work with Shell Xu Hu
Imaging and Machine Learning workshop, IHP, April 2nd 2019
1
Motivation and context
2
Formulation for CRF learning
3
Relaxing and reformulating in the dual
4
Dual augmented Lagrangian formulation and algorithm
5
Convergence results
6
Experiments
7
Conclusions
Cityscapes dataset (Cordts et al., 2016)
min
w F(w) + λ
2 w2
2
with F(w) =
n
Fs(w) and typically Fs(w) = fs(w⊺ϕ(xs)) = ℓ(w⊺ϕ(xs), ys)
Stochastic gradient methods with variance reduction
Iterate: pick s at random and update wt+1 = wt − ηgt with (SVRG) gt = ∇Fs(wt)−∇Fs( w) + 1
n ∇F(
w) and
wepoch (SAG) gt = ∇Fs(wt)− gt−1
s
+ gt−1 and gt
s = ∇Fs(wt)
(SAGA) gt = ∇Fs(wt)− gt−1
s
+ 1
n
gt−1 and gt
s = ∇Fs(wt)
Stochastic Dual Coordinate Ascent (Implicit Variance reduction)
max
α1,...,αn n
f∗
s (αs) + 1
2λ
ϕ(xs)αs
2
Iterate αt+1
s
← Prox λ
Ls f∗ s
s − 1 Ls ϕ(xs)⊺wt),
αt+1
i
←αt
i, ∀i = s.
κ : condition number d : ambient dimension Running times to have Obj(w) − Obj(w∗) ≤ ε Stochastic GD d κ 1
ε
GD d nκ log 1
ε
Accelerated GD d n√κ log 1
ε
SAG(A), SVRG, SDCA, MISO d (n + κ) log 1
ε
Accelerated variants d (n + √nκ) log 1
ε
Exploiting sum structure yields faster algorithms... min
w n
ℓ(w⊺φ(xs), ys) + λ 2 w2
2
· y1 y2 yn x1 x2 xn
Input image x Features at pixel s: ϕs(x) Encoding of class at pixel s: ys = (ys1, . . . , ysK) with
◮ ysk = 1 if in class k ◮ ysk = 0 else.
Options:
1 predict each pixel class individually: multiclass logistic regression
p(ys|x) ∝ exp K
ysk wk⊺ ϕs(x)
all pixels classes jointly while accounting for dependencies: CRF p(y1, . . . , yS|x) ∝ exp
s∈V K
yskwτ1,k⊺ϕs(x) +
K
wτ2,kl ysk ytl
− log p(yo|xo) = − w, φ(yo, xo) + Axo(w) = − w, φ(yo, xo) + log
expw, φ(y, xo) = log
expw, φ(y, xo) − φ(yo, xo) = log
exp
c∈C
wτc, φc(yc, xo) − φ(yo
c, xo)
log
exp
c∈C
yc, θ(c)
θ(c) =
c, xo) − φ(yo c, xo)
c)
c∈Yc.
Input image x Features at pixel s: ϕs(x) Encoding of class at pixel s: ys = (ys1, . . . , ysK) with
◮ ysk = 1 if in class k ◮ ysk = 0 else.
Options:
1 predict each pixel class individually: multiclass logistic regression
p(ys|x) ∝ exp K
ysk wk⊺ ϕs(x)
all pixels classes jointly while accounting for dependencies: CRF p(y1, . . . , yS|x) ∝ exp
s∈V K
yskwτ1,k⊺ϕs(x) +
K
wτ2,kl ysk ytl
p(yo|xo) ∝ exp
s∈V K
yo
sk wτ1,k⊺ ϕs(xo) +
K
wτ2,kl yo
sk yo tl
s∈V
wτ1, φs(yo
s, xo)
+
wτ2, φst(yo
s, yo t , xo)
log pw(yo|xo) =
wτc, φc(xo, yo
c) − log Z( xo , w),
with y{s,t} = ysy⊺
t and Z(xo, w) =
. . .
exp
c∈C
wτc, φc( xo , yc)
exp
c∈C
wτc, φc( xo , yc) − φc(xo, yo
c)
exp
Ψ⊺
(c) w, yc
=: f
exp
θ(c), yc.
The regularized maximum likelihood estimation problem min
w − log pw(yo|xo) + λ
2 w2
2
is reformulated as min
w f(Ψ⊺w) + λ
2 w2
2
with f(θ) = log
exp
θ(c), yc, f is essentially another way of writing the log-partition function A.
Major issue: NP-hardness of inference in graphical models
f and its gradient are NP-difficult to compute. ⇒ the maximum likelihood estimator is intractable. f or ∇F can be estimated using MCMC methods to perform approximate inference. Approximate inference can also be solved as an optimization problem with variational methods.
min
w S
log pw(yo
s|xo) + λ
2 w2
2
min
w S
fs(ψ⊺
sw) + λ
2 w2
2
with fs(θ(s)) := log
expθ(s), ys. fs is easy to compute: the sum of K terms The objective is a sum of a large number of terms ⇒ Very fast randomized algorithms can be used to solve this problem
SAG Roux et al. (2012) SVRG Johnson and Zhang (2013) SAGA Defazio et al. (2014), etc SDCA Shalev-Shwartz and Zhang (2016)
max
α1,...,αS S
f∗
s (αs) + 1
2λ
ψsαs
2
Could we do the same for CRFs? With SDCA?
f(θ) := log
exp
θ(c), yc = max
µ∈M µ, θ + HShannon(µ),
The marginal polytope M is the set of all realizable moments vectors M :=
s.t. ∀c ∈ C, µc = E[Yc]
HShannon is the Shannon entropy of the maximum entropy distribution with moments µ. P #(w) := f
2 w2
2
D#(µ) := HShannon(µ) − ιM(µ) − 1 2λΨµ2
2
min
w P #(w)
and max
µ
D#(µ) form a pair of primal and dual optimization problems. Both HShannon and M are intractable → NP-hard problem in general
A classical relaxation for M: the local polytope L
For C = E ∪ V Node and edge simplex constraints: ∀s ∈ V, △s :=
+ | µ⊺ s1 = 1
△{s,t} :=
+
| 1⊺µ⊺
st1 = 1
I :=
µc ∈ △c
| ∀{s, t} ∈ E, µst 1 = µs, µ⊺
st 1 = µt
for an appropriate definition of A...
Various entropy surrogates exist, e.g.: Bethe entropy (nonconvex), Tree-reweighted entropy (TRW) (convex on L but not on I)
Separable surrogates Happrox
We consider surrogates of the form Happrox(µ) =
hc(µc) , such that each function hc is smootha and convex on △c and Happrox is strongly convex on L In particular we propose to use the Gini entropy: hc(µc) = 1 − µc2
F
a quadratic counterpart of the oriented tree-reweighted entropy:
ai.e. has Lipschitz gradients
M
relax to
− → L = I ∩ {µ | Aµ = 0} HShannon
relax to
− → Happrox(µ) :=
hc(µc) .
Problem relaxation
D#(µ) := HShannon(µ) − ιM(µ) − 1 2λΨµ2
2 relax to ↓
D(µ) := Happrox(µ) − ιI(µ) − ι{Aµ=0} − 1 2λΨµ2
2
so that with f∗
c (µc) : −hc(µc) + ι△c(µc)
and g∗(µ) = 1 2λΨµ2
2
we have D(µ) = −
f∗
c (µc) − g∗(µ) − ι{Aµ=0} .
D(µ) = −
f∗
c (µc) − g∗(µ) − ι{Aµ=0}
Idea: without the linear constraint, we could exploit the form of the
Dρ(µ, ξ) = −
f∗
c (µc) − g∗(µ) − ξ, Aµ − 1
2ρAµ2
2
By strong duality, we need to solve min
ξ
d(ξ) with d(ξ) := max
µ
Dρ(µ, ξ).
Need to solve min
ξ
d(ξ) with d(ξ) := max
µ
Dρ(µ, ξ). with Dρ(µ, ξ) = −
f∗
c (µc) − g∗(µ) − ξ, Aµ − 1
2ρAµ2
2.
Note that we have ∇d(ξ) = Aµξ with µξ = arg min
ξ
Dρ(µ, ξ).
Combining an inexact dual Lagrangian method with a subsolver A
At epoch t: Maximize Dρ partially w.r.t. µ using a fixed number of steps of a (stochastic) linearly convergent algorithm A to get ˆ µt from the ˆ µt−1. Take an inexact gradient step on d with ξt+1 = ξt − 1 LAˆ µt
Let ξt (resp. ˆ µt) the value of ξ (resp. µ) at the end of epoch t Let ˆ ∆t := max
µ
Dρ(µ, ξt) − Dρ(ˆ µt, ξt) and Γt := d(ξt) − d(ξ∗). Let ∆0
t := max µ
Dρ(µ, ξt) − Dρ(µt
0, ξt)
If algorithm A used at epoch t to maximize Dρ(µ, ξ) w.r.t. µ is such that ∃β ∈ (0, 1), E[ ˆ ∆t] ≤ β E[∆0
t ]
, then ∃ κ ∈ (0, 1) characterizing d and ∃ C > 0 such that, if µt
0 = ˆ
µt−1,
∆Tex] E[ΓTex]
∆0] E[Γ0]
where λmax(β) is the largest eigenvalue of the matrix M(β) =
3β 1 1−κ
Let A be an iterative algorithm used to solve partially max
µ
Dρ(µ, ξ). Let ξt (resp. ˆ µt) the value of ξ (resp. µ) at the end of epoch t Let ˆ ∆t := max
µ
Dρ(µ, ξt) − Dρ(ˆ µt, ξt) and Γt := d(ξt) − d(ξ∗). Proposition: If A is a linearly convergent algorithm at epoch t, A is initialized with ˆ µt−1 (→ use of warm-starts) A is run for a fixed ahead Tin number of iteration at each epoch then we have ˆ ∆t, Γt
a.s.
− → 0 linearly the residuals Aˆ µt2
2 a.s.
− → 0 linearly the smooth part of the objective a.s. converges linearly
Let P be the relaxed primal objective P(w) := FL
2 w2
2,
with FL(θ) := max
µ∈L θ, µ + Happrox(µ).
Corollary
Let ˆ wt = − 1 λΨˆ µt. If A is a linearly convergent algorithm and the function µ → −Happrox(µ) + 1 2ρAµ2
2 is strongly convex,
then P( ˆ wt) − P(w⋆) converges to 0 linearly a.s. Since a fixed nb of inner iterations are done at each epoch, the linear convergence is as a function of the total number of clique updates.
A lot of work on approximate inference for CRFs:
Komodakis et al. (2007); Sontag et al. (2008); Savchynskyy et al. (2011)
Learning method going beyond saddle formulations:
Meshi et al. (2010); Hazan and Urtasun (2010); Lacoste-Julien et al. (2013)
Learning in the dual for structured SVMs with only clique-wise updates:
With relaxation + smoothing of the linear constraints Meshi et al. (2015) and using block coordinate Frank-Wolfe (BCFW) or block coordinate ascent. With multiplier and a greedy primal dual algorithm, Yen et al. (2016) show a global linear convergence result in the dual.
Convergence rates for approximate gradient descent
Schmidt et al. (2011); Devolder et al. (2014); Lan and Monteiro (2016); Lin et al. (2017)
Related work on BCFW with linear constraints: Gidel et al. (2018)
SoftBCFW Stochastic block coordinate Frank-Wolfe + penalty method (Meshi et al., 2015) SoftSDCA Stochastic block coordinate prox ascent + penalty method GDMM Dual decomposed learning with factorwise oracle (Yen et al., 2016) IDAL Our algorithm
Gaussian mixture Potts model 10 × 10 grid graph with 5 classes Gaussian features in R10 (wτ1 ∈ R10×5, wτ2 ∈ R5×5) 50 training grids Semantic segmentation of images MSRC-21 dataset (Shotton et al., 2006) 21 classes 50 features (wτ1 ∈ R50×21, wτ2 ∈ R21×21) 335 training images
Bound on duality gap Gap on marginalization constraints
5 10 15 20 25 30 35 40 45 50 10
−110 10
110
210
310
410
510
6SoftBCFW SoftSDCA GDMM IDAL 5 10 15 20 25 30 35 40 45 50 10
−210
−110 10
110
210
310
4SoftBCFW SoftSDCA GDMM IDAL 10
−110 10
110
210
310
4−5000 −4000 −3000 −2000 −1000 1000 SoftBCFW SoftSDCA GDMM IDAL 10
−110 10
110
210
310
40.4 0.45 0.5 0.55 0.6 0.65 SoftBCFW SoftSDCA GDMM IDAL
Dual objective Accuracy on test data
Bound on duality gap Gap on marginalization constraints
50 100 150 200 250 300 350 400 450 500 10
2
10
3
10
4
10
5
10
6
10
7
SoftBCFW SoftSDCA GDMM IDAL 50 100 150 200 250 300 350 400 450 500 10
−310
−210
−110 10
110
210
310
4SoftBCFW SoftSDCA GDMM IDAL 10
−1
10 10
1
10
2
10
3
10
4
−6 −4 −2 2 4 6 8 10 x 10
4
SoftBCFW SoftSDCA GDMM IDAL 10
−110 10
110
210
310
40.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 SoftBCFW SoftSDCA GDMM IDAL
Dual objective Accuracy on test data
We proposed an algorithm combining SDCA and an inexact dual Lagrangian method that obtains Global linear convergences for the relaxed objective
◮ In the primal and for the dual augmented Lagrangian formulation ◮ Obtains good practical performance
Other contributions: Computable duality gaps to track convergence in the primal Representer theorem in the structured learning case “inside the graph” Unified derivation connecting formulations of previous work SDCA can accommodate linear constraints on the dual parameter. Open questions: Use a better approx. for the entropy like OTRW (→ non Lipschitz gradients)? Be stochastic on ξ as well? Paper:
SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning, X. Hu, G. Obozinski, AIStats, 2018.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pages 1646–1654. Devolder, O., Glineur, F., and Nesterov, Y. (2014). First-order methods of smooth convex
Gidel, G., Pedregosa, F., and Lacoste-Julien, S. (2018). Frank-wolfe splitting via augmented lagrangian method. AIStats. Hazan, T. and Urtasun, R. (2010). A primal-dual message-passing algorithm for approximated large scale structured prediction. In NIPS, pages 838–846. Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323. Komodakis, N., Paragios, N., and Tziritas, G. (2007). MRF optimization via dual decomposition: Message-passing revisited. In ICCV, pages 1–8. Lacoste-Julien, S., Jaggi, M., Schmidt, M., and Pletscher, P. (2013). Block-coordinate Frank-Wolfe optimization for structural SVMs. In ICML, pages 53–61. Lan, G. and Monteiro, R. D. (2016). Iteration-complexity of first-order augmented Lagrangian methods for convex programming. Mathematical Programming, 155(1-2):511–547. Lin, H., Mairal, J., and Harchaoui, Z. (2017). QuickeNing: A generic quasi-Newton algorithm for faster gradient-based optimization. arXiv preprint arXiv:1610.00960.
Meshi, O., Sontag, D., Globerson, A., and Jaakkola, T. S. (2010). Learning efficiently with approximate inference via dual losses. In ICML, pages 783–790. Meshi, O., Srebro, N., and Hazan, T. (2015). Efficient training of structured SVMs via soft
Roux, N. L., Schmidt, M., and Bach, F. R. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pages 2663–2671. Savchynskyy, B., Kappes, J., Schmidt, S., and Schn¨
scheme for Lagrangian decomposition and MAP labeling. In CVPR, pages 1817–1823. Schmidt, M., Le Roux, N., and Bach, F. R. (2011). Convergence rates of inexact proximal-gradient methods for convex optimization. In NIPS, pages 1458–1466. Shalev-Shwartz, S. and Zhang, T. (2016). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105–145. Shotton, J., Winn, J., Rother, C., and Criminisi, A. (2006). Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In ECCV. Springer. Sontag, D., Meltzer, T., Globerson, A., Jaakkola, T., and Weiss, Y. (2008). Tightening LP relaxations for MAP using message passing. In UAI, pages 503–510. Yen, I. E.-H., Huang, X., Zhong, K., Zhang, R., Ravikumar, P. K., and Dhillon, I. S. (2016). Dual decomposed learning with factorwise oracle for structural SVM of large output domain. In NIPS, pages 5024–5032.