SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF - - PowerPoint PPT Presentation

sdca powered inexact dual augmented lagrangian method for
SMART_READER_LITE
LIVE PREVIEW

SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF - - PowerPoint PPT Presentation

SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning Guillaume Obozinski Swiss Data Science Center Joint work with Shell Xu Hu Imaging and Machine Learning workshop, IHP, April 2nd 2019 Outline Motivation and context 1


slide-1
SLIDE 1

SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning

Guillaume Obozinski

Swiss Data Science Center

Joint work with Shell Xu Hu

Imaging and Machine Learning workshop, IHP, April 2nd 2019

slide-2
SLIDE 2

Outline

1

Motivation and context

2

Formulation for CRF learning

3

Relaxing and reformulating in the dual

4

Dual augmented Lagrangian formulation and algorithm

5

Convergence results

6

Experiments

7

Conclusions

slide-3
SLIDE 3

A motivating example: semantic segmentation

Cityscapes dataset (Cordts et al., 2016)

slide-4
SLIDE 4

Recent fast algorithms for large sums of functions

min

w F(w) + λ

2 w2

2

with F(w) =

n

  • s=1

Fs(w) and typically Fs(w) = fs(w⊺ϕ(xs)) = ℓ(w⊺ϕ(xs), ys)

Stochastic gradient methods with variance reduction

Iterate: pick s at random and update wt+1 = wt − ηgt with (SVRG) gt = ∇Fs(wt)−∇Fs( w) + 1

n ∇F(

w) and

  • w =

wepoch (SAG) gt = ∇Fs(wt)− gt−1

s

+ gt−1 and gt

s = ∇Fs(wt)

(SAGA) gt = ∇Fs(wt)− gt−1

s

+ 1

n

gt−1 and gt

s = ∇Fs(wt)

Stochastic Dual Coordinate Ascent (Implicit Variance reduction)

max

α1,...,αn n

  • s=1

f∗

s (αs) + 1

  • n
  • s=1

ϕ(xs)αs

  • 2

2

Iterate αt+1

s

← Prox λ

Ls f∗ s

  • αt

s − 1 Ls ϕ(xs)⊺wt),

αt+1

i

←αt

i, ∀i = s.

slide-5
SLIDE 5

Variance reduction techniques yield improved rates

κ : condition number d : ambient dimension Running times to have Obj(w) − Obj(w∗) ≤ ε Stochastic GD d κ 1

ε

GD d nκ log 1

ε

Accelerated GD d n√κ log 1

ε

SAG(A), SVRG, SDCA, MISO d (n + κ) log 1

ε

Accelerated variants d (n + √nκ) log 1

ε

Exploiting sum structure yields faster algorithms... min

w n

  • s=1

ℓ(w⊺φ(xs), ys) + λ 2 w2

2

· y1 y2 yn x1 x2 xn

slide-6
SLIDE 6

Conditional Random Fields

Input image x Features at pixel s: ϕs(x) Encoding of class at pixel s: ys = (ys1, . . . , ysK) with

◮ ysk = 1 if in class k ◮ ysk = 0 else.

Options:

1 predict each pixel class individually: multiclass logistic regression

p(ys|x) ∝ exp K

  • k=1

ysk wk⊺ ϕs(x)

  • 2 View image as a grid graph with vertices V and edges E, and predict

all pixels classes jointly while accounting for dependencies: CRF p(y1, . . . , yS|x) ∝ exp

s∈V K

  • k=1

yskwτ1,k⊺ϕs(x) +

  • {s,t}∈E

K

  • k,l=1

wτ2,kl ysk ytl

slide-7
SLIDE 7

Trick: log-likelihood as log-partition

− log p(yo|xo) = − w, φ(yo, xo) + Axo(w) = − w, φ(yo, xo) + log

  • y

expw, φ(y, xo) = log

  • y

expw, φ(y, xo) − φ(yo, xo) = log

  • y

exp

c∈C

wτc, φc(yc, xo) − φ(yo

c, xo)

  • =

log

  • y

exp

c∈C

yc, θ(c)

  • with

θ(c) =

  • wτc, φc(y′

c, xo) − φ(yo c, xo)

  • ψc(y′

c)

  • y′

c∈Yc.

slide-8
SLIDE 8

Conditional Random Fields

Input image x Features at pixel s: ϕs(x) Encoding of class at pixel s: ys = (ys1, . . . , ysK) with

◮ ysk = 1 if in class k ◮ ysk = 0 else.

Options:

1 predict each pixel class individually: multiclass logistic regression

p(ys|x) ∝ exp K

  • k=1

ysk wk⊺ ϕs(x)

  • 2 View image as a grid graph with vertices V and edges E, and predict

all pixels classes jointly while accounting for dependencies: CRF p(y1, . . . , yS|x) ∝ exp

s∈V K

  • k=1

yskwτ1,k⊺ϕs(x) +

  • {s,t}∈E

K

  • k,l=1

wτ2,kl ysk ytl

slide-9
SLIDE 9

Abstract CRF model

p(yo|xo) ∝ exp

s∈V K

  • k=1

yo

sk wτ1,k⊺ ϕs(xo) +

  • {s,t}∈E

K

  • k,l=1

wτ2,kl yo

sk yo tl

  • p(yo|xo) ∝ exp

s∈V

wτ1, φs(yo

s, xo)

+

  • {s,t}∈E

wτ2, φst(yo

s, yo t , xo)

  • Let C = V ∪ E,

log pw(yo|xo) =

  • c∈C

wτc, φc(xo, yo

c) − log Z( xo , w),

with y{s,t} = ysy⊺

t and Z(xo, w) =

  • y1

. . .

  • yS

exp

c∈C

wτc, φc( xo , yc)

  • In fact − log pw( yo|xo ) = log
  • y

exp

c∈C

wτc, φc( xo , yc) − φc(xo, yo

c)

  • = log
  • y

exp

  • c∈C

Ψ⊺

(c) w, yc

=: f

  • Ψ⊺ w
  • with f(θ) = log
  • y

exp

  • c∈C

θ(c), yc.

slide-10
SLIDE 10

Regularized maximum likelihood estimation

The regularized maximum likelihood estimation problem min

w − log pw(yo|xo) + λ

2 w2

2

is reformulated as min

w f(Ψ⊺w) + λ

2 w2

2

with f(θ) = log

  • y

exp

  • c∈C

θ(c), yc, f is essentially another way of writing the log-partition function A.

Major issue: NP-hardness of inference in graphical models

f and its gradient are NP-difficult to compute. ⇒ the maximum likelihood estimator is intractable. f or ∇F can be estimated using MCMC methods to perform approximate inference. Approximate inference can also be solved as an optimization problem with variational methods.

slide-11
SLIDE 11

Compare with the “disconnected graph” case

min

w S

  • s=1

log pw(yo

s|xo) + λ

2 w2

2

min

w S

  • s=1

fs(ψ⊺

sw) + λ

2 w2

2

with fs(θ(s)) := log

  • ys

expθ(s), ys. fs is easy to compute: the sum of K terms The objective is a sum of a large number of terms ⇒ Very fast randomized algorithms can be used to solve this problem

SAG Roux et al. (2012) SVRG Johnson and Zhang (2013) SAGA Defazio et al. (2014), etc SDCA Shalev-Shwartz and Zhang (2016)

max

α1,...,αS S

  • s=1

f∗

s (αs) + 1

  • S
  • s=1

ψsαs

  • 2

2

Could we do the same for CRFs? With SDCA?

slide-12
SLIDE 12

Fenchel conjugate of the log-partition function

f(θ) := log

  • y

exp

  • c∈C

θ(c), yc = max

µ∈M µ, θ + HShannon(µ),

The marginal polytope M is the set of all realizable moments vectors M :=

  • µ = (µc)c∈C | ∃Y

s.t. ∀c ∈ C, µc = E[Yc]

  • .

HShannon is the Shannon entropy of the maximum entropy distribution with moments µ. P #(w) := f

  • Ψ⊺w
  • + λ

2 w2

2

D#(µ) := HShannon(µ) − ιM(µ) − 1 2λΨµ2

2

min

w P #(w)

and max

µ

D#(µ) form a pair of primal and dual optimization problems. Both HShannon and M are intractable → NP-hard problem in general

slide-13
SLIDE 13

Relaxing the marginal into the local polytope.

A classical relaxation for M: the local polytope L

For C = E ∪ V Node and edge simplex constraints: ∀s ∈ V, △s :=

  • µs ∈ Rk

+ | µ⊺ s1 = 1

  • ∀{s, t} ∈ E,

△{s,t} :=

  • µst ∈ Rk×k

+

| 1⊺µ⊺

st1 = 1

  • .

I :=

  • µ = (µc)c∈C | ∀c ∈ C,

µc ∈ △c

  • L :=
  • µ ∈ I

| ∀{s, t} ∈ E, µst 1 = µs, µ⊺

st 1 = µt

  • L = I ∩ {µ | Aµ = 0}

for an appropriate definition of A...

slide-14
SLIDE 14

Surrogates for the entropy

Various entropy surrogates exist, e.g.: Bethe entropy (nonconvex), Tree-reweighted entropy (TRW) (convex on L but not on I)

Separable surrogates Happrox

We consider surrogates of the form Happrox(µ) =

  • c∈C

hc(µc) , such that each function hc is smootha and convex on △c and Happrox is strongly convex on L In particular we propose to use the Gini entropy: hc(µc) = 1 − µc2

F

a quadratic counterpart of the oriented tree-reweighted entropy:

ai.e. has Lipschitz gradients

slide-15
SLIDE 15

Relaxed dual problem

M

relax to

− → L = I ∩ {µ | Aµ = 0} HShannon

relax to

− → Happrox(µ) :=

  • c∈C

hc(µc) .

Problem relaxation

D#(µ) := HShannon(µ) − ιM(µ) − 1 2λΨµ2

2 relax to ↓

D(µ) := Happrox(µ) − ιI(µ) − ι{Aµ=0} − 1 2λΨµ2

2

so that with f∗

c (µc) : −hc(µc) + ι△c(µc)

and g∗(µ) = 1 2λΨµ2

2

we have D(µ) = −

  • c∈C

f∗

c (µc) − g∗(µ) − ι{Aµ=0} .

slide-16
SLIDE 16

A dual augmented Lagrangian formulation

D(µ) = −

  • c∈C

f∗

c (µc) − g∗(µ) − ι{Aµ=0}

Idea: without the linear constraint, we could exploit the form of the

  • bjective to use a fast algorithm such as stochastic dual coordinate ascent.

Dρ(µ, ξ) = −

  • c∈C

f∗

c (µc) − g∗(µ) − ξ, Aµ − 1

2ρAµ2

2

By strong duality, we need to solve min

ξ

d(ξ) with d(ξ) := max

µ

Dρ(µ, ξ).

slide-17
SLIDE 17

The algorithm

Need to solve min

ξ

d(ξ) with d(ξ) := max

µ

Dρ(µ, ξ). with Dρ(µ, ξ) = −

  • c∈C

f∗

c (µc) − g∗(µ) − ξ, Aµ − 1

2ρAµ2

2.

Note that we have ∇d(ξ) = Aµξ with µξ = arg min

ξ

Dρ(µ, ξ).

Combining an inexact dual Lagrangian method with a subsolver A

At epoch t: Maximize Dρ partially w.r.t. µ using a fixed number of steps of a (stochastic) linearly convergent algorithm A to get ˆ µt from the ˆ µt−1. Take an inexact gradient step on d with ξt+1 = ξt − 1 LAˆ µt

slide-18
SLIDE 18

Main technical lemma

Let ξt (resp. ˆ µt) the value of ξ (resp. µ) at the end of epoch t Let ˆ ∆t := max

µ

Dρ(µ, ξt) − Dρ(ˆ µt, ξt) and Γt := d(ξt) − d(ξ∗). Let ∆0

t := max µ

Dρ(µ, ξt) − Dρ(µt

0, ξt)

If algorithm A used at epoch t to maximize Dρ(µ, ξ) w.r.t. µ is such that ∃β ∈ (0, 1), E[ ˆ ∆t] ≤ β E[∆0

t ]

, then ∃ κ ∈ (0, 1) characterizing d and ∃ C > 0 such that, if µt

0 = ˆ

µt−1,

  • E[ ˆ

∆Tex] E[ΓTex]

  • ≤ C λmax(β)Tex
  • E[ ˆ

∆0] E[Γ0]

  • ,

where λmax(β) is the largest eigenvalue of the matrix M(β) =

3β 1 1−κ

slide-19
SLIDE 19

Main theoretical result: linear convergence in the dual

Let A be an iterative algorithm used to solve partially max

µ

Dρ(µ, ξ). Let ξt (resp. ˆ µt) the value of ξ (resp. µ) at the end of epoch t Let ˆ ∆t := max

µ

Dρ(µ, ξt) − Dρ(ˆ µt, ξt) and Γt := d(ξt) − d(ξ∗). Proposition: If A is a linearly convergent algorithm at epoch t, A is initialized with ˆ µt−1 (→ use of warm-starts) A is run for a fixed ahead Tin number of iteration at each epoch then we have ˆ ∆t, Γt

a.s.

− → 0 linearly the residuals Aˆ µt2

2 a.s.

− → 0 linearly the smooth part of the objective a.s. converges linearly

slide-20
SLIDE 20

Global linear convergence in the primal

Let P be the relaxed primal objective P(w) := FL

  • Ψ⊺w
  • + λ

2 w2

2,

with FL(θ) := max

µ∈L θ, µ + Happrox(µ).

Corollary

Let ˆ wt = − 1 λΨˆ µt. If A is a linearly convergent algorithm and the function µ → −Happrox(µ) + 1 2ρAµ2

2 is strongly convex,

then P( ˆ wt) − P(w⋆) converges to 0 linearly a.s. Since a fixed nb of inner iterations are done at each epoch, the linear convergence is as a function of the total number of clique updates.

slide-21
SLIDE 21

Related work

A lot of work on approximate inference for CRFs:

Komodakis et al. (2007); Sontag et al. (2008); Savchynskyy et al. (2011)

Learning method going beyond saddle formulations:

Meshi et al. (2010); Hazan and Urtasun (2010); Lacoste-Julien et al. (2013)

Learning in the dual for structured SVMs with only clique-wise updates:

With relaxation + smoothing of the linear constraints Meshi et al. (2015) and using block coordinate Frank-Wolfe (BCFW) or block coordinate ascent. With multiplier and a greedy primal dual algorithm, Yen et al. (2016) show a global linear convergence result in the dual.

Convergence rates for approximate gradient descent

Schmidt et al. (2011); Devolder et al. (2014); Lan and Monteiro (2016); Lin et al. (2017)

Related work on BCFW with linear constraints: Gidel et al. (2018)

slide-22
SLIDE 22

Experiments: Algorithms

SoftBCFW Stochastic block coordinate Frank-Wolfe + penalty method (Meshi et al., 2015) SoftSDCA Stochastic block coordinate prox ascent + penalty method GDMM Dual decomposed learning with factorwise oracle (Yen et al., 2016) IDAL Our algorithm

slide-23
SLIDE 23

Datasets

Gaussian mixture Potts model 10 × 10 grid graph with 5 classes Gaussian features in R10 (wτ1 ∈ R10×5, wτ2 ∈ R5×5) 50 training grids Semantic segmentation of images MSRC-21 dataset (Shotton et al., 2006) 21 classes 50 features (wτ1 ∈ R50×21, wτ2 ∈ R21×21) 335 training images

slide-24
SLIDE 24

Results for Gaussian mixture Potts model (λ = 10, ρ = 1)

Bound on duality gap Gap on marginalization constraints

5 10 15 20 25 30 35 40 45 50 10

−1

10 10

1

10

2

10

3

10

4

10

5

10

6

SoftBCFW SoftSDCA GDMM IDAL 5 10 15 20 25 30 35 40 45 50 10

−2

10

−1

10 10

1

10

2

10

3

10

4

SoftBCFW SoftSDCA GDMM IDAL 10

−1

10 10

1

10

2

10

3

10

4

−5000 −4000 −3000 −2000 −1000 1000 SoftBCFW SoftSDCA GDMM IDAL 10

−1

10 10

1

10

2

10

3

10

4

0.4 0.45 0.5 0.55 0.6 0.65 SoftBCFW SoftSDCA GDMM IDAL

Dual objective Accuracy on test data

slide-25
SLIDE 25

Result on segmentation dataset, max margin variant (λ = 1, ρ = 0.1)

Bound on duality gap Gap on marginalization constraints

50 100 150 200 250 300 350 400 450 500 10

2

10

3

10

4

10

5

10

6

10

7

SoftBCFW SoftSDCA GDMM IDAL 50 100 150 200 250 300 350 400 450 500 10

−3

10

−2

10

−1

10 10

1

10

2

10

3

10

4

SoftBCFW SoftSDCA GDMM IDAL 10

−1

10 10

1

10

2

10

3

10

4

−6 −4 −2 2 4 6 8 10 x 10

4

SoftBCFW SoftSDCA GDMM IDAL 10

−1

10 10

1

10

2

10

3

10

4

0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 SoftBCFW SoftSDCA GDMM IDAL

Dual objective Accuracy on test data

slide-26
SLIDE 26

Summary and conclusions

We proposed an algorithm combining SDCA and an inexact dual Lagrangian method that obtains Global linear convergences for the relaxed objective

◮ In the primal and for the dual augmented Lagrangian formulation ◮ Obtains good practical performance

Other contributions: Computable duality gaps to track convergence in the primal Representer theorem in the structured learning case “inside the graph” Unified derivation connecting formulations of previous work SDCA can accommodate linear constraints on the dual parameter. Open questions: Use a better approx. for the entropy like OTRW (→ non Lipschitz gradients)? Be stochastic on ξ as well? Paper:

SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning, X. Hu, G. Obozinski, AIStats, 2018.

slide-27
SLIDE 27

References I

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pages 1646–1654. Devolder, O., Glineur, F., and Nesterov, Y. (2014). First-order methods of smooth convex

  • ptimization with inexact oracle. Mathematical Programming, 146(1-2):37–75.

Gidel, G., Pedregosa, F., and Lacoste-Julien, S. (2018). Frank-wolfe splitting via augmented lagrangian method. AIStats. Hazan, T. and Urtasun, R. (2010). A primal-dual message-passing algorithm for approximated large scale structured prediction. In NIPS, pages 838–846. Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323. Komodakis, N., Paragios, N., and Tziritas, G. (2007). MRF optimization via dual decomposition: Message-passing revisited. In ICCV, pages 1–8. Lacoste-Julien, S., Jaggi, M., Schmidt, M., and Pletscher, P. (2013). Block-coordinate Frank-Wolfe optimization for structural SVMs. In ICML, pages 53–61. Lan, G. and Monteiro, R. D. (2016). Iteration-complexity of first-order augmented Lagrangian methods for convex programming. Mathematical Programming, 155(1-2):511–547. Lin, H., Mairal, J., and Harchaoui, Z. (2017). QuickeNing: A generic quasi-Newton algorithm for faster gradient-based optimization. arXiv preprint arXiv:1610.00960.

slide-28
SLIDE 28

References II

Meshi, O., Sontag, D., Globerson, A., and Jaakkola, T. S. (2010). Learning efficiently with approximate inference via dual losses. In ICML, pages 783–790. Meshi, O., Srebro, N., and Hazan, T. (2015). Efficient training of structured SVMs via soft

  • constraints. In AISTATS, pages 699–707.

Roux, N. L., Schmidt, M., and Bach, F. R. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pages 2663–2671. Savchynskyy, B., Kappes, J., Schmidt, S., and Schn¨

  • rr, C. (2011). A study of Nesterov’s

scheme for Lagrangian decomposition and MAP labeling. In CVPR, pages 1817–1823. Schmidt, M., Le Roux, N., and Bach, F. R. (2011). Convergence rates of inexact proximal-gradient methods for convex optimization. In NIPS, pages 1458–1466. Shalev-Shwartz, S. and Zhang, T. (2016). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105–145. Shotton, J., Winn, J., Rother, C., and Criminisi, A. (2006). Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In ECCV. Springer. Sontag, D., Meltzer, T., Globerson, A., Jaakkola, T., and Weiss, Y. (2008). Tightening LP relaxations for MAP using message passing. In UAI, pages 503–510. Yen, I. E.-H., Huang, X., Zhong, K., Zhang, R., Ravikumar, P. K., and Dhillon, I. S. (2016). Dual decomposed learning with factorwise oracle for structural SVM of large output domain. In NIPS, pages 5024–5032.