[PPT] - Safe screening for the generalized conditional gradient method Yifan PowerPoint Presentation

SLIDE 1

Safe screening for the generalized conditional gradient method

Yifan Sun Joint work with Francis Bach at INRIA Paris March 2020

1 / 20

SLIDE 2

Problem statement

minimize

x

f(x) + φ(κ(x)) f : Rn → R convex and smooth κ : Rn → R+ gauge function (norm-like function that promotes sparsity) φ : R+ → R+ convex, monotonically nondecreasing Examples min

x

−

i

log(biaT

i x) + x2 1

sparse logistic regression min

x

1 2xT Qx + 1T x s.t. 0 ≤ x ≤ C support vector machine min

x

L(Kx − b) s.t.

n

i=2

|xi − xi−1| ≤ ǫ image denoising

2 / 20

SLIDE 3

Generalized conditional gradient method (gCGM)

minimize

x

f(x) + φ(κ(x))

h(x)

Scheme s(t) = argmin

s

∇f(x(t))T s + h(s) generalized LMO x(t+1) = (1 − θ(t))x(t) + θ(t)s(t) convex merging if h(x) constrains x ∈ P → vanilla CGM easy LMO for many sparse norms (1-norm, nuclear norm, group norm,...)

CGM: Frank & Wolfe ’56, Dunn & Harshbarger ’78, Clarkson ’10, Lacoste-Julien & Jaggi ’13,... gCGM: Yu et al ’17, Harchaoui et al ’15, Bredies and Lorenz ’08, Bach ’12 ...

3 / 20

SLIDE 4

Goals

min

x

f(x)

loss

+ φ(κ(x))

sparsity

Screening: Given x(t) → x∗, where x∗ is a sparse combination of essential atoms, can I

mit nonessential atoms based on x(t)?

Method: Is gCGM well-defined? What kinds of φ can we use? Support recovery of screened method: Is there a finite time t when we can exactly recover the support of x∗ from x(t)?

4 / 20

SLIDE 5

Goals

min

x

f(x)

loss

+ φ(κ(x))

sparsity

Screening: Given x(t) → x∗, where x∗ is a sparse combination of essential atoms, can I

mit nonessential atoms based on x(t)?

Method: Is gCGM well-defined? What kinds of φ can we use? Support recovery of screened method: Is there a finite time t when we can exactly recover the support of x∗ from x(t)?

4 / 20

SLIDE 6

Generalized sparsity

Atoms: P = {p1, ..., pm} ⊂ Rn (e.g. vertices of polytope, basis vectors, eigenvectors) Gauge function: κP(x) is the solution to minimize

ci≥0

m

i=1

ci :

m

i=1

cipi = x

includes norms, seminorms, convex cone indicators

positive homogenous, subadditive, not necessarily symmetric Examples P = cols of P, then κP(x) = P −1x1 P = {βe1, ..., βen}, then κP(x) = βx1 + δ+(x) P = {±(e2 − e1), ..., ±(en − en−1)}, then κP(x) =

n

i=2

|xi − xi−1| (smoothing)

Freund ’87, Chandresekaran ’12, Friedlander ’13

5 / 20

SLIDE 7

Generalized screening

minimize

ci≥0

m

i=1

ci :

m

i=1

cipi = x

(1)

Support: (may not be unique) suppP(x) = {pi ∈ P : ci > 0 in (1)} Screening: Given x ≈ x∗, can we safely guarantee some p ∈ P is not in suppP(x∗)?

Ghaoui et al ’12, Fercoq et al ’15, Xiang and Ramadge ’12; Wang et al. ’14; Liu et al. ’13; Malti and Herzet ’16; Ndiaye et al. ’15; Bonnefoy et al. ’15; Zhou and Zhao ’15, ... many

6 / 20

SLIDE 8

Support recovery optimality condition

minimize

x

f(x) + φ(κP(x))

Property

If p ∈ suppP(x∗) then −∇f(x∗)T p = max

p′∈P −∇f(x∗)T p′.

Example: Squared penalty min

x

f(x) + 1 2x2

1

Then at optimality, with z∗ := −∇f(x∗),

|z∗

i | = x∗1

if x∗

i = 0

|z∗

i | ≤ x∗1

if x∗

i = 0.

7 / 20

SLIDE 9

Support recovery optimality condition

minimize

x

f(x) + φ(κP(x))

Property

If p ∈ suppP(x∗) then −∇f(x∗)T p = max

p′∈P −∇f(x∗)T p′.

Example: Polyhedral constraint min

x

f(x) subject to Ax1 ≤ c Then P = col(A−1) and for A = (a1, ..., an), A−1 = (b1, ..., bn), aT

i x∗ = 0 ⇒ |bT i ∇f(x∗)| = B∇f(x∗)∞.

Proof: Normal cone condition: −∇f(x∗)T y ≤ −∇f(x∗)T x∗, ∀y : Ay1 ≤ c At optimality, −∇f(x∗) = AT u so max

y

uT Ay

attainable

= u∞ Ay1

=c

= u∞Ax∗1

require

= uT Ax∗.

7 / 20

SLIDE 10

Observation for gradient screening

minimize

x

f(x) + φ(x1) Optimality conditions |∇f(x∗)|i < ∇f(x∗)∞ ⇒ x∗

i = 0

If x(t) ≈ x∗, by smoothness ∇f(x(t)) ≈ ∇f(x∗) If ∇f(x(t)) − ∇f(x∗)∞ ≤ ǫ, then ∇f(x(t))∞ − |∇f(x(t))i| > 2ǫ ⇒ x∗

i = 0

8 / 20

SLIDE 11

Observation for gradient screening (generalized)

minimize

x

f(x) + φ(κP(x)) Optimality conditions −∇f(x∗)T p < max

p′∈P −∇f(x∗)T p′ ⇒ p ∈ suppP(x∗)

Measure of gradient error: σ

P(z − z∗) := max p,−p∈P(z − z∗)T p

(“dual gauge”) If σ

P(∇f(x) − ∇f(x∗)) ≤ ǫ, then

max

p′∈P |∇f(x)T p′| − |∇f(x)T p| > 2ǫ ⇒ p ∈ suppP(x∗)

If I know ǫ, this condition does not depend on x∗!

9 / 20

SLIDE 12

Gap bounds gradient error

Primal-dual pair (P) min

x

Ψ(x) := f(x) + h(x) (D) max

z

Ω(z) := −f∗(−z) − h∗(z) where f∗(z) = supx xT z − f(x) is the convex conjugate. Gap bounds gradient error gap(x, −∇f(x)) = Ψ(x) − Ω(−∇f(x)) ≥ (x − x∗)T (∇f(x) − ∇f(x∗))

smooth f

≥ 1 Lσ

P(∇f(x∗) − ∇f(x))2

Smoothness: for P = P ∪ −P : f(x) − f(y) ≤ ∇f(y)T (y − x) + L 2 κ

P(x − y)2,

∀x, y.

10 / 20

SLIDE 13

Theorem (Screening)

For any x, any p ∈ P, max

p′∈P −∇f(x)T (p′ − p) > 2

Lgap(x, −∇f(x))

implies that p ∈ suppP(x∗).

Theorem (Support recovery)

The support is recovered when

L min

i≤t gap(x(i), ∇f(x(i))) < δ/4,

where δ = min

p∈suppP(x∗), p′∈suppP(x∗)

−∇f(x∗)T (p − p′)

Goals

min

x

f(x)

loss

+ φ(κ(x))

sparsity

Screening: Given x(t) → x∗, where x∗ is a sparse combination of essential atoms, can I

mit nonessential atoms based on x(t)?

Method: Is gCGM well-defined? What kinds of φ can we use? Support recovery of screened method: Is there a finite time t when we can exactly recover the support of x∗ from x(t)?

12 / 20

SLIDE 15

Goals

min

x

f(x)

loss

+ φ(κ(x))

sparsity

Screening: Given x(t) → x∗, where x∗ is a sparse combination of essential atoms, can I

mit nonessential atoms based on x(t)?

Method: Is gCGM well-defined? What kinds of φ can we use? Support recovery of screened method: Is there a finite time t when we can exactly recover the support of x∗ from x(t)?

12 / 20

SLIDE 16

Conditional gradient method (Warm-up)

minimize

x∈P

f(x) Vanilla CGM (P = {x : x1 ≤ 1}) s(t) = argmin

s∈P

∇f(x(t))T s (LMO) = − sign(∇f(x(t))k) ek, |∇f(x(t))k| = ∇f(x(t))∞ x(t+1) = (1 − θ(t)) x(t) + θ(t)s(t)

13 / 20

SLIDE 17

Conditional gradient method (Warm-up)

minimize

x∈P

f(x) Vanilla CGM (P = {x : x1 ≤ 1}) s(t) = argmin

s∈P

∇f(x(t))T s (LMO) = − sign(∇f(x(t))k) ek, |∇f(x(t))k| = ∇f(x(t))∞ x(t+1) = (1 − θ(t)) x(t) + θ(t)s(t)

13 / 20

SLIDE 18

Generalized conditional gradient method

minimize

x

f(x) + φ(κP(x)) s(t) = argmin

s

∇f(x(t))T s + φ(κP(s)) = ξ ˆ s generalized LMO ˆ s = argmin

ˆ s∈P

∇f(x(t))T ˆ s

=:ν

direction ξ = argmin

ξ≥0

ξν + φ(ξ) magnitude x(t+1) = (1 − θ(t))x(t) + θ(t)s(t) convex merging Computational complexity: ≈ same as vanilla CGM (LMO + 1-D optimization)

14 / 20

SLIDE 19

Generalized conditional gradient method

minimize

x

f(x) + φ(κP(x)) s(t) = argmin

s

∇f(x(t))T s + φ(κP(s)) = ξ ˆ s generalized LMO ˆ s = argmin

ˆ s∈P

∇f(x(t))T ˆ s

=:ν

direction ξ = argmin

ξ≥0

ξν + φ(ξ) magnitude x(t+1) = (1 − θ(t))x(t) + θ(t)s(t) convex merging Computational complexity: ≈ same as vanilla CGM (LMO + 1-D optimization)

14 / 20

SLIDE 20

Computing magnitude

ξ = argmin

ξ≥0

− ξ · ν + φ(ξ) = (φ∗)′(ν) where φ∗(ν) = supξ ξ · ν − φ(ξ) is the convex conjugate Example: φ(ξ) = p−1ξp p = 1, ξ =

0,

ν ≤ 1 +∞, else. (doesn’t work for LASSO) p = 2, ξ = ν (not bounded) p = +∞, ξ = 1 (vanilla CGM)

15 / 20

SLIDE 21

Assumptions on φ

minimize

x

f(x) + φ(κP(x))

Assumption

The function φ : R+ → R+ is monotonically nondecreasing over all ξ ≥ 0 with subdifferentials not upper-bounded: sup {ν : ν ∈ ∂φ(ξ)}

ξ→+∞

→ +∞ (finite ξ always exists) lower-bounded by a quadratic function φ(ξ) ≥ µφξ2 − φ0 (ξ doesn’t grow too fast) for some µφ > 0 and φ0 ∈ R.

16 / 20

SLIDE 22

Theorem (Convergence)

Take θ(t) = 2/(t + 1). Then the gap min

i≤t gap(x(i), −∇f(x(i))) = O(1/t).

the support is fully recovered after a finite number of steps t = O(1/(δ2)). minimize

x

1 2Ax − b2

2 + λ

pxp

1,

A, b random, S(t) = unscreened atoms

50 100 150 200

iterations (t)

10−3 10−2 10−1 100 101 102

gap

p = 1.75 p = 2 p = 3 p = 5 p = +∞ 100 101 102 103 104

iterations (t)

20 40 60 80 100

|S(t)|

λ = 0.05 λ = 0.03 λ = 0.10 λ = 0.25 λ = 0.50 λ = 1.00 17 / 20

SLIDE 23

Convergence

minimize

x

1 2Ax − b2

2 + 1

2x2

1

100 101 102 103 104

iterations (t)

10−12 10−10 10−8 10−6 10−4 10−2 100 102

residuals

|f(x(t)) − f * | gap(t) σ(z(t) − z * )2 Support err. δ2 O(1/t)

18 / 20

SLIDE 24

Application: MNIST binary classification

φ1(ξ) = δ·≤C(ξ), φ2(ξ) = λ 2 ξ2, φ3(ξ) = −λ log(1 − ξ) + ξλ

100 101 102

C

200 400 600 800

# features selected

0.20 0.25 0.30 0.35

misclass rate p = +∞

10−3 10−2 10−1

λ

200 400 600 800

# features selected

0.22 0.24 0.26 0.28 0.30 0.32

misclass rate p = 2

10−2 10−1 100 101

λ

200 400 600 800

# features selected

0.24 0.26 0.28 0.30 0.32

misclass rate Log barrier

red = # unscreened at t = 10000 / 5000 / 1000 blue = train / test misclass. green = # nonzeros of x(10000) (= x∗)

19 / 20

SLIDE 25

Summary

Problem

minimize

x

f(x)

smooth convex loss

+ φ(

gauge

κP(x))

sparse penalty or constraint
ptimality conditions allow support of x∗ to be recovered from noisy ∇f(x)

gap bounds gradient error, giving screening rule that does not require x∗ Method

s(t) = argmin

s

∇f(x(t))T s + φ(κP(s)) (gen-LMO) x(t+1) = θ(t)x(t) + (1 − θ(t))s(t)

requires mild curvature assumptions on φ, and L-smoothness on f gap converges at rate O(1/t) (without assuming bounded iterates) identifies support at finite t = O(1/δ2) in practice, rule is pessimistic (but safe)

20 / 20