Harnessing Structure in Optimization for Machine Learning Franck - - PowerPoint PPT Presentation

harnessing structure in optimization for machine learning
SMART_READER_LITE
LIVE PREVIEW

Harnessing Structure in Optimization for Machine Learning Franck - - PowerPoint PPT Presentation

Harnessing Structure in Optimization for Machine Learning Franck Iutzeler LJK, Univ. Grenoble Alpes Optimization for Machine Learning CIRM 9-13 March 2020 >>> Regularization in Learning Structure Regularization Linear inverse


slide-1
SLIDE 1

Harnessing Structure in Optimization for Machine Learning

Franck Iutzeler

LJK, Univ. Grenoble Alpes

Optimization for Machine Learning CIRM – 9-13 March 2020

slide-2
SLIDE 2

>>> Regularization in Learning

Structure Regularization sparsity r = · 1 anti-sparsity r = · ∞ low rank r = · ∗ . . . . . . Linear inverse problems: for a chosen regularization, we seek x⋆ ∈ arg min

x

r(x) such that Ax = b

Regularized Empirical Risk Minimization problem: Find x⋆ ∈ arg min

x∈Rn

R (x; {ai, bi}m

i=1)

+ λ r(x)

  • btained from

chosen statistical modeling regularization

e.g. Lasso: Find

x⋆ ∈ arg min

x∈Rn

m

i=1 1 2(a⊤ i x − bi)2

+ λ x1

Regularization can improve statistical properties (generalization, stability, ...).

⋄ Tibshirani: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (1996) ⋄ Tibshirani et al.: Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society (2004) ⋄ Vaiter, Peyré, Fadili: Model consistency of partly smooth regularizers. IEEE Trans. on Information Theory (2017)

1 / 18

slide-3
SLIDE 3

>>> Optimization for Machine Learning Composite minimization

Find

x⋆ ∈ arg min

x∈Rn

R (x; {ai, bi}m

i=1)

+ λ r(x) Find x⋆ ∈ arg min

x∈Rn

f(x) + g(x)

smooth non-smooth

> f: differentiable surrogate of the empirical risk ⇒ Gradient

non-linear smooth function that depends on all the data

> g: non-smooth but chosen regularization ⇒ Proximity operator

non-differentiability on some manifolds implies structure on the solutions proxγg(u) = arg miny∈Rn

  • g(y) +

1 2γ y − u2 2

  • closed form/easy for many regularizations:

– g(x) = x1 – g(x) = TV(x) – g(x) = indicatorC(x)

Natural optimization method: proximal gradient uk+1 = xk − γ∇f(xk) xk+1 = proxγg(uk+1)

and its stochastic variants: proximal sgd, etc.

2 / 18

slide-4
SLIDE 4

>>> Structure, Non-differentiability, and Proximity operator Example: LASSO

Find

x⋆ ∈ arg min

x∈Rn

R (x; {ai, bi}m

i=1)

+ λ r(x) Find x⋆ ∈ arg min

x∈Rn 1 2 Ax − b2 2

+ λx1

smooth non-smooth

Coordinates

Structure ↔ Optimality conditions

∀i

x⋆

i = 0

⇔ A⊤

i (Ax⋆ − b) ∈ [−λ, λ]

Proximity Operator: per coordinate

  • proxγλ·1(u)
  • i =

   ui − λγ if ui > λγ if ui ∈ [−λγ; λγ] ui + λγ if ui < −λγ

Proximal Gradient (aka ISTA):

uk+1 = xk − γA⊤(Axk − b) xk+1 = proxγλ·1(uk+1)

−3 −2 −1 1 2 3 −1 1 2

| · | SoftThresholding

[−1, 1] → {0} per coord.

3 / 18

slide-5
SLIDE 5

>>> Structure, Non-differentiability, and Proximity operator Example: LASSO

Find

x⋆ ∈ arg min

x∈Rn

R (x; {ai, bi}m

i=1)

+ λ r(x) Find x⋆ ∈ arg min

x∈Rn 1 2 Ax − b2 2

+ λx1

smooth non-smooth

Coordinates

Structure ↔ Optimality conditions ↔ Proximity operation

∀i

x⋆

i = 0

⇔ A⊤

i (Ax⋆ − b) ∈ [−λ, λ]

  • proxγλ·1(u⋆)
  • i = 0

u⋆ = x⋆ − γA⊤(Ax⋆ − b)

  • proxγλ·1(u)
  • i =

   ui − λγ if ui > λγ if ui ∈ [−λγ; λγ] ui + λγ if ui < −λγ

Proximal Gradient (aka ISTA):

uk+1 = xk − γA⊤(Axk − b) xk+1 = proxγλ·1(uk+1)

−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2

1.1 1.1 2.3 2 . 3 2.3 3.4 3.4 3.4 3.4 4.5 4.5 4 . 5 5 . 7 5.7 5.7 6.8 6.8 8 8 9 . 1 1 . 2 11.4

x⋆

Proximal Gradient

Iterates (xk) reach the same structure as x⋆ in finite time!

3 / 18

slide-6
SLIDE 6

>>> Mathematical properties of Proximal Algorithms Proximal Algorithms:

uk+1 = xk − γ∇f(xk) xk+1 = proxγg(uk+1)

−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2 1.1 1.1 2 . 3 2 . 3 2.3 3.4 3.4 3.4 3.4 4.5 4.5 4.5 5 . 7 5 . 7 5.7 6.8 6.8 8 8 9.1 10.2 11.4

x⋆

Proximal Gradient

> project on manifolds

Let M be a manifold and uk such that xk = proxγg(uk) ∈ M

and

uk−xk γ

∈ ri ∂g(xk) If g is partly smooth at xk relative to M, then proxγg(u) ∈ M for any u close to uk.

⋄ Hare, Lewis: Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis (2004) ⋄ Daniilidis, Hare, Malick: Geometrical interpretation of the predictor-corrector type algorithms in structured

  • ptimization problems. Optimization (2006)

4 / 18

slide-7
SLIDE 7

>>> Mathematical properties of Proximal Algorithms Proximal Algorithms:

uk+1 = xk − γ∇f(xk) xk+1 = proxγg(uk+1)

−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2 1.1 1.1 2 . 3 2 . 3 2.3 3.4 3.4 3.4 3.4 4.5 4.5 4.5 5 . 7 5 . 7 5.7 6.8 6.8 8 8 9.1 10.2 11.4

x⋆

Proximal Gradient u⋆ x⋆

SoftThresholding

> project on manifolds > identify the optimal structure

Let (xk) and (uk) be a pair of sequences such that xk = proxγg(uk) → x⋆ = proxγg(u⋆) and M be a manifold. If x⋆ ∈ M and ∃ε > 0 such that for all u ∈ B(u⋆, ε), proxγg(u) ∈ M (QC) holds, then, after some finite but unknown time, xk ∈ M.

⋄ Lewis: Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization (2002) ⋄ Fadili, Malick, Peyré: Sensitivity analysis for mirror-stratifiable convex functions. SIAM Journal on Optimization (2018)

4 / 18

slide-8
SLIDE 8

>>> “Nonsmoothness can help”

> Nonsmoothness is actively studied in Numerical Optimization...

Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc.

⋄ Hare, Lewis: Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis (2004) ⋄ Lemarechal, Oustry, Sagastizabal: The U-Lagrangian of a convex function. Transactions of the AMS (2000) ⋄ Bolte, Daniilidis, Lewis: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization (2007) ⋄ Chen, Teboulle: A proximal-based decomposition method for convex minimization problems. Mathematical Programming (1994)

5 / 18

slide-9
SLIDE 9

>>> “Nonsmoothness can help”

> Nonsmoothness is actively studied in Numerical Optimization...

Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc.

> ...but often suffered because of lack of structure/expression.

Bundle methods, Gradient Sampling, Smoothing, Inexact proximal methods, etc.

⋄ Nesterov: Smooth minimization of non-smooth functions. Mathematical Programming (2005) ⋄ Burke, Lewis, Overton: A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM Journal on Optimization (2005) ⋄ Solodov, Svaiter: A hybrid projection-proximal point algorithm. Journal of convex analysis (1999) ⋄ de Oliveira, Sagastizábal: Bundle methods in the XXIst century: A bird’s-eye view. Pesquisa Operacional (2014)

5 / 18

slide-10
SLIDE 10

>>> “Nonsmoothness can help”

> Nonsmoothness is actively studied in Numerical Optimization...

Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc.

> ...but often suffered because of lack of structure/expression.

Bundle methods, Gradient Sampling, Smoothing, Inexact proximal methods, etc.

> For Machine Learning objectives, it can often be harnessed

  • Explicit/“proximable” regularizations ℓ1, nuclear norm
  • We know the expressions and activity of sought structures sparsity, rank

See the talks of ...

⋄ Bach, et al.: Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning (2012) ⋄ Massias, Salmon, Gramfort: Celer: a fast solver for the lasso with dual extrapolation. ICML (2018) ⋄ Liang, Fadili, Peyré: Local linear convergence of forward–backward under partial smoothness. NeurIPS (2014) ⋄ O’Donoghue, Candes: Adaptive restart for accelerated gradient schemes. Foundations of computational mathematic (2015)

5 / 18

slide-11
SLIDE 11

>>> Noticeable Structure

Find

x⋆ ∈ arg min

x∈Rn

R (x; {ai, bi}m

i=1)

+ λ r(x) Find x⋆ ∈ arg min

x∈Rn

f(x) + g(x)

smooth non-smooth

A reason why the nonsmoothness of ML problems can be leveraged is their noticeable structure, that is: We can design a lookout collection C = {M1, .., Mp} of closed sets such that: (i) we have a projection mapping projMi onto Mi for all i; (ii) proxγg(u) is a singleton and can be computed explicitly for any u and γ; (iii) upon computation of x = proxγg(u), we know if x ∈ Mi or not for all i. ⇒ Identification can be directly harnessed. Example: Sparse structure and g = · 1, · 0.5

0.5, · 0, ...

C = {M1, . . . , Mn} with Mi = {x ∈ Rn : xi = 0}

6 / 18

slide-12
SLIDE 12

>>> Question

lookout collection C = {M1, .., Mp} of closed sets such that: (i) we have a projection mapping projMi onto Mi for all i; (ii) proxγg(u) is a singleton and can be computed explicitly for any u and γ; (iii) upon computation of x = proxγg(u), we know if x ∈ Mi or not for all i. (QC) ∃ε > 0 such that for all u ∈ B(u⋆, ε), proxγg(u) ∈ M

Take any proximal algorithm uk+1 = Update (f; {xℓ}ℓ≤k; {uℓ}ℓ≤k; γ) xk+1 = proxγg(uk+1) (prox −Update) such that (uk) converges almost surely to a point u⋆

with x⋆ = proxγg(u⋆) a solution of the problem.

Let’s use the structure What can we do on the way to identification/when screening is inefficient?

not close to x⋆, no explicit or bad dual (non-convex), proxγg(Uk) difficult to evaluate

7 / 18

slide-13
SLIDE 13

>>> Question

lookout collection C = {M1, .., Mp} of closed sets such that: (i) we have a projection mapping projMi onto Mi for all i; (ii) proxγg(u) is a singleton and can be computed explicitly for any u and γ; (iii) upon computation of x = proxγg(u), we know if x ∈ Mi or not for all i. (QC) ∃ε > 0 such that for all u ∈ B(u⋆, ε), proxγg(u) ∈ M

Take any proximal algorithm uk+1 = Update (f; {xℓ}ℓ≤k; {uℓ}ℓ≤k; γ) xk+1 = proxγg(uk+1) (prox −Update) such that (uk) converges almost surely to a point u⋆

with x⋆ = proxγg(u⋆) a solution of the problem.

Define Mk = Rn

i:xk∈Mi Mi and M⋆ := Rn i:x⋆∈Mi Mi, then:

Mk ⊂ Rn partial identif/screening and Mk = M⋆ after some finite time identification 1– Observing Mk can help reduce the dimension of the problem on the way Can we efficiently restrict Update using Mk? 2– The uncovered structure along the way bears valuable information Does accelerated proximal gradient identify as well as vanilla?

7 / 18

slide-14
SLIDE 14

ADAPTIVE SUBSPACE DESCENT INTERPLAY BETWEEN ACCELERATION AND IDENTIFICATION

slide-15
SLIDE 15

>>> Adaptive Proximal Gradient ADAPTIVE DESCENT Disclaimer: This part talk assumes that the identified manifolds are linear subspaces eg: Dx1.                    yk = xk − γ∇f(xk) zk = yk xk+1 = proxγg(zk)

> Vanilla Proximal gradient identifies but does not use it

full gradient computed at each iteration

Example: Sparse structure and g = · 1 C = {M1, . . . , Mn} with Mi = {x ∈ Rn : xi = 0} Mk = {x ∈ Rn : xi = xi,k }

8 / 18

slide-16
SLIDE 16

>>> Adaptive Proximal Gradient ADAPTIVE DESCENT Disclaimer: This part talk assumes that the identified manifolds are linear subspaces eg: Dx1.                    Observe Mk = Rn

i:xk∈Mi

Mi yk = xk − γ∇f(xk) zk = projMk( yk) + proj⊥

Mk(zk−1)

xk+1 = proxγg(zk)

> Direct Use of Identification may not converge

eg: starting with 0

Example: Sparse structure and g = · 1 C = {M1, . . . , Mn} with Mi = {x ∈ Rn : xi = 0} Mk = {x ∈ Rn : xi = xi,k }

8 / 18

slide-17
SLIDE 17

>>> Adaptive Proximal Gradient ADAPTIVE DESCENT Disclaimer: This part talk assumes that the identified manifolds are linear subspaces eg: Dx1.                    Observe Mk = Rn

i:xk∈Mi (ξk,iMi + (1 − ξk,i)Rn) for ξk,i ∼ B(p)

yk = xk − γ∇f(xk) zk = projMk( yk) + proj⊥

Mk(zk−1)

xk+1 = proxγg(zk)

> Mixing Identification and Randomized coordinate descent biases gradient

convergence issues

Example: Sparse structure and g = · 1 C = {M1, . . . , Mn} with Mi = {x ∈ Rn : xi = 0} Mk = {x ∈ Rn : xi = xi,kfor some i}

8 / 18

slide-18
SLIDE 18

>>> Adaptive Proximal Gradient ADAPTIVE DESCENT Disclaimer: This part talk assumes that the identified manifolds are linear subspaces eg: Dx1.                    Observe Mk = Rn

i:xk∈Mi (ξk,iMi + (1 − ξk,i)Rn) for ξk,i ∼ B(p)

yk = xk − γ∇f(xk) zk = Q−1

k (projMk(Qkyk) + proj⊥ Mk(zk−1))

xk+1 = proxγg(zk)

> With Qk := (EprojMk)−1/2, this works after identification

but before... no, which prevents identification...

TV-regularized logistic regression:

0.2 0.4 0.6 0.8 1 1.2 1.4 ·105 10 15 20 25 30

Iteration Iterates structural sparsity

every iteration as in theory 0.2 0.4 0.6 0.8 1 1.2 1.4 ·105 10−10 10−7 10−4 10−1 102

Iteration Suboptimality

every iteration as in theory

8 / 18

slide-19
SLIDE 19

>>> Adaptive Proximal Gradient ADAPTIVE DESCENT Disclaimer: This part talk assumes that the identified manifolds are linear subspaces eg: Dx1.                    Observe Mk = Rn

i:xℓ∈Mi (ξk,iMi + (1 − ξk,i)Rn) for ξk,i ∼ B(p)

yk = xk − γ∇f(xk) zk = Q−1

k (projMk(Qkyk) + proj⊥ Mk(zk−1))

xk+1 = proxγg(zk) Check if an adaptation can be performed, if so ℓ ← k + 1

> Generalized Support adaptation can be performed at some iterations

depends on the amount of change QkQ−1

k+1 and harshness of the sparsification λmin(Qk)

TV-regularized logistic regression:

0.2 0.4 0.6 0.8 1 1.2 1.4 ·105 10 15 20 25 30

Iteration Iterates structural sparsity

every iteration as in theory 0.2 0.4 0.6 0.8 1 1.2 1.4 ·105 10−10 10−7 10−4 10−1 102

Iteration Suboptimality

every iteration as in theory

8 / 18

slide-20
SLIDE 20

>>> Adaptive Subspace descent ADAPTIVE DESCENT TV-reg. logistic regression on a1a (1605 × 143), 90% final jump sparsity

1,000 2,000 3,000 4,000 20 40 60 80

10% 10% 10% 10% 20% 20% 20% 20% 50% 50% 50% 50%

Iteration Iterate density

PGD 20% w/o identif 1 ARPSD 10% 10% ARPSD 20% 20% ARPSD 50% 50% ARPSD 1,000 2,000 3,000 4,000 10−11 10−8 10−5 10−2 101 10% 10% 10% 10% 20% 20% 20% 20% 50% 50% 50% 50%

Iteration Suboptimality

PGD 20% w/o identif 1 ARPSD 10% 10% ARPSD 20% 20% ARPSD 50% 50% ARPSD 1 2 3 4 ·105 10−11 10−8 10−5 10−2 101 10% 10% 10% 10% 20% 20% 20% 20% 50% 50% 50% 50%

Number of Subspaces explored Suboptimality

PGD 20% w/o identif 1 ARPSD 10% 10% ARPSD 20% 20% ARPSD 50% 50% ARPSD

> Iterate structure enforced by nonsmooth regularizers can be used to

adapt the selection probabilities of coordinate descent/sketching;

> Before identification, adaptation has to be moderate.

⊲ Grishchenko, I., & Malick: Proximal Gradient Methods with Adaptive Subspace Sampling, in revision for Mathematics of Operation Research

available on my webpage, more details at SMAI MODE

9 / 18

slide-21
SLIDE 21

ADAPTIVE SUBSPACE DESCENT INTERPLAY BETWEEN ACCELERATION AND IDENTIFICATION

slide-22
SLIDE 22

>>> Acceleration of the Proximal Gradient ACCELERATION?        uk+1 = yk − γ∇f(yk) xk+1 = proxγg(uk+1) yk+1 = xk+1 + αk+1(xk+1 − xk)

  • inertia/acceleration

> αk+1 ≡ 0 : vanilla Proximal Gradient > αk+1 = k−1

k+3 : accelerated Proximal Gradient (aka FISTA)

Optimal rate for composite problems (coefficients may vary a little)

PG

  • Accel. PG

F(xk) − F⋆ O(1/k) O(1/k2) iterates convergence yes yes monotone functional decrease yes no Fejér-monotone iterates yes no

⋄ Nesterov: A method for solving the convex programming problem with convergence rate O(1/k2). Dokladi A.N. Sssr (1983) ⋄ Beck, Teboulle: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on Imaging Sciences (2009) ⋄ Chambolle, Dossal: On the convergence of the iterates of “FISTA”. Journal of Optimization theory and Applications (2015) ⋄ I., Malick: On the Proximal Gradient Algorithm with Alternated Inertia. Journal of Optimization Theory and Applications (2018)

10 / 18

slide-23
SLIDE 23

>>> Interplay between Acceleration and Identification ACCELERATION? min

x∈R2 Ax − b2 2 + λg(x)

−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2

1.1 1.1 2 . 3 2.3 2.3 3.4 3 . 4 3.4 3.4 4.5 4.5 4.5 5.7 5 . 7 5.7 6 . 8 6.8 8 8 9.1 1 . 2 1 1 . 4

x⋆

Proximal Gradient −1.5 −1 −0.5 0.5 −0.5 0.5 1 1.5

. 2 8 . 2 8 . 2 8 0.28 . 2 8 0.28 0.56 0.56 0.56 0.56 0.56 0.84 0.84 . 8 4 1 . 1 2 1 . 1 2 1.12 1.4 1.4 1.68 1.68 1.96 1.96 2.24 2 . 5 2 2.8

x⋆

Proximal Gradient

g(x) = x1 1-norm regularization g(x) = max(x1.3 − 1; 0) distance to 1.3-norm unit ball

11 / 18

slide-24
SLIDE 24

>>> Interplay between Acceleration and Identification ACCELERATION? min

x∈R2 Ax − b2 2 + λg(x)

−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2

1.1 1.1 2 . 3 2.3 2.3 3.4 3 . 4 3.4 3.4 4.5 4.5 4.5 5.7 5 . 7 5.7 6 . 8 6.8 8 8 9.1 1 . 2 1 1 . 4

x⋆

Proximal Gradient Accelerated Proximal Gradient −1.5 −1 −0.5 0.5 −0.5 0.5 1 1.5

. 2 8 . 2 8 . 2 8 0.28 . 2 8 0.28 0.56 0.56 0.56 0.56 0.56 0.84 0.84 . 8 4 1 . 1 2 1 . 1 2 1.12 1.4 1.4 1.68 1.68 1.96 1.96 2.24 2 . 5 2 2.8

M x⋆

Proximal Gradient Accelerated Proximal Gradient

g(x) = x1 1-norm regularization g(x) = max(x1.3 − 1; 0) distance to 1.3-norm unit ball

> PG identifies well; > Accelerated PG explores well, identifies eventually, but erratically.

Can we converge fast and identify well?

11 / 18

slide-25
SLIDE 25

>>> A test-based algorithm ACCELERATION? T is a boolean function of past iterates; decides whether to accelerate or not.        uk+1 = yk − γ∇f(yk) xk+1 = proxγg(uk+1) yk+1 = xk+1 + αk+1(xk+1 − xk) if T = 1 xk+1 if T = 0 Proposed tests: We use our lookout collection C

  • 1. No Acceleration i.e. T1 = 0

when reaching a new one:

xk+1 ∈ M and xk ∈ M for some M ∈ C.

  • 2. No Acceleration i.e. T2 = 0

if this means leaving:

Tγ(xk+1) ∈ M and Tγ(xk+1 + αk+1(xk+1 − xk)) ∈ M for some M ∈ C.

where Tγ := proxγg(· − γ∇f(·)) is the proximal gradient operator. For analysis reasons, we allow no acceleration only when Tγ(yk) − yk2 ≤ δ and F(Tγ(yk)) ≤ F(x0).

12 / 18

slide-26
SLIDE 26

>>> Convergence result ACCELERATION? Theorem

Let f, g be two convex functions such that f is L-smooth, g is lower semi-continuous, and f + g is semi-algebraic with a minimizer. Take γ ∈ (0, 1/L]. Then, the iterates of the proposed

methods with test T1 or T2 verify F (xk+1) − F⋆ ≤ 9x0 − x⋆2 2γ(k + 2)2 + 9kR 2γ(k + 2)2 = O 1 k

  • for some R > 0.

Furthermore, if the problem has a unique minimizer x⋆ and the qualifying constraint (QC) holds, then the iterates sequence (xk) converges, finite-time identification happens and F (xk+1) − F(x⋆) ≤ 9x0 − x⋆2 2γ(k + 2)2 + 9KR 2γ(k + 2)2 = O 1 k2

  • .

for some finite K > 0.

L-smooth means that f is differentiable and ∇f is L-Lipschitz continuous.

13 / 18

slide-27
SLIDE 27

>>> Back to initial problems: ℓ1 norm ACCELERATION? min

x∈R2 Ax − b2 2 + λx1

−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2

1.1 1.1 2 . 3 2.3 2.3 3.4 3 . 4 3.4 3.4 4.5 4.5 4.5 5.7 5 . 7 5.7 6 . 8 6.8 8 8 9.1 1 . 2 1 1 . 4

x⋆

Proximal Gradient Accelerated Proximal Gradient 14 / 18

slide-28
SLIDE 28

>>> Back to initial problems: ℓ1 norm ACCELERATION? min

x∈R2 Ax − b2 2 + λx1

−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2

1.1 1.1 2 . 3 2.3 2.3 3.4 3 . 4 3.4 3.4 4.5 4.5 4.5 5.7 5 . 7 5.7 6 . 8 6.8 8 8 9.1 1 . 2 1 1 . 4

x⋆

Proximal Gradient Accelerated Proximal Gradient T1 T2 50 100 150 200 10−17 10−12 10−7 10−2 103

number of proximal gradient steps F(xk) − F⋆

Proximal Gradient

  • Accel. Proximal Gradient
  • Prov. Alg – T1
  • Prov. Alg – T2

⊕ marks identification time

14 / 18

slide-29
SLIDE 29

>>> Back to initial problems: distance to 1.3-norm ball ACCELERATION? min

x∈R2 Ax − b2 2 + λ max(|x1.3 − 1; 0)

−1.5 −1 −0.5 0.5 −0.5 0.5 1 1.5

. 2 8 . 2 8 . 2 8 0.28 . 2 8 0.28 0.56 0.56 0.56 0.56 0.56 0.84 0.84 . 8 4 1 . 1 2 1 . 1 2 1.12 1.4 1.4 1.68 1.68 1.96 1.96 2.24 2 . 5 2 2.8

M x⋆

Proximal Gradient Accelerated Proximal Gradient 15 / 18

slide-30
SLIDE 30

>>> Back to initial problems: distance to 1.3-norm ball ACCELERATION? min

x∈R2 Ax − b2 2 + λ max(|x1.3 − 1; 0)

−1.5 −1 −0.5 0.5 −0.5 0.5 1 1.5

. 2 8 . 2 8 . 2 8 0.28 . 2 8 0.28 0.56 0.56 0.56 0.56 0.56 0.84 0.84 . 8 4 1 . 1 2 1 . 1 2 1.12 1.4 1.4 1.68 1.68 1.96 1.96 2.24 2 . 5 2 2.8

x⋆

Proximal Gradient Accelerated Proximal Gradient T1 T2 50 100 150 10−18 10−13 10−8 10−3 102

number of proximal gradient steps F(xk) − F⋆

Proximal Gradient

  • Accel. Proximal Gradient
  • Prov. Alg – T1
  • Prov. Alg – T2

⊕ marks identification time

15 / 18

slide-31
SLIDE 31

>>> Matrix regression with nuclear-norm regularization ACCELERATION? min

X∈R20×20 AX − B2 F + λX∗

> S ∈ R20×20 is a rank 3 matrix; > A ∈ R(16×16)×(20×20) is drawn from the normal distribution; > B = AS + E with E drawn from the normal distribution with variance .01

10−9 10−6 10−3 100 0 · 100 1 · 104 2 · 104 3 · 104 20 40 60 80 100

dim Ker(Xk)/dim Ker(S) (in %)

Proximal Gradient 10−9 10−6 10−3 100 0 · 100 1 · 104 2 · 104 3 · 104 20 40 60 80 100

iterations dim Ker(Xk)/dim Ker(S) (in %)

T1 10−9 10−6 10−3 100

F(xk) − F⋆

0 · 100 1 · 104 2 · 104 3 · 104 20 40 60 80 100

  • Accel. Proximal Gradient

10−9 10−6 10−3 100

F(xk) − F⋆

0 · 100 1 · 104 2 · 104 3 · 104 20 40 60 80 100

iterations

T2 16 / 18

slide-32
SLIDE 32

>>> On the Interplay between Acceleration and Identification ACCELERATION?

> acceleration can hurt identification for the proximal gradient algorithm; > we proposed a method with stable identification behavior, maintaining an

accelerated convergence rate. ⊲ Bareilles & I.: On the Interplay between Acceleration and Identification for the Proximal Gradient algorithm. arXiv:1909.08944 Try it in Julia on https://github.com/GillesBareilles/Acceleration-Identification

17 / 18

slide-33
SLIDE 33

>>> Harnessing Structure in Optimization for ML ACCELERATION?

> Machine Learning problems often have a noticeable structure; > We can design a lookout collection C = {M1, .., Mp} of sets: (i) with easy

projections; (ii) identified by proximity operations; (iii) we know if these sets are identified or not;

> This structure can/should be harnessed but may be tricky before

identification. ⊲ Malick & I.: Nonsmoothness can help! on the Specific Structure of Machine Learning problems, review/pedagogical paper coming hopefully soon

thanks to this week at CIRM but it also depends whether we go hiking/running in the calanques which may very well be the case

Thanks to ANR JCJC STROLL & IDEX UGA IRS DOLL & PGMO

Thank you! – Franck IUTZELER

http://www.iutzeler.org

18 / 18