Harnessing Structure in Optimization for Machine Learning Franck - - PowerPoint PPT Presentation
Harnessing Structure in Optimization for Machine Learning Franck - - PowerPoint PPT Presentation
Harnessing Structure in Optimization for Machine Learning Franck Iutzeler LJK, Univ. Grenoble Alpes Optimization for Machine Learning CIRM 9-13 March 2020 >>> Regularization in Learning Structure Regularization Linear inverse
>>> Regularization in Learning
Structure Regularization sparsity r = · 1 anti-sparsity r = · ∞ low rank r = · ∗ . . . . . . Linear inverse problems: for a chosen regularization, we seek x⋆ ∈ arg min
x
r(x) such that Ax = b
Regularized Empirical Risk Minimization problem: Find x⋆ ∈ arg min
x∈Rn
R (x; {ai, bi}m
i=1)
+ λ r(x)
- btained from
chosen statistical modeling regularization
e.g. Lasso: Find
x⋆ ∈ arg min
x∈Rn
m
i=1 1 2(a⊤ i x − bi)2
+ λ x1
Regularization can improve statistical properties (generalization, stability, ...).
⋄ Tibshirani: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (1996) ⋄ Tibshirani et al.: Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society (2004) ⋄ Vaiter, Peyré, Fadili: Model consistency of partly smooth regularizers. IEEE Trans. on Information Theory (2017)
1 / 18
>>> Optimization for Machine Learning Composite minimization
Find
x⋆ ∈ arg min
x∈Rn
R (x; {ai, bi}m
i=1)
+ λ r(x) Find x⋆ ∈ arg min
x∈Rn
f(x) + g(x)
smooth non-smooth
> f: differentiable surrogate of the empirical risk ⇒ Gradient
non-linear smooth function that depends on all the data
> g: non-smooth but chosen regularization ⇒ Proximity operator
non-differentiability on some manifolds implies structure on the solutions proxγg(u) = arg miny∈Rn
- g(y) +
1 2γ y − u2 2
- closed form/easy for many regularizations:
– g(x) = x1 – g(x) = TV(x) – g(x) = indicatorC(x)
Natural optimization method: proximal gradient uk+1 = xk − γ∇f(xk) xk+1 = proxγg(uk+1)
and its stochastic variants: proximal sgd, etc.
2 / 18
>>> Structure, Non-differentiability, and Proximity operator Example: LASSO
Find
x⋆ ∈ arg min
x∈Rn
R (x; {ai, bi}m
i=1)
+ λ r(x) Find x⋆ ∈ arg min
x∈Rn 1 2 Ax − b2 2
+ λx1
smooth non-smooth
Coordinates
Structure ↔ Optimality conditions
∀i
x⋆
i = 0
⇔ A⊤
i (Ax⋆ − b) ∈ [−λ, λ]
Proximity Operator: per coordinate
- proxγλ·1(u)
- i =
ui − λγ if ui > λγ if ui ∈ [−λγ; λγ] ui + λγ if ui < −λγ
Proximal Gradient (aka ISTA):
uk+1 = xk − γA⊤(Axk − b) xk+1 = proxγλ·1(uk+1)
−3 −2 −1 1 2 3 −1 1 2
| · | SoftThresholding
[−1, 1] → {0} per coord.
3 / 18
>>> Structure, Non-differentiability, and Proximity operator Example: LASSO
Find
x⋆ ∈ arg min
x∈Rn
R (x; {ai, bi}m
i=1)
+ λ r(x) Find x⋆ ∈ arg min
x∈Rn 1 2 Ax − b2 2
+ λx1
smooth non-smooth
Coordinates
Structure ↔ Optimality conditions ↔ Proximity operation
∀i
x⋆
i = 0
⇔ A⊤
i (Ax⋆ − b) ∈ [−λ, λ]
⇔
- proxγλ·1(u⋆)
- i = 0
u⋆ = x⋆ − γA⊤(Ax⋆ − b)
- proxγλ·1(u)
- i =
ui − λγ if ui > λγ if ui ∈ [−λγ; λγ] ui + λγ if ui < −λγ
Proximal Gradient (aka ISTA):
uk+1 = xk − γA⊤(Axk − b) xk+1 = proxγλ·1(uk+1)
−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2
1.1 1.1 2.3 2 . 3 2.3 3.4 3.4 3.4 3.4 4.5 4.5 4 . 5 5 . 7 5.7 5.7 6.8 6.8 8 8 9 . 1 1 . 2 11.4
x⋆
Proximal Gradient
Iterates (xk) reach the same structure as x⋆ in finite time!
3 / 18
>>> Mathematical properties of Proximal Algorithms Proximal Algorithms:
uk+1 = xk − γ∇f(xk) xk+1 = proxγg(uk+1)
−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2 1.1 1.1 2 . 3 2 . 3 2.3 3.4 3.4 3.4 3.4 4.5 4.5 4.5 5 . 7 5 . 7 5.7 6.8 6.8 8 8 9.1 10.2 11.4
x⋆
Proximal Gradient
> project on manifolds
Let M be a manifold and uk such that xk = proxγg(uk) ∈ M
and
uk−xk γ
∈ ri ∂g(xk) If g is partly smooth at xk relative to M, then proxγg(u) ∈ M for any u close to uk.
⋄ Hare, Lewis: Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis (2004) ⋄ Daniilidis, Hare, Malick: Geometrical interpretation of the predictor-corrector type algorithms in structured
- ptimization problems. Optimization (2006)
4 / 18
>>> Mathematical properties of Proximal Algorithms Proximal Algorithms:
uk+1 = xk − γ∇f(xk) xk+1 = proxγg(uk+1)
−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2 1.1 1.1 2 . 3 2 . 3 2.3 3.4 3.4 3.4 3.4 4.5 4.5 4.5 5 . 7 5 . 7 5.7 6.8 6.8 8 8 9.1 10.2 11.4
x⋆
Proximal Gradient u⋆ x⋆
SoftThresholding
> project on manifolds > identify the optimal structure
Let (xk) and (uk) be a pair of sequences such that xk = proxγg(uk) → x⋆ = proxγg(u⋆) and M be a manifold. If x⋆ ∈ M and ∃ε > 0 such that for all u ∈ B(u⋆, ε), proxγg(u) ∈ M (QC) holds, then, after some finite but unknown time, xk ∈ M.
⋄ Lewis: Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization (2002) ⋄ Fadili, Malick, Peyré: Sensitivity analysis for mirror-stratifiable convex functions. SIAM Journal on Optimization (2018)
4 / 18
>>> “Nonsmoothness can help”
> Nonsmoothness is actively studied in Numerical Optimization...
Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc.
⋄ Hare, Lewis: Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis (2004) ⋄ Lemarechal, Oustry, Sagastizabal: The U-Lagrangian of a convex function. Transactions of the AMS (2000) ⋄ Bolte, Daniilidis, Lewis: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization (2007) ⋄ Chen, Teboulle: A proximal-based decomposition method for convex minimization problems. Mathematical Programming (1994)
5 / 18
>>> “Nonsmoothness can help”
> Nonsmoothness is actively studied in Numerical Optimization...
Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc.
> ...but often suffered because of lack of structure/expression.
Bundle methods, Gradient Sampling, Smoothing, Inexact proximal methods, etc.
⋄ Nesterov: Smooth minimization of non-smooth functions. Mathematical Programming (2005) ⋄ Burke, Lewis, Overton: A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM Journal on Optimization (2005) ⋄ Solodov, Svaiter: A hybrid projection-proximal point algorithm. Journal of convex analysis (1999) ⋄ de Oliveira, Sagastizábal: Bundle methods in the XXIst century: A bird’s-eye view. Pesquisa Operacional (2014)
5 / 18
>>> “Nonsmoothness can help”
> Nonsmoothness is actively studied in Numerical Optimization...
Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc.
> ...but often suffered because of lack of structure/expression.
Bundle methods, Gradient Sampling, Smoothing, Inexact proximal methods, etc.
> For Machine Learning objectives, it can often be harnessed
- Explicit/“proximable” regularizations ℓ1, nuclear norm
- We know the expressions and activity of sought structures sparsity, rank
See the talks of ...
⋄ Bach, et al.: Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning (2012) ⋄ Massias, Salmon, Gramfort: Celer: a fast solver for the lasso with dual extrapolation. ICML (2018) ⋄ Liang, Fadili, Peyré: Local linear convergence of forward–backward under partial smoothness. NeurIPS (2014) ⋄ O’Donoghue, Candes: Adaptive restart for accelerated gradient schemes. Foundations of computational mathematic (2015)
5 / 18
>>> Noticeable Structure
Find
x⋆ ∈ arg min
x∈Rn
R (x; {ai, bi}m
i=1)
+ λ r(x) Find x⋆ ∈ arg min
x∈Rn
f(x) + g(x)
smooth non-smooth
A reason why the nonsmoothness of ML problems can be leveraged is their noticeable structure, that is: We can design a lookout collection C = {M1, .., Mp} of closed sets such that: (i) we have a projection mapping projMi onto Mi for all i; (ii) proxγg(u) is a singleton and can be computed explicitly for any u and γ; (iii) upon computation of x = proxγg(u), we know if x ∈ Mi or not for all i. ⇒ Identification can be directly harnessed. Example: Sparse structure and g = · 1, · 0.5
0.5, · 0, ...
C = {M1, . . . , Mn} with Mi = {x ∈ Rn : xi = 0}
6 / 18
>>> Question
lookout collection C = {M1, .., Mp} of closed sets such that: (i) we have a projection mapping projMi onto Mi for all i; (ii) proxγg(u) is a singleton and can be computed explicitly for any u and γ; (iii) upon computation of x = proxγg(u), we know if x ∈ Mi or not for all i. (QC) ∃ε > 0 such that for all u ∈ B(u⋆, ε), proxγg(u) ∈ M
Take any proximal algorithm uk+1 = Update (f; {xℓ}ℓ≤k; {uℓ}ℓ≤k; γ) xk+1 = proxγg(uk+1) (prox −Update) such that (uk) converges almost surely to a point u⋆
with x⋆ = proxγg(u⋆) a solution of the problem.
Let’s use the structure What can we do on the way to identification/when screening is inefficient?
not close to x⋆, no explicit or bad dual (non-convex), proxγg(Uk) difficult to evaluate
7 / 18
>>> Question
lookout collection C = {M1, .., Mp} of closed sets such that: (i) we have a projection mapping projMi onto Mi for all i; (ii) proxγg(u) is a singleton and can be computed explicitly for any u and γ; (iii) upon computation of x = proxγg(u), we know if x ∈ Mi or not for all i. (QC) ∃ε > 0 such that for all u ∈ B(u⋆, ε), proxγg(u) ∈ M
Take any proximal algorithm uk+1 = Update (f; {xℓ}ℓ≤k; {uℓ}ℓ≤k; γ) xk+1 = proxγg(uk+1) (prox −Update) such that (uk) converges almost surely to a point u⋆
with x⋆ = proxγg(u⋆) a solution of the problem.
Define Mk = Rn
i:xk∈Mi Mi and M⋆ := Rn i:x⋆∈Mi Mi, then:
Mk ⊂ Rn partial identif/screening and Mk = M⋆ after some finite time identification 1– Observing Mk can help reduce the dimension of the problem on the way Can we efficiently restrict Update using Mk? 2– The uncovered structure along the way bears valuable information Does accelerated proximal gradient identify as well as vanilla?
7 / 18
ADAPTIVE SUBSPACE DESCENT INTERPLAY BETWEEN ACCELERATION AND IDENTIFICATION
>>> Adaptive Proximal Gradient ADAPTIVE DESCENT Disclaimer: This part talk assumes that the identified manifolds are linear subspaces eg: Dx1. yk = xk − γ∇f(xk) zk = yk xk+1 = proxγg(zk)
> Vanilla Proximal gradient identifies but does not use it
full gradient computed at each iteration
Example: Sparse structure and g = · 1 C = {M1, . . . , Mn} with Mi = {x ∈ Rn : xi = 0} Mk = {x ∈ Rn : xi = xi,k }
8 / 18
>>> Adaptive Proximal Gradient ADAPTIVE DESCENT Disclaimer: This part talk assumes that the identified manifolds are linear subspaces eg: Dx1. Observe Mk = Rn
i:xk∈Mi
Mi yk = xk − γ∇f(xk) zk = projMk( yk) + proj⊥
Mk(zk−1)
xk+1 = proxγg(zk)
> Direct Use of Identification may not converge
eg: starting with 0
Example: Sparse structure and g = · 1 C = {M1, . . . , Mn} with Mi = {x ∈ Rn : xi = 0} Mk = {x ∈ Rn : xi = xi,k }
8 / 18
>>> Adaptive Proximal Gradient ADAPTIVE DESCENT Disclaimer: This part talk assumes that the identified manifolds are linear subspaces eg: Dx1. Observe Mk = Rn
i:xk∈Mi (ξk,iMi + (1 − ξk,i)Rn) for ξk,i ∼ B(p)
yk = xk − γ∇f(xk) zk = projMk( yk) + proj⊥
Mk(zk−1)
xk+1 = proxγg(zk)
> Mixing Identification and Randomized coordinate descent biases gradient
convergence issues
Example: Sparse structure and g = · 1 C = {M1, . . . , Mn} with Mi = {x ∈ Rn : xi = 0} Mk = {x ∈ Rn : xi = xi,kfor some i}
8 / 18
>>> Adaptive Proximal Gradient ADAPTIVE DESCENT Disclaimer: This part talk assumes that the identified manifolds are linear subspaces eg: Dx1. Observe Mk = Rn
i:xk∈Mi (ξk,iMi + (1 − ξk,i)Rn) for ξk,i ∼ B(p)
yk = xk − γ∇f(xk) zk = Q−1
k (projMk(Qkyk) + proj⊥ Mk(zk−1))
xk+1 = proxγg(zk)
> With Qk := (EprojMk)−1/2, this works after identification
but before... no, which prevents identification...
TV-regularized logistic regression:
0.2 0.4 0.6 0.8 1 1.2 1.4 ·105 10 15 20 25 30
Iteration Iterates structural sparsity
every iteration as in theory 0.2 0.4 0.6 0.8 1 1.2 1.4 ·105 10−10 10−7 10−4 10−1 102
Iteration Suboptimality
every iteration as in theory
8 / 18
>>> Adaptive Proximal Gradient ADAPTIVE DESCENT Disclaimer: This part talk assumes that the identified manifolds are linear subspaces eg: Dx1. Observe Mk = Rn
i:xℓ∈Mi (ξk,iMi + (1 − ξk,i)Rn) for ξk,i ∼ B(p)
yk = xk − γ∇f(xk) zk = Q−1
k (projMk(Qkyk) + proj⊥ Mk(zk−1))
xk+1 = proxγg(zk) Check if an adaptation can be performed, if so ℓ ← k + 1
> Generalized Support adaptation can be performed at some iterations
depends on the amount of change QkQ−1
k+1 and harshness of the sparsification λmin(Qk)
TV-regularized logistic regression:
0.2 0.4 0.6 0.8 1 1.2 1.4 ·105 10 15 20 25 30
Iteration Iterates structural sparsity
every iteration as in theory 0.2 0.4 0.6 0.8 1 1.2 1.4 ·105 10−10 10−7 10−4 10−1 102
Iteration Suboptimality
every iteration as in theory
8 / 18
>>> Adaptive Subspace descent ADAPTIVE DESCENT TV-reg. logistic regression on a1a (1605 × 143), 90% final jump sparsity
1,000 2,000 3,000 4,000 20 40 60 80
10% 10% 10% 10% 20% 20% 20% 20% 50% 50% 50% 50%
Iteration Iterate density
PGD 20% w/o identif 1 ARPSD 10% 10% ARPSD 20% 20% ARPSD 50% 50% ARPSD 1,000 2,000 3,000 4,000 10−11 10−8 10−5 10−2 101 10% 10% 10% 10% 20% 20% 20% 20% 50% 50% 50% 50%
Iteration Suboptimality
PGD 20% w/o identif 1 ARPSD 10% 10% ARPSD 20% 20% ARPSD 50% 50% ARPSD 1 2 3 4 ·105 10−11 10−8 10−5 10−2 101 10% 10% 10% 10% 20% 20% 20% 20% 50% 50% 50% 50%
Number of Subspaces explored Suboptimality
PGD 20% w/o identif 1 ARPSD 10% 10% ARPSD 20% 20% ARPSD 50% 50% ARPSD
> Iterate structure enforced by nonsmooth regularizers can be used to
adapt the selection probabilities of coordinate descent/sketching;
> Before identification, adaptation has to be moderate.
⊲ Grishchenko, I., & Malick: Proximal Gradient Methods with Adaptive Subspace Sampling, in revision for Mathematics of Operation Research
available on my webpage, more details at SMAI MODE
9 / 18
ADAPTIVE SUBSPACE DESCENT INTERPLAY BETWEEN ACCELERATION AND IDENTIFICATION
>>> Acceleration of the Proximal Gradient ACCELERATION? uk+1 = yk − γ∇f(yk) xk+1 = proxγg(uk+1) yk+1 = xk+1 + αk+1(xk+1 − xk)
- inertia/acceleration
> αk+1 ≡ 0 : vanilla Proximal Gradient > αk+1 = k−1
k+3 : accelerated Proximal Gradient (aka FISTA)
Optimal rate for composite problems (coefficients may vary a little)
PG
- Accel. PG
F(xk) − F⋆ O(1/k) O(1/k2) iterates convergence yes yes monotone functional decrease yes no Fejér-monotone iterates yes no
⋄ Nesterov: A method for solving the convex programming problem with convergence rate O(1/k2). Dokladi A.N. Sssr (1983) ⋄ Beck, Teboulle: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on Imaging Sciences (2009) ⋄ Chambolle, Dossal: On the convergence of the iterates of “FISTA”. Journal of Optimization theory and Applications (2015) ⋄ I., Malick: On the Proximal Gradient Algorithm with Alternated Inertia. Journal of Optimization Theory and Applications (2018)
10 / 18
>>> Interplay between Acceleration and Identification ACCELERATION? min
x∈R2 Ax − b2 2 + λg(x)
−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2
1.1 1.1 2 . 3 2.3 2.3 3.4 3 . 4 3.4 3.4 4.5 4.5 4.5 5.7 5 . 7 5.7 6 . 8 6.8 8 8 9.1 1 . 2 1 1 . 4
x⋆
Proximal Gradient −1.5 −1 −0.5 0.5 −0.5 0.5 1 1.5
. 2 8 . 2 8 . 2 8 0.28 . 2 8 0.28 0.56 0.56 0.56 0.56 0.56 0.84 0.84 . 8 4 1 . 1 2 1 . 1 2 1.12 1.4 1.4 1.68 1.68 1.96 1.96 2.24 2 . 5 2 2.8
x⋆
Proximal Gradient
g(x) = x1 1-norm regularization g(x) = max(x1.3 − 1; 0) distance to 1.3-norm unit ball
11 / 18
>>> Interplay between Acceleration and Identification ACCELERATION? min
x∈R2 Ax − b2 2 + λg(x)
−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2
1.1 1.1 2 . 3 2.3 2.3 3.4 3 . 4 3.4 3.4 4.5 4.5 4.5 5.7 5 . 7 5.7 6 . 8 6.8 8 8 9.1 1 . 2 1 1 . 4
x⋆
Proximal Gradient Accelerated Proximal Gradient −1.5 −1 −0.5 0.5 −0.5 0.5 1 1.5
. 2 8 . 2 8 . 2 8 0.28 . 2 8 0.28 0.56 0.56 0.56 0.56 0.56 0.84 0.84 . 8 4 1 . 1 2 1 . 1 2 1.12 1.4 1.4 1.68 1.68 1.96 1.96 2.24 2 . 5 2 2.8
M x⋆
Proximal Gradient Accelerated Proximal Gradient
g(x) = x1 1-norm regularization g(x) = max(x1.3 − 1; 0) distance to 1.3-norm unit ball
> PG identifies well; > Accelerated PG explores well, identifies eventually, but erratically.
Can we converge fast and identify well?
11 / 18
>>> A test-based algorithm ACCELERATION? T is a boolean function of past iterates; decides whether to accelerate or not. uk+1 = yk − γ∇f(yk) xk+1 = proxγg(uk+1) yk+1 = xk+1 + αk+1(xk+1 − xk) if T = 1 xk+1 if T = 0 Proposed tests: We use our lookout collection C
- 1. No Acceleration i.e. T1 = 0
when reaching a new one:
xk+1 ∈ M and xk ∈ M for some M ∈ C.
- 2. No Acceleration i.e. T2 = 0
if this means leaving:
Tγ(xk+1) ∈ M and Tγ(xk+1 + αk+1(xk+1 − xk)) ∈ M for some M ∈ C.
where Tγ := proxγg(· − γ∇f(·)) is the proximal gradient operator. For analysis reasons, we allow no acceleration only when Tγ(yk) − yk2 ≤ δ and F(Tγ(yk)) ≤ F(x0).
12 / 18
>>> Convergence result ACCELERATION? Theorem
Let f, g be two convex functions such that f is L-smooth, g is lower semi-continuous, and f + g is semi-algebraic with a minimizer. Take γ ∈ (0, 1/L]. Then, the iterates of the proposed
methods with test T1 or T2 verify F (xk+1) − F⋆ ≤ 9x0 − x⋆2 2γ(k + 2)2 + 9kR 2γ(k + 2)2 = O 1 k
- for some R > 0.
Furthermore, if the problem has a unique minimizer x⋆ and the qualifying constraint (QC) holds, then the iterates sequence (xk) converges, finite-time identification happens and F (xk+1) − F(x⋆) ≤ 9x0 − x⋆2 2γ(k + 2)2 + 9KR 2γ(k + 2)2 = O 1 k2
- .
for some finite K > 0.
L-smooth means that f is differentiable and ∇f is L-Lipschitz continuous.
13 / 18
>>> Back to initial problems: ℓ1 norm ACCELERATION? min
x∈R2 Ax − b2 2 + λx1
−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2
1.1 1.1 2 . 3 2.3 2.3 3.4 3 . 4 3.4 3.4 4.5 4.5 4.5 5.7 5 . 7 5.7 6 . 8 6.8 8 8 9.1 1 . 2 1 1 . 4
x⋆
Proximal Gradient Accelerated Proximal Gradient 14 / 18
>>> Back to initial problems: ℓ1 norm ACCELERATION? min
x∈R2 Ax − b2 2 + λx1
−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2
1.1 1.1 2 . 3 2.3 2.3 3.4 3 . 4 3.4 3.4 4.5 4.5 4.5 5.7 5 . 7 5.7 6 . 8 6.8 8 8 9.1 1 . 2 1 1 . 4
x⋆
Proximal Gradient Accelerated Proximal Gradient T1 T2 50 100 150 200 10−17 10−12 10−7 10−2 103
number of proximal gradient steps F(xk) − F⋆
Proximal Gradient
- Accel. Proximal Gradient
- Prov. Alg – T1
- Prov. Alg – T2
⊕ marks identification time
14 / 18
>>> Back to initial problems: distance to 1.3-norm ball ACCELERATION? min
x∈R2 Ax − b2 2 + λ max(|x1.3 − 1; 0)
−1.5 −1 −0.5 0.5 −0.5 0.5 1 1.5
. 2 8 . 2 8 . 2 8 0.28 . 2 8 0.28 0.56 0.56 0.56 0.56 0.56 0.84 0.84 . 8 4 1 . 1 2 1 . 1 2 1.12 1.4 1.4 1.68 1.68 1.96 1.96 2.24 2 . 5 2 2.8
M x⋆
Proximal Gradient Accelerated Proximal Gradient 15 / 18
>>> Back to initial problems: distance to 1.3-norm ball ACCELERATION? min
x∈R2 Ax − b2 2 + λ max(|x1.3 − 1; 0)
−1.5 −1 −0.5 0.5 −0.5 0.5 1 1.5
. 2 8 . 2 8 . 2 8 0.28 . 2 8 0.28 0.56 0.56 0.56 0.56 0.56 0.84 0.84 . 8 4 1 . 1 2 1 . 1 2 1.12 1.4 1.4 1.68 1.68 1.96 1.96 2.24 2 . 5 2 2.8
x⋆
Proximal Gradient Accelerated Proximal Gradient T1 T2 50 100 150 10−18 10−13 10−8 10−3 102
number of proximal gradient steps F(xk) − F⋆
Proximal Gradient
- Accel. Proximal Gradient
- Prov. Alg – T1
- Prov. Alg – T2
⊕ marks identification time
15 / 18
>>> Matrix regression with nuclear-norm regularization ACCELERATION? min
X∈R20×20 AX − B2 F + λX∗
> S ∈ R20×20 is a rank 3 matrix; > A ∈ R(16×16)×(20×20) is drawn from the normal distribution; > B = AS + E with E drawn from the normal distribution with variance .01
10−9 10−6 10−3 100 0 · 100 1 · 104 2 · 104 3 · 104 20 40 60 80 100
dim Ker(Xk)/dim Ker(S) (in %)
Proximal Gradient 10−9 10−6 10−3 100 0 · 100 1 · 104 2 · 104 3 · 104 20 40 60 80 100
iterations dim Ker(Xk)/dim Ker(S) (in %)
T1 10−9 10−6 10−3 100
F(xk) − F⋆
0 · 100 1 · 104 2 · 104 3 · 104 20 40 60 80 100
- Accel. Proximal Gradient
10−9 10−6 10−3 100
F(xk) − F⋆
0 · 100 1 · 104 2 · 104 3 · 104 20 40 60 80 100
iterations
T2 16 / 18
>>> On the Interplay between Acceleration and Identification ACCELERATION?
> acceleration can hurt identification for the proximal gradient algorithm; > we proposed a method with stable identification behavior, maintaining an
accelerated convergence rate. ⊲ Bareilles & I.: On the Interplay between Acceleration and Identification for the Proximal Gradient algorithm. arXiv:1909.08944 Try it in Julia on https://github.com/GillesBareilles/Acceleration-Identification
17 / 18
>>> Harnessing Structure in Optimization for ML ACCELERATION?
> Machine Learning problems often have a noticeable structure; > We can design a lookout collection C = {M1, .., Mp} of sets: (i) with easy
projections; (ii) identified by proximity operations; (iii) we know if these sets are identified or not;
> This structure can/should be harnessed but may be tricky before
identification. ⊲ Malick & I.: Nonsmoothness can help! on the Specific Structure of Machine Learning problems, review/pedagogical paper coming hopefully soon
thanks to this week at CIRM but it also depends whether we go hiking/running in the calanques which may very well be the case
Thanks to ANR JCJC STROLL & IDEX UGA IRS DOLL & PGMO
Thank you! – Franck IUTZELER
http://www.iutzeler.org
18 / 18