Convex optimization based on global lower second-order models - - PowerPoint PPT Presentation
Convex optimization based on global lower second-order models - - PowerPoint PPT Presentation
Convex optimization based on global lower second-order models Nikita Doikov Yurii Nesterov UCLouvain, Belgium NeurIPS 2020 Problem Composite convex optimization problem: def min x F ( x ) = f ( x ) + ( x ) f is convex,
Problem Composite convex optimization problem: min
x F(x) def
= f (x) + ψ(x) ◮ f is convex, differentiable. ◮ ψ : Rn → R ∪ {+∞} is convex, simple. ◮ dom ψ is bounded. D def = diam (dom ψ). Example: ψ(x) = {︄ 0, ‖x‖ ≤ D
2 ,
+∞,
- therwise.
⇒ The problem with ball-regularization: min
‖x‖≤ D
2
f (x)
2 / 13
Review: Gradient Methods Let ∇f be Lipschitz continuous: ‖∇f (y) − ∇f (x)‖* ≤ L‖y − x‖. The Gradient Method: xk+1 = argmin
y
{︂ f (xk) + ⟨∇f (xk), y − xk⟩ + L
2‖y − xk‖2 + ψ(y)
}︂ . ◮ Global convergence: F(xk) − F * ≤ O( 1
k ).
The Conditional Gradient Method [Frank-Wolfe, 1956]: vk+1 = argmin
y
{︂ f (xk) + ⟨∇f (xk), y − xk⟩ + ψ(y) }︂ , xk+1 = γkvk+1 + (1 − γk)xk. ◮ Set γk =
2 k+2. Then F(xk) − F * ≤ O( 1 k ).
Note: Near-optimal for ‖ · ‖∞-balls [Guzm´
an-Nemirovski, 2015].
3 / 13
Review: Second-Order Methods Let ∇2f be Lipschitz continuous: ‖∇2f (x) − ∇2f (y)‖ ≤ L‖x − y‖. Newton Method: xk+1 = argmin
y
{︂ ⟨∇f (xk), y − xk⟩ + 1
2⟨∇2f (xk)(y − xk), y − xk⟩
+ ψ(y) }︂ . ◮ Quadratic convergence (if ∇2f (x*) ≻ 0 and x0 close to x*). ◮ No global convergence. A heuristic: use line-search in practice. Newton Method with Cubic Regularization: xk+1 = argmin
y
{︂ ⟨∇f (xk), y − xk⟩ + 1
2⟨∇2f (xk)(y − xk), y − xk⟩
+
L 6‖y − xk‖3 + ψ(y)
}︂ . ◮ Global rate: F(xk) − F * ≤ O( 1
k2 ) [Nesterov-Polyak, 2006].
4 / 13
Overview of the Contributions New second-order algorithms with global convergence proofs. ◮ The methods are universal (no unknown parameters). ◮ Affine-invariant (the norm is not fixed). Stochastic methods (basic and with the variance reduction). Numerical experiments.
5 / 13
Second-Order Lower Model
- 1. f is convex: f (y) ≥ f (x) + ⟨∇f (x), y − x⟩.
- 2. ∇2f is Lipschitz continuous:
‖∇2f (x) − ∇2f (y)‖ ≤ L‖x − y‖. Convexity + Smoothness ⇒ tighter lower bound: ∀t ∈ [0, 1] f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ + t
2⟨∇2f (x)(y − x), y − x⟩ − t2L‖y−x‖3 6
.
−3 −2 −1 1 2 3 4 −1 1 2 3 4
Second-order First-order
6 / 13
New Algorithm Contracting-Domain Newton Method: vk+1 = argmin
y
{︂ ⟨∇f (xk), y − xk⟩ + γk
2 ⟨∇2f (xk)(y − xk), y − xk⟩
+ ψ(y) }︂ , xk+1 = γkvk+1 + (1 − γk)xk.
7 / 13
Trust-Region Interpretation Contracting-Domain Newton Method (reformulation): xk+1 = argmin
y
{︂ ⟨∇f (xk), y − xk⟩ + 1
2⟨∇2f (xk)(y − xk), y − xk⟩
+ γkψ(xk + 1
γk (y − xk))
}︂ . Regularization of quadratic model by the asymmetric trust region.
8 / 13
Global Convergence Let ∇2f be Lipschitz continuous: ‖∇2f (x) − ∇2f (y)‖ ≤ L‖x − y‖ (w.r.t. arbitrary norm). Theorem 1. Set γk =
3 k+3. Then
F(xk) − F * ≤ O( LD3
k2 ).
Theorem 2. Let ψ be strongly convex with parameter µ > 0. ◮ Set γk =
5 k+5. Then
F(xk) − F * ≤ O( LD
µ · LD3 k4 ).
◮ Set γk =
1 1+ω, where ω def
= [︂
LD 2µ
]︂ 1
2 . Then
F(xk) − F * ≤ exp (︂ − k−1
1+ω
)︂
LD3 2 .
9 / 13
Experiments: Logistic Regression min
‖x‖2≤ D
2
M
∑︁
i=1
fi(x), fi(x) = log(1 + exp (︁ ⟨ai, x⟩ )︁ ). D plays the role of regularization parameter.
50 100 150 200
Iterations
10
7
10
6
10
5
10
4
10
3
10
2
10
1
100
- Func. residual
0.5s 0.28s 0.25s 4.58s 7s
w8a, D = 20
Frank-Wolfe
- Grad. Method
Fast Grad. Method
- Contr. Newton
- Aggr. Newton
500 1000 1500 2000
Iterations
10
7
10
6
10
5
10
4
10
3
10
2
10
1
100 101
- Func. residual
2.4s 5.1s 2.5s 5s 4.59s 4.48s 6.99s
w8a, D = 100
For bigger D the problem becomes more ill-conditioned.
10 / 13
Stochastic Methods for Logistic Regression Approximate ∇f (x), ∇2f (x) by stochastic estimates.
50 100 150 200
Epochs
10
6
10
5
10
4
10
3
10
2
10
1
100
- Func. residual
20s 20.14s 19.75s 19.85s
YearPredictionMSD, D = 20
SGD SVRG SNewton SVRNewton
The problem with big dataset size (M = 463715) and small dimension (n = 90).
11 / 13
Conclusions Second-order information helps in a case of ◮ ill-conditioning; ◮ small or moderate dimension (the subproblems are more expensive). No need to tune stepsize. Can be preferable for solving problems over the sets with a non-Euclidean geometry.
12 / 13
Follow Up Results Nikita Doikov and Yurii Nesterov. “Affine-invariant contracting- point methods for Convex Optimization”. In: arXiv:2009.08894 (2020) ◮ General framework of Contracting-Point Methods. ◮ Contracting-Point Tensor Methods of order p ≥ 1: F(xk) − F * ≤ O( 1
kp ).
◮ Affine-invariant smoothness condition ⇒ Affine-invariant analysis.
Thank you for your attention!
13 / 13