Convex optimization based on global lower second-order models - - PowerPoint PPT Presentation

▶

Nov 21, 2022 141 likes •292 views

Convex optimization based on global lower second-order models Nikita Doikov Yurii Nesterov UCLouvain, Belgium NeurIPS 2020 Problem Composite convex optimization problem: def min x F ( x ) = f ( x ) + ( x ) f is convex,

SLIDE 1

Convex optimization based on global lower second-order models

Nikita Doikov Yurii Nesterov

UCLouvain, Belgium

NeurIPS 2020

SLIDE 2

Problem Composite convex optimization problem: min

x F(x) def

= f (x) + ψ(x) ◮ f is convex, differentiable. ◮ ψ : Rn → R ∪ {+∞} is convex, simple. ◮ dom ψ is bounded. D def = diam (dom ψ). Example: ψ(x) = {︄ 0, ‖x‖ ≤ D

2 ,

+∞,

therwise.

⇒ The problem with ball-regularization: min

‖x‖≤ D

f (x)

2 / 13

SLIDE 3

Review: Gradient Methods Let ∇f be Lipschitz continuous: ‖∇f (y) − ∇f (x)‖* ≤ L‖y − x‖. The Gradient Method: xk+1 = argmin

y

{︂ f (xk) + ⟨∇f (xk), y − xk⟩ + L

2‖y − xk‖2 + ψ(y)

}︂ . ◮ Global convergence: F(xk) − F * ≤ O( 1

k ).

The Conditional Gradient Method [Frank-Wolfe, 1956]: vk+1 = argmin

y

{︂ f (xk) + ⟨∇f (xk), y − xk⟩ + ψ(y) }︂ , xk+1 = γkvk+1 + (1 − γk)xk. ◮ Set γk =

2 k+2. Then F(xk) − F * ≤ O( 1 k ).

Note: Near-optimal for ‖ · ‖∞-balls [Guzm´

an-Nemirovski, 2015].

3 / 13

SLIDE 4

Review: Second-Order Methods Let ∇2f be Lipschitz continuous: ‖∇2f (x) − ∇2f (y)‖ ≤ L‖x − y‖. Newton Method: xk+1 = argmin

y

{︂ ⟨∇f (xk), y − xk⟩ + 1

2⟨∇2f (xk)(y − xk), y − xk⟩

+ ψ(y) }︂ . ◮ Quadratic convergence (if ∇2f (x) ≻ 0 and x0 close to x). ◮ No global convergence. A heuristic: use line-search in practice. Newton Method with Cubic Regularization: xk+1 = argmin

y

{︂ ⟨∇f (xk), y − xk⟩ + 1

2⟨∇2f (xk)(y − xk), y − xk⟩

+

L 6‖y − xk‖3 + ψ(y)

}︂ . ◮ Global rate: F(xk) − F * ≤ O( 1

k2 ) [Nesterov-Polyak, 2006].

4 / 13

SLIDE 5

Overview of the Contributions New second-order algorithms with global convergence proofs. ◮ The methods are universal (no unknown parameters). ◮ Affine-invariant (the norm is not fixed). Stochastic methods (basic and with the variance reduction). Numerical experiments.

5 / 13

SLIDE 6

Second-Order Lower Model

1. f is convex: f (y) ≥ f (x) + ⟨∇f (x), y − x⟩.
2. ∇2f is Lipschitz continuous:

‖∇2f (x) − ∇2f (y)‖ ≤ L‖x − y‖. Convexity + Smoothness ⇒ tighter lower bound: ∀t ∈ [0, 1] f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ + t

2⟨∇2f (x)(y − x), y − x⟩ − t2L‖y−x‖3 6

.

−3 −2 −1 1 2 3 4 −1 1 2 3 4

Second-order First-order

6 / 13

SLIDE 7

New Algorithm Contracting-Domain Newton Method: vk+1 = argmin

y

{︂ ⟨∇f (xk), y − xk⟩ + γk

2 ⟨∇2f (xk)(y − xk), y − xk⟩

+ ψ(y) }︂ , xk+1 = γkvk+1 + (1 − γk)xk.

7 / 13

SLIDE 8

Trust-Region Interpretation Contracting-Domain Newton Method (reformulation): xk+1 = argmin

y

{︂ ⟨∇f (xk), y − xk⟩ + 1

2⟨∇2f (xk)(y − xk), y − xk⟩

+ γkψ(xk + 1

γk (y − xk))

}︂ . Regularization of quadratic model by the asymmetric trust region.

8 / 13

SLIDE 9

Global Convergence Let ∇2f be Lipschitz continuous: ‖∇2f (x) − ∇2f (y)‖ ≤ L‖x − y‖ (w.r.t. arbitrary norm). Theorem 1. Set γk =

3 k+3. Then

F(xk) − F * ≤ O( LD3

k2 ).

Theorem 2. Let ψ be strongly convex with parameter µ > 0. ◮ Set γk =

5 k+5. Then

F(xk) − F * ≤ O( LD

µ · LD3 k4 ).

◮ Set γk =

1 1+ω, where ω def

= [︂

LD 2µ

]︂ 1

2 . Then

F(xk) − F * ≤ exp (︂ − k−1

1+ω

)︂

LD3 2 .

9 / 13

SLIDE 10

Experiments: Logistic Regression min

‖x‖2≤ D

M

∑︁

i=1

fi(x), fi(x) = log(1 + exp (︁ ⟨ai, x⟩ )︁ ). D plays the role of regularization parameter.

50 100 150 200

Iterations

100

Func. residual

0.5s 0.28s 0.25s 4.58s 7s

w8a, D = 20

Frank-Wolfe

Grad. Method

Fast Grad. Method

Contr. Newton
Aggr. Newton

500 1000 1500 2000

Iterations

100 101

Func. residual

2.4s 5.1s 2.5s 5s 4.59s 4.48s 6.99s

w8a, D = 100

For bigger D the problem becomes more ill-conditioned.

10 / 13

SLIDE 11

Stochastic Methods for Logistic Regression Approximate ∇f (x), ∇2f (x) by stochastic estimates.

50 100 150 200

Epochs

100

Func. residual

20s 20.14s 19.75s 19.85s

YearPredictionMSD, D = 20

SGD SVRG SNewton SVRNewton

The problem with big dataset size (M = 463715) and small dimension (n = 90).

11 / 13

SLIDE 12

Conclusions Second-order information helps in a case of ◮ ill-conditioning; ◮ small or moderate dimension (the subproblems are more expensive). No need to tune stepsize. Can be preferable for solving problems over the sets with a non-Euclidean geometry.

12 / 13

SLIDE 13

Follow Up Results Nikita Doikov and Yurii Nesterov. “Affine-invariant contracting- point methods for Convex Optimization”. In: arXiv:2009.08894 (2020) ◮ General framework of Contracting-Point Methods. ◮ Contracting-Point Tensor Methods of order p ≥ 1: F(xk) − F * ≤ O( 1

kp ).

◮ Affine-invariant smoothness condition ⇒ Affine-invariant analysis.

Thank you for your attention!

13 / 13

Convex optimization based on global lower second-order models

Nikita Doikov Yurii Nesterov

UCLouvain, Belgium

NeurIPS 2020

Problem Composite convex optimization problem: min

x F(x) def

= f (x) + ψ(x) ◮ f is convex, differentiable. ◮ ψ : Rn → R ∪ {+∞} is convex, simple. ◮ dom ψ is bounded. D def = diam (dom ψ). Example: ψ(x) = {︄ 0, ‖x‖ ≤ D

2 ,

+∞,

⇒ The problem with ball-regularization: min

‖x‖≤ D

f (x)

Review: Gradient Methods Let ∇f be Lipschitz continuous: ‖∇f (y) − ∇f (x)‖* ≤ L‖y − x‖. The Gradient Method: xk+1 = argmin

y

{︂ f (xk) + ⟨∇f (xk), y − xk⟩ + L

2‖y − xk‖2 + ψ(y)

}︂ . ◮ Global convergence: F(xk) − F * ≤ O( 1

k ).

The Conditional Gradient Method [Frank-Wolfe, 1956]: vk+1 = argmin

y

{︂ f (xk) + ⟨∇f (xk), y − xk⟩ + ψ(y) }︂ , xk+1 = γkvk+1 + (1 − γk)xk. ◮ Set γk =

2 k+2. Then F(xk) − F * ≤ O( 1 k ).

Note: Near-optimal for ‖ · ‖∞-balls [Guzm´

an-Nemirovski, 2015].

Review: Second-Order Methods Let ∇2f be Lipschitz continuous: ‖∇2f (x) − ∇2f (y)‖ ≤ L‖x − y‖. Newton Method: xk+1 = argmin

y

{︂ ⟨∇f (xk), y − xk⟩ + 1

2⟨∇2f (xk)(y − xk), y − xk⟩

+ ψ(y) }︂ . ◮ Quadratic convergence (if ∇2f (x*) ≻ 0 and x0 close to x*). ◮ No global convergence. A heuristic: use line-search in practice. Newton Method with Cubic Regularization: xk+1 = argmin

y

{︂ ⟨∇f (xk), y − xk⟩ + 1

2⟨∇2f (xk)(y − xk), y − xk⟩

+

L 6‖y − xk‖3 + ψ(y)

}︂ . ◮ Global rate: F(xk) − F * ≤ O( 1

k2 ) [Nesterov-Polyak, 2006].

Overview of the Contributions New second-order algorithms with global convergence proofs. ◮ The methods are universal (no unknown parameters). ◮ Affine-invariant (the norm is not fixed). Stochastic methods (basic and with the variance reduction). Numerical experiments.

Second-Order Lower Model

‖∇2f (x) − ∇2f (y)‖ ≤ L‖x − y‖. Convexity + Smoothness ⇒ tighter lower bound: ∀t ∈ [0, 1] f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ + t

2⟨∇2f (x)(y − x), y − x⟩ − t2L‖y−x‖3 6

.

New Algorithm Contracting-Domain Newton Method: vk+1 = argmin

y

{︂ ⟨∇f (xk), y − xk⟩ + γk

2 ⟨∇2f (xk)(y − xk), y − xk⟩

+ ψ(y) }︂ , xk+1 = γkvk+1 + (1 − γk)xk.

Trust-Region Interpretation Contracting-Domain Newton Method (reformulation): xk+1 = argmin

y

{︂ ⟨∇f (xk), y − xk⟩ + 1

2⟨∇2f (xk)(y − xk), y − xk⟩

+ γkψ(xk + 1

γk (y − xk))

}︂ . Regularization of quadratic model by the asymmetric trust region.

Global Convergence Let ∇2f be Lipschitz continuous: ‖∇2f (x) − ∇2f (y)‖ ≤ L‖x − y‖ (w.r.t. arbitrary norm). Theorem 1. Set γk =

3 k+3. Then

F(xk) − F * ≤ O( LD3

k2 ).

Theorem 2. Let ψ be strongly convex with parameter µ > 0. ◮ Set γk =

5 k+5. Then

F(xk) − F * ≤ O( LD

µ · LD3 k4 ).

◮ Set γk =

1 1+ω, where ω def

= [︂

LD 2µ

]︂ 1

F(xk) − F * ≤ exp (︂ − k−1

1+ω

)︂

LD3 2 .

Experiments: Logistic Regression min

‖x‖2≤ D

M

∑︁

i=1

fi(x), fi(x) = log(1 + exp (︁ ⟨ai, x⟩ )︁ ). D plays the role of regularization parameter.

For bigger D the problem becomes more ill-conditioned.

Stochastic Methods for Logistic Regression Approximate ∇f (x), ∇2f (x) by stochastic estimates.

YearPredictionMSD, D = 20

The problem with big dataset size (M = 463715) and small dimension (n = 90).

+ ψ(y) }︂ . ◮ Quadratic convergence (if ∇2f (x) ≻ 0 and x0 close to x). ◮ No global convergence. A heuristic: use line-search in practice. Newton Method with Cubic Regularization: xk+1 = argmin