Inexact Tensor Methods with Dynamic Accuracies Nikita Doikov Yurii - - PowerPoint PPT Presentation

inexact tensor methods with dynamic accuracies
SMART_READER_LITE
LIVE PREVIEW

Inexact Tensor Methods with Dynamic Accuracies Nikita Doikov Yurii - - PowerPoint PPT Presentation

Inexact Tensor Methods with Dynamic Accuracies Nikita Doikov Yurii Nesterov UCLouvain, Belgium ICML 2020 Plan of the talk 1. Introduction: Tensor Methods in Convex Optimization 2. Inexact Tensor Methods 3. Acceleration 4. Numerical Example 2


slide-1
SLIDE 1

Inexact Tensor Methods with Dynamic Accuracies

Nikita Doikov Yurii Nesterov

UCLouvain, Belgium

ICML 2020

slide-2
SLIDE 2

Plan of the talk

  • 1. Introduction: Tensor Methods in Convex Optimization
  • 2. Inexact Tensor Methods
  • 3. Acceleration
  • 4. Numerical Example

2 / 22

slide-3
SLIDE 3

Plan of the talk

  • 1. Introduction: Tensor Methods in Convex Optimization
  • 2. Inexact Tensor Methods
  • 3. Acceleration
  • 4. Numerical Example

3 / 22

slide-4
SLIDE 4

Gradient Method Composite optimization problem min

x∈dom F F(x) := f (x) + ψ(x),

◮ f is convex and smooth; ◮ ψ : Rn → R ∪ {+∞} is convex (possibly nonsmooth, but simple). The Gradient Method: xk+1 = argmin

y

{︂ ⟨∇f (xk), y − xk⟩ + H

2 ‖y − xk‖2 + ψ(y)

}︂ , k ≥ 0. ◮ Gradient of f is Lipschitz continuous: ‖∇f (y) − ∇f (x)‖ ≤ L1‖y − x‖ ⇒ H := L1 ◮ Global sublinear convergence: F(xk) − F * ≤ O(1/k).

4 / 22

slide-5
SLIDE 5

Newton Method with Cubic Regularization ◮ Hessian of f is Lipschitz continuous: ‖∇2f (y) − ∇2f (x)‖ ≤ L2‖y − x‖. Cubic Newton: xk+1 = argmin

y

{︂ ⟨∇f (xk), y − xk⟩ + 1

2⟨∇2f (xk)(y − xk), y − xk⟩

+ H

6 ‖y − xk‖3 + ψ(y)

}︂ , k ≥ 0. ◮ H := 0 ⇒ Classical Newton. ◮ H := L2 ⇒ Global convergence: F(xk) − F * ≤ O(1/k2).

[Nesterov-Polyak, 2006]

5 / 22

slide-6
SLIDE 6

Tensor Methods Let x ∈ Rn be fixed, consider arbitrary h ∈ Rn and one-dimensional φ(t) := f (x + th), t ∈ R. Then φ(0) = f (x), φ′(0) = ⟨∇f (x), h⟩, φ′′(0) = ⟨∇2f (x)h, h⟩. Denote: Dpf (x)[h]p := φ(p)(0). The model: ΩH(x; y) :=

p

∑︁

k=1 1 k!Dkf (x)[y − x]k + H (p+1)!‖y − x‖p+1 + ψ(y).

Tensor Method of order p ≥ 1: xk+1 = argmin

y

ΩH(xk; y), k ≥ 0. ◮ p-th derivative is Lipschitz continuous: ‖Dpf (y) − Dpf (x)‖ ≤ Lp‖y − x‖. ◮ Global convergence: F(xk) − F * ≤ O(1/kp). [Baes, 2009]

6 / 22

slide-7
SLIDE 7

Tensor Methods: Solving the Subproblem At each iteration k ≥ 0, the subproblem is min

y

ΩH(xk; y) :=

p

∑︁

k=1 1 k!Dkf (x)[y − x]k + H (p+1)!‖y − x‖p+1 + ψ(y).

◮ H ≥ pLp ⇒ ΩH(xk; y) is convex in y. [Nesterov, 2018] ◮ For p = 3: efficient implementation, using Gradient Method with relative smoothness condition [Van Nguyen, 2017;

Bauschke-Bolte-Teboulle, 2016; Lu-Freund-Nesterov, 2018].

The cost of minimizing ΩH(xk; ·) is: O(n3) + ˜ O(n).

7 / 22

slide-8
SLIDE 8

Some Recent Results ◮ Accelerated Tensor Methods: F(xk) − F * ≤ O(1/kp+1)

[Baes, 2009; Nesterov, 2018].

◮ Optimal Tensor Methods: F(xk) − F * ≤ O(1/k

3p+1 2 )

[Gasnikov et al., 2019; Kamzolov-Gasnikov-Dvurechensky, 2020].

The oracle complexity matches the lower bound (up to logarithmic factor) from [Arjevani-Shamir-Shiff, 2017]. ◮ Universal Tensor Methods: [Grapiglia-Nesterov, 2019]. ◮ Stochastic Tensor Methods: [Lucchi-Kohler, 2019]. ◮ . . .

8 / 22

slide-9
SLIDE 9

Plan of the talk

  • 1. Introduction: Tensor Methods in Convex Optimization
  • 2. Inexact Tensor Methods
  • 3. Acceleration
  • 4. Numerical Example

9 / 22

slide-10
SLIDE 10

Definition of Inexactness Use a point T = TH,δ(xk) with small residual in function value: ΩH(xk; T) − min

y

ΩH(xk; y) ≤ δ. ◮ Easier to achieve by inner method. ◮ Can be controlled in practice using the duality gap. Set H := pLp. We have F(T) ≤ F(xk) + δ. ◮ Inexact step can be nonmonotone.

10 / 22

slide-11
SLIDE 11

Monotone Inexact Tensor Methods Initialization: choose x0 ∈ dom F, set H := pLp. Iterations: k ≥ 0.

1: Pick up δk+1 ≥ 0. 2: Compute inexact monotone tensor step T, such that

ΩH(xk; T) − min

y

ΩH(xk; y) ≤ δk+1, and F(T) < F(xk).

3: xk+1 := T.

Theorem 1. Set δk := c kp+1 , for c ≥ 0. Then F(xk) − F * ≤ O (︁ 1

kp

)︁ .

11 / 22

slide-12
SLIDE 12

Adaptive Strategy for Inner Accuracy Let us set δk := c(F(xk−2) − F(xk−1)). Theorem 2. (General convex case) F(xk) − F * ≤ O (︁ 1

kp

)︁ . Theorem 3. (Uniformly convex objective) Let F(y) ≥ F(x) + ⟨F ′(x), y − x⟩ + σp+1

p+1 ‖y − x‖p+1.

Denote ωp := max{(p+1)2Lp

p!σp+1 , 1}. Then we have linear rate

F(xk+1) − F * ≤ (︂ 1 − pω−1/p

p

2(p+1)

)︂ (F(xk) − F *). ◮ This works for methods, starting from p ≥ 1. Theorem 4. For p ≥ 2 and strongly convex objective, we have local superlinear rate.

12 / 22

slide-13
SLIDE 13

Plan of the talk

  • 1. Introduction: Tensor Methods in Convex Optimization
  • 2. Inexact Tensor Methods
  • 3. Acceleration
  • 4. Numerical Example

13 / 22

slide-14
SLIDE 14

Contracting Proximal Scheme ◮ Fix prox-function d(x). Bregman divergence: βd(x; y) := d(y) − d(x) − ⟨∇d(x), y − x⟩. ◮ Two sequences of points {xk}k≥0, {vk}k≥0, v0 = x0. ◮ Sequence of positive coefficients {ak}k≥0, Ak

def

= ∑︁k

i=1 ai.

Iterations, k ≥ 0:

  • 1. Compute

vk+1 = argmin

y

{︂ Ak+1f (︁ ak+1y+Akxk

Ak+1

)︁ + ak+1ψ(y) + βd(vk; y) }︂ .

  • 2. Put xk+1 = ak+1vk+1+Akxk

Ak+1

. The rate of convergence: F(xk) − F * ≤

βd(x0;x∗) Ak

.

[Doikov-Nesterov, 2019]

14 / 22

slide-15
SLIDE 15

Acceleration of Tensor Steps For Tensor Method of order p ≥ 1: ◮ Set d(x) :=

1 p+1‖x − x0‖p+1.

◮ Ak+1 := (k+1)p+1

Lp

. For contracted objective with regularization hk+1(y) := Ak+1f (︁ ak+1y+Akxk

Ak+1

)︁ + ak+1ψ(y) + βd(vk; y), we compute inexact minimizer vk+1: hk+1(vk+1) − h*

k+1

c (k+1)p+2 .

◮ It requires ˜ O(1) inexact Tensor Steps.

  • Theorem. For outer iterations, we obtain accelerated rate:

F(xk) − F * ≤ O (︁

1 kp+1

)︁ .

15 / 22

slide-16
SLIDE 16

Plan of the talk

  • 1. Introduction: Tensor Methods in Convex Optimization
  • 2. Inexact Tensor Methods
  • 3. Acceleration
  • 4. Numerical Example

16 / 22

slide-17
SLIDE 17

Log-sum-exp min

x∈Rn f (x) := µ log

(︃ m ∑︁

i=1

exp (︂

⟨ai,x⟩−bi µ

)︂)︃ (SoftMax). ◮ a1, . . . , am, b — given data. ◮ µ > 0 — smoothing parameter. ◮ Denote B ≡

m

∑︁

i=1

aiaT

i ⪰ 0, and use ‖x‖ ≡ ⟨Bx, x⟩1/2.

We have L1 ≤ 1

µ,

L2 ≤

2 µ2 ,

L3 ≤

4 µ3 .

◮ Cubic Newton (p = 2). ◮ Compute each step (inexactly) by Fast Gradient Method.

17 / 22

slide-18
SLIDE 18

Log-sum-exp: Constant strategies ◮ δk := const.

200 400

Iterations

10

7

10

5

10

3

10

1

101

Functional residual

20000 40000 60000

Hessian-vector products

10

7

10

5

10

3

10

1

101 10

2

10

4

10

6

10

8

Log-sum-exp, = 0.05: constant strategies

18 / 22

slide-19
SLIDE 19

Log-sum-exp: Dynamic strategies ◮ δk := 1/kα.

200 400

Iterations

10

7

10

5

10

3

10

1

101

Functional residual

20000 40000 60000

Hessian-vector products

10

7

10

5

10

3

10

1

101 1/k 1/k2 1/k3 1/k4 10

8

Log-sum-exp, = 0.05: dynamic strategies

19 / 22

slide-20
SLIDE 20

Log-sum-exp: Adaptive strategies ◮ δk := (F(xk−1) − F(xk))α.

50 100 150

Iterations

10

8

10

6

10

4

10

2

100

Functional residual

20000 40000 60000

Hessian-vector products

10

8

10

6

10

4

10

2

100 adaptive adaptive 1.5 adaptive 2 10

8

Log-sum-exp, = 0.05: adaptive strategies

20 / 22

slide-21
SLIDE 21

Log-sum-exp: Cubic Newton vs. Tensor Method

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Time, s

10

7

10

5

10

3

10

1

101

Functional residual Log-sum-exp, = 0.1.

Cubic Newton (p = 2) Tensor (p = 3), Exact Tensor (p = 3), Adaptive

◮ H is fixed.

21 / 22

slide-22
SLIDE 22

Conclusion Inexact Tensor Methods of degree p ≥ 1: p = 1: Gradient Method. p = 2: Newton method with Cubic regularization. p = 3: Third order Tensor method. We admit to solve the subproblem inexactly, δk — accuracy in functional residual for the subproblem. ◮ Dynamic strategy δk :=

c kp+1 .

◮ Adaptive strategy δk := c(F(xk) − F(xk−1)). Global rate of convergence: F(xk) − F * ≤ O( 1

kp ).

◮ Using contracting proximal iterations we obtain accelerated O(

1 kp+1 ) rate.

Thank you for your attention!

22 / 22