Randomized Block Cubic Newton Method Nikita Doikov 1 arik 2, 3, 4 - - PowerPoint PPT Presentation

randomized block cubic newton method
SMART_READER_LITE
LIVE PREVIEW

Randomized Block Cubic Newton Method Nikita Doikov 1 arik 2, 3, 4 - - PowerPoint PPT Presentation

Randomized Block Cubic Newton Method Nikita Doikov 1 arik 2, 3, 4 Peter Richt 1 Higher School of Economics, Russia 2 King Abdullah University of Science and Technology, Saudi Arabia 3 The University of Edinburgh, United Kingdom 4 Moscow Institute


slide-1
SLIDE 1

Randomized Block Cubic Newton Method

Nikita Doikov1 Peter Richt´ arik2, 3, 4

1Higher School of Economics, Russia 2King Abdullah University of Science and Technology, Saudi Arabia 3The University of Edinburgh, United Kingdom 4Moscow Institute of Physics and Technology, Russia

International Conference on Machine Learning, Stockholm July 12, 2018

slide-2
SLIDE 2

Plan of the Talk

  • 1. Review: Gradient Descent and Cubic Newton methods
  • 2. RBCN: Randomized Block Cubic Newton
  • 3. Application: Empirical Risk Minimization

2 / 20

slide-3
SLIDE 3

Plan of the Talk

  • 1. Review: Gradient Descent and Cubic Newton methods
  • 2. RBCN: Randomized Block Cubic Newton
  • 3. Application: Empirical Risk Minimization

3 / 20

slide-4
SLIDE 4

Review: Classical Gradient Descent Optimization problem: min

x∈RN F(x)

◮ Main assumption: gradient of F is Lipschitz-continuous: ‖∇F(x) − ∇F(y)‖ ≤ L‖x − y‖, ∀x, y ∈ RN. ◮ From which we get the Global upper bound for the function: F(y) ≤ F(x) + ⟨∇F(x), y − x⟩ + L 2‖y − x‖2, ∀x, y ∈ RN. ◮ The Gradient Descent: x+ = argmin

y∈RN

[︂ F(x) + ⟨∇F(x), y − x⟩ + L 2‖y − x‖2]︂ = x − 1

L∇F(x).

4 / 20

slide-5
SLIDE 5

Review: Cubic Newton Optimization problem: min

x∈RN F(x).

◮ New assumption: Hessian of F is Lipschitz-continuous: ‖∇2F(x) − ∇2F(y)‖ ≤ H‖x − y‖, ∀x, y ∈ RN. ◮ Corresponding Global upper bound for the function: Q(x; y) ≡ F(x)+⟨∇F(x), y −x⟩+ 1 2⟨∇2F(x)(y −x), y −x⟩, then F(y) ≤ Q(x; y) + H 6 ‖y − x‖3, ∀x, y ∈ RN. ◮ Newton method with cubic regularization1: x+ = argmin

y∈RN

[︂ Q(x; y) + H 6 ‖y − x‖3]︂ = x − (︂ ∇2F(x) + H‖x+ − x‖ 2 I )︂

−1

∇F(x).

1Yurii Nesterov and Boris T Polyak. “Cubic regularization of Newton’s

method and its global performance”. In: Mathematical Programming 108.1 (2006), pp. 177–205.

5 / 20

slide-6
SLIDE 6

Gradient Descent vs. Cubic Newton Optimization problem: F ∗ = min

x∈RN F(x).

◮ F(xK) − F ∗ ≤ ε, What is K – ? ◮ Let F be convex: F(y) ≥ F(x) + ⟨∇F(x), y − x⟩. ◮ Iteration complexity estimates: K = O (︂

1 ε

)︂ for GD, and K = O (︂

1 √ε

)︂ for CN (much better). ◮ But, cost of one iteration: O(N) for GD and O(N3) for CN. N is huge for modern applications. Even O(N) is too much!

6 / 20

slide-7
SLIDE 7

Our Motivation Recent advances in block coordinate methods.

  • 1. Paul Tseng and Sangwoon Yun. “A coordinate gradient descent method

for nonsmooth separable minimization”. In: Mathematical Programming 117.1-2 (2009), pp. 387–423

  • 2. Peter Richt´

arik and Martin Tak´ aˇ

  • c. “Iteration complexity of randomized

block-coordinate descent methods for minimizing a composite function”. In: Mathematical Programming 144.1-2 (2014), pp. 1–38

  • 3. Zheng Qu et al. “SDNA: stochastic dual Newton ascent for empirical risk

minimization”. In: International Conference on Machine Learning. 2016,

  • pp. 1823–1832

Computationally effective steps, convergence as for full methods. Aim: To create a second-order method with global complexity guarantees and low cost of every iteration.

7 / 20

slide-8
SLIDE 8

Plan of the Talk

  • 1. Review: Gradient Descent and Cubic Newton methods
  • 2. RBCN: Randomized Block Cubic Newton
  • 3. Application: Empirical Risk Minimization

8 / 20

slide-9
SLIDE 9

Problem Structure ◮ Consider the following decomposition of F : RN → R: F(x) ≡ φ(x) ⏟ ⏞

twice differentiable

+ g(x) ⏟ ⏞

differentiable

◮ For a given space decomposition RN ≡ RN1 × · · · × RNn, x ≡ (︁ x(1), . . . , x(n) )︁ , x(i) ∈ RNi, assume block-separable structure of φ: φ(x) ≡

n

∑︂

i=1

φi(x(i)). ◮ Block separability for g : RN → R is not fixed.

9 / 20

slide-10
SLIDE 10

Main Assumptions Optimization problem: min

x∈Q F(x), where

F(x) ≡

n

∑︂

i=1

φi(x(i)) + g(x). ◮ Every φi(x(i)), i ∈ {1, . . . , n} is twice-differentiable and convex, with Lipschitz-continuous Hessian: ‖∇2φi(x) − ∇2φi(y)‖ ≤ Hi‖x − y‖, ∀x, y ∈ RNi. ◮ g(x) is differentiable, and for some fixed positive-semidefinite matrices A ⪰ G ⪰ 0 we have bounds, for all x, y ∈ RN:

◮ g(y) ≤ g(x) + ⟨∇g(x), y − x⟩ + 1

2⟨A(y − x), y − x⟩,

◮ g(y) ≥ g(x) + ⟨∇g(x), y − x⟩ + 1

2⟨G(y − x), y − x⟩.

◮ Q ⊂ RN is a simple convex set.

10 / 20

slide-11
SLIDE 11

Model of the Objective Objective: F(x) ≡

n

∑︂

i=1

φi(x(i)) + g(x) We want to build a model of F. ◮ Fix subset of blocks: S ⊂ {1, . . . , n}. ◮ For y ∈ RN denote by y[S] ∈ RN a vector with zeroed i / ∈ S. ◮ MH,S(x; y) ≡ F(x) + ⟨∇φ(x), y[S]⟩ + 1 2⟨∇2φ(x)y[S], y[S]⟩ +

H 6 ‖y[S]‖3 + ⟨∇g(x), y[S]⟩ + 1

2⟨Ay[S], y[S]⟩. ◮ From smoothness: F(x + y) ≤ MH,S(x; y), ∀x, y ∈ RN for H ≥ ∑︂

i∈S

Hi.

11 / 20

slide-12
SLIDE 12

RBCN: Randomized Block Cubic Newton method ◮ Method step: TH,S(x) ≡ argmin

y∈RN

[S]

s.t. x+y∈Q

MH,S(x; y). ◮ Algorithm: Initialization: choose x0 ∈ RN, uniform random distribution ˆ S. Iterations: k ≥ 0.

1: Sample Sk ∼ ˆ

S

2: Find Hk > 0 such that

F(xk + THk,Sk(xk)) ≤ MHk,Sk(xk; xk + THk,Sk(xk)).

3: Make the step: xk+1

def

= xk + THk,Sk(xk).

12 / 20

slide-13
SLIDE 13

Convergence Results We want to get: P (︂ F(xK) − F ∗ ≤ ε )︂ ≥ 1 − ρ ε > 0 is required accuracy level, ρ ∈ (0, 1) is confidence level. Theorem 1. General conditions. K = O (︃1 ε · n τ · (︃ 1 + log 1 ρ )︃)︃ , τ ≡ E[| ˆ S|]. Theorem 2. σ ∈ [0, 1] is a condition number. σ ≥ λmin(G) λmax(A) > 0. K = O (︃ 1 √ε · n τ · 1 σ · (︃ 1 + log 1 ρ )︃)︃ Theorem 3. Strongly convex case: µ ≡ λmin(G) > 0. K = O (︃ log (︃ 1 ερ )︃ · n τ · 1 σ · √︄ max {︃HD µ , 1 }︃ )︃ , D ≥ ‖x0−x∗‖.

13 / 20

slide-14
SLIDE 14

Plan of the Talk

  • 1. Review: Gradient Descent and Cubic Newton methods
  • 2. RBCN: Randomized Block Cubic Newton
  • 3. Application: Empirical Risk Minimization

14 / 20

slide-15
SLIDE 15

Empirical Risk Minimization ERM problem: min

w∈Rd

[︃ P(w) ≡

n

∑︂

i=1

φi(bT

i w)

⏟ ⏞

loss

+ g(w) ⏟ ⏞

regularizer

]︃ ◮ SVM: φi(a) = max{0, 1 − yia}, ◮ Logistic regression: φi(a) = log(1 + exp(−yia)), ◮ Regression: φi(a) = (a − yi)2 or φi(a) = |a − yi|, ◮ Support vector regression: φi(a) = max{0, |a − yi| − ν}, ◮ Generalized linear models.

15 / 20

slide-16
SLIDE 16

Constrained Problem Reformulation min

w∈Rd P(w)

= min

w∈Rd

[︃

n

∑︂

i=1

φi(bT

i w

⏟ ⏞

≡µi

) + g(w) ]︃ = min

w∈Rd µ∈Rn bT

i w=µi

⏟ ⏞

≡ Q

[︃

n

∑︂

i=1

φi(µi) ⏟ ⏞

separable, twice differentiable

+ g(w) ⏟ ⏞

differentiable

]︃ ◮ Approximate φi by second-order models with cubic regularization; ◮ Treat g as quadratic function; ◮ Project onto simple constraints Q ≡ {︁ w ∈ Rd, µ ∈ Rn ⃒ ⃒ bT

i w = µi

}︁ .

16 / 20

slide-17
SLIDE 17

Proof of Concept: Does second-order information help?

10 25 50 100 200 500 1K 2K 4K 8K

Block size

5 10 15 20 25

Total computational time (s) leukemia, tolerance=1e-6

Block coordinate Gradient Descent Block coordinate Cubic Newton

10 25 50 100 200 500 1K 2K 4K 8K

Block size

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5

Total computational time (s) duke breast-cancer, tolerance=1e-6

Block coordinate Gradient Descent Block coordinate Cubic Newton

◮ Training Logistic Regression, d = 7129. ◮ Cubic Newton beats Gradient Descent for 10 ≤ |S| ≤ 50. ◮ Second-order information improves convergence.

17 / 20

slide-18
SLIDE 18

Maximization of the Dual Problem Initial objective: P(w) ≡

n

∑︂

i=1

φi(bT

i w) + g(w).

We have Primal and Dual problems: min

w∈Rd P(w) ≥ max α∈Rn D(α),

introducing Fenchel Conjugate: f ∗(s) ≡ sup

x

[︁ sTx −f (x) ]︁ , we have D(α) ≡

n

∑︂

i=1

−φ∗

i (αi)

⏟ ⏞

separable, twice differentiable

− g∗(︂ −BTα )︂ ⏟ ⏞

differentiable

. Solve Dual problem by our framework: ◮ Approximate φ∗

i by second-order cubic models;

◮ Treat g∗ as quadratic function; ◮ Project onto Q ≡ ⋂︁n

i=1 dom φ∗ i .

18 / 20

slide-19
SLIDE 19

Training Poisson Regression ◮ Solving the dual of Poisson regression.

20 40 60 80 100

Epoch

10

−7

10

−5

10

−3

10

−1

10

1

Duality gap Synthetic

Cubic, 8 Cubic, 32 Cubic, 256 SDNA, 8 SDNA, 32 SDNA, 256 SDCA, 8 SDCA, 32 SDCA, 256 100 200 300 400 500 600

Epoch

10

−8

10

−5

10

−2

10

1

10

4

10

7

10

10

Duality gap Montreal bike lanes

Cubic, 8 Cubic, 32 Cubic, 256 SDNA, 8 SDNA, 32 SDNA, 256 SDCA, 8 SDCA, 32 SDCA, 256

SDNA: Zheng Qu et al. “SDNA: stochastic dual Newton ascent for empirical risk minimization”. In: International Conference on Machine Learning. 2016,

  • pp. 1823–1832

SDCA: Shai Shalev-Shwartz and Tong Zhang. “Stochastic dual coordinate ascent methods for regularized loss minimization”. In: Journal of Machine Learning Research 14.Feb (2013), pp. 567–599

19 / 20

slide-20
SLIDE 20

Conclusion New second-order algorithm for convex optimization. ◮ Based on cubic regularization. ◮ Utilizes problem structure. ◮ Does randomized block updates (computationally cheap). ◮ Has global complexity guarantees. New Primal-Dual method for Empirical Risk Minimization. ◮ Outperforms state-of-the-art in terms of number of data accesses.

Thank you for your attention! See you at Poster #156.

20 / 20