Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal - - PowerPoint PPT Presentation

sub sampled newton methods for machine learning jorge
SMART_READER_LITE
LIVE PREVIEW

Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal - - PowerPoint PPT Presentation

Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Richard Byrd Northwestern University University of Colorado 2 Optimization,


slide-1
SLIDE 1

1

Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal

Northwestern University Goldman Lecture, Sept 2016

slide-2
SLIDE 2

2

Collaborators

Raghu Bollapragada Richard Byrd Northwestern University University of Colorado

slide-3
SLIDE 3

3

Optimization, Statistics, Machine Learning

Dramatic breakdown of barriers between optimization and

  • statistics. Partly stimulated by the rise of machine learning

Learning process and learning measures defined using continuous optimization models – convex and non-convex Nonlinear optimization algorithms used to train statistical models – a great challenge with the advent of high dimensional models and huge data sets The stochastic gradient method currently plays a central role

slide-4
SLIDE 4

4

Stochastic Gradient Method Robbins-Monro (1951)

  • 1. Why has it risen to such prominence?
  • 2. What is the main mechanism that drives it?
  • 3. What can we say about its behavior in convex and non-

convex cases?

  • 4. What ideas have been proposed to improve upon SG?

``Optimization Methods for Machine Learning’’, Bottou, Curtis, Nocedal (2016)

slide-5
SLIDE 5

5

Problem statement

Given training set {(x1,y1),…(xn,yn)} Given a loss function ℓ(z,y) (hinge loss, logistic,...) Find a prediction function h(x;w) (linear, DNN,...) minw 1 n ℓ

i=1 n

∑ (h(xi;w),yi)

Notation: fi(w) = ℓ(h(xi;w),yi) Rn(w) = 1 n fi(w)

i=1 n

empirical risk Random variable ξ = (x,y) F(w) = E[ f (w;ξ)] expected risk

slide-6
SLIDE 6

6

Stochastic Gradient Method

wk+1 = wk −α k∇fi(wk ) i ∈{1,...,n} choose at random

  • Very cheap, noisy iteration; gradient w.r.t. just 1 data point
  • Not a gradient descent method
  • Stochastic process dependent on the choice of i
  • Descent in expectation

First present algorithms for empirical risk minimization Rn(w) = 1 n fi(w)

i=1 n

slide-7
SLIDE 7

7

Batch Optimization Methods

wk+1 = wk −α k∇Rn(wk ) batch gradient method

Why has SG emerged as the preeminent method? wk+1= wk − α k n ∇

i=1 n

fi(wk )

  • More expensive, accurate step
  • Can choose among a wide range of optimization algorithms
  • Opportunities for parallelism

Computational trade-offs between stochastic and batch methods Ability to minimize F (generalization error)

slide-8
SLIDE 8

8

Practical Experience

10 epochs Fast initial progress

  • f SG followed by

drastic slowdown Can we explain this?

0.5 1 1.5 2 2.5 3 3.5 4 x 10

5

0.1 0.2 0.3 0.4 0.5 0.6

Accessed Data Points Empirical Risk

SGD LBFGS

Rn

Logistic regression; speech data

Batch L-BFGS

slide-9
SLIDE 9

9

Intuition

SG employs information more efficiently than batch methods Argument 1: Suppose data consists of 10 copies of a set S Iteration of batch method 10 times more expensive SG performs same computations

slide-10
SLIDE 10

10

Computational complexity

Total work to obtain Rn(wk ) ≤ Rn(w*)+ ε Batch gradient method: nlog(1/ε)

Think of ε = 10−3

Which one is better?

Stochastic gradient method: 1/ε More precisely: Batch: ndκ log(1/ε) SG: dνκ 2 /ε

slide-11
SLIDE 11

w1 −1 w1,* 1

11

Example by Bertsekas

Region of confusion Note that this is a geographical argument

Analysis: given wk what is the expected decrease in the

  • bjective function Rn as we choose one of the quadratics

randomly? Rn(w) = 1 n fi(w)

i=1 n

slide-12
SLIDE 12

12

A fundamental observation

To ensure convergence: α k → 0 in SG method to control variance.

E[Rn(wk+1)− Rn(wk)] ≤ −α k‖∇Rn(wk)‖

2 2 + α k 2 E‖∇fik (wk)‖

2

Initially, gradient decrease dominates; then variance in gradient hinders progress Sub-sampled Newton methods directly control the noise given in the last term What can we say when α k = α is constant?

slide-13
SLIDE 13

13

  • Only converges to a neighborhood of the optimal value.
slide-14
SLIDE 14

k

=

14

slide-15
SLIDE 15

15

Deep Neural Networks

Although much is understood about the SG method, still some great mysteries: why is it so much better than batch methods

  • n DNNs?
slide-16
SLIDE 16

16

Sharp and flat minima Keskar et al. (2016)

SG solution Batch solution Observing R along line From SG solution to batch solution

Goodfellow et al

Deep convolutional Neural net CIFAR-10 SG: mini-batch of size 256 Batch: 10% of training set ADAM optimizer

slide-17
SLIDE 17

17

Testing accuracy vs batch size Sharpness of minimizer vs batch size Sharpness: Max R in a small box around minimizer

Testing accuracy and sharpness Keskar (2016)

slide-18
SLIDE 18

18

Drawback of SG method: distributed computing

slide-19
SLIDE 19

19

Sub-sampled Newton Methods

slide-20
SLIDE 20

20

Iteration

∇2F

S(wk)p = −∇F X(wk)

wk+1 = wk +α k p Choose S ⊂ {1,...,n}, X ∈{1,...,n} uniformly and independently ∇F

X(wk) =

1 | X | ∇fi

i∈X

(wk) ∇2F

S(wk) = 1

| S | ∇2 fi

i∈ S

(wk)

Sub-sampled gradient and Hessian True Newton method impractical in large-scale machine learning Will not achieve scale invariance or quadratic convergence But the stochastic nature of the objective creates opportunities: Coordinate Hessian sample S and gradient sample X for optimal complexity

Rn(w) = 1 n fi(w)

i=1 n

slide-21
SLIDE 21

21

Active research area

  • Friedlander and Schmidt (2011)
  • Byrd, Chin, Neveitt, N. (2011)
  • Royset (2012)
  • Erdogdu and Montanari (2015)
  • Roosta-Khorasani and Mahoney (2016)
  • Agarwal, Bullins and Hazan (2016)
  • Pilanci and Wainwright (2015)
  • Pasupathy, Glynn, Ghosh, Hashemi (2015)
  • Xu, Yang, Roosta-Khorasani, Re’, Mahoney (2016)
slide-22
SLIDE 22

22

Linear convergence

∇2F

Sk (wk)p = −∇FXk (wk)

wk+1 = wk +α p

The following result is well known for strongly convex objective: Theorem: Under standard assumptions. If a) α = µ / L b) | Sk |= constant c) | Xk |= ηk η >1 (geometric growth) Then, E[ ‖wk − w*‖ ] → 0 at a linear rate and work complexity matches that of stochastic gradient method µ = smallest eigenvalue of any subsampled Hessian L = largest eigenvalue of Hessian of F

slide-23
SLIDE 23

23

Local superlinear convergence Objective: expected risk F

We can show the linear-quadratic result

Ek[ ‖wk+1 − w*‖ ] ≤ C1‖wk − w*‖

2 +σ‖wk − w*‖

µ | Sk | + v µ | Xk |

To obtain superlinear convergence: i) | Sk |→ ∞ ii) | Xk | must increase faster than geometrically

slide-24
SLIDE 24

24

Closer look at the constant

We can show the linear-quadratic result

Ek[ ‖wk+1 − w*‖ ] ≤ C1‖wk − w*‖

2 +σ‖wk − w*‖

µ | Sk | + v µ | Xk |

|| E[(∇2F

i(w)− ∇2F(w)2]||≤σ 2

tr(Cov(∇fi(w))) ≤ v2

slide-25
SLIDE 25

25

Achieving faster convergence

We can show the linear-quadratic result

Ek[ ‖wk+1 − w*‖ ] ≤ C1‖wk − w*‖

2 +σ‖wk − w*‖

µ | Sk | + v µ | Xk |

To obtain we used the bound Ek[ ‖∇2F

Sk (wk)− ∇2F(wk)‖≤

σ | Sk |‖wk − w*‖ Not matrix concentration inequalities

slide-26
SLIDE 26

26

Formal superlinear convergence

Theorem: under the conditions just stated, there is a neighborhood

  • f w* such that for the sub-sampled Newton method with α k = 1

E[ ‖wk − w*‖ ] → 0 superlinearly

slide-27
SLIDE 27

27

Observations on quasi- Newton methods

Bk p = −∇F

X(wk)

wk+1 = wk +α k p and suppose wk → w*

Dennis-More’

If ‖[Bk − ∇2 f (x*)]pk‖ ‖pk‖ → 0 convergence rate is superlinear

  • 1. Bk need not converge to B*; only good approximation

along search directions

slide-28
SLIDE 28

28

Hacker’s definition of second-order method

1. One for which unit steplength is acceptable (yields sufficient reduction) most of the time 2. Why? How can one learn the scale of search directions without having learned curvature of the function in some relevant spaces? 3. This ``definition’’ is problem dependent

slide-29
SLIDE 29

29

Inexact Methods

∇2F

S(wk )p = −∇FX(wk )

wk+1 = wk +α k p

1. Exact method: Hessian approximation is inverted (.e.g Newton-sketch) 2. Inexact method: solve linear system inexactly by iterative solver 3. Conjugate gradient 4. Stochastic gradient 5. Both require only Hessian-vector products

Newton-CG chooses a fixed sample S, applies CG to qk(p) = F(wk)+ ∇F(wk)T p + 1 2 pT∇2F

S(wk)p

slide-30
SLIDE 30

30

Newton-SGI (stochastic gradient iteration)

qk(p) = F(wk )+ ∇F(wk )T p + 1 2 pT∇2F(wk )p If we apply the standard gradient method to we obtain the iteration

pk

i+1 = pk i − ∇qk(pk i )

= (I − ∇2F(wk))pk

i − ∇F(wk)

Gradient method using the exact gradient but and an estimate

  • f the Hessian. This method is implicit in Agarwal, Bullins, Hazan 2016
  • 1. Choose and index j at random;
  • 2. pk

i+1 = (I − ∇2Fj(wk))pk i − ∇F(wk)

Consider instead the semi-stochastic gradient iteration: Change sample Hessian at each nner iteration

slide-31
SLIDE 31

31

Comparing Newton-CG and Newton-GD

O ( ˆ κ l

max)2 ˆ

κ l log( ˆ κ l )log(d)

( ) Newton-SGI

O ( ˆ κ l

max)2

ˆ κ l

max log( ˆ

κ l

max)

( ) Newton-CG

Number of Hessian-vector products to achieve ‖wk+1 − w*‖≤ 1 2‖wk − w*‖ (*)

Results in Agarwal, Bullins and Hazan (2016) and Xu, Yang, Re, Roosta-Khorasani, Mahoney (2016): Decrease (*) obtained at each step with probability 1-p Our results give convergence of the whole sequence in expectation Complexity bounds are very pessimistic, particularly for CG

slide-32
SLIDE 32

32

Final Remarks

  • The search for effective optimization algorithms for machine learning is
  • ngoing
  • In spite of the total dominance of the SG method at present on very large

scale applications

  • SG does not parallelize well
  • SG is a first order method affected by ill conditioning
  • A method that is noisy at the start and gradually becomes more accurate

seems attractive