1
Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal - - PowerPoint PPT Presentation
Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal - - PowerPoint PPT Presentation
Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Richard Byrd Northwestern University University of Colorado 2 Optimization,
2
Collaborators
Raghu Bollapragada Richard Byrd Northwestern University University of Colorado
3
Optimization, Statistics, Machine Learning
Dramatic breakdown of barriers between optimization and
- statistics. Partly stimulated by the rise of machine learning
Learning process and learning measures defined using continuous optimization models – convex and non-convex Nonlinear optimization algorithms used to train statistical models – a great challenge with the advent of high dimensional models and huge data sets The stochastic gradient method currently plays a central role
4
Stochastic Gradient Method Robbins-Monro (1951)
- 1. Why has it risen to such prominence?
- 2. What is the main mechanism that drives it?
- 3. What can we say about its behavior in convex and non-
convex cases?
- 4. What ideas have been proposed to improve upon SG?
``Optimization Methods for Machine Learning’’, Bottou, Curtis, Nocedal (2016)
5
Problem statement
Given training set {(x1,y1),…(xn,yn)} Given a loss function ℓ(z,y) (hinge loss, logistic,...) Find a prediction function h(x;w) (linear, DNN,...) minw 1 n ℓ
i=1 n
∑ (h(xi;w),yi)
Notation: fi(w) = ℓ(h(xi;w),yi) Rn(w) = 1 n fi(w)
i=1 n
∑
empirical risk Random variable ξ = (x,y) F(w) = E[ f (w;ξ)] expected risk
6
Stochastic Gradient Method
wk+1 = wk −α k∇fi(wk ) i ∈{1,...,n} choose at random
- Very cheap, noisy iteration; gradient w.r.t. just 1 data point
- Not a gradient descent method
- Stochastic process dependent on the choice of i
- Descent in expectation
First present algorithms for empirical risk minimization Rn(w) = 1 n fi(w)
i=1 n
∑
7
Batch Optimization Methods
wk+1 = wk −α k∇Rn(wk ) batch gradient method
Why has SG emerged as the preeminent method? wk+1= wk − α k n ∇
i=1 n
∑
fi(wk )
- More expensive, accurate step
- Can choose among a wide range of optimization algorithms
- Opportunities for parallelism
Computational trade-offs between stochastic and batch methods Ability to minimize F (generalization error)
8
Practical Experience
10 epochs Fast initial progress
- f SG followed by
drastic slowdown Can we explain this?
0.5 1 1.5 2 2.5 3 3.5 4 x 10
5
0.1 0.2 0.3 0.4 0.5 0.6
Accessed Data Points Empirical Risk
SGD LBFGS
Rn
Logistic regression; speech data
Batch L-BFGS
9
Intuition
SG employs information more efficiently than batch methods Argument 1: Suppose data consists of 10 copies of a set S Iteration of batch method 10 times more expensive SG performs same computations
10
Computational complexity
Total work to obtain Rn(wk ) ≤ Rn(w*)+ ε Batch gradient method: nlog(1/ε)
Think of ε = 10−3
Which one is better?
Stochastic gradient method: 1/ε More precisely: Batch: ndκ log(1/ε) SG: dνκ 2 /ε
w1 −1 w1,* 1
11
Example by Bertsekas
Region of confusion Note that this is a geographical argument
Analysis: given wk what is the expected decrease in the
- bjective function Rn as we choose one of the quadratics
randomly? Rn(w) = 1 n fi(w)
i=1 n
∑
12
A fundamental observation
To ensure convergence: α k → 0 in SG method to control variance.
E[Rn(wk+1)− Rn(wk)] ≤ −α k‖∇Rn(wk)‖
2 2 + α k 2 E‖∇fik (wk)‖
2
Initially, gradient decrease dominates; then variance in gradient hinders progress Sub-sampled Newton methods directly control the noise given in the last term What can we say when α k = α is constant?
13
- Only converges to a neighborhood of the optimal value.
k
=
14
15
Deep Neural Networks
Although much is understood about the SG method, still some great mysteries: why is it so much better than batch methods
- n DNNs?
16
Sharp and flat minima Keskar et al. (2016)
SG solution Batch solution Observing R along line From SG solution to batch solution
Goodfellow et al
Deep convolutional Neural net CIFAR-10 SG: mini-batch of size 256 Batch: 10% of training set ADAM optimizer
17
Testing accuracy vs batch size Sharpness of minimizer vs batch size Sharpness: Max R in a small box around minimizer
Testing accuracy and sharpness Keskar (2016)
18
Drawback of SG method: distributed computing
19
Sub-sampled Newton Methods
20
Iteration
∇2F
S(wk)p = −∇F X(wk)
wk+1 = wk +α k p Choose S ⊂ {1,...,n}, X ∈{1,...,n} uniformly and independently ∇F
X(wk) =
1 | X | ∇fi
i∈X
∑
(wk) ∇2F
S(wk) = 1
| S | ∇2 fi
i∈ S
∑
(wk)
Sub-sampled gradient and Hessian True Newton method impractical in large-scale machine learning Will not achieve scale invariance or quadratic convergence But the stochastic nature of the objective creates opportunities: Coordinate Hessian sample S and gradient sample X for optimal complexity
Rn(w) = 1 n fi(w)
i=1 n
∑
21
Active research area
- Friedlander and Schmidt (2011)
- Byrd, Chin, Neveitt, N. (2011)
- Royset (2012)
- Erdogdu and Montanari (2015)
- Roosta-Khorasani and Mahoney (2016)
- Agarwal, Bullins and Hazan (2016)
- Pilanci and Wainwright (2015)
- Pasupathy, Glynn, Ghosh, Hashemi (2015)
- Xu, Yang, Roosta-Khorasani, Re’, Mahoney (2016)
22
Linear convergence
∇2F
Sk (wk)p = −∇FXk (wk)
wk+1 = wk +α p
The following result is well known for strongly convex objective: Theorem: Under standard assumptions. If a) α = µ / L b) | Sk |= constant c) | Xk |= ηk η >1 (geometric growth) Then, E[ ‖wk − w*‖ ] → 0 at a linear rate and work complexity matches that of stochastic gradient method µ = smallest eigenvalue of any subsampled Hessian L = largest eigenvalue of Hessian of F
23
Local superlinear convergence Objective: expected risk F
We can show the linear-quadratic result
Ek[ ‖wk+1 − w*‖ ] ≤ C1‖wk − w*‖
2 +σ‖wk − w*‖
µ | Sk | + v µ | Xk |
To obtain superlinear convergence: i) | Sk |→ ∞ ii) | Xk | must increase faster than geometrically
24
Closer look at the constant
We can show the linear-quadratic result
Ek[ ‖wk+1 − w*‖ ] ≤ C1‖wk − w*‖
2 +σ‖wk − w*‖
µ | Sk | + v µ | Xk |
|| E[(∇2F
i(w)− ∇2F(w)2]||≤σ 2
tr(Cov(∇fi(w))) ≤ v2
25
Achieving faster convergence
We can show the linear-quadratic result
Ek[ ‖wk+1 − w*‖ ] ≤ C1‖wk − w*‖
2 +σ‖wk − w*‖
µ | Sk | + v µ | Xk |
To obtain we used the bound Ek[ ‖∇2F
Sk (wk)− ∇2F(wk)‖≤
σ | Sk |‖wk − w*‖ Not matrix concentration inequalities
26
Formal superlinear convergence
Theorem: under the conditions just stated, there is a neighborhood
- f w* such that for the sub-sampled Newton method with α k = 1
E[ ‖wk − w*‖ ] → 0 superlinearly
27
Observations on quasi- Newton methods
Bk p = −∇F
X(wk)
wk+1 = wk +α k p and suppose wk → w*
Dennis-More’
If ‖[Bk − ∇2 f (x*)]pk‖ ‖pk‖ → 0 convergence rate is superlinear
- 1. Bk need not converge to B*; only good approximation
along search directions
28
Hacker’s definition of second-order method
1. One for which unit steplength is acceptable (yields sufficient reduction) most of the time 2. Why? How can one learn the scale of search directions without having learned curvature of the function in some relevant spaces? 3. This ``definition’’ is problem dependent
29
Inexact Methods
∇2F
S(wk )p = −∇FX(wk )
wk+1 = wk +α k p
1. Exact method: Hessian approximation is inverted (.e.g Newton-sketch) 2. Inexact method: solve linear system inexactly by iterative solver 3. Conjugate gradient 4. Stochastic gradient 5. Both require only Hessian-vector products
Newton-CG chooses a fixed sample S, applies CG to qk(p) = F(wk)+ ∇F(wk)T p + 1 2 pT∇2F
S(wk)p
30
Newton-SGI (stochastic gradient iteration)
qk(p) = F(wk )+ ∇F(wk )T p + 1 2 pT∇2F(wk )p If we apply the standard gradient method to we obtain the iteration
pk
i+1 = pk i − ∇qk(pk i )
= (I − ∇2F(wk))pk
i − ∇F(wk)
Gradient method using the exact gradient but and an estimate
- f the Hessian. This method is implicit in Agarwal, Bullins, Hazan 2016
- 1. Choose and index j at random;
- 2. pk
i+1 = (I − ∇2Fj(wk))pk i − ∇F(wk)
Consider instead the semi-stochastic gradient iteration: Change sample Hessian at each nner iteration
31
Comparing Newton-CG and Newton-GD
O ( ˆ κ l
max)2 ˆ
κ l log( ˆ κ l )log(d)
( ) Newton-SGI
O ( ˆ κ l
max)2
ˆ κ l
max log( ˆ
κ l
max)
( ) Newton-CG
Number of Hessian-vector products to achieve ‖wk+1 − w*‖≤ 1 2‖wk − w*‖ (*)
Results in Agarwal, Bullins and Hazan (2016) and Xu, Yang, Re, Roosta-Khorasani, Mahoney (2016): Decrease (*) obtained at each step with probability 1-p Our results give convergence of the whole sequence in expectation Complexity bounds are very pessimistic, particularly for CG
32
Final Remarks
- The search for effective optimization algorithms for machine learning is
- ngoing
- In spite of the total dominance of the SG method at present on very large
scale applications
- SG does not parallelize well
- SG is a first order method affected by ill conditioning
- A method that is noisy at the start and gradually becomes more accurate