Second order machine learning Michael W. Mahoney ICSI and - - PowerPoint PPT Presentation

second order machine learning
SMART_READER_LITE
LIVE PREVIEW

Second order machine learning Michael W. Mahoney ICSI and - - PowerPoint PPT Presentation

Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96 Outline Machine Learnings Inverse Problem Your choice: 1st Order


slide-1
SLIDE 1

Second order machine learning

Michael W. Mahoney ICSI and Department of Statistics UC Berkeley

Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96

slide-2
SLIDE 2

Outline

Machine Learning’s “Inverse” Problem Your choice:

1st Order Methods: FLAG n’ FLARE, or

disentangle geometry from sequence of iterates

2nd Order Methods: Stochastic Newton-Type Methods

“simple” methods for convex “more subtle” methods for non-convex

Michael W. Mahoney (UC Berkeley) Second order machine learning 2 / 96

slide-3
SLIDE 3

Introduction

Big Data ... Massive Data ...

Michael W. Mahoney (UC Berkeley) Second order machine learning 3 / 96

slide-4
SLIDE 4

Introduction

Humongous Data ...

Michael W. Mahoney (UC Berkeley) Second order machine learning 4 / 96

slide-5
SLIDE 5

Introduction

Big Data

How do we view BIG data?

Michael W. Mahoney (UC Berkeley) Second order machine learning 5 / 96

slide-6
SLIDE 6

Introduction

Algorithmic & Statistical Perspectives ...

Computer Scientists Data: are a record of everything that happened. Goal: process the data to find interesting patterns and associations. Methodology: Develop approximation algorithms under different models of data access since the goal is typically computationally hard. Statisticians (and Natural Scientists, etc) Data: are a particular random instantiation of an underlying process describing unobserved patterns in the world. Goal: is to extract information about the world from noisy data. Methodology: Make inferences (perhaps about unseen events) by positing a model that describes the random variability of the data around the deterministic model.

Michael W. Mahoney (UC Berkeley) Second order machine learning 6 / 96

slide-7
SLIDE 7

Introduction

... are VERY different paradigms

Statistics, natural sciences, scientific computing, etc: Problems often involve computation, but the study of computation per se is secondary Only makes sense to develop algorithms for well-posed problems1 First, write down a model, and think about computation later Computer science: Easier to study computation per se in discrete settings, e.g., Turing machines, logic, complexity classes Theory of algorithms divorces computation from data First, run a fast algorithm, and ask what it means later

1Solution exists, is unique, and varies continuously with input data Michael W. Mahoney (UC Berkeley) Second order machine learning 7 / 96

slide-8
SLIDE 8

Introduction

Context: My first stab at deep learning

Michael W. Mahoney (UC Berkeley) Second order machine learning 8 / 96

slide-9
SLIDE 9

Introduction

A blog about my first stab at deep learning

Michael W. Mahoney (UC Berkeley) Second order machine learning 9 / 96

slide-10
SLIDE 10

Introduction

A blog about my first stab at deep learning

Michael W. Mahoney (UC Berkeley) Second order machine learning 10 / 96

slide-11
SLIDE 11

Efficient and Effective Optimization Methods

Problem Statement

Problem 1: Composite Optimization Problem min

x∈X⊆Rd F(x) = f (x) + h(x)

f : Convex and Smooth h: Convex and (Non-)Smooth Problem 2: Minimizing Finite Sum Problem min

x∈X⊆Rd F(x) = 1

n

n

  • i=1

fi(x) fi: (Non-)Convex and Smooth n ≫ 1

Michael W. Mahoney (UC Berkeley) Second order machine learning 11 / 96

slide-12
SLIDE 12

Efficient and Effective Optimization Methods

Modern “Big-Data”

Classical Optimization Algorithms

Effective but Inefficient

Need to design variants, that are:

1 Efficient, i.e., Low Per-Iteration Cost 2 Effective, i.e., Fast Convergence Rate Michael W. Mahoney (UC Berkeley) Second order machine learning 12 / 96

slide-13
SLIDE 13

Efficient and Effective Optimization Methods

Scientific Computing and Machine Learning share the same challenges, and use the same means, but to get to different ends! Machine Learning has been, and continues to be, very busy designing efficient and effective optimization methods

Michael W. Mahoney (UC Berkeley) Second order machine learning 13 / 96

slide-14
SLIDE 14

Efficient and Effective Optimization Methods

First Order Methods

  • Variants of Gradient Descent (GD):

Reduce the per-iteration cost of GD ⇒ Efficiency Achieve the convergence rate of the GD ⇒ Effectiveness

x(k+1) = x(k) − αk∇F(x(k))

Michael W. Mahoney (UC Berkeley) Second order machine learning 14 / 96

slide-15
SLIDE 15

Efficient and Effective Optimization Methods

First Order Methods

E.g.: SAG, SDCA, SVRG, Prox-SVRG, Acc-Prox-SVRG, Acc-Prox-SDCA, S2GD, mS2GD, MISO, SAGA, AMSVRG, ...

Michael W. Mahoney (UC Berkeley) Second order machine learning 15 / 96

slide-16
SLIDE 16

Efficient and Effective Optimization Methods

But why?

Q: Why do we use (stochastic) 1st order method? Cheaper Iterations? i.e., n ≫ 1 and/or d ≫ 1 Avoids Over-fitting?

Michael W. Mahoney (UC Berkeley) Second order machine learning 16 / 96

slide-17
SLIDE 17

Efficient and Effective Optimization Methods

1st order method and “over-fitting”

Challenges with “simple” 1st order method for “over-fitting”: Highly sensitive to ill-conditioning Very difficult to tune (many) hyper-parameters “Over-fitting” is difficult with “simple” 1st order method!

Michael W. Mahoney (UC Berkeley) Second order machine learning 17 / 96

slide-18
SLIDE 18

Efficient and Effective Optimization Methods

Remedy?

1 “Not-So-Simple” 1st order method, e.g., accelerated and adaptive 2 2nd order methods, e.g.,

methods x(k+1) = x(k) − [∇2F(x(k))]−1∇F(x(k))

Michael W. Mahoney (UC Berkeley) Second order machine learning 18 / 96

slide-19
SLIDE 19

Efficient and Effective Optimization Methods

Your Choice Of....

Michael W. Mahoney (UC Berkeley) Second order machine learning 19 / 96

slide-20
SLIDE 20

Efficient and Effective Optimization Methods

Which Problem?

1 “Not-So-Simple” 1st order method: FLAG n’ FLARE

Problem 1: Composite Optimization Problem min

x∈X⊆Rd F(x) = f (x) + h(x)

f : Convex and Smooth, h: Convex and (Non-)Smooth

2 2nd order methods: Stochastic Newton-Type Methods

Stochastic Newton, Trust Region, Cubic Regularization

Problem 2: Minimizing Finite Sum Problem min

x∈X⊆Rd F(x) = 1

n

n

  • i=1

fi(x) fi: (Non-)Convex and Smooth, n ≫ 1

Michael W. Mahoney (UC Berkeley) Second order machine learning 20 / 96

slide-21
SLIDE 21

Efficient and Effective Optimization Methods

Collaborators

FLAG n’ FLARE

Fred Roosta (UC Berkeley) Xiang Cheng (UC Berkeley) Stefan Palombo (UC Berkeley) Peter L. Bartlett (UC Berkeley & QUT)

Sub-Sampled Newton-Type Methods for Convex

Fred Roosta (UC Berkeley) Peng Xu (Stanford) Jiyan Yang (Stanford) Christopher R´ e (Stanford)

Sub-Sampled Newton-Type Methods for Non-convex

Fred Roosta (UC Berkeley) Peng Xu (Stanford)

Implementations on GPU, etc.

Fred Roosta (UC Berkeley) Sudhir Kylasa (Purdue) Ananth Grama (Purdue)

Michael W. Mahoney (UC Berkeley) Second order machine learning 21 / 96

slide-22
SLIDE 22

First-order methods: FLAG n’ FLARE

Subgradient Method

Composite Optimization Problem min

x∈X⊆Rd F(x) = f (x) + h(x)

f : Convex (Non-)Smooth h: Convex (Non-)Smooth

Michael W. Mahoney (UC Berkeley) Second order machine learning 22 / 96

slide-23
SLIDE 23

First-order methods: FLAG n’ FLARE

Subgradient Method

Algorithm 1 Subgradient Method

1: Input: x1, and T 2: for k = 1, 2, . . . , T − 1 do 3:

  • gk ∈ ∂ (f (xk) + h(xk))

4:

  • xk+1 = arg minx∈X
  • gk, x +

1 2αk x − xk2

5: end for 6: Output: ¯

x = 1

T

T

t=1 xt

αk: Step-size

Constant Step-size: αk = α Diminishing Step size ∞

k=1 αk = ∞,

limk→∞ αk = 0

Michael W. Mahoney (UC Berkeley) Second order machine learning 23 / 96

slide-24
SLIDE 24

First-order methods: FLAG n’ FLARE

Example: Logistic Regression

{ai, bi}: features and labels ai ∈ {0, 1}d, bi ∈ {0, 1} F(x) =

n

  • i=1

log(1 + eai,x) − biai, x ∇F(x) =

n

  • i=1
  • 1

1 + e−ai,x − bi

  • ai

Infrequent Features ⇒ Small Partial Derivative

Michael W. Mahoney (UC Berkeley) Second order machine learning 24 / 96

slide-25
SLIDE 25

First-order methods: FLAG n’ FLARE

predictive vs. irrelevant features

Very infrequent features ⇒ Highly predictive (e.g. “CANON” in document classification) Very frequent features ⇒ Highly irrelevant (e.g. “and” in document classification)

Michael W. Mahoney (UC Berkeley) Second order machine learning 25 / 96

slide-26
SLIDE 26

First-order methods: FLAG n’ FLARE

AdaGrad [Duchi et al., 2011]

Frequent Features ⇒ Large Partial Derivative ⇒ Learning Rate ↓ Infrequent Features ⇒ Small Partial Derivative ⇒ Learning Rate ↑ Replace αk with scaling matrix adaptively... Many follows up works: RMSProp, Adam, Adadelta, etc...

Michael W. Mahoney (UC Berkeley) Second order machine learning 26 / 96

slide-27
SLIDE 27

First-order methods: FLAG n’ FLARE

AdaGrad [Duchi et al., 2011]

Algorithm 2 AdaGrad

1: Input: x1, η and T 2: for k = 1, 2, . . . , T − 1 do 3:

  • gk ∈ ∂f (xk)

4:

  • Form scaling matrix Sk based on {gt; t = 1, . . . , k}

5:

  • xk+1 = arg minx∈X
  • gk, x + h(x) + 1

2(x − xk)TSk(x − xk)

  • 6: end for

7: Output: ¯

x = 1

T

T

t=1 xt

Michael W. Mahoney (UC Berkeley) Second order machine learning 27 / 96

slide-28
SLIDE 28

First-order methods: FLAG n’ FLARE

Convergence

Convergence Let x∗ be an optimum point. We have: AdaGrad [Duchi et al., 2011]: F(¯ x) − F(x∗) ≤ O √ dD∞α √ T

  • ,

where α ∈ [ 1

√ d , 1] and D∞ = maxx,y∈X y − x∞, and

Subgradient Descent: F(¯ x) − F(x∗) ≤ O D2 √ T

  • where D2 = maxx,y∈X y − x2.

Michael W. Mahoney (UC Berkeley) Second order machine learning 28 / 96

slide-29
SLIDE 29

First-order methods: FLAG n’ FLARE

Comparison

Competitive Factor: √ dD∞α D2 D∞ and D2 depend on geometry of X

e.g., X = {x; x∞ ≤ 1} then D2 = √ dD∞

α =

d

i=1

T

t=1[gt]2 i

  • d T

t=1 gt2

depends on {gt; t = 1, . . . , T}

Michael W. Mahoney (UC Berkeley) Second order machine learning 29 / 96

slide-30
SLIDE 30

First-order methods: FLAG n’ FLARE

Improving the T dependence

Problem 1: Composite Optimization Problem min

x∈X⊆Rd F(x) = f (x) + h(x)

f : Convex and Smooth (w. L-Lipschitz Gradient) h: Convex and (Non-)Smooth Subgradient Methods: O

  • 1

√ T

  • ISTA: O

1

T

  • FISTA [Beck and Teboulle, 2009]: O

1

T 2

  • Michael W. Mahoney (UC Berkeley)

Second order machine learning 30 / 96

slide-31
SLIDE 31

First-order methods: FLAG n’ FLARE

Best of both worlds?

Accelerated Gradient Methods ⇒ Optimal Rate

e.g.,

1 T 2 vs. 1 T vs. 1 √ T

Adaptive Gradient Methods ⇒ Better Constant

√ dD∞α vs. D2

How about Accelerated and Adaptive Gradient Methods?

Michael W. Mahoney (UC Berkeley) Second order machine learning 31 / 96

slide-32
SLIDE 32

First-order methods: FLAG n’ FLARE

FLAG: Fast Linearly-Coupled Adaptive Gradient Method FLARE: FLAg RElaxed

Michael W. Mahoney (UC Berkeley) Second order machine learning 32 / 96

slide-33
SLIDE 33

First-order methods: FLAG n’ FLARE

FLAG [CRPBM, 2016]

Algorithm 3 FLAG

1: Input: x0 = y0 = z0 and L 2: for k = 1, 2, . . . , T do 3:

  • yk+1 = Prox(xk)

4:

  • Gradient Mapping gk = −L(yk+1 − xk)

5:

  • Form Sk based on { gt

gt; t = 1, . . . , k}

6:

  • Compute ηk

7:

  • zk+1 = arg minz∈X ηkgk, z − zk + 1

2(z − zk)TSk(z − zk)

8:

  • xk = Linearly Couple (yk+1, zk+1)

9: end for 10: Output: yT+1

Prox(xk) := arg min

x∈X

  • ∇f (xk), x + h(x) + L

2x − xk2

2

  • Michael W. Mahoney (UC Berkeley)

Second order machine learning 33 / 96

slide-34
SLIDE 34

First-order methods: FLAG n’ FLARE

FLAG Simplified

Algorithm 4 Birds Eye View of FLAG

1: Input: x0 2: for k = 1, 2, . . . , T do 3:

  • yk : Usual Gradient Step

4:

  • Form Gradient History

5:

  • zk : Scaled Gradient Step

6:

  • Find mixing wight w via Binary Search

7:

  • xk+1 = (1 − w)yk+1 + wzk+1

8: end for 9: Output: yT+1

Michael W. Mahoney (UC Berkeley) Second order machine learning 34 / 96

slide-35
SLIDE 35

First-order methods: FLAG n’ FLARE

Convergence

Convergence Let x∗ be an optimum point. We have: FLAG [CRPBM, 2016]: F(¯ x) − F(x∗) ≤ O dD2

∞β

T 2

  • ,

where β ∈ [ 1

d , 1] and D∞ = maxx,y∈X y − x∞, and

FISTA [Beck and Teboulle, 2009]: F(¯ x) − F(x∗) ≤ O D2

2

T 2

  • where D2 = maxx,y∈X y − x2.

Michael W. Mahoney (UC Berkeley) Second order machine learning 35 / 96

slide-36
SLIDE 36

First-order methods: FLAG n’ FLARE

Comparison

Competitive Factor: dD2

∞β

D2

2

D∞ and D2 depend on geometry of X

e.g., X = {x; x∞ ≤ 1} then D2 = √ dD∞

β =

d

i=1

T

t=1[˜

gt]2

i

2 dT

depends on {˜ gt := gt/gt; t = 1, . . . , T}

Michael W. Mahoney (UC Berkeley) Second order machine learning 36 / 96

slide-37
SLIDE 37

First-order methods: FLAG n’ FLARE

Linear Coupling

Linearly Couple of (yk+1, zk+1) via a “ǫ-Binary Search”: Find ǫ approximation to the root of non-linear equation Prox (ty + (1 − t)z) − (ty + (1 − t)z) , y − z = 0, where Prox(x) := arg min

y∈C h(y) + L

2y −

  • x − 1

L∇f (x)

  • 2

2.

At most log(1/ǫ) steps using bisection At most 2 + log(1/ǫ) Prox evals per-iteration more than FISTA Can be Expensive!

Michael W. Mahoney (UC Berkeley) Second order machine learning 37 / 96

slide-38
SLIDE 38

First-order methods: FLAG n’ FLARE

Linear Coupling

Linearly approximate: tProx (y) + (1 − t)Prox (z) − (ty + (1 − t)z) , y − z = 0. Linear equation in t, so closed form solution! t = z − Prox(z), y − z (z − Prox(z)) − (y − Prox(y)), y − z At most 2 Prox evals per-iteration more than FISTA Equivalent to ǫ-Binary Search with ǫ = 1/3 Better But Might Not Be Good Enough!

Michael W. Mahoney (UC Berkeley) Second order machine learning 38 / 96

slide-39
SLIDE 39

First-order methods: FLAG n’ FLARE

FLARE: FLAg RElaxed

Basic Idea: Choose mixing weight by intelligent “futuristic” guess

Guess now, and next iteration, correct if guessed wrong

FLARE: exactly the same Prox evals per-iteration as FISTA! FLARE: has the similar theoretical guarantee as FLAG!

Michael W. Mahoney (UC Berkeley) Second order machine learning 39 / 96

slide-40
SLIDE 40

First-order methods: FLAG n’ FLARE

L(x1, x2, . . . , xC) =

n

  • i=1

C

  • c=1

−1(bi = c) log

  • eai,xc

1 + C−1

b=1 eai,xb

  • =

n

  • i=1
  • log
  • 1 +

C−1

  • c=1

eai,xc

C−1

  • c=1

1(bi = c)ai, xc

  • Michael W. Mahoney (UC Berkeley)

Second order machine learning 40 / 96

slide-41
SLIDE 41

First-order methods: FLAG n’ FLARE

Classification: 20 Newsgroups

Prediction across 20 different newsgroups

Data Train Size Test Size d Classes 20 Newsgroups 10,142 1,127 53,975 20

min

x∞≤1 L(x1, x2, . . . , xC)

Michael W. Mahoney (UC Berkeley) Second order machine learning 41 / 96

slide-42
SLIDE 42

First-order methods: FLAG n’ FLARE

Classification: 20 Newsgroups

Michael W. Mahoney (UC Berkeley) Second order machine learning 42 / 96

slide-43
SLIDE 43

First-order methods: FLAG n’ FLARE

Classification: Forest CoverType

Predicting forest cover type from cartographic variables

Data Train Size Test Size d Classes CoveType 435,759 145,253 54 7

min

x∈Rd L(x1, x2, . . . , xC) + λx1

Michael W. Mahoney (UC Berkeley) Second order machine learning 43 / 96

slide-44
SLIDE 44

First-order methods: FLAG n’ FLARE

Classification: Forest CoverType

Michael W. Mahoney (UC Berkeley) Second order machine learning 44 / 96

slide-45
SLIDE 45

First-order methods: FLAG n’ FLARE

Regression: BlogFeedback

Prediction of the number of comments in the next 24 hours for blogs

Data Train Size Test Size d BlogFeedback 47,157 5,240 280

min

x∈Rd

1 2Ax − b2

2 + λx1

Michael W. Mahoney (UC Berkeley) Second order machine learning 45 / 96

slide-46
SLIDE 46

First-order methods: FLAG n’ FLARE

Regression: BlogFeedback

Michael W. Mahoney (UC Berkeley) Second order machine learning 46 / 96

slide-47
SLIDE 47

Second-order methods: Stochastic Newton-Type Methods 1 2nd order methods: Stochastic Newton-Type Methods

Stochastic Newton (think: convex) Stochastic Trust Region (think: non-convex) Stochastic Cubic Regularization (think: non-convex)

Problem 2: Minimizing Finite Sum Problem min

x∈X⊆Rd F(x) = 1

n

n

  • i=1

fi(x)

fi: (Non-)Convex and Smooth n ≫ 1

Michael W. Mahoney (UC Berkeley) Second order machine learning 47 / 96

slide-48
SLIDE 48

Second-order methods: Stochastic Newton-Type Methods

Second Order Methods

Use both gradient and Hessian information Fast convergence rate Resilient to ill-conditioning They “over-fit” nicely! However, per-iteration cost is high!

Michael W. Mahoney (UC Berkeley) Second order machine learning 48 / 96

slide-49
SLIDE 49

Second-order methods: Stochastic Newton-Type Methods

Sensorless Drive Diagnosis

n : 50, 000, p = 528, No. Classes = 11, λ : 0.0001

Figure: Test Accuracy

Michael W. Mahoney (UC Berkeley) Second order machine learning 49 / 96

slide-50
SLIDE 50

Second-order methods: Stochastic Newton-Type Methods

Sensorless Drive Diagnosis

n : 50, 000, p = 528, No. Classes = 11, λ : 0.0001

Figure: Time/Iteration

Michael W. Mahoney (UC Berkeley) Second order machine learning 50 / 96

slide-51
SLIDE 51

Second-order methods: Stochastic Newton-Type Methods

Second Order Methods

Deterministically approximating second order information cheaply

Quasi-Newton, e.g., BFGS and L-BFGS [Nocedal, 1980]

Randomly approximating second order information cheaply

Sub-Sampling the Hessian [Byrd et al., 2011, Erdogdu et al., 2015, Martens, 2010, RM-I, RM-II, XYRRM, 2016, Bollapragada et al., 2016, ...] Sketching the Hessian [Pilanci et al., 2015] Sub-Sampling the Hessian and the gradient [RM-I & RM-II, 2016, Bollapragada et al., 2016, ...]

Michael W. Mahoney (UC Berkeley) Second order machine learning 51 / 96

slide-52
SLIDE 52

Second-order methods: Stochastic Newton-Type Methods

Iterative Scheme

x(k+1) = arg min

x∈D∩X

  • F(x(k)) + (x − x(k))Tg(x(k)) +

1 2αk (x − x(k))TH(x(k))(x − x(k))

  • Michael W. Mahoney (UC Berkeley)

Second order machine learning 52 / 96

slide-53
SLIDE 53

Second-order methods: Stochastic Newton-Type Methods

Hessian Sub-Sampling

g(x) = ∇F(x) H(x) = 1 |S|

  • j∈S

∇2fj(x)

Michael W. Mahoney (UC Berkeley) Second order machine learning 53 / 96

slide-54
SLIDE 54

Second-order methods: Stochastic Newton-Type Methods

First, let’s consider the convex case....

Michael W. Mahoney (UC Berkeley) Second order machine learning 54 / 96

slide-55
SLIDE 55

Second-order methods: Stochastic Newton-Type Methods

Convex Problems

Each fi is smooth and weakly convex F is γ-strongly convex

Michael W. Mahoney (UC Berkeley) Second order machine learning 55 / 96

slide-56
SLIDE 56

Second-order methods: Stochastic Newton-Type Methods

“We want to design methods for machine learning that are not as ideal as Newton’s method but have [these] properties: first of all, they tend to turn towards the right directions and they have the right length, [i.e.,] the step size of one is going to be working most of the time...and we have to have an algorithm that scales up for machine leaning.”

  • Prof. Jorge Nocedal

IPAM Summer School, 2012 Tutorial on Optimization Methods for ML (Video - Part I: 50’ 03”)

Michael W. Mahoney (UC Berkeley) Second order machine learning 56 / 96

slide-57
SLIDE 57

Second-order methods: Stochastic Newton-Type Methods

What do we need?

Requirements:

(R.1) Scale up: |S| must be independent of n, or at least smaller than n and for p ≫ 1, allow for inexactness. (R.2) Turn to right directions: H(x) must preserve the spectrum of ∇2F(x) as much as possible (R.3) Not ideal but close: redFast local convergence rate, close to that of Newton (R.4) Right step length: Unit step length eventually works

Michael W. Mahoney (UC Berkeley) Second order machine learning 57 / 96

slide-58
SLIDE 58

Second-order methods: Stochastic Newton-Type Methods

What do we need?

Requirements:

(R.1) Scale up: |S| must be independent of n, or at least smaller than n and for p ≫ 1, allow for inexactness (R.2) Turn to right directions: H(x) must preserve the spectrum of ∇2F(x) as much as possible (R.3) Not ideal but close: redFast local convergence rate, close to that of Newton (R.4) Right step length: Unit step length eventually works

Michael W. Mahoney (UC Berkeley) Second order machine learning 58 / 96

slide-59
SLIDE 59

Second-order methods: Stochastic Newton-Type Methods

What do we need?

Requirements:

(R.1) Scale up: |S| must be independent of n, or at least smaller than n and for p ≫ 1, allow for inexactness (R.2) Turn to right directions: H(x) must preserve the spectrum of ∇2F(x) as much as possible (R.3) Not ideal but close: redFast local convergence rate, close to that of Newton (R.4) Right step length: Unit step length eventually works

Michael W. Mahoney (UC Berkeley) Second order machine learning 59 / 96

slide-60
SLIDE 60

Second-order methods: Stochastic Newton-Type Methods

What do we need?

Requirements:

(R.1) Scale up: |S| must be independent of n, or at least smaller than n and for p ≫ 1, allow for inexactness (R.2) Turn to right directions: H(x) must preserve the spectrum of ∇2F(x) as much as possible (R.3) Not ideal but close: Fast local convergence rate, close to that of Newton (R.4) Right step length: Unit step length eventually works

Michael W. Mahoney (UC Berkeley) Second order machine learning 60 / 96

slide-61
SLIDE 61

Second-order methods: Stochastic Newton-Type Methods

What do we need?

Requirements:

(R.1) Scale up: |S| must be independent of n, or at least smaller than n and for p ≫ 1, allow for inexactness (R.2) Turn to right directions: H(x) must preserve the spectrum of ∇2F(x) as much as possible (R.3) Not ideal but close: Fast local convergence rate, close to that of Newton (R.4) Right step length: Unit step length eventually works

Michael W. Mahoney (UC Berkeley) Second order machine learning 61 / 96

slide-62
SLIDE 62

Second-order methods: Stochastic Newton-Type Methods

Sub-sampling Hessian

Requirements:

(R.1) Scale up: |S| must be independent of n, or at least smaller than n and for p ≫ 1, allow for inexactness (R.2) Turn to right directions: H(x) must preserve the spectrum of ∇2F(x) as much as possible (R.3) Not ideal but close: Fast local convergence rate, close to that of Newton (R.4) Right step length: Unit step length eventually works

Michael W. Mahoney (UC Berkeley) Second order machine learning 62 / 96

slide-63
SLIDE 63

Second-order methods: Stochastic Newton-Type Methods

Sub-sampling Hessian

Lemma (Uniform Hessian Sub-Sampling) Given any 0 < ǫ < 1, 0 < δ < 1 and x ∈ Rp, if |S| ≥ 2κ2 ln(2p/δ) ǫ2 , then Pr

  • (1 − ǫ)∇2F(x) H(x) (1 + ǫ)∇2F(x)
  • ≥ 1 − δ.

Michael W. Mahoney (UC Berkeley) Second order machine learning 63 / 96

slide-64
SLIDE 64

Second-order methods: Stochastic Newton-Type Methods

Sub-sampling Hessian

Requirements:

(R.1) Scale up: |S| must be independent of n, or at least smaller than n and for p ≫ 1, allow for inexactness (R.2) Turn to right directions: H(x) must preserve the spectrum of ∇2F(x) as much as possible (R.3) Not ideal but close: Fast local convergence rate, close to that of Newton (R.4) Right step length: Unit step length eventually works

Michael W. Mahoney (UC Berkeley) Second order machine learning 64 / 96

slide-65
SLIDE 65

Second-order methods: Stochastic Newton-Type Methods

Error Recursion: Hessian Sub-Sampling

Theorem (Error Recursion) Using αk = 1, with high-probability, we have x(k+1) − x∗ ≤ ρ0x(k) − x∗ + ξx(k) − x∗2, where ρ0 = ǫ (1 − ǫ), and ξ = L 2(1 − ǫ)γ . ρ0 is problem-independent! ⇒ Can be made arbitrarily small!

Michael W. Mahoney (UC Berkeley) Second order machine learning 65 / 96

slide-66
SLIDE 66

Second-order methods: Stochastic Newton-Type Methods

SSN-H: Q-Linear Convergence

Theorem (Q-Linear Convergence) Consider any 0 < ρ0 < ρ < 1 and ǫ ≤ ρ0/(1 + ρ0). If x(0) − x∗ ≤ ρ − ρ0 ξ , we get locally Q-linear convergence x(k) − x∗ ≤ ρx(k−1) − x∗, k = 1, . . . , k0 with high-probability. Possible to get superlinear rate as well.

Michael W. Mahoney (UC Berkeley) Second order machine learning 66 / 96

slide-67
SLIDE 67

Second-order methods: Stochastic Newton-Type Methods

Sub-sampling Hessian

Requirements:

(R.1) Scale up: |S| must be independent of n, or at least smaller than n and for p ≫ 1, allow for inexactness (R.2) Turn to right directions: H(x) must preserve the spectrum of ∇2F(x) as much as possible (R.3) Not ideal but close: Fast local convergence rate, close to that of Newton (R.4) Right step length: Unit step length eventually works

Michael W. Mahoney (UC Berkeley) Second order machine learning 67 / 96

slide-68
SLIDE 68

Second-order methods: Stochastic Newton-Type Methods

Sub-Sampling Hessian

Lemma (Uniform Hessian Sub-Sampling) Given any 0 < ǫ < 1, 0 < δ < 1, and x ∈ Rp, if |S| ≥ 2κ ln(p/δ) ǫ2 , then Pr

  • (1 − ǫ)γ ≤ λmin (H(x))
  • ≥ 1 − δ.

Michael W. Mahoney (UC Berkeley) Second order machine learning 68 / 96

slide-69
SLIDE 69

Second-order methods: Stochastic Newton-Type Methods

SSN-H: Inexact Update

Assume X = Rp Descent Dir.:

  • H(x(k))pk + ∇F(x(k)) ≤ θ1∇F(x(k))

Step Size:    αk = arg max α s.t. α ≤ 1 F(x(k) + αpk) ≤ F(x(k)) + αβpT

k ∇F(x(k))

Update:

  • x(k+1) = x(k) + αkpk

0 < β, θ1, θ2 < 1

Michael W. Mahoney (UC Berkeley) Second order machine learning 69 / 96

slide-70
SLIDE 70

Second-order methods: Stochastic Newton-Type Methods

SSN-H Algorithm: Inexact Update

Algorithm 5 Globally Convergent SSN-H with inexact solve

1: Input: x(0), 0 < δ < 1, 0 < ǫ < 1, 0 < β, θ1, θ2 < 1 2: - Set the sample size, |S|, with ǫ and δ 3: for k = 0, 1, 2, · · · until termination do 4:

  • Select a sample set, S, of size |S| and form H(x(k))

5:

  • Update x(k+1) with H(x(k)) and inexact solve

6: end for

Michael W. Mahoney (UC Berkeley) Second order machine learning 70 / 96

slide-71
SLIDE 71

Second-order methods: Stochastic Newton-Type Methods

Gloabl Convergence SSN-H: Inexact Update

Theorem (Global Convergence of Algorithm 5) Using Algorithm 5 with θ1 ≈ 1/√κ, with high-probability, we have F(x(k+1)) − F(x∗) ≤ (1 − ρ)

  • F(x(k)) − F(x∗)
  • ,

where ρ = αkβ/κ and αk ≥ 2(1−θ2)(1−β)(1−ǫ)

κ

.

Michael W. Mahoney (UC Berkeley) Second order machine learning 71 / 96

slide-72
SLIDE 72

Second-order methods: Stochastic Newton-Type Methods

Local + Global

Theorem For any ρ < 1 and ǫ ≈ ρ/√κ, Algorithm 5 is globally convergent and after O(κ2) iterations, with high-probability achieves “problem-independent” Q-linear convergence, i.e., x(k+1) − x∗ ≤ ρx(k) − x∗. Moreover, the step size of αk = 1 passes Armijo rule for all subsequent iterations.

Michael W. Mahoney (UC Berkeley) Second order machine learning 72 / 96

slide-73
SLIDE 73

Second-order methods: Stochastic Newton-Type Methods

“Any optimization algorithm for which the unit step length works has some wisdom. It is too much of a fluke if the unit step length [accidentally] works.”

  • Prof. Jorge Nocedal

IPAM Summer School, 2012 Tutorial on Optimization Methods for ML (Video - Part I: 56’ 32”)

Michael W. Mahoney (UC Berkeley) Second order machine learning 73 / 96

slide-74
SLIDE 74

Second-order methods: Stochastic Newton-Type Methods

So far these efforts mostly treated convex problems.... Now, it is time for non-convexity!

Michael W. Mahoney (UC Berkeley) Second order machine learning 74 / 96

slide-75
SLIDE 75

Second-order methods: Stochastic Newton-Type Methods

Non-Convex Is Hard!

Saddle points, Local Minima, Local Maxima Optimization of a degree four polynomial: NP-hard [Hillar et al., 2013] Checking whether a point is not a local minimum: NP-complete [Murty et al., 1987]

Michael W. Mahoney (UC Berkeley) Second order machine learning 75 / 96

slide-76
SLIDE 76

Second-order methods: Stochastic Newton-Type Methods

All convex problems are the same, while every non-convex problem is different.

Not sure who’s quote this is!

Michael W. Mahoney (UC Berkeley) Second order machine learning 76 / 96

slide-77
SLIDE 77

Second-order methods: Stochastic Newton-Type Methods

(ǫg, ǫH) − Optimality

∇F(x) ≤ ǫg, λmin(∇2F(x)) ≥ −ǫH

Michael W. Mahoney (UC Berkeley) Second order machine learning 77 / 96

slide-78
SLIDE 78

Second-order methods: Stochastic Newton-Type Methods

Trust Region: Classical Method for Non-Convex Problem [Sorensen, 1982, Conn et al., 2000] s(k) = arg min

s≤∆k

s, ∇F(x(k)) + 1 2s, ∇2F(x(k))s Cubic Regularization: More Recent Method for Non-Convex Problem [Griewank, 1981, Nesterov et al., 2006, Cartis et al., 2011a, Cartis et al., 2011b] s(k) = arg min

s∈Rds, ∇F(x(k)) + 1

2s, ∇2F(x(k))s + σk 3 s3

Michael W. Mahoney (UC Berkeley) Second order machine learning 78 / 96

slide-79
SLIDE 79

Second-order methods: Stochastic Newton-Type Methods

To get iteration complexity, all previous work required:

  • H(x(k)) − ∇2F(x(k))
  • s(k)
  • ≤ Cs(k)2

(1) Stronger than “Dennis-Mor´ e” lim

k→∞

  • H(x(k)) − ∇2F(x(k))
  • s(k)

s(k) = 0 We relaxed (1) to

  • H(x(k)) − ∇2F(x(k))
  • s(k)
  • ≤ ǫs(k)

(2) Quasi-Newton, Sketching, Sub-Sampling satisfy Dennis-Mor´ e and (2) but not necessarily (1)

Michael W. Mahoney (UC Berkeley) Second order machine learning 79 / 96

slide-80
SLIDE 80

Second-order methods: Stochastic Newton-Type Methods

Recall...

F(x) = 1 n

n

  • i=1

fi(x)

Michael W. Mahoney (UC Berkeley) Second order machine learning 80 / 96

slide-81
SLIDE 81

Second-order methods: Stochastic Newton-Type Methods

Lemma (Complexity of Uniform Sampling) Suppose ∇2fi(x) ≤ K, ∀i. Given any 0 < ǫ < 1, 0 < δ < 1, and x ∈ Rd, if |S| ≥ 16K 2 ǫ2 log 2d δ , then for H(x) =

1 |S|

  • j∈S ∇2fj(x), we have

Pr

  • H(x) − ∇2F(x) ≤ ǫ
  • ≥ 1 − δ.

Only top eigenavlues/eigenvectors need to preserved.

Michael W. Mahoney (UC Berkeley) Second order machine learning 81 / 96

slide-82
SLIDE 82

Second-order methods: Stochastic Newton-Type Methods

F(x) = 1 n

n

  • i=1

fi(aT

i x)

pi = |f ′′

i (aT i x)|ai2 2

n

j=1 |f ′′ j (aT j x)|aj2 2

Michael W. Mahoney (UC Berkeley) Second order machine learning 82 / 96

slide-83
SLIDE 83

Second-order methods: Stochastic Newton-Type Methods

Lemma (Complexity of Non-Uniform Sampling) Suppose ∇2fi(x) ≤ Ki, i = 1, 2, . . . , n. Given any 0 < ǫ < 1, 0 < δ < 1, and x ∈ Rd, if |S| ≥ 16 ¯ K 2 ǫ2 log 2d δ , then for H(x) =

1 |S|

  • j∈S

1 npj ∇2fj(x), we have

Pr

  • H(x) − ∇2F(x) ≤ ǫ
  • ≥ 1 − δ,

where ¯ K = 1 n

n

  • i=1

Ki.

Michael W. Mahoney (UC Berkeley) Second order machine learning 83 / 96

slide-84
SLIDE 84

Second-order methods: Stochastic Newton-Type Methods

Non-Convex Problems

Algorithm 6 Stochastic Trust-Region Algorithm

1: Input: x0, ∆0 > 0 η ∈ (0, 1), γ > 1, 0 < ǫ, ǫg, ǫH < 1 2: for k = 0, 1, 2, · · · until termination do 3:

sk ≈ arg min

s≤∆k

mk(s) := ∇F(x(k)

k )Ts + 1

2sTH(x(k))s

4:

ρk :=

  • F(x(k) + sk) − F(x(k))
  • /mk(sk).

5:

if ρk ≥ η then

6:

x(k+1) = x(k) + sk and ∆k+1 = γ∆k

7:

else

8:

x(k+1) = x(k+1) and ∆k+1 = γ−1∆k

9:

end if

10: end for

Michael W. Mahoney (UC Berkeley) Second order machine learning 84 / 96

slide-85
SLIDE 85

Second-order methods: Stochastic Newton-Type Methods

Theorem (Complexity of Stochastic TR) If ǫ ∈ O(ǫH), then Stochastic TR terminates after T ∈ O

  • max{ǫ−2

g ǫ−1 H , ǫ−3 H }

  • ,

iterations, upon which, with high probability, we have that ∇F(x) ≤ ǫg, and λmin(∇2F(x)) ≥ − (ǫ + ǫH) . This is tight!

Michael W. Mahoney (UC Berkeley) Second order machine learning 85 / 96

slide-86
SLIDE 86

Second-order methods: Stochastic Newton-Type Methods

Non-Convex Problems

Algorithm 7 Stochastic Adaptive Regularization with Cubic Algorithm

1: Input: x0, ∆0 > 0 η ∈ (0, 1), γ > 1, 0 < ǫ, ǫg, ǫH < 1 2: for k = 0, 1, 2, · · · until termination do 3:

sk ≈ arg min

s∈Rd mk(s) := ∇F(x(k) k )Ts + 1

2sTH(x(k))s + δk 3 s3

4:

ρk :=

  • F(x(k) + sk) − F(x(k))
  • /mk(sk).

5:

if ρk ≥ η then

6:

x(k+1) = x(k) + sk and σk+1 = γ−1∆k

7:

else

8:

x(k+1) = x(k+1) and σk+1 = γ∆k

9:

end if

10: end for

Michael W. Mahoney (UC Berkeley) Second order machine learning 86 / 96

slide-87
SLIDE 87

Second-order methods: Stochastic Newton-Type Methods

Theorem (Complexity of Stochastic ARC) If ǫ ∈ O(ǫg, ǫH), then Stochastic TR terminates after T ∈ O

  • max{ǫ−3/2

g

, ǫ−3

H }

  • ,

iterations, upon which, with high probability, we have that ∇F(x) ≤ ǫg, and λmin(∇2F(x)) ≥ − (ǫ + ǫH) . This is tight!

Michael W. Mahoney (UC Berkeley) Second order machine learning 87 / 96

slide-88
SLIDE 88

Second-order methods: Stochastic Newton-Type Methods

For ǫ2

H = ǫg = ǫ = ǫ0

Stochastic TR: T ∈ O(ǫ−3

0 )

Stochastic ARC: T ∈ O(ǫ−3/2 )

Michael W. Mahoney (UC Berkeley) Second order machine learning 88 / 96

slide-89
SLIDE 89

Second-order methods: Stochastic Newton-Type Methods

Non-Linear Least Squares

min

x∈Rd

1 n

n

  • i=1
  • bi − Φ(aT

i xi)

2

Michael W. Mahoney (UC Berkeley) Second order machine learning 89 / 96

slide-90
SLIDE 90

Second-order methods: Stochastic Newton-Type Methods

Non-Linear Least Squares: Synthetic, n = 1000, 000, d = 1000, s = 1%

(a) Train Loss vs. Time (b) Train Loss vs. Time

Michael W. Mahoney (UC Berkeley) Second order machine learning 90 / 96

slide-91
SLIDE 91

Second-order methods: Stochastic Newton-Type Methods

“Preliminary results” (1 of 5)

resiliency to problem ill-conditioning

Michael W. Mahoney (UC Berkeley) Second order machine learning 91 / 96

slide-92
SLIDE 92

Second-order methods: Stochastic Newton-Type Methods

“Preliminary results” (2 of 5)

good generalization error and robustness to hyper-parameter tuning

Michael W. Mahoney (UC Berkeley) Second order machine learning 92 / 96

slide-93
SLIDE 93

Second-order methods: Stochastic Newton-Type Methods

“Preliminary results” (3 of 5)

ability to escape undesirable saddle-points

Michael W. Mahoney (UC Berkeley) Second order machine learning 93 / 96

slide-94
SLIDE 94

Second-order methods: Stochastic Newton-Type Methods

“Preliminary results” (4 of 5)

low-communication costs in distributed settings

γ = 10−3, q = 8 γ = 10−4, q = 26 γ = 10−5, q = 72

Michael W. Mahoney (UC Berkeley) Second order machine learning 94 / 96

slide-95
SLIDE 95

Second-order methods: Stochastic Newton-Type Methods

“Preliminary results” (5 of 5)

computational advantages offered by leveraging the power of GPUs

Michael W. Mahoney (UC Berkeley) Second order machine learning 95 / 96

slide-96
SLIDE 96

Conclusion

Conclusions: Second order machine learning

Second order methods

A simple way to go beyond first order methods Obviously, don’t be na¨ ıve about the details

FLAG n’ FLARE

Combine acceleration and adaptivity to get best of both worlds

Can aggressively sub-sample gradient and/or Hessian

Improve running time at each step Maintain strong second-order convergence

Apply to non-convex problems

Trust region methods and cubic regularization methods Converge to second order stationary point Quite promising “preliminary results” in ML/DA applications

Michael W. Mahoney (UC Berkeley) Second order machine learning 96 / 96