[PPT] - (recent advancements in) Optimization in the Big Data Regime Sham PowerPoint Presentation

SLIDE 1

(recent advancements in) Optimization in the “Big Data” Regime Sham M. Kakade

Computer Science & Engineering Statistics University of Washington

S. M. Kakade (UW)

Optimization for Big data 1 / 34

SLIDE 2

Machine Learning, Optimization, and more...

ML is having a profound impact: speech recognition (siri, echo), computer vision (ImageNet), game playing (alpha Go), robotics (self driving cars?), personalized health care, music recommendation (spotify), ... Optimization underlies machine learning. How can we optimize faster?

S. M. Kakade (UW)

Optimization for Big data 1 / 34

SLIDE 3

Machine Learning and the Big Data Regime...

goal: find a d-dim parameter vector which minimizes the loss on n training examples. have n training examples (x1, y1), . . . (xn, yn) have parametric a classifier h(x, w), where w is d dimensional. min

i

loss(h(xi, w), yi) “Big Data Regime”: How do you optimize this when n and d are large? memory? parallelization? Can we obtain linear time algorithms?

S. M. Kakade (UW)

Optimization for Big data 2 / 34

SLIDE 4

This tutorial:

Part I: convexity: regression (and more...) (optimization for prediction) Part 2: non-convexity: PCA (and more...) (optimization for representation) Part 3: Statistics (what we care about) Part 4: thoughts and open problems parallelization, second order methods, non-convexity, ...

S. M. Kakade (UW)

Optimization for Big data 3 / 34

SLIDE 5

Part 1: Least Squares

min

w n

i=1

(w · xi − yi)2 + λw2 How much computation time is required to to get ǫ accuracy? n points, d dimensions. “Big Data Regime”: How do you optimize this when n and d are large? Aside: think of x as a large feature representation.

S. M. Kakade (UW)

Optimization for Big data 4 / 34

SLIDE 6

Review: Direct Solution

min

w n

i=1

(w · xi − yi)2 + λw2 solution: w = (X ⊤X + λI)−1X ⊤Y where X be the n × d matrix whose rows are xi, and Y is an n-dim vector. time complexity: O(nd2) and memory O(d2) Not feasible due to both time and memory.

S. M. Kakade (UW)

Optimization for Big data 5 / 34

SLIDE 7

Review: Gradient Descent (and Conjugate GD)

min

w n

i=1

(w · xi − yi)2 + λw2 n points, d dimensions, λmax, λmin are eigs. of “design/data matrix” Computation time to get ǫ accuracy:

Gradient Descent (GD): λmax λmin nd log 1/ǫ Conjugate Gradient Descent:

λmax

λmin nd log 1/ǫ

memory: O(d) Better runtime and memory, but still costly.

S. M. Kakade (UW)

Optimization for Big data 6 / 34

SLIDE 8

Review: Stochastic Gradient Descent (SGD)

SGD update rule: at each time t,

sample a point (xi, yi) w ← w − η(w · xi − yi)xi

S. M. Kakade (UW)

Optimization for Big data 7 / 34

SLIDE 9

Review: Stochastic Gradient Descent (SGD)

SGD update rule: at each time t,

sample a point (xi, yi) w ← w − η(w · xi − yi)xi

S. M. Kakade (UW)

Optimization for Big data 7 / 34

SLIDE 10

Review: Stochastic Gradient Descent (SGD)

SGD update rule: at each time t,

sample a point (xi, yi) w ← w − η(w · xi − yi)xi

Problem: even if w = w∗, the update changes w. Rate: convergence rate is O(1/ǫ), with decaying η simple algorithm, light on memory, but poor convergence rate

S. M. Kakade (UW)

Optimization for Big data 7 / 34

SLIDE 11

Regression in the big data regime?

min

w n

i=1

(w · xi − yi)2 + λw2 How much computation time is required to to get ǫ accuracy? “Big Data Regime”: How do you optimize this when n and d are large? Convex optimization: All results can be generalized to smooth+strongly convex loss functions.

S. M. Kakade (UW)

Optimization for Big data 8 / 34

SLIDE 12

Duality (without Duality)

w = (X ⊤X + λI)−1X ⊤Y = X ⊤(XX ⊤ + λI)−1Y := 1 λX ⊤α where α = (I + XX ⊤/λ)−1Y. idea: let’s compute the n-dim vector α. let’s do this with coordinate ascent

S. M. Kakade (UW)

Optimization for Big data 9 / 34

SLIDE 13

SDCA: stochastic dual coordinate ascent

G(α1, α2, . . . αn) = α − Y2 + 1 λα⊤XX ⊤α the minimizer of G(α) is α = (I + XX ⊤/λ)−1Y SDCA:

start with α = 0. choose coordinate i randomly, and update: αi = argminzG(α1, . . . αi−1, z, . . . , αn) easy to do as we touch just one datapoint. return w = 1

λX ⊤α.

S. M. Kakade (UW)

Optimization for Big data 10 / 34

SLIDE 14

SDCA: the algorithm

G(α1, α2, . . . αn) = α − Y2 + 1 λα⊤XX ⊤α start with α = 0, w = 1

λX ⊤α.

1

choose coordinate i randomly, and compute difference: ∆αi = (yi − w · xi) − αi 1 + xi2/λ

2

update: αi ← αi + ∆αi, w ← w + 1 λxi · ∆αi

return w = 1

λX ⊤α.

S. M. Kakade (UW)

Optimization for Big data 11 / 34

SLIDE 15

Guarantees: speedups for the big data regime

n points, d dimensions, λav average eigenvalue Computation time to get ǫ accuracy gradient descent: (Shalev-Shwartz & Zhang ’12),((Frosting, Ge, K., Sidford, 2015)

GD vs SDCA: λmax λmin n d log 1/ǫ →

n + d λav

λmin

d log 1/ǫ

conjugate GD vs acceleration+SDCA:

λmax

λmin n d log 1/ǫ →

n +
nd λav

λmin

d log 1/ǫ

memory: O(n + d)

S. M. Kakade (UW)

Optimization for Big data 12 / 34

SLIDE 16

(another idea) Stochastic Variance Reduced Gradient (SVRG)

1

exact gradient computation: at stage s, using ws, compute: ∇L( ws) = 1 n

n

i=1

∇loss( ws, (xi, yi))

2

corrected SGD: initialize w ←

ws. for m steps,

sample a point (x, y) w ← w − η

∇loss(w, (x, y))−∇loss(

ws, (x, y)) + ∇L( ws)

3

update and repeat: ws+1 ← w.

S. M. Kakade (UW)

Optimization for Big data 13 / 34

SLIDE 17

(another idea) Stochastic Variance Reduced Gradient (SVRG)

1

exact gradient computation: at stage s, using ws, compute: ∇L( ws) = 1 n

n

i=1

∇loss( ws, (xi, yi))

2

corrected SGD: initialize w ←

ws. for m steps,

sample a point (x, y) w ← w − η

∇loss(w, (x, y))−∇loss(

ws, (x, y)) + ∇L( ws)

3

update and repeat: ws+1 ← w. Two ideas: If w = w∗, then no update. unbiased updates: blue term is mean 0.

S. M. Kakade (UW)

Optimization for Big data 13 / 34

SLIDE 18

Guarantees of SVRG

n points, d dimensions, λav average eigenvalue Computation time to get ǫ accuracy gradient descent: (Johnson & Zhang ’13)

GD vs SDCA: λmax λmin n d log 1/ǫ →

n + d λav

λmin

d log 1/ǫ

conjugate GD vs ??

λmax

λmin n d log 1/ǫ → ?? memory: O(d)

S. M. Kakade (UW)

Optimization for Big data 14 / 34

SLIDE 19

Part 1: Summary

Methods extend: to sums of convex functions L(w) = min

i

loss(h(xi, w), yi) for smooth loss(·) and strongly convex L(·). Take home: Natural stochastic algorithms, similar to SGD, which enjoy “numerical accuracy” guarantees. Other good ideas: Sketching is good in the large n, but “medium sized” d regime. (Rokhlin and Tygert, 2008) Improve upon conjugate gradient in the big data regime.

S. M. Kakade (UW)

Optimization for Big data 15 / 34

SLIDE 20

Part 2: PCA

We have n-vectors, x1 . . . xn in d dimensions and a matrix: A =

n

i=1

xix⊤

i ?

How much computation time do you need to get an eps-approximation

f the top eigenvector?

Constructing and storing the matrix may be costly. Computation: How do you accurately estimate the leading eigenvector of A, in terms of n, d, the “gap”, etc? Aside: (with modifications/CCA) this is the simplest way to learn embeddings.

S. M. Kakade (UW)

Optimization for Big data 16 / 34

SLIDE 21

Part 2: outline

Similar to least squares, we provide speeds for eigenvector computations.

Power method → improvements for large n. Lanczos method → improvements for large n.

Key idea: Utilizes faster least squares algorithms. “Shift and Invert" Preconditioning

S. M. Kakade (UW)

Optimization for Big data 17 / 34

SLIDE 22

Review: Algebraic Methods

max

w

w⊤Aw w2 , A =

n

i=1

xix⊤

i

n points, d dimensions, time complexity: O(nd2) and memory O(d2) No “gap” dependence. “Big data regime”: What about the large n regime?

S. M. Kakade (UW)

Optimization for Big data 18 / 34

SLIDE 23

Review: The Power Method and Lanczos

max

w

w⊤Aw w2 , A =

n

i=1

xix⊤

i

n points, d dimensions, gap = λ1−λ2

λ1

, nnz(A) is the # nonzeros in A. Computation time to get ǫ accuracy:

Power method: nnz(A) gap log 1/ǫ ≈ nd gap log 1/ǫ Lanczos method: nnz(A) √gap log 1/ǫ ≈ nd √gap log 1/ǫ

“Big data regime”: What about the large n regime?

S. M. Kakade (UW)

Optimization for Big data 19 / 34

SLIDE 24

Review: Oja’s algorithm and SGD

max

w

w⊤Aw w2 , A =

n

i=1

xix⊤

i

initialize w = 0 and then repeat:

1

for datapoint i sampled randomly: w ← (I + ηxix⊤

i )w

2

normalize: w ← w/w Computation time to get ǫ accuracy: O(1/ǫ) Memory: O(d).

S. M. Kakade (UW)

Optimization for Big data 20 / 34

SLIDE 25

PCA in the “Big Data” Regime?

How do you find the top eigenvector of: A =

n

i=1

xix⊤

i ?

“Big Data Regime”: How do you optimize this when n and d are large?

S. M. Kakade (UW)

Optimization for Big data 21 / 34

SLIDE 26

Classical: Shift and Invert Preconditioning

Can we make the gap larger? Consider B = λI − A. for the i-th eigenvalue, we have that: λi(B−1) = 1 λ − λi(A) For large λ, gap(B−1) ≥ 1/2. “shift and invert” power method: w ← B−1w converges in a log 1/ǫ iterations. (Saad ’92)

S. M. Kakade (UW)

Optimization for Big data 22 / 34

SLIDE 27

Robust Shift and Invert Preconditioning

The rub: we must now solve linear systems. Setting λ appropriately, makes gap(B−1) ≥ 1/2 but the condition number of B−1 is the gap. (use λ = (1 + const ∗ gap) ∗ λ1). Robust Shift and Invert Power Method wt+1 ← ApproxLeastSquares(B−1wt) Solving the linear system from scratch is costly. Key idea: Start the linear system solver at time t from the previous solution at t − 1.

S. M. Kakade (UW)

Optimization for Big data 23 / 34

SLIDE 28

Guarantees: speedups in the big data regime

max

w

w⊤Aw w2 , A =

n

i=1

xix⊤

i

Lanczos runtime vs. “shift and invert” + acceleratedSDCA (Garber, Hazan, Jain, Jin, K., Musco, Netrapalli, Sidford, 2016) nd √gap log 1/ǫ → n3/4d1/4 d √gap log 1/ǫ

S. M. Kakade (UW)

Optimization for Big data 24 / 34

SLIDE 29

Part 2: Summary

Take home: Natural stochastic algorithms, based on regression, which enjoy “numerical accuracy” guarantees Other good ideas: Sketching also good in the large n, d regime, but not great gap dependence. (Halko, Martinsson, and Tropp, 2011) Improve upon Lanczos in the big data regime.

S. M. Kakade (UW)

Optimization for Big data 25 / 34

SLIDE 30

Part 2: Non-convexity and beyond PCA

Methods extend to CCA and generalized eigenvalues Local search methods for other matrix based problems

Matrix square root Jain, Jin, K. , Netrapalli (2015) linear dynamical systems Hardt, Ma, Recht (2016)

Related: alternating minimization, tensor decompositions, etc Improve upon Lanczos in the big data regime.

S. M. Kakade (UW)

Optimization for Big data 26 / 34

SLIDE 31

Part 3: Statistics

How do we optimize when we have a stochastic approximation? E[

i

loss(h(xi, w), y)] ≈ 1 n

n

i=1

loss(h(xi, w), y) Numerical accuracy isn’t the right notion. For sums of convex functions, SGD+averaging (or variants) are

ptimal.

Juditsky & Polyak (1992); Dieuleveut & Bach (2014); Frostig, Ge, K., & Sidford (2014) For PCA, Oja’s algo (SGD) is “sample efficient”. Jain, Jin, K., Netrapalli, Sidford (2016)

S. M. Kakade (UW)

Optimization for Big data 27 / 34

SLIDE 32

Statistics: Regression

Suppose y = w∗ · x + η where η ∼ N(0, σ2). The MLE, with n samples, has residual sum of square errors: E[Loss(wMLE,n) − Loss(w∗)] → dσ2 n (for large n) SGD+averaging: wt+1 = wt − η(wt · xt − yt)xt wT = 1 T

T

t=1

wt (keep a running average ) The running average is near optimal statistically. After t samples (in a streaming model): E[Loss(wt) − Loss(w∗)] → const ∗ dσ2 t (for large t)

S. M. Kakade (UW)

Optimization for Big data 28 / 34

SLIDE 33

Part 3: Statistics: PCA

Oja’s algorithm/ SGD

1

for datapoint i sampled randomly: w ← (I + ηxix⊤

i )w

2

normalize: w ← w/w For PCA, vanilla Oja’s algo (SGD) with decaying step sizes, is “sample efficient”. Jain, Jin, K., Netrapalli, Sidford (2016)

S. M. Kakade (UW)

Optimization for Big data 29 / 34

SLIDE 34

Part 4: Thoughts and Open Problems

parallelization second order methods non-convexity

S. M. Kakade (UW)

Optimization for Big data 30 / 34

SLIDE 35

Parallelization

How do we parallelize? Common approaches are:

distribute the gradient computation? split up the data, run multiple SGD jobs, and average their answers? mini-batch for SGD?

How well do they work? forthcoming HOGWILD: asynchronous updating Niu, Recht, Re & Wright (2012)

for sparse x. unclear what can be done for the dense case.

S. M. Kakade (UW)

Optimization for Big data 31 / 34

SLIDE 36

Second Order Methods

L-BFGS works well in practice for both convex and non-convex

ptimization.

no sharp analysis of L-BFGS. sometimes L-BFGS is possible for large scale problems. Can we find a scalable (and provable) variant?

S. M. Kakade (UW)

Optimization for Big data 32 / 34

SLIDE 37

Non-convex optimization

What should we do? Again, L-BFGS is a good idea. Is there a scalable alternative? SVRG “works” in the non-convex case.

recent local convergence analysis. Allen-Zhu & Hazan (2016); Reddi, Hefny, Sra, Poczos & Smola (2016) is it actually helpful in practice?

S. M. Kakade (UW)

Optimization for Big data 33 / 34

SLIDE 38

Thanks!

Improved algorithms for optimization in the big data regime.

using reductions to least squares solvers. for more general convex loss functions. also improvements for eigenvalues and CCA. future/better non-convex optimization?

Selected references provided in supplementary material.

S. M. Kakade (UW)

Optimization for Big data 34 / 34