[PPT] - Recent Progresses in Stochastic Algorithms for Big Data Optimization PowerPoint Presentation

SLIDE 1

Recent Progresses in Stochastic Algorithms for Big Data Optimization

Tong Zhang

Rutgers University & Baidu Inc.

collaborators: Shai Shalev-Shwartz, Rie Johnson, Lin Xiao, Ohad Shamir and Nathan Srebro

T. Zhang

Big Data Optimization 1 / 36

SLIDE 2

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

T. Zhang

Big Data Optimization 2 / 36

SLIDE 3

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

T. Zhang

Big Data Optimization 2 / 36

SLIDE 4

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computing

algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

T. Zhang

Big Data Optimization 2 / 36

SLIDE 5

Mathematical Problem

Big Data Optimization Problem in machine learning: min

w f(w)

f(w) = 1 n

n

i=1

fi(w) Special structure: sum over data. Big data (n large) requires distrubuted training.

T. Zhang

Big Data Optimization 3 / 36

SLIDE 6

Assumptions on loss function

λ-strong convexity: f(w′) ≥ f(w) + ∇f(w)⊤(w′ − w) + λ 2w′ − w2

2

L-smoothness: fi(w′) ≤ fi(w) + ∇fi(w)⊤(w′ − w) + L 2w′ − w2

2

T. Zhang

Big Data Optimization 4 / 36

SLIDE 7

Example: Computational Advertizing

Large scale regularized logistic regression min

w

1 n

n

i=1

   ln(1 + e−w⊤xiyi) + λ 2w2

2

fi(w)

   

data (xi, yi) with yi ∈ {±1} parameter vector w.

λ strongly convex and L = 0.25 maxi xi2

2 + λ smooth.

T. Zhang

Big Data Optimization 5 / 36

SLIDE 8

Example: Computational Advertizing

Large scale regularized logistic regression min

w

1 n

n

i=1

   ln(1 + e−w⊤xiyi) + λ 2w2

2

fi(w)

   

data (xi, yi) with yi ∈ {±1} parameter vector w.

λ strongly convex and L = 0.25 maxi xi2

2 + λ smooth.

big data: n ∼ 10 − 100 billion high dimension: dim(xi) ∼ 10 − 100 billion

T. Zhang

Big Data Optimization 5 / 36

SLIDE 9

Example: Computational Advertizing

Large scale regularized logistic regression min

w

1 n

n

i=1

   ln(1 + e−w⊤xiyi) + λ 2w2

2

fi(w)

   

data (xi, yi) with yi ∈ {±1} parameter vector w.

λ strongly convex and L = 0.25 maxi xi2

2 + λ smooth.

big data: n ∼ 10 − 100 billion high dimension: dim(xi) ∼ 10 − 100 billion How to solve big optimization problems efficiently?

T. Zhang

Big Data Optimization 5 / 36

SLIDE 10

Statistical Thinking: sampling

Objective function: f(w) = 1 n

n

i=1

fi(w) sample objective function: only optimize approximate objective

T. Zhang

Big Data Optimization 6 / 36

SLIDE 11

Statistical Thinking: sampling

Objective function: f(w) = 1 n

n

i=1

fi(w) sample objective function: only optimize approximate objective 1st order gradient ∇f(w) = 1 n

n

i=1

∇fi(w) sample 1st order gradient (stochastic gradient): converges to exact

ptimal – variance reduction: fast rate
T. Zhang

Big Data Optimization 6 / 36

SLIDE 12

Statistical Thinking: sampling

Objective function: f(w) = 1 n

n

i=1

fi(w) sample objective function: only optimize approximate objective 1st order gradient ∇f(w) = 1 n

n

i=1

∇fi(w) sample 1st order gradient (stochastic gradient): converges to exact

ptimal – variance reduction: fast rate

2nd order gradient ∇2f(w) = 1 n

n

i=1

∇2fi(w) sample 2nd order gradient (stochastic Newton): converges to exact

ptimal with fast rate, distributed computing
T. Zhang

Big Data Optimization 6 / 36

SLIDE 13

Batch Optimization Method: Gradient Descent

Solve w∗ = arg min

w f(w)

f(w) = 1 n

n

i=1

fi(w). Gradient Descent (GD): wk = wk−1 − ηk∇f(wk−1). How fast does this method converge to the optimal solution?

T. Zhang

Big Data Optimization 7 / 36

SLIDE 14

Batch Optimization Method: Gradient Descent

Solve w∗ = arg min

w f(w)

f(w) = 1 n

n

i=1

fi(w). Gradient Descent (GD): wk = wk−1 − ηk∇f(wk−1). How fast does this method converge to the optimal solution? General result: converge to local minimum under suitable conditions. Convergence rate depends on conditions of f(·). For λ-strongly convex and L-smooth problems, it is linear rate: f(wk) − f(w∗) = O((1 − ρ)k), where ρ = O(λ/L) is the inverse condition number

T. Zhang

Big Data Optimization 7 / 36

SLIDE 15

Stochastic Approximate Gradient Computation

If f(w) = 1 n

n

i=1

fi(w), GD requires the computation of full gradient, which is extremely costly ∇f(w) = 1 n

n

i=1

∇fi(w)

T. Zhang

Big Data Optimization 8 / 36

SLIDE 16

Stochastic Approximate Gradient Computation

If f(w) = 1 n

n

i=1

fi(w), GD requires the computation of full gradient, which is extremely costly ∇f(w) = 1 n

n

i=1

∇fi(w) Idea: stochastic optimization employs random sample (mini-batch) B to approximate ∇f(w) ≈ 1 |B|

i∈B

∇fi(w) It is an unbiased estimator more efficient computation but introduces variance

T. Zhang

Big Data Optimization 8 / 36

SLIDE 17

SGD versus GD

SGD: faster computation per step Sublinear convergence: due to the variance of gradient approximation. f(wt) − f(w∗) = ˜ O(1/t). GD: slower computation per step Linear convergence: f(wt) − f(w∗) = O((1 − ρ)t).

T. Zhang

Big Data Optimization 9 / 36

SLIDE 18

Improving SGD via Variance Reduction

GD converges fast but computation is slow SGD computation is fast but converges slowly

slow convergence due to inherent variance

SGD as a statistical estimator of gradient:

let gi = ∇fi. unbaisedness: E gi = 1

n

i=1 gi = ∇f.

error of using gi to approx ∇f: variance Egi − Egi2

2.

T. Zhang

Big Data Optimization 10 / 36

SLIDE 19

Improving SGD via Variance Reduction

GD converges fast but computation is slow SGD computation is fast but converges slowly

slow convergence due to inherent variance

SGD as a statistical estimator of gradient:

let gi = ∇fi. unbaisedness: E gi = 1

n

i=1 gi = ∇f.

error of using gi to approx ∇f: variance Egi − Egi2

2.

Statistical thinking:

relating variance to optimization design other unbiased gradient estimators with smaller variance

T. Zhang

Big Data Optimization 10 / 36

SLIDE 20

Relating Statistical Variance to Optimization

Want to optimize min

w f(w)

Full gradient ∇f(w).

T. Zhang

Big Data Optimization 11 / 36

SLIDE 21

Relating Statistical Variance to Optimization

Want to optimize min

w f(w)

Full gradient ∇f(w). Given unbiased random estimator gi of ∇f(w), and SGD rule w → w − ηgi, reduction of objective is Ef(w − ηgi) ≤ f(w) − (η − η2L/2)∇f(w)2

2

non-random

+η2L 2 Eg − Eg2

2

variance

.

T. Zhang

Big Data Optimization 11 / 36

SLIDE 22

Relating Statistical Variance to Optimization

Want to optimize min

w f(w)

Full gradient ∇f(w). Given unbiased random estimator gi of ∇f(w), and SGD rule w → w − ηgi, reduction of objective is Ef(w − ηgi) ≤ f(w) − (η − η2L/2)∇f(w)2

2

non-random

+η2L 2 Eg − Eg2

2

variance

. Smaller variance implies bigger reduction

T. Zhang

Big Data Optimization 11 / 36

SLIDE 23

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

T. Zhang

Big Data Optimization 12 / 36

SLIDE 24

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

T. Zhang

Big Data Optimization 12 / 36

SLIDE 25

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computing

algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

T. Zhang

Big Data Optimization 12 / 36

SLIDE 26

Stochastic Variance Reduced Gradient (SVRG) I

Objective function f(w) = 1 n

n

i=1

fi(w) = 1 n

n

i=1

˜ fi(w), where ˜ fi(w) = fi(w) − (∇fi( ˜ w) − ∇f( ˜ w))⊤w

sum to zero

. Pick ˜ w to be an approximate solution (close to w∗). The SVRG rule (control variates) is wt = wt−1 − ηt∇˜ fi(wt−1) = wt−1 − ηt [∇fi(wt−1) − ∇fi( ˜ w) + ∇f( ˜ w)]

small variance

.

T. Zhang

Big Data Optimization 13 / 36

SLIDE 27

Stochastic Variance Reduced Gradient (SVRG) II

Assume that ˜ w ≈ w∗ and wt−1 ≈ w∗. Then ∇f( ˜ w) ≈ ∇f(w∗) = 0 ∇fi(wt−1) ≈ ∇fi( ˜ w). This means ∇fi(wt−1) − ∇fi( ˜ w) + ∇f( ˜ w) → 0. It is possible to choose a constant step size ηt = η instead of requiring ηt → 0. One can achieve comparable linear convergence with SVRG: Ef(wt) − f(w∗) = O((1 − ˜ ρ)t), where ˜ ρ = O(λn/(L + λn); convergence is faster than GD.

T. Zhang

Big Data Optimization 14 / 36

SLIDE 28

Compare SVRG to Batch Gradient Descent Algorithm

Number of examples needed to achieve ǫ accuracy: Batch GD: ˜ O(n · L/λ log(1/ǫ)) SVRG: ˜ O((n + L/λ) log(1/ǫ)) Assume L-smooth loss: ∇fi(w) − ∇fi(w′) ≤ Lw − w′ and λ strong convex objective: ∇f(w) − ∇f(w′) ≥ λw − w′ The gain of SVRG over batch algorithm is significant when n is large.

T. Zhang

Big Data Optimization 15 / 36

SLIDE 29

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

T. Zhang

Big Data Optimization 16 / 36

SLIDE 30

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

T. Zhang

Big Data Optimization 16 / 36

SLIDE 31

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computing

algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

T. Zhang

Big Data Optimization 16 / 36

SLIDE 32

Motivation of SDCA: regularized loss minimization

Assume we want to solve the Lasso problem: min

w

1

n

i=1

(w⊤xi − yi)2 + λw1

T. Zhang

Big Data Optimization 17 / 36

SLIDE 33

Motivation of SDCA: regularized loss minimization

Assume we want to solve the Lasso problem: min

w

1

n

i=1

(w⊤xi − yi)2 + λw1

r the ridge regression problem:

min

w

      1 n

n

i=1

(w⊤xi − yi)2

loss

+ λ 2w2

2 regularization

      Our goal: solve regularized loss minimization problems as fast as we can.

T. Zhang

Big Data Optimization 17 / 36

SLIDE 34

Motivation of SDCA: regularized loss minimization

Assume we want to solve the Lasso problem: min

w

1

n

i=1

(w⊤xi − yi)2 + λw1

r the ridge regression problem:

min

w

      1 n

n

i=1

(w⊤xi − yi)2

loss

+ λ 2w2

2 regularization

      Our goal: solve regularized loss minimization problems as fast as we can. A good solution leads to stochastic algorithm called proximal Stochastic Dual Coordinate Ascent (Prox-SDCA). We show: fast convergence of SDCA for many regularized loss minimization problems in machine learning.

T. Zhang

Big Data Optimization 17 / 36

SLIDE 35

General Problem

Want to solve: min

w P(w) :=

1

n

i=1

φi(X ⊤

i w) + λg(w)

,

where Xi are matrices; g(·) is strongly convex. Examples: Multi-class logistic loss φi(X ⊤

i w) = ln K

ℓ=1

exp(w⊤Xi,ℓ) − w⊤Xi,yi. L1 − L2 regularization g(w) = 1 2w2

2 + σ

λw1

T. Zhang

Big Data Optimization 18 / 36

SLIDE 36

Dual Formulation

Primal: min

w P(w) :=

1

n

i=1

φi(X ⊤

i w) + λg(w)

,

Dual: max

α

D(α) :=

1

n

i=1

−φ∗

i (−αi) − λg∗

1

λn

n

i=1

Xiαi

with the relationship

w = ∇g∗

1

λn

n

i=1

Xiαi

.

The convex conjugate (dual) is defined as φ∗

i (a) = supz(az − φi(z)).

SDCA: randomly pick i and optimize D(α) by varying αi while keeping

ther dual variables fixed.
T. Zhang

Big Data Optimization 19 / 36

SLIDE 37

Example: L1 − L2 Regularized Logistic Regression

Primal: P(w) = 1 n

n

i=1

ln(1 + e−w⊤XiYi)

φi(w)

+ λ 2w⊤w + σw1

λg(w)

. Dual: with αiYi ∈ [0, 1] D(α) =1 n

n

i=1

−αiYi ln(αiYi) − (1 − αiYi) ln(1 − αiYi)

φ∗

i (−αi)

−λ 2trunc(v, σ/λ)2

2

s.t. v = 1 λn

n

i=1

αiXi; w = trunc(v, σ/λ) where trunc(u, δ)j =      uj − δ if uj > δ if |uj| ≤ δ uj + δ if uj < −δ

T. Zhang

Big Data Optimization 20 / 36

SLIDE 38

Proximal-SDCA for L1-L2 Regularization

Algorithm: Keep dual α and v = (λn)−1

i αiXi

Randomly pick i Find ∆i by approximately maximizing: −φ∗

i (αi + ∆i) − trunc(v, σ/λ)⊤Xi ∆i −

1 2λnXi2

2∆2 i ,

where φ∗

i (αi + ∆) = (αi + ∆)Yi ln((αi + ∆)Yi) + (1 − (αi + ∆)Yi) ln(1 − (αi + ∆)Yi)

α = α + ∆i · ei v = v + (λn)−1∆i · Xi. Let w = trunc(v, σ/λ).

T. Zhang

Big Data Optimization 21 / 36

SLIDE 39

Convergence rate

The number of iterations needed to achieve ǫ accuracy For (1/γ)-smooth loss: ˜ O

n + 1

γλ

log 1

ǫ

For L-Lipschitz loss:

˜ O

n + L2

λ ǫ

T. Zhang

Big Data Optimization 22 / 36

SLIDE 40

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ǫ with smooth φi: 1 n

n

i=1

φi(w) + σw1. Apply Prox-SDCA with extra term 0.5λw2

2, where λ = O(ǫ):

number of iterations needed is ˜ O(n + 1/ǫ).

T. Zhang

Big Data Optimization 23 / 36

SLIDE 41

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ǫ with smooth φi: 1 n

n

i=1

φi(w) + σw1. Apply Prox-SDCA with extra term 0.5λw2

2, where λ = O(ǫ):

number of iterations needed is ˜ O(n + 1/ǫ). Compare to Dual Averaging SGD (Xiao): number of iterations needed is ˜ O(1/ǫ2).

T. Zhang

Big Data Optimization 23 / 36

SLIDE 42

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ǫ with smooth φi: 1 n

n

i=1

φi(w) + σw1. Apply Prox-SDCA with extra term 0.5λw2

2, where λ = O(ǫ):

number of iterations needed is ˜ O(n + 1/ǫ). Compare to Dual Averaging SGD (Xiao): number of iterations needed is ˜ O(1/ǫ2). Compare to batch FISTA (Nesterov’s accelerated proximal gradient): number of iterations needed is ˜ O(n/√ǫ). Prox-SDCA wins in the statistically interesting regime: ǫ > Ω(1/n2)

T. Zhang

Big Data Optimization 23 / 36

SLIDE 43

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ǫ with smooth φi: 1 n

n

i=1

φi(w) + σw1. Apply Prox-SDCA with extra term 0.5λw2

2, where λ = O(ǫ):

number of iterations needed is ˜ O(n + 1/ǫ). Compare to Dual Averaging SGD (Xiao): number of iterations needed is ˜ O(1/ǫ2). Compare to batch FISTA (Nesterov’s accelerated proximal gradient): number of iterations needed is ˜ O(n/√ǫ). Prox-SDCA wins in the statistically interesting regime: ǫ > Ω(1/n2) Can design an accelerated prox-SDCA procedure always superior to FISTA

T. Zhang

Big Data Optimization 23 / 36

SLIDE 44

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

T. Zhang

Big Data Optimization 24 / 36

SLIDE 45

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

T. Zhang

Big Data Optimization 24 / 36

SLIDE 46

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computing

algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

T. Zhang

Big Data Optimization 24 / 36

SLIDE 47

Accelerated Prox-SDCA

Solving: P(w) := 1 n

n

i=1

φi(X ⊤

i w) + λg(w)

Convergence rate of Prox-SDCA depends on O(1/λ) Inferior to acceleration when λ is very small ≪ O(1/n), which has O(1/ √ λ) dependency

T. Zhang

Big Data Optimization 25 / 36

SLIDE 48

Accelerated Prox-SDCA

Solving: P(w) := 1 n

n

i=1

φi(X ⊤

i w) + λg(w)

Convergence rate of Prox-SDCA depends on O(1/λ) Inferior to acceleration when λ is very small ≪ O(1/n), which has O(1/ √ λ) dependency Inner-outer Iteration Accelerated Prox-SDCA Pick a suitable κ = Θ(1/n) and β For t = 2, 3 . . . (outer iter)

Let ˜ gt(w) = λg(w) + 0.5κw2

2 − κw⊤y(t−1) (κ-strongly convex)

Let ˜ Pt(w) = P(w) − λg(w) + ˜ gt(w) (redefine P(·) – κ strongly convex) Approximately solve ˜ Pt(w) for (w(t), α(t)) with prox-SDCA (inner iter) Let y(t) = w(t) + β(w(t) − w(t−1)) (acceleration)

T. Zhang

Big Data Optimization 25 / 36

SLIDE 49

Performance Comparisons

Problem Algorithm Runtime SVM SGD

d λǫ

AGD (Nesterov) dn

1

λ ǫ

Acc-Prox-SDCA d

n + min{ 1

λ ǫ,

n

λǫ}

Lasso

SGD and variants

d ǫ2

Stochastic Coordinate Descent

dn ǫ

FISTA dn

1

ǫ

Acc-Prox-SDCA d

n + min{ 1

ǫ,

n

ǫ }

Ridge Regression

Exact d2n + d3 SGD, SDCA d

n + 1

λ

AGD

dn

1

λ

Acc-Prox-SDCA d

n + min{ 1

λ,

n

λ}

T. Zhang

Big Data Optimization 26 / 36

SLIDE 50

Summary of Sampling

1st order gradient ∇f(w) = 1 n

n

i=1

∇fi(w) sample 1st order gradient (stochastic gradient): variance reduction leads to fast linear convergence 2nd order gradient ∇2f(w) = 1 n

n

i=1

∇2fi(w) sample 2nd order gradient (stochastic Newton): converges to exact

ptimal with fast rate in distributed computing setting
T. Zhang

Big Data Optimization 27 / 36

SLIDE 51

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

T. Zhang

Big Data Optimization 28 / 36

SLIDE 52

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

T. Zhang

Big Data Optimization 28 / 36

SLIDE 53

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computing

algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

T. Zhang

Big Data Optimization 28 / 36

SLIDE 54

Distributed Computing

Assume: data distributed over machines m processors each has n/m examples Simple Computational Strategy — One Shot Averaging (OSA) run optimization on m machines separately

btaining parameters w(1), . . . , w(m)

average the parameters: l

¯ w = m−1 m

i=1 w(i)

T. Zhang

Big Data Optimization 29 / 36

SLIDE 55

Improvement

OSA Strategy’s advantage: machines run independently simple and computationally efficient; asymptotically good in theory Disadvantage: practically inferior to training all examples on a single machine

T. Zhang

Big Data Optimization 30 / 36

SLIDE 56

Improvement

OSA Strategy’s advantage: machines run independently simple and computationally efficient; asymptotically good in theory Disadvantage: practically inferior to training all examples on a single machine Traditional solution in optimization: ADMM New Idea: via 2nd order gradient sampling Distributed Approximate NEwton (DANE)

T. Zhang

Big Data Optimization 30 / 36

SLIDE 57

Distribution Scheme

Assume: data distributed over machines with decomposed problem f(w) =

m

ℓ=1

f (ℓ)(w). m processors each f (ℓ)(w) has n/m randomly partitioned examples each machine holds a complete set of parameters

T. Zhang

Big Data Optimization 31 / 36

SLIDE 58

DANE

Starting with ˜ w using OSA Iterate Take ˜ w and define ˜ f (ℓ)(w) = f (ℓ)(w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w

n each machine solves

w(ℓ) = arg min

w

˜ f (ℓ)(w) independently. Take partial average as the next ˜ w

T. Zhang

Big Data Optimization 32 / 36

SLIDE 59

DANE

Starting with ˜ w using OSA Iterate Take ˜ w and define ˜ f (ℓ)(w) = f (ℓ)(w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w

n each machine solves

w(ℓ) = arg min

w

˜ f (ℓ)(w) independently. Take partial average as the next ˜ w The gradient correction is similar to SVRG

T. Zhang

Big Data Optimization 32 / 36

SLIDE 60

Relationship to 2nd Order Gradient Sampling

Each f (ℓ)(w) takes m out of n sample from f(w) = n−1

i fi(w).

˜ f (ℓ)(w) = f (ℓ)(w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w Expand f (ℓ)(w) around ˜ w, we have ˜ f (ℓ)(w) ≈f (ℓ)( ˜ w) + ∇f (ℓ)( ˜ w)⊤(w − ˜ w) + 1 2(w − ˜ w)⊤∇2f (ℓ)( ˜ w)(w − ˜ w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w.

T. Zhang

Big Data Optimization 33 / 36

SLIDE 61

Relationship to 2nd Order Gradient Sampling

Each f (ℓ)(w) takes m out of n sample from f(w) = n−1

i fi(w).

˜ f (ℓ)(w) = f (ℓ)(w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w Expand f (ℓ)(w) around ˜ w, we have ˜ f (ℓ)(w) ≈f (ℓ)( ˜ w) + ∇f (ℓ)( ˜ w)⊤(w − ˜ w) + 1 2(w − ˜ w)⊤∇2f (ℓ)( ˜ w)(w − ˜ w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w. minw ˜ f (ℓ)(w) is equivalent to approximate minimization of min

w

  f (ℓ)( ˜ w) + ∇f( ˜ w)⊤(w − ˜ w) + 1 2 (w − ˜ w)⊤∇2f (ℓ)( ˜ w)(w − ˜ w)

2nd order gradient sampling from ∇2f(˜

w)

  

T. Zhang

Big Data Optimization 33 / 36

SLIDE 62

Relationship to 2nd Order Gradient Sampling

Each f (ℓ)(w) takes m out of n sample from f(w) = n−1

i fi(w).

˜ f (ℓ)(w) = f (ℓ)(w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w Expand f (ℓ)(w) around ˜ w, we have ˜ f (ℓ)(w) ≈f (ℓ)( ˜ w) + ∇f (ℓ)( ˜ w)⊤(w − ˜ w) + 1 2(w − ˜ w)⊤∇2f (ℓ)( ˜ w)(w − ˜ w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w. minw ˜ f (ℓ)(w) is equivalent to approximate minimization of min

w

  f (ℓ)( ˜ w) + ∇f( ˜ w)⊤(w − ˜ w) + 1 2 (w − ˜ w)⊤∇2f (ℓ)( ˜ w)(w − ˜ w)

2nd order gradient sampling from ∇2f(˜

w)

   Lead to fast convergence

T. Zhang

Big Data Optimization 33 / 36

SLIDE 63

Comparisons

5 10 0.229 0.23 0.231 t

COV1

5 10 0.04 0.05 0.06 0.07 t

ASTRO

5 10 0.03 0.04 0.05 0.06 t

MNIST−47

DANE ADMM OSA Opt

T. Zhang

Big Data Optimization 34 / 36

SLIDE 64

Summary

Optimization in machine learning: sum over data structure

suitable for statistical sampling

Traditional methods: gradient based batch algorithms

do not take advantage of special structure

Recent progress: stochastic optimization (focusing on my own work)

T. Zhang

Big Data Optimization 35 / 36

SLIDE 65

Summary

Optimization in machine learning: sum over data structure

suitable for statistical sampling

Traditional methods: gradient based batch algorithms

do not take advantage of special structure

Recent progress: stochastic optimization (focusing on my own work) Sampling of 1st order gradient

variance reduction leads to fast linear convergence rate

Sampling of 2nd order gradient

leads to fast distributed algorithm

T. Zhang

Big Data Optimization 35 / 36

SLIDE 66

References

Rie Johnson and TZ. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, NIPS 2013. Lin Xiao and TZ. A Proximal Stochastic Gradient Method with Progressive Variance Reduction, Tech Report arXiv:1403.4699, March 2014. Shai Shalev-Shwartz and TZ. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization, JMLR 14:567-599, 2013. Shai Shalev-Shwartz and TZ. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization, Tech Report arXiv:1309.2375, Sep 2013. Ohad Shamir and Nathan Srebro and Tong Zhang. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method, ICML 2014.

T. Zhang

Big Data Optimization 36 / 36

SLIDE 67

References

Rie Johnson and TZ. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, NIPS 2013. Lin Xiao and TZ. A Proximal Stochastic Gradient Method with Progressive Variance Reduction, Tech Report arXiv:1403.4699, March 2014. Shai Shalev-Shwartz and TZ. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization, JMLR 14:567-599, 2013. Shai Shalev-Shwartz and TZ. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization, Tech Report arXiv:1309.2375, Sep 2013. Ohad Shamir and Nathan Srebro and Tong Zhang. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method, ICML 2014. My friend Lin Xiao has made further improves with collaborators in multiple fronts on this thread of big optimization research ...

T. Zhang

Big Data Optimization 36 / 36