Recent Progresses in Stochastic Algorithms for Big Data Optimization - - PowerPoint PPT Presentation

recent progresses in stochastic algorithms for big data
SMART_READER_LITE
LIVE PREVIEW

Recent Progresses in Stochastic Algorithms for Big Data Optimization - - PowerPoint PPT Presentation

Recent Progresses in Stochastic Algorithms for Big Data Optimization Tong Zhang Rutgers University & Baidu Inc. collaborators: Shai Shalev-Shwartz, Rie Johnson, Lin Xiao, Ohad Shamir and Nathan Srebro T. Zhang Big Data Optimization 1 /


slide-1
SLIDE 1

Recent Progresses in Stochastic Algorithms for Big Data Optimization

Tong Zhang

Rutgers University & Baidu Inc.

collaborators: Shai Shalev-Shwartz, Rie Johnson, Lin Xiao, Ohad Shamir and Nathan Srebro

  • T. Zhang

Big Data Optimization 1 / 36

slide-2
SLIDE 2

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

  • T. Zhang

Big Data Optimization 2 / 36

slide-3
SLIDE 3

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 2 / 36

slide-4
SLIDE 4

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computing

algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 2 / 36

slide-5
SLIDE 5

Mathematical Problem

Big Data Optimization Problem in machine learning: min

w f(w)

f(w) = 1 n

n

  • i=1

fi(w) Special structure: sum over data. Big data (n large) requires distrubuted training.

  • T. Zhang

Big Data Optimization 3 / 36

slide-6
SLIDE 6

Assumptions on loss function

λ-strong convexity: f(w′) ≥ f(w) + ∇f(w)⊤(w′ − w) + λ 2w′ − w2

2

L-smoothness: fi(w′) ≤ fi(w) + ∇fi(w)⊤(w′ − w) + L 2w′ − w2

2

  • T. Zhang

Big Data Optimization 4 / 36

slide-7
SLIDE 7

Example: Computational Advertizing

Large scale regularized logistic regression min

w

1 n

n

  • i=1

   ln(1 + e−w⊤xiyi) + λ 2w2

2

  • fi(w)

   

data (xi, yi) with yi ∈ {±1} parameter vector w.

λ strongly convex and L = 0.25 maxi xi2

2 + λ smooth.

  • T. Zhang

Big Data Optimization 5 / 36

slide-8
SLIDE 8

Example: Computational Advertizing

Large scale regularized logistic regression min

w

1 n

n

  • i=1

   ln(1 + e−w⊤xiyi) + λ 2w2

2

  • fi(w)

   

data (xi, yi) with yi ∈ {±1} parameter vector w.

λ strongly convex and L = 0.25 maxi xi2

2 + λ smooth.

big data: n ∼ 10 − 100 billion high dimension: dim(xi) ∼ 10 − 100 billion

  • T. Zhang

Big Data Optimization 5 / 36

slide-9
SLIDE 9

Example: Computational Advertizing

Large scale regularized logistic regression min

w

1 n

n

  • i=1

   ln(1 + e−w⊤xiyi) + λ 2w2

2

  • fi(w)

   

data (xi, yi) with yi ∈ {±1} parameter vector w.

λ strongly convex and L = 0.25 maxi xi2

2 + λ smooth.

big data: n ∼ 10 − 100 billion high dimension: dim(xi) ∼ 10 − 100 billion How to solve big optimization problems efficiently?

  • T. Zhang

Big Data Optimization 5 / 36

slide-10
SLIDE 10

Statistical Thinking: sampling

Objective function: f(w) = 1 n

n

  • i=1

fi(w) sample objective function: only optimize approximate objective

  • T. Zhang

Big Data Optimization 6 / 36

slide-11
SLIDE 11

Statistical Thinking: sampling

Objective function: f(w) = 1 n

n

  • i=1

fi(w) sample objective function: only optimize approximate objective 1st order gradient ∇f(w) = 1 n

n

  • i=1

∇fi(w) sample 1st order gradient (stochastic gradient): converges to exact

  • ptimal – variance reduction: fast rate
  • T. Zhang

Big Data Optimization 6 / 36

slide-12
SLIDE 12

Statistical Thinking: sampling

Objective function: f(w) = 1 n

n

  • i=1

fi(w) sample objective function: only optimize approximate objective 1st order gradient ∇f(w) = 1 n

n

  • i=1

∇fi(w) sample 1st order gradient (stochastic gradient): converges to exact

  • ptimal – variance reduction: fast rate

2nd order gradient ∇2f(w) = 1 n

n

  • i=1

∇2fi(w) sample 2nd order gradient (stochastic Newton): converges to exact

  • ptimal with fast rate, distributed computing
  • T. Zhang

Big Data Optimization 6 / 36

slide-13
SLIDE 13

Batch Optimization Method: Gradient Descent

Solve w∗ = arg min

w f(w)

f(w) = 1 n

n

  • i=1

fi(w). Gradient Descent (GD): wk = wk−1 − ηk∇f(wk−1). How fast does this method converge to the optimal solution?

  • T. Zhang

Big Data Optimization 7 / 36

slide-14
SLIDE 14

Batch Optimization Method: Gradient Descent

Solve w∗ = arg min

w f(w)

f(w) = 1 n

n

  • i=1

fi(w). Gradient Descent (GD): wk = wk−1 − ηk∇f(wk−1). How fast does this method converge to the optimal solution? General result: converge to local minimum under suitable conditions. Convergence rate depends on conditions of f(·). For λ-strongly convex and L-smooth problems, it is linear rate: f(wk) − f(w∗) = O((1 − ρ)k), where ρ = O(λ/L) is the inverse condition number

  • T. Zhang

Big Data Optimization 7 / 36

slide-15
SLIDE 15

Stochastic Approximate Gradient Computation

If f(w) = 1 n

n

  • i=1

fi(w), GD requires the computation of full gradient, which is extremely costly ∇f(w) = 1 n

n

  • i=1

∇fi(w)

  • T. Zhang

Big Data Optimization 8 / 36

slide-16
SLIDE 16

Stochastic Approximate Gradient Computation

If f(w) = 1 n

n

  • i=1

fi(w), GD requires the computation of full gradient, which is extremely costly ∇f(w) = 1 n

n

  • i=1

∇fi(w) Idea: stochastic optimization employs random sample (mini-batch) B to approximate ∇f(w) ≈ 1 |B|

  • i∈B

∇fi(w) It is an unbiased estimator more efficient computation but introduces variance

  • T. Zhang

Big Data Optimization 8 / 36

slide-17
SLIDE 17

SGD versus GD

SGD: faster computation per step Sublinear convergence: due to the variance of gradient approximation. f(wt) − f(w∗) = ˜ O(1/t). GD: slower computation per step Linear convergence: f(wt) − f(w∗) = O((1 − ρ)t).

  • T. Zhang

Big Data Optimization 9 / 36

slide-18
SLIDE 18

Improving SGD via Variance Reduction

GD converges fast but computation is slow SGD computation is fast but converges slowly

slow convergence due to inherent variance

SGD as a statistical estimator of gradient:

let gi = ∇fi. unbaisedness: E gi = 1

n

n

i=1 gi = ∇f.

error of using gi to approx ∇f: variance Egi − Egi2

2.

  • T. Zhang

Big Data Optimization 10 / 36

slide-19
SLIDE 19

Improving SGD via Variance Reduction

GD converges fast but computation is slow SGD computation is fast but converges slowly

slow convergence due to inherent variance

SGD as a statistical estimator of gradient:

let gi = ∇fi. unbaisedness: E gi = 1

n

n

i=1 gi = ∇f.

error of using gi to approx ∇f: variance Egi − Egi2

2.

Statistical thinking:

relating variance to optimization design other unbiased gradient estimators with smaller variance

  • T. Zhang

Big Data Optimization 10 / 36

slide-20
SLIDE 20

Relating Statistical Variance to Optimization

Want to optimize min

w f(w)

Full gradient ∇f(w).

  • T. Zhang

Big Data Optimization 11 / 36

slide-21
SLIDE 21

Relating Statistical Variance to Optimization

Want to optimize min

w f(w)

Full gradient ∇f(w). Given unbiased random estimator gi of ∇f(w), and SGD rule w → w − ηgi, reduction of objective is Ef(w − ηgi) ≤ f(w) − (η − η2L/2)∇f(w)2

2

  • non-random

+η2L 2 Eg − Eg2

2

  • variance

.

  • T. Zhang

Big Data Optimization 11 / 36

slide-22
SLIDE 22

Relating Statistical Variance to Optimization

Want to optimize min

w f(w)

Full gradient ∇f(w). Given unbiased random estimator gi of ∇f(w), and SGD rule w → w − ηgi, reduction of objective is Ef(w − ηgi) ≤ f(w) − (η − η2L/2)∇f(w)2

2

  • non-random

+η2L 2 Eg − Eg2

2

  • variance

. Smaller variance implies bigger reduction

  • T. Zhang

Big Data Optimization 11 / 36

slide-23
SLIDE 23

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

  • T. Zhang

Big Data Optimization 12 / 36

slide-24
SLIDE 24

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 12 / 36

slide-25
SLIDE 25

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computing

algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 12 / 36

slide-26
SLIDE 26

Stochastic Variance Reduced Gradient (SVRG) I

Objective function f(w) = 1 n

n

  • i=1

fi(w) = 1 n

n

  • i=1

˜ fi(w), where ˜ fi(w) = fi(w) − (∇fi( ˜ w) − ∇f( ˜ w))⊤w

  • sum to zero

. Pick ˜ w to be an approximate solution (close to w∗). The SVRG rule (control variates) is wt = wt−1 − ηt∇˜ fi(wt−1) = wt−1 − ηt [∇fi(wt−1) − ∇fi( ˜ w) + ∇f( ˜ w)]

  • small variance

.

  • T. Zhang

Big Data Optimization 13 / 36

slide-27
SLIDE 27

Stochastic Variance Reduced Gradient (SVRG) II

Assume that ˜ w ≈ w∗ and wt−1 ≈ w∗. Then ∇f( ˜ w) ≈ ∇f(w∗) = 0 ∇fi(wt−1) ≈ ∇fi( ˜ w). This means ∇fi(wt−1) − ∇fi( ˜ w) + ∇f( ˜ w) → 0. It is possible to choose a constant step size ηt = η instead of requiring ηt → 0. One can achieve comparable linear convergence with SVRG: Ef(wt) − f(w∗) = O((1 − ˜ ρ)t), where ˜ ρ = O(λn/(L + λn); convergence is faster than GD.

  • T. Zhang

Big Data Optimization 14 / 36

slide-28
SLIDE 28

Compare SVRG to Batch Gradient Descent Algorithm

Number of examples needed to achieve ǫ accuracy: Batch GD: ˜ O(n · L/λ log(1/ǫ)) SVRG: ˜ O((n + L/λ) log(1/ǫ)) Assume L-smooth loss: ∇fi(w) − ∇fi(w′) ≤ Lw − w′ and λ strong convex objective: ∇f(w) − ∇f(w′) ≥ λw − w′ The gain of SVRG over batch algorithm is significant when n is large.

  • T. Zhang

Big Data Optimization 15 / 36

slide-29
SLIDE 29

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

  • T. Zhang

Big Data Optimization 16 / 36

slide-30
SLIDE 30

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 16 / 36

slide-31
SLIDE 31

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computing

algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 16 / 36

slide-32
SLIDE 32

Motivation of SDCA: regularized loss minimization

Assume we want to solve the Lasso problem: min

w

  • 1

n

n

  • i=1

(w⊤xi − yi)2 + λw1

  • T. Zhang

Big Data Optimization 17 / 36

slide-33
SLIDE 33

Motivation of SDCA: regularized loss minimization

Assume we want to solve the Lasso problem: min

w

  • 1

n

n

  • i=1

(w⊤xi − yi)2 + λw1

  • r the ridge regression problem:

min

w

      1 n

n

  • i=1

(w⊤xi − yi)2

  • loss

+ λ 2w2

2 regularization

      Our goal: solve regularized loss minimization problems as fast as we can.

  • T. Zhang

Big Data Optimization 17 / 36

slide-34
SLIDE 34

Motivation of SDCA: regularized loss minimization

Assume we want to solve the Lasso problem: min

w

  • 1

n

n

  • i=1

(w⊤xi − yi)2 + λw1

  • r the ridge regression problem:

min

w

      1 n

n

  • i=1

(w⊤xi − yi)2

  • loss

+ λ 2w2

2 regularization

      Our goal: solve regularized loss minimization problems as fast as we can. A good solution leads to stochastic algorithm called proximal Stochastic Dual Coordinate Ascent (Prox-SDCA). We show: fast convergence of SDCA for many regularized loss minimization problems in machine learning.

  • T. Zhang

Big Data Optimization 17 / 36

slide-35
SLIDE 35

General Problem

Want to solve: min

w P(w) :=

  • 1

n

n

  • i=1

φi(X ⊤

i w) + λg(w)

  • ,

where Xi are matrices; g(·) is strongly convex. Examples: Multi-class logistic loss φi(X ⊤

i w) = ln K

  • ℓ=1

exp(w⊤Xi,ℓ) − w⊤Xi,yi. L1 − L2 regularization g(w) = 1 2w2

2 + σ

λw1

  • T. Zhang

Big Data Optimization 18 / 36

slide-36
SLIDE 36

Dual Formulation

Primal: min

w P(w) :=

  • 1

n

n

  • i=1

φi(X ⊤

i w) + λg(w)

  • ,

Dual: max

α

D(α) :=

  • 1

n

n

  • i=1

−φ∗

i (−αi) − λg∗

  • 1

λn

n

  • i=1

Xiαi

  • with the relationship

w = ∇g∗

  • 1

λn

n

  • i=1

Xiαi

  • .

The convex conjugate (dual) is defined as φ∗

i (a) = supz(az − φi(z)).

SDCA: randomly pick i and optimize D(α) by varying αi while keeping

  • ther dual variables fixed.
  • T. Zhang

Big Data Optimization 19 / 36

slide-37
SLIDE 37

Example: L1 − L2 Regularized Logistic Regression

Primal: P(w) = 1 n

n

  • i=1

ln(1 + e−w⊤XiYi)

  • φi(w)

+ λ 2w⊤w + σw1

  • λg(w)

. Dual: with αiYi ∈ [0, 1] D(α) =1 n

n

  • i=1

−αiYi ln(αiYi) − (1 − αiYi) ln(1 − αiYi)

  • φ∗

i (−αi)

−λ 2trunc(v, σ/λ)2

2

s.t. v = 1 λn

n

  • i=1

αiXi; w = trunc(v, σ/λ) where trunc(u, δ)j =      uj − δ if uj > δ if |uj| ≤ δ uj + δ if uj < −δ

  • T. Zhang

Big Data Optimization 20 / 36

slide-38
SLIDE 38

Proximal-SDCA for L1-L2 Regularization

Algorithm: Keep dual α and v = (λn)−1

i αiXi

Randomly pick i Find ∆i by approximately maximizing: −φ∗

i (αi + ∆i) − trunc(v, σ/λ)⊤Xi ∆i −

1 2λnXi2

2∆2 i ,

where φ∗

i (αi + ∆) = (αi + ∆)Yi ln((αi + ∆)Yi) + (1 − (αi + ∆)Yi) ln(1 − (αi + ∆)Yi)

α = α + ∆i · ei v = v + (λn)−1∆i · Xi. Let w = trunc(v, σ/λ).

  • T. Zhang

Big Data Optimization 21 / 36

slide-39
SLIDE 39

Convergence rate

The number of iterations needed to achieve ǫ accuracy For (1/γ)-smooth loss: ˜ O

  • n + 1

γλ

  • log 1

ǫ

  • For L-Lipschitz loss:

˜ O

  • n + L2

λ ǫ

  • T. Zhang

Big Data Optimization 22 / 36

slide-40
SLIDE 40

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ǫ with smooth φi: 1 n

n

  • i=1

φi(w) + σw1. Apply Prox-SDCA with extra term 0.5λw2

2, where λ = O(ǫ):

number of iterations needed is ˜ O(n + 1/ǫ).

  • T. Zhang

Big Data Optimization 23 / 36

slide-41
SLIDE 41

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ǫ with smooth φi: 1 n

n

  • i=1

φi(w) + σw1. Apply Prox-SDCA with extra term 0.5λw2

2, where λ = O(ǫ):

number of iterations needed is ˜ O(n + 1/ǫ). Compare to Dual Averaging SGD (Xiao): number of iterations needed is ˜ O(1/ǫ2).

  • T. Zhang

Big Data Optimization 23 / 36

slide-42
SLIDE 42

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ǫ with smooth φi: 1 n

n

  • i=1

φi(w) + σw1. Apply Prox-SDCA with extra term 0.5λw2

2, where λ = O(ǫ):

number of iterations needed is ˜ O(n + 1/ǫ). Compare to Dual Averaging SGD (Xiao): number of iterations needed is ˜ O(1/ǫ2). Compare to batch FISTA (Nesterov’s accelerated proximal gradient): number of iterations needed is ˜ O(n/√ǫ). Prox-SDCA wins in the statistically interesting regime: ǫ > Ω(1/n2)

  • T. Zhang

Big Data Optimization 23 / 36

slide-43
SLIDE 43

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ǫ with smooth φi: 1 n

n

  • i=1

φi(w) + σw1. Apply Prox-SDCA with extra term 0.5λw2

2, where λ = O(ǫ):

number of iterations needed is ˜ O(n + 1/ǫ). Compare to Dual Averaging SGD (Xiao): number of iterations needed is ˜ O(1/ǫ2). Compare to batch FISTA (Nesterov’s accelerated proximal gradient): number of iterations needed is ˜ O(n/√ǫ). Prox-SDCA wins in the statistically interesting regime: ǫ > Ω(1/n2) Can design an accelerated prox-SDCA procedure always superior to FISTA

  • T. Zhang

Big Data Optimization 23 / 36

slide-44
SLIDE 44

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

  • T. Zhang

Big Data Optimization 24 / 36

slide-45
SLIDE 45

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 24 / 36

slide-46
SLIDE 46

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computing

algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 24 / 36

slide-47
SLIDE 47

Accelerated Prox-SDCA

Solving: P(w) := 1 n

n

  • i=1

φi(X ⊤

i w) + λg(w)

Convergence rate of Prox-SDCA depends on O(1/λ) Inferior to acceleration when λ is very small ≪ O(1/n), which has O(1/ √ λ) dependency

  • T. Zhang

Big Data Optimization 25 / 36

slide-48
SLIDE 48

Accelerated Prox-SDCA

Solving: P(w) := 1 n

n

  • i=1

φi(X ⊤

i w) + λg(w)

Convergence rate of Prox-SDCA depends on O(1/λ) Inferior to acceleration when λ is very small ≪ O(1/n), which has O(1/ √ λ) dependency Inner-outer Iteration Accelerated Prox-SDCA Pick a suitable κ = Θ(1/n) and β For t = 2, 3 . . . (outer iter)

Let ˜ gt(w) = λg(w) + 0.5κw2

2 − κw⊤y(t−1) (κ-strongly convex)

Let ˜ Pt(w) = P(w) − λg(w) + ˜ gt(w) (redefine P(·) – κ strongly convex) Approximately solve ˜ Pt(w) for (w(t), α(t)) with prox-SDCA (inner iter) Let y(t) = w(t) + β(w(t) − w(t−1)) (acceleration)

  • T. Zhang

Big Data Optimization 25 / 36

slide-49
SLIDE 49

Performance Comparisons

Problem Algorithm Runtime SVM SGD

d λǫ

AGD (Nesterov) dn

  • 1

λ ǫ

Acc-Prox-SDCA d

  • n + min{ 1

λ ǫ,

  • n

λǫ}

  • Lasso

SGD and variants

d ǫ2

Stochastic Coordinate Descent

dn ǫ

FISTA dn

  • 1

ǫ

Acc-Prox-SDCA d

  • n + min{ 1

ǫ,

  • n

ǫ }

  • Ridge Regression

Exact d2n + d3 SGD, SDCA d

  • n + 1

λ

  • AGD

dn

  • 1

λ

Acc-Prox-SDCA d

  • n + min{ 1

λ,

  • n

λ}

  • T. Zhang

Big Data Optimization 26 / 36

slide-50
SLIDE 50

Summary of Sampling

1st order gradient ∇f(w) = 1 n

n

  • i=1

∇fi(w) sample 1st order gradient (stochastic gradient): variance reduction leads to fast linear convergence 2nd order gradient ∇2f(w) = 1 n

n

  • i=1

∇2fi(w) sample 2nd order gradient (stochastic Newton): converges to exact

  • ptimal with fast rate in distributed computing setting
  • T. Zhang

Big Data Optimization 27 / 36

slide-51
SLIDE 51

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

  • T. Zhang

Big Data Optimization 28 / 36

slide-52
SLIDE 52

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 28 / 36

slide-53
SLIDE 53

Outline

Background:

big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reduction

algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computing

algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 28 / 36

slide-54
SLIDE 54

Distributed Computing

Assume: data distributed over machines m processors each has n/m examples Simple Computational Strategy — One Shot Averaging (OSA) run optimization on m machines separately

  • btaining parameters w(1), . . . , w(m)

average the parameters: l

¯ w = m−1 m

i=1 w(i)

  • T. Zhang

Big Data Optimization 29 / 36

slide-55
SLIDE 55

Improvement

OSA Strategy’s advantage: machines run independently simple and computationally efficient; asymptotically good in theory Disadvantage: practically inferior to training all examples on a single machine

  • T. Zhang

Big Data Optimization 30 / 36

slide-56
SLIDE 56

Improvement

OSA Strategy’s advantage: machines run independently simple and computationally efficient; asymptotically good in theory Disadvantage: practically inferior to training all examples on a single machine Traditional solution in optimization: ADMM New Idea: via 2nd order gradient sampling Distributed Approximate NEwton (DANE)

  • T. Zhang

Big Data Optimization 30 / 36

slide-57
SLIDE 57

Distribution Scheme

Assume: data distributed over machines with decomposed problem f(w) =

m

  • ℓ=1

f (ℓ)(w). m processors each f (ℓ)(w) has n/m randomly partitioned examples each machine holds a complete set of parameters

  • T. Zhang

Big Data Optimization 31 / 36

slide-58
SLIDE 58

DANE

Starting with ˜ w using OSA Iterate Take ˜ w and define ˜ f (ℓ)(w) = f (ℓ)(w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w

  • n each machine solves

w(ℓ) = arg min

w

˜ f (ℓ)(w) independently. Take partial average as the next ˜ w

  • T. Zhang

Big Data Optimization 32 / 36

slide-59
SLIDE 59

DANE

Starting with ˜ w using OSA Iterate Take ˜ w and define ˜ f (ℓ)(w) = f (ℓ)(w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w

  • n each machine solves

w(ℓ) = arg min

w

˜ f (ℓ)(w) independently. Take partial average as the next ˜ w The gradient correction is similar to SVRG

  • T. Zhang

Big Data Optimization 32 / 36

slide-60
SLIDE 60

Relationship to 2nd Order Gradient Sampling

Each f (ℓ)(w) takes m out of n sample from f(w) = n−1

i fi(w).

˜ f (ℓ)(w) = f (ℓ)(w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w Expand f (ℓ)(w) around ˜ w, we have ˜ f (ℓ)(w) ≈f (ℓ)( ˜ w) + ∇f (ℓ)( ˜ w)⊤(w − ˜ w) + 1 2(w − ˜ w)⊤∇2f (ℓ)( ˜ w)(w − ˜ w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w.

  • T. Zhang

Big Data Optimization 33 / 36

slide-61
SLIDE 61

Relationship to 2nd Order Gradient Sampling

Each f (ℓ)(w) takes m out of n sample from f(w) = n−1

i fi(w).

˜ f (ℓ)(w) = f (ℓ)(w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w Expand f (ℓ)(w) around ˜ w, we have ˜ f (ℓ)(w) ≈f (ℓ)( ˜ w) + ∇f (ℓ)( ˜ w)⊤(w − ˜ w) + 1 2(w − ˜ w)⊤∇2f (ℓ)( ˜ w)(w − ˜ w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w. minw ˜ f (ℓ)(w) is equivalent to approximate minimization of min

w

  f (ℓ)( ˜ w) + ∇f( ˜ w)⊤(w − ˜ w) + 1 2 (w − ˜ w)⊤∇2f (ℓ)( ˜ w)(w − ˜ w)

  • 2nd order gradient sampling from ∇2f(˜

w)

  

  • T. Zhang

Big Data Optimization 33 / 36

slide-62
SLIDE 62

Relationship to 2nd Order Gradient Sampling

Each f (ℓ)(w) takes m out of n sample from f(w) = n−1

i fi(w).

˜ f (ℓ)(w) = f (ℓ)(w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w Expand f (ℓ)(w) around ˜ w, we have ˜ f (ℓ)(w) ≈f (ℓ)( ˜ w) + ∇f (ℓ)( ˜ w)⊤(w − ˜ w) + 1 2(w − ˜ w)⊤∇2f (ℓ)( ˜ w)(w − ˜ w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w. minw ˜ f (ℓ)(w) is equivalent to approximate minimization of min

w

  f (ℓ)( ˜ w) + ∇f( ˜ w)⊤(w − ˜ w) + 1 2 (w − ˜ w)⊤∇2f (ℓ)( ˜ w)(w − ˜ w)

  • 2nd order gradient sampling from ∇2f(˜

w)

   Lead to fast convergence

  • T. Zhang

Big Data Optimization 33 / 36

slide-63
SLIDE 63

Comparisons

5 10 0.229 0.23 0.231 t

COV1

5 10 0.04 0.05 0.06 0.07 t

ASTRO

5 10 0.03 0.04 0.05 0.06 t

MNIST−47

DANE ADMM OSA Opt

  • T. Zhang

Big Data Optimization 34 / 36

slide-64
SLIDE 64

Summary

Optimization in machine learning: sum over data structure

suitable for statistical sampling

Traditional methods: gradient based batch algorithms

do not take advantage of special structure

Recent progress: stochastic optimization (focusing on my own work)

  • T. Zhang

Big Data Optimization 35 / 36

slide-65
SLIDE 65

Summary

Optimization in machine learning: sum over data structure

suitable for statistical sampling

Traditional methods: gradient based batch algorithms

do not take advantage of special structure

Recent progress: stochastic optimization (focusing on my own work) Sampling of 1st order gradient

variance reduction leads to fast linear convergence rate

Sampling of 2nd order gradient

leads to fast distributed algorithm

  • T. Zhang

Big Data Optimization 35 / 36

slide-66
SLIDE 66

References

Rie Johnson and TZ. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, NIPS 2013. Lin Xiao and TZ. A Proximal Stochastic Gradient Method with Progressive Variance Reduction, Tech Report arXiv:1403.4699, March 2014. Shai Shalev-Shwartz and TZ. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization, JMLR 14:567-599, 2013. Shai Shalev-Shwartz and TZ. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization, Tech Report arXiv:1309.2375, Sep 2013. Ohad Shamir and Nathan Srebro and Tong Zhang. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method, ICML 2014.

  • T. Zhang

Big Data Optimization 36 / 36

slide-67
SLIDE 67

References

Rie Johnson and TZ. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, NIPS 2013. Lin Xiao and TZ. A Proximal Stochastic Gradient Method with Progressive Variance Reduction, Tech Report arXiv:1403.4699, March 2014. Shai Shalev-Shwartz and TZ. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization, JMLR 14:567-599, 2013. Shai Shalev-Shwartz and TZ. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization, Tech Report arXiv:1309.2375, Sep 2013. Ohad Shamir and Nathan Srebro and Tong Zhang. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method, ICML 2014. My friend Lin Xiao has made further improves with collaborators in multiple fronts on this thread of big optimization research ...

  • T. Zhang

Big Data Optimization 36 / 36