Stochastic Optimization Techniques for Big Data Machine Learning - - PowerPoint PPT Presentation

stochastic optimization techniques for big data machine
SMART_READER_LITE
LIVE PREVIEW

Stochastic Optimization Techniques for Big Data Machine Learning - - PowerPoint PPT Presentation

Stochastic Optimization Techniques for Big Data Machine Learning Tong Zhang Rutgers University & Baidu Inc. T. Zhang Big Data Optimization 1 / 73 Outline Background: big data optimization in machine learning: special structure T. Zhang


slide-1
SLIDE 1

Stochastic Optimization Techniques for Big Data Machine Learning

Tong Zhang

Rutgers University & Baidu Inc.

  • T. Zhang

Big Data Optimization 1 / 73

slide-2
SLIDE 2

Outline

Background:

big data optimization in machine learning: special structure

  • T. Zhang

Big Data Optimization 2 / 73

slide-3
SLIDE 3

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 2 / 73

slide-4
SLIDE 4

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

Distributed optimization

algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 2 / 73

slide-5
SLIDE 5

Mathematical Problem

Big Data Optimization Problem in machine learning: min

w f(w)

f(w) = 1 n

n

  • i=1

fi(w) Special structure: sum over data: large n

  • T. Zhang

Big Data Optimization 3 / 73

slide-6
SLIDE 6

Mathematical Problem

Big Data Optimization Problem in machine learning: min

w f(w)

f(w) = 1 n

n

  • i=1

fi(w) Special structure: sum over data: large n Assumptions on loss function λ-strong convexity: f(w′) ≥ f(w) + ∇f(w)⊤(w′ − w) + λ 2w′ − w2

2

  • quadratic lower bound

L-smoothness: fi(w′) ≤ fi(w) + ∇fi(w)⊤(w′ − w) + L 2w′ − w2

2

  • quadratic upper bound
  • T. Zhang

Big Data Optimization 3 / 73

slide-7
SLIDE 7

Example: Computational Advertising

Large scale regularized logistic regression min

w

1 n

n

  • i=1

   ln(1 + e−w⊤xiyi) + λ 2w2

2

  • fi(w)

   

data (xi, yi) with yi ∈ {±1} parameter vector w.

λ strongly convex and L = 0.25 maxi xi2

2 + λ smooth.

  • T. Zhang

Big Data Optimization 4 / 73

slide-8
SLIDE 8

Example: Computational Advertising

Large scale regularized logistic regression min

w

1 n

n

  • i=1

   ln(1 + e−w⊤xiyi) + λ 2w2

2

  • fi(w)

   

data (xi, yi) with yi ∈ {±1} parameter vector w.

λ strongly convex and L = 0.25 maxi xi2

2 + λ smooth.

big data: n ∼ 10 − 100 billion high dimension: dim(xi) ∼ 10 − 100 billion

  • T. Zhang

Big Data Optimization 4 / 73

slide-9
SLIDE 9

Example: Computational Advertising

Large scale regularized logistic regression min

w

1 n

n

  • i=1

   ln(1 + e−w⊤xiyi) + λ 2w2

2

  • fi(w)

   

data (xi, yi) with yi ∈ {±1} parameter vector w.

λ strongly convex and L = 0.25 maxi xi2

2 + λ smooth.

big data: n ∼ 10 − 100 billion high dimension: dim(xi) ∼ 10 − 100 billion How to solve big optimization problems efficiently?

  • T. Zhang

Big Data Optimization 4 / 73

slide-10
SLIDE 10

Optimization Problem: Communication Complexity

From simple to complex Single machine single-core can employ sequential algorithms

  • T. Zhang

Big Data Optimization 5 / 73

slide-11
SLIDE 11

Optimization Problem: Communication Complexity

From simple to complex Single machine single-core can employ sequential algorithms Single machine multi-core relatively cheap communication

  • T. Zhang

Big Data Optimization 5 / 73

slide-12
SLIDE 12

Optimization Problem: Communication Complexity

From simple to complex Single machine single-core can employ sequential algorithms Single machine multi-core relatively cheap communication Multi-machine (synchronous) expensive communication

  • T. Zhang

Big Data Optimization 5 / 73

slide-13
SLIDE 13

Optimization Problem: Communication Complexity

From simple to complex Single machine single-core can employ sequential algorithms Single machine multi-core relatively cheap communication Multi-machine (synchronous) expensive communication Multi-machine (asynchronous) break synchronization to reduce communication

  • T. Zhang

Big Data Optimization 5 / 73

slide-14
SLIDE 14

Optimization Problem: Communication Complexity

From simple to complex Single machine single-core can employ sequential algorithms Single machine multi-core relatively cheap communication Multi-machine (synchronous) expensive communication Multi-machine (asynchronous) break synchronization to reduce communication We want to solve simple problem well first, then more complex ones.

  • T. Zhang

Big Data Optimization 5 / 73

slide-15
SLIDE 15

Outline

Background:

big data optimization in machine learning: special structure

  • T. Zhang

Big Data Optimization 6 / 73

slide-16
SLIDE 16

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 6 / 73

slide-17
SLIDE 17

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

Distributed optimization

algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 6 / 73

slide-18
SLIDE 18

Batch Optimization Method: Gradient Descent

Solve w∗ = arg min

w f(w)

f(w) = 1 n

n

  • i=1

fi(w). Gradient Descent (GD): wk = wk−1 − ηk∇f(wk−1). How fast does this method converge to the optimal solution?

  • T. Zhang

Big Data Optimization 7 / 73

slide-19
SLIDE 19

Batch Optimization Method: Gradient Descent

Solve w∗ = arg min

w f(w)

f(w) = 1 n

n

  • i=1

fi(w). Gradient Descent (GD): wk = wk−1 − ηk∇f(wk−1). How fast does this method converge to the optimal solution? Convergence rate depends on conditions of f(·). For λ-strongly convex and L-smooth problems, it is linear rate: f(wk) − f(w∗) = O((1 − ρ)k), where ρ = O(λ/L) is the inverse condition number

  • T. Zhang

Big Data Optimization 7 / 73

slide-20
SLIDE 20

Stochastic Approximate Gradient Computation

If f(w) = 1 n

n

  • i=1

fi(w), GD requires the computation of full gradient, which is extremely costly ∇f(w) = 1 n

n

  • i=1

∇fi(w)

  • T. Zhang

Big Data Optimization 8 / 73

slide-21
SLIDE 21

Stochastic Approximate Gradient Computation

If f(w) = 1 n

n

  • i=1

fi(w), GD requires the computation of full gradient, which is extremely costly ∇f(w) = 1 n

n

  • i=1

∇fi(w) Idea: stochastic optimization employs random sample (mini-batch) B to approximate ∇f(w) ≈ 1 |B|

  • i∈B

∇fi(w) It is an unbiased estimator more efficient computation but introduces variance

  • T. Zhang

Big Data Optimization 8 / 73

slide-22
SLIDE 22

SGD versus GD

SGD: faster computation per step Sublinear convergence: due to the variance of gradient approximation. f(wt) − f(w∗) = ˜ O(1/t). GD: slower computation per step Linear convergence: f(wt) − f(w∗) = O((1 − ρ)t).

  • T. Zhang

Big Data Optimization 9 / 73

slide-23
SLIDE 23

SGD versus GD

SGD: faster computation per step Sublinear convergence: due to the variance of gradient approximation. f(wt) − f(w∗) = ˜ O(1/t). GD: slower computation per step Linear convergence: f(wt) − f(w∗) = O((1 − ρ)t). Overall: sgd is fast in the beginning but slow asymptotically

  • T. Zhang

Big Data Optimization 9 / 73

slide-24
SLIDE 24

SGD versus GD

stochastic gradient descent computational cost training error gradient descent

One strategy: use sgd first to train after a while switch to batch methods such as LBFGS.

  • T. Zhang

Big Data Optimization 10 / 73

slide-25
SLIDE 25

SGD versus GD

stochastic gradient descent computational cost training error gradient descent

One strategy: use sgd first to train after a while switch to batch methods such as LBFGS. However, one can do better

  • T. Zhang

Big Data Optimization 10 / 73

slide-26
SLIDE 26

Improving SGD via Variance Reduction

GD converges fast but computation is slow SGD computation is fast but converges slowly

slow convergence due to inherent variance

SGD as a statistical estimator of gradient:

let gi = ∇fi. unbaisedness: E gi = 1

n

n

i=1 gi = ∇f.

error of using gi to approx ∇f: variance Egi − Egi2

2.

  • T. Zhang

Big Data Optimization 11 / 73

slide-27
SLIDE 27

Improving SGD via Variance Reduction

GD converges fast but computation is slow SGD computation is fast but converges slowly

slow convergence due to inherent variance

SGD as a statistical estimator of gradient:

let gi = ∇fi. unbaisedness: E gi = 1

n

n

i=1 gi = ∇f.

error of using gi to approx ∇f: variance Egi − Egi2

2.

Statistical thinking:

relating variance to optimization design other unbiased gradient estimators with smaller variance

  • T. Zhang

Big Data Optimization 11 / 73

slide-28
SLIDE 28

Improving SGD using Variance Reduction

The idea leads to modern stochastic algorithms for big data machine learning with fast convergence rate

  • T. Zhang

Big Data Optimization 12 / 73

slide-29
SLIDE 29

Improving SGD using Variance Reduction

The idea leads to modern stochastic algorithms for big data machine learning with fast convergence rate Collins et al (2008): For special problems, with a relatively complicated algorithm (Exponentiated Gradient on dual) Le Roux, Schmidt, Bach (NIPS 2012): A variant of SGD called SAG (stochastic average gradient) and later SAGA (Defazio, Bach, Lacoste-Julien, NIPS 2014) Johnson and Z (NIPS 2013): SVRG (Stochastic variance reduced gradient) Shalev-Schwartz and Z (JMLR 2013): SDCA (Stochastic Dual Coordinate Ascent) , and later a variant with Zheng Qu and Peter Richtarik

  • T. Zhang

Big Data Optimization 12 / 73

slide-30
SLIDE 30

Outline

Background:

big data optimization in machine learning: special structure

  • T. Zhang

Big Data Optimization 13 / 73

slide-31
SLIDE 31

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 13 / 73

slide-32
SLIDE 32

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

Distributed optimization

algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 13 / 73

slide-33
SLIDE 33

Relating Statistical Variance to Optimization

Want to optimize min

w f(w)

Full gradient ∇f(w).

  • T. Zhang

Big Data Optimization 14 / 73

slide-34
SLIDE 34

Relating Statistical Variance to Optimization

Want to optimize min

w f(w)

Full gradient ∇f(w). Given unbiased random estimator gi of ∇f(w), and SGD rule w → w − ηgi, reduction of objective is Ef(w − ηgi) ≤ f(w) − (η − η2L/2)∇f(w)2

2

  • non-random

+η2L 2 Eg − Eg2

2

  • variance

.

  • T. Zhang

Big Data Optimization 14 / 73

slide-35
SLIDE 35

Relating Statistical Variance to Optimization

Want to optimize min

w f(w)

Full gradient ∇f(w). Given unbiased random estimator gi of ∇f(w), and SGD rule w → w − ηgi, reduction of objective is Ef(w − ηgi) ≤ f(w) − (η − η2L/2)∇f(w)2

2

  • non-random

+η2L 2 Eg − Eg2

2

  • variance

. Smaller variance implies bigger reduction

  • T. Zhang

Big Data Optimization 14 / 73

slide-36
SLIDE 36

Statistical Thinking: variance reduction techniques

Given unbiased estimator gi of ∇f; how to design other unbiased estimators with reduce variance?

  • T. Zhang

Big Data Optimization 15 / 73

slide-37
SLIDE 37

Statistical Thinking: variance reduction techniques

Given unbiased estimator gi of ∇f; how to design other unbiased estimators with reduce variance? Control variates (leads to SVRG).

find ˜ gi ≈ gi use estimator g′

i := gi − ˜

gi + E ˜ gi.

  • T. Zhang

Big Data Optimization 15 / 73

slide-38
SLIDE 38

Statistical Thinking: variance reduction techniques

Given unbiased estimator gi of ∇f; how to design other unbiased estimators with reduce variance? Control variates (leads to SVRG).

find ˜ gi ≈ gi use estimator g′

i := gi − ˜

gi + E ˜ gi.

Importance sampling (Zhao and Zhang ICML 2014)

sample gi proportional to ρi (Eρi = 1) use estimator gi/ρi

  • T. Zhang

Big Data Optimization 15 / 73

slide-39
SLIDE 39

Statistical Thinking: variance reduction techniques

Given unbiased estimator gi of ∇f; how to design other unbiased estimators with reduce variance? Control variates (leads to SVRG).

find ˜ gi ≈ gi use estimator g′

i := gi − ˜

gi + E ˜ gi.

Importance sampling (Zhao and Zhang ICML 2014)

sample gi proportional to ρi (Eρi = 1) use estimator gi/ρi

Stratified sampling (Zhao and Zhang)

  • T. Zhang

Big Data Optimization 15 / 73

slide-40
SLIDE 40

Stochastic Variance Reduced Gradient (SVRG) I

Objective function f(w) = 1 n

n

  • i=1

fi(w) = 1 n

n

  • i=1

˜ fi(w), where ˜ fi(w) = fi(w) − (∇fi( ˜ w) − ∇f( ˜ w))⊤w

  • sum to zero

. Pick ˜ w to be an approximate solution (close to w∗). The SVRG rule (control variates) is wt = wt−1 − ηt∇˜ fi(wt−1) = wt−1 − ηt [∇fi(wt−1) − ∇fi( ˜ w) + ∇f( ˜ w)]

  • small variance

.

  • T. Zhang

Big Data Optimization 16 / 73

slide-41
SLIDE 41

Stochastic Variance Reduced Gradient (SVRG) II

Assume that ˜ w ≈ w∗ and wt−1 ≈ w∗. Then ∇f( ˜ w) ≈ ∇f(w∗) = 0 ∇fi(wt−1) ≈ ∇fi( ˜ w). This means ∇fi(wt−1) − ∇fi( ˜ w) + ∇f( ˜ w) → 0. It is possible to choose a constant step size ηt = η instead of requiring ηt → 0. One can achieve comparable linear convergence with SVRG: Ef(wt) − f(w∗) = O((1 − ˜ ρ)t), where ˜ ρ = O(λn/(L + λn); convergence is faster than GD.

  • T. Zhang

Big Data Optimization 17 / 73

slide-42
SLIDE 42

SVRG Algorithm

Procedure SVRG Parameters update frequency m and learning rate η Initialize ˜ w0 Iterate: for s = 1, 2, . . . ˜ w = ˜ ws−1 ˜ µ = 1

n

n

i=1 ∇ψi( ˜

w) w0 = ˜ w Iterate: for t = 1, 2, . . . , m Randomly pick it ∈ {1, . . . , n} and update weight wt = wt−1 − η(∇ψit(wt−1) − ∇ψit( ˜ w) + ˜ µ) end Set ˜ ws = wm end

  • T. Zhang

Big Data Optimization 18 / 73

slide-43
SLIDE 43

SVRG v.s. Batch Gradient Descent: fast convergence

Number of examples needed to achieve ǫ accuracy: Batch GD: ˜ O(n · L/λ log(1/ǫ)) SVRG: ˜ O((n + L/λ) log(1/ǫ)) Assume L-smooth loss fi and λ strongly convex objective function. SVRG has fast convergence — condition number effectively reduced The gain of SVRG over batch algorithm is significant when n is large.

  • T. Zhang

Big Data Optimization 19 / 73

slide-44
SLIDE 44

SVRG: variance

  • Convex case (left): least squares on MNIST;

Nonconvex case (right): neural nets on CIFAR-10. The numbers in the legends are learning rate

  • T. Zhang

Big Data Optimization 20 / 73

slide-45
SLIDE 45

SVRG: convergence

  • Convex case (left): least squares on MNIST;

Nonconvex case (right): neural nets on CIFAR-10. The numbers in the legends are learning rate

  • T. Zhang

Big Data Optimization 21 / 73

slide-46
SLIDE 46

Variance Reduction using Importance Sampling (combined with SVRG)

f(w) = 1 n

n

  • i=1

fi(w). Li: smoothness param of fi(w); λ: strong convexity param of f(w) Number of examples needed to achieve ǫ accuracy: With uniform sampling: ˜ O((n + L/λ) log(1/ǫ)), where L = maxi Li With importance sampling: ˜ O((n + ¯ L/λ) log(1/ǫ)), where ¯ L = n−1 n

i=1 Li

  • T. Zhang

Big Data Optimization 22 / 73

slide-47
SLIDE 47

Outline

Background:

big data optimization in machine learning: special structure

  • T. Zhang

Big Data Optimization 23 / 73

slide-48
SLIDE 48

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 23 / 73

slide-49
SLIDE 49

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

Distributed optimization

algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 23 / 73

slide-50
SLIDE 50

Motivation

Solve w∗ = arg min

w f(w)

f(w) = 1 n

n

  • i=1

fi(w). SGD with variance reduction via SVRG: wt = wt−1 − ηt [∇fi(wt−1) − ∇fi( ˜ w) + ∇f( ˜ w)]

  • small variance

.

  • T. Zhang

Big Data Optimization 24 / 73

slide-51
SLIDE 51

Motivation

Solve w∗ = arg min

w f(w)

f(w) = 1 n

n

  • i=1

fi(w). SGD with variance reduction via SVRG: wt = wt−1 − ηt [∇fi(wt−1) − ∇fi( ˜ w) + ∇f( ˜ w)]

  • small variance

. Compute full gradient ∇f( ˜ w) periodically at an intermediate ˜ w

  • T. Zhang

Big Data Optimization 24 / 73

slide-52
SLIDE 52

Motivation

Solve w∗ = arg min

w f(w)

f(w) = 1 n

n

  • i=1

fi(w). SGD with variance reduction via SVRG: wt = wt−1 − ηt [∇fi(wt−1) − ∇fi( ˜ w) + ∇f( ˜ w)]

  • small variance

. Compute full gradient ∇f( ˜ w) periodically at an intermediate ˜ w How to avoid computing ∇f( ˜ w)? Answer: keeping previously calculated gradients.

  • T. Zhang

Big Data Optimization 24 / 73

slide-53
SLIDE 53

Stochastic Average Gradient ameliore: SAGA

Initialize: ˜ gi = ∇fi(w0) and ˜ g = 1

n

n

j=1 ˜

gj SAGA update rule: randomly select i, and wt =wt−1 − ηt[∇fi(wt−1) − ˜ gi + ˜ g] ˜ g =˜ g + (∇fi(wt−1) − ˜ gi)/n ˜ gi =∇fi(wt−1) Equivalent to: wt = wt−1 − ηt [∇fi(wt−1) − ∇fi( ˜ wi) + 1 n

n

  • j=1

∇fj( ˜ wj)]

  • small variance

˜ wi = wt−1. Compare to SVRG: wt = wt−1 − ηt [∇fi(wt−1) − ∇fi( ˜ w) + ∇f( ˜ w)]

  • small variance

.

  • T. Zhang

Big Data Optimization 25 / 73

slide-54
SLIDE 54

Variance Reduction

The gradient estimator of SAGA is unbiased: E  ∇fi(wt−1) − ∇fi( ˜ wi) + 1 n

n

  • j=1

∇fj( ˜ wj)   = ∇f(wt−1). Since ˜ wi → w∗, we have  ∇fi(wt−1) − ∇fi( ˜ wi) + 1 n

n

  • j=1

∇fj( ˜ wj)   → 0. Therefore variance of the gradient estimator goes to zero.

  • T. Zhang

Big Data Optimization 26 / 73

slide-55
SLIDE 55

Theory of SAGA

Similar to SVRG, we have fast convergence for SAGA. Number of examples needed to achieve ǫ accuracy: Batch GD: ˜ O(n · L/λ log(1/ǫ)) SVRG: ˜ O((n + L/λ) log(1/ǫ)) SAGA: ˜ O((n + L/λ) log(1/ǫ)) Assume L-smooth loss fi and λ strongly convex objective function.

  • T. Zhang

Big Data Optimization 27 / 73

slide-56
SLIDE 56

Outline

Background:

big data optimization in machine learning: special structure

  • T. Zhang

Big Data Optimization 28 / 73

slide-57
SLIDE 57

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 28 / 73

slide-58
SLIDE 58

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

Distributed optimization

algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 28 / 73

slide-59
SLIDE 59

Motivation of SDCA: regularized loss minimization

Assume we want to solve the Lasso problem: min

w

  • 1

n

n

  • i=1

(w⊤xi − yi)2 + λw1

  • T. Zhang

Big Data Optimization 29 / 73

slide-60
SLIDE 60

Motivation of SDCA: regularized loss minimization

Assume we want to solve the Lasso problem: min

w

  • 1

n

n

  • i=1

(w⊤xi − yi)2 + λw1

  • r the ridge regression problem:

min

w

      1 n

n

  • i=1

(w⊤xi − yi)2

  • loss

+ λ 2w2

2 regularization

      Goal: solve regularized loss minimization problems as fast as we can.

  • T. Zhang

Big Data Optimization 29 / 73

slide-61
SLIDE 61

Motivation of SDCA: regularized loss minimization

Assume we want to solve the Lasso problem: min

w

  • 1

n

n

  • i=1

(w⊤xi − yi)2 + λw1

  • r the ridge regression problem:

min

w

      1 n

n

  • i=1

(w⊤xi − yi)2

  • loss

+ λ 2w2

2 regularization

      Goal: solve regularized loss minimization problems as fast as we can. solution: proximal Stochastic Dual Coordinate Ascent (Prox-SDCA). can show: fast convergence of SDCA.

  • T. Zhang

Big Data Optimization 29 / 73

slide-62
SLIDE 62

Loss Minimization with L2 Regularization

min

w P(w) :=

  • 1

n

n

  • i=1

φi(w⊤xi) + λ 2w2

  • .
  • T. Zhang

Big Data Optimization 30 / 73

slide-63
SLIDE 63

Loss Minimization with L2 Regularization

min

w P(w) :=

  • 1

n

n

  • i=1

φi(w⊤xi) + λ 2w2

  • .

Examples: φi(z) Lipschitz smooth SVM max{0, 1 − yiz} ✓ ✗ Logistic regression log(1 + exp(−yiz)) ✓ ✓ Abs-loss regression |z − yi| ✓ ✗ Square-loss regression (z − yi)2 ✗ ✓

  • T. Zhang

Big Data Optimization 30 / 73

slide-64
SLIDE 64

Dual Formulation

Primal problem: w∗ = arg min

w P(w) :=

  • 1

n

n

  • i=1

φi(w⊤xi) + λ 2w2

  • Dual problem:

α∗ = max

α∈Rn D(α) :=

 1 n

n

  • i=1

−φ∗

i (−αi) − λ

2

  • 1

λn n

  • i=1

αixi

  • 2

 , and the convex conjugate (dual) is defined as: φ∗

i (a) = sup z

(az − φi(z)).

  • T. Zhang

Big Data Optimization 31 / 73

slide-65
SLIDE 65

Relationship of Primal and Dual Solutions

Weak duality: P(w) ≥ D(α) for all w and α Strong duality: P(w∗) = D(α∗) with the relationship w∗ = 1 λn

n

  • i=1

α∗,i · xi, α∗ i = −φ′

i(w⊤ ∗ xi).

  • T. Zhang

Big Data Optimization 32 / 73

slide-66
SLIDE 66

Relationship of Primal and Dual Solutions

Weak duality: P(w) ≥ D(α) for all w and α Strong duality: P(w∗) = D(α∗) with the relationship w∗ = 1 λn

n

  • i=1

α∗,i · xi, α∗ i = −φ′

i(w⊤ ∗ xi).

Duality gap: for any w and α: P(w) − D(α)

  • duality gap

≥ P(w) − P(w∗)

  • primal sub-optimality

.

  • T. Zhang

Big Data Optimization 32 / 73

slide-67
SLIDE 67

Example: Linear Support Vector Machine

Primal formulation: P(w) = 1 n

n

  • i=1

max(0, 1 − w⊤xiyi) + λ 2w2

2

Dual formulation: D(α) = 1 n

n

  • i=1

αiyi − 1 2λn2

  • n
  • i=1

αixiyi

  • 2

2

, αiyi ∈ [0, 1]. Relationship: w∗ = 1 λn

n

  • i=1

α∗,ixi

  • T. Zhang

Big Data Optimization 33 / 73

slide-68
SLIDE 68

Dual Coordinate Ascent (DCA)

Solve the dual problem using coordinate ascent max

α∈Rn D(α),

and keep the corresponding primal solution using the relationship w = 1 λn

n

  • i=1

αixi. DCA: At each iteration, optimize D(α) w.r.t. a single coordinate, while the rest of the coordinates are kept in tact. Stochastic Dual Coordinate Ascent (SDCA): Choose the updated coordinate uniformly at random

  • T. Zhang

Big Data Optimization 34 / 73

slide-69
SLIDE 69

Dual Coordinate Ascent (DCA)

Solve the dual problem using coordinate ascent max

α∈Rn D(α),

and keep the corresponding primal solution using the relationship w = 1 λn

n

  • i=1

αixi. DCA: At each iteration, optimize D(α) w.r.t. a single coordinate, while the rest of the coordinates are kept in tact. Stochastic Dual Coordinate Ascent (SDCA): Choose the updated coordinate uniformly at random SMO (John Platt), Liblinear (Hsieh et al) etc implemented DCA.

  • T. Zhang

Big Data Optimization 34 / 73

slide-70
SLIDE 70

SDCA vs. SGD — update rule

Stochastic Gradient Descent (SGD) update rule: w(t+1) =

  • 1 − 1

t

  • w(t) − φ′

i(w(t) ⊤xi)

λ t xi SDCA update rule:

  • 1. ∆i = argmax

∆∈R

D(α(t) + ∆i ei)

  • 2. w(t+1) = w(t) + ∆i

λ n xi Rather similar update rules. SDCA has several advantages:

Stopping criterion: duality gap smaller than a value No need to tune learning rate

  • T. Zhang

Big Data Optimization 35 / 73

slide-71
SLIDE 71

SDCA vs. SGD — update rule — Example

SVM with the hinge loss: φi(w) = max{0, 1 − yiw⊤xi} SGD update rule: w(t+1) =

  • 1 − 1

t

  • w(t) − 1[yi x⊤

i w(t) < 1]

λ t xi SDCA update rule:

  • 1. ∆i = yi max
  • 0, min
  • 1, 1 − yi x⊤

i w(t−1)

xi2

2/(λn)

+ yi α(t−1)

i

  • − α(t−1)

i

  • 1. α(t+1) = α(t) + ∆i ei
  • 2. w(t+1) = w(t) + ∆i

λ n xi

  • T. Zhang

Big Data Optimization 36 / 73

slide-72
SLIDE 72

SDCA vs. SGD — experimental observations

On CCAT dataset, λ = 10−6, smoothed loss

5 10 15 20 25 10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

SDCA SDCA−Perm SGD

  • T. Zhang

Big Data Optimization 37 / 73

slide-73
SLIDE 73

SDCA vs. SGD — experimental observations

On CCAT dataset, λ = 10−6, smoothed loss

5 10 15 20 25 10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

SDCA SDCA−Perm SGD

The convergence of SDCA is shockingly fast! How to explain this?

  • T. Zhang

Big Data Optimization 37 / 73

slide-74
SLIDE 74

SDCA vs. SGD — experimental observations

On CCAT dataset, λ = 10−5, hinge-loss

5 10 15 20 25 30 35 10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

SDCA SDCA−Perm SGD

How to understand the convergence behavior?

  • T. Zhang

Big Data Optimization 38 / 73

slide-75
SLIDE 75

Derivation of SDCA I

Consider the following optimization problem w∗ = arg min

w f(w)

f(w) = n−1

n

  • i=1

φi(w) + 0.5λw⊤w. The optimal condition is n−1

n

  • i=1

∇φi(w∗) + λw∗ = 0. We have dual representation: w∗ =

n

  • i=1

α∗

i

α∗

i = − 1

λn∇φi(w∗)

  • T. Zhang

Big Data Optimization 39 / 73

slide-76
SLIDE 76

Derivation of SDCA II

If we maintain a relationship: w = n

i=1 αi, then SGD rule

wt = wt−1 − η∇φi(wt−1) − ηλwt−1

  • large variance

. property E[wt|wt−1] = wt−1 − ∇f(w) The dual representation of SGD rule is αt,j = (1 − ηλ)αt−1,j − η∇φi(wt−1)δi,j.

  • T. Zhang

Big Data Optimization 40 / 73

slide-77
SLIDE 77

Derivation of SDCA III

The alternative SDCA rule is to replace −ηλwt−1 by −ηλnαi: primal update is wt = wt−1 − η(∇φi(wt−1) + λnαi)

  • small variance

. and the dual update is αt,j = αt−1,j − η(∇φi(wt−1) + λnαi)δi,j. It is unbiased: E[wt|wt−1] = wt−1 − ∇f(w)

  • T. Zhang

Big Data Optimization 41 / 73

slide-78
SLIDE 78

Benefit of SDCA

Variance reduction effect: as w → w∗ and α → α∗, ∇φi(wt−1) + λnαi → 0, thus the stochastic variance goes to zero.

  • T. Zhang

Big Data Optimization 42 / 73

slide-79
SLIDE 79

Benefit of SDCA

Variance reduction effect: as w → w∗ and α → α∗, ∇φi(wt−1) + λnαi → 0, thus the stochastic variance goes to zero. Fast convergence rate result: Ef(wt) − f(w∗) = O(µk), where µ = 1 − O(λn/(1 + λn)). Convergence rate is fast even when λ = O(1/n). Better than batch method

  • T. Zhang

Big Data Optimization 42 / 73

slide-80
SLIDE 80

Fast Convergence of SDCA

The number of iterations needed to achieve ǫ accuracy For L-smooth loss: ˜ O

  • n + L

λ

  • log 1

ǫ

  • For non-smooth but G-Lipschitz loss (bounded gradient):

˜ O

  • n + G2

λ ǫ

  • T. Zhang

Big Data Optimization 43 / 73

slide-81
SLIDE 81

Fast Convergence of SDCA

The number of iterations needed to achieve ǫ accuracy For L-smooth loss: ˜ O

  • n + L

λ

  • log 1

ǫ

  • For non-smooth but G-Lipschitz loss (bounded gradient):

˜ O

  • n + G2

λ ǫ

  • Similar to that of SVRG; and effective when n is large
  • T. Zhang

Big Data Optimization 43 / 73

slide-82
SLIDE 82

SDCA vs. DCA — Randomization is Crucial!

On CCAT dataset, λ = 10−4, smoothed hinge-loss

2 4 6 8 10 12 14 16 18 10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

SDCA DCA−Cyclic SDCA−Perm Bound

  • T. Zhang

Big Data Optimization 44 / 73

slide-83
SLIDE 83

SDCA vs. DCA — Randomization is Crucial!

On CCAT dataset, λ = 10−4, smoothed hinge-loss

2 4 6 8 10 12 14 16 18 10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

SDCA DCA−Cyclic SDCA−Perm Bound

Randomization is crucial!

  • T. Zhang

Big Data Optimization 44 / 73

slide-84
SLIDE 84

Proximal SDCA for General Regularizer

Want to solve: min

w P(w) :=

  • 1

n

n

  • i=1

φi(X ⊤

i w) + λg(w)

  • ,

where Xi are matrices; g(·) is strongly convex. Examples: Multi-class logistic loss φi(X ⊤

i w) = ln K

  • ℓ=1

exp(w⊤Xi,ℓ) − w⊤Xi,yi. L1 − L2 regularization g(w) = 1 2w2

2 + σ

λw1

  • T. Zhang

Big Data Optimization 45 / 73

slide-85
SLIDE 85

Dual Formulation

Primal: min

w P(w) :=

  • 1

n

n

  • i=1

φi(X ⊤

i w) + λg(w)

  • ,

Dual: max

α

D(α) :=

  • 1

n

n

  • i=1

−φ∗

i (−αi) − λg∗

  • 1

λn

n

  • i=1

Xiαi

  • with the relationship

w = ∇g∗

  • 1

λn

n

  • i=1

Xiαi

  • .

Prox-SDCA: extension of SDCA for arbitrarily strongly convex g(w).

  • T. Zhang

Big Data Optimization 46 / 73

slide-86
SLIDE 86

Prox-SDCA

Dual: max

α

D(α) :=

  • 1

n

n

  • i=1

−φ∗

i (−αi) − λg∗(v)

  • ,

v = 1 λn

n

  • i=1

Xiαi. Assume g(w) is strongly convex in norm · P with dual norm · D.

  • T. Zhang

Big Data Optimization 47 / 73

slide-87
SLIDE 87

Prox-SDCA

Dual: max

α

D(α) :=

  • 1

n

n

  • i=1

−φ∗

i (−αi) − λg∗(v)

  • ,

v = 1 λn

n

  • i=1

Xiαi. Assume g(w) is strongly convex in norm · P with dual norm · D. For each α, and the corresponding v and w, define prox-dual ˜ Dα(∆α) =

  • 1

n

n

  • i=1

−φ∗

i (−(αi + ∆αi))

−λ        g∗(v) + ∇g∗(v)⊤ 1 λn

n

  • i=1

Xi∆αi + 1 2

  • 1

λn

n

  • i=1

Xi∆αi

  • 2

D

  • upper bound of g∗(·)

             

  • T. Zhang

Big Data Optimization 47 / 73

slide-88
SLIDE 88

Prox-SDCA

Dual: max

α

D(α) :=

  • 1

n

n

  • i=1

−φ∗

i (−αi) − λg∗(v)

  • ,

v = 1 λn

n

  • i=1

Xiαi. Assume g(w) is strongly convex in norm · P with dual norm · D. For each α, and the corresponding v and w, define prox-dual ˜ Dα(∆α) =

  • 1

n

n

  • i=1

−φ∗

i (−(αi + ∆αi))

−λ        g∗(v) + ∇g∗(v)⊤ 1 λn

n

  • i=1

Xi∆αi + 1 2

  • 1

λn

n

  • i=1

Xi∆αi

  • 2

D

  • upper bound of g∗(·)

              Prox-SDCA: randomly pick i and update ∆αi by maximizing ˜ Dα(·).

  • T. Zhang

Big Data Optimization 47 / 73

slide-89
SLIDE 89

Proximal-SDCA for L1-L2 Regularization

Algorithm: Keep dual α and v = (λn)−1

i αiXi

Randomly pick i Find ∆i by approximately maximizing: −φ∗

i (αi + ∆i) − trunc(v, σ/λ)⊤Xi ∆i −

1 2λnXi2

2∆2 i ,

where φ∗

i (αi + ∆) = (αi + ∆)Yi ln((αi + ∆)Yi) + (1 − (αi + ∆)Yi) ln(1 − (αi + ∆)Yi)

α = α + ∆i · ei v = v + (λn)−1∆i · Xi. Let w = trunc(v, σ/λ).

  • T. Zhang

Big Data Optimization 48 / 73

slide-90
SLIDE 90

Solving L1 with Smooth Loss

Want to solve L1 regularization to accuracy ǫ with smooth φi: 1 n

n

  • i=1

φi(w) + σw1. Apply Prox-SDCA with extra term 0.5λw2

2, where λ = O(ǫ):

number of iterations needed by prox-SDCA is ˜ O(n + 1/ǫ).

  • T. Zhang

Big Data Optimization 49 / 73

slide-91
SLIDE 91

Solving L1 with Smooth Loss

Want to solve L1 regularization to accuracy ǫ with smooth φi: 1 n

n

  • i=1

φi(w) + σw1. Apply Prox-SDCA with extra term 0.5λw2

2, where λ = O(ǫ):

number of iterations needed by prox-SDCA is ˜ O(n + 1/ǫ). Compare to (number of examples needed to go through): Dual Averaging SGD (Xiao): ˜ O(1/ǫ2). FISTA (Nesterov’s batch accelerated proximal gradient): ˜ O(n/√ǫ). Prox-SDCA wins in the statistically interesting regime: ǫ > Ω(1/n2)

  • T. Zhang

Big Data Optimization 49 / 73

slide-92
SLIDE 92

Solving L1 with Smooth Loss

Want to solve L1 regularization to accuracy ǫ with smooth φi: 1 n

n

  • i=1

φi(w) + σw1. Apply Prox-SDCA with extra term 0.5λw2

2, where λ = O(ǫ):

number of iterations needed by prox-SDCA is ˜ O(n + 1/ǫ). Compare to (number of examples needed to go through): Dual Averaging SGD (Xiao): ˜ O(1/ǫ2). FISTA (Nesterov’s batch accelerated proximal gradient): ˜ O(n/√ǫ). Prox-SDCA wins in the statistically interesting regime: ǫ > Ω(1/n2) Can design accelerated prox-SDCA always superior to FISTA

  • T. Zhang

Big Data Optimization 49 / 73

slide-93
SLIDE 93

Outline

Background:

big data optimization in machine learning: special structure

  • T. Zhang

Big Data Optimization 50 / 73

slide-94
SLIDE 94

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 50 / 73

slide-95
SLIDE 95

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

Distributed optimization

algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 50 / 73

slide-96
SLIDE 96

Accelerated Prox-SDCA

Solving: P(w) := 1 n

n

  • i=1

φi(X ⊤

i w) + λg(w)

Convergence rate of Prox-SDCA depends on O(1/λ) Inferior to acceleration when λ is very small ≪ O(1/n), which has O(1/ √ λ) dependency

  • T. Zhang

Big Data Optimization 51 / 73

slide-97
SLIDE 97

Accelerated Prox-SDCA

Solving: P(w) := 1 n

n

  • i=1

φi(X ⊤

i w) + λg(w)

Convergence rate of Prox-SDCA depends on O(1/λ) Inferior to acceleration when λ is very small ≪ O(1/n), which has O(1/ √ λ) dependency Inner-outer Iteration Accelerated Prox-SDCA Pick a suitable κ = Θ(1/n) and β For t = 2, 3 . . . (outer iter)

Let ˜ gt(w) = λg(w) + 0.5κw − yt−12

2 (κ-strongly convex)

Let ˜ Pt(w) = P(w) − λg(w) + ˜ gt(w) (redefine P(·) – κ strongly convex) Approximately solve ˜ Pt(w) for (w(t), α(t)) with prox-SDCA (inner iter) Let y(t) = w(t) + β(w(t) − w(t−1)) (acceleration)

  • T. Zhang

Big Data Optimization 51 / 73

slide-98
SLIDE 98

Performance Comparisons

Problem Algorithm Runtime SVM SGD

1 λǫ

AGD (Nesterov) n

  • 1

λ ǫ

Acc-Prox-SDCA

  • n + min{ 1

λ ǫ,

  • n

λǫ}

  • Lasso

SGD and variants

d ǫ2

Stochastic Coordinate Descent

n ǫ

FISTA n

  • 1

ǫ

Acc-Prox-SDCA

  • n + min{ 1

ǫ,

  • n

ǫ }

  • Ridge Regression

SGD, SDCA

  • n + 1

λ

  • AGD

n

  • 1

λ

Acc-Prox-SDCA

  • n + min{ 1

λ,

  • n

λ}

  • T. Zhang

Big Data Optimization 52 / 73

slide-99
SLIDE 99

Experiments of L1-L2 regularization

Smoothed hinge loss + λ 2w2

2 + σw1

  • n CCAT datasaet with σ = 10−5

20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 AccProxSDCA ProxSDCA FISTA 20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 AccProxSDCA ProxSDCA FISTA

λ = 10−7 λ = 10−9

  • T. Zhang

Big Data Optimization 53 / 73

slide-100
SLIDE 100

Additional Related Work on Acceleration

Methods achieving fast accelerated convergence comparable to Acc-Prox-SDCA Upper bounds: Qihang Lin, Zhaosong Lu, Lin Xiao, An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization, 2014, arXiv Yuchen Zhang, Lin Xiao, Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization, ICML 2015 (APCG — accelerated proximal coordinate gradeint) Matching Lower bound: Alekh Agarwal and Leon Bottou, A Lower Bound for the Optimization

  • f Finite Sums, ICML 2015
  • T. Zhang

Big Data Optimization 54 / 73

slide-101
SLIDE 101

Distributed Computing: Distribution Schemes

Distribute data (data parallelism) all machines have the same parameters each machine has a different set of data Distribute features (model parallelism) all machines have the same data each machine has a different set of parameters Distribute data and features (data & model parallelism) each machine has a different set of data each machine has a different set of parameters

  • T. Zhang

Big Data Optimization 55 / 73

slide-102
SLIDE 102

Main Issues in Distributed Large Scale Learning

System Design and Network Communication data parallelism: need to transfer a reasonable size chunk of data each time (mini batch) model parallelism: distributed parameter vector

  • T. Zhang

Big Data Optimization 56 / 73

slide-103
SLIDE 103

Main Issues in Distributed Large Scale Learning

System Design and Network Communication data parallelism: need to transfer a reasonable size chunk of data each time (mini batch) model parallelism: distributed parameter vector Model Update Strategy synchronous asynchronous

  • T. Zhang

Big Data Optimization 56 / 73

slide-104
SLIDE 104

Outline

Background:

big data optimization in machine learning: special structure

  • T. Zhang

Big Data Optimization 57 / 73

slide-105
SLIDE 105

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 57 / 73

slide-106
SLIDE 106

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

Distributed optimization

algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 57 / 73

slide-107
SLIDE 107

MiniBatch

Vanilla SDCA (or SGD) is difficult to parallelize Solution: use minibatch (thousands to hundreds of thousands)

  • T. Zhang

Big Data Optimization 58 / 73

slide-108
SLIDE 108

MiniBatch

Vanilla SDCA (or SGD) is difficult to parallelize Solution: use minibatch (thousands to hundreds of thousands) Problem: simple minibatch implementation slows down convergence limited gain for using parallel computing

  • T. Zhang

Big Data Optimization 58 / 73

slide-109
SLIDE 109

MiniBatch

Vanilla SDCA (or SGD) is difficult to parallelize Solution: use minibatch (thousands to hundreds of thousands) Problem: simple minibatch implementation slows down convergence limited gain for using parallel computing Solution: use Nesterov acceleration use second order information (e.g. approximate Newton steps)

  • T. Zhang

Big Data Optimization 58 / 73

slide-110
SLIDE 110

MiniBatch SDCA with Acceleration

Parameters scalars λ, γ and θ ∈ [0, 1] ; mini-batch size m Initialize α(0)

1

= · · · = α(0)

n

= ¯ α(0) = 0, w(0) = 0 Iterate: for t = 1, 2, . . . u(t−1) = (1 − θ)w(t−1) + θ¯ α(t−1) Randomly pick subset I ⊂ {1, . . . , n} of size m and update α(t)

i

= (1 − θ)α(t−1)

i

− θ∇fi(u(t−1))/(λn) for i ∈ I α(t)

j

= α(t−1)

j

for j / ∈ I ¯ α(t) = ¯ α(t−1) +

i∈I(α(t) i

− α(t−1)

i

) w(t) = (1 − θ)w(t−1) + θ¯ α(t) end Better than vanilla block SDCA, and allow large batch.

  • T. Zhang

Big Data Optimization 59 / 73

slide-111
SLIDE 111

Theory

Generally when minibatch size m increases, it is easier to parallelize, but convergence slows down

Theorem

If θ ≤ 1

4 min

  • 1 ,
  • γλn

m , γλn , (γλn)2/3 m1/3

  • then after performing

t ≥ n/m θ log

  • m∆P(x(0)) + n∆D(α(0))

  • iterations, we have that E[P(x(t)) − D(α(t))] ≤ ǫ.
  • T. Zhang

Big Data Optimization 60 / 73

slide-112
SLIDE 112

Interesting Cases

The number of iterations required in several interesting regimes:

Algorithm

γλn = Θ(1) γλn = Θ(1/m) γλn = Θ(m) SDCA n nm n ASDCA n/√m n n/m AGD √n √nm

  • n/m

The number of examples processed in several interesting regimes:

Algorithm

γλn = Θ(1) γλn = Θ(1/m) γλn = Θ(m) SDCA n nm n ASDCA n√m nm n AGD n√n n√nm n

  • n/m
  • T. Zhang

Big Data Optimization 61 / 73

slide-113
SLIDE 113

Example

10

6

10

7

10

8

10

−3

10

−2

10

−1

#processed examples Primal suboptimality m=52 m=523 m=5229 AGD SDCA

MiniBatch SDCA with acceleration can employ large minibatch size.

  • T. Zhang

Big Data Optimization 62 / 73

slide-114
SLIDE 114

Outline

Background:

big data optimization in machine learning: special structure

  • T. Zhang

Big Data Optimization 63 / 73

slide-115
SLIDE 115

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

  • T. Zhang

Big Data Optimization 63 / 73

slide-116
SLIDE 116

Outline

Background:

big data optimization in machine learning: special structure

Single machine optimization

stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration)

Distributed optimization

algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling

  • T. Zhang

Big Data Optimization 63 / 73

slide-117
SLIDE 117

Improvement

OSA Strategy’s advantage: machines run independently simple and computationally efficient; asymptotically good in theory Disadvantage: practically inferior to training all examples on a single machine

  • T. Zhang

Big Data Optimization 64 / 73

slide-118
SLIDE 118

Improvement

OSA Strategy’s advantage: machines run independently simple and computationally efficient; asymptotically good in theory Disadvantage: practically inferior to training all examples on a single machine Traditional solution in optimization: ADMM New Idea: via 2nd order gradient sampling Distributed Approximate NEwton (DANE)

  • T. Zhang

Big Data Optimization 64 / 73

slide-119
SLIDE 119

Distribution Scheme

Assume: data distributed over machines with decomposed problem f(w) =

m

  • ℓ=1

f (ℓ)(w). m processors each f (ℓ)(w) has n/m randomly partitioned examples each machine holds a complete set of parameters

  • T. Zhang

Big Data Optimization 65 / 73

slide-120
SLIDE 120

DANE Algorithm

Starting with ˜ w using OSA Iterate Take ˜ w and define ˜ f (ℓ)(w) = f (ℓ)(w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w

  • n each machine solves

w(ℓ) = arg min

w

˜ f (ℓ)(w) independently. Take partial average as the next ˜ w

  • T. Zhang

Big Data Optimization 66 / 73

slide-121
SLIDE 121

DANE Algorithm

Starting with ˜ w using OSA Iterate Take ˜ w and define ˜ f (ℓ)(w) = f (ℓ)(w) − (∇f (ℓ)( ˜ w) − ∇f( ˜ w))⊤w

  • n each machine solves

w(ℓ) = arg min

w

˜ f (ℓ)(w) independently. Take partial average as the next ˜ w Lead to fast convergence: O((1 − ρ)ℓ) with ρ ≈ 1

  • T. Zhang

Big Data Optimization 66 / 73

slide-122
SLIDE 122

Reason: Approximate Newton Step

On each machine, we solve: min

w

˜ f (ℓ)(w). It can be regarded as approximate minimization of min

w

  f( ˜ w) + ∇f( ˜ w)⊤(w − ˜ w) + 1 2 (w − ˜ w)⊤∇2f (ℓ)( ˜ w)(w − ˜ w)

  • 2nd order gradient sampling from ∇2f(˜

w)

   . Approximate Newton Step with sampled approximation of Hessian

  • T. Zhang

Big Data Optimization 67 / 73

slide-123
SLIDE 123

Quadratc Loss

Newton: w(t) = w(t−1) −

  • 1

m

∇2f (ℓ) −1

  • inverse Hessian

∇f(w(t−1)) DANE: w(t) = w(t−1) −     1 m

  • ∇2f (ℓ) + µI

−1

  • inverse Hessian on machine ℓ

    ∇f(w(t−1)), where a small µ is added to regularize the Hessian.

  • T. Zhang

Big Data Optimization 68 / 73

slide-124
SLIDE 124

Comparisons

5 10 0.229 0.23 0.231 t

COV1

5 10 0.04 0.05 0.06 0.07 t

ASTRO

5 10 0.03 0.04 0.05 0.06 t

MNIST−47

DANE ADMM OSA Opt

  • T. Zhang

Big Data Optimization 69 / 73

slide-125
SLIDE 125

Summary

Optimization in machine learning: sum over data structure Traditional methods: gradient based batch algorithms

do not take advantage of special structure

Recent progress: stochastic optimization with fast rate

take advantage of special structure: suitable for single machine

  • T. Zhang

Big Data Optimization 70 / 73

slide-126
SLIDE 126

Summary

Optimization in machine learning: sum over data structure Traditional methods: gradient based batch algorithms

do not take advantage of special structure

Recent progress: stochastic optimization with fast rate

take advantage of special structure: suitable for single machine

Distributed computing (data parallelism and synchronous update)

minibatch SDCA DANE (batch algorithm on each machine + synchronization)

  • T. Zhang

Big Data Optimization 70 / 73

slide-127
SLIDE 127

Other Developments

Distributed large scale computing

algorithmic side: ADMM, Asynchronous updates (Hogwild), etc system side: distributed vector computing

Nonconvex methods

nonconvex regularization and loss neural networks and complex models

Closer Integration of Optimization and Statistics

  • T. Zhang

Big Data Optimization 71 / 73

slide-128
SLIDE 128

References

SVRG:

Rie Johnson and TZ. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, NIPS 2013. Lin Xiao and TZ. A Proximal Stochastic Gradient Method with Progressive Variance Reduction, SIAM J. Optimization, 2014.

SAGA: Defazio and Bach and Lacoste-Julien, SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives, NIPS 2014. SDCA:

Shai Shalev-Shwartz and TZ. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization, JMLR 2013. Shai Shalev-Shwartz and TZ. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization, Math Programming, 2015. Zheng Qu and Peter Richtarik and TZ. Randomized Dual Coordinate Ascent with Arbitrary Sampling, arXiv, 2014.

  • T. Zhang

Big Data Optimization 72 / 73

slide-129
SLIDE 129

References (continued)

mini-batch SDCA with acceleration: Shai Shalev-Schwartz and TZ. Accelerated Mini-Batch Stochastic Dual Coordinate Ascent, NIPS 2013. DANE: Ohad Shamir and Nathan Srebro and TZ. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method, ICML 2014.

  • T. Zhang

Big Data Optimization 73 / 73