C O C O A Communication-Efficient Coordinate Ascent Virginia Smith - - PowerPoint PPT Presentation

c o c o a communication efficient coordinate ascent
SMART_READER_LITE
LIVE PREVIEW

C O C O A Communication-Efficient Coordinate Ascent Virginia Smith - - PowerPoint PPT Presentation

C O C O A Communication-Efficient Coordinate Ascent Virginia Smith Martin Jaggi, Martin Tak , Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan LARGE-SCALE OPTIMIZATION LARGE-SCALE OPTIMIZATION C O C O A


slide-1
SLIDE 1

COCOA Communication-Efficient Coordinate Ascent

Virginia Smith

Martin Jaggi, Martin Takáč, Jonathan Terhorst,
 Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan

slide-2
SLIDE 2
slide-3
SLIDE 3

LARGE-SCALE OPTIMIZATION

slide-4
SLIDE 4

LARGE-SCALE OPTIMIZATION COCOA

slide-5
SLIDE 5

LARGE-SCALE OPTIMIZATION COCOA RESULTS

slide-6
SLIDE 6

LARGE-SCALE OPTIMIZATION COCOA RESULTS

slide-7
SLIDE 7

Machine Learning with Large Datasets

slide-8
SLIDE 8

Machine Learning with Large Datasets

slide-9
SLIDE 9

Machine Learning with Large Datasets

image/music/video tagging document categorization item recommendation click-through rate prediction sequence tagging protein structure prediction sensor data prediction spam classification fraud detection

slide-10
SLIDE 10

Machine Learning Workflow

slide-11
SLIDE 11

Machine Learning Workflow

DATA & PROBLEM

classification, regression, collaborative filtering, …

slide-12
SLIDE 12

Machine Learning Workflow

DATA & PROBLEM

classification, regression, collaborative filtering, …

MACHINE LEARNING MODEL

logistic regression, lasso, support vector machines, …

slide-13
SLIDE 13

Machine Learning Workflow

DATA & PROBLEM

classification, regression, collaborative filtering, …

MACHINE LEARNING MODEL

logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM

gradient descent, coordinate descent, Newton’s method, …

slide-14
SLIDE 14

Example: SVM Classification

slide-15
SLIDE 15

Example: SVM Classification

slide-16
SLIDE 16

Example: SVM Classification

slide-17
SLIDE 17

Example: SVM Classification

slide-18
SLIDE 18

Example: SVM Classification

slide-19
SLIDE 19

Example: SVM Classification

slide-20
SLIDE 20

Example: SVM Classification

w x

  • b

= 1 w x

  • b

=

  • 1

w x

  • b

= w 2 / ||w||

slide-21
SLIDE 21

Example: SVM Classification

w x

  • b

= 1 w x

  • b

=

  • 1

w x

  • b

= w 2 / ||w||

min

w∈Rd

  • 2 ||w||2 + 1

n

n

X

i=1

`hinge(yiwT xi)

slide-22
SLIDE 22

Example: SVM Classification

w x

  • b

= 1 w x

  • b

=

  • 1

w x

  • b

= w 2 / ||w||

min

w∈Rd

  • 2 ||w||2 + 1

n

n

X

i=1

`hinge(yiwT xi)

Descent algorithms and line search methods Acceleration, momentum, and conjugate gradients Newton and Quasi-Newton methods Coordinate descent Stochastic and incremental gradient methods SMO SVMlight LIBLINEAR

slide-23
SLIDE 23

Linear Regularized Loss Minimization

slide-24
SLIDE 24

Linear Regularized Loss Minimization

min

w∈Rd

  • 2 ||w||2 + 1

n

n

X

i=1

`i(wT xi)

slide-25
SLIDE 25

Linear Regularized Loss Minimization

min

w∈Rd

  • 2 ||w||2 + 1

n

n

X

i=1

`i(wT xi) support vector machines

slide-26
SLIDE 26

Linear Regularized Loss Minimization

min

w∈Rd

  • 2 ||w||2 + 1

n

n

X

i=1

`i(wT xi) support vector machines logistic regression

slide-27
SLIDE 27

Linear Regularized Loss Minimization

min

w∈Rd

  • 2 ||w||2 + 1

n

n

X

i=1

`i(wT xi) support vector machines logistic regression lasso regression

slide-28
SLIDE 28

Linear Regularized Loss Minimization

min

w∈Rd

  • 2 ||w||2 + 1

n

n

X

i=1

`i(wT xi) support vector machines logistic regression lasso regression ridge regression

slide-29
SLIDE 29

Linear Regularized Loss Minimization

min

w∈Rd

  • 2 ||w||2 + 1

n

n

X

i=1

`i(wT xi) support vector machines logistic regression lasso regression ridge regression etc…

slide-30
SLIDE 30

Linear Regularized Loss Minimization

min

w∈Rd

  • 2 ||w||2 + 1

n

n

X

i=1

`i(wT xi) support vector machines logistic regression lasso regression ridge regression etc…

image/music/video tagging document categorization item recommendation click-through rate prediction sequence tagging protein structure prediction sensor data prediction spam classification fraud detection

slide-31
SLIDE 31

Machine Learning Workflow

DATA & PROBLEM

classification, regression, collaborative filtering, …

MACHINE LEARNING MODEL

logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM

gradient descent, coordinate descent, Newton’s method, …

slide-32
SLIDE 32

Machine Learning Workflow

DATA & PROBLEM

classification, regression, collaborative filtering, …

MACHINE LEARNING MODEL

logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM

gradient descent, coordinate descent, Newton’s method, …

SYSTEMS SETTING

multi-core, cluster, cloud, supercomputer, …

slide-33
SLIDE 33

Machine Learning Workflow

DATA & PROBLEM

classification, regression, collaborative filtering, …

MACHINE LEARNING MODEL

logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM

gradient descent, coordinate descent, Newton’s method, …

SYSTEMS SETTING

multi-core, cluster, cloud, supercomputer, …

Open Problem: efficiently solving objective when data is distributed

slide-34
SLIDE 34

Distributed Optimization

slide-35
SLIDE 35

Distributed Optimization

slide-36
SLIDE 36

Distributed Optimization

reduce: w = w − α P

k ∆w

slide-37
SLIDE 37

Distributed Optimization

reduce: w = w − α P

k ∆w

slide-38
SLIDE 38

Distributed Optimization

“always communicate”

reduce: w = w − α P

k ∆w

slide-39
SLIDE 39

Distributed Optimization

✔ convergence 
 guarantees “always communicate”

reduce: w = w − α P

k ∆w

slide-40
SLIDE 40

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication “always communicate”

reduce: w = w − α P

k ∆w

slide-41
SLIDE 41

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication “always communicate”

reduce: w = w − α P

k ∆w

slide-42
SLIDE 42

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication

average: w := 1

K

P

k wk

“always communicate”

reduce: w = w − α P

k ∆w

slide-43
SLIDE 43

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication

average: w := 1

K

P

k wk

“always communicate” “never communicate”

reduce: w = w − α P

k ∆w

slide-44
SLIDE 44

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication ✔ low communication

average: w := 1

K

P

k wk

“always communicate” “never communicate”

reduce: w = w − α P

k ∆w

slide-45
SLIDE 45

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication ✔ low communication

average: w := 1

K

P

k wk

“always communicate” “never communicate” ZDWJ, 2012

reduce: w = w − α P

k ∆w

slide-46
SLIDE 46

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication ✔ low communication ✗ convergence 
 not guaranteed

average: w := 1

K

P

k wk

“always communicate” “never communicate” ZDWJ, 2012

reduce: w = w − α P

k ∆w

slide-47
SLIDE 47

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication ✔ low communication ✗ convergence 
 not guaranteed

average: w := 1

K

P

k wk

“always communicate” “never communicate” ZDWJ, 2012

reduce: w = w − α P

k ∆w

slide-48
SLIDE 48

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication ✔ low communication ✗ convergence 
 not guaranteed

average: w := 1

K

P

k wk

“always communicate” “never communicate” ZDWJ, 2012

reduce: w = w − α P

k ∆w

slide-49
SLIDE 49

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication ✔ low communication ✗ convergence 
 not guaranteed

average: w := 1

K

P

k wk

“always communicate” “never communicate” ZDWJ, 2012

reduce: w = w − α P

k ∆w

slide-50
SLIDE 50

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication ✔ low communication ✗ convergence 
 not guaranteed

average: w := 1

K

P

k wk

“always communicate” “never communicate” ZDWJ, 2012

reduce: w = w − α P

k ∆w

slide-51
SLIDE 51

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication ✔ low communication ✗ convergence 
 not guaranteed

average: w := 1

K

P

k wk

“always communicate” “never communicate” ZDWJ, 2012

reduce: w = w − α P

k ∆w

slide-52
SLIDE 52

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication ✔ low communication ✗ convergence 
 not guaranteed

average: w := 1

K

P

k wk

“always communicate” “never communicate” ZDWJ, 2012

reduce: w = w − α P

k ∆w

slide-53
SLIDE 53

Distributed Optimization

✔ convergence 
 guarantees ✗ high communication ✔ low communication ✗ convergence 
 not guaranteed

average: w := 1

K

P

k wk

“always communicate” “never communicate” ZDWJ, 2012

reduce: w = w − α P

k ∆w

slide-54
SLIDE 54

Mini-batch

slide-55
SLIDE 55

Mini-batch

slide-56
SLIDE 56

Mini-batch

reduce: w = w − α

|b|

P

i∈b ∆w

slide-57
SLIDE 57

Mini-batch

reduce: w = w − α

|b|

P

i∈b ∆w

slide-58
SLIDE 58

Mini-batch

✔ convergence guarantees reduce: w = w − α

|b|

P

i∈b ∆w

slide-59
SLIDE 59

Mini-batch

✔ convergence guarantees ✔ tunable communication reduce: w = w − α

|b|

P

i∈b ∆w

slide-60
SLIDE 60

Mini-batch

✔ convergence guarantees ✔ tunable communication a natural middle-ground reduce: w = w − α

|b|

P

i∈b ∆w

slide-61
SLIDE 61

Mini-batch

✔ convergence guarantees ✔ tunable communication a natural middle-ground reduce: w = w − α

|b|

P

i∈b ∆w

slide-62
SLIDE 62

Mini-batch Limitations

slide-63
SLIDE 63

Mini-batch Limitations

  • 1. METHODS BEYOND SGD
slide-64
SLIDE 64

Mini-batch Limitations

  • 1. METHODS BEYOND SGD
  • 2. STALE UPDATES
slide-65
SLIDE 65

Mini-batch Limitations

  • 1. METHODS BEYOND SGD
  • 2. STALE UPDATES
  • 3. AVERAGE OVER BATCH SIZE
slide-66
SLIDE 66

LARGE-SCALE OPTIMIZATION COCOA RESULTS

slide-67
SLIDE 67

LARGE-SCALE OPTIMIZATION COCOA RESULTS

slide-68
SLIDE 68

Mini-batch Limitations

  • 1. METHODS BEYOND SGD
  • 2. STALE UPDATES
  • 3. AVERAGE OVER BATCH SIZE
slide-69
SLIDE 69

Mini-batch Limitations

  • 1. METHODS BEYOND SGD


Use Primal-Dual Framework

  • 2. STALE UPDATES 


Immediately apply local updates

  • 3. AVERAGE OVER BATCH SIZE


Average over K << batch size

slide-70
SLIDE 70
  • 1. METHODS BEYOND SGD


Use Primal-Dual Framework

  • 2. STALE UPDATES 


Immediately apply local updates

  • 3. AVERAGE OVER BATCH SIZE


Average over K << batch size

Communication-Efficient Distributed Dual Coordinate Ascent (CoCoA)

slide-71
SLIDE 71
  • 1. Primal-Dual Framework

PRIMAL DUAL

slide-72
SLIDE 72
  • 1. Primal-Dual Framework

PRIMAL DUAL

min

w∈Rd

" P(w) := 2 ||w||2 + 1 n

n

X

i=1

`i(wT xi) #

slide-73
SLIDE 73
  • 1. Primal-Dual Framework

PRIMAL DUAL

min

w∈Rd

" P(w) := 2 ||w||2 + 1 n

n

X

i=1

`i(wT xi) # max

α∈Rn

" D(↵) := −||A↵||2 − 1 n

n

X

i=1

`∗

i (−↵i)

# Ai = 1 λnxi

slide-74
SLIDE 74
  • 1. Primal-Dual Framework

PRIMAL DUAL Stopping criteria given by duality gap

min

w∈Rd

" P(w) := 2 ||w||2 + 1 n

n

X

i=1

`i(wT xi) # max

α∈Rn

" D(↵) := −||A↵||2 − 1 n

n

X

i=1

`∗

i (−↵i)

# Ai = 1 λnxi

slide-75
SLIDE 75
  • 1. Primal-Dual Framework

PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice

min

w∈Rd

" P(w) := 2 ||w||2 + 1 n

n

X

i=1

`i(wT xi) # max

α∈Rn

" D(↵) := −||A↵||2 − 1 n

n

X

i=1

`∗

i (−↵i)

# Ai = 1 λnxi

slide-76
SLIDE 76
  • 1. Primal-Dual Framework

PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. liblinear

min

w∈Rd

" P(w) := 2 ||w||2 + 1 n

n

X

i=1

`i(wT xi) # max

α∈Rn

" D(↵) := −||A↵||2 − 1 n

n

X

i=1

`∗

i (−↵i)

# Ai = 1 λnxi

slide-77
SLIDE 77
  • 2. Immediately Apply Updates
slide-78
SLIDE 78
  • 2. Immediately Apply Updates

for i 2 b ∆w ∆w αriP(w) end w w + ∆w

STALE

slide-79
SLIDE 79
  • 2. Immediately Apply Updates

for i 2 b ∆w ∆w αriP(w) end w w + ∆w

STALE for i 2 b ∆w ∆w αriP(w) w w + ∆w end FRESH

slide-80
SLIDE 80
  • 3. Average over K
slide-81
SLIDE 81
  • 3. Average over K

reduce: w = w + 1

K

P

k ∆wk

slide-82
SLIDE 82

CoCoA

Algorithm 1: CoCoA Input: T ≥ 1, scaling parameter 1 ≤ βK ≤ K (default: βK := 1). Data: {(xi, yi)}n

i=1 distributed over K machines

Initialize: α(0)

[k] ← 0 for all machines k, and w(0) ← 0

for t = 1, 2, . . . , T for all machines k = 1, 2, . . . , K in parallel (∆α[k], ∆wk) ← LocalDualMethod(α(t−1)

[k]

, w(t−1)) α(t)

[k] ← α(t−1) [k]

+ βK

K ∆α[k]

end reduce w(t) ← w(t−1) + βK

K

PK

k=1 ∆wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k Input: Local α[k] ∈ Rnk, and w ∈ Rd consistent with other coordinate blocks of α s.t. w = Aα Data: Local {(xi, yi)}nk

i=1

Output: ∆α[k] and ∆w := A[k]∆α[k]

slide-83
SLIDE 83

CoCoA

Algorithm 1: CoCoA Input: T ≥ 1, scaling parameter 1 ≤ βK ≤ K (default: βK := 1). Data: {(xi, yi)}n

i=1 distributed over K machines

Initialize: α(0)

[k] ← 0 for all machines k, and w(0) ← 0

for t = 1, 2, . . . , T for all machines k = 1, 2, . . . , K in parallel (∆α[k], ∆wk) ← LocalDualMethod(α(t−1)

[k]

, w(t−1)) α(t)

[k] ← α(t−1) [k]

+ βK

K ∆α[k]

end reduce w(t) ← w(t−1) + βK

K

PK

k=1 ∆wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k Input: Local α[k] ∈ Rnk, and w ∈ Rd consistent with other coordinate blocks of α s.t. w = Aα Data: Local {(xi, yi)}nk

i=1

Output: ∆α[k] and ∆w := A[k]∆α[k]

✔ <10 lines of code in Spark

slide-84
SLIDE 84

CoCoA

Algorithm 1: CoCoA Input: T ≥ 1, scaling parameter 1 ≤ βK ≤ K (default: βK := 1). Data: {(xi, yi)}n

i=1 distributed over K machines

Initialize: α(0)

[k] ← 0 for all machines k, and w(0) ← 0

for t = 1, 2, . . . , T for all machines k = 1, 2, . . . , K in parallel (∆α[k], ∆wk) ← LocalDualMethod(α(t−1)

[k]

, w(t−1)) α(t)

[k] ← α(t−1) [k]

+ βK

K ∆α[k]

end reduce w(t) ← w(t−1) + βK

K

PK

k=1 ∆wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k Input: Local α[k] ∈ Rnk, and w ∈ Rd consistent with other coordinate blocks of α s.t. w = Aα Data: Local {(xi, yi)}nk

i=1

Output: ∆α[k] and ∆w := A[k]∆α[k]

✔ <10 lines of code in Spark ✔ primal-dual framework allows for any internal optimization method

slide-85
SLIDE 85

CoCoA

Algorithm 1: CoCoA Input: T ≥ 1, scaling parameter 1 ≤ βK ≤ K (default: βK := 1). Data: {(xi, yi)}n

i=1 distributed over K machines

Initialize: α(0)

[k] ← 0 for all machines k, and w(0) ← 0

for t = 1, 2, . . . , T for all machines k = 1, 2, . . . , K in parallel (∆α[k], ∆wk) ← LocalDualMethod(α(t−1)

[k]

, w(t−1)) α(t)

[k] ← α(t−1) [k]

+ βK

K ∆α[k]

end reduce w(t) ← w(t−1) + βK

K

PK

k=1 ∆wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k Input: Local α[k] ∈ Rnk, and w ∈ Rd consistent with other coordinate blocks of α s.t. w = Aα Data: Local {(xi, yi)}nk

i=1

Output: ∆α[k] and ∆w := A[k]∆α[k]

✔ <10 lines of code in Spark ✔ primal-dual framework allows for any internal optimization method ✔ local updates applied immediately


slide-86
SLIDE 86

CoCoA

Algorithm 1: CoCoA Input: T ≥ 1, scaling parameter 1 ≤ βK ≤ K (default: βK := 1). Data: {(xi, yi)}n

i=1 distributed over K machines

Initialize: α(0)

[k] ← 0 for all machines k, and w(0) ← 0

for t = 1, 2, . . . , T for all machines k = 1, 2, . . . , K in parallel (∆α[k], ∆wk) ← LocalDualMethod(α(t−1)

[k]

, w(t−1)) α(t)

[k] ← α(t−1) [k]

+ βK

K ∆α[k]

end reduce w(t) ← w(t−1) + βK

K

PK

k=1 ∆wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k Input: Local α[k] ∈ Rnk, and w ∈ Rd consistent with other coordinate blocks of α s.t. w = Aα Data: Local {(xi, yi)}nk

i=1

Output: ∆α[k] and ∆w := A[k]∆α[k]

✔ <10 lines of code in Spark ✔ primal-dual framework allows for any internal optimization method ✔ local updates applied immediately
 ✔ average over K

slide-87
SLIDE 87

Convergence

slide-88
SLIDE 88

Convergence

Assumptions: are -smooth LocalDualMethod makes improvement per step



 e.g. for SDCA

Θ

Θ = ✓ 1 − λnγ 1 + λnγ 1 ˜ n ◆H

`i 1/γ

slide-89
SLIDE 89

Convergence

E[D(α∗) − D(α(T ))] ≤ ✓ 1 − (1 − Θ) 1 K λnγ σ + λnγ ◆T ⇣ D(α∗) − D(α(0)) ⌘

Assumptions: are -smooth LocalDualMethod makes improvement per step



 e.g. for SDCA

Θ

Θ = ✓ 1 − λnγ 1 + λnγ 1 ˜ n ◆H

`i 1/γ

slide-90
SLIDE 90

Convergence

E[D(α∗) − D(α(T ))] ≤ ✓ 1 − (1 − Θ) 1 K λnγ σ + λnγ ◆T ⇣ D(α∗) − D(α(0)) ⌘

Assumptions: are -smooth LocalDualMethod makes improvement per step



 e.g. for SDCA

Θ

Θ = ✓ 1 − λnγ 1 + λnγ 1 ˜ n ◆H

`i 1/γ applies also to duality gap measure of difficulty of data partition 0 ≤ σ ≤ n/K

slide-91
SLIDE 91

Convergence

E[D(α∗) − D(α(T ))] ≤ ✓ 1 − (1 − Θ) 1 K λnγ σ + λnγ ◆T ⇣ D(α∗) − D(α(0)) ⌘

Assumptions: are -smooth LocalDualMethod makes improvement per step



 e.g. for SDCA

Θ

Θ = ✓ 1 − λnγ 1 + λnγ 1 ˜ n ◆H

`i 1/γ applies also to duality gap measure of difficulty of data partition 0 ≤ σ ≤ n/K

✔ it converges!

slide-92
SLIDE 92

Convergence

E[D(α∗) − D(α(T ))] ≤ ✓ 1 − (1 − Θ) 1 K λnγ σ + λnγ ◆T ⇣ D(α∗) − D(α(0)) ⌘

Assumptions: are -smooth LocalDualMethod makes improvement per step



 e.g. for SDCA

Θ

Θ = ✓ 1 − λnγ 1 + λnγ 1 ˜ n ◆H

`i 1/γ applies also to duality gap measure of difficulty of data partition 0 ≤ σ ≤ n/K

✔ it converges! 
 ✔ inherits convergence rate of locally used method

slide-93
SLIDE 93

Convergence

E[D(α∗) − D(α(T ))] ≤ ✓ 1 − (1 − Θ) 1 K λnγ σ + λnγ ◆T ⇣ D(α∗) − D(α(0)) ⌘

Assumptions: are -smooth LocalDualMethod makes improvement per step



 e.g. for SDCA

Θ

Θ = ✓ 1 − λnγ 1 + λnγ 1 ˜ n ◆H

`i 1/γ applies also to duality gap measure of difficulty of data partition 0 ≤ σ ≤ n/K

✔ it converges! 
 ✔ inherits convergence rate of locally used method ✔ convergence rate is linear for smooth losses

slide-94
SLIDE 94

LARGE-SCALE OPTIMIZATION COCOA RESULTS!

slide-95
SLIDE 95

LARGE-SCALE OPTIMIZATION COCOA RESULTS!

slide-96
SLIDE 96

Empirical Results in

Dataset Training (n) Features (d) Sparsity Workers (K) Cov 522,911 54 22.22% 4 Rcv1 677,399 47,236 0.16% 8 Imagenet 32,751 160,000 100% 32

slide-97
SLIDE 97

200 400 600 800 10

−6

10

−4

10

−2

10 10

2

Imagenet Time (s) Log Primal Suboptimality

200 400 600 800 10

−6

10

−4

10

−2

10 10

2

Imagenet

COCOA (H=1e3) mini−batch−CD (H=1) local−SGD (H=1e3) mini−batch−SGD (H=10)

slide-98
SLIDE 98

200 400 600 800 10

−6

10

−4

10

−2

10 10

2

Imagenet Time (s) Log Primal Suboptimality

200 400 600 800 10

−6

10

−4

10

−2

10 10

2

Imagenet

COCOA (H=1e3) mini−batch−CD (H=1) local−SGD (H=1e3) mini−batch−SGD (H=10)

20 40 60 80 100 10

−6

10

−4

10

−2

10 10

2

Cov Time (s) Log Primal Suboptimality

20 40 60 80 100 10

−6

10

−4

10

−2

10 10

2

Cov

COCOA (H=1e5) minibatch−CD (H=100) local−SGD (H=1e5) batch−SGD (H=1) 100 200 300 400 10

−6

10

−4

10

−2

10 10

2

RCV1 Time (s) Log Primal Suboptimality

100 200 300 400 10

−6

10

−4

10

−2

10 10

2

COCOA (H=1e5) minibatch−CD (H=100) local−SGD (H=1e4) batch−SGD (H=100)

slide-99
SLIDE 99

20 40 60 80 100 10

−6

10

−4

10

−2

10 10

2

Time (s) Log Primal Suboptimality 20 40 60 80 100 10

−6

10

−4

10

−2

10 10

2

1e5 1e4 1e3 100 1

Effect of H on COCOA

slide-100
SLIDE 100

COCOA Take-Aways

slide-101
SLIDE 101

COCOA Take-Aways

A framework for distributed optimization

slide-102
SLIDE 102

COCOA Take-Aways

A framework for distributed optimization Uses state-of-the-art primal-dual methods

slide-103
SLIDE 103

COCOA Take-Aways

A framework for distributed optimization Uses state-of-the-art primal-dual methods Reduces communication:


  • applies updates immediately

  • averages over # of machines
slide-104
SLIDE 104

COCOA Take-Aways

A framework for distributed optimization Uses state-of-the-art primal-dual methods Reduces communication:


  • applies updates immediately

  • averages over # of machines

Strong convergence guarantees & results in practice

slide-105
SLIDE 105

COCOA Take-Aways

A framework for distributed optimization Uses state-of-the-art primal-dual methods Reduces communication:


  • applies updates immediately

  • averages over # of machines

Strong convergence guarantees & results in practice 
 NIPS ‘14

slide-106
SLIDE 106

COCOA Take-Aways

A framework for distributed optimization Uses state-of-the-art primal-dual methods Reduces communication:


  • applies updates immediately

  • averages over # of machines

Strong convergence guarantees & results in practice 
 NIPS ‘14 github.com/gingsmith/cocoa

slide-107
SLIDE 107

Future Work

slide-108
SLIDE 108

Future Work

  • ptimal scaling between adding &

averaging

slide-109
SLIDE 109

Future Work

  • ptimal scaling between adding &

averaging similar rates for local-SGD?

slide-110
SLIDE 110

Future Work

  • ptimal scaling between adding &

averaging similar rates for local-SGD? additional experiments: models/datasets

slide-111
SLIDE 111

Future Work

  • ptimal scaling between adding &

averaging similar rates for local-SGD? additional experiments: models/datasets integration into Spark MLlib

slide-112
SLIDE 112

Thanks!

github.com/gingsmith/cocoa


 NIPS ‘14