Towards More Efficient Distributed Machine Learning Jialei Wang - - PowerPoint PPT Presentation

towards more efficient distributed machine learning
SMART_READER_LITE
LIVE PREVIEW

Towards More Efficient Distributed Machine Learning Jialei Wang - - PowerPoint PPT Presentation

Towards More Efficient Distributed Machine Learning Jialei Wang University of Chicago ISE, NCSU, 12/13/2017 1/41 The empirical success of machine learning 2/41 The empirical success of machine learning Big Data Advanced Massive Modeling


slide-1
SLIDE 1

1/41

Towards More Efficient Distributed Machine Learning

Jialei Wang

University of Chicago

ISE, NCSU, 12/13/2017

slide-2
SLIDE 2

2/41

The empirical success of machine learning

slide-3
SLIDE 3

2/41

The empirical success of machine learning Advanced Modeling Massive Computing Big Data

slide-4
SLIDE 4

3/41

My research

Efficient ML Distributed

Sparsity Minibatch Prox Dual Alternating

Opt&Sketch

Variance reduction Primal-dual methods Sketching

Online

Confidence- Weighted Budget OGD Cost-sensitive

Applications

Cloud removal Collaborative ranking Potfolio Opt

slide-5
SLIDE 5

3/41

My research

Efficient ML Distributed

Sparsity Minibatch Prox Dual Alternating

Opt&Sketch

Variance reduction Primal-dual methods Sketching

Online

Confidence- Weighted Budget OGD Cost-sensitive

Applications

Cloud removal Collaborative ranking Potfolio Opt

slide-6
SLIDE 6

3/41

My research

Efficient ML Distributed

Sparsity Minibatch Prox Dual Alternating

Opt&Sketch

Variance reduction Primal-dual methods Sketching

Online

Confidence- Weighted Budget OGD Cost-sensitive

Applications

Cloud removal Collaborative ranking Potfolio Opt

slide-7
SLIDE 7

4/41

This talk

Efficient ML Distributed

Sparsity Minibatch Prox Dual Alternating

Opt&Sketch

Variance reduction Primal-dual methods Sketching

Online

Confidence- Weighted Budget OGD Cost-sensitive

Applications

Cloud removal Collaborative ranking Potfolio Opt

slide-8
SLIDE 8

5/41

Motivation for Distributed Learning

Data Size

§ Data cannot be stored or processed on a single machine. § Use distributed computing to handle big data sets. § Example: Click-through rate prediction problem.

slide-9
SLIDE 9

5/41

Motivation for Distributed Learning

slide-10
SLIDE 10

5/41

Motivation for Distributed Learning

slide-11
SLIDE 11

5/41

Motivation for Distributed Learning

Data Collection

§ Data are naturally distributed on different machines. § Use distributed computing to learn from decentralized data. § Example: Google’s federated learning problem.

slide-12
SLIDE 12

5/41

Motivation for Distributed Learning

slide-13
SLIDE 13

5/41

Motivation for Distributed Learning

slide-14
SLIDE 14

6/41

Challenges in Distributed Learning

Efficiency in multiple dimensions

§ Sample: sample complexity matches the centralized solution. § Computation: floating point operations. § Communication: bandwidth (number of bits transmitted) +

latency (rounds of communication).

§ Memory etc.

slide-15
SLIDE 15

6/41

Challenges in Distributed Learning

Efficiency in multiple dimensions

§ Sample: sample complexity matches the centralized solution. § Computation: floating point operations. § Communication: bandwidth (number of bits transmitted) +

latency (rounds of communication).

§ Memory etc.

latency " bandwidth " FLOPS

slide-16
SLIDE 16

7/41

Learning as Optimization

Stochastic Optimization Problems

min

wPΩ Fpwq :“ Ez„D rℓpw, zqs .

slide-17
SLIDE 17

7/41

Learning as Optimization

Stochastic Optimization Problems

min

wPΩ Fpwq :“ Ez„D rℓpw, zqs .

slide-18
SLIDE 18

7/41

Learning as Optimization

Stochastic Optimization Problems

min

wPΩ Fpwq :“ Ez„D rℓpw, zqs .

40 50 60 70 80 90 100 150 160 170 180 190 200

slide-19
SLIDE 19

7/41

Learning as Optimization

Stochastic Optimization Problems

min

wPΩ Fpwq :“ Ez„D rℓpw, zqs .

Input #1 Input #2 Input #3 Input #4 Output Hidden layer Input layer Output layer

slide-20
SLIDE 20

8/41

Distributed Optimization for Learning

Reduction from (Distributed) Learning to Optimization

§ m machines, each machine collect n data instances tzijun j“1. § Global Objective: minw f pwq :“ 1 m

řm

i“1

´

1 n

řn

j ℓpw, zijq

¯ .

§ Distributed Consensus:

fipwq :“ 1

n

řn

j ℓpw, zijq,

f pwq :“ 1

m

řm

i“1 fipwq.

slide-21
SLIDE 21

8/41

Distributed Optimization for Learning

Reduction from (Distributed) Learning to Optimization

§ m machines, each machine collect n data instances tzijun j“1. § Global Objective: minw f pwq :“ 1 m

řm

i“1

´

1 n

řn

j ℓpw, zijq

¯ .

§ Distributed Consensus:

fipwq :“ 1

n

řn

j ℓpw, zijq,

f pwq :“ 1

m

řm

i“1 fipwq. f1pwq f2pwq f3pwq f4pwq f5pwq

slide-22
SLIDE 22

9/41

Distributed Optimization for Learning

What’s special about Machine Learning ?

§ Learning care about the population objective

Fpwq “ Ez„D rℓpw, zqs.

§ Stochastic nature of the data: local objectives fipwq are

related.

slide-23
SLIDE 23

9/41

Distributed Optimization for Learning

What’s special about Machine Learning ?

§ Learning care about the population objective

Fpwq “ Ez„D rℓpw, zqs.

§ Stochastic nature of the data: local objectives fipwq are

related.

řn j ℓpw, z1jq řn j ℓpw, z2jq řn j ℓpw, z3jq řn j ℓpw, z4jq řn j ℓpw, z5jq

slide-24
SLIDE 24

9/41

Distributed Optimization for Learning

What’s special about Machine Learning ?

§ Learning care about the population objective

Fpwq “ Ez„D rℓpw, zqs.

§ Stochastic nature of the data: local objectives fipwq are

related.

tz1jun j“1 „ D tz2jun j“1 „ D tz3jun j“1 „ D tz4jun j“1 „ D tz5jun j“1 „ D

slide-25
SLIDE 25

9/41

Distributed Optimization for Learning

What’s special about Machine Learning ?

§ Learning care about the population objective

Fpwq “ Ez„D rℓpw, zqs.

§ Stochastic nature of the data: local objectives fipwq are

related.

tz1jun j“1 „ D tz2jun j“1 „ D tz3jun j“1 „ D tz4jun j“1 „ D tz5jun j“1 „ D

How to exploit similarity/relatedness between machines when designing distributed learning algorithms ?

slide-26
SLIDE 26

10/41

This talk: two specific problems

  • 1. How to efficiently learn sparse linear predictors in

distributed environment ?

slide-27
SLIDE 27

10/41

This talk: two specific problems

  • 1. How to efficiently learn sparse linear predictors in

distributed environment ?

  • 2. How to parallelize stochastic gradient descent(SGD) ?
slide-28
SLIDE 28

11/41

Efficient Distributed Learning with Sparsity

International Conference on Machine Learning (ICML), 2017.

Joint work with

Mladen Kolar Nathan Srebro Tong Zhang

slide-29
SLIDE 29

12/41

High-level Overview

Problem

Efficient Distributed Sparse Learning with Optimal Statistical Accuracy.

slide-30
SLIDE 30

12/41

High-level Overview

Problem

Efficient Distributed Sparse Learning with Optimal Statistical Accuracy.

Sparse Learning in High Dimension

§ On a single machine, use classical methods such as Lasso. § Statistical accuracy versus computation.

slide-31
SLIDE 31

12/41

High-level Overview

Problem

Efficient Distributed Sparse Learning with Optimal Statistical Accuracy.

Sparse Learning in High Dimension

§ On a single machine, use classical methods such as Lasso. § Statistical accuracy versus computation.

Distributed Learning with Big Data

§ Data are distributed on multiple machines. § Statistical accuracy versus computation and communication

loooooooooooooooooooomoooooooooooooooooooon

efficiency

.

slide-32
SLIDE 32

13/41

High-dimensional Sparse Model

Number of variables (p) is often very large. GTGCATCTGACTCCTGAGGAGTAG CACGTAGACTGAGGACTCCTCATC ... ... ... ... Genotype 2.5(Phenotype) predict

slide-33
SLIDE 33

13/41

High-dimensional Sparse Model

Number of variables (p) is often very large.

slide-34
SLIDE 34

13/41

High-dimensional Sparse Model

Number of variables (p) is often very large.

Sparsity

§ Only a few variables are predictive. § w˚ “ arg minw Ex,y„D rℓpy, xx, wyqs. § S :“ supportpw˚q “ tj P rps | wj ‰ 0u and s “ |S| ! p.

slide-35
SLIDE 35

13/41

High-dimensional Sparse Model

Number of variables (p) is often very large.

Sparsity

§ Only a few variables are predictive. § w˚ “ arg minw Ex,y„D rℓpy, xx, wyqs. § S :“ supportpw˚q “ tj P rps | wj ‰ 0u and s “ |S| ! p.

ℓ1 regularization (Tibshirani, 1996; Chen et al., 1998)

§ Statistical accuracy: good statistical properties. § Computational efficiency: Convex surrogate of ℓ0.

slide-36
SLIDE 36

14/41

Sparse Regression

Statistical Model

§ y “ xx, w˚y ` noise.

Centralized ℓ1 regularization

  • wcent “ arg min

w

1 mn

m

ÿ

i“1 n

ÿ

j“1

ℓpyij, xxij, wyq ` λ||w||1. (Optimal) Statistical accuracy: || wcent ´ w˚||2 “ O ˆb

s log p mn

˙ . Efficient method achieving optimal statistical accuracy ?

slide-37
SLIDE 37

15/41

This work

A communication and computation-efficient approach

To achieve optimal statistical accuracy: Approach n Á ms2 log p Communication Computation Centralize n ¨ p Tlassopmn, pq Avg-Debias p p ¨ Tlassopn, pq This work p 2 ¨ Tlassopn, pq

slide-38
SLIDE 38

15/41

This work

A communication and computation-efficient approach

To achieve optimal statistical accuracy: Approach n Á ms2 log p Communication Computation Centralize n ¨ p Tlassopmn, pq Avg-Debias p p ¨ Tlassopn, pq This work p 2 ¨ Tlassopn, pq Approach ms2 log p Á n Á s2 log p Communication Computation Centralize n ¨ p Tlassopmn, pq Avg-Debias ˆ ˆ This work log m ¨ p log m ¨ Tlassopn, pq Tlassopn, pq: runtime for solving a lasso problem of size n ˆ p.

slide-39
SLIDE 39

16/41

The Proposed Approach

Step 0: Local ℓ1 Regularized Problem

Solve

  • w1 “ arg min f1pwq ` λ1||w||1.
slide-40
SLIDE 40

16/41

The Proposed Approach

Step 1,...,t: Shifted ℓ1 Regularized Problem

Communicate wt and local gradient.

∇f1p wtq ∇f2p wtq ∇f3p wtq ∇f4p wtq ∇f5p wtq

slide-41
SLIDE 41

16/41

The Proposed Approach

Step 1,...,t: Shifted ℓ1 Regularized Problem

Solve

  • wt`1 “ arg min

w

f1pwq ` C 1 m ÿ

jPrms

∇fjp wtq ´ ∇f1p wtq, w G ` λt`1||w||1.

slide-42
SLIDE 42

17/41

Interpretation

Communication efficiency

arg min

w

f1pwq ` C 1 m ÿ

jPrms

∇fjp wtq´∇f1p wtq, w G ` λt`1||w||1.

§ Combine global first-order and local higher-order information.

slide-43
SLIDE 43

17/41

Interpretation

Communication efficiency

arg min

w

f1pwq ` C 1 m ÿ

jPrms

∇fjp wtq´∇f1p wtq, w G ` λt`1||w||1.

§ Combine global first-order and local higher-order information.

Quadratic objective without ℓ1 regularization

§ Closed form update

  • wt`1 “

wt ´ ` ∇2f1p wtq ˘´1´ m´1 ř

jPrms ∇fjp

wtq ¯ : sub-sampled Newton direction.

slide-44
SLIDE 44

17/41

Interpretation

Communication efficiency

arg min

w

f1pwq ` C 1 m ÿ

jPrms

∇fjp wtq´∇f1p wtq, w G ` λt`1||w||1.

§ Combine global first-order and local higher-order information.

Quadratic objective without ℓ1 regularization

§ Closed form update

  • wt`1 “

wt ´ ` ∇2f1p wtq ˘´1´ m´1 ř

jPrms ∇fjp

wtq ¯ : sub-sampled Newton direction.

Key: adjust the ℓ1 regularization level at each round !

slide-45
SLIDE 45

18/41

Related Work

Distributed first-order methods

§ e.g. (accelerated) proximal gradient. § slow convergence Ñ heavy communication.

slide-46
SLIDE 46

18/41

Related Work

Distributed first-order methods

§ e.g. (accelerated) proximal gradient. § slow convergence Ñ heavy communication.

Average after de-biasing method: (Lee et al., 2017)

§ Matching centralized statistical error in one shot under certain

conditions.

§ Not practical because of i) computation ii) stringent condition.

slide-47
SLIDE 47

19/41

Theoretical Results (regression for example)

Regularization level

λt`1 — c log p mn loomoon

centralize

` c log p n ˜ s c log p n ¸t loooooooooooomoooooooooooon

decaying

,

slide-48
SLIDE 48

19/41

Theoretical Results (regression for example)

Regularization level

λt`1 — c log p mn loomoon

centralize

` c log p n ˜ s c log p n ¸t loooooooooooomoooooooooooon

decaying

,

Statistical Accuracy

|| wt`1 ´ w˚||1 ÀP s c log p mn looomooon

centralize

` ˜ s c log p n ¸t`1 loooooooomoooooooon

decaying

, || wt`1 ´ w˚||2 ÀP c s log p mn looomooon

centralize

` ˜c s log p n ¸ ˜ s c log p n ¸t loooooooooooooooomoooooooooooooooon

decaying

.

slide-49
SLIDE 49

20/41

Theoretical Comparison

Centralize

|| wcent ´ w˚||2 ÀP c s log p mn .

slide-50
SLIDE 50

20/41

Theoretical Comparison

Centralize

|| wcent ´ w˚||2 ÀP c s log p mn . Proposed Method

After one round of communication

|| w2 ´ w˚||2 ÀP c s log p mn ` s3{2 log p n .

§ Matches Avg-Debias, matches Centralize when n Á ms2 log p.

slide-51
SLIDE 51

20/41

Theoretical Comparison

Centralize

|| wcent ´ w˚||2 ÀP c s log p mn . Proposed Method

After one round of communication

|| w2 ´ w˚||2 ÀP c s log p mn ` s3{2 log p n .

§ Matches Avg-Debias, matches Centralize when n Á ms2 log p.

After t Á log m rounds of communication (when n Á s2 log p)

|| wt`1 ´ w˚||2 ÀP c s log p mn .

slide-52
SLIDE 52

21/41

Experiments

1 2 3 4 5 6 7 8 9

Rounds of Communications

0.15 0.22 0.29 0.36 0.43 0.5

Estimation Error

Local Prox-GD Centralize Avg-Debias EDSL

slide-53
SLIDE 53

22/41

Experiments (simulation)

Sparse Regression

m “ 5 m “ 10 m “ 20

1 2 3 4 5 6 7 8 9

Rounds of Communications

0.15 0.22 0.29 0.36 0.43 0.5

Estimation Error

Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9

Rounds of Communications

0.12 0.24 0.36 0.48 0.6

Estimation Error

Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9

Rounds of Communications

0.07 0.21 0.35 0.49 0.63 0.77

Estimation Error

Local Prox-GD Centralize Avg-Debias EDSL

s “ 10, X „ Np0, Σq, Σij “ 0.5|i´j|.

1 2 3 4 5 6 7 8 9

Rounds of Communications

0.23 0.31 0.39 0.47 0.55 0.63

Estimation Error

Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9

Rounds of Communications

0.2 0.31 0.42 0.53 0.64 0.75

Estimation Error

Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9

Rounds of Communications

0.1 0.28 0.46 0.64 0.82 1.0

Estimation Error

Local Prox-GD Centralize Avg-Debias EDSL

s “ 10, X „ Np0, Σq, Σij “ 0.5|i´j|{5.

slide-54
SLIDE 54

23/41

Experiments (real data)

Regression

connect4 dna year

1 2 3 4 5 6 7 8 9

Rounds of Communications

0.677 0.6795 0.682 0.6845 0.687 0.6895 0.692

Normalized MSE

Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9

Rounds of Communications

0.4 0.44 0.48 0.52 0.56 0.6

Normalized MSE

Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9

Rounds of Communications

0.937 0.938 0.939 0.94 0.941 0.942

Normalized MSE

Local Prox-GD Centralize EDSL

Classification

w8a mitface spambase

1 2 3 4 5 6 7 8 9

Rounds of Communications

2.0 2.14 2.28 2.42 2.56 2.7

Classification Error (%)

Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9

Rounds of Communications

0.03 0.033 0.036 0.039 0.042 0.045

Classification Error

Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9

Rounds of Communications

0.1 0.108 0.116 0.124 0.132 0.14

Classification Error

Local Prox-GD Centralize Avg-Debias EDSL

slide-55
SLIDE 55

24/41

Summary

Distributed Sparse Learning achieving optimal statistical accuracy

§ Communication and computation efficient approach for

distributed learning with sparsity.

§ Oplog mq-rounds of communication to match centralized

performance, each round: local ℓ1 regularized problem.

slide-56
SLIDE 56

24/41

Summary

Distributed Sparse Learning achieving optimal statistical accuracy

§ Communication and computation efficient approach for

distributed learning with sparsity.

§ Oplog mq-rounds of communication to match centralized

performance, each round: local ℓ1 regularized problem.

Future Directions

§ Even weaker assumptions: e.g. relax the sample requirement

per machine (currently we require s2 log p).

§ More efficient approach for distributed multi-task learning

with shared sparsity.

slide-57
SLIDE 57

25/41

Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch-Prox

Conference on Learning Theory (COLT), 2017.

Joint work with

Weiran Wang Nathan Srebro

slide-58
SLIDE 58

26/41

Stochastic Gradient Descent

min

w Fpwq :“ Ez„D rℓpw, zqs .

slide-59
SLIDE 59

26/41

Stochastic Gradient Descent

min

w Fpwq :“ Ez„D rℓpw, zqs .

Stochastic Gradient

At time t, sample zt, and compute wt`1 “ wt ´ ηt∇ℓpwt, ztq. z1 z2 z3 z4 w2 w3 w4 w4 w5 ...

slide-60
SLIDE 60

26/41

Stochastic Gradient Descent

min

w Fpwq :“ Ez„D rℓpw, zqs .

Stochastic Gradient

At time t, sample zt, and compute wt`1 “ wt ´ ηt∇ℓpwt, ztq.

SGD in Machine Learning

Computation and memory efficient. Match ERM generalization guarantee with one-pass SGD. Back propagation in training neural networks.

slide-61
SLIDE 61

27/41

How to parallelize SGD ?

Minibatch Stochastic Gradient Descent

At time t, sample It, and update wt`1 “ wt ´ ηt

|It|

ř

zPIt ∇ℓpwt, zq.

z1, z2, z3, z4 z5, z6, z7, z8 z9, z10, z11, z12 w2 w3 w4 ...

slide-62
SLIDE 62

27/41

How to parallelize SGD ?

Minibatch Stochastic Gradient Descent

At time t, sample It, and update wt`1 “ wt ´ ηt

|It|

ř

zPIt ∇ℓpwt, zq.

ř zPI1 ∇ℓpwt, zq ř zPI2 ∇ℓpwt, zq ř zPI3 ∇ℓpwt, zq ř zPI4 ∇ℓpwt, zq ř zPI5 ∇ℓpwt, zq

slide-63
SLIDE 63

27/41

How to parallelize SGD ?

Minibatch Stochastic Gradient Descent

At time t, sample It, and update wt`1 “ wt ´ ηt

|It|

ř

zPIt ∇ℓpwt, zq.

Larger minibatch, better parallelism

§ Reduce communication rounds and bits.

slide-64
SLIDE 64

27/41

How to parallelize SGD ?

Minibatch Stochastic Gradient Descent

At time t, sample It, and update wt`1 “ wt ´ ηt

|It|

ř

zPIt ∇ℓpwt, zq.

Larger minibatch, better parallelism

§ Reduce communication rounds and bits.

Not too large ! (Dekel et al., 2012)

For convex smooth ℓ, we have Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ` 1 T ¸ .

slide-65
SLIDE 65

28/41

Limits of Minibatch SGD

Minibatch SGD

Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ` 1 T ¸ .

§ b ď

? N to ensure sample efficiency (N “ bT).

slide-66
SLIDE 66

28/41

Limits of Minibatch SGD

Minibatch SGD

Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ` 1 T ¸ .

§ b ď

? N to ensure sample efficiency (N “ bT).

Accelerated Minibatch SGD

Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ` 1 T 2 ¸ .

§ b ď pNq3{4 to ensure sample efficiency (N “ bT).

slide-67
SLIDE 67

29/41

Minibatch Proximal Update

Minibatch SGD

arg min

w

C 1 b ÿ

zPIt

∇ℓpwt, zq, w G ` 1 2ηt ||w ´ wt||2.

§ Solves a linear approximation problem on a minibatch. § wt`1 “ wt ´ ηt b

ř

zPIt ∇ℓpwt, zq.

slide-68
SLIDE 68

29/41

Minibatch Proximal Update

Minibatch SGD

arg min

w

C 1 b ÿ

zPIt

∇ℓpwt, zq, w G ` 1 2ηt ||w ´ wt||2.

§ Solves a linear approximation problem on a minibatch. § wt`1 “ wt ´ ηt b

ř

zPIt ∇ℓpwt, zq.

Minibatch Prox

arg min

w

1 b ÿ

zPIt

ℓpw, zq ` 1 2ηt ||w ´ wt||2.

§ Solves the original problem on a minibatch. § wt`1 “ wt ´ ηt b

ř

zPIt ∇ℓpwt`1, zq.

slide-69
SLIDE 69

30/41

Minibatch Prox Convergence

Convex case

Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ¸ .

§ Always matches the optimal rate regardless of minibatch size. § Does not require smoothness on ℓ.

slide-70
SLIDE 70

30/41

Minibatch Prox Convergence

Convex case

Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ¸ .

§ Always matches the optimal rate regardless of minibatch size. § Does not require smoothness on ℓ.

λ-strongly convex case

Fp¯ wTq ď Fpw˚q ` O ˆ 1 λbT ˙ .

slide-71
SLIDE 71

30/41

Minibatch Prox Convergence

Convex case

Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ¸ .

§ Always matches the optimal rate regardless of minibatch size. § Does not require smoothness on ℓ.

λ-strongly convex case

Fp¯ wTq ď Fpw˚q ` O ˆ 1 λbT ˙ .

§ Results can be extended to inexact minibatch prox update. § Minibatch prox subproblems are easy to solve.

slide-72
SLIDE 72

31/41

Solve Minibatch Prox Problem

Minibatch Prox

arg min

w φItpwq :“ 1

b ÿ

zPBt

ℓpw, zq ` 1 2ηt ||w ´ wt||2.

§ ηt is chosen as ηt „

a b{T.

slide-73
SLIDE 73

31/41

Solve Minibatch Prox Problem

Minibatch Prox

arg min

w φItpwq :“ 1

b ÿ

zPBt

ℓpw, zq ` 1 2ηt ||w ´ wt||2.

§ ηt is chosen as ηt „

a b{T.

Distributed SVRG (on m machines)

Each machines samples b{m data points.

§ Compute and communicate the full gradient ∇φItp

wq.

§ One machine performs the variance-reduced stochastic

update: w Ð w ´ αp∇ℓpw, zq ` ∇φItp wq ´ ∇ℓp w, zqq.

slide-74
SLIDE 74

32/41

Solving distributed stochastic convex optimization

min

w Fpwq :“ Ez„D rℓpw, zqs .

Samples Comm. Comp. Memory Ideal Solution Npεq Op1q Npεq{m Op1q Accelerated GD Npεq Npεq1{4 Npεq5{4{m Npεq{m

  • Acc. MB-SGD

Npεq Npεq1{4 Npεq{m Op1q DANE Npεq m Npεq Npεq{m DiSCO Npεq m1{4 Npεq{m3{4 Npεq{m AIDE Npεq m1{4 Npεq{m3{4 Npεq{m DSVRG Npεq Op1q Npεq{m Npεq{m MB-DSVRG Npεq Npεq{pmbq Npεq{m b

slide-75
SLIDE 75

32/41

Solving distributed stochastic convex optimization

min

w Fpwq :“ Ez„D rℓpw, zqs .

Memory Communication

Acc minibatch SGD DSVRG MB-DSVRG

slide-76
SLIDE 76

33/41

Summary

Minibatch Prox

§ Allows arbitrary minibatch size without slowing down

convergence.

§ Allows trade-off between communication and memory.

slide-77
SLIDE 77

33/41

Summary

Minibatch Prox

§ Allows arbitrary minibatch size without slowing down

convergence.

§ Allows trade-off between communication and memory.

Future Directions

§ Analyze more algorithms used in practice, such as averaging

iterates from SGD and minibatch prox.

slide-78
SLIDE 78

34/41

We covered ...

Efficient ML Distributed

Sparsity Minibatch Prox Dual Alternating

Opt&Sketch

Variance reduction Primal-dual methods Sketching

Online

Confidence- Weighted Budget OGD Cost-sensitive

Applications

Cloud removal Collaborative ranking Potfolio Opt

slide-79
SLIDE 79

35/41

My research

Efficient ML Distributed

Sparsity Minibatch Prox Dual Alternating

Opt&Sketch

Variance reduction Primal-dual methods Sketching

Online

Confidence- Weighted Budget OGD Cost-sensitive

Applications

Cloud removal Collaborative ranking Potfolio Opt

slide-80
SLIDE 80

36/41

Distributed Learning & Optimization

§ Design ML algorithms on distributed computing platforms.

Efficient Machine Learning

Distributed Learning & Optimization

Homo- geneous

Sparsity Minibatch Prox Dual Alternating Maximization Gradient Sparsi- fication Iterative Sketching

Coupled

Shared Sparsity Shared Subspace Graph Based

slide-81
SLIDE 81

36/41

Distributed Learning & Optimization

§ Design ML algorithms on distributed computing platforms.

Efficient Machine Learning

Distributed Learning & Optimization

Homo- geneous

Sparsity Minibatch Prox Dual Alternating Maximization Gradient Sparsi- fication Iterative Sketching

Coupled

Shared Sparsity Shared Subspace Graph Based

slide-82
SLIDE 82

36/41

Distributed Learning & Optimization

§ Design ML algorithms on distributed computing platforms. § rWKSZ, ICML 2017s, rWWS, COLT 2017s, rZWXXZ, JMLR

2017s, rWKS, AISTATS 2016s

Efficient Machine Learning

Distributed Learning & Optimization

Homo- geneous

Sparsity Minibatch Prox Dual Alternating Maximization Gradient Sparsi- fication Iterative Sketching

Coupled

Shared Sparsity Shared Subspace Graph Based

slide-83
SLIDE 83

37/41

Randomized Optimization & Sketching

§ Efficient optimization and sketching with provable guarantees.

Efficient Machine Learning

Randomized Optimization & Sketching

Variance reduction Primal- dual methods (Generalized) eigenvalues Iterative Sketching

slide-84
SLIDE 84

37/41

Randomized Optimization & Sketching

§ Efficient optimization and sketching with provable guarantees. § rWZ, arXiv 2017s, rWX, ICML 2017s, rWWGS, NIPS 2016s,

rWLMKS, AISTATS 2017, EJS 2017s

Efficient Machine Learning

Randomized Optimization & Sketching

Variance reduction Primal- dual methods (Generalized) eigenvalues Iterative Sketching

slide-85
SLIDE 85

38/41

Online Learning

§ Efficient algorithms under non-stationary environment.

Efficient Machine Learning

Online Learning

Confidence- Weighted Budget OGD Cost- sensitive LIBOL Online feature selection

slide-86
SLIDE 86

38/41

Online Learning

§ Efficient algorithms under non-stationary environment. § rWZH, ICML 2012s, rZWH, ICML 2012s, rWZH, TKDE

2013s, rHWZ, JMLR 2014s, rWZH, TKDE 2013s

Efficient Machine Learning

Online Learning

Confidence- Weighted Budget OGD Cost- sensitive LIBOL Online feature selection

slide-87
SLIDE 87

39/41

Applications

§ Apply ML techniques on vision, NLP, etc.

Efficient Machine Learning

Application

Cloud removal Web ranking Collaborative filtering Potfolio Opti- mization

slide-88
SLIDE 88

39/41

Applications

§ Apply ML techniques on vision, NLP, etc. § rWOCL, CVPR 2016s, rWWHZ, ACL 2015s, rWSE, KDD

2014s, rLWHH, Quantitative Finance 2017s

Efficient Machine Learning

Application

Cloud removal Web ranking Collaborative filtering Potfolio Opti- mization

slide-89
SLIDE 89

40/41

Thank you !

Efficient Machine Learning

Distributed Learning & Optimization Randomized Optimization & Sketching Online Learning Applications

slide-90
SLIDE 90

41/41

References I

  • S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic

decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1): 33–61, 1998.

  • O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal

distributed online prediction using mini-batches. J. Mach. Learn. Res., 13:165–202, 2012.

  • J. D. Lee, Y. Sun, Q. Liu, and J. E. Taylor.

Communication-efficient sparse regression: a one-shot approach.

  • J. Mach. Learn. Res., 2017.
  • R. J. Tibshirani. Regression shrinkage and selection via the lasso.
  • J. R. Stat. Soc. B, 58(1):267–288, 1996.