1/41
Towards More Efficient Distributed Machine Learning Jialei Wang - - PowerPoint PPT Presentation
Towards More Efficient Distributed Machine Learning Jialei Wang - - PowerPoint PPT Presentation
Towards More Efficient Distributed Machine Learning Jialei Wang University of Chicago ISE, NCSU, 12/13/2017 1/41 The empirical success of machine learning 2/41 The empirical success of machine learning Big Data Advanced Massive Modeling
2/41
The empirical success of machine learning
2/41
The empirical success of machine learning Advanced Modeling Massive Computing Big Data
3/41
My research
Efficient ML Distributed
Sparsity Minibatch Prox Dual Alternating
Opt&Sketch
Variance reduction Primal-dual methods Sketching
Online
Confidence- Weighted Budget OGD Cost-sensitive
Applications
Cloud removal Collaborative ranking Potfolio Opt
3/41
My research
Efficient ML Distributed
Sparsity Minibatch Prox Dual Alternating
Opt&Sketch
Variance reduction Primal-dual methods Sketching
Online
Confidence- Weighted Budget OGD Cost-sensitive
Applications
Cloud removal Collaborative ranking Potfolio Opt
3/41
My research
Efficient ML Distributed
Sparsity Minibatch Prox Dual Alternating
Opt&Sketch
Variance reduction Primal-dual methods Sketching
Online
Confidence- Weighted Budget OGD Cost-sensitive
Applications
Cloud removal Collaborative ranking Potfolio Opt
4/41
This talk
Efficient ML Distributed
Sparsity Minibatch Prox Dual Alternating
Opt&Sketch
Variance reduction Primal-dual methods Sketching
Online
Confidence- Weighted Budget OGD Cost-sensitive
Applications
Cloud removal Collaborative ranking Potfolio Opt
5/41
Motivation for Distributed Learning
Data Size
§ Data cannot be stored or processed on a single machine. § Use distributed computing to handle big data sets. § Example: Click-through rate prediction problem.
5/41
Motivation for Distributed Learning
5/41
Motivation for Distributed Learning
5/41
Motivation for Distributed Learning
Data Collection
§ Data are naturally distributed on different machines. § Use distributed computing to learn from decentralized data. § Example: Google’s federated learning problem.
5/41
Motivation for Distributed Learning
5/41
Motivation for Distributed Learning
6/41
Challenges in Distributed Learning
Efficiency in multiple dimensions
§ Sample: sample complexity matches the centralized solution. § Computation: floating point operations. § Communication: bandwidth (number of bits transmitted) +
latency (rounds of communication).
§ Memory etc.
6/41
Challenges in Distributed Learning
Efficiency in multiple dimensions
§ Sample: sample complexity matches the centralized solution. § Computation: floating point operations. § Communication: bandwidth (number of bits transmitted) +
latency (rounds of communication).
§ Memory etc.
latency " bandwidth " FLOPS
7/41
Learning as Optimization
Stochastic Optimization Problems
min
wPΩ Fpwq :“ Ez„D rℓpw, zqs .
7/41
Learning as Optimization
Stochastic Optimization Problems
min
wPΩ Fpwq :“ Ez„D rℓpw, zqs .
7/41
Learning as Optimization
Stochastic Optimization Problems
min
wPΩ Fpwq :“ Ez„D rℓpw, zqs .
40 50 60 70 80 90 100 150 160 170 180 190 200
7/41
Learning as Optimization
Stochastic Optimization Problems
min
wPΩ Fpwq :“ Ez„D rℓpw, zqs .
Input #1 Input #2 Input #3 Input #4 Output Hidden layer Input layer Output layer
8/41
Distributed Optimization for Learning
Reduction from (Distributed) Learning to Optimization
§ m machines, each machine collect n data instances tzijun j“1. § Global Objective: minw f pwq :“ 1 m
řm
i“1
´
1 n
řn
j ℓpw, zijq
¯ .
§ Distributed Consensus:
fipwq :“ 1
n
řn
j ℓpw, zijq,
f pwq :“ 1
m
řm
i“1 fipwq.
8/41
Distributed Optimization for Learning
Reduction from (Distributed) Learning to Optimization
§ m machines, each machine collect n data instances tzijun j“1. § Global Objective: minw f pwq :“ 1 m
řm
i“1
´
1 n
řn
j ℓpw, zijq
¯ .
§ Distributed Consensus:
fipwq :“ 1
n
řn
j ℓpw, zijq,
f pwq :“ 1
m
řm
i“1 fipwq. f1pwq f2pwq f3pwq f4pwq f5pwq
9/41
Distributed Optimization for Learning
What’s special about Machine Learning ?
§ Learning care about the population objective
Fpwq “ Ez„D rℓpw, zqs.
§ Stochastic nature of the data: local objectives fipwq are
related.
9/41
Distributed Optimization for Learning
What’s special about Machine Learning ?
§ Learning care about the population objective
Fpwq “ Ez„D rℓpw, zqs.
§ Stochastic nature of the data: local objectives fipwq are
related.
řn j ℓpw, z1jq řn j ℓpw, z2jq řn j ℓpw, z3jq řn j ℓpw, z4jq řn j ℓpw, z5jq
9/41
Distributed Optimization for Learning
What’s special about Machine Learning ?
§ Learning care about the population objective
Fpwq “ Ez„D rℓpw, zqs.
§ Stochastic nature of the data: local objectives fipwq are
related.
tz1jun j“1 „ D tz2jun j“1 „ D tz3jun j“1 „ D tz4jun j“1 „ D tz5jun j“1 „ D
9/41
Distributed Optimization for Learning
What’s special about Machine Learning ?
§ Learning care about the population objective
Fpwq “ Ez„D rℓpw, zqs.
§ Stochastic nature of the data: local objectives fipwq are
related.
tz1jun j“1 „ D tz2jun j“1 „ D tz3jun j“1 „ D tz4jun j“1 „ D tz5jun j“1 „ D
How to exploit similarity/relatedness between machines when designing distributed learning algorithms ?
10/41
This talk: two specific problems
- 1. How to efficiently learn sparse linear predictors in
distributed environment ?
10/41
This talk: two specific problems
- 1. How to efficiently learn sparse linear predictors in
distributed environment ?
- 2. How to parallelize stochastic gradient descent(SGD) ?
11/41
Efficient Distributed Learning with Sparsity
International Conference on Machine Learning (ICML), 2017.
Joint work with
Mladen Kolar Nathan Srebro Tong Zhang
12/41
High-level Overview
Problem
Efficient Distributed Sparse Learning with Optimal Statistical Accuracy.
12/41
High-level Overview
Problem
Efficient Distributed Sparse Learning with Optimal Statistical Accuracy.
Sparse Learning in High Dimension
§ On a single machine, use classical methods such as Lasso. § Statistical accuracy versus computation.
12/41
High-level Overview
Problem
Efficient Distributed Sparse Learning with Optimal Statistical Accuracy.
Sparse Learning in High Dimension
§ On a single machine, use classical methods such as Lasso. § Statistical accuracy versus computation.
Distributed Learning with Big Data
§ Data are distributed on multiple machines. § Statistical accuracy versus computation and communication
loooooooooooooooooooomoooooooooooooooooooon
efficiency
.
13/41
High-dimensional Sparse Model
Number of variables (p) is often very large. GTGCATCTGACTCCTGAGGAGTAG CACGTAGACTGAGGACTCCTCATC ... ... ... ... Genotype 2.5(Phenotype) predict
13/41
High-dimensional Sparse Model
Number of variables (p) is often very large.
13/41
High-dimensional Sparse Model
Number of variables (p) is often very large.
Sparsity
§ Only a few variables are predictive. § w˚ “ arg minw Ex,y„D rℓpy, xx, wyqs. § S :“ supportpw˚q “ tj P rps | wj ‰ 0u and s “ |S| ! p.
13/41
High-dimensional Sparse Model
Number of variables (p) is often very large.
Sparsity
§ Only a few variables are predictive. § w˚ “ arg minw Ex,y„D rℓpy, xx, wyqs. § S :“ supportpw˚q “ tj P rps | wj ‰ 0u and s “ |S| ! p.
ℓ1 regularization (Tibshirani, 1996; Chen et al., 1998)
§ Statistical accuracy: good statistical properties. § Computational efficiency: Convex surrogate of ℓ0.
14/41
Sparse Regression
Statistical Model
§ y “ xx, w˚y ` noise.
Centralized ℓ1 regularization
- wcent “ arg min
w
1 mn
m
ÿ
i“1 n
ÿ
j“1
ℓpyij, xxij, wyq ` λ||w||1. (Optimal) Statistical accuracy: || wcent ´ w˚||2 “ O ˆb
s log p mn
˙ . Efficient method achieving optimal statistical accuracy ?
15/41
This work
A communication and computation-efficient approach
To achieve optimal statistical accuracy: Approach n Á ms2 log p Communication Computation Centralize n ¨ p Tlassopmn, pq Avg-Debias p p ¨ Tlassopn, pq This work p 2 ¨ Tlassopn, pq
15/41
This work
A communication and computation-efficient approach
To achieve optimal statistical accuracy: Approach n Á ms2 log p Communication Computation Centralize n ¨ p Tlassopmn, pq Avg-Debias p p ¨ Tlassopn, pq This work p 2 ¨ Tlassopn, pq Approach ms2 log p Á n Á s2 log p Communication Computation Centralize n ¨ p Tlassopmn, pq Avg-Debias ˆ ˆ This work log m ¨ p log m ¨ Tlassopn, pq Tlassopn, pq: runtime for solving a lasso problem of size n ˆ p.
16/41
The Proposed Approach
Step 0: Local ℓ1 Regularized Problem
Solve
- w1 “ arg min f1pwq ` λ1||w||1.
16/41
The Proposed Approach
Step 1,...,t: Shifted ℓ1 Regularized Problem
Communicate wt and local gradient.
∇f1p wtq ∇f2p wtq ∇f3p wtq ∇f4p wtq ∇f5p wtq
16/41
The Proposed Approach
Step 1,...,t: Shifted ℓ1 Regularized Problem
Solve
- wt`1 “ arg min
w
f1pwq ` C 1 m ÿ
jPrms
∇fjp wtq ´ ∇f1p wtq, w G ` λt`1||w||1.
17/41
Interpretation
Communication efficiency
arg min
w
f1pwq ` C 1 m ÿ
jPrms
∇fjp wtq´∇f1p wtq, w G ` λt`1||w||1.
§ Combine global first-order and local higher-order information.
17/41
Interpretation
Communication efficiency
arg min
w
f1pwq ` C 1 m ÿ
jPrms
∇fjp wtq´∇f1p wtq, w G ` λt`1||w||1.
§ Combine global first-order and local higher-order information.
Quadratic objective without ℓ1 regularization
§ Closed form update
- wt`1 “
wt ´ ` ∇2f1p wtq ˘´1´ m´1 ř
jPrms ∇fjp
wtq ¯ : sub-sampled Newton direction.
17/41
Interpretation
Communication efficiency
arg min
w
f1pwq ` C 1 m ÿ
jPrms
∇fjp wtq´∇f1p wtq, w G ` λt`1||w||1.
§ Combine global first-order and local higher-order information.
Quadratic objective without ℓ1 regularization
§ Closed form update
- wt`1 “
wt ´ ` ∇2f1p wtq ˘´1´ m´1 ř
jPrms ∇fjp
wtq ¯ : sub-sampled Newton direction.
Key: adjust the ℓ1 regularization level at each round !
18/41
Related Work
Distributed first-order methods
§ e.g. (accelerated) proximal gradient. § slow convergence Ñ heavy communication.
18/41
Related Work
Distributed first-order methods
§ e.g. (accelerated) proximal gradient. § slow convergence Ñ heavy communication.
Average after de-biasing method: (Lee et al., 2017)
§ Matching centralized statistical error in one shot under certain
conditions.
§ Not practical because of i) computation ii) stringent condition.
19/41
Theoretical Results (regression for example)
Regularization level
λt`1 — c log p mn loomoon
centralize
` c log p n ˜ s c log p n ¸t loooooooooooomoooooooooooon
decaying
,
19/41
Theoretical Results (regression for example)
Regularization level
λt`1 — c log p mn loomoon
centralize
` c log p n ˜ s c log p n ¸t loooooooooooomoooooooooooon
decaying
,
Statistical Accuracy
|| wt`1 ´ w˚||1 ÀP s c log p mn looomooon
centralize
` ˜ s c log p n ¸t`1 loooooooomoooooooon
decaying
, || wt`1 ´ w˚||2 ÀP c s log p mn looomooon
centralize
` ˜c s log p n ¸ ˜ s c log p n ¸t loooooooooooooooomoooooooooooooooon
decaying
.
20/41
Theoretical Comparison
Centralize
|| wcent ´ w˚||2 ÀP c s log p mn .
20/41
Theoretical Comparison
Centralize
|| wcent ´ w˚||2 ÀP c s log p mn . Proposed Method
After one round of communication
|| w2 ´ w˚||2 ÀP c s log p mn ` s3{2 log p n .
§ Matches Avg-Debias, matches Centralize when n Á ms2 log p.
20/41
Theoretical Comparison
Centralize
|| wcent ´ w˚||2 ÀP c s log p mn . Proposed Method
After one round of communication
|| w2 ´ w˚||2 ÀP c s log p mn ` s3{2 log p n .
§ Matches Avg-Debias, matches Centralize when n Á ms2 log p.
After t Á log m rounds of communication (when n Á s2 log p)
|| wt`1 ´ w˚||2 ÀP c s log p mn .
21/41
Experiments
1 2 3 4 5 6 7 8 9
Rounds of Communications
0.15 0.22 0.29 0.36 0.43 0.5
Estimation Error
Local Prox-GD Centralize Avg-Debias EDSL
22/41
Experiments (simulation)
Sparse Regression
m “ 5 m “ 10 m “ 20
1 2 3 4 5 6 7 8 9
Rounds of Communications
0.15 0.22 0.29 0.36 0.43 0.5
Estimation Error
Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9
Rounds of Communications
0.12 0.24 0.36 0.48 0.6
Estimation Error
Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9
Rounds of Communications
0.07 0.21 0.35 0.49 0.63 0.77
Estimation Error
Local Prox-GD Centralize Avg-Debias EDSL
s “ 10, X „ Np0, Σq, Σij “ 0.5|i´j|.
1 2 3 4 5 6 7 8 9
Rounds of Communications
0.23 0.31 0.39 0.47 0.55 0.63
Estimation Error
Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9
Rounds of Communications
0.2 0.31 0.42 0.53 0.64 0.75
Estimation Error
Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9
Rounds of Communications
0.1 0.28 0.46 0.64 0.82 1.0
Estimation Error
Local Prox-GD Centralize Avg-Debias EDSL
s “ 10, X „ Np0, Σq, Σij “ 0.5|i´j|{5.
23/41
Experiments (real data)
Regression
connect4 dna year
1 2 3 4 5 6 7 8 9
Rounds of Communications
0.677 0.6795 0.682 0.6845 0.687 0.6895 0.692
Normalized MSE
Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9
Rounds of Communications
0.4 0.44 0.48 0.52 0.56 0.6
Normalized MSE
Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9
Rounds of Communications
0.937 0.938 0.939 0.94 0.941 0.942
Normalized MSE
Local Prox-GD Centralize EDSL
Classification
w8a mitface spambase
1 2 3 4 5 6 7 8 9
Rounds of Communications
2.0 2.14 2.28 2.42 2.56 2.7
Classification Error (%)
Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9
Rounds of Communications
0.03 0.033 0.036 0.039 0.042 0.045
Classification Error
Local Prox-GD Centralize Avg-Debias EDSL 1 2 3 4 5 6 7 8 9
Rounds of Communications
0.1 0.108 0.116 0.124 0.132 0.14
Classification Error
Local Prox-GD Centralize Avg-Debias EDSL
24/41
Summary
Distributed Sparse Learning achieving optimal statistical accuracy
§ Communication and computation efficient approach for
distributed learning with sparsity.
§ Oplog mq-rounds of communication to match centralized
performance, each round: local ℓ1 regularized problem.
24/41
Summary
Distributed Sparse Learning achieving optimal statistical accuracy
§ Communication and computation efficient approach for
distributed learning with sparsity.
§ Oplog mq-rounds of communication to match centralized
performance, each round: local ℓ1 regularized problem.
Future Directions
§ Even weaker assumptions: e.g. relax the sample requirement
per machine (currently we require s2 log p).
§ More efficient approach for distributed multi-task learning
with shared sparsity.
25/41
Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch-Prox
Conference on Learning Theory (COLT), 2017.
Joint work with
Weiran Wang Nathan Srebro
26/41
Stochastic Gradient Descent
min
w Fpwq :“ Ez„D rℓpw, zqs .
26/41
Stochastic Gradient Descent
min
w Fpwq :“ Ez„D rℓpw, zqs .
Stochastic Gradient
At time t, sample zt, and compute wt`1 “ wt ´ ηt∇ℓpwt, ztq. z1 z2 z3 z4 w2 w3 w4 w4 w5 ...
26/41
Stochastic Gradient Descent
min
w Fpwq :“ Ez„D rℓpw, zqs .
Stochastic Gradient
At time t, sample zt, and compute wt`1 “ wt ´ ηt∇ℓpwt, ztq.
SGD in Machine Learning
Computation and memory efficient. Match ERM generalization guarantee with one-pass SGD. Back propagation in training neural networks.
27/41
How to parallelize SGD ?
Minibatch Stochastic Gradient Descent
At time t, sample It, and update wt`1 “ wt ´ ηt
|It|
ř
zPIt ∇ℓpwt, zq.
z1, z2, z3, z4 z5, z6, z7, z8 z9, z10, z11, z12 w2 w3 w4 ...
27/41
How to parallelize SGD ?
Minibatch Stochastic Gradient Descent
At time t, sample It, and update wt`1 “ wt ´ ηt
|It|
ř
zPIt ∇ℓpwt, zq.
ř zPI1 ∇ℓpwt, zq ř zPI2 ∇ℓpwt, zq ř zPI3 ∇ℓpwt, zq ř zPI4 ∇ℓpwt, zq ř zPI5 ∇ℓpwt, zq
27/41
How to parallelize SGD ?
Minibatch Stochastic Gradient Descent
At time t, sample It, and update wt`1 “ wt ´ ηt
|It|
ř
zPIt ∇ℓpwt, zq.
Larger minibatch, better parallelism
§ Reduce communication rounds and bits.
27/41
How to parallelize SGD ?
Minibatch Stochastic Gradient Descent
At time t, sample It, and update wt`1 “ wt ´ ηt
|It|
ř
zPIt ∇ℓpwt, zq.
Larger minibatch, better parallelism
§ Reduce communication rounds and bits.
Not too large ! (Dekel et al., 2012)
For convex smooth ℓ, we have Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ` 1 T ¸ .
28/41
Limits of Minibatch SGD
Minibatch SGD
Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ` 1 T ¸ .
§ b ď
? N to ensure sample efficiency (N “ bT).
28/41
Limits of Minibatch SGD
Minibatch SGD
Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ` 1 T ¸ .
§ b ď
? N to ensure sample efficiency (N “ bT).
Accelerated Minibatch SGD
Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ` 1 T 2 ¸ .
§ b ď pNq3{4 to ensure sample efficiency (N “ bT).
29/41
Minibatch Proximal Update
Minibatch SGD
arg min
w
C 1 b ÿ
zPIt
∇ℓpwt, zq, w G ` 1 2ηt ||w ´ wt||2.
§ Solves a linear approximation problem on a minibatch. § wt`1 “ wt ´ ηt b
ř
zPIt ∇ℓpwt, zq.
29/41
Minibatch Proximal Update
Minibatch SGD
arg min
w
C 1 b ÿ
zPIt
∇ℓpwt, zq, w G ` 1 2ηt ||w ´ wt||2.
§ Solves a linear approximation problem on a minibatch. § wt`1 “ wt ´ ηt b
ř
zPIt ∇ℓpwt, zq.
Minibatch Prox
arg min
w
1 b ÿ
zPIt
ℓpw, zq ` 1 2ηt ||w ´ wt||2.
§ Solves the original problem on a minibatch. § wt`1 “ wt ´ ηt b
ř
zPIt ∇ℓpwt`1, zq.
30/41
Minibatch Prox Convergence
Convex case
Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ¸ .
§ Always matches the optimal rate regardless of minibatch size. § Does not require smoothness on ℓ.
30/41
Minibatch Prox Convergence
Convex case
Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ¸ .
§ Always matches the optimal rate regardless of minibatch size. § Does not require smoothness on ℓ.
λ-strongly convex case
Fp¯ wTq ď Fpw˚q ` O ˆ 1 λbT ˙ .
30/41
Minibatch Prox Convergence
Convex case
Fp¯ wTq ď Fpw˚q ` O ˜c 1 bT ¸ .
§ Always matches the optimal rate regardless of minibatch size. § Does not require smoothness on ℓ.
λ-strongly convex case
Fp¯ wTq ď Fpw˚q ` O ˆ 1 λbT ˙ .
§ Results can be extended to inexact minibatch prox update. § Minibatch prox subproblems are easy to solve.
31/41
Solve Minibatch Prox Problem
Minibatch Prox
arg min
w φItpwq :“ 1
b ÿ
zPBt
ℓpw, zq ` 1 2ηt ||w ´ wt||2.
§ ηt is chosen as ηt „
a b{T.
31/41
Solve Minibatch Prox Problem
Minibatch Prox
arg min
w φItpwq :“ 1
b ÿ
zPBt
ℓpw, zq ` 1 2ηt ||w ´ wt||2.
§ ηt is chosen as ηt „
a b{T.
Distributed SVRG (on m machines)
Each machines samples b{m data points.
§ Compute and communicate the full gradient ∇φItp
wq.
§ One machine performs the variance-reduced stochastic
update: w Ð w ´ αp∇ℓpw, zq ` ∇φItp wq ´ ∇ℓp w, zqq.
32/41
Solving distributed stochastic convex optimization
min
w Fpwq :“ Ez„D rℓpw, zqs .
Samples Comm. Comp. Memory Ideal Solution Npεq Op1q Npεq{m Op1q Accelerated GD Npεq Npεq1{4 Npεq5{4{m Npεq{m
- Acc. MB-SGD
Npεq Npεq1{4 Npεq{m Op1q DANE Npεq m Npεq Npεq{m DiSCO Npεq m1{4 Npεq{m3{4 Npεq{m AIDE Npεq m1{4 Npεq{m3{4 Npεq{m DSVRG Npεq Op1q Npεq{m Npεq{m MB-DSVRG Npεq Npεq{pmbq Npεq{m b
32/41
Solving distributed stochastic convex optimization
min
w Fpwq :“ Ez„D rℓpw, zqs .
Memory Communication
Acc minibatch SGD DSVRG MB-DSVRG
33/41
Summary
Minibatch Prox
§ Allows arbitrary minibatch size without slowing down
convergence.
§ Allows trade-off between communication and memory.
33/41
Summary
Minibatch Prox
§ Allows arbitrary minibatch size without slowing down
convergence.
§ Allows trade-off between communication and memory.
Future Directions
§ Analyze more algorithms used in practice, such as averaging
iterates from SGD and minibatch prox.
34/41
We covered ...
Efficient ML Distributed
Sparsity Minibatch Prox Dual Alternating
Opt&Sketch
Variance reduction Primal-dual methods Sketching
Online
Confidence- Weighted Budget OGD Cost-sensitive
Applications
Cloud removal Collaborative ranking Potfolio Opt
35/41
My research
Efficient ML Distributed
Sparsity Minibatch Prox Dual Alternating
Opt&Sketch
Variance reduction Primal-dual methods Sketching
Online
Confidence- Weighted Budget OGD Cost-sensitive
Applications
Cloud removal Collaborative ranking Potfolio Opt
36/41
Distributed Learning & Optimization
§ Design ML algorithms on distributed computing platforms.
Efficient Machine Learning
Distributed Learning & Optimization
Homo- geneous
Sparsity Minibatch Prox Dual Alternating Maximization Gradient Sparsi- fication Iterative Sketching
Coupled
Shared Sparsity Shared Subspace Graph Based
36/41
Distributed Learning & Optimization
§ Design ML algorithms on distributed computing platforms.
Efficient Machine Learning
Distributed Learning & Optimization
Homo- geneous
Sparsity Minibatch Prox Dual Alternating Maximization Gradient Sparsi- fication Iterative Sketching
Coupled
Shared Sparsity Shared Subspace Graph Based
36/41
Distributed Learning & Optimization
§ Design ML algorithms on distributed computing platforms. § rWKSZ, ICML 2017s, rWWS, COLT 2017s, rZWXXZ, JMLR
2017s, rWKS, AISTATS 2016s
Efficient Machine Learning
Distributed Learning & Optimization
Homo- geneous
Sparsity Minibatch Prox Dual Alternating Maximization Gradient Sparsi- fication Iterative Sketching
Coupled
Shared Sparsity Shared Subspace Graph Based
37/41
Randomized Optimization & Sketching
§ Efficient optimization and sketching with provable guarantees.
Efficient Machine Learning
Randomized Optimization & Sketching
Variance reduction Primal- dual methods (Generalized) eigenvalues Iterative Sketching
37/41
Randomized Optimization & Sketching
§ Efficient optimization and sketching with provable guarantees. § rWZ, arXiv 2017s, rWX, ICML 2017s, rWWGS, NIPS 2016s,
rWLMKS, AISTATS 2017, EJS 2017s
Efficient Machine Learning
Randomized Optimization & Sketching
Variance reduction Primal- dual methods (Generalized) eigenvalues Iterative Sketching
38/41
Online Learning
§ Efficient algorithms under non-stationary environment.
Efficient Machine Learning
Online Learning
Confidence- Weighted Budget OGD Cost- sensitive LIBOL Online feature selection
38/41
Online Learning
§ Efficient algorithms under non-stationary environment. § rWZH, ICML 2012s, rZWH, ICML 2012s, rWZH, TKDE
2013s, rHWZ, JMLR 2014s, rWZH, TKDE 2013s
Efficient Machine Learning
Online Learning
Confidence- Weighted Budget OGD Cost- sensitive LIBOL Online feature selection
39/41
Applications
§ Apply ML techniques on vision, NLP, etc.
Efficient Machine Learning
Application
Cloud removal Web ranking Collaborative filtering Potfolio Opti- mization
39/41
Applications
§ Apply ML techniques on vision, NLP, etc. § rWOCL, CVPR 2016s, rWWHZ, ACL 2015s, rWSE, KDD
2014s, rLWHH, Quantitative Finance 2017s
Efficient Machine Learning
Application
Cloud removal Web ranking Collaborative filtering Potfolio Opti- mization
40/41
Thank you !
Efficient Machine Learning
Distributed Learning & Optimization Randomized Optimization & Sketching Online Learning Applications
41/41
References I
- S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic
decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1): 33–61, 1998.
- O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal
distributed online prediction using mini-batches. J. Mach. Learn. Res., 13:165–202, 2012.
- J. D. Lee, Y. Sun, Q. Liu, and J. E. Taylor.
Communication-efficient sparse regression: a one-shot approach.
- J. Mach. Learn. Res., 2017.
- R. J. Tibshirani. Regression shrinkage and selection via the lasso.
- J. R. Stat. Soc. B, 58(1):267–288, 1996.