Towards More Efficient Distributed Machine Learning Jialei Wang - PowerPoint PPT Presentation

Towards More Efficient Distributed Machine Learning Jialei Wang University of Chicago ISE, NCSU, 12/13/2017 1/41

The empirical success of machine learning 2/41

The empirical success of machine learning Big Data Advanced Massive Modeling Computing 2/41

My research Variance Dual reduction Alternating Minibatch Prox Primal-dual methods Opt&Sketch Distributed Sparsity Sketching Efficient ML Potfolio Opt Confidence- Weighted Applications Online Collaborative ranking Budget OGD Cloud removal Cost-sensitive 3/41

This talk Variance Dual reduction Alternating Minibatch Prox Primal-dual methods Opt&Sketch Distributed Sparsity Sketching Efficient ML Potfolio Opt Confidence- Weighted Applications Online Collaborative ranking Budget OGD Cloud removal Cost-sensitive 4/41

Motivation for Distributed Learning Data Size § Data cannot be stored or processed on a single machine. § Use distributed computing to handle big data sets. § Example: Click-through rate prediction problem. 5/41

Motivation for Distributed Learning 5/41

Motivation for Distributed Learning Data Collection § Data are naturally distributed on different machines. § Use distributed computing to learn from decentralized data. § Example: Google’s federated learning problem. 5/41

Motivation for Distributed Learning 5/41

Challenges in Distributed Learning Efficiency in multiple dimensions § Sample: sample complexity matches the centralized solution. § Computation: floating point operations. § Communication: bandwidth (number of bits transmitted) + latency (rounds of communication). § Memory etc. 6/41

Challenges in Distributed Learning Efficiency in multiple dimensions § Sample: sample complexity matches the centralized solution. § Computation: floating point operations. § Communication: bandwidth (number of bits transmitted) + latency (rounds of communication). § Memory etc. latency " bandwidth " FLOPS 6/41

Learning as Optimization Stochastic Optimization Problems min w P Ω F p w q : “ E z „ D r ℓ p w , z qs . 7/41

Learning as Optimization Stochastic Optimization Problems min w P Ω F p w q : “ E z „ D r ℓ p w , z qs . 200 190 180 170 160 150 40 50 60 70 80 90 100 7/41

Learning as Optimization Stochastic Optimization Problems min w P Ω F p w q : “ E z „ D r ℓ p w , z qs . Input Hidden Output layer layer layer Input #1 Input #2 Output Input #3 Input #4 7/41

Distributed Optimization for Learning Reduction from (Distributed) Learning to Optimization § m machines, each machine collect n data instances t z ij u n j “ 1 . ´ ¯ ř m ř n § Global Objective: min w f p w q : “ 1 1 j ℓ p w , z ij q . i “ 1 m n § Distributed Consensus: f i p w q : “ 1 ř n f p w q : “ 1 ř m j ℓ p w , z ij q , i “ 1 f i p w q . n m 8/41

Distributed Optimization for Learning Reduction from (Distributed) Learning to Optimization § m machines, each machine collect n data instances t z ij u n j “ 1 . ´ ¯ § Global Objective: min w f p w q : “ 1 ř m 1 ř n j ℓ p w , z ij q . i “ 1 m n § Distributed Consensus: ř n ř m f i p w q : “ 1 f p w q : “ 1 j ℓ p w , z ij q , i “ 1 f i p w q . n m f 1 p w q f 2 p w q f 3 p w q f 4 p w q f 5 p w q 8/41

Distributed Optimization for Learning What’s special about Machine Learning ? § Learning care about the population objective F p w q “ E z „ D r ℓ p w , z qs . § Stochastic nature of the data: local objectives f i p w q are related. 9/41

Distributed Optimization for Learning What’s special about Machine Learning ? § Learning care about the population objective F p w q “ E z „ D r ℓ p w , z qs . § Stochastic nature of the data: local objectives f i p w q are related. ř n j ℓ p w , z 1 j q ř n j ℓ p w , z 2 j q ř n j ℓ p w , z 3 j q ř n j ℓ p w , z 4 j q ř n j ℓ p w , z 5 j q 9/41

Distributed Optimization for Learning What’s special about Machine Learning ? § Learning care about the population objective F p w q “ E z „ D r ℓ p w , z qs . § Stochastic nature of the data: local objectives f i p w q are related. t z 1 j u n t z 2 j u n t z 3 j u n t z 4 j u n t z 5 j u n j “ 1 „ D j “ 1 „ D j “ 1 „ D j “ 1 „ D j “ 1 „ D 9/41

Distributed Optimization for Learning What’s special about Machine Learning ? § Learning care about the population objective F p w q “ E z „ D r ℓ p w , z qs . § Stochastic nature of the data: local objectives f i p w q are related. t z 1 j u n t z 2 j u n t z 3 j u n t z 4 j u n t z 5 j u n j “ 1 „ D j “ 1 „ D j “ 1 „ D j “ 1 „ D j “ 1 „ D How to exploit similarity/relatedness between machines when designing distributed learning algorithms ? 9/41

This talk: two specific problems 1. How to efficiently learn sparse linear predictors in distributed environment ? 10/41

This talk: two specific problems 1. How to efficiently learn sparse linear predictors in distributed environment ? 2. How to parallelize stochastic gradient descent(SGD) ? 10/41

Efficient Distributed Learning with Sparsity International Conference on Machine Learning (ICML), 2017. Joint work with Mladen Kolar Nathan Srebro Tong Zhang 11/41

High-level Overview Problem Efficient Distributed Sparse Learning with Optimal Statistical Accuracy. 12/41

High-level Overview Problem Efficient Distributed Sparse Learning with Optimal Statistical Accuracy. Sparse Learning in High Dimension § On a single machine, use classical methods such as Lasso. § Statistical accuracy versus computation. 12/41

High-level Overview Problem Efficient Distributed Sparse Learning with Optimal Statistical Accuracy. Sparse Learning in High Dimension § On a single machine, use classical methods such as Lasso. § Statistical accuracy versus computation. Distributed Learning with Big Data § Data are distributed on multiple machines. § Statistical accuracy versus computation and communication . loooooooooooooooooooomoooooooooooooooooooon efficiency 12/41

High-dimensional Sparse Model Number of variables ( p ) is often very large. ... ... GTGCATCTGACTCCTGAGGAGTAG ... Genotype ... CACGTAGACTGAGGACTCCTCATC predict 2.5(Phenotype) 13/41

High-dimensional Sparse Model Number of variables ( p ) is often very large. 13/41

High-dimensional Sparse Model Number of variables ( p ) is often very large. Sparsity § Only a few variables are predictive. § w ˚ “ arg min w E x , y „ D r ℓ p y , x x , w yqs . § S : “ support p w ˚ q “ t j P r p s | w j ‰ 0 u and s “ | S | ! p . 13/41

High-dimensional Sparse Model Number of variables ( p ) is often very large. Sparsity § Only a few variables are predictive. § w ˚ “ arg min w E x , y „ D r ℓ p y , x x , w yqs . § S : “ support p w ˚ q “ t j P r p s | w j ‰ 0 u and s “ | S | ! p . ℓ 1 regularization (Tibshirani, 1996; Chen et al., 1998) § Statistical accuracy: good statistical properties. § Computational efficiency: Convex surrogate of ℓ 0 . 13/41

Sparse Regression Statistical Model § y “ x x , w ˚ y ` noise . Centralized ℓ 1 regularization m n 1 ÿ ÿ w cent “ arg min � ℓ p y ij , x x ij , w yq ` λ || w || 1 . mn w i “ 1 j “ 1 ˆb ˙ s log p w cent ´ w ˚ || 2 “ O (Optimal) Statistical accuracy: || � . mn Efficient method achieving optimal statistical accuracy ? 14/41

This work A communication and computation-efficient approach To achieve optimal statistical accuracy: n Á ms 2 log p Approach Communication Computation Centralize n ¨ p T lasso p mn , p q Avg-Debias p p ¨ T lasso p n , p q This work p 2 ¨ T lasso p n , p q 15/41

This work A communication and computation-efficient approach To achieve optimal statistical accuracy: n Á ms 2 log p Approach Communication Computation Centralize n ¨ p T lasso p mn , p q Avg-Debias p p ¨ T lasso p n , p q This work p 2 ¨ T lasso p n , p q ms 2 log p Á n Á s 2 log p Approach Communication Computation Centralize n ¨ p T lasso p mn , p q Avg-Debias ˆ ˆ This work log m ¨ p log m ¨ T lasso p n , p q T lasso p n , p q : runtime for solving a lasso problem of size n ˆ p . 15/41

The Proposed Approach Step 0: Local ℓ 1 Regularized Problem Solve w 1 “ arg min f 1 p w q ` λ 1 || w || 1 . � 16/41

The Proposed Approach Step 1,...,t: Shifted ℓ 1 Regularized Problem Communicate � w t and local gradient. ∇ f 1 p � ∇ f 2 p � ∇ f 3 p � ∇ f 4 p � ∇ f 5 p � w t q w t q w t q w t q w t q 16/41

Towards More Efficient Distributed Machine Learning Jialei Wang - PowerPoint PPT Presentation

Towards More Efficient Distributed Machine Learning Jialei Wang University of Chicago ISE, NCSU, 12/13/2017 1/41 The empirical success of machine learning 2/41 The empirical success of machine learning Big Data Advanced Massive Modeling

Towards Efficient Distributed Towards Efficient Distributed Simulation in Modelica using

Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University Distributed Machine

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Sparse Fuzzy Techniques There Is Room for . . . Our Idea Improve Machine Learning Towards an

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. More info blablabla More info blablabla More info blablabla More

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

MediaMeter: A Global Monitor for Online News Coverage Tadashi Nomoto National Institute of

Advisory Council on Clean Air Compliance Analysis Section 812 Benzene Case Study Air Quality

FCGEN project presentation Programme Review Day 2012 Brussels, 28 & 29 November 2012

1 Project Update - Overview & Schedule Revilla Uplands Project Overall site review

Conference Call & Webcast November 3, 2016 Welcome and Participants Vyomesh Joshi

Optimization considerations for regularizations of inverse and learning problems Hugo Raguet 1

Warm Welcome COSEC VEGA What is COSEC VEGA? Technically Advanced Door Controller For

French ce: An Anti-logophoric Demonstrative V. Homer, February 13 2019 1 Background French

Towards More Efficient Distributed Machine Learning Jialei Wang - PowerPoint PPT Presentation

Towards More Efficient Distributed Machine Learning Jialei Wang University of Chicago ISE, NCSU, 12/13/2017 1/41 The empirical success of machine learning 2/41 The empirical success of machine learning Big Data Advanced Massive Modeling

Towards Efficient Distributed Towards Efficient Distributed Simulation in Modelica using

Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University Distributed Machine

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Sparse Fuzzy Techniques There Is Room for . . . Our Idea Improve Machine Learning Towards an

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. *More info blablabla *More info blablabla *More info blablabla *More

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

MediaMeter: A Global Monitor for Online News Coverage Tadashi Nomoto National Institute of

Advisory Council on Clean Air Compliance Analysis Section 812 Benzene Case Study Air Quality

FCGEN project presentation Programme Review Day 2012 Brussels, 28 &amp; 29 November 2012

1 Project Update - Overview &amp; Schedule Revilla Uplands Project Overall site review

Conference Call &amp; Webcast November 3, 2016 Welcome and Participants Vyomesh Joshi

Optimization considerations for regularizations of inverse and learning problems Hugo Raguet 1

Warm Welcome COSEC VEGA What is COSEC VEGA? Technically Advanced Door Controller For

French ce: An Anti-logophoric Demonstrative V. Homer, February 13 2019 1 Background French

Why Transformers Work. More info blablabla More info blablabla More info blablabla More

FCGEN project presentation Programme Review Day 2012 Brussels, 28 & 29 November 2012

1 Project Update - Overview & Schedule Revilla Uplands Project Overall site review

Conference Call & Webcast November 3, 2016 Welcome and Participants Vyomesh Joshi