towards more efficient distributed machine learning
play

Towards More Efficient Distributed Machine Learning Jialei Wang - PowerPoint PPT Presentation

Towards More Efficient Distributed Machine Learning Jialei Wang University of Chicago ISE, NCSU, 12/13/2017 1/41 The empirical success of machine learning 2/41 The empirical success of machine learning Big Data Advanced Massive Modeling


  1. Towards More Efficient Distributed Machine Learning Jialei Wang University of Chicago ISE, NCSU, 12/13/2017 1/41

  2. The empirical success of machine learning 2/41

  3. The empirical success of machine learning Big Data Advanced Massive Modeling Computing 2/41

  4. My research Variance Dual reduction Alternating Minibatch Prox Primal-dual methods Opt&Sketch Distributed Sparsity Sketching Efficient ML Potfolio Opt Confidence- Weighted Applications Online Collaborative ranking Budget OGD Cloud removal Cost-sensitive 3/41

  5. My research Variance Dual reduction Alternating Minibatch Prox Primal-dual methods Opt&Sketch Distributed Sparsity Sketching Efficient ML Potfolio Opt Confidence- Weighted Applications Online Collaborative ranking Budget OGD Cloud removal Cost-sensitive 3/41

  6. My research Variance Dual reduction Alternating Minibatch Prox Primal-dual methods Opt&Sketch Distributed Sparsity Sketching Efficient ML Potfolio Opt Confidence- Weighted Applications Online Collaborative ranking Budget OGD Cloud removal Cost-sensitive 3/41

  7. This talk Variance Dual reduction Alternating Minibatch Prox Primal-dual methods Opt&Sketch Distributed Sparsity Sketching Efficient ML Potfolio Opt Confidence- Weighted Applications Online Collaborative ranking Budget OGD Cloud removal Cost-sensitive 4/41

  8. Motivation for Distributed Learning Data Size § Data cannot be stored or processed on a single machine. § Use distributed computing to handle big data sets. § Example: Click-through rate prediction problem. 5/41

  9. Motivation for Distributed Learning 5/41

  10. Motivation for Distributed Learning 5/41

  11. Motivation for Distributed Learning Data Collection § Data are naturally distributed on different machines. § Use distributed computing to learn from decentralized data. § Example: Google’s federated learning problem. 5/41

  12. Motivation for Distributed Learning 5/41

  13. Motivation for Distributed Learning 5/41

  14. Challenges in Distributed Learning Efficiency in multiple dimensions § Sample: sample complexity matches the centralized solution. § Computation: floating point operations. § Communication: bandwidth (number of bits transmitted) + latency (rounds of communication). § Memory etc. 6/41

  15. Challenges in Distributed Learning Efficiency in multiple dimensions § Sample: sample complexity matches the centralized solution. § Computation: floating point operations. § Communication: bandwidth (number of bits transmitted) + latency (rounds of communication). § Memory etc. latency " bandwidth " FLOPS 6/41

  16. Learning as Optimization Stochastic Optimization Problems min w P Ω F p w q : “ E z „ D r ℓ p w , z qs . 7/41

  17. Learning as Optimization Stochastic Optimization Problems min w P Ω F p w q : “ E z „ D r ℓ p w , z qs . 7/41

  18. Learning as Optimization Stochastic Optimization Problems min w P Ω F p w q : “ E z „ D r ℓ p w , z qs . 200 190 180 170 160 150 40 50 60 70 80 90 100 7/41

  19. Learning as Optimization Stochastic Optimization Problems min w P Ω F p w q : “ E z „ D r ℓ p w , z qs . Input Hidden Output layer layer layer Input #1 Input #2 Output Input #3 Input #4 7/41

  20. Distributed Optimization for Learning Reduction from (Distributed) Learning to Optimization § m machines, each machine collect n data instances t z ij u n j “ 1 . ´ ¯ ř m ř n § Global Objective: min w f p w q : “ 1 1 j ℓ p w , z ij q . i “ 1 m n § Distributed Consensus: f i p w q : “ 1 ř n f p w q : “ 1 ř m j ℓ p w , z ij q , i “ 1 f i p w q . n m 8/41

  21. Distributed Optimization for Learning Reduction from (Distributed) Learning to Optimization § m machines, each machine collect n data instances t z ij u n j “ 1 . ´ ¯ § Global Objective: min w f p w q : “ 1 ř m 1 ř n j ℓ p w , z ij q . i “ 1 m n § Distributed Consensus: ř n ř m f i p w q : “ 1 f p w q : “ 1 j ℓ p w , z ij q , i “ 1 f i p w q . n m f 1 p w q f 2 p w q f 3 p w q f 4 p w q f 5 p w q 8/41

  22. Distributed Optimization for Learning What’s special about Machine Learning ? § Learning care about the population objective F p w q “ E z „ D r ℓ p w , z qs . § Stochastic nature of the data: local objectives f i p w q are related. 9/41

  23. Distributed Optimization for Learning What’s special about Machine Learning ? § Learning care about the population objective F p w q “ E z „ D r ℓ p w , z qs . § Stochastic nature of the data: local objectives f i p w q are related. ř n j ℓ p w , z 1 j q ř n j ℓ p w , z 2 j q ř n j ℓ p w , z 3 j q ř n j ℓ p w , z 4 j q ř n j ℓ p w , z 5 j q 9/41

  24. Distributed Optimization for Learning What’s special about Machine Learning ? § Learning care about the population objective F p w q “ E z „ D r ℓ p w , z qs . § Stochastic nature of the data: local objectives f i p w q are related. t z 1 j u n t z 2 j u n t z 3 j u n t z 4 j u n t z 5 j u n j “ 1 „ D j “ 1 „ D j “ 1 „ D j “ 1 „ D j “ 1 „ D 9/41

  25. Distributed Optimization for Learning What’s special about Machine Learning ? § Learning care about the population objective F p w q “ E z „ D r ℓ p w , z qs . § Stochastic nature of the data: local objectives f i p w q are related. t z 1 j u n t z 2 j u n t z 3 j u n t z 4 j u n t z 5 j u n j “ 1 „ D j “ 1 „ D j “ 1 „ D j “ 1 „ D j “ 1 „ D How to exploit similarity/relatedness between machines when designing distributed learning algorithms ? 9/41

  26. This talk: two specific problems 1. How to efficiently learn sparse linear predictors in distributed environment ? 10/41

  27. This talk: two specific problems 1. How to efficiently learn sparse linear predictors in distributed environment ? 2. How to parallelize stochastic gradient descent(SGD) ? 10/41

  28. Efficient Distributed Learning with Sparsity International Conference on Machine Learning (ICML), 2017. Joint work with Mladen Kolar Nathan Srebro Tong Zhang 11/41

  29. High-level Overview Problem Efficient Distributed Sparse Learning with Optimal Statistical Accuracy. 12/41

  30. High-level Overview Problem Efficient Distributed Sparse Learning with Optimal Statistical Accuracy. Sparse Learning in High Dimension § On a single machine, use classical methods such as Lasso. § Statistical accuracy versus computation. 12/41

  31. High-level Overview Problem Efficient Distributed Sparse Learning with Optimal Statistical Accuracy. Sparse Learning in High Dimension § On a single machine, use classical methods such as Lasso. § Statistical accuracy versus computation. Distributed Learning with Big Data § Data are distributed on multiple machines. § Statistical accuracy versus computation and communication . loooooooooooooooooooomoooooooooooooooooooon efficiency 12/41

  32. High-dimensional Sparse Model Number of variables ( p ) is often very large. ... ... GTGCATCTGACTCCTGAGGAGTAG ... Genotype ... CACGTAGACTGAGGACTCCTCATC predict 2.5(Phenotype) 13/41

  33. High-dimensional Sparse Model Number of variables ( p ) is often very large. 13/41

  34. High-dimensional Sparse Model Number of variables ( p ) is often very large. Sparsity § Only a few variables are predictive. § w ˚ “ arg min w E x , y „ D r ℓ p y , x x , w yqs . § S : “ support p w ˚ q “ t j P r p s | w j ‰ 0 u and s “ | S | ! p . 13/41

  35. High-dimensional Sparse Model Number of variables ( p ) is often very large. Sparsity § Only a few variables are predictive. § w ˚ “ arg min w E x , y „ D r ℓ p y , x x , w yqs . § S : “ support p w ˚ q “ t j P r p s | w j ‰ 0 u and s “ | S | ! p . ℓ 1 regularization (Tibshirani, 1996; Chen et al., 1998) § Statistical accuracy: good statistical properties. § Computational efficiency: Convex surrogate of ℓ 0 . 13/41

  36. Sparse Regression Statistical Model § y “ x x , w ˚ y ` noise . Centralized ℓ 1 regularization m n 1 ÿ ÿ w cent “ arg min � ℓ p y ij , x x ij , w yq ` λ || w || 1 . mn w i “ 1 j “ 1 ˆb ˙ s log p w cent ´ w ˚ || 2 “ O (Optimal) Statistical accuracy: || � . mn Efficient method achieving optimal statistical accuracy ? 14/41

  37. This work A communication and computation-efficient approach To achieve optimal statistical accuracy: n Á ms 2 log p Approach Communication Computation Centralize n ¨ p T lasso p mn , p q Avg-Debias p p ¨ T lasso p n , p q This work p 2 ¨ T lasso p n , p q 15/41

  38. This work A communication and computation-efficient approach To achieve optimal statistical accuracy: n Á ms 2 log p Approach Communication Computation Centralize n ¨ p T lasso p mn , p q Avg-Debias p p ¨ T lasso p n , p q This work p 2 ¨ T lasso p n , p q ms 2 log p Á n Á s 2 log p Approach Communication Computation Centralize n ¨ p T lasso p mn , p q Avg-Debias ˆ ˆ This work log m ¨ p log m ¨ T lasso p n , p q T lasso p n , p q : runtime for solving a lasso problem of size n ˆ p . 15/41

  39. The Proposed Approach Step 0: Local ℓ 1 Regularized Problem Solve w 1 “ arg min f 1 p w q ` λ 1 || w || 1 . � 16/41

  40. The Proposed Approach Step 1,...,t: Shifted ℓ 1 Regularized Problem Communicate � w t and local gradient. ∇ f 1 p � ∇ f 2 p � ∇ f 3 p � ∇ f 4 p � ∇ f 5 p � w t q w t q w t q w t q w t q 16/41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend