manifold identification for ultimately communication
play

Manifold Identification for Ultimately Communication-Efficient - PowerPoint PPT Presentation

Manifold Identification for Ultimately Communication-Efficient Distributed Optimization Yu-Sheng Li Joint work with Wei-Lin Chiang (NTU) and Ching-pei Lee (NUS) Outline Overview Empirical Risk Minimization The Proposed Algorithm Experiments


  1. Manifold Identification for Ultimately Communication-Efficient Distributed Optimization Yu-Sheng Li Joint work with Wei-Lin Chiang (NTU) and Ching-pei Lee (NUS)

  2. Outline Overview Empirical Risk Minimization The Proposed Algorithm Experiments

  3. Distributed Machine Learning Read 1 MB sequentially from memory 3 µ s Read 1 MB sequentially from network 22 µ s Read 1 MB sequentially from disk (SSD) 49 µ s Round trip in the same datacenter 500 µ s ( Latency Numbers Every Programmer Should Know . 1 ) 1 Originally by Jeff Dean in 2010, updated by Colin Scott at https://colin-scott.github.io/personal_website/research/interactive_latency.html 1

  4. Distributed Machine Learning Read 1 MB sequentially from memory 3 µ s Read 1 MB sequentially from network 22 µ s Read 1 MB sequentially from disk (SSD) 49 µ s Round trip in the same datacenter 500 µ s ( Latency Numbers Every Programmer Should Know . 1 ) ◮ Inter-machine communication may be more time-consuming than local computations within a machine Comm. cost = ( # Comm. rounds ) × ( Bytes communicated per round ) 1 Originally by Jeff Dean in 2010, updated by Colin Scott at https://colin-scott.github.io/personal_website/research/interactive_latency.html 1

  5. Sparsity-inducing Regularization ◮ To avoid overfitting and to force some desired structure of the solution, usually a sparsity-inducing regularizer is introduced 2

  6. Sparsity-inducing Regularization ◮ To avoid overfitting and to force some desired structure of the solution, usually a sparsity-inducing regularizer is introduced ◮ Example: ℓ 2 - vs. ℓ 1 -regularized logistic regression on news20 Relative reg. strength Sparsity of solution Test accuracy ℓ 2 -regularized 2 0 1,355,191 (100%) 99.7449% 2 10 1,355,191 (100%) 97.0044% 2

  7. Sparsity-inducing Regularization ◮ To avoid overfitting and to force some desired structure of the solution, usually a sparsity-inducing regularizer is introduced ◮ Example: ℓ 2 - vs. ℓ 1 -regularized logistic regression on news20 Relative reg. strength Sparsity of solution Test accuracy ℓ 2 -regularized 2 0 1,355,191 (100%) 99.7449% 2 10 1,355,191 (100%) 97.0044% ℓ 1 -regularized 2 0 67,071 (4.95%) 99.7499% 2 2 42,020 (3.10%) 99.7499% 2 4 14,524 (1.07%) 99.7449% 2 6 5,432 (0.40%) 99.6749% 2 8 1,472 (0.11%) 97.3495% 2 10 546 (0.04%) 92.8936% 2

  8. Our contributions Recall: Comm. cost = ( # Comm. rounds ) × ( Bytes communicated per round ) 3

  9. Our contributions Recall: Comm. cost = ( # Comm. rounds ) × ( Bytes communicated per round ) ◮ Focusing on the small subproblem ⇒ fewer bytes to communicate 3

  10. Our contributions Recall: Comm. cost = ( # Comm. rounds ) × ( Bytes communicated per round ) ◮ Focusing on the small subproblem ⇒ fewer bytes to communicate ◮ Acceleration by smooth optimization in the correct manifold ⇒ fewer rounds of communication 3

  11. Results (ours: MADPQN) y-axis: relative distance to the optimal value (log-scaled) x-axis: communication costs (upper), training time (lower) news20 epsilon webspam 10 1 OWLQN OWLQN 10 1 10 1 L-COMM L-COMM 10 4 10 4 10 2 DPLBFGS DPLBFGS MADPQN MADPQN 10 7 10 7 OWLQN 5 10 L-COMM 10 10 10 10 DPLBFGS 10 8 10 13 10 13 MADPQN 0 10 20 0 200 400 0 5 10 Communication ( d bytes) Communication ( d bytes) Communication ( d bytes) 10 1 OWLQN OWLQN 10 1 10 1 L-COMM L-COMM 10 4 10 4 10 2 DPLBFGS DPLBFGS MADPQN MADPQN 10 7 10 7 OWLQN 10 5 10 10 10 10 L-COMM DPLBFGS 10 8 10 13 10 13 MADPQN 0 20 40 60 0 250 500 750 1000 0 1000 2000 Training Time (seconds) Training Time (seconds) Training Time (seconds) 4

  12. Outline Overview Empirical Risk Minimization The Proposed Algorithm Experiments 5

  13. Outline Overview Empirical Risk Minimization The Proposed Algorithm Experiments

  14. Distributed Empirical Risk Minimization (ERM) ◮ Train a model by minimizing a function that measures the performance on training data K � arg min f ( w ) := f k ( w ) w ∈ R d k =1 ◮ There are K machines, and f k is exclusively available on machine k 6

  15. Distributed Empirical Risk Minimization (ERM) ◮ Train a model by minimizing a function that measures the performance on training data K � arg min f ( w ) := f k ( w ) w ∈ R d k =1 ◮ There are K machines, and f k is exclusively available on machine k ◮ Synchronize w or ∇ f ( w ) by communication: communication cost per iteration is O ( d ) ◮ How to reduce the O ( d ) cost? 6

  16. Sparsity-inducing Regularizer ◮ If w is sparse throughout the training process, we only need to synchronize a shorter vector ◮ Regularized ERM: min f ( w ) + R ( w ) w 7

  17. Sparsity-inducing Regularizer ◮ If w is sparse throughout the training process, we only need to synchronize a shorter vector ◮ Regularized ERM: min f ( w ) + R ( w ) w ◮ An ideal regularization term for forcing sparsity is the ℓ 0 norm: � w � 0 = number of nonzeros in w 7

  18. Sparsity-inducing Regularizer ◮ If w is sparse throughout the training process, we only need to synchronize a shorter vector ◮ Regularized ERM: min f ( w ) + R ( w ) w ◮ An ideal regularization term for forcing sparsity is the ℓ 0 norm: � w � 0 = number of nonzeros in w ◮ But this norm is not continuous and hence hard to optimize ◮ A good surrogate is the ℓ 1 norm � w � 1 = � d i =1 | w i | ◮ Our algorithm works for other partly smooth R , e.g. group-LASSO 7

  19. The Regularized Problem ◮ Now the problem becomes min f ( w ) + � w � 1 , w which is harder to minimize than f ( w ) alone since � w � 1 is not differentiable ◮ As the gradient may not even exist, gradient descent or Newton method cannot be directly applied 8

  20. Proximal Quasi-Newton ◮ Proximal gradient is a simple algorithm that solves w ′ ∇ f ( w ) ⊤ ( w ′ − w ) + 1 2 α � w ′ − w � 2 2 + � w ′ � 1 , min where α is the step size for the current iteration ◮ Each calculation of ∇ f requires one round of communication 9

  21. Proximal Quasi-Newton ◮ Proximal gradient is a simple algorithm that solves w ′ ∇ f ( w ) ⊤ ( w ′ − w ) + 1 2 α � w ′ − w � 2 2 + � w ′ � 1 , min where α is the step size for the current iteration ◮ Each calculation of ∇ f requires one round of communication ◮ To reduce the amount of communication, we include some second-order information: reducing iterations ⇒ reducing rounds of communication ◮ Replace the term � w ′ − w � 2 2 / 2 α with ( w ′ − w ) ⊤ H ( w ′ − w ) / 2 for some H ≈ ∇ 2 f ( w ) 9

  22. Outline Overview Empirical Risk Minimization The Proposed Algorithm Experiments

  23. Utilizing Sparsity ◮ Even if we only update the nonzero entries of w , if we still compute the whole gradient ∇ f ( w ) , then the communication cost remains O ( d ) 10

  24. Utilizing Sparsity ◮ Even if we only update the nonzero entries of w , if we still compute the whole gradient ∇ f ( w ) , then the communication cost remains O ( d ) ◮ Guess: if w i = 0 at some iteration and it is likely to stay 0 at the next iteration, it remains 0 at the final solution ◮ Then we only solve the subproblem with respect to the coordinates that are likely to be nonzero 10

  25. Utilizing Sparsity ◮ Even if we only update the nonzero entries of w , if we still compute the whole gradient ∇ f ( w ) , then the communication cost remains O ( d ) ◮ Guess: if w i = 0 at some iteration and it is likely to stay 0 at the next iteration, it remains 0 at the final solution ◮ Then we only solve the subproblem with respect to the coordinates that are likely to be nonzero ◮ A progressive shrinking approach: once we guess w i = 0 , we remove those coordinates from our problem in future iterations ◮ So the number of nonzeros in w (i.e. � w � 0 ) gradually decreases 10

  26. Convergence Issue ◮ What if our guess was wrong at some iteration? 11

  27. Convergence Issue ◮ What if our guess was wrong at some iteration? ◮ Need to double-check: when some stopping criterion is met, we restart with all coordinates ◮ Training is terminated only when our model can hardly be improved using all coordinates 11

  28. More Acceleration by Smooth Optimization ◮ | w i | becomes twice-differentiable when w i � = 0 ◮ If the coordinates where w i � = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence 12

  29. More Acceleration by Smooth Optimization ◮ | w i | becomes twice-differentiable when w i � = 0 ◮ If the coordinates where w i � = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern 12

  30. More Acceleration by Smooth Optimization ◮ | w i | becomes twice-differentiable when w i � = 0 ◮ If the coordinates where w i � = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern ◮ Example with d = 5 : { 1 , 2 , 3 , 4 , 5 } 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend