hips hierarchical parameter synchronization in large
play

HiPS Hierarchical Parameter Synchronization in Large-Scale - PowerPoint PPT Presentation

HiPS Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li 1 ACM SIGCOMM Workshop on NetAI Net AI for 2 Background Computation Communication


  1. HiPS : Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li 1

  2. ACM SIGCOMM Workshop on NetAI Net AI for 2

  3. Background Computation Communication Distributed Machine Learning 3

  4. Background  Strong Computation Power (GPU & TPU) 4

  5. Background  Communication Challenge  TCP: High Latency & Low Throughput, Kernel Overheads, etc.  RDMA-Promising Alternative to TCP 5

  6. Background  A MNIST Benchmark with 1 million paras 6

  7. Background  RoCE/RDMA –multi-vendor ecosystem  Many Problems in Fat-Tree based Deployment 7

  8. Background  Fat-Tree based Deployment PFC pause frame storm [SIGCOMM’15,’16, NS-3 1. Simulation] Resilient RoCE-Performance Sacrifice [Chelsio-Tech] 2. Synchronization Performance 3. 8

  9. Background  Fat-Tree based Deployment PFC pause frame storm [SIGCOM’15,’16] 1. Resilient RoCE-Performance Sacrifice 2. 9

  10. Background  Fat-Tree based Deployment Synchronization Performance 1. 1 0

  11. Background  Server-Centric Networks Less hops lead to less PFC pause frames 1. Servers prevent cascading effect of PFC pause frame 2. 1 1

  12. Background  Synchronization Algorithm PS-based 1. Mesh-based 2. Ring-based 3. 1 2

  13. Background  Synchronization Algorithm PS-based (Pull+Push) 1. 1 3

  14. Background  Synchronization Algorithm Mesh-based (Diffuse+Collect) 1. 1 4

  15. Background  Synchronization Algorithm Ring-based (Scatter+Gather) 1. 1 5

  16. Background  Synchronization Algorithm Ring-based (Scatter+Gather) 1. 1 6

  17. HiPS Design  Map Logic View and Physical Structure Flexible (Topology-Aware) 1. Hierarchical (Efficient) 2. 1 7

  18. HiPS Design  HiPS in BCube 1 8

  19. HiPS Design  HiPS in BCube 1 9

  20. HiPS Design  HiPS in BCube 2 0

  21. HiPS Design  HiPS in BCube (Server <01>) 2 1

  22. HiPS Design  HiPS in BCube 2 2

  23. HiPS Design  HiPS in Torus 2 3

  24. Theoretical Evaluation 2 4

  25. Theoretical Evaluation 2 5

  26. Theoretical Evaluation 2 6

  27. Future Work  Conduct Further Comparative Study  Integrate HiPS into DML systems 2 7

  28. Simulation Evaluation  NS-3 Simulation with VGG Workload BCube: GST reduced by 37.5 % ∼ 61.9%. 1. Torus: GST reduced by 49 .6% ∼ 66.4% 2. GST Comparison with RDMA in BCube GST Comparison with RDMA in Torus 2 8

  29. Testbed Evaluation  System Instance of HiPS: BML Add an OP in Tensorflow 1. 9 Servers, each equipped with 2 RNICs (BCube (3,1)) 2. MINIST and VGG19 as benchmarks 3. Ring Allreduce in Ring and Mesh-based (P2P) Sync in 4. Fat-Tree as Baseline 2 9

  30. Testbed Evaluation 3 0

  31. Testbed Evaluation 18.7%~56.4% 3 1

  32. Ongoing Work  Conduct Further Comparative Study  Optimize HiPS in DML systems  More Cases of Network for AI 3 2

  33. Thanks! NASP Research Group https://nasp.cs.tsinghua.edu.cn/ 3 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend