HiPS:Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning
Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li
1
HiPS Hierarchical Parameter Synchronization in Large-Scale - - PowerPoint PPT Presentation
HiPS Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li 1 ACM SIGCOMM Workshop on NetAI Net AI for 2 Background Computation Communication
HiPS:Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning
Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li
1
ACM SIGCOMM Workshop on NetAI
2
Background Distributed Machine Learning Computation Communication
3
Background
Strong Computation Power (GPU & TPU)
4
Background
Communication Challenge TCP: High Latency & Low Throughput,
Kernel Overheads, etc.
RDMA-Promising Alternative to TCP
5
Background
A MNIST Benchmark with 1 million paras
6
Background
RoCE/RDMA –multi-vendor ecosystem Many Problems in Fat-Tree based
Deployment
7
Background
Fat-Tree based Deployment
1.
PFC pause frame storm [SIGCOMM’15,’16, NS-3 Simulation]
2.
Resilient RoCE-Performance Sacrifice [Chelsio-Tech]
3.
Synchronization Performance
8
Background
Fat-Tree based Deployment
1.
PFC pause frame storm [SIGCOM’15,’16]
2.
Resilient RoCE-Performance Sacrifice
9
Background
Fat-Tree based Deployment
1.
Synchronization Performance
1
Background
Server-Centric Networks
1.
Less hops lead to less PFC pause frames
2.
Servers prevent cascading effect of PFC pause frame
1 1
Background
Synchronization Algorithm
1.
PS-based
2.
Mesh-based
3.
Ring-based
1 2
Background
Synchronization Algorithm
1.
PS-based (Pull+Push)
1 3
Background
Synchronization Algorithm
1.
Mesh-based (Diffuse+Collect)
1 4
Background
Synchronization Algorithm
1.
Ring-based (Scatter+Gather)
1 5
Background
Synchronization Algorithm
1.
Ring-based (Scatter+Gather)
1 6
HiPS Design
Map Logic View and Physical Structure
1.
Flexible (Topology-Aware)
2.
Hierarchical (Efficient)
1 7
HiPS Design
HiPS in BCube
1 8
HiPS Design
HiPS in BCube
1 9
HiPS Design
HiPS in BCube
2
HiPS Design
HiPS in BCube (Server <01>)
2 1
HiPS Design
HiPS in BCube
2 2
HiPS Design
HiPS in Torus
2 3
Theoretical Evaluation
2 4
Theoretical Evaluation
2 5
Theoretical Evaluation
2 6
Future Work
Conduct Further Comparative Study Integrate HiPS into DML systems
2 7
Simulation Evaluation
GST Comparison with RDMA in Torus GST Comparison with RDMA in BCube
NS-3 Simulation with VGG Workload
1.
BCube: GST reduced by 37.5%∼61.9%.
2.
Torus: GST reduced by 49.6%∼66.4%
2 8
Testbed Evaluation
System Instance of HiPS: BML
1.
Add an OP in Tensorflow
2.
9 Servers, each equipped with 2 RNICs (BCube (3,1))
3.
MINIST and VGG19 as benchmarks
4.
Ring Allreduce in Ring and Mesh-based (P2P) Sync in Fat-Tree as Baseline
2 9
Testbed Evaluation
3
Testbed Evaluation
3 1
Ongoing Work
Conduct Further Comparative Study Optimize HiPS in DML systems More Cases of Network for AI
3 2
Thanks! NASP Research Group https://nasp.cs.tsinghua.edu.cn/
3 3