Parameter Hub
A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy
1
Parameter Hub A Rack-Scale Parameter Server for Efficient - - PowerPoint PPT Presentation
Parameter Hub A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 1 - DNN training is computationally expensive - Needs
A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy
1
2
F1 B1 A1 O1 F1 B1 Parameter Server Worker 1 Worker 2 F2 B2 F2 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization A2 O2 F3 F3 B2 B2 Time
Worker Parameter Server
3
F1 B1 A1 O1 F1 B1 Parameter Sever Worker 1 Worker 2 F2 B2 F2 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization A2 O2 F3 F3 B2 B2 Time
Worker Parameter Server
4
Network Core ToR Machine with GPUs Machine ToR Machine with GPUs Machine
5
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
6
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
7
0.5 1 1.5 2 GRID 520 K80 M60 V100 Seconds ResNet 269
time: shifting bottleneck.
communication.
do little to increase throughput
resources.
8
2012 2014 2015 2017
GPU and Network active GPU idle, waiting on network
Inception V3
AlexNet GoogleNet ResNet 269
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
10
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
GPU
Training Framework …
Network
11
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
AlexNet GoogleNet Inception ResNet 269
Seconds
Compute Data Copy and Communication Aggregator Optimizer Synchronization and other Overheads
12
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
13
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
14
15
25 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps GoogleNet / Inception: 40 Gbps ResNet: 100 Gbps AlexNet: 1200 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
16
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
17
Hosts
9 Gbps 4 Gbps
1 2 3 4 5 6 7 8 1 2 4.7 3 8.9 4.7 4 8.9 4.7 8.9 5 8.9 4.7 8.9 8.9 6 4.7 9.0 4.7 4.7 4.7 7 8.9 4.7 9.0 8.9 9.0 4.7 8 4.7 9.0 4.7 4.7 4.7 9.0 4.7
Cluster 1: 1 3 4 5 7 Cluster 2: 2 6 8
Hosts
18
ToR PS 1 ToR PS 2 Worker 2
19
Network Core
ToR Worker 1 PS 1 ToR PS 2 Worker 2
GPU Data Copy Aggregation Optimization … Network
20
Network Core
ToR Worker 1 PS 1 ToR PS 2 Worker 2
GPU Data Copy Aggregation Optimization … Network
21
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
GRADIENTS CPU
22
MEMORY
Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for
Requires synchronization. Great locality. No synchronization
Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into
aware tree reduction.
NUMA NUMA 1 Great locality. No synchronization T
synchronization
23
gradients deterministically.
particular core on the server.
independently.
core : maintaining maximum locality.
Gradient Array for Key 0 from 8 workers Core Mappings Aggregated
24
When Aggregation is done, PHub:
core that aggregates that chunk.
Array for Key 0 from 8 workers Aggregated
25
When Aggregation is done, PHub:
core that aggregates that chunk.
Array for Key 0 from 8 workers Optimized Aggregated
26
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
14000 1100 800 10
27
Gbps
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
28
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
29
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
14000 1100 800 10
30
Gbps
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
14000 1100 800 20
31
Gbps
Network Core
ToR Worker 1 PS 1 T PS Wor
14000 1100 800 800
…
32
Gbps
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
34
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
35
Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 CM
36
Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 CM PBox PBox
37
Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 PBox PBox
aggregation
aggregation N times traffic reduction!
38
2.7 2.3 2.2 1.3 1.3 2.2 1.9 2.3 2.3 1 2 3
AlexNet VGG 11 VGG 19 GoogleNet Inception V3 ResNet 18 ResNet 50 ResNet 269 ResNext 269
39
8 Workers. GTX 1080 Ti. MxNet: InfiniBand-enhanced baseline. PBox. Batch Size 64 for ResNext, 128 for ResNet 269, 256 for all others.
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
AlexNet GoogleNet Inception ResNet 269
Seconds
40
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Memory Bandwidth (MB/s) Number of active workers Microbenchmark limit PHub training
41
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Memory Bandwidth (MB/s) Number of active workers Microbenchmark limit PHub training
42
120 Machines training ResNet 50
14.3% 14.7% 15.1% 15.5% 15.9% 16.2% 16.5% 0.4% 1.6% 1.8% 1.8% 1.4% 1.5% 1.9%
0.0% 10.0% 20.0%
2 3 4 5 6 7 8 Overhead (%) Number of racks AlexNet ResNet 50
43
82.2 60.4 0.0 20.0 40.0 60.0 80.0 100.0 Throughput per $1000 PBox Standard Shard
Accounting for network devices (switch ports, network adapters, network cables), GPU costs, and PBox’s entire machine cost. Core oversubscription 2:1
44
A software, hardware and cluster configuration codesign that target three major bottlenecks in the cloud for more efficient DDNN training
45