Xin Sunny Huang Ang Chen
- T. S. Eugene Ng
Rice University
1
TensorLights End-Host Traffic Scheduling for Distributed Deep - - PowerPoint PPT Presentation
TensorLights End-Host Traffic Scheduling for Distributed Deep Learning Xin Sunny Huang Ang Chen T. S. Eugene Ng Rice University 1 This Work The Parameter Server (PS) architecture is the most popular approach for distributed Deep Learning.
Xin Sunny Huang Ang Chen
Rice University
1
The Parameter Server (PS) architecture is the most popular approach for distributed Deep Learning. Disadvantage: traffic contention at PS introduces harmful stragglers. TensorLights mitigates stragglers with improved application performance and machine utilization.
2
3
Classic AI problems Also used for ...
Language processing Image Recognition System Security [2]
[1] Deepmind AI reduces Google data centre cooling bill by 40%. (2016) [2] Abadi, M. et al. Learning to protect communications with adversarial neural cryptography. (arXiv 2016) [3] Valadarsky, A. et al. Learning to route. (HotNets 2017) [4] Kraska, T. et al. The case for learned index structures. (SIGMOD 2018) [5] Gu, J. et al. Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019)
Power Scheduling [1] Network Routing [3] Database Index [4]
4
Classic AI problems Also used for ...
Language processing Image Recognition System Security [2]
[1] Deepmind AI reduces Google data centre cooling bill by 40%. (2016) [2] Abadi, M. et al. Learning to protect communications with adversarial neural cryptography. (arXiv 2016) [3] Valadarsky, A. et al. Learning to route. (HotNets 2017) [4] Kraska, T. et al. The case for learned index structures. (SIGMOD 2018) [5] Gu, J. et al. Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019)
Power Scheduling [1] Network Routing [3] Database Index [4]
5
Parameter Server (PS)
worker 2 worker 1 mo model up updat ate gradient update ba barrier steps per job: 1, 1,000s 000s to
1,000, 000,000s 000s[1] st step ep=1 st step ep=2
[1] Szegedy, C. et al. Going Deeper with Convolutions (CVPR ‘15)
Cluster scheduler to manage the lifecycles of DL jobs.
Grid Search: run many DL jobs to train the same model of different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations.
6
w w w PS PS
A DL job
…
host
w w PS PS
host host
Cl Cluster scheduler
(e.g. YARN[1])
[1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013)
Cluster scheduler to manage the lifecycles of DL jobs.
Grid Search: run many DL jobs to train the same model of different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations.
7
…
host
w w PS PS w w w PS PS
Another DL job
w PS PS w
host host
Cl Cluster scheduler
(e.g. YARN[1])
[1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013)
Cluster scheduler to manage the lifecycles of DL jobs.
Grid Search: run many DL jobs to train the same model of different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations.
8
…
host
w w PS PS PS PS w w w PS PS w
host host
w w w PS PS
Also a DL job
Cl Cluster scheduler
(e.g. YARN[1])
[1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013)
Cluster scheduler to manage the lifecycles of DL jobs.
Grid Search: run many DL jobs to train the same model of different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations.
9
…
host
w w PS PS PS PS w w w PS PS w
host host
w w w PS PS
Also a DL job
Cl Cluster scheduler
(e.g. YARN[1])
Co Contention among collocated PSe PSes! How would PS contention impact the performance of distributed DL jobs?
[1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013)
10
PS PS
…
20 workers host
w w
Workload:
20 workers, all tasks on a different machine.
dataset until 30,000 global step is reached.
[1] https://www.tensorflow.org/ [2] He, K. et al. Deep residual learning for image recognition. (IEEE CVPR 2016) [3] Krizhevsky, A. Learning multiple layers of features from tiny images. (University of Toronto Technical Report 2009)
1 PS
11
21 PSes on one host
Testbed ed: CPU cluster with 21 hosts, all connected to an Ethernet switch with 10 Gbps link rate.
Task placem acement ent: Each job’s 21 tasks are on a different
PS PS21
21
PS PS1 … One PS per host PS PS21
21
PS PS1
…
Skewed placement Uniform placement 7 PSes per host
PS PS14
14
PS PS8 …
PS PS7 PS PS1 … PS PS21
21
PS PS15
15 …
12
mild intense
Traffic contention among PSes
Application performance degrades due to contention at PS.
#1 #2 #3 #4 #5 #6 #7 #8
500 1000 1500 2000 1830 1471 1213 1153 1110 1092 1078 1045
Job Completion Time (JCT) under various PS placements
Placement Index JCT (seconds)
75%
↓ Lower is better
PS PS2 PS PS1
…
to workers
…
FI FIFO FO mo model update sent
1 2
13
time
host machine
PS PS2 PS PS1
…
to workers
…
FI FIFO FO mo model update sent
1 2
14
time Po Possibl ble straggl gglers de detected! d!
Workers (of PS1) receiving the tail part will delay the progress of the whole job
host machine
PS PS2 PS PS1
…
to workers
…
FI FIFO FO mo model update sent
1 2
15
time Po Possibl ble straggl gglers de detected! d!
Workers (of PS1) receiving the tail part will delay the progress of the whole job
host machine
In Intr tra-jo job le level: One straggling worker delays th the w whole le jo job including other workers. In Inte ter-jo job le level: l: Mu Multiple jobs are delayed simultaneously if each job has a few stragglers.
PS PS2 PS PS1
…
to workers
…
FI FIFO FO mo model update sent
1 2
16
time Po Possibl ble straggl gglers de detected! d!
Workers (of PS1) receiving the tail part will delay the progress of the whole job
host machine
In Intr tra-jo job le level: One straggling worker delays th the w whole le jo job including other workers. In Inte ter-jo job le level: l: Mu Multiple jobs are delayed simultaneously if each job has a few stragglers.
Ap Applicat cation
erfor
ance deg egrad adat ation
and and machi achine ne und under erut utilizat ation.
PS PS2 PS PS1
…
to workers
…
Wi With priority “1>2” FI FIFO FO mo model update sent
17
1 2 1 2
time
host machine
PS PS2 PS PS1
…
to workers
…
Wi With priority “1>2” FI FIFO FO mo model update sent
18
1 2 1 2
time
host machine
On One priority for one job’s mo model update (from m PS)
PS PS2 PS PS1
…
to workers
…
Wi With priority “1>2” FI FIFO FO mo model update sent
19
1 2 1 2
time
host machine
Tr Traf affic c prior
zation
ates es st strag aggler ers: s:
workers of the same job are expected to wait for similar lengths of time.
On One priority for one job’s mo model update (from m PS)
PS PS2 PS PS1
…
to workers
…
FI FIFO FO mo model update sent
20
time Tensor TensorLi Light hts
One Tensor TensorLi Light hts
RoundRo Robin
1 2 1 2 1 2 1 2
Rotate priority assignments of “1>2” and “2>1”
host machine
PS PS2 PS PS1
…
to workers
…
FI FIFO FO mo model update sent
21
time Tensor TensorLi Light hts
One Tensor TensorLi Light hts
RoundRo Robin
1 2 1 2 1 2 1 2
Rotate priority assignments of “1>2” and “2>1”
host machine
Re Reducing stragglers with priority wh while achieving fair progress am among
concurrent ent job
s!
22
FIFO TensorLights
TensorLights
PS PS2 PS PS1 PS PS2 PS PS1 PS PS2 PS PS1
23 23
TensorLights
Other commun acceleration str Sc Schedu duling g
head ü Local, light-weight ✗ Global coordin Re Resource sched schedul uling ng ü Work conserving ✗ Inaccurate rate leads to bandw under-utilization De Deployment ü No change to app, cluster scheduler,
✗ Modifying the D at various level
Workload, testbed, and task placement: same as the previous measurement study.
Tensor
Light hts im imple lementatio ion: Hierarchical token bucket (htb) in the traffic control (tc) module under
Results:
24
#1 #2 #3 #4
0.00 0.25 0.50 0.75 1.00 1.25
#1 #2 #3 #4
0.00 0.25 0.50 0.75 1.00 1.25 0.73 0.81
#1 #2 #3 #4
0.00 0.25 0.50 0.75 1.00 1.25 0.73 0.81 0.84 0.87
#1 #2 #3 #4
0.00 0.25 0.50 0.75 1.00 1.25 0.73 0.81 0.98 1.01 0.84 0.87 1.01 1.00
TensorLights is more effective for high contention case. TensorLights improves the average JCT by up to 27%.
25
Tens Tensor
Light ghts-On One Tens Tensor
Light ghts-Ro RoundRo Robin
FI FIFO FO
Normalized Job Completion Time (JCT) under Various PS Placementts
Normalized JCT
mild intense
Traffic contention among PSes
Placement Index ↓ Lower is better
0.00 0.25 0.50 0.75 1.00
CDF
10 10 10
Barrier wait time (second) (a) Average in one barrier (b) Standard variance in one barrier
10 10 10
Barrier wait time (second)
26
Metrics: Average (or standard variance) of elapsed waiting time for the same barrier among workers of the same job
Comparable average under all policies.
TensorLights-One reduces variance by 26% on average. (TLs-RoundRobin is 15%) Distribution of Barrier Wait Time
FI FIFO FO Tens Tensor
Light ghts-Ro RoundRo Robin Tens Tensor
Light ghts-On One
← sm smal aller er is s bet etter er
27
CP CPU Ne Netwo work Inbound Ne Netwo work Outbound
* Presented number is for TensorLights-One. TensorLights-RoundRobin has similar results.
With more efficient barrier waiting, TensorLights also improves machine utilization
FI FIFO FO Tens Tensor
Light ghts-Ro RoundRo Robin ↑ Higher is better
*4% larger *13% larger *20% larger *20% larger *21% larger *20% larger
Tens Tensor
Light ghts-On One
28
introduce more network traffic contention.
manage traffic contention.
mitigate stragglers, accelerate DL jobs and increase resource utilization.
Open Source Code & Benchmark https://github.com/TensorLights