TensorLights End-Host Traffic Scheduling for Distributed Deep - - PowerPoint PPT Presentation

tensorlights
SMART_READER_LITE
LIVE PREVIEW

TensorLights End-Host Traffic Scheduling for Distributed Deep - - PowerPoint PPT Presentation

TensorLights End-Host Traffic Scheduling for Distributed Deep Learning Xin Sunny Huang Ang Chen T. S. Eugene Ng Rice University 1 This Work The Parameter Server (PS) architecture is the most popular approach for distributed Deep Learning.


slide-1
SLIDE 1

Xin Sunny Huang Ang Chen

  • T. S. Eugene Ng

Rice University

1

End-Host Traffic Scheduling for Distributed Deep Learning

TensorLights

slide-2
SLIDE 2

This Work

The Parameter Server (PS) architecture is the most popular approach for distributed Deep Learning. Disadvantage: traffic contention at PS introduces harmful stragglers. TensorLights mitigates stragglers with improved application performance and machine utilization.

2

slide-3
SLIDE 3

The Rise of Deep Learning (DL)

3

Classic AI problems Also used for ...

Language processing Image Recognition System Security [2]

[1] Deepmind AI reduces Google data centre cooling bill by 40%. (2016) [2] Abadi, M. et al. Learning to protect communications with adversarial neural cryptography. (arXiv 2016) [3] Valadarsky, A. et al. Learning to route. (HotNets 2017) [4] Kraska, T. et al. The case for learned index structures. (SIGMOD 2018) [5] Gu, J. et al. Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019)

Power Scheduling [1] Network Routing [3] Database Index [4]

slide-4
SLIDE 4

The Rise of Deep Learning (DL)

4

Classic AI problems Also used for ...

Language processing Image Recognition System Security [2]

[1] Deepmind AI reduces Google data centre cooling bill by 40%. (2016) [2] Abadi, M. et al. Learning to protect communications with adversarial neural cryptography. (arXiv 2016) [3] Valadarsky, A. et al. Learning to route. (HotNets 2017) [4] Kraska, T. et al. The case for learned index structures. (SIGMOD 2018) [5] Gu, J. et al. Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019)

Power Scheduling [1] Network Routing [3] Database Index [4]

10.5× increase of DL training jobs in Microsoft [5]

slide-5
SLIDE 5

Distributed Deep Learning (DL) with Parameter Server (PS)

5

Parameter Server (PS)

worker 2 worker 1 mo model up updat ate gradient update ba barrier steps per job: 1, 1,000s 000s to

  • 1,

1,000, 000,000s 000s[1] st step ep=1 st step ep=2

[1] Szegedy, C. et al. Going Deeper with Convolutions (CVPR ‘15)

slide-6
SLIDE 6

Supporting DL at Scale

  • Cl

Cluster scheduler to manage the lifecycles of DL jobs.

  • Gr

Grid Search: run many DL jobs to train the same model of different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations.

6

w w w PS PS

A DL job

host

w w PS PS

host host

Cl Cluster scheduler

(e.g. YARN[1])

[1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013)

slide-7
SLIDE 7

Supporting DL at Scale

  • Cl

Cluster scheduler to manage the lifecycles of DL jobs.

  • Gr

Grid Search: run many DL jobs to train the same model of different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations.

7

host

w w PS PS w w w PS PS

Another DL job

w PS PS w

host host

Cl Cluster scheduler

(e.g. YARN[1])

[1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013)

slide-8
SLIDE 8

Supporting DL at Scale

  • Cl

Cluster scheduler to manage the lifecycles of DL jobs.

  • Gr

Grid Search: run many DL jobs to train the same model of different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations.

8

host

w w PS PS PS PS w w w PS PS w

host host

w w w PS PS

Also a DL job

Cl Cluster scheduler

(e.g. YARN[1])

[1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013)

slide-9
SLIDE 9

Supporting DL at Scale

  • Cl

Cluster scheduler to manage the lifecycles of DL jobs.

  • Gr

Grid Search: run many DL jobs to train the same model of different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations.

9

host

w w PS PS PS PS w w w PS PS w

host host

w w w PS PS

Also a DL job

Cl Cluster scheduler

(e.g. YARN[1])

Co Contention among collocated PSe PSes! How would PS contention impact the performance of distributed DL jobs?

[1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013)

slide-10
SLIDE 10

Measurement Setup

10

PS PS

20 workers host

w w

  • Wo

Workload:

  • Each TensorFlow[1] job: 1 parameter server (PS) and

20 workers, all tasks on a different machine.

  • Each job trains the ResNet-32[2] model on the Cifar10[3]

dataset until 30,000 global step is reached.

  • Total 21 concurrent jobs.

x 21 21

[1] https://www.tensorflow.org/ [2] He, K. et al. Deep residual learning for image recognition. (IEEE CVPR 2016) [3] Krizhevsky, A. Learning multiple layers of features from tiny images. (University of Toronto Technical Report 2009)

1 PS

slide-11
SLIDE 11

11

21 PSes on one host

  • Tes

Testbed ed: CPU cluster with 21 hosts, all connected to an Ethernet switch with 10 Gbps link rate.

  • Tas

Task placem acement ent: Each job’s 21 tasks are on a different

  • host. A range of PS placements from skewed to uniform.

PS PS21

21

PS PS1 … One PS per host PS PS21

21

PS PS1

Skewed placement Uniform placement 7 PSes per host

  • n each of 3 hosts

PS PS14

14

PS PS8 …

… …

PS PS7 PS PS1 … PS PS21

21

PS PS15

15 …

Measurement Setup (cont.)

slide-12
SLIDE 12

Impact of PS Placements

12

mild intense

Traffic contention among PSes

Application performance degrades due to contention at PS.

#1 #2 #3 #4 #5 #6 #7 #8

500 1000 1500 2000 1830 1471 1213 1153 1110 1092 1078 1045

Job Completion Time (JCT) under various PS placements

Placement Index JCT (seconds)

75%

↓ Lower is better

slide-13
SLIDE 13

PS PS2 PS PS1

to workers

FI FIFO FO mo model update sent

1 2

Stragglers under Contention

13

time

host machine

slide-14
SLIDE 14

PS PS2 PS PS1

to workers

FI FIFO FO mo model update sent

1 2

Stragglers under Contention

14

time Po Possibl ble straggl gglers de detected! d!

Workers (of PS1) receiving the tail part will delay the progress of the whole job

host machine

slide-15
SLIDE 15

PS PS2 PS PS1

to workers

FI FIFO FO mo model update sent

1 2

Stragglers under Contention

15

time Po Possibl ble straggl gglers de detected! d!

Workers (of PS1) receiving the tail part will delay the progress of the whole job

host machine

In Intr tra-jo job le level: One straggling worker delays th the w whole le jo job including other workers. In Inte ter-jo job le level: l: Mu Multiple jobs are delayed simultaneously if each job has a few stragglers.

slide-16
SLIDE 16

PS PS2 PS PS1

to workers

FI FIFO FO mo model update sent

1 2

Stragglers under Contention

16

time Po Possibl ble straggl gglers de detected! d!

Workers (of PS1) receiving the tail part will delay the progress of the whole job

host machine

In Intr tra-jo job le level: One straggling worker delays th the w whole le jo job including other workers. In Inte ter-jo job le level: l: Mu Multiple jobs are delayed simultaneously if each job has a few stragglers.

Ap Applicat cation

  • n per

erfor

  • rmance

ance deg egrad adat ation

  • n

and and machi achine ne und under erut utilizat ation.

  • n.
slide-17
SLIDE 17

PS PS2 PS PS1

to workers

Wi With priority “1>2” FI FIFO FO mo model update sent

Mitigate Stragglers with Traffic Priority

17

1 2 1 2

time

host machine

slide-18
SLIDE 18

PS PS2 PS PS1

to workers

Wi With priority “1>2” FI FIFO FO mo model update sent

Mitigate Stragglers with Traffic Priority

18

1 2 1 2

time

host machine

On One priority for one job’s mo model update (from m PS)

slide-19
SLIDE 19

PS PS2 PS PS1

to workers

Wi With priority “1>2” FI FIFO FO mo model update sent

Mitigate Stragglers with Traffic Priority

19

1 2 1 2

time

host machine

Tr Traf affic c prior

  • ritizat

zation

  • n mitigat

ates es st strag aggler ers: s:

workers of the same job are expected to wait for similar lengths of time.

On One priority for one job’s mo model update (from m PS)

slide-20
SLIDE 20

PS PS2 PS PS1

to workers

FI FIFO FO mo model update sent

Reducing Stragglers with TensorLights

20

time Tensor TensorLi Light hts

  • On

One Tensor TensorLi Light hts

  • Ro

RoundRo Robin

1 2 1 2 1 2 1 2

Rotate priority assignments of “1>2” and “2>1”

host machine

slide-21
SLIDE 21

PS PS2 PS PS1

to workers

FI FIFO FO mo model update sent

Reducing Stragglers with TensorLights

21

time Tensor TensorLi Light hts

  • On

One Tensor TensorLi Light hts

  • Ro

RoundRo Robin

1 2 1 2 1 2 1 2

Rotate priority assignments of “1>2” and “2>1”

host machine

Re Reducing stragglers with priority wh while achieving fair progress am among

  • ng concur

concurrent ent job

  • bs!

s!

slide-22
SLIDE 22

Scheduling Model with TensorLights

22

FIFO TensorLights

  • One

TensorLights

  • RoundRobin

PS PS2 PS PS1 PS PS2 PS PS1 PS PS2 PS PS1

slide-23
SLIDE 23

23 23

TensorLights

Other commun acceleration str Sc Schedu duling g

  • ver
  • verhead

head ü Local, light-weight ✗ Global coordin Re Resource sched schedul uling ng ü Work conserving ✗ Inaccurate rate leads to bandw under-utilization De Deployment ü No change to app, cluster scheduler,

  • r hardware

✗ Modifying the D at various level

slide-24
SLIDE 24

Evaluation

  • Wo

Workload, testbed, and task placement: same as the previous measurement study.

  • Tens

Tensor

  • rLi

Light hts im imple lementatio ion: Hierarchical token bucket (htb) in the traffic control (tc) module under

  • Linux. Deployed at local host that has concurrent PSes.
  • Re

Results:

  • Improvement in job completion time
  • Improvement in barrier waiting efficiency
  • Improvement in machine utilization
  • Sensitivity to traffic contention intensity

24

slide-25
SLIDE 25

#1 #2 #3 #4

0.00 0.25 0.50 0.75 1.00 1.25

#1 #2 #3 #4

0.00 0.25 0.50 0.75 1.00 1.25 0.73 0.81

#1 #2 #3 #4

0.00 0.25 0.50 0.75 1.00 1.25 0.73 0.81 0.84 0.87

#1 #2 #3 #4

0.00 0.25 0.50 0.75 1.00 1.25 0.73 0.81 0.98 1.01 0.84 0.87 1.01 1.00

Improvement in Job Completion Time

TensorLights is more effective for high contention case. TensorLights improves the average JCT by up to 27%.

25

Tens Tensor

  • rLi

Light ghts-On One Tens Tensor

  • rLi

Light ghts-Ro RoundRo Robin

FI FIFO FO

Normalized Job Completion Time (JCT) under Various PS Placementts

Normalized JCT

mild intense

Traffic contention among PSes

Placement Index ↓ Lower is better

slide-26
SLIDE 26

0.00 0.25 0.50 0.75 1.00

CDF

10 10 10

  • 2
  • 1

Barrier wait time (second) (a) Average in one barrier (b) Standard variance in one barrier

10 10 10

  • 2
  • 1

Barrier wait time (second)

Reduction in Synchronization Overhead

26

  • Me

Metrics: Average (or standard variance) of elapsed waiting time for the same barrier among workers of the same job

Comparable average under all policies.

TensorLights-One reduces variance by 26% on average. (TLs-RoundRobin is 15%) Distribution of Barrier Wait Time

FI FIFO FO Tens Tensor

  • rLi

Light ghts-Ro RoundRo Robin Tens Tensor

  • rLi

Light ghts-On One

← sm smal aller er is s bet etter er

slide-27
SLIDE 27

Improvement in Utilization

27

CP CPU Ne Netwo work Inbound Ne Netwo work Outbound

* Presented number is for TensorLights-One. TensorLights-RoundRobin has similar results.

With more efficient barrier waiting, TensorLights also improves machine utilization

FI FIFO FO Tens Tensor

  • rLi

Light ghts-Ro RoundRo Robin ↑ Higher is better

*4% larger *13% larger *20% larger *20% larger *21% larger *20% larger

Tens Tensor

  • rLi

Light ghts-On One

slide-28
SLIDE 28

Conclusions

28

  • Trends to scale up DL applications continue to

introduce more network traffic contention.

  • Job-level traffic prioritization is helpful to

manage traffic contention.

  • TensorLights leverages traffic prioritization to

mitigate stragglers, accelerate DL jobs and increase resource utilization.

Open Source Code & Benchmark https://github.com/TensorLights

Thank You!