TensorLights End-Host Traffic Scheduling for Distributed Deep - PowerPoint PPT Presentation

TensorLights End-Host Traffic Scheduling for Distributed Deep Learning Xin Sunny Huang Ang Chen T. S. Eugene Ng Rice University 1

This Work The Parameter Server (PS) architecture is the most popular approach for distributed Deep Learning. Disadvantage: traffic contention at PS introduces harmful stragglers. TensorLights mitigates stragglers with improved application performance and machine utilization. 2

The Rise of Deep Learning (DL) Language Image Classic AI problems processing Recognition Also used for ... Power System Network Database Scheduling [1] Security [2] Routing [3] Index [4] [1] Deepmind AI reduces Google data centre cooling bill by 40%. (2016) [2] Abadi, M. et al. Learning to protect communications with adversarial neural cryptography. (arXiv 2016) [3] Valadarsky, A. et al. Learning to route. (HotNets 2017) [4] Kraska, T. et al. The case for learned index structures. (SIGMOD 2018) [5] Gu, J. et al. Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019) 3

The Rise of Deep Learning (DL) Language Image 10.5 × increase of DL training Classic AI problems processing Recognition jobs in Microsoft [5] Also used for ... Power System Network Database Scheduling [1] Security [2] Routing [3] Index [4] [1] Deepmind AI reduces Google data centre cooling bill by 40%. (2016) [2] Abadi, M. et al. Learning to protect communications with adversarial neural cryptography. (arXiv 2016) [3] Valadarsky, A. et al. Learning to route. (HotNets 2017) [4] Kraska, T. et al. The case for learned index structures. (SIGMOD 2018) [5] Gu, J. et al. Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019) 4

Distributed Deep Learning (DL) with Parameter Server (PS) mo model ba barrier up updat ate worker 1 step st ep=1 Parameter Server (PS) step st ep=2 worker 2 gradient update steps per job: 000s [1] 1,000s 1, 000s to o 1, 1,000, 000,000s [1] Szegedy, C. et al. Going Deeper with Convolutions (CVPR ‘15) 5

Supporting DL at Scale Cl Cluster scheduler to manage the lifecycles of DL jobs. • Gr Grid Search : run many DL jobs to train the same model of • different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations. PS PS w w A DL job … Cl Cluster scheduler w w w PS PS (e.g. YARN [1] ) host host host [1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013) 6

Supporting DL at Scale Cl Cluster scheduler to manage the lifecycles of DL jobs. • Gr Grid Search : run many DL jobs to train the same model of • different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations. PS PS w w Another DL job … Cluster scheduler Cl PS PS w w w (e.g. YARN [1] ) w PS PS w host host host [1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013) 7

Supporting DL at Scale Cluster scheduler to manage the lifecycles of DL jobs. Cl • Gr Grid Search : run many DL jobs to train the same model of • different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations. PS PS w w Also a DL job … Cl Cluster scheduler PS PS w w PS PS w w w (e.g. YARN [1] ) w PS PS w host host host [1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013) 8

Supporting DL at Scale Cl Cluster scheduler to manage the lifecycles of DL jobs. • Gr Grid Search : run many DL jobs to train the same model of • different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations. PS PS w w Also a DL job … Cluster scheduler Cl PS PS w w PS PS w w w (e.g. YARN [1] ) w PS PS w Co Contention among collocated PSe PSes! host host host How would PS contention impact the performance of distributed DL jobs? [1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013) 9

Measurement Setup • Wo Workload : Each TensorFlow [1] job: 1 parameter server (PS) and • 20 workers, all tasks on a different machine. Each job trains the ResNet-32 [2] model on the Cifar10 [3] • dataset until 30,000 global step is reached. Total 21 concurrent jobs. • host x 21 21 … PS PS w w 1 PS 20 workers [1] https://www.tensorflow.org/ [2] He, K. et al. Deep residual learning for image recognition. (IEEE CVPR 2016) [3] Krizhevsky, A. Learning multiple layers of features from tiny images. (University of Toronto Technical Report 2009) 10

Measurement Setup (cont.) • Tes Testbed ed : CPU cluster with 21 hosts, all connected to an Ethernet switch with 10 Gbps link rate. ent : Each job’s 21 tasks are on a different • Tas Task placem acement host. A range of PS placements from skewed to uniform. PS 1 … PS PS 7 PS Skewed placement Uniform placement … PS 8 … … PS PS 14 PS PS 1 … 14 … PS PS PS 21 PS 1 PS PS 21 PS 21 21 15 … 21 PSes on one host PS PS 15 PS PS 21 One PS per host 21 7 PSes per host on each of 3 hosts 11

Impact of PS Placements Job Completion Time (JCT) under various PS placements 2000 ↓ Lower is better JCT (seconds) 1500 75% 1830 1000 1471 1213 1153 1110 1092 1078 1045 500 0 #1 #2 #3 #4 #5 #6 #7 #8 Placement Index intense mild Traffic contention among PSes Application performance degrades due to contention at PS. 12

Stragglers under Contention host machine PS PS 1 PS 2 PS mo model update sent time … … to workers FI FIFO FO 1 2 13

Stragglers under Contention host machine Possibl Po ble straggl gglers de detected! d! PS PS 1 PS 2 PS Workers (of PS 1 ) receiving the tail part will model update sent mo delay the progress of the whole job time … … to workers FIFO FI FO 1 2 14

Stragglers under Contention host machine Po Possibl ble straggl gglers de detected! d! PS PS 1 PS PS 2 Workers (of PS 1 ) receiving the tail part will mo model update sent delay the progress of the whole job time … … to workers FIFO FI FO 1 2 In Inte ter-jo job le level: l: Intr In tra-jo job le level : Multiple jobs are delayed Mu One straggling worker simultaneously if each job delays th the w whole le jo job has a few stragglers. including other workers. 15

Stragglers under Contention host machine Possibl Po ble straggl gglers de detected! d! PS PS 1 PS 2 PS Workers (of PS 1 ) receiving the tail part will model update sent mo delay the progress of the whole job time … … to workers FIFO FI FO 1 2 Inte In ter-jo job le level: l: In Intr tra-jo job le level : Multiple jobs are delayed Mu One straggling worker simultaneously if each job delays th the w whole le jo job has a few stragglers. including other workers. Ap Applicat cation on per erfor ormance ance deg egrad adat ation on and and machi achine ne und under erut utilizat ation. on. 16

Mitigate Stragglers with Traffic Priority host machine PS PS 1 PS 2 PS mo model update sent time … … to workers FI FIFO FO 1 2 1 2 Wi With priority “1>2” 17

Mitigate Stragglers with Traffic Priority host machine PS PS 1 PS PS 2 mo model update sent time … … One priority for one job’s On to workers model update (from mo m PS) FIFO FI FO 1 2 1 2 With priority “1>2” Wi 18

Mitigate Stragglers with Traffic Priority host machine PS PS 1 PS 2 PS mo model update sent time … … One priority for one job’s On to workers model update (from mo m PS) FI FIFO FO 1 2 1 2 With priority “1>2” Wi Tr Traf affic c prior oritizat zation on mitigat ates es st strag aggler ers: s: workers of the same job are expected to wait for similar lengths of time . 19

Reducing Stragglers with TensorLights host machine PS PS 1 PS 2 PS mo model update sent time … … to workers FIFO FI FO 1 2 Rotate priority assignments of TensorLi Tensor Light hts 1 2 “1>2” and “2>1” -On One TensorLi Tensor Light hts 1 2 1 2 -Ro RoundRo Robin 20

Reducing Stragglers with TensorLights host machine PS PS 1 PS PS 2 mo model update sent time … … to workers FI FIFO FO 1 2 Re Reducing stragglers with priority Rotate priority while achieving fair progress wh assignments of TensorLi Tensor Light hts 1 2 am among ong concur concurrent ent job obs! s! “1>2” and “2>1” -On One TensorLi Tensor Light hts 1 2 1 2 -Ro RoundRo Robin 21

Scheduling Model with TensorLights PS 1 PS PS 1 PS PS 1 PS PS PS 2 PS PS 2 PS PS 2 FIFO TensorLights TensorLights -RoundRobin -One 22

Other commun TensorLights acceleration str ✗ Inaccurate rate Resource Re ng ü Work conserving leads to bandw schedul sched uling under-utilization Sc Schedu duling g ü Local, light-weight ✗ Global coordin overhead over head ü No change to app, ✗ Modifying the D De Deployment cluster scheduler, at various level or hardware 23 23

TensorLights End-Host Traffic Scheduling for Distributed Deep - PowerPoint PPT Presentation

TensorLights End-Host Traffic Scheduling for Distributed Deep Learning Xin Sunny Huang Ang Chen T. S. Eugene Ng Rice University 1 This Work The Parameter Server (PS) architecture is the most popular approach for distributed Deep Learning.

Overhead-free I/O from enclaves SysTEX'16 Trento, Italy Meni Orenbach Prof. Mark Silberstein 1

Implementation of Host-based Overlay Multicast in Support of Web Based Services for RT-DVS

The Role of Web Hosting Providers in Detecting Compromised Websites Davide Canali, Davide

Trusted End Host Monitors for Securing Cloud Datacenters Alan Shieh Srikanth Kandula

HOST PUF-Based Authentication ECE 525 PUF-Based Authentication PUF-based protocols have been

How to Host Your Own Instructor Training: For Project AWARE-SEA and MHAT Grantees Chayla Lyon ,

Automatic MySQL Schema Management with Skeema Evan Elias Percona Live, April 2017 What is

A Rendezvous-based Paradigm A Rendezvous-based Paradigm for Analysis of Solicited and for

Cada Da - Welsh Meeting Template Social Language Learning Program - Template - Wednesday - Dydd

Kantara Workshop: Making the World Safe for User-Managed Access Eve Maler Kantara UMA Work

Scalability and Availability Ryan Eberhardt and Armin Namavari May 19, 2020 Logistics Project 1

Golden Gate Bridging The Resource-Efficiency Gap Between ASICs and FPGA Prototypes Albert Magyar

TC-CIM: Empowering Tensor Comprehensions for Computing In Memory Andi Drebes 1 Lorenzo Chelini 2,3

Dynamic Near Data Processing Framework for SSDs Gunjae Koo , Kiran Kumar Matam, Te I ,

CSE543 - Introduction to Computer and Network Security Module: Network Security Professor

HTTP TP DESYNC ATTACKS SM SMASH SHING INTO THE CE CELL NEXT DOOR James Kettle Th The Fear

Introduction to Virtual Machines Introduction Abstraction and interfaces Virtualization

1 Welcome to our e-module on Post-Award Orientation Conferences in our series on How to

Security I retired slides Markus Kuhn Computer Laboratory Lent 2013 Part I B

Naming DNS & DHCP Naming IP addresses allow global connectivity But theyre pretty

Network ID Subnet Host NNNN NNNN NNNN NNNN SSSS SSSS HHHH HHHH 1000 0000 0000 1010

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Detecting and Surviving Intrusions Exploring New Host-Based Intrusion Detection, Recovery, and

FAI The Universal Deployment Tool Thomas Lange, University of Cologne

TensorLights End-Host Traffic Scheduling for Distributed Deep - PowerPoint PPT Presentation

TensorLights End-Host Traffic Scheduling for Distributed Deep Learning Xin Sunny Huang Ang Chen T. S. Eugene Ng Rice University 1 This Work The Parameter Server (PS) architecture is the most popular approach for distributed Deep Learning.

Overhead-free I/O from enclaves SysTEX'16 Trento, Italy Meni Orenbach Prof. Mark Silberstein 1

Implementation of Host-based Overlay Multicast in Support of Web Based Services for RT-DVS

The Role of Web Hosting Providers in Detecting Compromised Websites Davide Canali, Davide

Trusted End Host Monitors for Securing Cloud Datacenters Alan Shieh Srikanth Kandula

HOST PUF-Based Authentication ECE 525 PUF-Based Authentication PUF-based protocols have been

How to Host Your Own Instructor Training: For Project AWARE-SEA and MHAT Grantees Chayla Lyon ,

Automatic MySQL Schema Management with Skeema Evan Elias Percona Live, April 2017 What is

A Rendezvous-based Paradigm A Rendezvous-based Paradigm for Analysis of Solicited and for

Cada Da - Welsh Meeting Template Social Language Learning Program - Template - Wednesday - Dydd

Kantara Workshop: Making the World Safe for User-Managed Access Eve Maler Kantara UMA Work

Scalability and Availability Ryan Eberhardt and Armin Namavari May 19, 2020 Logistics Project 1

Golden Gate Bridging The Resource-Efficiency Gap Between ASICs and FPGA Prototypes Albert Magyar

TC-CIM: Empowering Tensor Comprehensions for Computing In Memory Andi Drebes 1 Lorenzo Chelini 2,3

Dynamic Near Data Processing Framework for SSDs Gunjae Koo *, Kiran Kumar Matam*, Te I ,

CSE543 - Introduction to Computer and Network Security Module: Network Security Professor

HTTP TP DESYNC ATTACKS SM SMASH SHING INTO THE CE CELL NEXT DOOR James Kettle Th The Fear

Introduction to Virtual Machines Introduction Abstraction and interfaces Virtualization

1 Welcome to our e-module on Post-Award Orientation Conferences in our series on How to

Security I retired slides Markus Kuhn Computer Laboratory Lent 2013 Part I B

Naming DNS &amp; DHCP Naming IP addresses allow global connectivity But theyre pretty

Network ID Subnet Host NNNN NNNN NNNN NNNN SSSS SSSS HHHH HHHH 1000 0000 0000 1010

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Detecting and Surviving Intrusions Exploring New Host-Based Intrusion Detection, Recovery, and

FAI The Universal Deployment Tool Thomas Lange, University of Cologne

Dynamic Near Data Processing Framework for SSDs Gunjae Koo , Kiran Kumar Matam, Te I ,

Naming DNS & DHCP Naming IP addresses allow global connectivity But theyre pretty