Parameter Hub A Rack-Scale Parameter Server for Efficient - PowerPoint PPT Presentation

Parameter Hub A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 1

- DNN training is computationally expensive - Needs to train it in distributed fashion - People use cloud for DDNN training Major cloud providers all have an ecosystem for cloud- based DDNN training. 2

Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter A1 O1 A2 O2 Server Worker 1 F2 B2 F3 B2 F1 B1 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Worker Parameter Server 3

Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter A1 O1 A2 O2 Sever Worker 1 F2 B2 F3 B2 F1 B1 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Worker Parameter Server 4

Cloud-based Distributed Training Today IN THE CONTEXT OF THE CLOUD Network Core ToR ToR Machine with GPUs Machine with GPUs Machine Machine 5

Cloud-based Distributed Training Today FORWARD AND BACKWARD PASSES IN WORKER Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 6

Cloud-based Distributed Training Today AGGREGATION AND OPTIMIZATION IN PS Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 7

DDNN training is communication bound 2 ResNet 269 - Problem gets worse over 1.5 time: shifting bottleneck. Seconds - With modern GPUs most 1 of the time is spent on communication. 0.5 - Making GPUs faster will do little to increase 0 throughput GRID 520 K80 M60 V100 - Wasting compute 2012 2014 2015 2017 resources. GPU idle, waiting on network GPU and Network active 8

DDNN training is communication bound AlexNet ResNet 269 GoogleNet Inception V3

Bottlenecks in Cloud-based DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 10

Bottlenecks in Cloud-based DDNN training FRAMEWORK BOTTLENECKS Training Framework … Network Core Network GPU ToR ToR Worker 1 PS 2 PS 1 Worker 2 11

Bottlenecks in Cloud-based DDNN training FRAMEWORK BOTTLENECKS ResNet 269 Inception Compute Data Copy and Communication GoogleNet Aggregator Optimizer AlexNet Synchronization and other Overheads 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Seconds 12

Bottlenecks in Cloud-based DDNN training BANDWIDTH BOTTLENECK Network Core ToR ToR Worker 1 PS 2 Worker 2 PS 1 14

Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required 1300 Gbps AlexNet: 1200 Gbps for each of the popular NNs for communication to not 1000 Gbps bottleneck computation? ResNet: 100 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet GoogleNet / Inception: 40 Gbps 25 Gbps Cloud Bandwidth 10 Gbps 15

Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 17

Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD 1 2 3 4 5 6 7 8 • Transient congestion, or 1 9 Gbps Cluster 1: 1 3 4 5 7 oversubscription by design 2 4.7 Cluster 2: 2 6 8 • Cross-rack communication 3 8.9 4.7 Hosts cost is higher than Intra- 4 8.9 4.7 8.9 rack communication. 5 8.9 4.7 8.9 8.9 • Comm. bottlenecked by 6 4.7 9.0 4.7 4.7 4.7 slowest link. 7 8.9 4.7 9.0 8.9 9.0 4.7 4 Gbps 8 4.7 9.0 4.7 4.7 4.7 9.0 4.7 Hosts 18

Parameter Hub Optimizations CODESIGNING SOFTWARE, HARDWARE AND CLUSTER CONFIGURATION FOR EFFICIENT CLOUD-BASED DDNN TRAINING ToR ToR PS 2 PS 1 Worker 2 19

Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Data Copy Aggregation Optimization … Network Core GPU Network ToR ToR Worker 1 PS 2 PS 1 Worker 2 20

Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Data Copy Aggregation Optimization … Network Core GPU Network ToR ToR Worker 1 PS 2 PS 1 Worker 2 21

Software Optimizations GRADIENTS Network MEMORY Core ToR ToR CPU Worker 1 PS 2 PS 1 Worker 2 22

Software Optimizations GRADIENT AGGREGATION AND OPTIMIZATION Requires synchronization. Great locality. No synchronization T oo much coherence and Great locality. No synchronization synchronization NUMA NUMA 0 1 Each core reads the input Q For each input Q, launch a Sequentially aggregates the Organize processors into from different workers and series of threads for same portion of gradients hierarchy. Perform NUMA writes to different locations to aggregation. This is used in within each queue. (Tall aware tree reduction. the output queue MxNet. (Wide Aggregation) Aggregation) 23

Software Optimizations TALL AGGREGATION AND OPTIMIZATION - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Core Mappings Aggregated - Virtual gradients are transferred independently. - A chunk is only processed by a single core : maintaining maximum locality. Gradient Array for Key 0 from 8 workers 24

Software Optimizations TALL AGGREGATION AND OPTIMIZATION When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. Aggregated Array for Key 0 from 8 workers 25

Software Optimizations TALL AGGREGATION AND OPTIMIZATION When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. - Allows overlapping of aggregation, Aggregated Optimized optimization and gradient transmission. Array for Key 0 from 8 workers 26

Software Optimizations NOT ENOUGH ON THEIR OWN! Typical server configuration is unbalanced Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 10 800 1100 14000 Gbps 27

Eliminating bandwidth bottlenecks: PBox hardware: balanced computation and communication resources. Network Core ToR ToR Worker 1 PS 2 Worker 2 PS 1 28

Eliminating bandwidth bottlenecks: PBox hardware: balanced computation and communication resources. Network Core ToR ToR Worker 1 PS 2 Worker 2 PS 1 29

Hardware Optimization THE PBOX • Balanced computation and communication • Extends the balance and locality notion Network across NUMA domains and NICs. Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 10 800 1100 14000 Gbps 30

Hardware Optimization THE PBOX • Balanced computation and communication • Extends the balance and locality notion Network across NUMA domains and NICs. Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 20 800 1100 14000 Gbps 31

Hardware Optimization THE PBOX • Balanced computation and communication • Extends the balance and locality notion Network across NUMA domains and NICs. Core ToR T Worker 1 PS … PS 1 Wor 800 1100 14000 800 Gbps 32

Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 34

Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 35

PBox Deployment RACK SCALE PARAMETER SERVICE Cluster CM Network Rack ToR ToR Worker/PS 1 Worker/PS 1 Worker/PS N Worker/PS 2 36

PBox Deployment RACK SCALE PARAMETER SERVICE Cluster CM Network Rack ToR ToR PBox PBox Worker/PS 1 Worker/PS 1 Worker/PS 2 Worker/PS N 37

Two-Phase Hierarchical Aggregation ADAPTING TO THE DATACENTER NETWORK TOPOLOGY Cluster Network N times traffic reduction! Rack 2. Inter-Rack aggregation ToR ToR PBox PBox 1. Intra-Rack central aggregation Worker/PS 1 Worker/PS 1 Worker/PS 2 Worker/PS N 38

Up to 2.7x performance in 10Gbps cloud- like environment 3 2.7 2.3 2.3 2.3 2.2 2.2 1.9 2 1.3 1.3 1 0 AlexNet VGG 11 VGG 19 GoogleNet Inception V3 ResNet 18 ResNet 50 ResNet 269 ResNext 269 8 Workers. GTX 1080 Ti. MxNet: InfiniBand-enhanced baseline. PBox. Batch Size 64 for ResNext, 128 for ResNet 269, 256 for all others. 39

Framework Bottlenecks • Data Copy • Aggregation and Optimization • Synchronization ResNet 269 Inception GoogleNet AlexNet 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Seconds 40

Scalability LINEAR SCALING IN COMM. ONLY BENCHMARK 100000 90000 Memory Bandwidth (MB/s) 80000 70000 60000 50000 40000 30000 20000 Microbenchmark limit PHub training 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of active workers 41

Scalability PCI-E TO MEMORY SUBSYSTEM BRIDGE 100000 120 Machines training ResNet 50 90000 Memory Bandwidth (MB/s) 80000 70000 60000 50000 40000 30000 20000 Microbenchmark limit PHub training 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of active workers 42

Parameter Hub A Rack-Scale Parameter Server for Efficient - PowerPoint PPT Presentation

Parameter Hub A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 1 - DNN training is computationally expensive - Needs

Hub and Spoke Gareth Jones Hub and Spoke What is in the DH Hub and Spoke proposal? NPA

Kinnwood Central Elementary School Best Start Hub - Forest (NBHD 13) Hub location : Full hub

Introduction to the Tropical Ecosystems Hub of the National Environmental Research Program Peter

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

The hub plan, 30 Van Ness avenue project, 98 franklin street project, and hub housing

Teaching + Learning Commons Academic Achievement Hub Engaged Teaching Hub Writing Hub Meet the

Parameter Passing and Pointers Parameter passing and functions I: reference parameters

10/16/19 Parameters and Parameter Tuning Genetic Algorithms History Taxonomy

The #GrowMySME Programme Jon Brunton Growth Hub Manager What is the Growth Hub? One-stop-shop

CASE STUDY GOOD FOOD NETWORK FOOD HUB 1 CONFERENCE MARCH 2018 Food Hub 1 is a 3rd generation

Encouraging HUB Participation Learning Objectives What is a Historically Underutilized

EXPANSION HUB REV ROBOTICS - EXPANSION HUB revrobotics.com ANOTHER CONTROLLER CHOICE MODERN

Dairy Hub AAAP Seminar Bangkok Nov 29, 2012 2012-11-29 Agenda Background Dairy Hub

The London Cancer Hub Minute Annex Nick Smales Programme Director London Cancer Hub &

Hub Labeling Algorithms Andrew V. Goldberg Amazon.com A.V. Goldberg Hub Labeling 6/2/2016 1 /

Gigabit Broadband, Interconnec1on proposi1ons, and the Challenge of Managing Expecta1ons Steven

Security Architecture for the Smart Home Jacob Fahrenkrug, CTO at yetu AG

COMP 431 The Network Layer: Routing & Addressing Internet Services & Protocols Outline

Presented by- Virajith Jalaparti Inter-Process Gateway Open Systems Architecture

TOG web pages EVN pages: http://www.evlbi.org/ Radionet wiki:

: Taming the Cloud Object Storage Ali Anwar , Yue Cheng , Aayush Gupta , Ali R. Butt

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Fresco SIX-G The NEXT Generator Who Is Murideo? Murideo was created out of the need to bring

Parameter Hub A Rack-Scale Parameter Server for Efficient - PowerPoint PPT Presentation

Parameter Hub A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 1 - DNN training is computationally expensive - Needs

Hub and Spoke Gareth Jones Hub and Spoke What is in the DH Hub and Spoke proposal? NPA

Kinnwood Central Elementary School Best Start Hub - Forest (NBHD 13) Hub location : Full hub

Introduction to the Tropical Ecosystems Hub of the National Environmental Research Program Peter

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

The hub plan, 30 Van Ness avenue project, 98 franklin street project, and hub housing

Teaching + Learning Commons Academic Achievement Hub Engaged Teaching Hub Writing Hub Meet the

Parameter Passing and Pointers Parameter passing and functions I: reference parameters

10/16/19 Parameters and Parameter Tuning Genetic Algorithms History Taxonomy

The #GrowMySME Programme Jon Brunton Growth Hub Manager What is the Growth Hub? One-stop-shop

CASE STUDY GOOD FOOD NETWORK FOOD HUB 1 CONFERENCE MARCH 2018 Food Hub 1 is a 3rd generation

Encouraging HUB Participation Learning Objectives What is a Historically Underutilized

EXPANSION HUB REV ROBOTICS - EXPANSION HUB revrobotics.com ANOTHER CONTROLLER CHOICE MODERN

Dairy Hub AAAP Seminar Bangkok Nov 29, 2012 2012-11-29 Agenda Background Dairy Hub

The London Cancer Hub Minute Annex Nick Smales Programme Director London Cancer Hub &amp;

Hub Labeling Algorithms Andrew V. Goldberg Amazon.com A.V. Goldberg Hub Labeling 6/2/2016 1 /

Gigabit Broadband, Interconnec1on proposi1ons, and the Challenge of Managing Expecta1ons Steven

Security Architecture for the Smart Home Jacob Fahrenkrug, CTO at yetu AG

COMP 431 The Network Layer: Routing &amp; Addressing Internet Services &amp; Protocols Outline

Presented by- Virajith Jalaparti Inter-Process Gateway Open Systems Architecture

TOG web pages EVN pages: http://www.evlbi.org/ Radionet wiki:

: Taming the Cloud Object Storage Ali Anwar , Yue Cheng , Aayush Gupta , Ali R. Butt

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Fresco SIX-G The NEXT Generator Who Is Murideo? Murideo was created out of the need to bring

The London Cancer Hub Minute Annex Nick Smales Programme Director London Cancer Hub &

COMP 431 The Network Layer: Routing & Addressing Internet Services & Protocols Outline