Scalable Distributed Training with Parameter Hub: a whirlwind tour - PowerPoint PPT Presentation

Scalable Distributed Training with Parameter Hub:   a whirlwind tour

TVM Stack Optimization High-Level Differentiable IR AutoTVM Tensor Expression IR LLVM, CUDA, Metal VTA AutoVTA Edge Cloud Hardware ASIC FPGA FPGA Fleet

Optimization High-Level Differentiable IR AutoTVM Tensor Expression IR LLVM, CUDA, Metal VTA AutoVTA Hardware Edge Cloud ASIC Fleet FPGA FPGA Active Topology Probing Your Cloud Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.

Parameter Hub Optimized, topology-aware and dynamic mechanism for inter-machine communication 4 Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy

Parameter Hub Optimized, topology-aware and dynamic mechanism for inter-machine communication * In the cloud-based training context 4 Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy

Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning. 5

Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook 6

EC2 reclaims your GPU instances as they run out of capacity 7

Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter A1 O1 A2 O2 Server Worker 1 F1 B1 F2 B2 F3 B2 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Parameter Server Worker 8

Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter A1 O1 A2 O2 Sever Worker 1 F1 B1 F2 B2 F3 B2 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Parameter Server Worker 9

Distributed Training Today IN THE CONTEXT OF THE CLOUD Network Core ToR ToR Machine with GPUs Machine with GPUs Machine Machine 10

Distributed Training Today FORWARD AND BACKWARD PASSES IN WORKER Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 11

Distributed Training Today AGGREGATION AND OPTIMIZATION IN PS Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 12

Distributed training is communication bound ResNet 269 - Problem gets worse 1.8 GPU idle, waiting on network over time: shifting GPU and Network active 1.35 bottleneck. Seconds - With modern GPUs 0.9 most of the time is spent 0.45 on communication. 0 - Making GPUs faster will GRID 520 K80 M60 V100 do little to increase throughput 2012 2014 2015 2017 - Wasting compute resources. 13

Distributed training is communication bound AlexNet ResNet 269 GoogleNet Inception V3

Bottlenecks in DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 15

Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 16

Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS Training Framework … Network Core Network GPU ToR ToR Worker 1 PS 2 PS 1 Worker 2 16

Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS 17

Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS ResNet 269 Compute Inception Data Copy and Communication Aggregator GoogleNet Optimizer Synchronization and other Overheads AlexNet 0 0.4 0.8 1.2 1.6 Seconds 17

Bottlenecks in DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 18

Bottlenecks in DDNN training BANDWIDTH BOTTLENECK Network Core ToR ToR Worker 1 PS 2 Worker 2 PS 1 19

Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required 1300 Gbps for each of the popular NNs for communication to not 1000 Gbps bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet 25 Gbps 10 Gbps 20

Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required 1300 Gbps for each of the popular NNs for communication to not 1000 Gbps bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet 25 Gbps Cloud Bandwidth 10 Gbps 20

Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required 1300 Gbps for each of the popular NNs for communication to not 1000 Gbps bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet GoogleNet / Inception: 40 Gbps 25 Gbps Cloud Bandwidth 10 Gbps 20

Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required 1300 Gbps for each of the popular NNs for communication to not 1000 Gbps bottleneck computation? ResNet: 100 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet GoogleNet / Inception: 40 Gbps 25 Gbps Cloud Bandwidth 10 Gbps 20

Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required 1300 Gbps AlexNet: 1200 Gbps for each of the popular NNs for communication to not 1000 Gbps bottleneck computation? ResNet: 100 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet GoogleNet / Inception: 40 Gbps 25 Gbps Cloud Bandwidth 10 Gbps 20

Bottlenecks in Cloud-based DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 21

Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 22

Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD 1 2 3 4 5 6 7 8 • Transient congestion, or 1 9 Gbps Cluster 1: 1 3 4 5 7 2 4. oversubscription by Cluster 2: 2 6 8 7 design 3 8. 4. Hosts • Cross-rack 9 7 communication cost is 4 8. 4. 8. 9 7 9 higher than Intra-rack 5 8. 4. 8. 8. communication. 9 7 9 9 4 Gbps 6 4. 9. 4. 4. 4. 7 0 7 7 7 Hosts 7 8. 4. 9. 8. 9. 4. 23 9 7 0 9 0 7

Parameter Hub Optimizations CODESIGNING SOFTWARE, HARDWARE WITH CLUSTER CONFIGURATION FOR EFFICIENT CLOUD- BASED DDNN TRAINING ToR ToR PS 2 PS 1 Worker 2 24

Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Data Copy Aggregation Optimization … Network Core GPU Network ToR ToR Worker 1 PS 2 PS 1 Worker 2 25

Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Data Copy Aggregation Optimization … Network Core GPU Network ToR ToR Worker 1 PS 2 PS 1 Worker 2 26

Software Optimizations GRADIENTS Network MEMORY Core ToR ToR CPU Worker 1 PS 2 PS 1 Worker 2 27

Software Optimizations GRADIENT AGGREGATION AND OPTIMIZATION Requires synchronization. Great locality. No synchronization Great locality. No synchronization Too much coherence and synchronization NUMA 0 NUMA 1 Each core reads the input Q For each input Q, launch a Sequentially aggregates the Organize processors into from different workers and series of threads for same portion of gradients hierarchy. Perform NUMA writes to different locations to aggregation. This is used in within each queue. (Tall aware tree reduction. the output queue MxNet. (Wide Aggregation) Aggregation) 28

Software Optimizations TALL AGGREGATION AND OPTIMIZATION - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Core Mappings Gradient Array for Key 0 from 8 workers 29

Software Optimizations TALL AGGREGATION AND OPTIMIZATION - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Core Mappings - Virtual gradients are transferred independently. Gradient Array for Key 0 from 8 workers 29

Scalable Distributed Training with Parameter Hub: a whirlwind tour - PowerPoint PPT Presentation

Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR AutoTVM Tensor Expression IR LLVM, CUDA, Metal VTA AutoVTA Edge Cloud Hardware ASIC FPGA FPGA Fleet

Hub and Spoke Gareth Jones Hub and Spoke What is in the DH Hub and Spoke proposal? NPA

Parameter Hub A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Kinnwood Central Elementary School Best Start Hub - Forest (NBHD 13) Hub location : Full hub

Introduction to the Tropical Ecosystems Hub of the National Environmental Research Program Peter

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

The hub plan, 30 Van Ness avenue project, 98 franklin street project, and hub housing

Teaching + Learning Commons Academic Achievement Hub Engaged Teaching Hub Writing Hub Meet the

Parameter Passing and Pointers Parameter passing and functions I: reference parameters

10/16/19 Parameters and Parameter Tuning Genetic Algorithms History Taxonomy

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan Kandasamy Carnegie Mellon

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

WSO2 Message Broker Scalable persistent Messaging System Outline Messaging Scalable

The #GrowMySME Programme Jon Brunton Growth Hub Manager What is the Growth Hub? One-stop-shop

Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical

Project Courses Prasun Dewan (dewan@cs.unc.edu) Department of Computer Science University of

Powerful tools for learning: Powerful tools for learning: Kernels and Similarity Kernels and

Distributed Learning Amir H. Payberah payberah@kth.se 10/12/2019 The Course Web Page

by Learning Entity-Level Distributed Representations K. Clark and C. Manning, ACL 2016

Distributed Learning for Cooperative Inference C esar A. Uribe . Collaboration with: Alex

TYPES OF SITUATIONS CLEAR SITUATIONS UNCLEAR SITUATIONS Level of difficulty: Level of

LISTEN LOUDER Jonathan Lw Sound is a disturbance that travels by transferring energy from one