Parameter Hub A Rack-Scale Parameter Server for Efficient - - PowerPoint PPT Presentation

parameter hub
SMART_READER_LITE
LIVE PREVIEW

Parameter Hub A Rack-Scale Parameter Server for Efficient - - PowerPoint PPT Presentation

Parameter Hub A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 1 - DNN training is computationally expensive - Needs


slide-1
SLIDE 1

Parameter Hub

A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy

1

slide-2
SLIDE 2
  • DNN training is computationally expensive
  • Needs to train it in distributed fashion
  • People use cloud for DDNN training

Major cloud providers all have an ecosystem for cloud- based DDNN training.

2

slide-3
SLIDE 3

Distributed Training

INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE

F1 B1 A1 O1 F1 B1 Parameter Server Worker 1 Worker 2 F2 B2 F2 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization A2 O2 F3 F3 B2 B2 Time

Worker Parameter Server

3

slide-4
SLIDE 4

Distributed Training

INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE

F1 B1 A1 O1 F1 B1 Parameter Sever Worker 1 Worker 2 F2 B2 F2 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization A2 O2 F3 F3 B2 B2 Time

Worker Parameter Server

4

slide-5
SLIDE 5

Cloud-based Distributed Training Today

IN THE CONTEXT OF THE CLOUD

Network Core ToR Machine with GPUs Machine ToR Machine with GPUs Machine

5

slide-6
SLIDE 6

Cloud-based Distributed Training Today

FORWARD AND BACKWARD PASSES IN WORKER

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

6

slide-7
SLIDE 7

Cloud-based Distributed Training Today

AGGREGATION AND OPTIMIZATION IN PS

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

7

slide-8
SLIDE 8

DDNN training is communication bound

0.5 1 1.5 2 GRID 520 K80 M60 V100 Seconds ResNet 269

  • Problem gets worse over

time: shifting bottleneck.

  • With modern GPUs most
  • f the time is spent on

communication.

  • Making GPUs faster will

do little to increase throughput

  • Wasting compute

resources.

8

2012 2014 2015 2017

GPU and Network active GPU idle, waiting on network

slide-9
SLIDE 9

DDNN training is communication bound

Inception V3

AlexNet GoogleNet ResNet 269

slide-10
SLIDE 10

Bottlenecks in Cloud-based DDNN training

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

10

slide-11
SLIDE 11

Bottlenecks in Cloud-based DDNN training

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

FRAMEWORK BOTTLENECKS

GPU

Training Framework …

Network

11

slide-12
SLIDE 12

Bottlenecks in Cloud-based DDNN training

FRAMEWORK BOTTLENECKS

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

AlexNet GoogleNet Inception ResNet 269

Seconds

Compute Data Copy and Communication Aggregator Optimizer Synchronization and other Overheads

12

slide-13
SLIDE 13

Bottlenecks in Cloud-based DDNN training

MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

13

slide-14
SLIDE 14

Bottlenecks in Cloud-based DDNN training

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

BANDWIDTH BOTTLENECK

14

slide-15
SLIDE 15

Bottlenecks in Cloud-based DDNN training

INSUFFICIENT BANDWIDTH

15

25 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps GoogleNet / Inception: 40 Gbps ResNet: 100 Gbps AlexNet: 1200 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet

slide-16
SLIDE 16

Bottlenecks in Cloud-based DDNN training

MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

16

slide-17
SLIDE 17

Bottlenecks in Cloud-based DDNN training

DEPLOYMENT-RELATED OVERHEAD

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

17

slide-18
SLIDE 18

Bottlenecks in Cloud-based DDNN training

DEPLOYMENT-RELATED OVERHEAD

  • Transient congestion, or
  • versubscription by design
  • Cross-rack communication

cost is higher than Intra- rack communication.

  • Comm. bottlenecked by

slowest link.

Hosts

9 Gbps 4 Gbps

1 2 3 4 5 6 7 8 1 2 4.7 3 8.9 4.7 4 8.9 4.7 8.9 5 8.9 4.7 8.9 8.9 6 4.7 9.0 4.7 4.7 4.7 7 8.9 4.7 9.0 8.9 9.0 4.7 8 4.7 9.0 4.7 4.7 4.7 9.0 4.7

Cluster 1: 1 3 4 5 7 Cluster 2: 2 6 8

Hosts

18

slide-19
SLIDE 19

Parameter Hub Optimizations

CODESIGNING SOFTWARE, HARDWARE AND CLUSTER CONFIGURATION FOR EFFICIENT CLOUD-BASED DDNN TRAINING

ToR PS 1 ToR PS 2 Worker 2

19

slide-20
SLIDE 20

Network Core

Eliminating framework bottlenecks:

ToR Worker 1 PS 1 ToR PS 2 Worker 2

GPU Data Copy Aggregation Optimization … Network

PHub Optimizations: streamlining DDNN training pipeline

20

slide-21
SLIDE 21

Network Core

Eliminating framework bottlenecks:

ToR Worker 1 PS 1 ToR PS 2 Worker 2

GPU Data Copy Aggregation Optimization … Network

PHub Optimizations: streamlining DDNN training pipeline

21

slide-22
SLIDE 22

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

GRADIENTS CPU

Software Optimizations

22

MEMORY

slide-23
SLIDE 23

Software Optimizations

GRADIENT AGGREGATION AND OPTIMIZATION

Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for

  • aggregation. This is used in
  • MxNet. (Wide Aggregation)

Requires synchronization. Great locality. No synchronization

Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into

  • hierarchy. Perform NUMA

aware tree reduction.

NUMA NUMA 1 Great locality. No synchronization T

  • o much coherence and

synchronization

23

slide-24
SLIDE 24

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

  • Chunk a gradient into a series of virtual

gradients deterministically.

  • A virtual gradient is mapped to a

particular core on the server.

  • Virtual gradients are transferred

independently.

  • A chunk is only processed by a single

core : maintaining maximum locality.

Gradient Array for Key 0 from 8 workers Core Mappings Aggregated

24

slide-25
SLIDE 25

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

When Aggregation is done, PHub:

  • PHub optimizes a chunk with the same

core that aggregates that chunk.

Array for Key 0 from 8 workers Aggregated

25

slide-26
SLIDE 26

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

When Aggregation is done, PHub:

  • PHub optimizes a chunk with the same

core that aggregates that chunk.

  • Allows overlapping of aggregation,
  • ptimization and gradient transmission.

Array for Key 0 from 8 workers Optimized Aggregated

26

slide-27
SLIDE 27

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

Software Optimizations

NOT ENOUGH ON THEIR OWN! Typical server configuration is unbalanced

14000 1100 800 10

27

Gbps

slide-28
SLIDE 28

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

Eliminating bandwidth bottlenecks: PBox hardware: balanced computation and communication resources.

28

slide-29
SLIDE 29

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

Eliminating bandwidth bottlenecks: PBox hardware: balanced computation and communication resources.

29

slide-30
SLIDE 30

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

Hardware Optimization

THE PBOX

  • Balanced computation and communication
  • Extends the balance and locality notion

across NUMA domains and NICs.

14000 1100 800 10

30

Gbps

slide-31
SLIDE 31

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

Hardware Optimization

THE PBOX

  • Balanced computation and communication
  • Extends the balance and locality notion

across NUMA domains and NICs.

14000 1100 800 20

31

Gbps

slide-32
SLIDE 32

Network Core

Hardware Optimization

THE PBOX

  • Balanced computation and communication
  • Extends the balance and locality notion

across NUMA domains and NICs.

ToR Worker 1 PS 1 T PS Wor

14000 1100 800 800

32

Gbps

slide-33
SLIDE 33

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic

34

slide-34
SLIDE 34

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic

35

slide-35
SLIDE 35

PBox Deployment

RACK SCALE PARAMETER SERVICE

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 CM

36

slide-36
SLIDE 36

PBox Deployment

RACK SCALE PARAMETER SERVICE

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 CM PBox PBox

37

slide-37
SLIDE 37

Two-Phase Hierarchical Aggregation

ADAPTING TO THE DATACENTER NETWORK TOPOLOGY

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 PBox PBox

  • 1. Intra-Rack central

aggregation

  • 2. Inter-Rack

aggregation N times traffic reduction!

38

slide-38
SLIDE 38

Up to 2.7x performance in 10Gbps cloud- like environment

2.7 2.3 2.2 1.3 1.3 2.2 1.9 2.3 2.3 1 2 3

AlexNet VGG 11 VGG 19 GoogleNet Inception V3 ResNet 18 ResNet 50 ResNet 269 ResNext 269

39

8 Workers. GTX 1080 Ti. MxNet: InfiniBand-enhanced baseline. PBox. Batch Size 64 for ResNext, 128 for ResNet 269, 256 for all others.

slide-39
SLIDE 39

Framework Bottlenecks

  • Data Copy
  • Aggregation and Optimization
  • Synchronization

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

AlexNet GoogleNet Inception ResNet 269

Seconds

40

slide-40
SLIDE 40

Scalability

LINEAR SCALING IN COMM. ONLY BENCHMARK

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Memory Bandwidth (MB/s) Number of active workers Microbenchmark limit PHub training

41

slide-41
SLIDE 41

Scalability

PCI-E TO MEMORY SUBSYSTEM BRIDGE

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Memory Bandwidth (MB/s) Number of active workers Microbenchmark limit PHub training

42

120 Machines training ResNet 50

slide-42
SLIDE 42

Scalability Beyond a Single Rack

EMULATING HIERARCHICAL AGGREGATION

14.3% 14.7% 15.1% 15.5% 15.9% 16.2% 16.5% 0.4% 1.6% 1.8% 1.8% 1.4% 1.5% 1.9%

0.0% 10.0% 20.0%

2 3 4 5 6 7 8 Overhead (%) Number of racks AlexNet ResNet 50

Overhead of Phub cross-rack synchronization

43

slide-43
SLIDE 43

Cost Analysis – for infrastructure builders

25% BETTER THROUGHPUT/$

82.2 60.4 0.0 20.0 40.0 60.0 80.0 100.0 Throughput per $1000 PBox Standard Shard

Accounting for network devices (switch ports, network adapters, network cables), GPU costs, and PBox’s entire machine cost. Core oversubscription 2:1

44

slide-44
SLIDE 44

Parameter Hub

A software, hardware and cluster configuration codesign that target three major bottlenecks in the cloud for more efficient DDNN training

45