Fast and Generic Collectives for Distributed ML Guanhua Wang , - - PowerPoint PPT Presentation

fast and generic collectives for distributed ml
SMART_READER_LITE
LIVE PREVIEW

Fast and Generic Collectives for Distributed ML Guanhua Wang , - - PowerPoint PPT Presentation

Fast and Generic Collectives for Distributed ML Guanhua Wang , Shivaram Venkataraman, Amar Phanishayee Jorgen Thelin, Nikhil R. Devanur, Ion Stoica 1 DNNs empower state-of-the-art results across many different applications Image Classification


slide-1
SLIDE 1

Fast and Generic Collectives for Distributed ML

Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee Jorgen Thelin, Nikhil R. Devanur, Ion Stoica

1

slide-2
SLIDE 2

DNNs empower state-of-the-art results across many different applications

2

Image Classification Speech Recognition Game Playing Robot Control

slide-3
SLIDE 3

Speed-up DNN training: Data Parallelism

3

Data parallel training speed-up on ImageNet-1K dataset* Significantly reduce training time

* https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems

slide-4
SLIDE 4

Speed-up DNN training: Data Parallelism

4

Data parallel training speed-up on ImageNet-1K dataset* Significantly reduce training time

* https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems

slide-5
SLIDE 5

5

Data parallel training speed-up on ImageNet-1K dataset* Significantly reduce training time

* https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems

Model Synchronization ∇W = ∇W1 + ∇W2 + ⋯ + ∇WN ∇W1

Speed-up DNN training: Data Parallelism

∇W2 ∇W3 ∇W4

slide-6
SLIDE 6

Despite many performance optimizations, model synchronization is a big overhead in data parallel training on cloud servers

Multi-GPU scaling performance using TensorFlow*

*Horovod: fast and easy distributed deep learning in TensorFlow, arXiv:1802.05799, 2018

6

>50% communication

  • verhead
slide-7
SLIDE 7

Multi-GPU scaling performance using TensorFlow*

^PipeDream: Generalized Pipeline Parallelism for DNN Training, SOSP 2019 *Horovod: fast and easy distributed deep learning in TensorFlow, arXiv:1802.05799, 2018

7

>50% communication

  • verhead

Up to 90% communication

  • verhead

Communication overhead of data-parallel training with Multi-GPU servers using PyTorch^

Despite many performance optimizations, model synchronization is a big overhead in data parallel training on cloud servers

slide-8
SLIDE 8

Multi-GPU scaling performance using TensorFlow*

^PipeDream: Generalized Pipeline Parallelism for DNN Training, SOSP 2019 *Horovod: fast and easy distributed deep learning in TensorFlow, arXiv:1802.05799, 2018

8

>50% communication

  • verhead

Model synchronization is a big overhead in data parallel training despite many performance optimizations

Up to 90% communication

  • verhead

Communication overhead of data-parallel training with Multi-GPU servers using PyTorch^

To alleviate communication bottlenecks, recently there have been big improvements in hardware and software.

slide-9
SLIDE 9

State of the art (hardware)

NVIDIA DGX-1 NVIDIA DGX-2

9

slide-10
SLIDE 10

What is inside?

  • Computation

NVIDIA P100: 5.3 Tera-FLOPs Double Precision NVIDIA V100: 7.8 Tera-FLOPs Double Precision

slide-11
SLIDE 11

What is inside?

  • Computation

NVIDIA P100: 5.3 Tera-FLOPs Double Precision NVIDIA V100: 7.8 Tera-FLOPs Double Precision

  • Faster Interconnects

PCIe 3.0 (x16) ~10GB/s

  • Shared

NVLink

  • Point-to-point
  • 1st Gen (P100) ~18GB/s
  • 2nd Gen (V100) ~ 23GB/s

11

slide-12
SLIDE 12

What is inside?

  • Computation

NVIDIA P100: 5.3 Tera-FLOPs Double Precision NVIDIA V100: 7.8 Tera-FLOPs Double Precision

  • Faster Interconnects

PCIe 3.0 (x16) ~10GB/s

  • Shared

NVLink

  • Point-to-point
  • 1st Gen (P100) ~18GB/s
  • 2nd Gen (V100) ~ 23GB/s

NVSwitch

  • Fully connected crossbar
  • 6x NVLink 2nd Gen Bandwidth

~130GB/s

12

slide-13
SLIDE 13

State of the art (software)

Ring-based collective communication protocols

13

NCCL

(Nvidia Collective Communication Library)

slide-14
SLIDE 14

Ring-based collectives (e.g. Broadcast)

GPU0 GPU1 GPU2 GPU3 Topology GPU0 GPU1 GPU2 GPU3 Ring Broadcast (from GPU0)

14

slide-15
SLIDE 15

Ring-based collectives (e.g. Broadcast)

GPU0 GPU1 GPU2 GPU3 Topology GPU0 GPU1 GPU2 GPU3 Ring Broadcast (from GPU0)

15

slide-16
SLIDE 16

Ring-based collectives (e.g. Broadcast)

GPU0 GPU1 GPU2 GPU3 Topology GPU0 GPU1 GPU2 GPU3 Ring Broadcast (from GPU0)

16

slide-17
SLIDE 17

Ring-based collectives (e.g. Broadcast)

GPU0 GPU1 GPU2 GPU3 Topology GPU0 GPU1 GPU2 GPU3 Ring Broadcast (from GPU0)

17

slide-18
SLIDE 18

Ring-based collectives (e.g. Broadcast)

GPU0 GPU1 GPU2 GPU3 Topology GPU0 GPU1 GPU2 GPU3 Ring Broadcast (from GPU0)

18

slide-19
SLIDE 19

Ring-based collectives (e.g. Broadcast)

GPU0 GPU1 GPU2 GPU3 Topology GPU0 GPU1 GPU2 GPU3 Ring Broadcast (from GPU0)

19

slide-20
SLIDE 20

Ring-based collectives (e.g. Broadcast)

GPU0 GPU1 GPU2 GPU3 Topology GPU0 GPU1 GPU2 GPU3 Ring Broadcast (from GPU0)

20

slide-21
SLIDE 21

State of the art (software)

Ring-based collective communication protocols

21

NCCL

(Nvidia Collective Communication Library)

slide-22
SLIDE 22

Can these hardware & software improvements alleviate communication bottleneck in data-parallel training?

22

slide-23
SLIDE 23

Can these hardware & software improvements alleviate communication bottleneck in data-parallel training? Not Really

23

slide-24
SLIDE 24

High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL)

Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box

24

There are many different 4 GPU allocations with a server

slide-25
SLIDE 25

Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box

25

High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) 4 GPU allocation with highest overhead 4 GPU allocation with lowest overhead

High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL)

slide-26
SLIDE 26

26

High communication overheads is consistent across different number of workers and for a range of DNNs

High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL)

Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box

slide-27
SLIDE 27

27

High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL)

Communication overheads become more pronounced with increasing GPU computation power. Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box High communication overheads is consistent across different number of workers and for a range of DNNs

slide-28
SLIDE 28

28

High communication overheads consistent across different number of workers and for a range of DNNs Communication overheads become more pronounced with increasing GPU computation power. Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box

We need Faster Collective Communication Protocols. High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL)

slide-29
SLIDE 29

Talk Outline

  • Motivation
  • Challenges to achieving faster collective communication
  • Design
  • Evaluation

29

slide-30
SLIDE 30

Challenge 1: Different server configurations

30

GPU3 GPU0 GPU2 GPU1 GPU4 GPU7 GPU5 GPU6 GPU4 GPU7 GPU5 GPU6

DGX1-P100 (NVLink 1st Gen, ~18GB/s) DGX1-V100 (NVLink 2nd Gen, ~23GB/s)

slide-31
SLIDE 31

Challenge 1: Different server configurations

Protocols needs to be topology aware to effectively use hardware links.

31

GPU3 GPU0 GPU2 GPU1 GPU4 GPU7 GPU5 GPU6 GPU4 GPU7 GPU5 GPU6

DGX1-P100 (NVLink 1st Gen, ~18GB/s) DGX1-V100 (NVLink 2nd Gen, ~23GB/s)

slide-32
SLIDE 32

Challenge 2: Link heterogeneity

NVLink topology PCIe topology

32

GPU3 GPU0 GPU2 GPU1 GPU4 GPU7 GPU5 GPU6 GPU4 GPU7 GPU5 GPU6 GPU3 GPU0

Ring-based collectives can only utilize homogeneous links.

GPU1
slide-33
SLIDE 33

Challenge 2: Link heterogeneity

NVLink topology PCIe topology

33

GPU3 GPU0 GPU2 GPU1 GPU4 GPU7 GPU5 GPU6 GPU4 GPU7 GPU5 GPU6 GPU3 GPU0

Ring-based collectives can only utilize homogeneous links.

GPU1

Why not heterogeneous links? e.g. PCIe will be bottleneck if included in a NVLink ring

slide-34
SLIDE 34

Challenge 3: Fragmentation in multi-tenant clusters

Examples of fragmented allocation (8GPU job across 2 servers) 3 + 5 2 + 6

34

Within each 8-GPU server, # of GPUs allocated to 40,000 multi-GPU jobs at Microsoft.

slide-35
SLIDE 35

Challenge 3: Fragmentation in multi-tenant clusters

Many cluster schedulers are not topology-aware. Without support for efficient migration, DNN jobs must embrace fragmentation to avoid queuing delays. Why fragmentation?

35

Within each 8-GPU server, # of GPUs allocated to 40,000 multi-GPU jobs at Microsoft.

slide-36
SLIDE 36

36

Irregular topo. à no ring Existing solutions (NCCL) fall back to PCIe if they cannot form a NVLink ring.

36

GPU3 GPU0 GPU2 GPU1 GPU4 GPU7 GPU5 GPU6 GPU4 GPU7 GPU5 GPU6

Challenge 3: Fragmentation in multi-tenant clusters

Many cluster schedulers are not topology-aware. Without support for efficient migration, DNN jobs must embrace fragmentation to avoid queuing delays.

Within each 8-GPU server, # of GPUs allocated to 40,000 multi-GPU jobs at Microsoft.

Why fragmentation?

slide-37
SLIDE 37

Can we do better than state-of-the-art?

Topology Heterogeneity

  • 1. Different server configurations
  • 2. Link heterogeneity
  • 3. Fragmentation in multi-tenant clusters

Ring-based collective communication protocols

37

slide-38
SLIDE 38

BLINK

38

Can we do better than state-of-the-art?

Topology Heterogeneity

  • 1. Different server configurations
  • 2. Link heterogeneity
  • 3. Fragmentation in multi-tenant clusters
slide-39
SLIDE 39

Talk Outline

  • Motivation
  • Challenges to achieving high-performance collective communication
  • 1. Different server configurations
  • 2. Link heterogeneity
  • 3. Fragmentation in multi-tenant clusters
  • Design
  • Evaluation

39

slide-40
SLIDE 40

How Blink handles topology heterogeneity

Topology Heterogeneity

Different server configurations

Blink

Probe available links at job run time

40

slide-41
SLIDE 41

How Blink handles topology heterogeneity

Topology Heterogeneity

Different server configurations Link heterogeneity

Blink

Probe available links at job run time Concurrent data transfer over heterogenous links

41

slide-42
SLIDE 42

How Blink handles topology heterogeneity

Topology Heterogeneity

Different server configurations Link heterogeneity Fragmentation in multi-tenant clusters (irregular topology)

Blink

Probe available links at job run time Concurrent data transfer over heterogenous links Spanning trees (v.s. Rings) are more flexible and optimal.

42

slide-43
SLIDE 43

How Blink handles topology heterogeneity

Topology Heterogeneity

Different server configurations Link heterogeneity Fragmentation in multi-tenant clusters (irregular topology)

Blink

Probe available links at job run time Concurrent data transfer over heterogenous links Spanning trees (v.s. Rings) are more flexible and optimal. More NCCL-compatible API, seamless integration with TF, PyTorch, etc.

43

slide-44
SLIDE 44

Blink workflow

44

slide-45
SLIDE 45

Blink workflow

45

slide-46
SLIDE 46

Broadcast comparison (Trees v.s. Rings)

46

6-GPU topology

slide-47
SLIDE 47

Broadcast comparison (Trees v.s. Rings)

Broadcast from GPU3

47

Unused link

NCCL 2 rings

slide-48
SLIDE 48

Broadcast comparison (Trees v.s. Rings)

48

Blink 3 Spanning trees

3 Spanning trees > 2 Rings Use All the links available à Optimal

Broadcast from GPU3

slide-49
SLIDE 49

TreeGen: packing max. spanning trees

  • Given available topology, pack max. unidirectional spanning trees.

49

GPU1 GPU2 GPU3 GPU1 GPU2 GPU3

Topology

GPU1 GPU2 GPU3 GPU1 GPU2 GPU3

Packing unidirectional spanning trees

slide-50
SLIDE 50

TreeGen: packing max. spanning trees

50

GPU1 GPU2 GPU3 GPU1 GPU2 GPU3

Topology

GPU1 GPU2 GPU3 GPU1 GPU2 GPU3

Maximize the sum of bandwidth usage among all links Constrain: amount of BW usage should not exceed ANY link capacity when packing multiple trees Packing unidirectional spanning trees Optimization problem

  • Given available topology, pack max. unidirectional spanning trees.
slide-51
SLIDE 51

TreeGen: packing max. spanning trees

51

GPU1 GPU2 GPU3 GPU1 GPU2 GPU3

Topology

GPU1 GPU2 GPU3 GPU1 GPU2 GPU3

Optimization problem Too many trees! 181 spanning trees for 8-GPU DGX-1V Constrain: amount of BW usage should not exceed ANY link capacity when packing multiple trees Packing unidirectional spanning trees

  • Given available topology, pack max. unidirectional spanning trees.

Data size per-tree is too small to fully saturate link BW. Maximize the sum of bandwidth usage among all links

slide-52
SLIDE 52

TreeGen: packing max. spanning trees

52

Optimization problem Approximate packing approxi mation Either a tree use ALL BW of a link,

  • r not use it.

181 trees for 8GPU DGX-1V 6 trees for 8GPU DGX-1V approximation

  • Given available topology, pack max. unidirectional spanning trees.
slide-53
SLIDE 53

TreeGen

  • Given available topology, pack max. unidirectional spanning trees
  • Direct support for one-to-many/many-to-one primitives
  • e.g. Reduce, Broadcast, etc.

Reduce Broadcast

53

slide-54
SLIDE 54

TreeGen

  • Given available topology, pack max. unidirectional spanning trees
  • Direct support for one-to-many/many-to-one primitives
  • e.g. Reduce, Broadcast, etc.

Reduce Broadcast

  • Extend to many-to-many primitives (e.g. AllReduce)
  • Pick a root node, reduce towards root, then broadcast in reverse direction.

Reduce à Broadcast ß AllReduce

54

slide-55
SLIDE 55

TreeGen for NVSwitch (DGX-2)

55

slide-56
SLIDE 56

TreeGen for NVSwitch (DGX-2)

56

NVSwitch

GPU1 GPU2 GPU3 GPU 4

4GPU Reduce (G1->G4)

  • With NVSwitch, the connectivity among any subset of GPUs is uniform
  • NCCL constructs a multi-hop ring.
slide-57
SLIDE 57

TreeGen for NVSwitch (DGX-2)

57

NVSwitch

GPU1 GPU2 GPU3 GPU 4

4GPU Reduce (G1->G4)

  • With NVSwitch, the connectivity among any subset of GPUs is uniform
  • NCCL constructs a multi-hop ring.

Hop Count

1

slide-58
SLIDE 58

TreeGen for NVSwitch (DGX-2)

58

NVSwitch

GPU1 GPU2 GPU3 GPU 4

4GPU Reduce (G1->G4)

  • With NVSwitch, the connectivity among any subset of GPUs is uniform
  • NCCL constructs a multi-hop ring.

Hop Count

2

slide-59
SLIDE 59

TreeGen for NVSwitch (DGX-2)

59

NVSwitch

GPU1 GPU2 GPU3 GPU 4

4GPU Reduce (G1->G4)

  • With NVSwitch, the connectivity among any subset of GPUs is uniform
  • NCCL constructs a multi-hop ring.

Hop Count

3

slide-60
SLIDE 60

TreeGen for NVSwitch (DGX-2)

60

NVSwitch

GPU1 GPU2 GPU3 GPU 4

4GPU Reduce (G1->G4)

  • With NVSwitch, the connectivity among any subset of GPUs is uniform
  • NCCL constructs a multi-hop ring.

Hop Count

4

slide-61
SLIDE 61

TreeGen for NVSwitch (DGX-2)

61

  • DGX-2 single-hop tree

NVSwitch

GPU1

Reduce 1-hop tree introduces

  • Min. latency

GPU2 GPU3 GPU 4

4GPU Reduce

slide-62
SLIDE 62

TreeGen for NVSwitch (DGX-2)

62

  • DGX-2 single-hop tree

NVSwitch

GPU1

Reduce 1-hop tree introduces

  • Min. latency

GPU2 GPU3 GPU 4

AllReduce à Reduce, Broadcast For N GPUs, N 1-hop trees, with each tree responsible for 1/N data. Broadcast

slide-63
SLIDE 63

Blink workflow

63

slide-64
SLIDE 64

CodeGen

  • Translate TreeGen output (spanning trees) into real data transfer commands
  • CodeGen optimizations:
  • Pipelining data chunks to reduce latency

64

slide-65
SLIDE 65

CodeGen

65

What chunk size to use?

  • Too small, cannot fully utilize BW
  • Too big, high latency
  • Translate TreeGen output (spanning trees) into real data transfer commands
  • CodeGen optimizations:
  • Pipelining data chunks to reduce latency
slide-66
SLIDE 66

CodeGen

66

  • Translate TreeGen output (spanning trees) into real data transfer commands
  • CodeGen optimizations:
  • Pipelining data chunks to reduce latency

Automatic chunk size selection MIAD (multiple-increase, additive-decrease)

What chunk size to use?

  • Too small, cannot fully utilize BW
  • Too big, high latency
slide-67
SLIDE 67

CodeGen

67

  • Translate TreeGen output (spanning trees) into real data transfer commands
  • CodeGen optimizations:
  • Pipelining data chunks to reduce latency

Automatic chunk size selection MIAD (multiple-increase, additive-decrease)

What chunk size to use?

  • Too small, cannot fully utilize BW
  • Too big, high latency
slide-68
SLIDE 68

CodeGen

68

  • Translate TreeGen output (spanning trees) into real data transfer commands
  • CodeGen optimizations:
  • Pipelining data chunks to reduce latency

Automatic chunk size selection MIAD (multiple-increase, additive-decrease)

What chunk size to use?

  • Too small, cannot fully utilize BW
  • Too big, high latency
slide-69
SLIDE 69

CodeGen

69

  • Translate TreeGen output (spanning trees) into real data transfer commands
  • CodeGen optimizations:
  • Pipelining data chunks to reduce latency

Automatic chunk size selection MIAD (multiple-increase, additive-decrease)

What chunk size to use?

  • Too small, cannot fully utilize BW
  • Too big, high latency
slide-70
SLIDE 70

Blink design recap

  • Packing spanning trees while minimizing trees
  • Single hop trees for DGX-2 (NVSwitch)
  • Chunking, pipelining transfers for max link utilization
  • Auto chunk size selection with MIAD
  • GPU stream reuse for fair sharing of links
  • PCIe + NVLink Hybrid transfers
  • Support for multi-machine collectives

70

Drop-in NCCL replacement (load-time, no code recompile)

slide-71
SLIDE 71

Talk Outline

  • Motivation
  • Challenges to achieving high-performance collective communication
  • 1. Different server configurations
  • 2. Link heterogeneity
  • 3. Fragmentation in multi-tenant clusters
  • Design
  • Evaluation
  • AllReduce and Broadcast Microbenchmarks
  • End-to-end improvements
  • Benefits of One-Hop Trees over Rings or Double Binary trees
  • Rest of the extensive evaluation à refer to the paper

71

slide-72
SLIDE 72

Microbenchmarks (DGX-1V)

72

GPU3 GPU0 GPU2 GPU1 GPU4 GPU7 GPU5 GPU6 GPU4 GPU7 GPU5 GPU6

Topology

AllReduce

slide-73
SLIDE 73

Microbenchmarks (DGX-1V)

73

AllReduce

GPU3 GPU0 GPU2 GPU1 GPU4 GPU7 GPU5 GPU6 GPU4 GPU7 GPU5 GPU6

Topology NCCL2 (2 rings)

slide-74
SLIDE 74

Microbenchmarks (DGX-1V)

74

GPU3 GPU0 GPU2 GPU1 GPU4 GPU7 GPU5 GPU6 GPU4 GPU7 GPU5 GPU6

Topology NCCL2 (2 rings) Blink (3 spanning trees)

AllReduce

slide-75
SLIDE 75

Microbenchmarks (DGX-1V)

75

AllReduce

(up to 8x speed-up, 2x geo-mean)

slide-76
SLIDE 76

Microbenchmarks (DGX-1V)

Broadcast (up to 6x speed-up, 2x geo-mean) AllReduce (up to 8x speed-up, 2x geo-mean)

76

slide-77
SLIDE 77

End-to-end Benchmarks (DGX-1V)

77

Blink end-to-end Communication time reduction (ImageNet1K) up to 87% Communication time reduction (51% avg.)

slide-78
SLIDE 78

End-to-end Benchmarks (DGX-1V)

78

Blink end-to-end training time reduction (ImageNet1K) Blink end-to-end Communication time reduction (ImageNet1K) up to 87% Communication time reduction (51% avg.) up to 40% end-to-end training time reduction

slide-79
SLIDE 79

Microbenchmarks (DGX-2)

16 GPU AllReduce

Throughput (up to 3.5x speed-up) Latency (Up to 3.32x reduction)

79

Biggest win in small chunk sizes because our 1-hop tree achieve min. latency.

slide-80
SLIDE 80
  • Topology heterogeneity results in link underutilization for collectives.
  • Blink packs spanning trees for optimal link utilization
  • Auto-generates one-to-all, all-to-one, all-to-all collectives
  • Broadcast, AllReduce, etc.
  • Faster collective communication than NCCL
  • Up to 6x faster Broadcast (2x geo-mean)
  • Up to 8x faster AllReduce (2x geo-mean)
  • Up to 7.7x (2x geo-mean) communication time reduction in E2E data-parallel training on

DGX-1 machines.

Guanhua Wang guanhua@cs.berkeley.edu

80

slide-81
SLIDE 81

Back-ups

81

slide-82
SLIDE 82

TreeGen

  • Handle hybrid communication (e.g. PCIe & NVLink)
  • Balance amount of data transfer over different link types based on link bandwidth.
  • Take link type switching (i.e. disable_peer_access) latency into account.

82

slide-83
SLIDE 83

TreeGen

  • Multi-server transfers

83

slide-84
SLIDE 84

Multiple DGX-1s DNN Training

  • 8-GPU job on 2 DGX-1V machines (5-3 GPU placement)
  • Inter-server tput (40Gb/s) < Intra-server tput (40GB/s)
  • Projection with 100/400 Gbps inter-server bandwidth, highlight

Blink’s advantage.

84