HiPS Hierarchical Parameter Synchronization in Large-Scale - - PowerPoint PPT Presentation

hips hierarchical parameter synchronization in large
SMART_READER_LITE
LIVE PREVIEW

HiPS Hierarchical Parameter Synchronization in Large-Scale - - PowerPoint PPT Presentation

HiPS Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li 1 ACM SIGCOMM Workshop on NetAI Net AI for 2 Background Computation Communication


slide-1
SLIDE 1

HiPS:Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning

Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li

1

slide-2
SLIDE 2

Net

for

ACM SIGCOMM Workshop on NetAI

AI

2

slide-3
SLIDE 3

Background Distributed Machine Learning Computation Communication

3

slide-4
SLIDE 4

Background

 Strong Computation Power (GPU & TPU)

4

slide-5
SLIDE 5

Background

 Communication Challenge  TCP: High Latency & Low Throughput,

Kernel Overheads, etc.

 RDMA-Promising Alternative to TCP

5

slide-6
SLIDE 6

Background

 A MNIST Benchmark with 1 million paras

6

slide-7
SLIDE 7

Background

 RoCE/RDMA –multi-vendor ecosystem  Many Problems in Fat-Tree based

Deployment

7

slide-8
SLIDE 8

Background

 Fat-Tree based Deployment

1.

PFC pause frame storm [SIGCOMM’15,’16, NS-3 Simulation]

2.

Resilient RoCE-Performance Sacrifice [Chelsio-Tech]

3.

Synchronization Performance

8

slide-9
SLIDE 9

Background

 Fat-Tree based Deployment

1.

PFC pause frame storm [SIGCOM’15,’16]

2.

Resilient RoCE-Performance Sacrifice

9

slide-10
SLIDE 10

Background

 Fat-Tree based Deployment

1.

Synchronization Performance

1

slide-11
SLIDE 11

Background

 Server-Centric Networks

1.

Less hops lead to less PFC pause frames

2.

Servers prevent cascading effect of PFC pause frame

1 1

slide-12
SLIDE 12

Background

 Synchronization Algorithm

1.

PS-based

2.

Mesh-based

3.

Ring-based

1 2

slide-13
SLIDE 13

Background

 Synchronization Algorithm

1.

PS-based (Pull+Push)

1 3

slide-14
SLIDE 14

Background

 Synchronization Algorithm

1.

Mesh-based (Diffuse+Collect)

1 4

slide-15
SLIDE 15

Background

 Synchronization Algorithm

1.

Ring-based (Scatter+Gather)

1 5

slide-16
SLIDE 16

Background

 Synchronization Algorithm

1.

Ring-based (Scatter+Gather)

1 6

slide-17
SLIDE 17

HiPS Design

 Map Logic View and Physical Structure

1.

Flexible (Topology-Aware)

2.

Hierarchical (Efficient)

1 7

slide-18
SLIDE 18

HiPS Design

 HiPS in BCube

1 8

slide-19
SLIDE 19

HiPS Design

 HiPS in BCube

1 9

slide-20
SLIDE 20

HiPS Design

 HiPS in BCube

2

slide-21
SLIDE 21

HiPS Design

 HiPS in BCube (Server <01>)

2 1

slide-22
SLIDE 22

HiPS Design

 HiPS in BCube

2 2

slide-23
SLIDE 23

HiPS Design

 HiPS in Torus

2 3

slide-24
SLIDE 24

Theoretical Evaluation

2 4

slide-25
SLIDE 25

Theoretical Evaluation

2 5

slide-26
SLIDE 26

Theoretical Evaluation

2 6

slide-27
SLIDE 27

Future Work

 Conduct Further Comparative Study  Integrate HiPS into DML systems

2 7

slide-28
SLIDE 28

Simulation Evaluation

GST Comparison with RDMA in Torus GST Comparison with RDMA in BCube

 NS-3 Simulation with VGG Workload

1.

BCube: GST reduced by 37.5%∼61.9%.

2.

Torus: GST reduced by 49.6%∼66.4%

2 8

slide-29
SLIDE 29

Testbed Evaluation

 System Instance of HiPS: BML

1.

Add an OP in Tensorflow

2.

9 Servers, each equipped with 2 RNICs (BCube (3,1))

3.

MINIST and VGG19 as benchmarks

4.

Ring Allreduce in Ring and Mesh-based (P2P) Sync in Fat-Tree as Baseline

2 9

slide-30
SLIDE 30

Testbed Evaluation

3

slide-31
SLIDE 31

Testbed Evaluation

18.7%~56.4%

3 1

slide-32
SLIDE 32

Ongoing Work

 Conduct Further Comparative Study  Optimize HiPS in DML systems  More Cases of Network for AI

3 2

slide-33
SLIDE 33

Thanks! NASP Research Group https://nasp.cs.tsinghua.edu.cn/

3 3