[PPT] - CLUSTAR: AI Training Platform Powered by High Performance Networking PowerPoint Presentation

SLIDE 1

Junxue ZHANG EVP CLUSTAR PhD SING Lab, HKUST

AGUEST 1,2018

CLUSTAR: AI Training Platform Powered by High Performance Networking

SLIDE 2

Deep Learning Is Becoming Increasingly Important

27

Computer Vision Natural Language Processing Auto-driving Cars

SLIDE 3

How does Deep Learning Work ?

28

𝑧 = 𝑏 ∗ 𝑦 + 𝑐

1 𝑦 𝑡𝑣𝑛 Input Layer Output Layer 𝒚 𝒛 𝒛𝒒𝒔𝒇𝒆 1 5 2 7 mini batch 𝑏 = 1 𝑐 = 1

SLIDE 4

How does Deep Learning Work ?

29

𝑧 = 𝑏 ∗ 𝑦 + 𝑐

1 𝑦 𝑡𝑣𝑛 Input Layer Output Layer 𝒚 𝒛 𝒛𝒒𝒔𝒇𝒆 1 5 2 7 mini batch Forward Pass 𝑏 = 1 𝑐 = 1 2 3

SLIDE 5

How does Deep Learning Work ?

30

𝑧 = 𝑏 ∗ 𝑦 + 𝑐

1 𝑦 𝑡𝑣𝑛 Input Layer Output Layer 𝒚 𝒛 𝒛𝒒𝒔𝒇𝒆 1 5 2 7 mini batch 𝑀 = 𝐷 4 𝑧 − 𝑧6789 = 1 2 4 𝑧 − 𝑧6789

;

Forward Pass 𝑏 = 1 𝑐 = 1 2 3 Calculating Loss

SLIDE 6

How does Deep Learning Work ?

31

𝑧 = 𝑏 ∗ 𝑦 + 𝑐

1 𝑦 𝑡𝑣𝑛 Input Layer Output Layer 𝒚 𝒛 𝒛𝒒𝒔𝒇𝒆 1 5 2 7 mini batch 𝑀 = 𝐷 4 𝑧 − 𝑧6789 = 1 2 4 𝑧 − 𝑧6789

;

𝜖𝑀 𝜖𝑏 = 𝜖𝑀 𝜖𝑧6789 × 𝜖𝑧6789 𝜖𝑏 = 4 𝑧6789 − 𝑧 𝑦 = −11 𝜖𝑀 𝜖𝑐 = 𝜖𝑀 𝜖𝑧6789 × 𝜖𝑧6789 𝜖𝑐 = 4 𝑧6789 − 𝑧 = −7 𝑏 = 𝑏 − 𝑠 𝜖𝑀 𝜖𝑏 𝑐 = 𝑐 − 𝑠 𝜖𝑀 𝜖𝑐 2 3 Calculating Loss Backpropagation 𝑏 = 1 − 0.1 ∗ −11 = 2.1 𝑐 = 1 − 0.1 ∗ −7 = 1.7

SLIDE 7

How does Deep Learning Work ?

32

𝑧 = 𝑏 ∗ 𝑦 + 𝑐

1 𝑦 𝑡𝑣𝑛 Input Layer Output Layer 𝑀 = 𝐷 4 𝑧 − 𝑧6789 = 1 2 4 𝑧 − 𝑧6789

;

𝜖𝑀 𝜖𝑏 = 𝜖𝑀 𝜖𝑧6789 × 𝜖𝑧6789 𝜖𝑏 = 4 𝑧6789 − 𝑧 𝑦 = −11 𝜖𝑀 𝜖𝑐 = 𝜖𝑀 𝜖𝑧6789 × 𝜖𝑧6789 𝜖𝑐 = 4 𝑧6789 − 𝑧 = −7 𝑏 = 𝑏 − 𝑠 𝜖𝑀 𝜖𝑏 𝑐 = 𝑐 − 𝑠 𝜖𝑀 𝜖𝑐 Calculating Loss Backpropagation 𝑏 = 1 − 0.1 ∗ −11 = 2.1 𝑐 = 1 − 0.1 ∗ −7 = 1.7 𝒚 𝒛 𝒛𝒒𝒔𝒇𝒆 3 9 5 13 Next Iteration

SLIDE 8

How does Deep Learning Work ?

33

Input Layer Output Layer Hidden Layer

SLIDE 9

How does Deep Learning Work ?

34

Input Layer Output Layer Hidden Layer Forward Pass Forward Pass Forward Pass Calculating Loss Backpropagation Backpropagation Backpropagation 𝑥;C

D

𝑥E;

D

𝑥FE

D

SLIDE 10

The Big Data Drives a New Paradigm for Training

35

1. Data is too large to fit in single machine 2. The training time is too long

Uber: it usually takes weeks or longer to complete [1]

SLIDE 11

Networking Plays an Important Role

36

Networking

Worker 1 Worker 2

𝑥E 𝑥; …

Parameter Server Data Partition 1 Data Partition 2

SLIDE 12

Networking Plays an Important Role

37

Networking

Worker 1 Worker 2

𝑥E 𝑥; …

Parameter Server

Pull Parameters From Servers

Data Partition 1 Data Partition 2

𝑥E 𝑥; 𝑥E 𝑥;

SLIDE 13

Networking Plays an Important Role

38

Networking

Worker 1 Worker 2

𝑥E 𝑥; …

Parameter Server

Forward Pass Forward Pass

Data Partition 1 Data Partition 2

Input Input 𝑥E 𝑥; 𝑥E 𝑥;

SLIDE 14

Networking Plays an Important Role

39

Networking

Worker 1 Worker 2

𝑥E 𝑥; …

Parameter Server

Forward Pass Forward Pass Calculating Loss

Data Partition 1 Data Partition 2

Calculating Loss Input Input 𝑥E 𝑥; 𝑥E 𝑥;

SLIDE 15

Networking Plays an Important Role

40

Networking

Worker 1 Worker 2

𝑥E

D

𝑥E

DD

𝑥;

DD

𝑥;

D

𝑥E 𝑥; …

Parameter Server Data Partition 1 Data Partition 2

Backpropagation Backpropagation

SLIDE 16

Networking Plays an Important Role

41

Networking

Worker 1 Worker 2

𝑥E

D

𝑥E

DD

𝑥;

DD

𝑥;

D

𝑥E 𝑥; …

Parameter Server Data Partition 1 Data Partition 2

Backpropagation Backpropagation Push parameters to Servers

SLIDE 17

Networking Plays an Important Role

42

Networking

Worker 1 Worker 2

𝑥E

D

𝑥E

DD

𝑥;

DD

𝑥;

D

𝑥E 𝑥; …

Parameter Server Data Partition 1 Data Partition 2

Backpropagation Backpropagation Push parameters to Servers Networking is critical to performance !

SLIDE 18

Networking Plays an Important Role

43

Model Logistic Regression Multi-layer perceptron Alexnet VGG-16 Resnet-50 Speedup 2.59x 3.45x 1.6x 1.33x 1.03x

The speedup achieved after utilizing the 40Gbps networking bandwidth with CLUSTAR

SLIDE 19

CLUSTAR: AI Training Platform Powered by High Performance Networking

Smart Networking Scheduling

Co-flow scheduling
Elephant & Mice flow

scheduling GDR

Towards 0-copy data flow
Utilize RDMA and GPUDirect
Integrated with TensorFlow

ParaExpress

Resilient and adaptive parameter

aggregation

Tackles the disadvantage of

Parameter Server & Ring AllReduce

Key Technology（World-leading Research Achievements）

MLT

Utilize the SGD of AI training
Semi-loss tolerance
Model quality awareness

Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI

The important of networking towards AI system equals the traffic system towards cities

44

SLIDE 20

CLUSTAR: AI Training Platform Powered by High Performance Networking

Smart Networking Scheduling

Co-flow scheduling
Elephant & Mice flow

scheduling GDR

Towards 0-copy data flow
Utilize RDMA and GPUDirect
Integrated with TensorFlow

ParaExpress

Resilient and adaptive parameter

aggregation

Tackles the disadvantage of

Parameter Server & Ring AllReduce

Key Technology（World-leading Research Achievements）

MLT

Utilize the SGD of AI training
Semi-loss tolerance
Model quality awareness

Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI

The important of networking towards AI system equals the traffic system towards cities

45

SLIDE 21

CLUSTAR: AI Training Platform Powered by High Performance Networking

Smart Networking Scheduling

Co-flow scheduling
Elephant & Mice flow

scheduling GDR

Towards 0-copy data flow
Utilize RDMA and GPUDirect
Integrated with TensorFlow

ParaExpress

Resilient and adaptive parameter

aggregation

Tackles the disadvantage of

Parameter Server & Ring AllReduce

Key Technology（World-leading Research Achievements）

MLT

Utilize the SGD of AI training
Semi-loss tolerance
Model quality awareness

Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI

The important of networking towards AI system equals the traffic system towards cities

46

SLIDE 22

CLUSTAR Platform

47

基础设施应⽤甩

计算机视觉⾦金喇融⾏行降业应⽤甩语⾳韴识别⾃臫然语⾔訁处理痢⾃臫动驾驶智能反欺诈智能⽆旡⼈亻机安防⾏行降业应⽤甩互联⽹罒⾏行降业应⽤甩制造业⾏行降业应⽤甩医疗⾏行降业应⽤甩政府⾏行降业应⽤甩

通⽤甩硬件

CPU GPU FPGA ASIC RDMA⽹罒络全闪存存储 Intel Nvidia AMD 寒武纪 Mellanox

Broadcom

P4 E8 Storage Clustar AI Fabrics RoCE 智能⽹罒卡数据预处理离线训练在线训练多租户管理任务调度运维监控 Spark优化 TensorFlow优化容器塀编排引擎交互编程界⾯靣

星云平台

可编程⽹罒络

SLIDE 23

GDR: Towards Zero Copy Data Flow

48

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

SLIDE 24

GDR: Towards Zero Copy Data Flow

49

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

SLIDE 25

GDR: Towards Zero Copy Data Flow

50

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

SLIDE 26

GDR: Towards Zero Copy Data Flow

51

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

SLIDE 27

GDR: Towards Zero Copy Data Flow

52

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

SLIDE 28

GDR: Towards Zero Copy Data Flow

53

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

SLIDE 29

GDR: Towards Zero Copy Data Flow

54

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

SLIDE 30

GDR: Towards Zero Copy Data Flow

55

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

SLIDE 31

GDR: Towards Zero Copy Data Flow

56

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking GDR removes the unnecessary copy to boost performance

SLIDE 32

GDR: Towards Zero Copy Data Flow

57

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking GDR removes the unnecessary copy to boost performance

SLIDE 33

GDR: Towards Zero Copy Data Flow

58

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking GDR removes the unnecessary copy to boost performance

SLIDE 34

GDR: Towards Zero Copy Data Flow

59

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking GDR removes the unnecessary copy to boost performance

SLIDE 35

Memory Management

60

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer GDR further reduces the data copy by managing the

bjects manually over pinned memory

SLIDE 36

Memory Management

61

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer GDR further reduces the data copy by managing the

bjects manually over pinned memory

SLIDE 37

Memory Management

62

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer GDR further reduces the data copy by managing the

bjects manually over pinned memory

SLIDE 38

Memory Management

63

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Data Copy Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer GDR further reduces the data copy by managing the

bjects manually over pinned memory

SLIDE 39

Memory Management

64

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Data Copy Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer Manually manage malloc() and free() over pre-pinned memory GDR further reduces the data copy by managing the

bjects manually over pinned memory

SLIDE 40

Memory Management

65

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Data Copy Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer Manually manage malloc() and free() over pre-pinned memory Allocated Object GDR further reduces the data copy by managing the

bjects manually over pinned memory

SLIDE 41

Memory Management

66

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Data Copy Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer Manually manage malloc() and free() over pre-pinned memory Allocated Object GDR further reduces the data copy by managing the

bjects manually over pinned memory

SLIDE 42

TensorFlow GDR

67

GDR has been contributed to TensorFlow community (We have commercial version) https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/gdr

SLIDE 43

Benchmark

68 VGG16 BERT AlexNet

SLIDE 44

The Evil of Parameter Server & Ring AllReduce

69

Worker A Worker B Worker C Parameter Server Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D

SLIDE 45

The Evil of Parameter Server & Ring AllReduce

70

Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D

SLIDE 46

The Evil of Parameter Server & Ring AllReduce

71

Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D

SLIDE 47

The Evil of Parameter Server & Ring AllReduce

72

Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D

SLIDE 48

The Evil of Parameter Server & Ring AllReduce

73

Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D Delayed due to congestion Cannot Start Transferring Cannot Start Transferring

SLIDE 49

The Evil of Parameter Server & Ring AllReduce

74

Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D Delayed due to congestion Cannot Start Transferring Cannot Start Transferring The long dependency of Ring AllReudce may cause the whole job to wait once one hop blocks

SLIDE 50

ParaExpress: Networking-aware Parameter Aggregation

75

?

Root Aggregator Leaf

1 2 ? 3 4 5 6 ? 7 8

Rack1 Rack2

Real-time networking conditions

1 2 3 4 5 6 7 8 ToR ToR

Generate Optimal Parameter Aggregation The generated parameter aggregation topology has the advantage of both tree structure (Parameter Server ) and ring structure (Ring AllReduce)

SLIDE 51

ParaExpress Architecture

76

Task Queue Completion Queue Execution Graph Resolver Operation Pool Request Manager Traffic Prioritization Module

ParaExpress Master

Embedding Plan Prioritization

MPI Request

High Speed Network Interface Change DSCP – Priority Mapping

ParaExpress Agent

Execution Graph

D … … … R1 A1 S1 R2 A2 An S2 Sn Rn

Tensor

SLIDE 52

Highlighted Results

77

Compared with TensorFlow PS, Baidu Ring AllReduce and Horovod, the software optimization of

ParaExpress can achieve 1.5-4.3X better performance.

In real environment, ParaExpess achieves 2.6X better results than Parameter Server and 3X better

results than Ring AllReduce.

SLIDE 53

About CLUSTAR

SLIDE 54

World-leading Research Achievements

79 9 papers appear in top-tier Networking Conference (SIGCOMM/NSDI）in recent 5 years. First in Asia.

"AuTO: Scaling Deep Reinforcement Learning to Enable Datacenter-Scale Automatic Traffic Optimization", ACM SIGCOMM 2018
"PowerMan: An Out-of-Band Management Network for Datacenters using Power Line Communication", USENIX NSDI 2018
"Resilient Datacenter Load Balancing in the Wild", ACM SIGCOMM 2017
"Enabling Wide-spread Communications on Optical Fabric with MegaSwitch", USENIX NSDI 2017
"Scheduling Mix-flows in Commodity Datacenters with Karuna", ACM SIGCOMM 2016
"CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark", ACM SIGCOMM 2016
"Enabling ECN in Multi-Service Multi-Queue Data Centers", USENIX NSDI 2016
"Information-Agnostic Flow Scheduling for Commodity Data Centers", USENIX NSDI 2015
"Explicit Path Control in Commodity Data Centers: Design and Applications", USENIX NSDI 2015

Statistics of universities in great China:

University Number of accepted papers HKUST 9 (all are from teams of CLUSTAR) Tsinghua University 5 (from different professors and labs) Chinese Academy of Sciences 3 (from different professors and labs) Peking University 1 Fudan University 1 National Supercomputing Center in Wuxi 1

SLIDE 55

Selected Clients

80

n GDR n ParaExpress

Selected Clients Utilize GDR to boost performance of moments classification for WeChat; and performance of CV for SAIC Achievements： 1. ~3X for Wechat 2. ~1.6 for SAIC Selected Clients Utilize ParaExpress to improve the performance

f AI training in the

sophisticated cloud environment Progress： POC Utilize GDR to boost the performance of Federated

Learning. Utilize MLT to boost

the long-distance communication Progress： Developing High-speed Networking virtualization Progress： Developing AI Unicorn1

1 NDA issue

n AI Consulting

Selected Clients Smart customer support

system. Utilize CLUSTAR

platform to speed up the AI training Progress： Developing next-gen AI platform together

SLIDE 56

CLUSTAR Team

81

Kai CHEN Founder

PhD, Northwest University
Associated Professor, HKUST
50+ top-tier networking conference paper

(SIGCOMM/NSDI)

Qiang YANG Co-founder

PhD, University of Maryland
Chair Professor, Department Head of CSE,

HKUST

President of IJCAI
10+ years of research experience on DCN
Director of SING Lab, HKUST
Director of WHAT Lab, HKUST
Founder of Transfer Learning
IEEE/ACM/AAAI Fellow
Founding director of the Huawei Noah's Ark

Research Lab Shuihai HU VP of Technology

PhD, HKUST
Expertise on RDMA

Pin LYU Director of Algorithm

7 years of IBM software

development Junxue ZHANG EVP

PhD, HKUST
Architecture for CLUSTAR

platform Yajing LYU VP of Business

MBA, ESSEC
6+ years of business experiences

Junhuan SUN VP of Engineering

10+ years of engineering

experiences Weiyan WANG AI Scientist

PhD, HKUST
AutoML System

SLIDE 57

Milestone

82

2018.11

CLUSTAR v1.0 launched! Cooperate with SAIC

2018.05

CLUSTAR is founded!

2018.01 2018.09

Join Nvidia Inception Program

2019.01

CLUSTAR v1.1 launched! Cooperate with WeChat

2017.08 2018.11

Cooperate with Sunshine Insurance Angle Funding

2018.03

SLIDE 58

CLUSTAR: AI Training Platform Powered by High Performance Networking

Deep Learning Is Becoming Increasingly Important

How does Deep Learning Work ?

𝑧 = 𝑏 ∗ 𝑦 + 𝑐

How does Deep Learning Work ?

𝑧 = 𝑏 ∗ 𝑦 + 𝑐

How does Deep Learning Work ?

𝑧 = 𝑏 ∗ 𝑦 + 𝑐

How does Deep Learning Work ?

𝑧 = 𝑏 ∗ 𝑦 + 𝑐

How does Deep Learning Work ?

𝑧 = 𝑏 ∗ 𝑦 + 𝑐

How does Deep Learning Work ?

How does Deep Learning Work ?

The Big Data Drives a New Paradigm for Training

1. Data is too large to fit in single machine 2. The training time is too long

Networking Plays an Important Role

Networking Plays an Important Role

Networking Plays an Important Role

Networking Plays an Important Role

Networking Plays an Important Role

Networking Plays an Important Role

Networking Plays an Important Role

Networking Plays an Important Role

The speedup achieved after utilizing the 40Gbps networking bandwidth with CLUSTAR

CLUSTAR: AI Training Platform Powered by High Performance Networking

CLUSTAR: AI Training Platform Powered by High Performance Networking

CLUSTAR: AI Training Platform Powered by High Performance Networking

CLUSTAR Platform

GDR: Towards Zero Copy Data Flow

GDR: Towards Zero Copy Data Flow

GDR: Towards Zero Copy Data Flow

GDR: Towards Zero Copy Data Flow

GDR: Towards Zero Copy Data Flow

GDR: Towards Zero Copy Data Flow

GDR: Towards Zero Copy Data Flow

GDR: Towards Zero Copy Data Flow

GDR: Towards Zero Copy Data Flow

GDR: Towards Zero Copy Data Flow

GDR: Towards Zero Copy Data Flow

GDR: Towards Zero Copy Data Flow

Memory Management

Memory Management

Memory Management

Memory Management

Memory Management

Memory Management

Memory Management

TensorFlow GDR

Benchmark

The Evil of Parameter Server & Ring AllReduce

The Evil of Parameter Server & Ring AllReduce

The Evil of Parameter Server & Ring AllReduce

The Evil of Parameter Server & Ring AllReduce

The Evil of Parameter Server & Ring AllReduce

The Evil of Parameter Server & Ring AllReduce

ParaExpress: Networking-aware Parameter Aggregation

ParaExpress Architecture

ParaExpress Master

MPI Request

ParaExpress Agent

D … … … R1 A1 S1 R2 A2 An S2 Sn Rn

Tensor

Highlighted Results

About CLUSTAR

World-leading Research Achievements

Selected Clients

CLUSTAR Team

Milestone

THANK YOU

jzhangcs@clustar.ai https://www.clustarai.com