CLUSTAR: AI Training Platform Powered by High Performance Networking - - PowerPoint PPT Presentation

โ–ถ
clustar ai training platform powered by high performance
SMART_READER_LITE
LIVE PREVIEW

CLUSTAR: AI Training Platform Powered by High Performance Networking - - PowerPoint PPT Presentation

CLUSTAR: AI Training Platform Powered by High Performance Networking Junxue ZHANG EVP CLUSTAR PhD SING Lab, HKUST AGUEST 1,2018 Deep Learning Is Becoming Increasingly Important Computer Vision Natural Language Processing Auto-driving Cars


slide-1
SLIDE 1

Junxue ZHANG EVP CLUSTAR PhD SING Lab, HKUST

AGUEST 1,2018

CLUSTAR: AI Training Platform Powered by High Performance Networking

slide-2
SLIDE 2

Deep Learning Is Becoming Increasingly Important

27

Computer Vision Natural Language Processing Auto-driving Cars

slide-3
SLIDE 3

How does Deep Learning Work ?

28

๐‘ง = ๐‘ โˆ— ๐‘ฆ + ๐‘

1 ๐‘ฆ ๐‘ก๐‘ฃ๐‘› Input Layer Output Layer ๐’š ๐’› ๐’›๐’’๐’”๐’‡๐’† 1 5 2 7 mini batch ๐‘ = 1 ๐‘ = 1

slide-4
SLIDE 4

How does Deep Learning Work ?

29

๐‘ง = ๐‘ โˆ— ๐‘ฆ + ๐‘

1 ๐‘ฆ ๐‘ก๐‘ฃ๐‘› Input Layer Output Layer ๐’š ๐’› ๐’›๐’’๐’”๐’‡๐’† 1 5 2 7 mini batch Forward Pass ๐‘ = 1 ๐‘ = 1 2 3

slide-5
SLIDE 5

How does Deep Learning Work ?

30

๐‘ง = ๐‘ โˆ— ๐‘ฆ + ๐‘

1 ๐‘ฆ ๐‘ก๐‘ฃ๐‘› Input Layer Output Layer ๐’š ๐’› ๐’›๐’’๐’”๐’‡๐’† 1 5 2 7 mini batch ๐‘€ = ๐ท 4 ๐‘ง โˆ’ ๐‘ง6789 = 1 2 4 ๐‘ง โˆ’ ๐‘ง6789

;

Forward Pass ๐‘ = 1 ๐‘ = 1 2 3 Calculating Loss

slide-6
SLIDE 6

How does Deep Learning Work ?

31

๐‘ง = ๐‘ โˆ— ๐‘ฆ + ๐‘

1 ๐‘ฆ ๐‘ก๐‘ฃ๐‘› Input Layer Output Layer ๐’š ๐’› ๐’›๐’’๐’”๐’‡๐’† 1 5 2 7 mini batch ๐‘€ = ๐ท 4 ๐‘ง โˆ’ ๐‘ง6789 = 1 2 4 ๐‘ง โˆ’ ๐‘ง6789

;

๐œ–๐‘€ ๐œ–๐‘ = ๐œ–๐‘€ ๐œ–๐‘ง6789 ร— ๐œ–๐‘ง6789 ๐œ–๐‘ = 4 ๐‘ง6789 โˆ’ ๐‘ง ๐‘ฆ = โˆ’11 ๐œ–๐‘€ ๐œ–๐‘ = ๐œ–๐‘€ ๐œ–๐‘ง6789 ร— ๐œ–๐‘ง6789 ๐œ–๐‘ = 4 ๐‘ง6789 โˆ’ ๐‘ง = โˆ’7 ๐‘ = ๐‘ โˆ’ ๐‘  ๐œ–๐‘€ ๐œ–๐‘ ๐‘ = ๐‘ โˆ’ ๐‘  ๐œ–๐‘€ ๐œ–๐‘ 2 3 Calculating Loss Backpropagation ๐‘ = 1 โˆ’ 0.1 โˆ— โˆ’11 = 2.1 ๐‘ = 1 โˆ’ 0.1 โˆ— โˆ’7 = 1.7

slide-7
SLIDE 7

How does Deep Learning Work ?

32

๐‘ง = ๐‘ โˆ— ๐‘ฆ + ๐‘

1 ๐‘ฆ ๐‘ก๐‘ฃ๐‘› Input Layer Output Layer ๐‘€ = ๐ท 4 ๐‘ง โˆ’ ๐‘ง6789 = 1 2 4 ๐‘ง โˆ’ ๐‘ง6789

;

๐œ–๐‘€ ๐œ–๐‘ = ๐œ–๐‘€ ๐œ–๐‘ง6789 ร— ๐œ–๐‘ง6789 ๐œ–๐‘ = 4 ๐‘ง6789 โˆ’ ๐‘ง ๐‘ฆ = โˆ’11 ๐œ–๐‘€ ๐œ–๐‘ = ๐œ–๐‘€ ๐œ–๐‘ง6789 ร— ๐œ–๐‘ง6789 ๐œ–๐‘ = 4 ๐‘ง6789 โˆ’ ๐‘ง = โˆ’7 ๐‘ = ๐‘ โˆ’ ๐‘  ๐œ–๐‘€ ๐œ–๐‘ ๐‘ = ๐‘ โˆ’ ๐‘  ๐œ–๐‘€ ๐œ–๐‘ Calculating Loss Backpropagation ๐‘ = 1 โˆ’ 0.1 โˆ— โˆ’11 = 2.1 ๐‘ = 1 โˆ’ 0.1 โˆ— โˆ’7 = 1.7 ๐’š ๐’› ๐’›๐’’๐’”๐’‡๐’† 3 9 5 13 Next Iteration

slide-8
SLIDE 8

How does Deep Learning Work ?

33

Input Layer Output Layer Hidden Layer

slide-9
SLIDE 9

How does Deep Learning Work ?

34

Input Layer Output Layer Hidden Layer Forward Pass Forward Pass Forward Pass Calculating Loss Backpropagation Backpropagation Backpropagation ๐‘ฅ;C

D

๐‘ฅE;

D

๐‘ฅFE

D

slide-10
SLIDE 10

The Big Data Drives a New Paradigm for Training

35

1. Data is too large to fit in single machine 2. The training time is too long

Uber: it usually takes weeks or longer to complete [1]

slide-11
SLIDE 11

Networking Plays an Important Role

36

Networking

Worker 1 Worker 2

๐‘ฅE ๐‘ฅ; โ€ฆ

Parameter Server Data Partition 1 Data Partition 2

slide-12
SLIDE 12

Networking Plays an Important Role

37

Networking

Worker 1 Worker 2

๐‘ฅE ๐‘ฅ; โ€ฆ

Parameter Server

Pull Parameters From Servers

Data Partition 1 Data Partition 2

๐‘ฅE ๐‘ฅ; ๐‘ฅE ๐‘ฅ;

slide-13
SLIDE 13

Networking Plays an Important Role

38

Networking

Worker 1 Worker 2

๐‘ฅE ๐‘ฅ; โ€ฆ

Parameter Server

Forward Pass Forward Pass

Data Partition 1 Data Partition 2

Input Input ๐‘ฅE ๐‘ฅ; ๐‘ฅE ๐‘ฅ;

slide-14
SLIDE 14

Networking Plays an Important Role

39

Networking

Worker 1 Worker 2

๐‘ฅE ๐‘ฅ; โ€ฆ

Parameter Server

Forward Pass Forward Pass Calculating Loss

Data Partition 1 Data Partition 2

Calculating Loss Input Input ๐‘ฅE ๐‘ฅ; ๐‘ฅE ๐‘ฅ;

slide-15
SLIDE 15

Networking Plays an Important Role

40

Networking

Worker 1 Worker 2

๐‘ฅE

D

๐‘ฅE

DD

๐‘ฅ;

DD

๐‘ฅ;

D

๐‘ฅE ๐‘ฅ; โ€ฆ

Parameter Server Data Partition 1 Data Partition 2

Backpropagation Backpropagation

slide-16
SLIDE 16

Networking Plays an Important Role

41

Networking

Worker 1 Worker 2

๐‘ฅE

D

๐‘ฅE

DD

๐‘ฅ;

DD

๐‘ฅ;

D

๐‘ฅE ๐‘ฅ; โ€ฆ

Parameter Server Data Partition 1 Data Partition 2

Backpropagation Backpropagation Push parameters to Servers

slide-17
SLIDE 17

Networking Plays an Important Role

42

Networking

Worker 1 Worker 2

๐‘ฅE

D

๐‘ฅE

DD

๐‘ฅ;

DD

๐‘ฅ;

D

๐‘ฅE ๐‘ฅ; โ€ฆ

Parameter Server Data Partition 1 Data Partition 2

Backpropagation Backpropagation Push parameters to Servers Networking is critical to performance !

slide-18
SLIDE 18

Networking Plays an Important Role

43

Model Logistic Regression Multi-layer perceptron Alexnet VGG-16 Resnet-50 Speedup 2.59x 3.45x 1.6x 1.33x 1.03x

The speedup achieved after utilizing the 40Gbps networking bandwidth with CLUSTAR

slide-19
SLIDE 19

CLUSTAR: AI Training Platform Powered by High Performance Networking

Smart Networking Scheduling

  • Co-flow scheduling
  • Elephant & Mice flow

scheduling GDR

  • Towards 0-copy data flow
  • Utilize RDMA and GPUDirect
  • Integrated with TensorFlow

ParaExpress

  • Resilient and adaptive parameter

aggregation

  • Tackles the disadvantage of

Parameter Server & Ring AllReduce

Key Technology๏ผˆWorld-leading Research Achievements๏ผ‰

MLT

  • Utilize the SGD of AI training
  • Semi-loss tolerance
  • Model quality awareness

Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI

The important of networking towards AI system equals the traffic system towards cities

44

slide-20
SLIDE 20

CLUSTAR: AI Training Platform Powered by High Performance Networking

Smart Networking Scheduling

  • Co-flow scheduling
  • Elephant & Mice flow

scheduling GDR

  • Towards 0-copy data flow
  • Utilize RDMA and GPUDirect
  • Integrated with TensorFlow

ParaExpress

  • Resilient and adaptive parameter

aggregation

  • Tackles the disadvantage of

Parameter Server & Ring AllReduce

Key Technology๏ผˆWorld-leading Research Achievements๏ผ‰

MLT

  • Utilize the SGD of AI training
  • Semi-loss tolerance
  • Model quality awareness

Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI

The important of networking towards AI system equals the traffic system towards cities

45

slide-21
SLIDE 21

CLUSTAR: AI Training Platform Powered by High Performance Networking

Smart Networking Scheduling

  • Co-flow scheduling
  • Elephant & Mice flow

scheduling GDR

  • Towards 0-copy data flow
  • Utilize RDMA and GPUDirect
  • Integrated with TensorFlow

ParaExpress

  • Resilient and adaptive parameter

aggregation

  • Tackles the disadvantage of

Parameter Server & Ring AllReduce

Key Technology๏ผˆWorld-leading Research Achievements๏ผ‰

MLT

  • Utilize the SGD of AI training
  • Semi-loss tolerance
  • Model quality awareness

Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI

The important of networking towards AI system equals the traffic system towards cities

46

slide-22
SLIDE 22

CLUSTAR Platform

47

ๅŸบ ็ก€ ่ฎพ ๆ–ฝ ๅบ”โฝค็”ฉ

่ฎก็ฎ—ๆœบ่ง†่ง‰ โพฆ้‡‘๏ค‹่žโพ่กŒ๏จ‰ไธšๅบ”โฝค็”ฉ ่ฏญโพณ้Ÿด่ฏ†ๅˆซ โพƒ่‡ซ็„ถ่ฏญโพ”่จๅค„็†๏งฅ โพƒ่‡ซๅŠจ้ฉพ้ฉถ ๆ™บ่ƒฝๅๆฌบ่ฏˆ ๆ™บ่ƒฝโฝ†ๆ—กโผˆไบปๆœบ ๅฎ‰้˜ฒโพ่กŒ๏จ‰ไธšๅบ”โฝค็”ฉ ไบ’่”โฝน็ฝ’โพ่กŒ๏จ‰ไธšๅบ”โฝค็”ฉ ๅˆถ้€ ไธšโพ่กŒ๏จ‰ไธšๅบ”โฝค็”ฉ ๅŒป็–—โพ่กŒ๏จ‰ไธšๅบ”โฝค็”ฉ ๆ”ฟๅบœโพ่กŒ๏จ‰ไธšๅบ”โฝค็”ฉ

้€š โฝค็”ฉ ็กฌ ไปถ

CPU GPU FPGA ASIC RDMAโฝน็ฝ’็ปœ ๅ…จ้—ชๅญ˜ๅญ˜ๅ‚จ Intel Nvidia AMD ๅฏ’ๆญฆ็บช Mellanox

Broadcom

P4 E8 Storage Clustar AI Fabrics RoCE ๆ™บ่ƒฝโฝน็ฝ’ๅก ๆ•ฐๆฎ้ข„ๅค„็† ็ฆป็บฟ่ฎญ็ปƒ ๅœจ็บฟ่ฎญ็ปƒ ๅคš็งŸๆˆท็ฎก็† ไปปๅŠก่ฐƒๅบฆ ่ฟ็ปด็›‘ๆŽง Sparkไผ˜ๅŒ– TensorFlowไผ˜ๅŒ– ๅฎนๅ™จ๏จน็ผ–ๆŽ’ๅผ•ๆ“Ž ไบคไบ’็ผ–็จ‹็•Œโพฏ้ฃ

ๆ˜Ÿ ไบ‘ ๅนณ ๅฐ

ๅฏ็ผ–็จ‹โฝน็ฝ’็ปœ

slide-23
SLIDE 23

GDR: Towards Zero Copy Data Flow

48

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

slide-24
SLIDE 24

GDR: Towards Zero Copy Data Flow

49

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

slide-25
SLIDE 25

GDR: Towards Zero Copy Data Flow

50

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

slide-26
SLIDE 26

GDR: Towards Zero Copy Data Flow

51

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

slide-27
SLIDE 27

GDR: Towards Zero Copy Data Flow

52

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

slide-28
SLIDE 28

GDR: Towards Zero Copy Data Flow

53

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

slide-29
SLIDE 29

GDR: Towards Zero Copy Data Flow

54

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

slide-30
SLIDE 30

GDR: Towards Zero Copy Data Flow

55

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU

slide-31
SLIDE 31

GDR: Towards Zero Copy Data Flow

56

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking GDR removes the unnecessary copy to boost performance

slide-32
SLIDE 32

GDR: Towards Zero Copy Data Flow

57

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking GDR removes the unnecessary copy to boost performance

slide-33
SLIDE 33

GDR: Towards Zero Copy Data Flow

58

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking GDR removes the unnecessary copy to boost performance

slide-34
SLIDE 34

GDR: Towards Zero Copy Data Flow

59

CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU

RDMA NIC

CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking GDR removes the unnecessary copy to boost performance

slide-35
SLIDE 35

Memory Management

60

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer GDR further reduces the data copy by managing the

  • bjects manually over pinned memory
slide-36
SLIDE 36

Memory Management

61

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer GDR further reduces the data copy by managing the

  • bjects manually over pinned memory
slide-37
SLIDE 37

Memory Management

62

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer GDR further reduces the data copy by managing the

  • bjects manually over pinned memory
slide-38
SLIDE 38

Memory Management

63

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Data Copy Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer GDR further reduces the data copy by managing the

  • bjects manually over pinned memory
slide-39
SLIDE 39

Memory Management

64

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Data Copy Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer Manually manage malloc() and free() over pre-pinned memory GDR further reduces the data copy by managing the

  • bjects manually over pinned memory
slide-40
SLIDE 40

Memory Management

65

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Data Copy Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer Manually manage malloc() and free() over pre-pinned memory Allocated Object GDR further reduces the data copy by managing the

  • bjects manually over pinned memory
slide-41
SLIDE 41

Memory Management

66

OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Data Copy Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer Manually manage malloc() and free() over pre-pinned memory Allocated Object GDR further reduces the data copy by managing the

  • bjects manually over pinned memory
slide-42
SLIDE 42

TensorFlow GDR

67

GDR has been contributed to TensorFlow community (We have commercial version) https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/gdr

slide-43
SLIDE 43

Benchmark

68 VGG16 BERT AlexNet

slide-44
SLIDE 44

The Evil of Parameter Server & Ring AllReduce

69

Worker A Worker B Worker C Parameter Server Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D

slide-45
SLIDE 45

The Evil of Parameter Server & Ring AllReduce

70

Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D

slide-46
SLIDE 46

The Evil of Parameter Server & Ring AllReduce

71

Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D

slide-47
SLIDE 47

The Evil of Parameter Server & Ring AllReduce

72

Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D

slide-48
SLIDE 48

The Evil of Parameter Server & Ring AllReduce

73

Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D Delayed due to congestion Cannot Start Transferring Cannot Start Transferring

slide-49
SLIDE 49

The Evil of Parameter Server & Ring AllReduce

74

Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D Delayed due to congestion Cannot Start Transferring Cannot Start Transferring The long dependency of Ring AllReudce may cause the whole job to wait once one hop blocks

slide-50
SLIDE 50

ParaExpress: Networking-aware Parameter Aggregation

75

?

Root Aggregator Leaf

1 2 ? 3 4 5 6 ? 7 8

Rack1 Rack2

Real-time networking conditions

1 2 3 4 5 6 7 8 ToR ToR

Generate Optimal Parameter Aggregation The generated parameter aggregation topology has the advantage of both tree structure (Parameter Server ) and ring structure (Ring AllReduce)

slide-51
SLIDE 51

ParaExpress Architecture

76

Task Queue Completion Queue Execution Graph Resolver Operation Pool Request Manager Traffic Prioritization Module

ParaExpress Master

Embedding Plan Prioritization

MPI Request

High Speed Network Interface Change DSCP โ€“ Priority Mapping

ParaExpress Agent

Execution Graph

D โ€ฆ โ€ฆ โ€ฆ R1 A1 S1 R2 A2 An S2 Sn Rn

Tensor

slide-52
SLIDE 52

Highlighted Results

77

  • Compared with TensorFlow PS, Baidu Ring AllReduce and Horovod, the software optimization of

ParaExpress can achieve 1.5-4.3X better performance.

  • In real environment, ParaExpess achieves 2.6X better results than Parameter Server and 3X better

results than Ring AllReduce.

slide-53
SLIDE 53

About CLUSTAR

slide-54
SLIDE 54

World-leading Research Achievements

79 9 papers appear in top-tier Networking Conference (SIGCOMM/NSDI๏ผ‰in recent 5 years. First in Asia.

  • "AuTO: Scaling Deep Reinforcement Learning to Enable Datacenter-Scale Automatic Traffic Optimization", ACM SIGCOMM 2018
  • "PowerMan: An Out-of-Band Management Network for Datacenters using Power Line Communication", USENIX NSDI 2018
  • "Resilient Datacenter Load Balancing in the Wild", ACM SIGCOMM 2017
  • "Enabling Wide-spread Communications on Optical Fabric with MegaSwitch", USENIX NSDI 2017
  • "Scheduling Mix-flows in Commodity Datacenters with Karuna", ACM SIGCOMM 2016
  • "CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark", ACM SIGCOMM 2016
  • "Enabling ECN in Multi-Service Multi-Queue Data Centers", USENIX NSDI 2016
  • "Information-Agnostic Flow Scheduling for Commodity Data Centers", USENIX NSDI 2015
  • "Explicit Path Control in Commodity Data Centers: Design and Applications", USENIX NSDI 2015

Statistics of universities in great China:

University Number of accepted papers HKUST 9 (all are from teams of CLUSTAR) Tsinghua University 5 (from different professors and labs) Chinese Academy of Sciences 3 (from different professors and labs) Peking University 1 Fudan University 1 National Supercomputing Center in Wuxi 1

slide-55
SLIDE 55

Selected Clients

80

n GDR n ParaExpress

Selected Clients Utilize GDR to boost performance of moments classification for WeChat; and performance of CV for SAIC Achievements๏ผš 1. ~3X for Wechat 2. ~1.6 for SAIC Selected Clients Utilize ParaExpress to improve the performance

  • f AI training in the

sophisticated cloud environment Progress๏ผš POC Utilize GDR to boost the performance of Federated

  • Learning. Utilize MLT to boost

the long-distance communication Progress๏ผš Developing High-speed Networking virtualization Progress๏ผš Developing AI Unicorn1

1 NDA issue

n AI Consulting

Selected Clients Smart customer support

  • system. Utilize CLUSTAR

platform to speed up the AI training Progress๏ผš Developing next-gen AI platform together

slide-56
SLIDE 56

CLUSTAR Team

81

Kai CHEN Founder

  • PhD, Northwest University
  • Associated Professor, HKUST
  • 50+ top-tier networking conference paper

(SIGCOMM/NSDI)

Qiang YANG Co-founder

  • PhD, University of Maryland
  • Chair Professor, Department Head of CSE,

HKUST

  • President of IJCAI
  • 10+ years of research experience on DCN
  • Director of SING Lab, HKUST
  • Director of WHAT Lab, HKUST
  • Founder of Transfer Learning
  • IEEE/ACM/AAAI Fellow
  • Founding director of the Huawei Noah's Ark

Research Lab Shuihai HU VP of Technology

  • PhD, HKUST
  • Expertise on RDMA

Pin LYU Director of Algorithm

  • 7 years of IBM software

development Junxue ZHANG EVP

  • PhD, HKUST
  • Architecture for CLUSTAR

platform Yajing LYU VP of Business

  • MBA, ESSEC
  • 6+ years of business experiences

Junhuan SUN VP of Engineering

  • 10+ years of engineering

experiences Weiyan WANG AI Scientist

  • PhD, HKUST
  • AutoML System
slide-57
SLIDE 57

Milestone

82

2018.11

CLUSTAR v1.0 launched! Cooperate with SAIC

2018.05

CLUSTAR is founded!

2018.01 2018.09

Join Nvidia Inception Program

2019.01

CLUSTAR v1.1 launched! Cooperate with WeChat

2017.08 2018.11

Cooperate with Sunshine Insurance Angle Funding

2018.03

slide-58
SLIDE 58

THANK YOU

jzhangcs@clustar.ai https://www.clustarai.com