Junxue ZHANG EVP CLUSTAR PhD SING Lab, HKUST
AGUEST 1,2018
CLUSTAR: AI Training Platform Powered by High Performance Networking - - PowerPoint PPT Presentation
CLUSTAR: AI Training Platform Powered by High Performance Networking Junxue ZHANG EVP CLUSTAR PhD SING Lab, HKUST AGUEST 1,2018 Deep Learning Is Becoming Increasingly Important Computer Vision Natural Language Processing Auto-driving Cars
Junxue ZHANG EVP CLUSTAR PhD SING Lab, HKUST
AGUEST 1,2018
27
Computer Vision Natural Language Processing Auto-driving Cars
28
1 ๐ฆ ๐ก๐ฃ๐ Input Layer Output Layer ๐ ๐ ๐๐๐๐๐ 1 5 2 7 mini batch ๐ = 1 ๐ = 1
29
1 ๐ฆ ๐ก๐ฃ๐ Input Layer Output Layer ๐ ๐ ๐๐๐๐๐ 1 5 2 7 mini batch Forward Pass ๐ = 1 ๐ = 1 2 3
30
1 ๐ฆ ๐ก๐ฃ๐ Input Layer Output Layer ๐ ๐ ๐๐๐๐๐ 1 5 2 7 mini batch ๐ = ๐ท 4 ๐ง โ ๐ง6789 = 1 2 4 ๐ง โ ๐ง6789
;
Forward Pass ๐ = 1 ๐ = 1 2 3 Calculating Loss
31
1 ๐ฆ ๐ก๐ฃ๐ Input Layer Output Layer ๐ ๐ ๐๐๐๐๐ 1 5 2 7 mini batch ๐ = ๐ท 4 ๐ง โ ๐ง6789 = 1 2 4 ๐ง โ ๐ง6789
;
๐๐ ๐๐ = ๐๐ ๐๐ง6789 ร ๐๐ง6789 ๐๐ = 4 ๐ง6789 โ ๐ง ๐ฆ = โ11 ๐๐ ๐๐ = ๐๐ ๐๐ง6789 ร ๐๐ง6789 ๐๐ = 4 ๐ง6789 โ ๐ง = โ7 ๐ = ๐ โ ๐ ๐๐ ๐๐ ๐ = ๐ โ ๐ ๐๐ ๐๐ 2 3 Calculating Loss Backpropagation ๐ = 1 โ 0.1 โ โ11 = 2.1 ๐ = 1 โ 0.1 โ โ7 = 1.7
32
1 ๐ฆ ๐ก๐ฃ๐ Input Layer Output Layer ๐ = ๐ท 4 ๐ง โ ๐ง6789 = 1 2 4 ๐ง โ ๐ง6789
;
๐๐ ๐๐ = ๐๐ ๐๐ง6789 ร ๐๐ง6789 ๐๐ = 4 ๐ง6789 โ ๐ง ๐ฆ = โ11 ๐๐ ๐๐ = ๐๐ ๐๐ง6789 ร ๐๐ง6789 ๐๐ = 4 ๐ง6789 โ ๐ง = โ7 ๐ = ๐ โ ๐ ๐๐ ๐๐ ๐ = ๐ โ ๐ ๐๐ ๐๐ Calculating Loss Backpropagation ๐ = 1 โ 0.1 โ โ11 = 2.1 ๐ = 1 โ 0.1 โ โ7 = 1.7 ๐ ๐ ๐๐๐๐๐ 3 9 5 13 Next Iteration
33
Input Layer Output Layer Hidden Layer
34
Input Layer Output Layer Hidden Layer Forward Pass Forward Pass Forward Pass Calculating Loss Backpropagation Backpropagation Backpropagation ๐ฅ;C
D
๐ฅE;
D
๐ฅFE
D
35
Uber: it usually takes weeks or longer to complete [1]
36
Networking
Worker 1 Worker 2
๐ฅE ๐ฅ; โฆ
Parameter Server Data Partition 1 Data Partition 2
37
Networking
Worker 1 Worker 2
๐ฅE ๐ฅ; โฆ
Parameter Server
Pull Parameters From Servers
Data Partition 1 Data Partition 2
๐ฅE ๐ฅ; ๐ฅE ๐ฅ;
38
Networking
Worker 1 Worker 2
๐ฅE ๐ฅ; โฆ
Parameter Server
Forward Pass Forward Pass
Data Partition 1 Data Partition 2
Input Input ๐ฅE ๐ฅ; ๐ฅE ๐ฅ;
39
Networking
Worker 1 Worker 2
๐ฅE ๐ฅ; โฆ
Parameter Server
Forward Pass Forward Pass Calculating Loss
Data Partition 1 Data Partition 2
Calculating Loss Input Input ๐ฅE ๐ฅ; ๐ฅE ๐ฅ;
40
Networking
Worker 1 Worker 2
๐ฅE
D
๐ฅE
DD
๐ฅ;
DD
๐ฅ;
D
๐ฅE ๐ฅ; โฆ
Parameter Server Data Partition 1 Data Partition 2
Backpropagation Backpropagation
41
Networking
Worker 1 Worker 2
๐ฅE
D
๐ฅE
DD
๐ฅ;
DD
๐ฅ;
D
๐ฅE ๐ฅ; โฆ
Parameter Server Data Partition 1 Data Partition 2
Backpropagation Backpropagation Push parameters to Servers
42
Networking
Worker 1 Worker 2
๐ฅE
D
๐ฅE
DD
๐ฅ;
DD
๐ฅ;
D
๐ฅE ๐ฅ; โฆ
Parameter Server Data Partition 1 Data Partition 2
Backpropagation Backpropagation Push parameters to Servers Networking is critical to performance !
43
Model Logistic Regression Multi-layer perceptron Alexnet VGG-16 Resnet-50 Speedup 2.59x 3.45x 1.6x 1.33x 1.03x
Smart Networking Scheduling
scheduling GDR
ParaExpress
aggregation
Parameter Server & Ring AllReduce
Key Technology๏ผWorld-leading Research Achievements๏ผ
MLT
Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI
The important of networking towards AI system equals the traffic system towards cities
44
Smart Networking Scheduling
scheduling GDR
ParaExpress
aggregation
Parameter Server & Ring AllReduce
Key Technology๏ผWorld-leading Research Achievements๏ผ
MLT
Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI
The important of networking towards AI system equals the traffic system towards cities
45
Smart Networking Scheduling
scheduling GDR
ParaExpress
aggregation
Parameter Server & Ring AllReduce
Key Technology๏ผWorld-leading Research Achievements๏ผ
MLT
Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI
The important of networking towards AI system equals the traffic system towards cities
46
47
ๅบ ็ก ่ฎพ ๆฝ ๅบโฝค็ฉ
่ฎก็ฎๆบ่ง่ง โพฆ้๏ค่โพ่ก๏จไธๅบโฝค็ฉ ่ฏญโพณ้ด่ฏๅซ โพ่ซ็ถ่ฏญโพ่จๅค็๏งฅ โพ่ซๅจ้ฉพ้ฉถ ๆบ่ฝๅๆฌบ่ฏ ๆบ่ฝโฝๆกโผไบปๆบ ๅฎ้ฒโพ่ก๏จไธๅบโฝค็ฉ ไบ่โฝน็ฝโพ่ก๏จไธๅบโฝค็ฉ ๅถ้ ไธโพ่ก๏จไธๅบโฝค็ฉ ๅป็โพ่ก๏จไธๅบโฝค็ฉ ๆฟๅบโพ่ก๏จไธๅบโฝค็ฉ
้ โฝค็ฉ ็กฌ ไปถ
CPU GPU FPGA ASIC RDMAโฝน็ฝ็ป ๅ จ้ชๅญๅญๅจ Intel Nvidia AMD ๅฏๆญฆ็บช Mellanox
Broadcom
P4 E8 Storage Clustar AI Fabrics RoCE ๆบ่ฝโฝน็ฝๅก ๆฐๆฎ้ขๅค็ ็ฆป็บฟ่ฎญ็ป ๅจ็บฟ่ฎญ็ป ๅค็งๆท็ฎก็ ไปปๅก่ฐๅบฆ ่ฟ็ปด็ๆง Sparkไผๅ TensorFlowไผๅ ๅฎนๅจ๏จน็ผๆๅผๆ ไบคไบ็ผ็จ็โพฏ้ฃ
ๆ ไบ ๅนณ ๅฐ
ๅฏ็ผ็จโฝน็ฝ็ป
48
CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
49
CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
50
CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
51
CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
52
CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
53
CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
54
CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
55
CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
56
CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking GDR removes the unnecessary copy to boost performance
57
CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking GDR removes the unnecessary copy to boost performance
58
CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking GDR removes the unnecessary copy to boost performance
59
CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 1 CPU Memory GPU
RDMA NIC
CPU Memory GPU GPU Socket 1 Socket 2 Server 2 Data Center Networking GDR removes the unnecessary copy to boost performance
60
OS Managed Application Memory Pinned RDMA Memory Sending Buffer Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer GDR further reduces the data copy by managing the
61
OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer GDR further reduces the data copy by managing the
62
OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer GDR further reduces the data copy by managing the
63
OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Data Copy Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer GDR further reduces the data copy by managing the
64
OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Data Copy Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer Manually manage malloc() and free() over pre-pinned memory GDR further reduces the data copy by managing the
65
OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Data Copy Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer Manually manage malloc() and free() over pre-pinned memory Allocated Object GDR further reduces the data copy by managing the
66
OS Managed Application Memory Pinned RDMA Memory Sending Buffer Allocated Object Data Copy Allocated Object Unnecessary Data Copy between pinned buffer and application memory degrades performance Pinned RDMA Memory Sending Buffer Manually manage malloc() and free() over pre-pinned memory Allocated Object GDR further reduces the data copy by managing the
67
GDR has been contributed to TensorFlow community (We have commercial version) https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/gdr
68 VGG16 BERT AlexNet
69
Worker A Worker B Worker C Parameter Server Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D
70
Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D
71
Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D
72
Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D
73
Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D Delayed due to congestion Cannot Start Transferring Cannot Start Transferring
74
Worker A Worker B Worker C Parameter Server Bottleneck link Parameter Server largely degrades from congested links due to over-subscribed networking Worker A Worker B Worker C Worker D Delayed due to congestion Cannot Start Transferring Cannot Start Transferring The long dependency of Ring AllReudce may cause the whole job to wait once one hop blocks
75
?
Root Aggregator Leaf
1 2 ? 3 4 5 6 ? 7 8
Rack1 Rack2
Real-time networking conditions
1 2 3 4 5 6 7 8 ToR ToR
Generate Optimal Parameter Aggregation The generated parameter aggregation topology has the advantage of both tree structure (Parameter Server ) and ring structure (Ring AllReduce)
76
Task Queue Completion Queue Execution Graph Resolver Operation Pool Request Manager Traffic Prioritization Module
Embedding Plan Prioritization
High Speed Network Interface Change DSCP โ Priority Mapping
Execution Graph
77
ParaExpress can achieve 1.5-4.3X better performance.
results than Ring AllReduce.
79 9 papers appear in top-tier Networking Conference (SIGCOMM/NSDI๏ผin recent 5 years. First in Asia.
Statistics of universities in great China:
University Number of accepted papers HKUST 9 (all are from teams of CLUSTAR) Tsinghua University 5 (from different professors and labs) Chinese Academy of Sciences 3 (from different professors and labs) Peking University 1 Fudan University 1 National Supercomputing Center in Wuxi 1
80
n GDR n ParaExpress
Selected Clients Utilize GDR to boost performance of moments classification for WeChat; and performance of CV for SAIC Achievements๏ผ 1. ~3X for Wechat 2. ~1.6 for SAIC Selected Clients Utilize ParaExpress to improve the performance
sophisticated cloud environment Progress๏ผ POC Utilize GDR to boost the performance of Federated
the long-distance communication Progress๏ผ Developing High-speed Networking virtualization Progress๏ผ Developing AI Unicorn1
1 NDA issue
n AI Consulting
Selected Clients Smart customer support
platform to speed up the AI training Progress๏ผ Developing next-gen AI platform together
81
Kai CHEN Founder
(SIGCOMM/NSDI)
Qiang YANG Co-founder
HKUST
Research Lab Shuihai HU VP of Technology
Pin LYU Director of Algorithm
development Junxue ZHANG EVP
platform Yajing LYU VP of Business
Junhuan SUN VP of Engineering
experiences Weiyan WANG AI Scientist
82
2018.11
CLUSTAR v1.0 launched! Cooperate with SAIC
2018.05
CLUSTAR is founded!
2018.01 2018.09
Join Nvidia Inception Program
2019.01
CLUSTAR v1.1 launched! Cooperate with WeChat
2017.08 2018.11
Cooperate with Sunshine Insurance Angle Funding
2018.03