Toward Highly Available, Intelligent Cloud and ML Systems
Chuanxiong Guo Bytedance NetAI 2018
1
Toward Highly Available, Intelligent Cloud and ML Systems - - PowerPoint PPT Presentation
Toward Highly Available, Intelligent Cloud and ML Systems Chuanxiong Guo Bytedance NetAI 2018 1 Outline Background: System/networking meets ML Deepview: ML for availability improvement of cloud systems RDMA for scalable ML
Chuanxiong Guo Bytedance NetAI 2018
1
2
3
client server socket socket tcp tcp ip ip packets nic nic Protocols Interfaces
principles
explicitly coded, and packets can be traced and explained training inference labeling training dataset model data
from data without explicit programming
computer vision and speech
4
ML Networking/system ML helps to improve system/network availability Networking to scale and accelerate ML systems
Code Repo Software systems Deployment/ provisioning Resource mgmt Config/ Management Monitoring Data Repo
5
6
99.999% 99.99% 5 min downtime per year
7
53 min downtime per year
๐ต = ๐ต๐_๐ฃ ฯ ๐_๐ฃ + ๐ต๐_๐
Code Repo Software systems Deployment/ provisioning Resource mgmt Config/ Management Monitoring Data Repo
8
Lessons learned
Deployment Provisioning Monitoring Resource mgmt
Incident localization detection Design implement
9
Automation
Incident resolution, mitigation Availability fundamentals Gray failure Panorama ByteBrain Deepview Netbouncer Pingmesh Incident prevention
10
11
Clos Network
a Clos-like network
(VHDs) provisioned from different clusters
redirects disk access to remote storage
power failure to a rack
Storage Cluster
12
VM
Hypervisor
Host
VM
Compute Cluster Subsystems inside a Datacenter
indefinitely
failures to customer
their app-level SLAs
13
Clos Network Storage Cluster
VM
Hypervisor
Host
VM
Compute Cluster
Subsystems inside a Datacenter
VHD Failure 52% SW Failure 41% HW Failure 6% Unknown 1% Breakdown of Unplanned VM Downtime in a Year
14
15
C1 C2 C3 C4 S1 S2 S3
Bipartite Model
C1 C2 C3 C4 S1 S2 S3
Grid View
16
C1 C2 C3 C4 S1 S2 S3
Compute Cluster C2 failed C2 Failure Grid View
C1 C2 C3 C4 S1 S2 S3
Example Compute Cluster Failure
C1 C2 C3 C4 S1 S2 S3
Storage Cluster S1 Failed Example Storage Cluster Failure S1 Gray Failure Grid View
C1 C2 C3 C4 S1 S2 S3
17
Remaining challenges:
Generalized model to include network devices Lasso regression/Hypothesis testing algorithm Streaming data pipeline A system to localize VHD failures to underlying failures in compute, storage or network subsystems within a time budget of 15 minutes Summary of our goal:
18
Time budget set by production team to meet availability goals
19
Clos Network Storage Cluster Compute Cluster
๐คโ๐ช๐๐ฎ๐ข(๐ฃ)
๐คโ๐ช๐๐ฎ๐ข(๐ฃ)
๐คโ๐ช๐๐ฎ๐ข(๐ฃ)
๐ค=๐ ๐
๐ณ๐ฃ=๐ฆ๐ฉ๐ก ๐ โ
๐๐ฃ ๐จ๐ฃ
๐๐ค=๐ฆ๐ฉ๐ก ๐ช๐ค ๐๐ฃ=measurement noise System of Linear Equations Blue: observable Red: unknown Purple: topology
20
Component j is healthy with ๐ช๐ค = ๐๐ฒ๐ช(๐๐ค)
*Assume independent failures ๐๐ฃ=num of VMs crashed ๐๐ฃ=num of VMs
Sparsity
เทก ๐ = ๐๐ฌ๐ก๐ง๐ฃ๐จ
๐โโ๐,๐โค๐
๐ณ โ ๐๐ ๐ + ๐ ๐ ๐ Lasso Objective Function: ๐ณ๐ = ๐๐๐ + ๐๐จ๐๐ฎ + ๐๐ญ๐ + ๐๐ ๐ณ๐ = ๐๐๐ + ๐๐จ๐๐ฎ + ๐๐ญ๐ + ๐๐ ๐ณ๐ = ๐๐๐ + ๐๐จ๐๐ฎ + ๐๐ญ๐ + ๐๐ ๐ณ๐ = ๐๐๐ + ๐๐จ๐๐ฎ + ๐๐ญ๐ + ๐๐
Net C1 C2 S1 S2
๐ค=๐ ๐
21
Example:
mathematically?
22
Kusto Engine
23
VHD Failure VM Info StorageAcct Net Topo VMsPerPath Input
Real-time Non-RT
Ingestion Pipeline RAW DATA SLIDING WINDOW OF INPUT
Output
ACTIONS Alerts Vis Near-realtime Scheduler RUN ALGO Algo
24
ToR_11 ToR_12 ToR_13 ToR_14 ToR_15 STR_01 STR_02 STR_03 STR_04 STR_05 STR_06 STR_07
Unplanned ToR reboot in a region
25
Number of VMs with VHD Failures per Hour during a Storage Cluster Gray Failure
26
ToR Availability
= ๐ โ ๐๐% โ ๐๐ ๐ง๐ฃ๐จ + ๐๐% โ ๐๐๐ ๐ง๐ฃ๐จ โ ๐. ๐% ๐๐ โ ๐๐ โ ๐๐ ๐ง๐ฃ๐จ = ๐ โ % ๐ญ๐ฉ๐ ๐ฎ โ ๐ญ๐ฉ๐ ๐ฎ ๐๐ฏ๐ฌ. +% ๐ข๐๐ฌ๐ โ ๐ข๐๐ฌ๐ ๐๐ฏ๐ฌ. โ ๐ ๐ฌ๐๐. ๐ฌ๐๐๐ฉ๐ฉ๐ฎ๐๐ ๐๐ฉ๐๐ญ ๐ช๐๐ฌ ๐ง๐ฉ๐จ๐ฎ๐ข ๐ฎ๐ฉ๐ฎ๐๐ฆ ๐ฎ๐ฃ๐ง๐ ๐ฃ๐จ ๐ ๐ง๐ฉ๐จ๐ฎ๐ข = ๐๐. ๐๐๐๐๐%
27
ToRs are not on critical path for VMs to achieve five-nines availability
28
29
Deployment Provisioning Monitoring Resource mgmt
OPS
Incident localization detection Design implement
Admin
Incident resolution, mitigation Incident prevention
30
Bytedance Content Platform
32
cat AlexNet
33
Forward Backward
34
GPU servers GPU servers Parameter server Parameter server
Compute (GPU, CPU, FPGA, ASIC)
Network (RDMA) Storage (CephFS) Mesos Nvidia Docker Metis SCM Arnold agent Arnold API Web UI
Arnold SDK
Tensorboard
Tensorflow MxNet Caffe
36
NIC
Ethernet
control [sigcomm15]
conext17]
TCP/IP NIC driver
User Kernel Hardware
RDMA transport IP Ethernet RDMA app DMA RDMA verbs TCP/IP NIC driver Ethernet RDMA app DMA RDMA verbs Lossless network RDMA transport IP
โ Poor TCP performance โ Low network bandwidth
โ Much higher bandwidth โ Reduces communication time โ Scales the cluster to thousands of GPU cards
39
Throughput PFC ECN
switch
Sending servers Receiving server 100GbE 100GbE
40
Batch size: 32 Batch size: 64
41
Forward
Backward
Minibatch1
Epoch0 Epoch1 Epoch2 Epoch3 Epoch M
Minibatch0 Minibatch2 Minibatch N
Training
f_0 f_1
โฆ
f_{n-1} b_0 b_{n-2}
โฆ
b_{n-1} s_{n-1} g_{n-1} s_{n-2} g_{n-2} s_0 g_0
โฆ
43
44
45
46