Toward Highly Available, Intelligent Cloud and ML Systems - - PowerPoint PPT Presentation

โ–ถ
toward highly available
SMART_READER_LITE
LIVE PREVIEW

Toward Highly Available, Intelligent Cloud and ML Systems - - PowerPoint PPT Presentation

Toward Highly Available, Intelligent Cloud and ML Systems Chuanxiong Guo Bytedance NetAI 2018 1 Outline Background: System/networking meets ML Deepview: ML for availability improvement of cloud systems RDMA for scalable ML


slide-1
SLIDE 1

Toward Highly Available, Intelligent Cloud and ML Systems

Chuanxiong Guo Bytedance NetAI 2018

1

slide-2
SLIDE 2

Outline

  • Background: System/networking meets ML
  • Deepview: ML for availability improvement of cloud systems
  • RDMA for scalable ML training acceleration
  • Summary

2

slide-3
SLIDE 3

Two Different Approaches

3

client server socket socket tcp tcp ip ip packets nic nic Protocols Interfaces

  • Network/systems are designed by following

principles

  • Interfaces are explicitly defined, protocols are

explicitly coded, and packets can be traced and explained training inference labeling training dataset model data

  • Models in machine learning are learned

from data without explicit programming

  • Deep learning made breakthroughs in

computer vision and speech

slide-4
SLIDE 4

Networking Meets Machine Learning

4

ML Networking/system ML helps to improve system/network availability Networking to scale and accelerate ML systems

slide-5
SLIDE 5

Code Repo Software systems Deployment/ provisioning Resource mgmt Config/ Management Monitoring Data Repo

Software Rules the Clouds

5

slide-6
SLIDE 6

Incidents, Incidents, Incidents

6

slide-7
SLIDE 7

System Availability is Plagued by Incidents

99.999% 99.99% 5 min downtime per year

7

53 min downtime per year

๐ต = ๐›ต๐‘ˆ_๐‘ฃ ฯƒ ๐‘ˆ_๐‘ฃ + ๐›ต๐‘ˆ_๐‘’

slide-8
SLIDE 8

Code Repo Software systems Deployment/ provisioning Resource mgmt Config/ Management Monitoring Data Repo

Incident Handling Practice

8

Lessons learned

slide-9
SLIDE 9

Deployment Provisioning Monitoring Resource mgmt

OPS

Incident localization detection Design implement

9

Automation

Dev

Incident resolution, mitigation Availability fundamentals Gray failure Panorama ByteBrain Deepview Netbouncer Pingmesh Incident prevention

slide-10
SLIDE 10

10

Deepview for Virtual Disk Failure Diagnosis

  • - A case where ML helps system availability
slide-11
SLIDE 11

VM Availability

  • IaaS is one of the largest cloud services today
  • High VM availability is a key performance metric
  • Yet, achieving 99.999% VM uptime remains a challenge

11

  • 1. What is the VM availability bottleneck?
  • 2. How to eliminate it?
slide-12
SLIDE 12

Clos Network

IaaS Architecture

  • Compute and storage clusters with

a Clos-like network

  • Compute-storage Separation
  • VMs and Virtual Hard Disks

(VHDs) provisioned from different clusters

  • Hypervisor transparently

redirects disk access to remote storage

  • Keep data available during localized

power failure to a rack

Storage Cluster

12

VM

Hypervisor

Host

VM

Compute Cluster Subsystems inside a Datacenter

slide-13
SLIDE 13

A New Type of Failure: VHD Failures

  • Infra failures can disrupt VHD access
  • Hypervisor can retry, but not

indefinitely

  • Hypervisor will crash the VM to surface

failures to customer

  • Allow customers to take actions to keep

their app-level SLAs

13

Clos Network Storage Cluster

VM

Hypervisor

Host

VM

Compute Cluster

How much do VHD failures impact VM availability?

Subsystems inside a Datacenter

slide-14
SLIDE 14

Availability Bottleneck

  • VHD failure localization is the bottleneck
  • 52% of unplanned VM downtime
  • Take 10s minutes to hours to localize
  • This talk: quick and accurate failure localization

VHD Failure 52% SW Failure 41% HW Failure 6% Unknown 1% Breakdown of Unplanned VM Downtime in a Year

14

slide-15
SLIDE 15

Failure Triage was Slow and Inaccurate

  • SREs from each team check their subsystem for anomalies to match the incident
  • e.g. compute host heart-beats, storage perf-counters, network link discards
  • Incidents get ping-ponged among different teams due to false positives
  • Inaccurate diagnosis and delayed mitigation
  • Gray failures in network and storage are hard to catch
  • Troubled but not totally down, e.g. performance issues or software bugs
  • Only fail a subset of VHDs requests
  • Can take hours to localize

15

slide-16
SLIDE 16

Deepview Approach: Global View

C1 C2 C3 C4 S1 S2 S3

Bipartite Model

C1 C2 C3 C4 S1 S2 S3

Grid View

  • Isolate failures by examining interactions between subsystems
  • Instead of alerting every SRE team to check if their subsystem is at fault
  • Bipartite model
  • Compute Clusters (left) : Storage Clusters (right)
  • VMs are provisioned from compute/storage cluster pair
  • Edge weight = VHD failure rate

16

slide-17
SLIDE 17

Our Approach: Global View

C1 C2 C3 C4 S1 S2 S3

Compute Cluster C2 failed C2 Failure Grid View

C1 C2 C3 C4 S1 S2 S3

Example Compute Cluster Failure

C1 C2 C3 C4 S1 S2 S3

Storage Cluster S1 Failed Example Storage Cluster Failure S1 Gray Failure Grid View

C1 C2 C3 C4 S1 S2 S3

17

slide-18
SLIDE 18

Challenges

Remaining challenges:

  • 1. Need to pinpoint network failures
  • 2. Need to handle gray failures
  • 3. Need to be near-real-time

Generalized model to include network devices Lasso regression/Hypothesis testing algorithm Streaming data pipeline A system to localize VHD failures to underlying failures in compute, storage or network subsystems within a time budget of 15 minutes Summary of our goal:

18

Time budget set by production team to meet availability goals

slide-19
SLIDE 19

Deepview Model: Include the Network

19

Clos Network Storage Cluster Compute Cluster

  • Need to handle multipath and ECMP
  • Simplify Clos network to a tree by aggregating network devices
  • Can model at the granularity of clusters or ToRs
slide-20
SLIDE 20

Deepview Model: Estimate Component Health

๐๐ฌ๐ฉ๐œ ๐ช๐›๐ฎ๐ข ๐ฃ ๐ฃ๐ญ ๐ข๐Ÿ๐›๐ฆ๐ฎ๐ข๐ณ = เท‘

๐คโˆˆ๐ช๐›๐ฎ๐ข(๐ฃ)

๐๐ฌ๐ฉ๐œ ๐๐ฉ๐ง๐ช๐ฉ๐จ๐Ÿ๐จ๐ฎ ๐ค ๐ฃ๐ญ ๐ข๐Ÿ๐›๐ฆ๐ฎ๐ข๐ณ ๐Ÿ โˆ’ ๐Ÿ๐ฃ ๐จ๐ฃ = เท‘

๐คโˆˆ๐ช๐›๐ฎ๐ข(๐ฃ)

๐ช๐ค ๐ฆ๐ฉ๐ก ๐Ÿ โˆ’ ๐Ÿ๐ฃ ๐จ๐ฃ = เท

๐คโˆˆ๐ช๐›๐ฎ๐ข(๐ฃ)

๐ฆ๐ฉ๐ก ๐ช๐ค ๐ณ๐ฃ = เท

๐ค=๐Ÿ ๐Ž

๐›„๐ค ๐ฒ๐ฃ๐ค+ ๐›‡๐ฃ

๐ณ๐ฃ=๐ฆ๐ฉ๐ก ๐Ÿ โˆ’

๐Ÿ๐ฃ ๐จ๐ฃ

๐›„๐ค=๐ฆ๐ฉ๐ก ๐ช๐ค ๐›‡๐ฃ=measurement noise System of Linear Equations Blue: observable Red: unknown Purple: topology

20

Component j is healthy with ๐ช๐ค = ๐Ÿ๐ฒ๐ช(๐›„๐ค)

  • ฮฒj = 0, clear component j
  • ฮฒj โ‰ช 0, may blame it

*Assume independent failures ๐Ÿ๐ฃ=num of VMs crashed ๐’๐ฃ=num of VMs

slide-21
SLIDE 21

Deepview Algorithm: Prefer Simpler Explanation via Lasso

  • Potentially #unknowns > #equations
  • Traditional least-square regression would fail

Sparsity

เทก ๐›„ = ๐›๐ฌ๐ก๐ง๐ฃ๐จ

๐›„โˆˆโ„๐Ž,๐›„โ‰ค๐Ÿ

๐ณ โˆ’ ๐˜๐›„ ๐Ÿ‘ + ๐› ๐›„ ๐Ÿ Lasso Objective Function: ๐ณ๐Ÿ = ๐›„๐๐Ÿ + ๐›„๐จ๐Ÿ๐ฎ + ๐›„๐ญ๐Ÿ + ๐›‡๐Ÿ ๐ณ๐Ÿ‘ = ๐›„๐๐Ÿ + ๐›„๐จ๐Ÿ๐ฎ + ๐›„๐ญ๐Ÿ‘ + ๐›‡๐Ÿ‘ ๐ณ๐Ÿ’ = ๐›„๐๐Ÿ‘ + ๐›„๐จ๐Ÿ๐ฎ + ๐›„๐ญ๐Ÿ + ๐›‡๐Ÿ’ ๐ณ๐Ÿ“ = ๐›„๐๐Ÿ‘ + ๐›„๐จ๐Ÿ๐ฎ + ๐›„๐ญ๐Ÿ‘ + ๐›‡๐Ÿ“

Net C1 C2 S1 S2

๐ณ๐ฃ = เท

๐ค=๐Ÿ ๐Ž

๐›„๐ค ๐ฒ๐ฃ๐ค+ ๐›‡๐ฃ

21

Example:

  • But multiple simultaneous failures are rare
  • How to encode this domain knowledge

mathematically?

  • Equivalent to prefer most ฮฒj to be zero
  • Lasso regression can get sparse solutions efficiently
slide-22
SLIDE 22

Deepview Algorithm: Principled Blame Decision via Hypothesis Testing

  • Need a binary decision (flag/clear) for each component
  • Ad-hoc thresholds do not work reliably
  • Can we make a principled decision?
  • If estimated failure probability worse than average, then likely a real failure
  • Automate this empirical decision criterion using a hypothesis test:
  • Reject H0 j means blame component j
  • Otherwise, clear component j

๐ˆ๐Ÿ ๐ค : ๐›„๐ค = เดฅ ๐›„ ๐ฐ๐ญ. ๐ˆ๐ ๐ค : ๐›„๐ค < เดฅ ๐›„

22

slide-23
SLIDE 23

Kusto Engine

Deepview System Architecture: NRT Data Pipeline

23

VHD Failure VM Info StorageAcct Net Topo VMsPerPath Input

Real-time Non-RT

Ingestion Pipeline RAW DATA SLIDING WINDOW OF INPUT

Output

ACTIONS Alerts Vis Near-realtime Scheduler RUN ALGO Algo

slide-24
SLIDE 24

Some Statistics

  • Analyzed Deepview results for one month
  • Daily VHD failures: hundreds to tens of thousands
  • Detected 100 failures instances
  • 70 matched with existing tickets, 30 were previously undetected
  • Reduced unclassified VHD failures to less than a max of 500 per day
  • Single-host failures or customer mistakes (e.g. expired storage accounts)

24

slide-25
SLIDE 25

Case Study 1: Unplanned ToR Reboot

  • Unplanned ToR reboot can cause VMs to crash
  • We knew this can happens, but not where and when
  • Deepview can flag those ToRs
  • The figure shows a ToR down in one small region
  • Blamed the right ToR among 288 components
  • Associate VM downtime with ToR failures
  • Quantify the impact of ToR as a single-point-of-failure
  • n VM availability

ToR_11 ToR_12 ToR_13 ToR_14 ToR_15 STR_01 STR_02 STR_03 STR_04 STR_05 STR_06 STR_07

Unplanned ToR reboot in a region

25

slide-26
SLIDE 26

Case Study 2: Storage Cluster Gray Failure

  • Impact only a subset of VMs
  • A storage cluster was brought online

with a bug that puts some VHDs in negative cache

  • Deepview flagged the faulty storage

cluster almost immediately while manual triage took 20+ hours

Number of VMs with VHD Failures per Hour during a Storage Cluster Gray Failure

26

slide-27
SLIDE 27

Deepview Insight: ToR as a Single Point of Failure

  • Reduced Network Cost vs. Availability cost for using a single ToR per rack
  • Unplanned ToR failures: soft failures (recoverable by reboot) vs. hard failures

ToR Availability

= ๐Ÿ โˆ’ ๐Ÿ˜๐Ÿ% โˆ— ๐Ÿ‘๐Ÿ ๐ง๐ฃ๐จ + ๐Ÿ๐Ÿ% โˆ— ๐Ÿ๐Ÿ‘๐Ÿ ๐ง๐ฃ๐จ โˆ— ๐Ÿ. ๐Ÿ% ๐Ÿ’๐Ÿ โˆ— ๐Ÿ‘๐Ÿ“ โˆ— ๐Ÿ•๐Ÿ ๐ง๐ฃ๐จ = ๐Ÿ โˆ’ % ๐ญ๐ฉ๐ ๐ฎ โˆ— ๐ญ๐ฉ๐ ๐ฎ ๐ž๐ฏ๐ฌ. +% ๐ข๐›๐ฌ๐ž โˆ— ๐ข๐›๐ฌ๐ž ๐ž๐ฏ๐ฌ. โˆ— ๐ ๐ฌ๐›๐. ๐ฌ๐Ÿ๐œ๐ฉ๐ฉ๐ฎ๐Ÿ๐ž ๐”๐ฉ๐’๐ญ ๐ช๐Ÿ๐ฌ ๐ง๐ฉ๐จ๐ฎ๐ข ๐ฎ๐ฉ๐ฎ๐›๐ฆ ๐ฎ๐ฃ๐ง๐Ÿ ๐ฃ๐จ ๐› ๐ง๐ฉ๐จ๐ฎ๐ข = ๐Ÿ˜๐Ÿ˜. ๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ’%

  • Dependent services (ToRs) need to provide one extra nine to target service (VMs)

27

ToRs are not on critical path for VMs to achieve five-nines availability

slide-28
SLIDE 28

Deepview Insight: VMs and their Storage Co- location

  • For load balancing, VMs can mount VHDs from any storage cluster in the

same region

  • Some VMs have storage that are further away
  • Can longer network paths impact VM availability?

28

Some benefit to co-locate VM and their VHDs

  • At Azure, 52% two-hop, 41% three-hop
  • Compute daily VHD failure rates: r0 (two-hop), r1 (three-hop)
  • Average over 3-months
  • Yes!

ฮค เดฅ ๐ฌ๐Ÿ โˆ’ เดฅ ๐ฌ๐Ÿ เดฅ ๐ฌ๐Ÿ = ๐Ÿ๐Ÿ. ๐Ÿ“% ๐ฃ๐จ๐๐ฌ๐Ÿ๐›๐ญ๐Ÿ

slide-29
SLIDE 29

29

Deployment Provisioning Monitoring Resource mgmt

OPS

Incident localization detection Design implement

Admin

Dev

Incident resolution, mitigation Incident prevention

slide-30
SLIDE 30

30

RDMA for ML Training Acceleration

  • - A case where networking helps ML to scale
slide-31
SLIDE 31

Background

Bytedance AI

Bytedance Content Platform

Content Creation Content Distribution

slide-32
SLIDE 32

Content Understanding using DNN

32

cat AlexNet

slide-33
SLIDE 33

DNN Training: BP

33

Forward Backward

slide-34
SLIDE 34

Distributed Training Acceleration

  • GPU, with mini-batch
  • Distributed training (data parallel)

34

GPU servers GPU servers Parameter server Parameter server

slide-35
SLIDE 35

Arnold Training System

Compute (GPU, CPU, FPGA, ASIC)

Network (RDMA) Storage (CephFS) Mesos Nvidia Docker Metis SCM Arnold agent Arnold API Web UI

Arnold SDK

Tensorboard

Tensorflow MxNet Caffe

slide-36
SLIDE 36

When Communication Becomes Bottleneck

36

slide-37
SLIDE 37

RDMA/RoCEv2 background

  • RDMA addresses TCPโ€™s latency and CPU
  • verhead problems
  • RDMA offloads the transport layer to the

NIC

  • RDMA needs a lossless network
  • RoCEv2: RDMA over commodity

Ethernet

  • PFC for hop-by-hop flow control
  • DCQCN for connection-level congestion

control [sigcomm15]

  • Many issues addressed [sigcomm16,

conext17]

TCP/IP NIC driver

User Kernel Hardware

RDMA transport IP Ethernet RDMA app DMA RDMA verbs TCP/IP NIC driver Ethernet RDMA app DMA RDMA verbs Lossless network RDMA transport IP

slide-38
SLIDE 38

RDMA Cluster for Arnold Training

  • 100Gbps throughput between any servers
  • Micros-second e2e latency
  • Minimal CPU overhead for packet processing
  • Many models spend large

amount of time on communication

โ‚“ Poor TCP performance โ‚“ Low network bandwidth

  • 100GbE RDMA network

โœ“ Much higher bandwidth โœ“ Reduces communication time โœ“ Scales the cluster to thousands of GPU cards

slide-39
SLIDE 39

RDMA Many-To-One

39

Throughput PFC ECN

switch

Sending servers Receiving server 100GbE 100GbE

slide-40
SLIDE 40

RDMA for ML Training Acceleration (CNN)

40

Batch size: 32 Batch size: 64

slide-41
SLIDE 41

RDMA for ML Training Acceleration (RNN)

41

slide-42
SLIDE 42

When RDMA Acceleration Helps

Forward

Backward

Minibatch1

Epoch0 Epoch1 Epoch2 Epoch3 Epoch M

Minibatch0 Minibatch2 Minibatch N

Training

f_0 f_1

โ€ฆ

f_{n-1} b_0 b_{n-2}

โ€ฆ

b_{n-1} s_{n-1} g_{n-1} s_{n-2} g_{n-2} s_0 g_0

โ€ฆ

slide-43
SLIDE 43

When RDMA Acceleration Helps

  • Big models
  • ResNet50 (98MB), VGG19 (548MB)
  • Communication/computation ratio is large
  • Layers with large parameter size
  • Small minibatch size
  • When TCP is slow

43

slide-44
SLIDE 44

Summary

  • ML will be a core part for building highly available systems
  • Deeper availability understanding
  • Automatic incident localization, mitigation, prevention
  • Intelligent system/network design
  • System/networking for ML
  • Scalable ML systems
  • Hardware, systems, ML services integrated design

44

slide-45
SLIDE 45

Acknowledgement

  • Deepview (nsdi18): Qiao Zhang, Guo Yu, Yingnong Dang, Nick

Swanson, Xinsheng Yang, Randolph Yao, Murali Chintalapati, Arvind Krishnamurthy, and Thomas Anderson

  • ByteDance Networking Team
  • Bytedance ML System Team

45

slide-46
SLIDE 46

Q&A

46