Toward Highly Available, Intelligent Cloud and ML Systems - PowerPoint PPT Presentation

Toward Highly Available, Intelligent Cloud and ML Systems Chuanxiong Guo Bytedance NetAI 2018 1

Outline • Background: System/networking meets ML • Deepview: ML for availability improvement of cloud systems • RDMA for scalable ML training acceleration • Summary 2

Two Different Approaches client server training socket socket Interfaces training Protocols model dataset tcp tcp ip ip nic nic labeling inference data packets • Network/systems are designed by following • Models in machine learning are learned principles from data without explicit programming • Interfaces are explicitly defined, protocols are • Deep learning made breakthroughs in explicitly coded, and packets can be traced and computer vision and speech 3 explained

Networking Meets Machine Learning ML helps to improve system/network availability ML Networking/system Networking to scale and accelerate ML systems 4

Software Rules the Clouds Data Repo Code Repo Software systems Deployment/ Config/ Resource Monitoring provisioning Management mgmt 5

Incidents, Incidents, Incidents 6

System Availability is Plagued by Incidents 𝛵𝑈_𝑣 𝐵 = 5 min downtime per year σ 𝑈_𝑣 + 𝛵𝑈_𝑒 99.999% 99.99% 53 min downtime per year 7

Incident Handling Practice Lessons learned Data Repo Code Repo Software systems Deployment/ Config/ Resource Monitoring mgmt provisioning Management 8

Design implement Dev Gray Incident prevention failure ByteBrain Panorama Incident resolution, Availability mitigation fundamentals Incident localization detection Deepview OPS Netbouncer Pingmesh Deployment Provisioning Monitoring Resource mgmt Automation 9

Deepview for Virtual Disk Failure Diagnosis -- A case where ML helps system availability 10

VM Availability • IaaS is one of the largest cloud services today • High VM availability is a key performance metric • Yet, achieving 99.999% VM uptime remains a challenge 1. What is the VM availability bottleneck? 2. How to eliminate it? 11

IaaS Architecture • Compute and storage clusters with a Clos-like network Clos Network • Compute-storage Separation Host • VMs and Virtual Hard Disks (VHDs) provisioned from VM VM different clusters • Hypervisor transparently redirects disk access to remote Hypervisor storage • Keep data available during localized Compute Cluster Storage Cluster power failure to a rack Subsystems inside a Datacenter 12

A New Type of Failure: VHD Failures • Infra failures can disrupt VHD access Clos Network • Hypervisor can retry, but not indefinitely Host • Hypervisor will crash the VM to surface failures to customer VM VM • Allow customers to take actions to keep their app-level SLAs Hypervisor How much do VHD failures impact VM availability? Compute Cluster Storage Cluster Subsystems inside a Datacenter 13

Availability Bottleneck Unknown 1% HW Failure • VHD failure localization is the bottleneck 6% • 52% of unplanned VM downtime • Take 10s minutes to hours to localize VHD Failure SW • This talk: quick and accurate failure localization 52% Failure 41% Breakdown of Unplanned VM Downtime in a Year 14

Failure Triage was Slow and Inaccurate • SREs from each team check their subsystem for anomalies to match the incident • e.g. compute host heart-beats, storage perf-counters, network link discards • Incidents get ping-ponged among different teams due to false positives • Inaccurate diagnosis and delayed mitigation • Gray failures in network and storage are hard to catch • Troubled but not totally down, e.g. performance issues or software bugs • Only fail a subset of VHDs requests • Can take hours to localize 15

Deepview Approach: Global View • Isolate failures by examining interactions between subsystems • Instead of alerting every SRE team to check if their subsystem is at fault • Bipartite model • Compute Clusters (left) : Storage Clusters (right) • VMs are provisioned from compute/storage cluster pair • Edge weight = VHD failure rate C1 S1 C1 C2 C2 S2 C3 C3 S3 C4 C4 S1 S2 S3 Bipartite Model Grid View 16

Our Approach: Global View Example Storage Cluster Failure Example Compute Cluster Failure S1 C1 C1 S1 C1 C1 C2 C2 C2 C2 S2 S2 C3 C3 C3 C3 S3 S3 C4 C4 C4 C4 S1 S2 S3 S1 S2 S3 Compute Cluster Storage Cluster C2 Failure S1 Gray Failure C2 failed S1 Failed Grid View Grid View 17

Challenges Remaining challenges: 1. Need to pinpoint network failures Generalized model to include network devices 2. Need to handle gray failures Lasso regression/Hypothesis testing algorithm 3. Need to be near-real-time Streaming data pipeline Summary of our goal: A system to localize VHD failures to underlying failures in compute, storage or network subsystems within a time budget of 15 minutes Time budget set by production team to meet availability goals 18

Deepview Model: Include the Network Clos Network Compute Cluster Storage Cluster • Need to handle multipath and ECMP • Simplify Clos network to a tree by aggregating network devices • Can model at the granularity of clusters or ToRs 19

Deepview Model: Estimate Component Health 𝐐𝐬𝐩𝐜 𝐪𝐛𝐮𝐢 𝐣 𝐣𝐭 𝐢𝐟𝐛𝐦𝐮𝐢𝐳 = ෑ 𝐐𝐬𝐩𝐜 𝐝𝐩𝐧𝐪𝐩𝐨𝐟𝐨𝐮 𝐤 𝐣𝐭 𝐢𝐟𝐛𝐦𝐮𝐢𝐳 𝐤∈𝐪𝐛𝐮𝐢(𝐣) Blue: observable *Assume independent failures 𝟐 − 𝐟 𝐣 Red: unknown = ෑ 𝐪 𝐤 𝐨 𝐣 Purple: topology 𝐟 𝐣 =num of VMs crashed 𝐤∈𝐪𝐛𝐮𝐢(𝐣) 𝒐 𝐣 =num of VMs 𝐦𝐩𝐡 𝟐 − 𝐟 𝐣 = ෍ 𝐦𝐩𝐡 𝐪 𝐤 𝐨 𝐣 System of Linear Equations 𝐤∈𝐪𝐛𝐮𝐢(𝐣) Component j is healthy with 𝐟 𝐣 𝐎 𝐳 𝐣 = 𝐦𝐩𝐡 𝟐 − 𝐪 𝐤 = 𝐟𝐲𝐪(𝛄 𝐤 ) 𝐨 𝐣 • β j = 0 , clear component j 𝐳 𝐣 = ෍ 𝛄 𝐤 𝐲 𝐣𝐤 + 𝛇 𝐣 𝛄 𝐤 = 𝐦𝐩𝐡 𝐪 𝐤 • β j ≪ 0 , may blame it 𝐤=𝟐 𝛇 𝐣 =measurement noise 20

Deepview Algorithm: Prefer Simpler Explanation via Lasso 𝐎 Example: Net 𝐳 𝐣 = ෍ 𝛄 𝐤 𝐲 𝐣𝐤 + 𝛇 𝐣 𝐤=𝟐 S1 S2 C1 C2 • Potentially #unknowns > #equations 𝐳 𝟐 = 𝛄 𝐝𝟐 + 𝛄 𝐨𝐟𝐮 + 𝛄 𝐭𝟐 + 𝛇 𝟐 • Traditional least-square regression would fail 𝐳 𝟑 = 𝛄 𝐝𝟐 + 𝛄 𝐨𝐟𝐮 + 𝛄 𝐭𝟑 + 𝛇 𝟑 𝐳 𝟒 = 𝛄 𝐝𝟑 + 𝛄 𝐨𝐟𝐮 + 𝛄 𝐭𝟐 + 𝛇 𝟒 • But multiple simultaneous failures are rare 𝐳 𝟓 = 𝛄 𝐝𝟑 + 𝛄 𝐨𝐟𝐮 + 𝛄 𝐭𝟑 + 𝛇 𝟓 • How to encode this domain knowledge mathematically? Lasso Objective Function: 𝐳 − 𝐘𝛄 𝟑 + 𝛍 𝛄 𝟐 • Equivalent to prefer most β j to be zero ෡ 𝛄 = 𝐛𝐬𝐡𝐧𝐣𝐨 𝛄∈ℝ 𝐎 ,𝛄≤𝟏 • Lasso regression can get sparse solutions efficiently Sparsity 21

Deepview Algorithm: Principled Blame Decision via Hypothesis Testing • Need a binary decision ( flag/clear ) for each component • Ad-hoc thresholds do not work reliably • Can we make a principled decision? • If estimated failure probability worse than average, then likely a real failure • Automate this empirical decision criterion using a hypothesis test: 𝐈 𝟏 𝐤 : 𝛄 𝐤 = ഥ 𝐈 𝐁 𝐤 : 𝛄 𝐤 < ഥ 𝛄 𝐰𝐭. 𝛄 • Reject H 0 j means blame component j • Otherwise, clear component j 22

Deepview System Architecture: NRT Data Pipeline Near-realtime Scheduler VHD Failure Real-time Alerts VM Info Non-RT StorageAcct VMsPerPath Input Algo Output Net Topo Ingestion Vis Pipeline Kusto Engine 23 RAW DATA SLIDING WINDOW OF INPUT RUN ALGO ACTIONS

Some Statistics • Analyzed Deepview results for one month • Daily VHD failures: hundreds to tens of thousands • Detected 100 failures instances • 70 matched with existing tickets, 30 were previously undetected • Reduced unclassified VHD failures to less than a max of 500 per day • Single-host failures or customer mistakes (e.g. expired storage accounts) 24

Case Study 1: Unplanned ToR Reboot • Unplanned ToR reboot can cause VMs to crash • We knew this can happens, but not where and when ToR_11 ToR_12 ToR_13 • Deepview can flag those ToRs ToR_14 • The figure shows a ToR down in one small region ToR_15 • Blamed the right ToR among 288 components STR_01 STR_02 STR_03 STR_04 STR_05 STR_06 STR_07 • Associate VM downtime with ToR failures Unplanned ToR reboot • Quantify the impact of ToR as a single-point-of-failure in a region on VM availability 25

Case Study 2: Storage Cluster Gray Failure • Impact only a subset of VMs • A storage cluster was brought online with a bug that puts some VHDs in negative cache • Deepview flagged the faulty storage cluster almost immediately while Number of VMs with VHD manual triage took 20+ hours Failures per Hour during a Storage Cluster Gray Failure 26

Toward Highly Available, Intelligent Cloud and ML Systems - PowerPoint PPT Presentation

Toward Highly Available, Intelligent Cloud and ML Systems Chuanxiong Guo Bytedance NetAI 2018 1 Outline Background: System/networking meets ML Deepview: ML for availability improvement of cloud systems RDMA for scalable ML

gholzmann@acm.org ISO 26262: highly recommended EN 50128: highly recommended IEC 61508: highly

EDIA Working Group EDIA Working Group Journey Toward Equity Journey Toward Equity SARAH We are

Edmonds School District Highly Capable Program Assessment and Selection Process November 14,

Edmonds School District Highly Capable Program Assessment and Selection Process ~ Kim Hunter,

Highly Capable Performance Study Update Highly Capable Recommendation Action Recommendation 1:

Introduction Need for a highly available Distributed Data Store During the holiday shopping

Amazon Dynamo A Highly Available Key-value Store Present by Jian Fang jianf@cmu.edu What is

Towards Highly Available Clos-Based WAN Routers Sucha Supittayapornpong , Barath Raghavan, Ramesh

PROGRESS TOWARD U.S. NATIONAL MAPS OF SOIL PROGRESS TOWARD U.S. NATIONAL MAPS OF SOIL MINERALOGY

Toward a Toward a Overview Sociology of Sociology of Introduction Interpreting

Student Attitudes toward Older Adults Anna Feenstra Student Attitudes toward Older Adults

Workday HCM Working Toward Stabilization- Primary focus toward successful Payrolls Working

Bolstering the Revenue Base Toward the Final Year of Bolstering the Revenue Base Toward the Final

Alternative Paths Toward Stabilization Alternative Paths Toward Stabilization Some Challenges for

Toward a Polanyi Rule Polanyi Rule Picture of Nuclear Picture of Nuclear Toward

Overall of Low Carbon Society Overall of Low Carbon Society toward 2050 Project toward 2050

Energy-aware checkpointing of divisible tasks with soft or hard deadlines Guillaume Aupy 1 , Anne

Alpha-Beta Pruning: Algorithm and Analysis Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

Converting 85% of Dutch Primary Schools from Oracle to PostgreSQL Martijn Dashorst topicus.nl

Distributed Systems Principles and Paradigms Maarten van Steen Chapter 8: Fault Tolerance

volatile unsigned short DMA1SA @ 0x01eau; void iar_buggy_func(unsigned char ch) { DMA1SA =

Predicting Computer System Failures Using Support Vector Machines Errin W. Fulp a Glenn A. Fink b

AGENDA TTO follow up Run 11 -12 Goals Efficiency TTO FOLLOW UP P . Krejcik wrote

RF Sources Ralph J. Pasquinelli PIP-II Machine Advisory Committee Meeting 15-17 March 2016 High