Life lessons and datacenter performance analysis Dan Ardelean Amer - PowerPoint PPT Presentation

Life lessons and datacenter performance analysis Dan Ardelean Amer Diwan Rick Hank Christian Kurmann Balaji Raghavan Matt Seegmiller

The need to solve performance crimes Performance crimes are anything that unnecessarily increase ● Latency ● Resource usage Performance crimes... ● degrade end user experience ● waste valuable resources (energy, cost, etc.) Solving performance crimes is a necessity not a luxury This talk shares our experiences in solving performance crimes in Gmail Google Confidential and Proprietary

But email is such a simple app … right? Your email Delivery service Storage system We have all used this...in the 1990s! Google Confidential and Proprietary

Email today is a lot more... Spam filtering Virus detection Web client support Sync client support Backups Your email Smart search Delivery service Mail classification Labels/folder Filters Images/Attachments Contacts ... Storage system To enable sharing across applications (e.g., contacts) each component is a service Google Confidential and Proprietary

Each component is a service running in its own processes Calendar Gmail RPC App Logic Authentication App Logic Events Contacts Body Provides modularity, parallelism, reliability Google Confidential and Proprietary

Lesson 1: No RPC left behind Google Confidential and Proprietary

Each user request involves O(100) RPCs We cannot ignore the rarely slow RPCs ● 1/100 slow RPC affects 63% of the requests ● 1/1M slow RPC affects 0.01% of the requests We cannot ignore the rarely slow requests ● 1/1M event affects O(10M) requests daily Blue moon...daily! Google Confidential and Proprietary

Latency of RPCs follows a complex distribution Many layers of Continuously varying Countless code many servers load paths Fraction of RPCs Latency Google Confidential and Proprietary

Even for “simplest” components data is rarely normal Log scale latency distribution of a critical (but “simple”) component at Google Fraction of RPCs Mean - sd 66% Mean Mean + sd Long tail Latency Abnormal is the new normal ! Google Confidential and Proprietary

When is better actually better? Minimal overhead at every request; Constant overhead at every request; but expensive recovery but fast recovery Fraction of RPCs Latency A tighter distribution may be better than a better median Google Confidential and Proprietary

Challenges Normal distribution are rare at Google ● Optimizing only for the “common case” is inadequate Many statistical truths assume properties of the data ● Ensure that analysis is appropriate for the shape of the data But...don’t invent statistics; just use it correctly! Google Confidential and Proprietary

Lesson 2: Prepare for the storm

Long tail also shows up in resource usage Must reserve resources for peaks CPU usage Time Two causes for peaks: (i) load; (ii) unusual events Google Confidential and Proprietary

Cause 1: load varies over time Users are not “randomly” spread out over time zones Who is using Gmail here? Working overlaps between zones (e.g., NA and European) cause large usage peaks Google Confidential and Proprietary

Cause 2: storms! What happened here? CPU usage Time Must plan for hardware and software updates Google Confidential and Proprietary

Consequences of Lessons 1 and 2 Aggregate metrics (e.g., mean latency) cannot distinguish between: and We must reason with traces for long-tail events Google Confidential and Proprietary

Challenges with traces ● Different traces may use different clocks ○ We use large amounts of data to do time alignment after the fact ● Reasoning is hard and laborious ○ Traces from a single machine may be 100K events per second ○ We use a language based on temporal logic to reason over traces ● Need to coordinate tracing across machines ○ Use coordinated-bursty tracing Profiling tools are (mostly) ok; now we must invest in tracing Google Confidential and Proprietary

Lesson 3: Ask nicely! Google Confidential and Proprietary

Is your pattern of access reasonable? my here you message go please Gmail Sync client all 50,000 Aaargh! messages NOW Gmail A single not-so-nice request can degrade many requests Google Confidential and Proprietary

Some problems are more subtle Chicken and Egg Results for "Chicken and Egg" Results for "(Chicken OR Chickens OR ...) and Egg" Gmail Search Query rewriting can dramatically inflate a small request Google Confidential and Proprietary

Any layer can potentially overwhelm the next layer Gmail front end Gmail front end Lots of small reads Fewer but larger reads Gmail storage layer Gmail storage layer Requests to a layer should match its strengths Google Confidential and Proprietary

Challenges Our APIs capture functionality, not performance How do we know that a method that we are calling... ● ...makes expensive RPC calls? ● ...acquires locks? ● ...performs IO? Can we express and reason over performance in our APIs? Google Confidential and Proprietary

Lesson 4: Share! Google Confidential and Proprietary

Lock contention often causes long-tail latency A burst can cause contention when there is normally none Google Confidential and Proprietary

Layer interactions may aggravate or alleviate contention # Requests Time Storage layer # Requests Time Low-level layer for accessing disk An inefficient layer presents an easier request stream to next layer Google Confidential and Proprietary

Remember Little’s law! If processing a request holds a lock for 100ms we cannot process more than 10 requests per second 5 qps Hold for 0.1 sec Average queue depth: 5 * 0.1 = ½ If we want to double parallelism we must halve holding time Google Confidential and Proprietary

Lesson 5: Confront weaknesses

Caches can hide the latency of “weak” components ... Fast component Cache Slow component ...but it is often much better to fix the root cause! Google Confidential and Proprietary

Why should we care if caches reduce latency? Gmail’s storage layer Cache File system layer Glass is half full... Wow, a 98.5% cache hit rate! Glass is half empty... Wow, why are we are asking for the same data 98.5% of the time? Google Confidential and Proprietary

Fix the problem not the symptom If cache is performing too well, something is wrong with the requests Fix request pattern Remove cache Use less resources Cleaner code Caches are great abuse detectors! Google Confidential and Proprietary

Lesson 6: Get your priorities right

Priorities only matter in a crunch High prio Low prio Resource Long queues: big problem! Short queues: no problem! Overprovisioned resources mask poor priority settings Google Confidential and Proprietary

Setting priorities is hard! High prio Low prio First try: put only user-facing requests in high priority Google Confidential and Proprietary

...but that can easily backfire when there are dependencies Queued Run Time Waiting for lower-priority req Run High prio Low prio Avoiding priority inversion makes for a complex priority model Google Confidential and Proprietary

Lesson 7: Question everything

Suspect every layer, every component, every lock, ... But even more so, look every gift horse in the mouth If data looks too good to be true, it is probably wrong! Google Confidential and Proprietary

Rookie mistake: Compare Monday data to Sunday data! Requests per second Time The load varies day to day, hour to hour! Google Confidential and Proprietary

Lesson 8 Real life is real life; tests are tests!

Loadtest versus production CPU usage during loadtest CPU usage in production The loadtest executes same binaries...but the load is different Google Confidential and Proprietary

Why the differences? ● Real user pattern is more complex than what we can synthesize ○ Users use a variety of email clients which affect requests ○ Users are different in their mailbox size and usage ■ e.g., Googlers are 10x more of everything compared to “average” user ○ Users have different settings (e.g., filters) and usage styles (clean- inbox versus everything-in-inbox) ○ ... ● The loadtest attempts to model these… ○ but it cannot possibly model everything Many performance problems must be debugged in production Google Confidential and Proprietary

Debugging in production If only I knew X I could get to the bottom of this puzzle Sure… but only if ● Recording X does not violate any compliance, contractual, and privacy promises to our users ● It is carefully reviewed so that we do not introduce bugs or regressions ● And it still needs to go through a careful rollout process Challenge is to infer what we really need from what we have Google Confidential and Proprietary

Call to action ● Trace analysis tools ○ How to combine and analyze diverse sources of data ○ How can we derive high-level knowledge from low-level data? ● Performance APIs ○ How can we express and check performance specifications? ● Focus on analyzing large production systems when possible ○ How can our students learn from and impact real systems? Are you up to it? Google Confidential and Proprietary

Life lessons and datacenter performance analysis Dan Ardelean Amer - PowerPoint PPT Presentation

Life lessons and datacenter performance analysis Dan Ardelean Amer Diwan Rick Hank Christian Kurmann Balaji Raghavan Matt Seegmiller The need to solve performance crimes Performance crimes are anything that unnecessarily increase

CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang

CompSci 514: Computer Networks Lecture 15 Practical Datacenter Networks Xiaowei Yang Overview

The Time-less Datacenter Paul Borrill and Alan H. Karp Earth Computing The Datacenter Resilience

Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff

FLAT DATACENTER STORAGE CS 744 - Big Data Systems Fall 2018 Presenter - Arjun Balasubramanian

Datacenter Transformation Datacenter Transformation

Google Datacenter CS 142 Lecture Notes: Datacenters Slide 1 Datacenter Organization Single

Methods Updating Variables Console Programs int life = 42; life life = 42 life; 21 life =

Intro to Life Cycle Analysis Intro to Life Cycle Analysis Intro to Life Cycle Analysis

Day in a Life of a Datacenter Architect Kushagra Vaid

May 2018 ALL THINGS ADAPTED LESSONS What are adapted lessons? therapeutic music lessons

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Chapter 24 Life in the Universe 24.1 Life on Earth Our goals for learning When did life

Life in the Universe Chapter 24 24.1 Life on Earth Our goals for learning When did life

Machine Learning Accelerators Eric Chen Peicheng Tang In-Datacenter Performance Analysis of a

GautamAltekar and Ion Stoica University of California, Berkeley Debugging datacenter software is

Office Hours: ESG-CV Notice September 3, 2020 Housekeeping A recording of todays session,

CSE543 Computer and Network Security Module: Malware Professor Trent Jaeger CSE543 -

Viruses cont 1 Changelog Corrections made in this version not in fjrst posting: 6 Feb 2017:

CDW Corporation Webcast Conference Call May 6, 2020 CDW.com | 800.800.4239 Disclaimers

Albert-Lszl Barabsi with Emma K. Towlson and Sean P. Cornelius www.BarabasiLab.com

Engineering Challenges in Immunizing Systems Against Malicious Code Craig Chamberlain Principal

Misleading and Defeating Importance- Scanning Malware Propagation Guofei Gu 1 , Zesheng Chen 1 ,

AppSpear: Bytecode Decryp0ng and DEX Reassembling for Packed Android Malware Yang Wenbo, Zhang