Life lessons and datacenter performance analysis Dan Ardelean Amer - - PowerPoint PPT Presentation
Life lessons and datacenter performance analysis Dan Ardelean Amer - - PowerPoint PPT Presentation
Life lessons and datacenter performance analysis Dan Ardelean Amer Diwan Rick Hank Christian Kurmann Balaji Raghavan Matt Seegmiller The need to solve performance crimes Performance crimes are anything that unnecessarily increase
Google Confidential and Proprietary
The need to solve performance crimes
Performance crimes are anything that unnecessarily increase
- Latency
- Resource usage
Solving performance crimes is a necessity not a luxury This talk shares our experiences in solving performance crimes in Gmail
Performance crimes...
- degrade end user experience
- waste valuable resources (energy, cost, etc.)
Google Confidential and Proprietary
But email is such a simple app … right?
Storage system Delivery service Your email
We have all used this...in the 1990s!
Google Confidential and Proprietary
Email today is a lot more...
Delivery service Your email
To enable sharing across applications (e.g., contacts) each component is a service
Spam filtering Virus detection Web client support Sync client support Backups Smart search Mail classification Labels/folder Filters Images/Attachments Contacts ... Storage system
Google Confidential and Proprietary
Each component is a service running in its own processes Provides modularity, parallelism, reliability
Gmail Authentication App Logic Contacts Body RPC Calendar App Logic Events
Google Confidential and Proprietary
Lesson 1:
No RPC left behind
Google Confidential and Proprietary
Each user request involves O(100) RPCs We cannot ignore the rarely slow RPCs
- 1/100 slow RPC affects 63% of the requests
- 1/1M slow RPC affects 0.01% of the requests
Blue moon...daily! We cannot ignore the rarely slow requests
- 1/1M event affects O(10M) requests daily
Google Confidential and Proprietary
Latency of RPCs follows a complex distribution
Many layers of many servers Continuously varying load Countless code paths Latency Fraction of RPCs
Google Confidential and Proprietary
Even for “simplest” components data is rarely normal Abnormal is the new normal!
Log scale latency distribution of a critical (but “simple”) component at Google
Mean Mean + sd Mean - sd 66% Long tail Latency Fraction of RPCs
Google Confidential and Proprietary
When is better actually better? A tighter distribution may be better than a better median
Latency Fraction of RPCs Constant overhead at every request; but fast recovery Minimal overhead at every request; but expensive recovery
Google Confidential and Proprietary
Challenges
Normal distribution are rare at Google
- Optimizing only for the “common case” is inadequate
Many statistical truths assume properties of the data
- Ensure that analysis is appropriate for the shape of the data
But...don’t invent statistics; just use it correctly!
Lesson 2:
Prepare for the storm
Google Confidential and Proprietary
Long tail also shows up in resource usage Two causes for peaks: (i) load; (ii) unusual events
Must reserve resources for peaks Time CPU usage
Google Confidential and Proprietary
Cause 1: load varies over time
Users are not “randomly” spread out over time zones
Who is using Gmail here?
Working overlaps between zones (e.g., NA and European) cause large usage peaks
Google Confidential and Proprietary
Cause 2: storms! Must plan for hardware and software updates
What happened here? Time CPU usage
Google Confidential and Proprietary
Consequences of Lessons 1 and 2
Aggregate metrics (e.g., mean latency) cannot distinguish between:
and We must reason with traces for long-tail events
Google Confidential and Proprietary
Challenges with traces Profiling tools are (mostly) ok; now we must invest in tracing
- Different traces may use different clocks
○ We use large amounts of data to do time alignment after the fact
- Reasoning is hard and laborious
○ Traces from a single machine may be 100K events per second ○ We use a language based on temporal logic to reason over traces
- Need to coordinate tracing across machines
○ Use coordinated-bursty tracing
Google Confidential and Proprietary
Lesson 3:
Ask nicely!
Google Confidential and Proprietary
Is your pattern of access reasonable?
Gmail
my message please here you go
Gmail Sync client
all 50,000 messages NOW Aaargh!
A single not-so-nice request can degrade many requests
Google Confidential and Proprietary
Some problems are more subtle
Gmail Search Chicken and Egg Results for "Chicken and Egg" Results for "(Chicken OR Chickens OR ...) and Egg"
Query rewriting can dramatically inflate a small request
Google Confidential and Proprietary
Any layer can potentially overwhelm the next layer
Gmail front end Gmail storage layer Lots of small reads Gmail front end Gmail storage layer Fewer but larger reads
Requests to a layer should match its strengths
Google Confidential and Proprietary
Challenges
Our APIs capture functionality, not performance How do we know that a method that we are calling...
- ...makes expensive RPC calls?
- ...acquires locks?
- ...performs IO?
Can we express and reason over performance in our APIs?
Google Confidential and Proprietary
Lesson 4:
Share!
Google Confidential and Proprietary
Lock contention often causes long-tail latency A burst can cause contention when there is normally none
Google Confidential and Proprietary
Layer interactions may aggravate or alleviate contention
Low-level layer for accessing disk Storage layer
An inefficient layer presents an easier request stream to next layer
Time # Requests Time # Requests
Google Confidential and Proprietary
Remember Little’s law!
If processing a request holds a lock for 100ms we cannot process more than 10 requests per second
5 qps Hold for 0.1 sec
Average queue depth: 5 * 0.1 = ½
If we want to double parallelism we must halve holding time
Lesson 5:
Confront weaknesses
Google Confidential and Proprietary
Caches can hide the latency of “weak” components ... Fast component Slow component
Cache
...but it is often much better to fix the root cause!
Google Confidential and Proprietary
Why should we care if caches reduce latency?
Cache
Gmail’s storage layer File system layer
Glass is half full... Wow, a 98.5% cache hit rate! Glass is half empty... Wow, why are we are asking for the same data 98.5% of the time?
Google Confidential and Proprietary
Fix the problem not the symptom
If cache is performing too well, something is wrong with the requests
Fix request pattern Remove cache Use less resources Cleaner code
Caches are great abuse detectors!
Lesson 6:
Get your priorities right
Google Confidential and Proprietary
Priorities only matter in a crunch
Resource
High prio Low prio
Short queues: no problem! Long queues: big problem!
Overprovisioned resources mask poor priority settings
Google Confidential and Proprietary
Setting priorities is hard!
High prio Low prio
First try: put only user-facing requests in high priority
Google Confidential and Proprietary
...but that can easily backfire when there are dependencies
High prio Low prio Time Queued Run Waiting for lower-priority req Run
Avoiding priority inversion makes for a complex priority model
Lesson 7:
Question everything
Google Confidential and Proprietary
Suspect every layer, every component, every lock, ...
But even more so, look every gift horse in the mouth
If data looks too good to be true, it is probably wrong!
Google Confidential and Proprietary
Rookie mistake: Compare Monday data to Sunday data! The load varies day to day, hour to hour!
Requests per second Time
Lesson 8
Real life is real life; tests are tests!
Google Confidential and Proprietary
Loadtest versus production The loadtest executes same binaries...but the load is different
CPU usage during loadtest CPU usage in production
Google Confidential and Proprietary
Why the differences?
- Real user pattern is more complex than what we can synthesize
○ Users use a variety of email clients which affect requests ○ Users are different in their mailbox size and usage
■ e.g., Googlers are 10x more of everything compared to “average” user
○ Users have different settings (e.g., filters) and usage styles (clean- inbox versus everything-in-inbox) ○ ...
- The loadtest attempts to model these…
○ but it cannot possibly model everything
Many performance problems must be debugged in production
Google Confidential and Proprietary
Debugging in production
If only I knew X I could get to the bottom of this puzzle
Sure… but only if
- Recording X does not violate any compliance, contractual, and privacy
promises to our users
- It is carefully reviewed so that we do not introduce bugs or regressions
- And it still needs to go through a careful rollout process
Challenge is to infer what we really need from what we have
Google Confidential and Proprietary
Call to action Are you up to it?
- Trace analysis tools
○ How to combine and analyze diverse sources of data ○ How can we derive high-level knowledge from low-level data?
- Performance APIs
○ How can we express and check performance specifications?
- Focus on analyzing large production systems when possible
○ How can our students learn from and impact real systems?