Life lessons and datacenter performance analysis Dan Ardelean Amer - - PowerPoint PPT Presentation

life lessons and datacenter performance analysis
SMART_READER_LITE
LIVE PREVIEW

Life lessons and datacenter performance analysis Dan Ardelean Amer - - PowerPoint PPT Presentation

Life lessons and datacenter performance analysis Dan Ardelean Amer Diwan Rick Hank Christian Kurmann Balaji Raghavan Matt Seegmiller The need to solve performance crimes Performance crimes are anything that unnecessarily increase


slide-1
SLIDE 1

Life lessons and datacenter performance analysis

Dan Ardelean Amer Diwan Rick Hank Christian Kurmann Balaji Raghavan Matt Seegmiller

slide-2
SLIDE 2

Google Confidential and Proprietary

The need to solve performance crimes

Performance crimes are anything that unnecessarily increase

  • Latency
  • Resource usage

Solving performance crimes is a necessity not a luxury This talk shares our experiences in solving performance crimes in Gmail

Performance crimes...

  • degrade end user experience
  • waste valuable resources (energy, cost, etc.)
slide-3
SLIDE 3

Google Confidential and Proprietary

But email is such a simple app … right?

Storage system Delivery service Your email

We have all used this...in the 1990s!

slide-4
SLIDE 4

Google Confidential and Proprietary

Email today is a lot more...

Delivery service Your email

To enable sharing across applications (e.g., contacts) each component is a service

Spam filtering Virus detection Web client support Sync client support Backups Smart search Mail classification Labels/folder Filters Images/Attachments Contacts ... Storage system

slide-5
SLIDE 5

Google Confidential and Proprietary

Each component is a service running in its own processes Provides modularity, parallelism, reliability

Gmail Authentication App Logic Contacts Body RPC Calendar App Logic Events

slide-6
SLIDE 6

Google Confidential and Proprietary

Lesson 1:

No RPC left behind

slide-7
SLIDE 7

Google Confidential and Proprietary

Each user request involves O(100) RPCs We cannot ignore the rarely slow RPCs

  • 1/100 slow RPC affects 63% of the requests
  • 1/1M slow RPC affects 0.01% of the requests

Blue moon...daily! We cannot ignore the rarely slow requests

  • 1/1M event affects O(10M) requests daily
slide-8
SLIDE 8

Google Confidential and Proprietary

Latency of RPCs follows a complex distribution

Many layers of many servers Continuously varying load Countless code paths Latency Fraction of RPCs

slide-9
SLIDE 9

Google Confidential and Proprietary

Even for “simplest” components data is rarely normal Abnormal is the new normal!

Log scale latency distribution of a critical (but “simple”) component at Google

Mean Mean + sd Mean - sd 66% Long tail Latency Fraction of RPCs

slide-10
SLIDE 10

Google Confidential and Proprietary

When is better actually better? A tighter distribution may be better than a better median

Latency Fraction of RPCs Constant overhead at every request; but fast recovery Minimal overhead at every request; but expensive recovery

slide-11
SLIDE 11

Google Confidential and Proprietary

Challenges

Normal distribution are rare at Google

  • Optimizing only for the “common case” is inadequate

Many statistical truths assume properties of the data

  • Ensure that analysis is appropriate for the shape of the data

But...don’t invent statistics; just use it correctly!

slide-12
SLIDE 12

Lesson 2:

Prepare for the storm

slide-13
SLIDE 13

Google Confidential and Proprietary

Long tail also shows up in resource usage Two causes for peaks: (i) load; (ii) unusual events

Must reserve resources for peaks Time CPU usage

slide-14
SLIDE 14

Google Confidential and Proprietary

Cause 1: load varies over time

Users are not “randomly” spread out over time zones

Who is using Gmail here?

Working overlaps between zones (e.g., NA and European) cause large usage peaks

slide-15
SLIDE 15

Google Confidential and Proprietary

Cause 2: storms! Must plan for hardware and software updates

What happened here? Time CPU usage

slide-16
SLIDE 16

Google Confidential and Proprietary

Consequences of Lessons 1 and 2

Aggregate metrics (e.g., mean latency) cannot distinguish between:

and We must reason with traces for long-tail events

slide-17
SLIDE 17

Google Confidential and Proprietary

Challenges with traces Profiling tools are (mostly) ok; now we must invest in tracing

  • Different traces may use different clocks

○ We use large amounts of data to do time alignment after the fact

  • Reasoning is hard and laborious

○ Traces from a single machine may be 100K events per second ○ We use a language based on temporal logic to reason over traces

  • Need to coordinate tracing across machines

○ Use coordinated-bursty tracing

slide-18
SLIDE 18

Google Confidential and Proprietary

Lesson 3:

Ask nicely!

slide-19
SLIDE 19

Google Confidential and Proprietary

Is your pattern of access reasonable?

Gmail

my message please here you go

Gmail Sync client

all 50,000 messages NOW Aaargh!

A single not-so-nice request can degrade many requests

slide-20
SLIDE 20

Google Confidential and Proprietary

Some problems are more subtle

Gmail Search Chicken and Egg Results for "Chicken and Egg" Results for "(Chicken OR Chickens OR ...) and Egg"

Query rewriting can dramatically inflate a small request

slide-21
SLIDE 21

Google Confidential and Proprietary

Any layer can potentially overwhelm the next layer

Gmail front end Gmail storage layer Lots of small reads Gmail front end Gmail storage layer Fewer but larger reads

Requests to a layer should match its strengths

slide-22
SLIDE 22

Google Confidential and Proprietary

Challenges

Our APIs capture functionality, not performance How do we know that a method that we are calling...

  • ...makes expensive RPC calls?
  • ...acquires locks?
  • ...performs IO?

Can we express and reason over performance in our APIs?

slide-23
SLIDE 23

Google Confidential and Proprietary

Lesson 4:

Share!

slide-24
SLIDE 24

Google Confidential and Proprietary

Lock contention often causes long-tail latency A burst can cause contention when there is normally none

slide-25
SLIDE 25

Google Confidential and Proprietary

Layer interactions may aggravate or alleviate contention

Low-level layer for accessing disk Storage layer

An inefficient layer presents an easier request stream to next layer

Time # Requests Time # Requests

slide-26
SLIDE 26

Google Confidential and Proprietary

Remember Little’s law!

If processing a request holds a lock for 100ms we cannot process more than 10 requests per second

5 qps Hold for 0.1 sec

Average queue depth: 5 * 0.1 = ½

If we want to double parallelism we must halve holding time

slide-27
SLIDE 27

Lesson 5:

Confront weaknesses

slide-28
SLIDE 28

Google Confidential and Proprietary

Caches can hide the latency of “weak” components ... Fast component Slow component

Cache

...but it is often much better to fix the root cause!

slide-29
SLIDE 29

Google Confidential and Proprietary

Why should we care if caches reduce latency?

Cache

Gmail’s storage layer File system layer

Glass is half full... Wow, a 98.5% cache hit rate! Glass is half empty... Wow, why are we are asking for the same data 98.5% of the time?

slide-30
SLIDE 30

Google Confidential and Proprietary

Fix the problem not the symptom

If cache is performing too well, something is wrong with the requests

Fix request pattern Remove cache Use less resources Cleaner code

Caches are great abuse detectors!

slide-31
SLIDE 31

Lesson 6:

Get your priorities right

slide-32
SLIDE 32

Google Confidential and Proprietary

Priorities only matter in a crunch

Resource

High prio Low prio

Short queues: no problem! Long queues: big problem!

Overprovisioned resources mask poor priority settings

slide-33
SLIDE 33

Google Confidential and Proprietary

Setting priorities is hard!

High prio Low prio

First try: put only user-facing requests in high priority

slide-34
SLIDE 34

Google Confidential and Proprietary

...but that can easily backfire when there are dependencies

High prio Low prio Time Queued Run Waiting for lower-priority req Run

Avoiding priority inversion makes for a complex priority model

slide-35
SLIDE 35

Lesson 7:

Question everything

slide-36
SLIDE 36

Google Confidential and Proprietary

Suspect every layer, every component, every lock, ...

But even more so, look every gift horse in the mouth

If data looks too good to be true, it is probably wrong!

slide-37
SLIDE 37

Google Confidential and Proprietary

Rookie mistake: Compare Monday data to Sunday data! The load varies day to day, hour to hour!

Requests per second Time

slide-38
SLIDE 38

Lesson 8

Real life is real life; tests are tests!

slide-39
SLIDE 39

Google Confidential and Proprietary

Loadtest versus production The loadtest executes same binaries...but the load is different

CPU usage during loadtest CPU usage in production

slide-40
SLIDE 40

Google Confidential and Proprietary

Why the differences?

  • Real user pattern is more complex than what we can synthesize

○ Users use a variety of email clients which affect requests ○ Users are different in their mailbox size and usage

■ e.g., Googlers are 10x more of everything compared to “average” user

○ Users have different settings (e.g., filters) and usage styles (clean- inbox versus everything-in-inbox) ○ ...

  • The loadtest attempts to model these…

○ but it cannot possibly model everything

Many performance problems must be debugged in production

slide-41
SLIDE 41

Google Confidential and Proprietary

Debugging in production

If only I knew X I could get to the bottom of this puzzle

Sure… but only if

  • Recording X does not violate any compliance, contractual, and privacy

promises to our users

  • It is carefully reviewed so that we do not introduce bugs or regressions
  • And it still needs to go through a careful rollout process

Challenge is to infer what we really need from what we have

slide-42
SLIDE 42

Google Confidential and Proprietary

Call to action Are you up to it?

  • Trace analysis tools

○ How to combine and analyze diverse sources of data ○ How can we derive high-level knowledge from low-level data?

  • Performance APIs

○ How can we express and check performance specifications?

  • Focus on analyzing large production systems when possible

○ How can our students learn from and impact real systems?