Do You Really Know Your Response Times? Daniel Rolls March 2017 - - PowerPoint PPT Presentation

do you really know your response times
SMART_READER_LITE
LIVE PREVIEW

Do You Really Know Your Response Times? Daniel Rolls March 2017 - - PowerPoint PPT Presentation

Do You Really Know Your Response Times? Daniel Rolls March 2017 Sky Over The Top Delivery Web Services Over The Top Asset Delivery NowTV/Sky Go Always up High traffic Highly concurrent OTT Endpoints GET /stuff Our App {


slide-1
SLIDE 1

Do You Really Know Your Response Times?

Daniel Rolls March 2017

slide-2
SLIDE 2

Sky Over The Top Delivery

◮ Web Services ◮ Over The Top Asset Delivery ◮ NowTV/Sky Go ◮ Always up ◮ High traffic ◮ Highly concurrent

slide-3
SLIDE 3

OTT Endpoints

GET /stuff {“foo”: “bar”} Our App

◮ How much traffic is hitting that endpoint? ◮ How quickly are we responding to a typical customer? ◮ One customer complained we respond slowly. How slow do we get? ◮ What’s the difference between the fastest and slowest responses? ◮ I don’t care about anomalies but how slow are the slowest 1%?

slide-4
SLIDE 4

Collecting Response Times

My App Response Time Collection System

◮ Large volumes of network traffic ◮ Risk of losing data (network may fail) ◮ Affects application performance ◮ Needs measuring itself!

slide-5
SLIDE 5

Our Setup

Graphite Grafana Application instance 1 Application instance 2

slide-6
SLIDE 6

Dropwizard Metrics Library: Types of Metric

◮ Counter ◮ Gauge — ‘instantaneous measurement of a value’ ◮ Meter (counts, rates) ◮ Histogram — min, max, mean, stddev, percentiles ◮ Timer — Meter + Histogram

slide-7
SLIDE 7

Example Dashboard

slide-8
SLIDE 8

Dropwizard Metrics

◮ Use Dropwizard and you get

◮ Metrics infrastructure for free ◮ Metrics from Cassandra and Dropwizard bundles for free ◮ You can easily add timers to metrics just by adding annotations

◮ Ports exist for other languages ◮ Developers, architects, managers everybody loves graphs ◮ We trust and depend on them ◮ We rarely understand them ◮ We lie to ourselves and to our managers with them

slide-9
SLIDE 9

Goals of this talk

◮ Understand how we can measure service time latencies ◮ Ensure meaningful statistics are given back to managers ◮ Learn how to use appropriate dashboards for monitoring and alerting

slide-10
SLIDE 10

What is the 99th Percentile Response Time?

?

slide-11
SLIDE 11

What is the 99th Percentile?

slide-12
SLIDE 12

Our Setup

Graphite Grafana Application instance 1 Application instance 2 Reservoir Reservoir

slide-13
SLIDE 13

Reservoirs

Reservoir (1000 elements)

slide-14
SLIDE 14

Types of Reservoir

◮ Sliding window ◮ Time-base sliding window ◮ Exponentially decaying

slide-15
SLIDE 15

Forward Decay

Time v3 = 4 5 m s v2 = 3 8 m s v1 = 2 5 m s v4 = 5 7 m s x3 x2 x1 x4 L a n d m a r k

wi = eαxi

slide-16
SLIDE 16

w1 w2 w3 w4 w5 w6 w7 w8 v1 v2 v3 v4 v5 v6 v7 v8 Sorted by value

Getting at the percentiles

◮ Normalise weights: i wi = 1 ◮ Lookup by normalised weight

Data retention

◮ Sorted Map indexed by w.random number ◮ Smaller indices removed first

slide-17
SLIDE 17

Response Time Jumps for 4 Minutes

slide-18
SLIDE 18

One Percent Rise from 20ms to 500ms

slide-19
SLIDE 19

One Percent Rise from 20ms to 500ms

slide-20
SLIDE 20

Trade-off

◮ Autonomous teams

◮ Know one app well ◮ Feel responsible for app performance

◮ But. . .

◮ Can’t know everything ◮ Will make mistakes with numbers ◮ We might even ignore mistakes

slide-21
SLIDE 21

One Long Request Blocks New Requests

slide-22
SLIDE 22

One Long Request Blocks New Requests

slide-23
SLIDE 23

Spikes and Tower Blocks

slide-24
SLIDE 24

Splitting Things Up

My App IOS Android Web Brand A Brand B Brand A Brand B Brand A Brand B

slide-25
SLIDE 25

Metric Imbalance Visualised

Reservoir 1 100 ms10 ms Reservoir 2 100 ms Reservoir 3 100 ms Max 100 ms100 ms

slide-26
SLIDE 26

Metric Imbalance

◮ One pool gives more accurate results ◮ Multiple pools allow drilling down, but. . .

◮ Some pools may have inaccurate performance measurements ◮ Only those with sufficient rates should be analysed ◮ How can we narrow down on just those?

◮ Simpson’s Paradox

slide-27
SLIDE 27

Simpson’s Paradox

Explanation

◮ Two variables have a positive correlation ◮ Grouped data shows a negative correlation ◮ There’s a lurking third variable

slide-28
SLIDE 28

Simpson’s Paradox

X Y X Y

◮ Increasing traffic =

⇒ X gets slower

◮ Increasing traffic =

⇒ Y gets faster

◮ We move % traffic to System Y ◮ We wait for prime time peak ◮ System gets slower??? ◮ 100% of brand B traffic still goes to X ◮ Results are pooled by client and brand ◮ Classic example: UC Berkeley gender

bias

slide-29
SLIDE 29

Lessons Learnt

◮ Want fast alerting?

◮ Use max ◮ If you don’t graph the max you are hiding the bad

◮ Don’t just look at fixed percentiles.

◮ Understand the distribution of the data (HdrHistogram) ◮ A few fixed percentiles tells you very little as a test

◮ Monitor one metric per endpoint ◮ When aggregating response times

◮ Use maxSeries

slide-30
SLIDE 30

So We’re Living a Lie, Does it Matter?

slide-31
SLIDE 31

Conclusions and Thoughts

◮ Don’t immediately assume numbers on dashboards are meaningful ◮ Understand what you are graphing ◮ Test assumptions ◮ Provide these tools and developers will confidently use them

◮ Although maybe not correctly! ◮ Most developers are not mathematicians

◮ Keep it simple! ◮ Know which numbers are real and which are lies!

slide-32
SLIDE 32

Thank you