Do You Really Know Your Response Times? Daniel Rolls March 2017

Sky Over The Top Delivery ◮ Web Services ◮ Over The Top Asset Delivery ◮ NowTV/Sky Go ◮ Always up ◮ High traffic ◮ Highly concurrent

OTT Endpoints GET /stuff Our App { “foo”: “bar” } ◮ How much traffic is hitting that endpoint? ◮ How quickly are we responding to a typical customer? ◮ One customer complained we respond slowly. How slow do we get? ◮ What’s the difference between the fastest and slowest responses? ◮ I don’t care about anomalies but how slow are the slowest 1%?

Collecting Response Times My App Response Time Collection System ◮ Large volumes of network traffic ◮ Risk of losing data (network may fail) ◮ Affects application performance ◮ Needs measuring itself!

Our Setup Application instance 1 Graphite Grafana Application instance 2

Dropwizard Metrics Library: Types of Metric ◮ Counter ◮ Gauge — ‘instantaneous measurement of a value’ ◮ Meter (counts, rates) ◮ Histogram — min, max, mean, stddev, percentiles ◮ Timer — Meter + Histogram

Example Dashboard

Dropwizard Metrics ◮ Use Dropwizard and you get ◮ Metrics infrastructure for free ◮ Metrics from Cassandra and Dropwizard bundles for free ◮ You can easily add timers to metrics just by adding annotations ◮ Ports exist for other languages ◮ Developers, architects, managers everybody loves graphs ◮ We trust and depend on them ◮ We rarely understand them ◮ We lie to ourselves and to our managers with them

Goals of this talk ◮ Understand how we can measure service time latencies ◮ Ensure meaningful statistics are given back to managers ◮ Learn how to use appropriate dashboards for monitoring and alerting

What is the 99 th Percentile Response Time? ?

What is the 99 th Percentile?

Our Setup Application instance 1 Reservoir Graphite Grafana Application instance 2 Reservoir

Reservoirs Reservoir (1000 elements)

Types of Reservoir ◮ Sliding window ◮ Time-base sliding window ◮ Exponentially decaying

Forward Decay k s s s s r m m m m a 5 8 5 7 m 4 3 2 5 d = = = = n a v 3 v 2 x 4 v 1 v 4 L x 1 x 2 x 3 Time w i = e α x i

w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 Sorted by value Getting at the percentiles ◮ Normalise weights: � i w i = 1 ◮ Lookup by normalised weight Data retention ◮ Sorted Map indexed by w . random number ◮ Smaller indices removed first

Response Time Jumps for 4 Minutes

One Percent Rise from 20ms to 500ms

Trade-off ◮ Autonomous teams ◮ Know one app well ◮ Feel responsible for app performance ◮ But . . . ◮ Can’t know everything ◮ Will make mistakes with numbers ◮ We might even ignore mistakes

One Long Request Blocks New Requests

Spikes and Tower Blocks

Splitting Things Up IOS Brand B Brand A Brand B My App Android Brand A Brand B Brand A Web

Metric Imbalance Visualised 100 ms Reservoir 3 100 ms Reservoir 2 Max 100 ms100 ms 100 ms10 ms Reservoir 1

Metric Imbalance ◮ One pool gives more accurate results ◮ Multiple pools allow drilling down, but . . . ◮ Some pools may have inaccurate performance measurements ◮ Only those with sufficient rates should be analysed ◮ How can we narrow down on just those? ◮ Simpson’s Paradox

Simpson’s Paradox Explanation ◮ Two variables have a positive correlation ◮ Grouped data shows a negative correlation ◮ There’s a lurking third variable

Simpson’s Paradox ◮ Increasing traffic = ⇒ X gets slower Y ◮ Increasing traffic = ⇒ Y gets faster ◮ We move % traffic to System Y Y ◮ We wait for prime time peak ◮ System gets slower??? X ◮ 100% of brand B traffic still goes to X ◮ Results are pooled by client and brand ◮ Classic example: UC Berkeley gender X bias

Lessons Learnt ◮ Want fast alerting? ◮ Use max ◮ If you don’t graph the max you are hiding the bad ◮ Don’t just look at fixed percentiles. ◮ Understand the distribution of the data (HdrHistogram) ◮ A few fixed percentiles tells you very little as a test ◮ Monitor one metric per endpoint ◮ When aggregating response times ◮ Use maxSeries

So We’re Living a Lie, Does it Matter?

Conclusions and Thoughts ◮ Don’t immediately assume numbers on dashboards are meaningful ◮ Understand what you are graphing ◮ Test assumptions ◮ Provide these tools and developers will confidently use them ◮ Although maybe not correctly! ◮ Most developers are not mathematicians ◮ Keep it simple! ◮ Know which numbers are real and which are lies!

Thank you

Do You Really Know Your Response Times? Daniel Rolls March 2017 - PowerPoint PPT Presentation

Do You Really Know Your Response Times? Daniel Rolls March 2017 Sky Over The Top Delivery Web Services Over The Top Asset Delivery NowTV/Sky Go Always up High traffic Highly concurrent OTT Endpoints GET /stuff Our App {

What You Dont Know What You Dont Know What You Dont Know What You Dont Know That

We Know It ! We Know It ! WeKnowIt WeKnowIt Emerging, Collective Intelligence for personal,

(11-14) How much do you know about the internet? Make sure you stay SAFE AND SECURE ONLINE YOU

TIMES TABLES HOW WE TEACH TIMES TABLES AND HOW YOU CAN HELP WHY ARE TIMES TABLES IMPORTANT?

WELCOME! You need to know what you know, and know what you dont know. Then work on your areas

INTERPRETATION INTERPRETATION INTERPRETATION INTERPRETATION How can I know what How can I know

HOW TO BECOME AN EFFECTIVE GROUP FACILITATOR How do I prepare? Know your Know your Know your

When (Low ) Pow er Really Matters When (Low ) Pow er Really Matters When (Low ) Pow er Really

Know how. Know now. Know how. Know now. Please Thank our sponsor! The Nebraska Soybean Board

CLUB INFORMATION EVENING YOU CANT REALLY KNOW WHERE YOUR GOING UNTIL YOU KNOW WHERE YOU HAVE

GROWING YOUR MEMBERS & PLAYERS Know your catchment, know your members, know your potential

New York By:Tyonika I really hope you like my sta Introduction Do you want to know what my

COVID-19 VIRTUAL FORUM STRATEGY IN UNCERTAIN TIMES COVID-19: STRATEGY IN UNCERTAIN TIMES APRIL

Rapid Response Jobs are Alaskas Future Rapid Response Rapid Response Rapid Response is a

The Art The Art when you don't know! Define what you want when you do know! of of Know

SIPA - MachWall February 2018 How many times have potential clients said to you: How many times

Tracking H akan Ard o March 4, 2013 H akan Ard o Tracking March 4, 2013 1 / 57

GraphChi(huahua) Overview The Punchline Quick Overview Novel Method Parallel

Indexing Bi-temporal Windows Chang Ge 1 , Martin Kaufmann 2 , Lukasz Golab 1 , Peter M. Fischer 3 ,

Rail Safety on Boundary Bay Opportunities and Challenges To: Surrey Mayor & Council February

COYOL COUNTRY VERSION CATEGORY YEAR PROJECT COSTA RICA MIXED USE COYOL WAREHPUSE 08/2018

Compact Adaptive Optics Heather I Campbell, Alan H Greenaway and Sergio R Restaino* Physics,

Administration Building Needs Assessment JUNE 13, 2016 EXISTING BUILDING EVALUATION EXISTING

The Tech Savvy Teacher Dr. Kelly Edenfield, Manager of School Partnerships Carnegie Learning