do you really know your response times
play

Do You Really Know Your Response Times? Daniel Rolls March 2017 - PowerPoint PPT Presentation

Do You Really Know Your Response Times? Daniel Rolls March 2017 Sky Over The Top Delivery Web Services Over The Top Asset Delivery NowTV/Sky Go Always up High traffic Highly concurrent OTT Endpoints GET /stuff Our App {


  1. Do You Really Know Your Response Times? Daniel Rolls March 2017

  2. Sky Over The Top Delivery ◮ Web Services ◮ Over The Top Asset Delivery ◮ NowTV/Sky Go ◮ Always up ◮ High traffic ◮ Highly concurrent

  3. OTT Endpoints GET /stuff Our App { “foo”: “bar” } ◮ How much traffic is hitting that endpoint? ◮ How quickly are we responding to a typical customer? ◮ One customer complained we respond slowly. How slow do we get? ◮ What’s the difference between the fastest and slowest responses? ◮ I don’t care about anomalies but how slow are the slowest 1%?

  4. Collecting Response Times My App Response Time Collection System ◮ Large volumes of network traffic ◮ Risk of losing data (network may fail) ◮ Affects application performance ◮ Needs measuring itself!

  5. Our Setup Application instance 1 Graphite Grafana Application instance 2

  6. Dropwizard Metrics Library: Types of Metric ◮ Counter ◮ Gauge — ‘instantaneous measurement of a value’ ◮ Meter (counts, rates) ◮ Histogram — min, max, mean, stddev, percentiles ◮ Timer — Meter + Histogram

  7. Example Dashboard

  8. Dropwizard Metrics ◮ Use Dropwizard and you get ◮ Metrics infrastructure for free ◮ Metrics from Cassandra and Dropwizard bundles for free ◮ You can easily add timers to metrics just by adding annotations ◮ Ports exist for other languages ◮ Developers, architects, managers everybody loves graphs ◮ We trust and depend on them ◮ We rarely understand them ◮ We lie to ourselves and to our managers with them

  9. Goals of this talk ◮ Understand how we can measure service time latencies ◮ Ensure meaningful statistics are given back to managers ◮ Learn how to use appropriate dashboards for monitoring and alerting

  10. What is the 99 th Percentile Response Time? ?

  11. What is the 99 th Percentile?

  12. Our Setup Application instance 1 Reservoir Graphite Grafana Application instance 2 Reservoir

  13. Reservoirs Reservoir (1000 elements)

  14. Types of Reservoir ◮ Sliding window ◮ Time-base sliding window ◮ Exponentially decaying

  15. Forward Decay k s s s s r m m m m a 5 8 5 7 m 4 3 2 5 d = = = = n a v 3 v 2 x 4 v 1 v 4 L x 1 x 2 x 3 Time w i = e α x i

  16. w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 Sorted by value Getting at the percentiles ◮ Normalise weights: � i w i = 1 ◮ Lookup by normalised weight Data retention ◮ Sorted Map indexed by w . random number ◮ Smaller indices removed first

  17. Response Time Jumps for 4 Minutes

  18. One Percent Rise from 20ms to 500ms

  19. One Percent Rise from 20ms to 500ms

  20. Trade-off ◮ Autonomous teams ◮ Know one app well ◮ Feel responsible for app performance ◮ But . . . ◮ Can’t know everything ◮ Will make mistakes with numbers ◮ We might even ignore mistakes

  21. One Long Request Blocks New Requests

  22. One Long Request Blocks New Requests

  23. Spikes and Tower Blocks

  24. Splitting Things Up IOS Brand B Brand A Brand B My App Android Brand A Brand B Brand A Web

  25. Metric Imbalance Visualised 100 ms Reservoir 3 100 ms Reservoir 2 Max 100 ms100 ms 100 ms10 ms Reservoir 1

  26. Metric Imbalance ◮ One pool gives more accurate results ◮ Multiple pools allow drilling down, but . . . ◮ Some pools may have inaccurate performance measurements ◮ Only those with sufficient rates should be analysed ◮ How can we narrow down on just those? ◮ Simpson’s Paradox

  27. Simpson’s Paradox Explanation ◮ Two variables have a positive correlation ◮ Grouped data shows a negative correlation ◮ There’s a lurking third variable

  28. Simpson’s Paradox ◮ Increasing traffic = ⇒ X gets slower Y ◮ Increasing traffic = ⇒ Y gets faster ◮ We move % traffic to System Y Y ◮ We wait for prime time peak ◮ System gets slower??? X ◮ 100% of brand B traffic still goes to X ◮ Results are pooled by client and brand ◮ Classic example: UC Berkeley gender X bias

  29. Lessons Learnt ◮ Want fast alerting? ◮ Use max ◮ If you don’t graph the max you are hiding the bad ◮ Don’t just look at fixed percentiles. ◮ Understand the distribution of the data (HdrHistogram) ◮ A few fixed percentiles tells you very little as a test ◮ Monitor one metric per endpoint ◮ When aggregating response times ◮ Use maxSeries

  30. So We’re Living a Lie, Does it Matter?

  31. Conclusions and Thoughts ◮ Don’t immediately assume numbers on dashboards are meaningful ◮ Understand what you are graphing ◮ Test assumptions ◮ Provide these tools and developers will confidently use them ◮ Although maybe not correctly! ◮ Most developers are not mathematicians ◮ Keep it simple! ◮ Know which numbers are real and which are lies!

  32. Thank you

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend