avoiding alerts overload from microservices
play

Avoiding alerts overload from microservices Sarah Wells Principal - PowerPoint PPT Presentation

Avoiding alerts overload from microservices Sarah Wells Principal Engineer, Financial Times @sarahjwells Knowing when theres a problem isnt enough @sarahjwells You only want an alert when you need to take action Hello @sarahjwells 1


  1. Avoiding alerts overload from microservices Sarah Wells Principal Engineer, Financial Times @sarahjwells

  2. Knowing when there’s a problem isn’t enough @sarahjwells

  3. You only want an alert when you need to take action

  4. Hello @sarahjwells

  5. 1

  6. 2 1

  7. 2 1 3

  8. 4 2 1 3

  9. Monitoring this system… @sarahjwells

  10. Microservices make it worse @sarahjwells

  11. “microservices (n,pl): an efficient device for transforming business problems into distributed transaction problems” @drsnooks

  12. The services *themselves* are simple… @sarahjwells

  13. There’s a lot of complexity around them @sarahjwells

  14. Why do they make monitoring harder? @sarahjwells

  15. You have a lot more services @sarahjwells

  16. 99 functional microservices 350 running instances @sarahjwells

  17. 52 non functional services 218 running instances @sarahjwells

  18. That’s 568 separate services @sarahjwells

  19. If we checked each service every minute… @sarahjwells

  20. 817,920 checks per day @sarahjwells

  21. What about system checks? @sarahjwells

  22. 16,358,400 checks per day @sarahjwells

  23. “One-in-a-million” issues would hit us 16 times every day @sarahjwells

  24. Running containers on shared VMs reduces this to 92,160 system checks per day @sarahjwells

  25. For a total of 910,080 checks per day @sarahjwells

  26. It’s a distributed system @sarahjwells

  27. Services are not independent @sarahjwells

  28. http://devopsreactions.tumblr.com/post/122408751191/alerts-when- an-outage-starts

  29. You have to change how you think about monitoring @sarahjwells

  30. How can you make it better?

  31. 1. Build a system you can support @sarahjwells

  32. The basic tools you need @sarahjwells

  33. Log aggregation @sarahjwells

  34. Logs go missing or get delayed more now @sarahjwells

  35. Which means log based alerts may miss stuff @sarahjwells

  36. Monitoring @sarahjwells

  37. Limitations of our nagios integration… @sarahjwells

  38. No ‘service-level’ view @sarahjwells

  39. Default checks included things we couldn’t fix @sarahjwells

  40. A new approach for our container stack @sarahjwells

  41. We care about each service @sarahjwells

  42. We care about each VM @sarahjwells

  43. We care about unhealthy instances @sarahjwells

  44. Monitoring needs aggregating somehow @sarahjwells

  45. SAWS @sarahjwells

  46. Built by Silvano Dossan See our Engine room blog: http://bit.ly/1GATHLy

  47. "I imagine most people do exactly what I do - create a google filter to send all Nagios emails straight to the bin" @sarahjwells

  48. "Our screens have a viewing angle of about 10 degrees" @sarahjwells

  49. "It never seems to show the page I want" @sarahjwells

  50. Code at: https://github.com/muce/SAWS @sarahjwells

  51. Dashing @sarahjwells

  52. Graphing of metrics @sarahjwells

  53. https://www.flickr.com/photos/davidmasters/2564786205/

  54. The things that make those tools WORK @sarahjwells

  55. Effective log aggregation needs a way to find all related logs @sarahjwells

  56. Transaction ids tie all microservices together

  57. Make it easy for any language you use @sarahjwells

  58. @sarahjwells

  59. Services need to report on their own health @sarahjwells

  60. The FT healthcheck standard GET http://{service}/__health

  61. The FT healthcheck standard GET http://{service}/__health returns 200 if the service can run the healthcheck

  62. The FT healthcheck standard GET http://{service}/__health returns 200 if the service can run the healthcheck each check will return "ok": true or "ok": false

  63. Knowing about problems before your clients do @sarahjwells

  64. Synthetic requests tell you about problems early https://www.flickr.com/photos/jted/ 5448635109

  65. 2. Concentrate on the stuff that matters @sarahjwells

  66. It’s the business functionality you should care about @sarahjwells

  67. We care about whether content got published successfully

  68. When people call our APIs, we care about speed

  69. … we also care about errors

  70. But it's the end-to-end that matters https://www.flickr.com/photos/robef/16537786315/

  71. If you just want information, create a dashboard or report

  72. Checking the services involved in a business flow @sarahjwells

  73. /__health?categories=lists-publish

  74. 3. Cultivate your alerts @sarahjwells

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend