how to build observable distributed systems
play

How to build observable distributed systems @PierreVincent - PowerPoint PPT Presentation

March 8th, 2018 QCon London How to build observable distributed systems @PierreVincent pvincent.io @PierreVincent @PierreVincent Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io @PierreVincent Reaching production is


  1. March 8th, 2018 – QCon London How to build observable distributed systems @PierreVincent pvincent.io @PierreVincent

  2. @PierreVincent

  3. Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io @PierreVincent

  4. Reaching production is only the beginning @PierreVincent

  5. No system is immune to failure Be ready to recover @PierreVincent

  6. When distributing a system, we’re also distributing the places where things might go wrong @PierreVincent

  7. Monitoring only applies to known failure modes. What about everything else? @PierreVincent

  8. “ Monitoring tells you whether the system works. Observability lets ” you ask why it's not working. – Baron Schwarz @PierreVincent

  9. Metrics Healthchecks Tracing Logging @PierreVincent

  10. Healthchecks @PierreVincent

  11. Healthchecks Can it perform Can it accept Is it running? its task? more work? @PierreVincent

  12. Healthchecks Broadcast Register Expose @PierreVincent

  13. Healthchecks GET http://1.2.3.4:8080/health 200 OK { "service": "registration-service", "healthy": true , "workload": { "healthy": true }, "dependencies": [ /health { "name": "cassandra", "healthy": true }, { "name": "billing-svc", "healthy": true }, ] } @PierreVincent

  14. Overzealous Healthchecks can be counter-productive Source: HTTP Healthchecks for a Resilient Platform - Chris O’Dell skeltonthatcher.com/blog/http-healthchecks-for-a-resilient-platform @PierreVincent

  15. Metrics @PierreVincent

  16. Metrics System Application Business metrics metrics metrics CPU usage Error rates Customer conversions @PierreVincent

  17. Metrics Servers / VMs Dashboards Metrics Metrics Appliances/Infra query collector engine Alerts Services @PierreVincent

  18. Metrics Servers / VMs /metrics Appliances/Infra /metrics Prometheus Services /metrics @PierreVincent

  19. Watch out for over-reliance on metrics Not every metric Limit alerting to deserves an alert user-impacting symptoms Poor fine-grained debugging Limitations at high-cardinality e.g. CustomerId Real-time querying means Not suitable for long-term some trade-offs on retention trend analysis @PierreVincent

  20. Logging @PierreVincent

  21. Logging Making sense of (a lot of) logs Centralised Searchable Correlated @PierreVincent

  22. Log Correlation C B E A D a1b2c3 ERROR [svc= A ][trace= a1b2c3 ] Failed to process order F Cause: Order process manager responded with 500 G a1b2c3 ERROR [svc= F ][trace= a1b2c3 ] Failed to complete order a1b2c3 Cause: Shipping service responded with 500 J INFO [svc= G ][trace= a1b2c3 ] Items verified in stock H ERROR [svc= H ][trace= a1b2c3 ] Failed to save order Cause: Cassandra timeout exception a1b2c3 @PierreVincent

  23. 2018-02-20T16:38:23+00:00 ERROR Read timed out Hmmm thanks… ? When did it happen? timestamp 2018-02-20T16:38:23+00:00 What is the message? level ERROR log Read timed out What is it? service registration-service team events What is it running? commit 542a8b8e build 542a8b8e.7 runtime java-1.8.0_161 JSON Where is it? region europe-west2 node node_e79f3e52 Who caused it? customerID 55123 userID 458 Can I trace it? requestID ec667cb45 Any other info? ... ... @PierreVincent

  24. Structured logs unleash high-cardinality exploration Error rate spike isolated by build version Activity spike isolated for single customer @PierreVincent

  25. Tracing @PierreVincent

  26. Tracing Trace 50ms 0ms 100ms 150ms 172ms event-mgt-api get_confirmed_attendees 172ms attendees-service get_attendees 73ms Spans cassandra/select 54ms registration-service get_confirm_status 78ms mysql/select 41ms @PierreVincent

  27. @PierreVincent

  28. Known Unknown Unknowns Unknowns Logs Metrics Healthchecks (Queries) Metrics Events (Alerting) Tracing Monitoring & Resiliency Debugging & Exploration @PierreVincent

  29. Usability of tooling is key to adoption Cheap to Easy to Reliable & instrument explore Trustworthy @PierreVincent

  30. Visibility builds trust but requires safety @PierreVincent

  31. Visibility helps justify decisions @PierreVincent

  32. Visibility enables operability @PierreVincent

  33. Test (a little bit*) less Test (a little bit*) less Don’t spend all your time testing ...keep some for instrumenting @PierreVincent Thank you! pvincent.io @PierreVincent

  34. Thank you! Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io @PierreVincent

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend