How to build observable distributed systems @PierreVincent - PowerPoint PPT Presentation

March 8th, 2018 – QCon London How to build observable distributed systems @PierreVincent pvincent.io @PierreVincent

@PierreVincent

Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io @PierreVincent

Reaching production is only the beginning @PierreVincent

No system is immune to failure Be ready to recover @PierreVincent

When distributing a system, we’re also distributing the places where things might go wrong @PierreVincent

Monitoring only applies to known failure modes. What about everything else? @PierreVincent

“ Monitoring tells you whether the system works. Observability lets ” you ask why it's not working. – Baron Schwarz @PierreVincent

Metrics Healthchecks Tracing Logging @PierreVincent

Healthchecks @PierreVincent

Healthchecks Can it perform Can it accept Is it running? its task? more work? @PierreVincent

Healthchecks Broadcast Register Expose @PierreVincent

Healthchecks GET http://1.2.3.4:8080/health 200 OK { "service": "registration-service", "healthy": true , "workload": { "healthy": true }, "dependencies": [ /health { "name": "cassandra", "healthy": true }, { "name": "billing-svc", "healthy": true }, ] } @PierreVincent

Overzealous Healthchecks can be counter-productive Source: HTTP Healthchecks for a Resilient Platform - Chris O’Dell skeltonthatcher.com/blog/http-healthchecks-for-a-resilient-platform @PierreVincent

Metrics @PierreVincent

Metrics System Application Business metrics metrics metrics CPU usage Error rates Customer conversions @PierreVincent

Metrics Servers / VMs Dashboards Metrics Metrics Appliances/Infra query collector engine Alerts Services @PierreVincent

Metrics Servers / VMs /metrics Appliances/Infra /metrics Prometheus Services /metrics @PierreVincent

Watch out for over-reliance on metrics Not every metric Limit alerting to deserves an alert user-impacting symptoms Poor fine-grained debugging Limitations at high-cardinality e.g. CustomerId Real-time querying means Not suitable for long-term some trade-offs on retention trend analysis @PierreVincent

Logging @PierreVincent

Logging Making sense of (a lot of) logs Centralised Searchable Correlated @PierreVincent

Log Correlation C B E A D a1b2c3 ERROR [svc= A ][trace= a1b2c3 ] Failed to process order F Cause: Order process manager responded with 500 G a1b2c3 ERROR [svc= F ][trace= a1b2c3 ] Failed to complete order a1b2c3 Cause: Shipping service responded with 500 J INFO [svc= G ][trace= a1b2c3 ] Items verified in stock H ERROR [svc= H ][trace= a1b2c3 ] Failed to save order Cause: Cassandra timeout exception a1b2c3 @PierreVincent

2018-02-20T16:38:23+00:00 ERROR Read timed out Hmmm thanks… ? When did it happen? timestamp 2018-02-20T16:38:23+00:00 What is the message? level ERROR log Read timed out What is it? service registration-service team events What is it running? commit 542a8b8e build 542a8b8e.7 runtime java-1.8.0_161 JSON Where is it? region europe-west2 node node_e79f3e52 Who caused it? customerID 55123 userID 458 Can I trace it? requestID ec667cb45 Any other info? ... ... @PierreVincent

Structured logs unleash high-cardinality exploration Error rate spike isolated by build version Activity spike isolated for single customer @PierreVincent

Tracing @PierreVincent

Tracing Trace 50ms 0ms 100ms 150ms 172ms event-mgt-api get_confirmed_attendees 172ms attendees-service get_attendees 73ms Spans cassandra/select 54ms registration-service get_confirm_status 78ms mysql/select 41ms @PierreVincent

@PierreVincent

Known Unknown Unknowns Unknowns Logs Metrics Healthchecks (Queries) Metrics Events (Alerting) Tracing Monitoring & Resiliency Debugging & Exploration @PierreVincent

Usability of tooling is key to adoption Cheap to Easy to Reliable & instrument explore Trustworthy @PierreVincent

Visibility builds trust but requires safety @PierreVincent

Visibility helps justify decisions @PierreVincent

Visibility enables operability @PierreVincent

Test (a little bit*) less Test (a little bit*) less Don’t spend all your time testing ...keep some for instrumenting @PierreVincent Thank you! pvincent.io @PierreVincent

Thank you! Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io @PierreVincent

How to build observable distributed systems @PierreVincent - PowerPoint PPT Presentation

March 8th, 2018 QCon London How to build observable distributed systems @PierreVincent pvincent.io @PierreVincent @PierreVincent Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io @PierreVincent Reaching production is

1 Stochastic, Partially Observable Markov Decision Process (MDP) Partially Observable MDP S

Dichotomic Observables; GP(1992) vs Psudospin observable(2002). Gisin-Peres observable for Bell

Reinforcement Learning Environments Fully-observable vs partially-observable Single agent

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Partially observable Markov decision processes Matthijs Spaan Institute for Systems and Robotics

Build Build Build Build System building The process of compiling and linking software

Introduction to Distributed * Systems Introduction to Distributed * Systems Outline Outline

Introduction to Distributed Systems Introduction to Distributed Systems Outline Outline

Build-Finance or Design-Build-Finance Transportation Projects Types of P3s Design-Build (DB)

Heapsort Build-Max-Heap Next we build a full heap from an unsorted sequence Build-Max-Heap(A)

Electron Cloud Build Electron Cloud Build- Electron Cloud Build Electron Cloud Build -Up

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Sequence Estimation and Schedulability Aim Analysis for Partially Observable Petri Nets

Weak Models of Distributed Computing, with Connections to Modal Logic Lauri Hella 1 , Matti J

Minnesotas Actions and Developments in Distribution System Planning Tricia DeBleeckere

On Data Placement in Distributed Systems LADIS14 Jo ao Paiva , Lu s Rodrigues {

via Dynamic Multi-Path Routing in a Full Mesh Network European Patents EP2375650 and EP2634981

IESO York Region Non-Wires Alternative Demonstration Public Webinar December 12, 2019 Webinar

Item IV Transfer of On-site Natural Gas Distribution System PRESENTED BY Angela M. Poole CPA,

DUKE ENERGY SOUTH CAROLINA GRID IMPROVEMENT INITIATIVE WORKSHOP National Context of Grid

GPU and Distributed Architecture Dr. Shiliang PU Hikivision Research Institute Challenge in

How to build observable distributed systems @PierreVincent - PowerPoint PPT Presentation

March 8th, 2018 QCon London How to build observable distributed systems @PierreVincent pvincent.io @PierreVincent @PierreVincent Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io @PierreVincent Reaching production is

1 Stochastic, Partially Observable Markov Decision Process (MDP) Partially Observable MDP S

Dichotomic Observables; GP(1992) vs Psudospin observable(2002). Gisin-Peres observable for Bell

Reinforcement Learning Environments Fully-observable vs partially-observable Single agent

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Partially observable Markov decision processes Matthijs Spaan Institute for Systems and Robotics

Build Build Build Build System building The process of compiling and linking software

Introduction to Distributed * Systems Introduction to Distributed * Systems Outline Outline

Introduction to Distributed Systems Introduction to Distributed Systems Outline Outline

Build-Finance or Design-Build-Finance Transportation Projects Types of P3s Design-Build (DB)

Heapsort Build-Max-Heap Next we build a full heap from an unsorted sequence Build-Max-Heap(A)

Electron Cloud Build Electron Cloud Build- Electron Cloud Build Electron Cloud Build -Up

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Sequence Estimation and Schedulability Aim Analysis for Partially Observable Petri Nets

Weak Models of Distributed Computing, with Connections to Modal Logic Lauri Hella 1 , Matti J

Minnesotas Actions and Developments in Distribution System Planning Tricia DeBleeckere

On Data Placement in Distributed Systems LADIS14 Jo ao Paiva , Lu s Rodrigues {

via Dynamic Multi-Path Routing in a Full Mesh Network European Patents EP2375650 and EP2634981

IESO York Region Non-Wires Alternative Demonstration Public Webinar December 12, 2019 Webinar

Item IV Transfer of On-site Natural Gas Distribution System PRESENTED BY Angela M. Poole CPA,

DUKE ENERGY SOUTH CAROLINA GRID IMPROVEMENT INITIATIVE WORKSHOP National Context of Grid

GPU and Distributed Architecture Dr. Shiliang PU Hikivision Research Institute Challenge in

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges