How to build observable distributed systems @PierreVincent - - PowerPoint PPT Presentation

how to build observable distributed systems
SMART_READER_LITE
LIVE PREVIEW

How to build observable distributed systems @PierreVincent - - PowerPoint PPT Presentation

March 8th, 2018 QCon London How to build observable distributed systems @PierreVincent pvincent.io @PierreVincent @PierreVincent Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io @PierreVincent Reaching production is


slide-1
SLIDE 1

@PierreVincent

How to build observable distributed systems

March 8th, 2018 – QCon London @PierreVincent pvincent.io

slide-2
SLIDE 2

@PierreVincent

slide-3
SLIDE 3

@PierreVincent

Pierre Vincent

SRE Manager at Poppulo

@PierreVincent pvincent.io

slide-4
SLIDE 4

@PierreVincent

Reaching production is

  • nly the beginning
slide-5
SLIDE 5

@PierreVincent

No system is immune to failure Be ready to recover

slide-6
SLIDE 6

@PierreVincent

When distributing a system, we’re also distributing the places where things might go wrong

slide-7
SLIDE 7

@PierreVincent

Monitoring only applies to known failure modes. What about everything else?

slide-8
SLIDE 8

@PierreVincent

Monitoring tells you whether the system works. Observability lets you ask why it's not working.

“ ”

– Baron Schwarz

slide-9
SLIDE 9

@PierreVincent

Healthchecks Metrics Logging Tracing

slide-10
SLIDE 10

@PierreVincent

Healthchecks

slide-11
SLIDE 11

@PierreVincent

Is it running? Can it perform its task? Can it accept more work?

Healthchecks

slide-12
SLIDE 12

@PierreVincent

Broadcast Register Expose

Healthchecks

slide-13
SLIDE 13

@PierreVincent

Healthchecks

/health { "service": "registration-service", "healthy": true, "workload": { "healthy": true }, "dependencies": [ { "name": "cassandra", "healthy": true }, { "name": "billing-svc", "healthy": true }, ] } GET http://1.2.3.4:8080/health 200 OK

slide-14
SLIDE 14

@PierreVincent

Source: HTTP Healthchecks for a Resilient Platform - Chris O’Dell

skeltonthatcher.com/blog/http-healthchecks-for-a-resilient-platform

Overzealous Healthchecks can be counter-productive

slide-15
SLIDE 15

@PierreVincent

Metrics

slide-16
SLIDE 16

@PierreVincent

System metrics Application metrics Business metrics

CPU usage Error rates Customer conversions

Metrics

slide-17
SLIDE 17

@PierreVincent Servers / VMs Appliances/Infra Services Metrics collector Metrics query engine

Dashboards Alerts

Metrics

slide-18
SLIDE 18

@PierreVincent Servers / VMs Appliances/Infra Services

/metrics /metrics /metrics

Prometheus

Metrics

slide-19
SLIDE 19

@PierreVincent

Watch out for over-reliance on metrics

Limitations at high-cardinality Not every metric deserves an alert Real-time querying means some trade-offs on retention Limit alerting to user-impacting symptoms Poor fine-grained debugging e.g. CustomerId Not suitable for long-term trend analysis

slide-20
SLIDE 20

@PierreVincent

Logging

slide-21
SLIDE 21

@PierreVincent

Searchable Correlated

Logging

Making sense of (a lot of) logs

Centralised

slide-22
SLIDE 22

@PierreVincent

A F H D J B E C G

a1b2c3 a1b2c3 a1b2c3

ERROR [svc=H][trace=a1b2c3] Failed to save order Cause: Cassandra timeout exception ERROR [svc=F][trace=a1b2c3] Failed to complete order Cause: Shipping service responded with 500 ERROR [svc=A][trace=a1b2c3] Failed to process order Cause: Order process manager responded with 500

a1b2c3

INFO [svc=G][trace=a1b2c3] Items verified in stock

Log Correlation

slide-23
SLIDE 23

@PierreVincent

JSON

2018-02-20T16:38:23+00:00 ERROR Read timed out

timestamp 2018-02-20T16:38:23+00:00 requestID ec667cb45 level ERROR team events service registration-service commit 542a8b8e build 542a8b8e.7 node node_e79f3e52 log Read timed out region europe-west2 runtime java-1.8.0_161 When did it happen? Can I trace it? What is it? What is it running? Where is it? What is the message? customerID 55123 Who caused it? userID 458 ... ... Any other info? Hmmm thanks… ?

slide-24
SLIDE 24

@PierreVincent

Structured logs unleash high-cardinality exploration

Error rate spike isolated by build version

Activity spike isolated for single customer

slide-25
SLIDE 25

@PierreVincent

Tracing

slide-26
SLIDE 26

@PierreVincent

Tracing

get_confirmed_attendees get_attendees get_confirm_status cassandra/select mysql/select event-mgt-api attendees-service registration-service

Trace Spans

0ms 50ms 100ms 150ms 172ms 172ms 73ms 54ms 78ms 41ms
slide-27
SLIDE 27

@PierreVincent

slide-28
SLIDE 28

@PierreVincent

Unknown Unknowns Known Unknowns Healthchecks Metrics

(Alerting)

Logs Metrics

(Queries)

Tracing Events

Monitoring & Resiliency Debugging & Exploration

slide-29
SLIDE 29

@PierreVincent

Cheap to instrument Easy to explore Reliable & Trustworthy

Usability of tooling is key to adoption

slide-30
SLIDE 30

@PierreVincent

Visibility builds trust but requires safety

slide-31
SLIDE 31

@PierreVincent

Visibility helps justify decisions

slide-32
SLIDE 32

@PierreVincent

Visibility enables operability

slide-33
SLIDE 33

@PierreVincent

Test (a little bit*) less Test (a little bit*) less

Don’t spend all your time testing

...keep some for instrumenting

Thank you!

@PierreVincent pvincent.io

slide-34
SLIDE 34

@PierreVincent

Thank you!

Pierre Vincent

SRE Manager at Poppulo

@PierreVincent pvincent.io