Three Pillars with Zero Answers A New Observability Scorecard - - PowerPoint PPT Presentation

three pillars with zero answers
SMART_READER_LITE
LIVE PREVIEW

Three Pillars with Zero Answers A New Observability Scorecard - - PowerPoint PPT Presentation

Three Pillars with Zero Answers A New Observability Scorecard November 5, 2018 First, a Critique The Conventional Wisdom Observing microservices is hard Google and Facebook solved this (right???) They used Metrics, Logging, and Distributed


slide-1
SLIDE 1

Three Pillars with Zero Answers

A New Observability Scorecard

November 5, 2018

slide-2
SLIDE 2

First, a Critique

slide-3
SLIDE 3

Observing microservices is hard Google and Facebook solved this (right???) They used Metrics, Logging, and Distributed Tracing… So we should, too.

The Conventional Wisdom

slide-4
SLIDE 4

The Three Pillars of Observability

  • Metrics
  • Logging
  • Distributed Tracing
slide-5
SLIDE 5

Metrics!

slide-6
SLIDE 6

Logging!

slide-7
SLIDE 7

Tracing!

slide-8
SLIDE 8
slide-9
SLIDE 9

Fatal Flaws

slide-10
SLIDE 10

A word nobody knew in 2015…

Dimensions (aka “tags”) can explain variance in timeseries data (aka “metrics”) … … but cardinality

slide-11
SLIDE 11

Logging Data Volume: a reality check

transaction rate x all microservices x cost of net+storage x weeks of retention

  • way too much $$$$
slide-12
SLIDE 12

The Life of Transaction Data: Dapper

Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%

slide-13
SLIDE 13

Logs Metrics

  • Dist. Traces

TCO scales gracefully

– ✓ ✓

Accounts for all data (i.e., unsampled)

✓ ✓ –

Immune to cardinality

✓ – ✓

Fatal Flaws

slide-14
SLIDE 14

Data vs UI

slide-15
SLIDE 15

Data vs UI

Metrics Logs Traces

slide-16
SLIDE 16

Metrics, Logs, and Traces are Just Data, … not a feature or use case.

slide-17
SLIDE 17

A New Scorecard for Observability

slide-18
SLIDE 18

“SLI” = “Service Level Indicator” TL;DR: An SLI is an indicator of health that a service’s consumers would care about. … not an indicator of its inner workings Observability: Quick Vocab Refresher

slide-19
SLIDE 19

Observability: Two Fundamental Goals

  • Gradually improving an SLI
  • Rapidly restoring an SLI

Reminder: “SLI” = “Service Level Indicator”

NOW!!!! days, weeks, months…

slide-20
SLIDE 20
  • 1. Detection: perfect SLI capture
  • 2. Refinement: reduce the search space

Observability: Two Fundamental Activities

slide-21
SLIDE 21

An interlude about stats frequency

slide-22
SLIDE 22

Specificity:

  • Arbitrary dimensionality and cardinality
  • Any layer of the stack, including mobile+web!

Fidelity:

  • Correct stats!!!
  • High stats frequency (i.e., “beware smoothing”!)

Freshness: ≤ 5 second lag

Scorecard >> Detection

slide-23
SLIDE 23

# of things your users actually care about

# of microservices

# of failure modes

Must reduce the search space!

Scorecard >> Refinement

slide-24
SLIDE 24

Scorecard >> Refinement

Identify Variance Explain Variance

slide-25
SLIDE 25

An interlude about variance and “p99”

slide-26
SLIDE 26

Scorecard >> Refinement

Identifying Variance:

  • Cardinality: understand which tag changed
  • Robust stats: histograms (see prev slide)
  • Data retention: always “Know What’s Normal”

Explaining variance:

  • Correct stats!!!
  • “Suppress the messengers” of microservice failures
slide-27
SLIDE 27

Wrapping up…

slide-28
SLIDE 28

(first, a hint at my perspective)

slide-29
SLIDE 29

The Life of Transaction Data: Dapper

Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%

(Review)

slide-30
SLIDE 30

The Life of Transaction Data: Dapper LightStep

Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 100.00% Flushed out of process App 100.00% Centralized regionally Regional network + storage 100.00% Centralized globally WAN + storage

  • n-demand
slide-31
SLIDE 31

An Observability Scorecard

Detection

  • Specificity: unlimited

cardinality, across the entire stack

  • Fidelity: correct stats,

high stats frequency

  • Freshness: ≤ 5 seconds

Refinement

  • Identifying variance: unlimited

cardinality, hi-fi histograms, data retention

  • “Suppress the messengers”
slide-32
SLIDE 32

Thank you!

Ben Sigelman, Co-founder and CEO twitter: @el_bhs email: bhs@lightstep.com