Traces Are the Fuel, Not the Car Making Distributed Tracing Valuable - - PowerPoint PPT Presentation

traces are the fuel not the car
SMART_READER_LITE
LIVE PREVIEW

Traces Are the Fuel, Not the Car Making Distributed Tracing Valuable - - PowerPoint PPT Presentation

Traces Are the Fuel, Not the Car Making Distributed Tracing Valuable March 4, 2019 Ben Sigelman, CEO and Co-founder, LightStep Part I Observability Dogma: A Critique The Conventional Wisdom Observing microservices is hard Google and Facebook


slide-1
SLIDE 1

Making Distributed Tracing Valuable

March 4, 2019 Ben Sigelman, CEO and Co-founder, LightStep

Traces Are the Fuel, Not the Car

slide-2
SLIDE 2

Part I

Observability Dogma: A Critique

slide-3
SLIDE 3

Observing microservices is hard Google and Facebook solved this (right???) They used Metrics, Logging, and Distributed Tracing… So we should, too.

The Conventional Wisdom

slide-4
SLIDE 4

The Three Pillars of Observability

Metrics! Traces! Logs!

slide-5
SLIDE 5

Fatal Flaws

slide-6
SLIDE 6

So Many Flaws, So Little Time…

slide-7
SLIDE 7

Logs Metrics

  • Dist. Traces

TCO scales gracefully

– ✓ ✓

Accounts for all data (i.e., unsampled)

✓ ✓ –

Immune to cardinality

✓ – ✓

Fatal Flaws: “TL;DR” edition

slide-8
SLIDE 8

A fun game!

Design your own (positive-ROI) observability system:

฀ High-throughput ฀ High-cardinality ฀ Unsampled ฀ Lengthy retention window

Choose three.

slide-9
SLIDE 9

Metrics, Logs, and Traces are Just Data, … not a feature or use case.

slide-10
SLIDE 10

The Three Pillars of Observability

Metrics! Traces! Logs!

slide-11
SLIDE 11

The Three Pillars Pipes of Observability

Metrics! T r a c e s ! Logs!

slide-12
SLIDE 12

Part II Service-Centric Observability

slide-13
SLIDE 13

A microservices architecture

The WAN

slide-14
SLIDE 14

A microservices architecture

A narrow scope of understanding = great!

The WAN

Faster releases, smaller teams, less friction, etc.

Consider a single service...

slide-15
SLIDE 15

A narrow scope of understanding = great!

A microservices architecture (with a slowdown)

A B C D E

A narrow scope of understanding = great! … problematic. Decoupling cuts both ways.

The WAN

slide-16
SLIDE 16

Hands-on with a single distributed trace

slide-17
SLIDE 17

Distributed traces, in summary

  • One distributed trace per transaction
  • Crosses microservice boundaries
  • They are necessary if we want to understand the

relationships between distant actors in our architecture … … and yet:

  • too numerous to centralize in “standard” ways
  • too data-dense for our brains to process without help
slide-18
SLIDE 18

“Distributed Tracing” != “Distributed Traces” Distributed traces: basically just structs Distributed tracing: the art and science of making distributed traces valuable

slide-19
SLIDE 19

So… how do we make distributed traces valuable?

slide-20
SLIDE 20

“SLI” = “Service Level Indicator” TL;DR: An SLI is an indicator of health that a service’s consumers would care about. … not an indicator of its inner workings Quick Vocab Refresher: SLIs

slide-21
SLIDE 21

Two Fundamental Goals

  • Gradually improving an SLI
  • Rapidly restoring an SLI

Reminder: “SLI” = “Service Level Indicator”

NOW!!!! days, weeks, months…

slide-22
SLIDE 22
  • 1. Detection: measuring SLIs precisely
  • 2. Explaining variance: recognizing and

explaining variance, often iteratively

Two Fundamental Activities

slide-23
SLIDE 23

The Refinement Process

Recognize Variance Explain Variance

Fix Something

slide-24
SLIDE 24

Given any service: 1. Start with an SLI 2. Find variance 3. Explain it

A Service-Centric Approach

The WAN

slide-25
SLIDE 25

Part III “Show & Tell”

slide-26
SLIDE 26

A simple microservices architecture

📲iOS📲 web client api-proxy api-server generic- cache database charger payment- gateway auth-service geofencer geofence- server tile-db

The WAN

slide-27
SLIDE 27
  • 1. Discovering SLIs (slide)
  • 2. High-percentile latency measurement
  • 3. “Performance is a shape” (and knowing what’s normal)
  • 4. Examining individual traces

(link)

Recognizing Variance

slide-28
SLIDE 28

A blast from the past…

SLI advice from earlier today...

slide-29
SLIDE 29

Service Diagrams

  • 1. “Where’s Waldo” antipatterns (next slide)
  • 2. Finding the common-case bottleneck
  • 3. Finding the latency-outlier bottleneck

(link)

slide-30
SLIDE 30

Service Diagrams and “Actionability”

slide-31
SLIDE 31

Explaining Variance With Many Dimensions

  • 1. A “cardinality refresher” (next slide)
  • 2. Exploring data with no cardinality limits
  • 3. Explaining variance across the stack

(link)

slide-32
SLIDE 32

A word nobody knew in 2015…

Dimensions (aka “tags”) can explain variance in timeseries data (aka “metrics”) … … but cardinality

slide-33
SLIDE 33

Wrapping up…

slide-34
SLIDE 34
  • Microservices helped us reduce human comms overhead
  • … and that created huge problems for observability
  • Distributed traces are necessary but not sufficient
  • Distributed tracing is much more than distributed traces
  • A service-centric approach with a modern, sophisticated

distributed tracing system can do amazing things

What we’ve learned

slide-35
SLIDE 35

Ben Sigelman, Co-founder and CEO twitter: @el_bhs email: bhs@lightstep.com PS: LightStep announced something cool today! Stop by Booth #3 to learn more.

Thank you!

I am friendly and would love to chat… please say hello, I don’t make it to Europe often!

slide-36
SLIDE 36

Extra slides