SLIDE 1 Ben Sigelman (@el_bhs, bhs@lightstep.com)
Co-founder & CEO: LightStep Co-creator: OpenTracing, OpenTelemetry, Google Dapper, Google Monarch
Architectures that Scale Deep:
Regaining Control in Deep Systems
QCon SF, November 2019
SLIDE 2
Part I
Scaling, and Deep Systems
SLIDE 3
What is scale, anyway?
SLIDE 4
Scaling wide
SLIDE 5
Scaling wide
SLIDE 6
Scaling wide
SLIDE 7
Scaling wide
SLIDE 8
Scaling wide
SLIDE 9
Scaling deep
SLIDE 10
Scaling deep
SLIDE 11
Scaling deep
SLIDE 12
Scaling deep
SLIDE 13
Scaling deep
SLIDE 14
How does this look for software?
SLIDE 15
Software: Scaling wide
SLIDE 16
Software: Scaling deep
SLIDE 17
How do real-world systems look?
SLIDE 18
Microservices at scale aren’t just wide systems, they’re deep systems
SLIDE 19
Deep Systems
Architectures with ≥ 4 layers of independently operated services
(including external/cloud dependencies)
Deep Systems
Architectures with ≥ 4 layers of independently operated services
(including external/cloud dependencies)
SLIDE 20
What do deep systems sound like?
SLIDE 21
“Don’t deploy on Fridays” What do deep systems sound like?
SLIDE 22
“Where’s Chris?! I’m dealing with a P0 and they’re the only one who knows how to debug this.” What do deep systems sound like?
SLIDE 23
“It can’t be our fault, our dashboard says we’re healthy” What do deep systems sound like?
SLIDE 24
“Kafka is on fire” What do deep systems sound like?
SLIDE 25
“I need 100% availability from your team. One hundred percent.” What do deep systems sound like?
SLIDE 26
“I didn’t know I depended on that region” What do deep systems sound like?
SLIDE 27
“That was on a dashboard but I can’t find it” What do deep systems sound like?
SLIDE 28 Lots of challenges:
- People-management
- Security
- Multi-tenancy
- “Big-customer” success
- Performance
- Observability
What do deep systems sound like?
SLIDE 29
Part II
Control Theory: TL;DR Edition
SLIDE 30
Why do we care so much about observability, anyway?
SLIDE 31
SLIDE 32 A System Inputs Outputs
… and its state vector,
SLIDE 33 Inputs A System Outputs
… and its state vector,
Observability
How well can you infer internal state using only the outputs?
SLIDE 34 Outputs A System Inputs
… and its state vector,
Controllability
How well can you control internal state using only the inputs?
SLIDE 35
Controllability is the dual of Observability
SLIDE 36
Controllability is the dual of Observability
SLIDE 37
Part III
What Deep Systems Mean for Observability
SLIDE 38 # of services developers per service
Architectural evolution
Deep Systems Pure Monoliths
SLIDE 39
Stress (n): responsibility without control Stress
what you can control what you are responsible for
SLIDE 40
SLIDE 41
Observability: Shrink This Gap
SLIDE 42
Mental models
A System
SLIDE 43 Managing Deep Systems
Services must have SLOs
(“Service Level Objectives”: latency, errors, etc)
For effective service management, only three things matter: 0. Releasing service functionality 1. Gradually improving SLOs 2. Rapidly restoring SLOs In a deep system, we must control the entire “triangle” to maintain our SLOs
SLIDE 44
Controllability == Observability Controllability == Observability There’s that word again…
SLIDE 45
Observability: “The Conventional Wisdom”
Observing microservices is hard Google and Facebook solved this (right???) They used Metrics, Logging, and Distributed Tracing… … So we should, too.
SLIDE 46 3 Pillars, 3 Experiences
Metrics Logs Traces
SLIDE 47
SLIDE 48 Three Pillars? Three Pillars? Two giant pipes…
Logs Metrics
Without Traces: Cognitive Load ≈ O(depth2)
SLIDE 49 Three Pillars? Three Pillars? Two giant pipes…
Logs Metrics
SLIDE 50
SLIDE 51 Two giant pipes…
Logs Metrics
Without Traces: Cognitive Load ≈ O(depth2)
SLIDE 52
SLIDE 53
Traces
SLIDE 54
Traces provide Context
SLIDE 55
Traces provide Context And context rules out invalid hypotheses
SLIDE 56 Two giant pipes and a filter
Logs Metrics
Context
(from traces)
SLIDE 57 Context
(from traces)
Context reduces cognitive load
With Traces: Cognitive Load ≈ O(depth)
Relevant Metrics Relevant Logs
SLIDE 58
Observability: Shrink This Gap
SLIDE 59
SLIDE 60
Let’s Review
SLIDE 61
Microservices don’t just scale wide, they scale deep Recognize deep systems
SLIDE 62
Stress (n): responsibility without control Stress
what you can control what you are responsible for
SLIDE 63
“Controllability” (of SLOs) depends on observability
SLIDE 64
… and traces are not sprinkles
“The Three Pillars of Observability” is a lousy metaphor
SLIDE 65
Tracing can reduce cognitive load from O(depth2) to O(depth)
SLIDE 66
Tracing is the backbone of simple observability in deep systems
SLIDE 67 Thank You
Feedback always welcome:
twitter → @el_bhs the emails → bhs@lightstep.com
Play with LightStep, for free, anytime:
(no email address required!)
lightstep.com/play