SLIDE 1 With Distributed Tracing
Conquering Microservices Complexity @Uber
Yuri Shkuro
SOFTWARE ENGINEER @ UBER
SLIDE 2 Agenda
Why Distributed Tracing Trace as a Narrative Trace vs. Trace Traces vs. Trace Data Lineage Q & A
SLIDE 3 Yuri Shkuro
Software Engineer Uber Technologies shkuro.com Founder & Maintainer
jaegertracing.io Co-founder of OpenTracing & OpenTelemetry Author of "Mastering Distributed Tracing", by Packt Publishing
SLIDE 4
Quick Poll
SLIDE 5
Why Distributed Tracing
SLIDE 6
Scaling With Users
Distributed Systems
SLIDE 7 Scaling With Engineering Organization
Monoliths to Microservices
D C B A A B C D
SLIDE 8 Scaling With CPU Cores
Asynchronous Programming Models, Distributed Concurrency
BASIC CONCURRENCY ASYNC CONCURRENCY DISTRIBUTED CONCURRENCY
SLIDE 9
SLIDE 10
In microservices architectures the number of failure modes increases exponentially
SLIDE 11
Observability of distributed transactions is paramount!
SLIDE 12
Observability vs. monitoring
SLIDE 13
Observability vs. monitoring
SLIDE 14 Observability
System’s ability to answer questions
Which services did the request go through What did every service do when processing the request If the request was slow, where were the bottlenecks If the request failed, where did the errors happen How different was the execution from the normal system behavior Structural differences Performance differences What was on the critical path of the request Who should be paged
SLIDE 15
Distributed tracing can answer these questions and accelerate root cause analysis
SLIDE 16
Distributed Tracing in a Nutshell
SLIDE 17
Trace as a narrative
SLIDE 18
Trace Timeline
Classic trace view as Gantt chart
SLIDE 19 Trace Timeline
1
Parent → Child → Grandchild
SLIDE 20 Trace Timeline
1
Time + Mini-Map
2
SLIDE 21 Trace Timeline
1
Blocking operation
2 3
SLIDE 22 Trace Timeline
1
Sequential operations
2 3 4
SLIDE 23 Trace Timeline
1
Errors
2 3 4 5
SLIDE 24
Span details
SLIDE 25 Span details
1
Database query
SLIDE 26 Span details
1
Timed events (logs)
2
SLIDE 27
We can also trace asynchronous workflows
SLIDE 28
Tracing Talk Application
Mastering Distributed Tracing, Chapter 5
SLIDE 29
Tracing Talk Application
Architecture
SLIDE 30
Tracing Talk Application
Request trace
SLIDE 31 Tracing Talk Application
Message sent
1
SLIDE 32 Tracing Talk Application
Message received
1 2
SLIDE 33 Single Trace
Pros and cons
Tells a story about a single transaction Allows deep contextual drill-down Acts as a distributed stack trace One trace can be overwhelmingly complex Tells a story about a single
- transaction. What if it’s an anomaly?
SLIDE 34
Too Much Complexity
One request - 30 services, 100+ RPCs
SLIDE 35
Too Much Complexity
Some traces have hundreds of thousands spans
SLIDE 36
Reducing complexity by smarter visualizations
SLIDE 37
Trace graph
Time ordered, repeated edges collapsed
SLIDE 38
Trace graph
Latency heat map
SLIDE 39
Finding anomalies is easier when we look at differences in performance profiles
SLIDE 40
Trace vs. Trace
SLIDE 41
Comparing Trace Structures
Just like a Code Diff
SLIDE 42 Comparing Trace Structures
Shared Structure
1
SLIDE 43 Comparing Trace Structures
Absent in One or the Traces
1 2
SLIDE 44 Comparing Trace Structures
More or Fewer Spans Within a Node
1 2 3
SLIDE 45 Comparing Trace Structures
Substantial Divergence
1 2 3 4
SLIDE 46 Deep Linking to Raw Traces & Spans
5
Error: ”You have an outstanding balance…"
SLIDE 47
Production story
Migrating services to a nearby datacenter Request latency doubles
SLIDE 48
Investigating latency
Structural comparison not always useful
SLIDE 49 Investigating latency
Very similar structure
1
SLIDE 50 Investigating latency
Left trace 2.74 seconds
1 2
SLIDE 51 Investigating latency
Right trace 4.2 seconds
1 2 3
SLIDE 52 Investigating latency
Due to structural differences?
1 2 3 4
SLIDE 53 Investigating latency
Or dispersed contributors?
1 2 3 4 5
SLIDE 54
Heat-maps!
SLIDE 55
Comparing trace durations
Heat-map of latencies
SLIDE 56 Comparing trace durations
Similar durations (grey)
1
SLIDE 57 Comparing trace durations
Nodes that are not shared (white)
1 2
SLIDE 58 Comparing trace durations
Red heat-map for latency differences
1 2 3
SLIDE 59
Comparing trace durations
Details on Mouse-Over
SLIDE 60
Comparing trace durations
Details on Mouse-Over
SLIDE 61 How Are These Approach Different?
Summary
Surface less information Condense
the structural representation Emphasize the differences Distinct comparison modes simplify
the comparisons
SLIDE 62
Challenges
Individual traces can be an outliers. User must find the right baseline.
SLIDE 63
Traces vs. Trace
SLIDE 64
What Went Wrong?
Root Cause Analysis
SLIDE 65 Top Level Outcome
1
Including Request/Response Payloads
SLIDE 66 Link to the Trace
1 2
Can Always Go Back to Raw Data
SLIDE 67 Trace Structure
1 2 3
Nodes Are Sorted Chronologically
SLIDE 68 Present and Missing Nodes
1 2 3 4
Color-Coding
SLIDE 69 A Node With Error Data
1 2 3 4 5
SLIDE 70 Error Data Panel
1 2 3 4 5 6
SLIDE 71 How Is This Approach Different?
Summary
Much broader context: aggregate vs.
One purpose: root cause analysis of reliability issues
SLIDE 72
Tackling Data Complexity
SLIDE 73 Uber is a data company
OK, and a transportation company
More data is derived from other data Data undergoes many transformations
Microservices / RPCs Streams / Kafka Data lake / HDFS
Debugging data quality is difficult
SLIDE 74 Data Lineage
Debugging Data Quality
Microservices / RPCs Streams / Kafka Data lake / HDFS
SLIDE 75
Observability requires high quality instrumentation.
SLIDE 76 Our Software Is Highly Composable
Often from Open Source Components
Microservice
Queue Driver DB Driver DB Queue Server Framework RPC Framework
Threads
SLIDE 77
Tracing breaks if components
don’t understand each other.
SLIDE 78 Standardization Efforts
Instrumentation and Data Formats
Effective observability requires high-quality telemetry. OpenTelemetry makes robust, portable telemetry a built-in feature of cloud-native software. Distributed Tracing Working Group Data formats for on-the-wire trace context & correlation-context, and out-of-band trace data.
SLIDE 79
In Summary
Distributed tracing helps us to deal with the overwhelming complexity of microservices
SLIDE 80
In Summary
Creative visualizations are essential in performance analysis
SLIDE 81
In Summary
Distributed tracing empowers unparalleled insights into our distributed systems
SLIDE 82
Q&A
Thank You
Find me @ shkuro.com