Conquering Microservices Complexity @Uber With Distributed Tracing - - PowerPoint PPT Presentation

conquering microservices complexity uber
SMART_READER_LITE
LIVE PREVIEW

Conquering Microservices Complexity @Uber With Distributed Tracing - - PowerPoint PPT Presentation

Conquering Microservices Complexity @Uber With Distributed Tracing Yuri Shkuro SOFTWARE ENGINEER @ UBER Why Distributed Tracing Trace as a Narrative Trace vs. Trace Agenda Traces vs. Trace Data Lineage Q & A Yuri Shkuro Founder &


slide-1
SLIDE 1

With Distributed Tracing

Conquering Microservices Complexity @Uber

Yuri Shkuro

SOFTWARE ENGINEER @ UBER

slide-2
SLIDE 2

Agenda

Why Distributed Tracing Trace as a Narrative Trace vs. Trace Traces vs. Trace Data Lineage Q & A

slide-3
SLIDE 3

Yuri Shkuro

Software Engineer Uber Technologies shkuro.com Founder & Maintainer


  • f CNCF Jaeger

jaegertracing.io Co-founder of OpenTracing & OpenTelemetry Author of "Mastering Distributed Tracing", by Packt Publishing

slide-4
SLIDE 4

Quick Poll

slide-5
SLIDE 5

Why Distributed Tracing

slide-6
SLIDE 6

Scaling With Users

Distributed Systems

slide-7
SLIDE 7

Scaling With Engineering Organization

Monoliths to Microservices

D C B A A B C D

slide-8
SLIDE 8

Scaling With CPU Cores

Asynchronous Programming Models, Distributed Concurrency

BASIC CONCURRENCY ASYNC CONCURRENCY DISTRIBUTED CONCURRENCY

slide-9
SLIDE 9
slide-10
SLIDE 10

In microservices architectures the number of failure modes increases exponentially

slide-11
SLIDE 11

Observability of distributed transactions is paramount!

slide-12
SLIDE 12

Observability vs. monitoring

slide-13
SLIDE 13

Observability vs. monitoring

slide-14
SLIDE 14

Observability

System’s ability to answer questions

Which services did the request go through What did every service do when processing the request If the request was slow, where were the bottlenecks If the request failed, where did the errors happen How different was the execution from the normal system behavior Structural differences Performance differences What was on the critical path of the request Who should be paged

slide-15
SLIDE 15

Distributed tracing can answer these questions and accelerate root cause analysis

slide-16
SLIDE 16

Distributed Tracing in a Nutshell

slide-17
SLIDE 17

Trace as a narrative

slide-18
SLIDE 18

Trace Timeline

Classic trace view as Gantt chart

slide-19
SLIDE 19

Trace Timeline

1

Parent → Child → Grandchild

slide-20
SLIDE 20

Trace Timeline

1

Time + Mini-Map

2

slide-21
SLIDE 21

Trace Timeline

1

Blocking operation

2 3

slide-22
SLIDE 22

Trace Timeline

1

Sequential operations

2 3 4

slide-23
SLIDE 23

Trace Timeline

1

Errors

2 3 4 5

slide-24
SLIDE 24

Span details

slide-25
SLIDE 25

Span details

1

Database query

slide-26
SLIDE 26

Span details

1

Timed events (logs)

2

slide-27
SLIDE 27

We can also trace asynchronous workflows

slide-28
SLIDE 28

Tracing Talk Application

Mastering Distributed Tracing, Chapter 5

slide-29
SLIDE 29

Tracing Talk Application

Architecture

slide-30
SLIDE 30

Tracing Talk Application

Request trace

slide-31
SLIDE 31

Tracing Talk Application

Message sent

1

slide-32
SLIDE 32

Tracing Talk Application

Message received

1 2

slide-33
SLIDE 33

Single Trace

Pros and cons

Tells a story about a single transaction Allows deep contextual drill-down Acts as a distributed stack trace One trace can be overwhelmingly complex Tells a story about a single

  • transaction. What if it’s an anomaly?
slide-34
SLIDE 34

Too Much Complexity

One request - 30 services, 100+ RPCs

slide-35
SLIDE 35

Too Much Complexity

Some traces have hundreds of thousands spans

slide-36
SLIDE 36

Reducing complexity by smarter visualizations

slide-37
SLIDE 37

Trace graph

Time ordered, repeated edges collapsed

slide-38
SLIDE 38

Trace graph

Latency heat map

slide-39
SLIDE 39

Finding anomalies is easier when we look at differences in performance profiles

slide-40
SLIDE 40

Trace vs. Trace

slide-41
SLIDE 41

Comparing Trace Structures

Just like a Code Diff

slide-42
SLIDE 42

Comparing Trace Structures

Shared Structure

1

slide-43
SLIDE 43

Comparing Trace Structures

Absent in One or the Traces

1 2

slide-44
SLIDE 44

Comparing Trace Structures

More or Fewer Spans Within a Node

1 2 3

slide-45
SLIDE 45

Comparing Trace Structures

Substantial Divergence

1 2 3 4

slide-46
SLIDE 46

Deep Linking to Raw Traces & Spans

5

Error: ”You have an outstanding balance…"

slide-47
SLIDE 47

Production story

Migrating services to a nearby datacenter Request latency doubles

slide-48
SLIDE 48

Investigating latency

Structural comparison not always useful

slide-49
SLIDE 49

Investigating latency

Very similar structure

1

slide-50
SLIDE 50

Investigating latency

Left trace 2.74 seconds

1 2

slide-51
SLIDE 51

Investigating latency

Right trace 4.2 seconds

1 2 3

slide-52
SLIDE 52

Investigating latency

Due to structural differences?

1 2 3 4

slide-53
SLIDE 53

Investigating latency

Or dispersed contributors?

1 2 3 4 5

slide-54
SLIDE 54

Heat-maps!

slide-55
SLIDE 55

Comparing trace durations

Heat-map of latencies

slide-56
SLIDE 56

Comparing trace durations

Similar durations (grey)

1

slide-57
SLIDE 57

Comparing trace durations

Nodes that are not shared (white)

1 2

slide-58
SLIDE 58

Comparing trace durations

Red heat-map for latency differences

1 2 3

slide-59
SLIDE 59

Comparing trace durations

Details on Mouse-Over

slide-60
SLIDE 60

Comparing trace durations

Details on Mouse-Over

slide-61
SLIDE 61

How Are These Approach Different?

Summary

Surface less information Condense 
 the structural representation Emphasize the differences Distinct comparison modes simplify 
 the comparisons

slide-62
SLIDE 62

Challenges

Individual traces can be an outliers. User must find the right baseline.

slide-63
SLIDE 63

Traces vs. Trace

slide-64
SLIDE 64

What Went Wrong?

Root Cause Analysis

slide-65
SLIDE 65

Top Level Outcome

1

Including Request/Response Payloads

slide-66
SLIDE 66

Link to the Trace

1 2

Can Always Go Back to Raw Data

slide-67
SLIDE 67

Trace Structure

1 2 3

Nodes Are Sorted Chronologically

slide-68
SLIDE 68

Present and Missing Nodes

1 2 3 4

Color-Coding

slide-69
SLIDE 69

A Node With Error Data

1 2 3 4 5

slide-70
SLIDE 70

Error Data Panel

1 2 3 4 5 6

slide-71
SLIDE 71

How Is This Approach Different?

Summary

Much broader context: aggregate vs.

  • ne trace

One purpose: root cause analysis of reliability issues

slide-72
SLIDE 72

Tackling Data Complexity

slide-73
SLIDE 73

Uber is a data company

OK, and a transportation company

More data is derived from other data Data undergoes many transformations

Microservices / RPCs Streams / Kafka Data lake / HDFS

Debugging data quality is difficult

slide-74
SLIDE 74

Data Lineage

Debugging Data Quality

Microservices / RPCs Streams / Kafka Data lake / HDFS

slide-75
SLIDE 75

Observability requires high quality instrumentation.

slide-76
SLIDE 76

Our Software Is Highly Composable

Often from Open Source Components

Microservice

Queue Driver DB Driver DB Queue Server Framework RPC Framework

Threads

slide-77
SLIDE 77

Tracing breaks if components
 don’t understand each other.

slide-78
SLIDE 78

Standardization Efforts

Instrumentation and Data Formats

Effective observability requires high-quality telemetry. OpenTelemetry makes robust, portable telemetry a built-in feature of cloud-native software. Distributed Tracing Working Group Data formats for on-the-wire trace context & correlation-context, and out-of-band trace data.

slide-79
SLIDE 79

In Summary

Distributed tracing helps us to deal with the overwhelming complexity of microservices

slide-80
SLIDE 80

In Summary

Creative visualizations are essential in performance analysis

slide-81
SLIDE 81

In Summary

Distributed tracing empowers unparalleled insights into our distributed systems

slide-82
SLIDE 82

Q&A

Thank You

Find me @ shkuro.com