Conquering Microservices Complexity @Uber With Distributed Tracing - PowerPoint PPT Presentation

Conquering Microservices Complexity @Uber With Distributed Tracing Yuri Shkuro SOFTWARE ENGINEER @ UBER

Why Distributed Tracing Trace as a Narrative Trace vs. Trace Agenda Traces vs. Trace Data Lineage Q & A

Yuri Shkuro Founder & Maintainer   of CNCF Jaeger jaegertracing.io Co-founder of OpenTracing & OpenTelemetry Software Engineer Author of "Mastering Uber Technologies Distributed Tracing", by Packt Publishing shkuro.com

Quick Poll

Why Distributed Tracing

Scaling With Users Distributed Systems

Scaling With Engineering Organization Monoliths to Microservices A A B B D C C D

Scaling With CPU Cores Asynchronous Programming Models, Distributed Concurrency BASIC CONCURRENCY ASYNC CONCURRENCY DISTRIBUTED CONCURRENCY

In microservices architectures the number of failure modes increases exponentially

Observability of distributed transactions is paramount!

Observability vs. monitoring

Observability System’s ability to answer questions How different was the execution from Which services did the request go the normal system behavior through What did every service do when Structural differences processing the request Performance differences If the request was slow, where were the bottlenecks What was on the critical path of the If the request failed, where did the request errors happen Who should be paged

Distributed tracing can answer these questions and accelerate root cause analysis

Distributed Tracing in a Nutshell

Trace as a narrative

Trace Timeline Classic trace view as Gantt chart

Trace Timeline Parent → Child → Grandchild 1

Trace Timeline Time + Mini-Map 2 1

Trace Timeline Blocking operation 2 3 1

Trace Timeline Sequential operations 2 3 1 4

Trace Timeline Errors 2 3 1 5 4

Span details

Span details Database query 1

Span details Timed events (logs) 1 2

We can also trace asynchronous workflows

Tracing Talk Application Mastering Distributed Tracing , Chapter 5

Tracing Talk Application Architecture

Tracing Talk Application Request trace

Tracing Talk Application Message sent 1

Tracing Talk Application Message received 1 2

Single Trace Pros and cons Tells a story about a single Tells a story about a single transaction. What if it’s an anomaly? transaction One trace can be overwhelmingly Allows deep contextual drill-down complex Acts as a distributed stack trace

Too Much Complexity One request - 30 services, 100+ RPCs

Too Much Complexity Some traces have hundreds of thousands spans

Reducing complexity by smarter visualizations

Trace graph Time ordered, repeated edges collapsed

Trace graph Latency heat map

Finding anomalies is easier when we look at differences in performance profiles

Trace vs. Trace

Comparing Trace Structures Just like a Code Diff

Comparing Trace Structures Shared Structure 1

Comparing Trace Structures Absent in One or the Traces 1 2

Comparing Trace Structures More or Fewer Spans Within a Node 3 1 2

Comparing Trace Structures Substantial Divergence 3 1 4 2

Deep Linking to Raw Traces & Spans Error: ”You have an outstanding balance…" 5

Production story Migrating services to a nearby datacenter Request latency doubles

Investigating latency Structural comparison not always useful

Investigating latency Very similar structure 1

Investigating latency Left trace 2.74 seconds 2 1

Investigating latency Right trace 4.2 seconds 3 2 1

Investigating latency Due to structural differences? 3 2 1 4

Investigating latency Or dispersed contributors? 3 2 5 1 4

Heat-maps!

Comparing trace durations Heat-map of latencies

Comparing trace durations Similar durations (grey) 1

Comparing trace durations Nodes that are not shared (white) 1 2

Comparing trace durations Red heat-map for latency differences 1 3 2

Comparing trace durations Details on Mouse-Over

How Are These Approach Different? Summary Distinct comparison Surface less Condense   Emphasize modes simplify   information the structural the differences the comparisons representation

Challenges Individual traces can be an outliers. User must find the right baseline.

Traces vs. Trace

What Went Wrong? Root Cause Analysis

Top Level Outcome Including Request/Response Payloads 1

Link to the Trace Can Always Go Back to Raw Data 1 2

Trace Structure Nodes Are Sorted Chronologically 1 2 3

Present and Missing Nodes Color-Coding 1 2 3 4

A Node With Error Data 1 2 3 5 4

Error Data Panel 1 2 6 3 5 4

How Is This Approach Different? Summary Much broader One purpose: root context: cause analysis of aggregate vs. reliability issues one trace

Tackling Data Complexity

Uber is a data company OK, and a transportation company Streams / Kafka Data lake / HDFS Microservices / RPCs Data undergoes many transformations More data is derived from other data Debugging data quality is difficult

Data Lineage Debugging Data Quality Streams / Kafka Data lake / HDFS Microservices / RPCs

Observability requires high quality instrumentation.

Our Software Is Highly Composable Often from Open Source Components Server RPC Framework Framework Microservice Threads Queue Driver DB Driver DB Queue

Tracing breaks if components   don’t understand each other.

Standardization Efforts Instrumentation and Data Formats Effective observability requires high-quality Distributed Tracing Working Group telemetry. Data formats for on-the-wire trace context & OpenTelemetry makes robust, portable correlation-context, and out-of-band trace telemetry a built-in feature of cloud-native data. software.

In Summary Distributed tracing helps us to deal with the overwhelming complexity of microservices

In Summary Creative visualizations are essential in performance analysis

In Summary Distributed tracing empowers unparalleled insights into our distributed systems

Thank You Find me @ shkuro.com Q&A

Conquering Microservices Complexity @Uber With Distributed Tracing - PowerPoint PPT Presentation

Conquering Microservices Complexity @Uber With Distributed Tracing Yuri Shkuro SOFTWARE ENGINEER @ UBER Why Distributed Tracing Trace as a Narrative Trace vs. Trace Agenda Traces vs. Trace Data Lineage Q & A Yuri Shkuro Founder &

Time Predictions in Uber Eats Zi Wang@Uber QCon New York 2019 June 2019 Agenda 1. ML in Uber

Peeking Beneath the Hood of Uber Le Chen, Alan Mislove, Christo Wilson Northeastern University

STREAM PROCESSING @ UBER DANNY YUAN @ UBER What is Uber Transportation at your fingertips

UBER RUSH AND REBUILDING UBERS DISPATCHING PLATFORM motivation CHAPTER 1 OF 8 MOTIVATION

Microservices Security Fundamentals MICROSERVICES SECURITY CHALLENGES Wojciech Lesniak PRINCIPAL

The Architecture of Uber's Realtime System March 25, 2015 Amos Barreto Danny Yuan

Tracing polyglot systems An OpenTracing Tutorial Yuri Shkuro (Uber), Won Jun Jang (Uber),

Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi

Uber & MADD Franchesca Cassanese Victoria Walker Natalia Colon Lee Andrews Uber &

Plug and Play Language Model : A Simple Baseline for Controlled Language Generation ICLR20

WHAT COMES AFTER MICROSERVICES? MATT RANNEY WHAT COMES AFTER MICROSERVICES? MATT RANNEY We

FROM HTTP TO KAFKA-BASED FROM HTTP TO KAFKA-BASED MICROSERVICES MICROSERVICES Wojciech Rzsa,

Microservices and OSGi running with Apache Karaf Agenda No free Lunch - microservices

Self-Driving Cars As Edge Computing Devices Matt Ranney - @mranney Uber ATG Why Self-Driving?

Scaling Uber with Node.js Amos Barreto @amos_barreto Uber is everyones Private driver.

FESAC Slides Jonathan Hall Chief Economist Uber Technologies Uber Labor Market Primer Prices

Finance and Student Achievement A Collaboration of the superintendent, school

FORTIS 3in1 IP66 LED module General specification Fortis 3in1 IP66 is a ready-to-use LED

Welcome You Bronze Sponsors: Exhibitors: Non-profit: Pacific Northwest University of Health

BCS 206 UNDERGRADUATE RESEARCH Academic Presentation 9/12/2016 Monday, September 12, 16 Today,

ABB LTD, ZURICH, SWITZERLAND, FEBRUARY 8, 2018, FULL-YEAR AND Q4 2017 RESULTS Positioned for

Implementation Planning 2015-2019 City Council Planning & Development Committee March 9,

Analyst Presentation 1H17 Results 28 August 2017 Agenda 1. Key Highlights 2. CIMB Group 1H17

Contributions to Generalizability Theory: Dick Jaegers Indirect but Strong Mentoring Effects

Conquering Microservices Complexity @Uber With Distributed Tracing - PowerPoint PPT Presentation

Conquering Microservices Complexity @Uber With Distributed Tracing Yuri Shkuro SOFTWARE ENGINEER @ UBER Why Distributed Tracing Trace as a Narrative Trace vs. Trace Agenda Traces vs. Trace Data Lineage Q & A Yuri Shkuro Founder &

Time Predictions in Uber Eats Zi Wang@Uber QCon New York 2019 June 2019 Agenda 1. ML in Uber

Peeking Beneath the Hood of Uber Le Chen, Alan Mislove, Christo Wilson Northeastern University

STREAM PROCESSING @ UBER DANNY YUAN @ UBER What is Uber Transportation at your fingertips

UBER RUSH AND REBUILDING UBERS DISPATCHING PLATFORM motivation CHAPTER 1 OF 8 MOTIVATION

Microservices Security Fundamentals MICROSERVICES SECURITY CHALLENGES Wojciech Lesniak PRINCIPAL

The Architecture of Uber's Realtime System March 25, 2015 Amos Barreto Danny Yuan

Tracing polyglot systems An OpenTracing Tutorial Yuri Shkuro (Uber), Won Jun Jang (Uber),

Apache Hadoop Ingestion &amp; Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi

Uber &amp; MADD Franchesca Cassanese Victoria Walker Natalia Colon Lee Andrews Uber &amp;

Plug and Play Language Model : A Simple Baseline for Controlled Language Generation ICLR20

WHAT COMES AFTER MICROSERVICES? MATT RANNEY WHAT COMES AFTER MICROSERVICES? MATT RANNEY We

FROM HTTP TO KAFKA-BASED FROM HTTP TO KAFKA-BASED MICROSERVICES MICROSERVICES Wojciech Rzsa,

Microservices and OSGi running with Apache Karaf Agenda No free Lunch - microservices

Self-Driving Cars As Edge Computing Devices Matt Ranney - @mranney Uber ATG Why Self-Driving?

Scaling Uber with Node.js Amos Barreto @amos_barreto Uber is everyones Private driver.

FESAC Slides Jonathan Hall Chief Economist Uber Technologies Uber Labor Market Primer Prices

Finance and Student Achievement A Collaboration of the superintendent, school

FORTIS 3in1 IP66 LED module General specification Fortis 3in1 IP66 is a ready-to-use LED

Welcome You Bronze Sponsors: Exhibitors: Non-profit: Pacific Northwest University of Health

BCS 206 UNDERGRADUATE RESEARCH Academic Presentation 9/12/2016 Monday, September 12, 16 Today,

ABB LTD, ZURICH, SWITZERLAND, FEBRUARY 8, 2018, FULL-YEAR AND Q4 2017 RESULTS Positioned for

Implementation Planning 2015-2019 City Council Planning &amp; Development Committee March 9,

Analyst Presentation 1H17 Results 28 August 2017 Agenda 1. Key Highlights 2. CIMB Group 1H17

Contributions to Generalizability Theory: Dick Jaegers Indirect but Strong Mentoring Effects

Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi

Uber & MADD Franchesca Cassanese Victoria Walker Natalia Colon Lee Andrews Uber &

Implementation Planning 2015-2019 City Council Planning & Development Committee March 9,