Distributed Tracing Understand how your components work together - - PowerPoint PPT Presentation

distributed tracing
SMART_READER_LITE
LIVE PREVIEW

Distributed Tracing Understand how your components work together - - PowerPoint PPT Presentation

Distributed Tracing Understand how your components work together About me Jos Carlos Chvez Software Engineer at Typeform focused on the aggregate of responses services. Zipkin core team and open source contributor for


slide-1
SLIDE 1

Distributed Tracing

Understand how your components work together

slide-2
SLIDE 2

About me

José Carlos Chávez

  • Software Engineer at Typeform

focused on the aggregate of responses services.

  • Zipkin core team and open source

contributor for Observability projects.

slide-3
SLIDE 3

Distributed Systems

slide-4
SLIDE 4

Distributed systems

A collection of independent components appears to its users as a single coherent system. Characteristics:

  • Concurrency
  • No global clock
  • Independent failures
slide-5
SLIDE 5

Water heater Gas supplier Cold water storage tank Shutoff valve First floor branch Tank valve

爆$❄#☭฀

Distributed systems

slide-6
SLIDE 6

Auth service Images service Videos service DB2 DB3 DB4

TCP error (2003) 500 Internal Error 500 Internal Error

GET /media/e5k2

API Proxy

Distributed systems: Understanding failures

DB1 Media API

slide-7
SLIDE 7

Water heater Gas supplier Cold water storage tank Shutoff valve First floor branch Tank valve

爆$❄#☭฀

I AM HERE! First floor distributor is clogged!

Distributed systems: Understanding failures

slide-8
SLIDE 8

We do have that, it is called logs!

slide-9
SLIDE 9

API Proxy Auth service Media API Images service Videos service DB2 DB3 DB4

TCP error (2003) 500 Internal Error 500 Internal Error

GET /media/e5k2

Logs & Concurrency

DB1

slide-10
SLIDE 10

[24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/13548” [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/23948” [24/Oct/2017 13:50:08 +0000] “GET /media HTTP/1.1” 200 … **0/12396” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/23748” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/23248” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 200 … **0/26548” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/13148” [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/2588” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 500 … **0/3248” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/23548” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/22598” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/13948” ... ? ?

Logs & Concurrency

slide-11
SLIDE 11

Water heater Gas supplier Cold water storage tank Shutoff valve First floor branch Tank valve

爆$❄#☭฀

I AM HERE! First floor distributor is clogged!

Distributed systems: Understanding failures

slide-12
SLIDE 12

Distributed Tracing to unclog your pipes

slide-13
SLIDE 13

API Proxy Media API Auth Videos Images Time 500 error

Distributed tracing

[1508410442] no cache for resource, retrieving from DBc TraceID d52d38b69b0fb15efa

slide-14
SLIDE 14

Distributed Tracing: What answers I get?

  • What services did a request pass through?
  • What occurred in each service for a given request?
  • Where did the error happen?
  • Where are the bottlenecks?
  • What is the critical path for a request?
  • Who should I page?
slide-15
SLIDE 15

Distributed Tracing & friends

Credits: Peter Bourgon

slide-16
SLIDE 16

Benefits of Distributed Tracing

  • (almost) Immediate feedback
  • System insight, clarifies non trivial interactions
  • Visibility to critical paths and dependencies
  • Understand latencies
  • Request scoped, not request’s lifecycle scoped.
slide-17
SLIDE 17

Trace’s Anatomy

  • A trace shows an execution path

through a distributed system

  • A span in the trace represents a logical

unit of work (with a start and end)

  • A context includes information that

should be propagated across services

  • Tags and logs (optional) add

complementary information to spans.

/things auth.Auth Time GET /videos mysql.Get T R A C E

slide-18
SLIDE 18

Elements of distributed tracing

Credits: Nic Munroe

Leg 1: inbound propagation Leg 2: outbound propagation Leg 3: in-process propagation

Distributed Tracing

slide-19
SLIDE 19

Leg 1: Inbound propagation

When your service process a request or consume a message.

API Proxy Media API GET /media TraceID: fAf3oXL6DS SpanID: dZ0xHIBa1A ...

slide-20
SLIDE 20

Leg 2: Outbound propagation

When your service makes an outbound call to another service

Media API Video service GET /videos TraceID: fAf3oXL6DS ParentID: dZ0xHIBa1A SpanID: y74fr5udj http/get

slide-21
SLIDE 21

mysql.Query redis.Get

Leg 3: In process propagation

When performing an operation inside the service

Media API Cache service Images service GET /images

slide-22
SLIDE 22

API Proxy Media API Auth Videos Images Time 500 error

Distributed tracing

[1508410442] no cache for resource, retrieving from DBc TraceID d52d38b69b0fb15efa

slide-23
SLIDE 23

Any overhead?

For users:

  • Observability tools are meant to be unintrusive
  • Sampling reduces overhead
  • (Don’t) trace every single operation

For developers:

  • Not all libraries are ready to plug instruments
  • Instrumentation can be delegated to common

frameworks

slide-24
SLIDE 24

Introducing Apache Zipkin

slide-25
SLIDE 25

Apache Zipkin

Based on BigBrotherBird (B3) and inspired on Google Dapper (2010). It was open sourced by Twitter (2012) and joined Apache Incubator

  • n September 2018.
  • Mature tracing model emerged from users’ needs.
  • Used by large companies like Netflix, SoundCloud and Yelp but

also not too big ones.

  • Strong community:

○ @zipkinproject ○ gitter.im/openzipkin

slide-26
SLIDE 26

Service (instrumented) Transport Collect spans Collector API UI Storage DB Visualize Retrieve data Store spans http/kafka/grpc Receive spans Deserialize and schedule for storage Cassandra/MySQL/ElasticSearch

Zipkin: architecture

slide-27
SLIDE 27

Zipkin: traces

slide-28
SLIDE 28

Zipkin: traces

slide-29
SLIDE 29

Zipkin: traces

slide-30
SLIDE 30

Zipkin: trace overview

slide-31
SLIDE 31

Zipkin: tags and logs

slide-32
SLIDE 32

Zipkin: traces with errors

slide-33
SLIDE 33

Zipkin: traces for async operations

slide-34
SLIDE 34

Zipkin: dependency graph

slide-35
SLIDE 35

Zipkin: dependency graph

slide-36
SLIDE 36

Q&As

twitter.com/jcchavezs Find more: http://bit.ly/dist-trac