Tag and Release Monitoring Increasingly Distributed Applications - - PowerPoint PPT Presentation

tag and release
SMART_READER_LITE
LIVE PREVIEW

Tag and Release Monitoring Increasingly Distributed Applications - - PowerPoint PPT Presentation

Tag and Release Monitoring Increasingly Distributed Applications dkuebric / dan@appneta.com Outline What is distributed tracing? Whos doing it, and how? Challenges, and future directions? Thrift Shop Frontend web app:


slide-1
SLIDE 1

Tag and Release

Monitoring Increasingly Distributed Applications

dkuebric / dan@appneta.com

slide-2
SLIDE 2

Outline

  • What is distributed tracing?
  • Who’s doing it, and how?
  • Challenges, and future directions?
slide-3
SLIDE 3
slide-4
SLIDE 4

Thrift Shop

  • Frontend web app: PHP
  • Text search: lucene-based, via thrift
  • Pricing service: erlang, via thrift
  • Spelling corrector: python bindings around xapian, via thrift
  • Content provider search: ruby, via thrift
  • ...
slide-5
SLIDE 5 cache (memcached) search (lucene) cache (memcached) app1 Apache PHP app1 Apache PHP fw1 perlbal cache (memcached) fw2 perlbal

...

search (lucene) db2 Mysql search (lucene) app server Apache PHP search (lucene) search (lucene) API search (ruby) pricing (elang) spelling (python) APIs APIs db1 Mysql
slide-6
SLIDE 6

Q: Why do you remember this so well?

slide-7
SLIDE 7

Q: Why do you remember this so well?

A: ops

slide-8
SLIDE 8

“Close enough” architectural diagram

https://www.flickr.com/photos/clonedmilkmen/3604999084
slide-9
SLIDE 9

Things we had

  • Ganglia
  • Nagios
  • Thrift
○ Per-service status page ○ Service status page
  • Logs
slide-10
SLIDE 10

Sample performance / debug workflow

1. Are any services outright down? 2. Hit refresh N times -- how many times were problematic? 3. Systematically tail the logs of every service on every machine 4. Check database processlist 5. SSH in and poke around 6. Deploy new release with debug logging 7. Google

slide-11
SLIDE 11

X-Trace

slide-12
SLIDE 12

Example: Drupal request handling

Web server Application Web server Application Apache PHP SQL memcached APIs
slide-13
SLIDE 13

Drupal TraceView project

D6/7: https://www.drupal.org/project/traceview D8: https://www.drupal.org/node/2113637
slide-14
SLIDE 14

Drupal 8 request handling

https://helloapp.tv.appneta.com/traces/view/FECA51A4134E765EBB04717C1D07F64352DE49E0
slide-15
SLIDE 15

Example Drupal 7 request

slide-16
SLIDE 16

Example Drupal 7 request

slide-17
SLIDE 17

Example Drupal 7 request

slide-18
SLIDE 18

Example Drupal 7 request

slide-19
SLIDE 19

Example Drupal 7 request

slide-20
SLIDE 20

Example Drupal 7 request

slide-21
SLIDE 21

Example Drupal 7 request

slide-22
SLIDE 22

Example Drupal 7 request

slide-23
SLIDE 23

Example Drupal 7 request

slide-24
SLIDE 24

Example Drupal 7 request

slide-25
SLIDE 25

Example Drupal 7 request

slide-26
SLIDE 26

Example Drupal 7 request

slide-27
SLIDE 27

Example Drupal request: more distributed

Web server Application Web server Application Apache PHP Database Service Cache APIs Solr
slide-28
SLIDE 28

Example Drupal request

slide-29
SLIDE 29

Example Drupal request

slide-30
SLIDE 30

Great minds...

  • Distributed tracing based on ID propagation
○ Google Dapper (200x? Published paper 2010) ○ Twitter Zipkin (Open-sourced 2012, 3rd party PHP support) ○ Etsy Cross Stitch (2014ish) ○ OpenTracing (2016ish)
  • Commercial APM -- semi-distributed tracing
○ New Relic ○ AppDynamics
slide-31
SLIDE 31

Challenges: Instrumentation Points

function interesting_method(...) { log_entry(...); _do_stuff(); log_exit(...); }

slide-32
SLIDE 32

Challenges: Trace ID Propagation

function interesting_method(trace_id,...) { log_entry(trace_id, ...); _do_stuff(?); log_exit(trace_id, ...); }

Optional in PHP! Could use globals due to single-request handling model.
slide-33
SLIDE 33

Challenges: Trace ID Propagation

function http_rpc_call(...) { log_entry(...); $opt = array(modified_headers); drupal_http_request($url, $opt); log_exit(...); }

slide-34
SLIDE 34

Challenges: Extracting Value

slide-35
SLIDE 35

Rich data set

  • Distributed tracing “only”
○ Follow request flow through application ○ Understand end-to-end latency ○ Associate backend load with frontend requests ○ Provide errors with distributed context
  • While you’re in there
○ Latency of queries, RPC calls, in each tier ○ Slow code ○ Cache hit/miss ratio ○ Errors and exceptions ○ Custom tagging/categorization of data ○ ...
slide-36
SLIDE 36

How does it actually work?

  • PHP extension
○ Hook into core methods
  • TraceView Module
○ Hook into key events -- take timing and attributes
  • Drupal 8 module, for example:
○ Event Dispatcher -- log timing of different kernel actions, etc ○ Event Subscriber -- figure out if user is anon/authenticated/admin ○ Service Provider -- alter base template class ■ Wrapper for Twig -- get timing and info on templates
slide-37
SLIDE 37

How does it actually work?

class TraceViewContainerAwareEventDispatcher extends ContainerAwareEventDispatcher { public function dispatch($eventName, Event $event = null) { // On an untraced request, bail out early. if (!oboe_is_tracing()) { return parent::dispatch($eventName, $event); } … // Figure out what event we’re dispatching if ($is_request) {
  • boe_log(($event->getRequestType() === HttpKernelInterface::MASTER_REQUEST) ? 'HttpKernel.
master_request' : 'HttpKernel.sub_request', "entry", array('Event' => get_class($event)), TRUE);
  • boe_log(NULL,"profile_entry", array('Event' => get_class($event), 'ProfileName' => $eventName),
TRUE); } elseif ($is_finish_request) { ... // Try to dispatch the event as normal. try { $ret = parent::dispatch($eventName, $event); // Catch any exceptions that occur during dispatch. } catch (\Exception $e) { ... } // And mark the end timing as well
slide-38
SLIDE 38

Aggregate performance

slide-39
SLIDE 39

Outliers, trends

slide-40
SLIDE 40

Topology mapping

slide-41
SLIDE 41

Thanks!

twitter.com/dkuebric appneta.com

dkuebric / dan@appneta.com