Enhancing End-to-End Tracing Systems for Automated Performance - PowerPoint PPT Presentation

Enhancing End-to-End Tracing Systems for Automated Performance Debugging in Distributed Systems Jethro S. Sun January 23, 2018 MassOpenCloud Research Group 1

Introduction

A Sad Story ... 2

A Sad Story ... A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable. – Leslie Lamport 2

A Sad Story ... What developers and operators really need is a way to understand and troubleshoot a distributed system as a whole . 2

Performance Diagnosis in OpenStack OPENSTACK Bug # 1587777 was filed against HORIZON. 3

Performance Diagnosis in OpenStack And only took 10 Month to figure out it was something wrong in KEYSTONE . 3

Performance Diagnosis in OpenStack Q uestion: Is there a way to make developers’ and operators’ life less miserable? 3

Performance Diagnosis in OpenStack Q uestion: Is there a way to make developers’ and operators’ life less miserable? YES. End-to-end tracing 3

End-to-End Tracing, what is it and where we are today?

End-to-End Tracing Definition (End-to-End Tracing) E nd-to-end tracing captures the workflow of causally-related activity (e.g., work done to process a request) within and among every component of a distributed system. 1 Request work � ows Boundary 3ms 2ms 3ms Work 1ms 2ms 2ms Storage nodes Client Server App server Distributed � lesystem Table store 1 So, you want to trace your distributed system? Key design insights from 4 years of practical experience. Raja Sambasivan et al.

A Typical End-to-End Tracing Infrastructure Definition (Trace Metadata) F ields propagated with causally-related event to identify their workflows. They are usually unique IDs or in a format of logical clock stored thread-locally or context-locally. Definition (Trace Points) Instrumentation points in the system used to identify individual work done, and also propagate necessary metadata. Definition (Backend) Central collector that gathers pieces of trace data and reconstruct them into full feature-riched trace. 5

End-to-end Tracing gains its popularity gradually... TABLE 1 T imeline 2002 • Pinpoint 2004 • Magpie, SDI 2005 • Causeway 2006 • Pip, Stardust 2007 • X-Trace 2010 • Google Dapper 2012 • Zipkin, HTrace 2013 • Node.js CLS 2014 • Apple Activity Tracing, Blkin 2015 • AppNeta, AppDynamics, NewRelic, OSProfiler 2017 • 6

End-to-end Tracing gains its popularity gradually... TABLE 1 T imeline 2002 • Pinpoint 2004 • Magpie, SDI 2005 • Causeway 2006 • Pip, Stardust 2007 • X-Trace 2010 • Google Dapper 2012 • Zipkin, HTrace 2013 • Node.js CLS 2014 • Apple Activity Tracing, Blkin 2015 • AppNeta, AppDynamics, NewRelic, OSProfiler 2017 • ..., Twitter, Prezi, SoundCloud, HDFS, HBase, Accumulo, Phoenix, Baidu, Neflit, Pivotal, Coursera, Census (Google), Canopy (Facebook), Jaeger (Uber), ... 6

End-to-End Tracing Systems Service Model T o distinguish tracing systems: 7

End-to-End Tracing Systems Service Model T o distinguish tracing systems: • On-demand (Rudimentary) 7

End-to-End Tracing Systems Service Model T o distinguish tracing systems: • On-demand (Rudimentary) • Be always on (Smart Sampling) 7

End-to-End Tracing Systems Service Model T o distinguish tracing systems: • On-demand (Rudimentary) • Be always on (Smart Sampling) • Collect trace data asynchronously 7

End-to-End Tracing Systems Service Model T o distinguish tracing systems: • On-demand (Rudimentary) • Be always on (Smart Sampling) • Collect trace data asynchronously • DAG-based model to represent events 7

End-to-End Tracing Systems Service Model T o distinguish tracing systems: • On-demand (Rudimentary) • Be always on (Smart Sampling) • Collect trace data asynchronously • DAG-based model to represent events • Logical clock support 7

Comparing End-to-End Tracing Systems Table 2: C omparing end-to-end tracing systems features between Jaeger, Zipkin, Pivot Tracing, Dapper, Canopy, OSProfiler and Blkin. Systems Can Be Applied to Rudimentary Features Needed to Be Always on Advanced Features On-demand Sampling Async. Collect. DAG-based Model Interval Tree Clock Jaeger Tracing Broadly (K8s, OpenShift) ✗ � � ✗ ✗ Zipkin Tracing Broadly ✗ � � ✗ ✗ Pivot Tracing Hadoop/Java based systems ✗ � � � � Dapper N/A ✗ � � ✗ ✗ Canopy N/A ✗ � � � ✗ OSProfiler Blkin 8

Comparing End-to-End Tracing Systems Table 2: C omparing end-to-end tracing systems features between Jaeger, Zipkin, Pivot Tracing, Dapper, Canopy, OSProfiler and Blkin. Systems Can Be Applied to Rudimentary Features Needed to Be Always on Advanced Features On-demand Sampling Async. Collect. DAG-based Model Interval Tree Clock Jaeger Tracing Broadly (K8s, OpenShift) ✗ � � ✗ ✗ Zipkin Tracing Broadly ✗ � � ✗ ✗ Pivot Tracing Hadoop/Java based systems ✗ � � � � Dapper N/A ✗ � � ✗ ✗ Canopy N/A ✗ � � � ✗ OSProfiler OpenStack � ✗ ✗ ✗ ✗ Blkin Ceph � ✗ ✗ ✗ ✗ 8

Approaches for Enabling Sophisticated Tracing in OpenStack

Jaeger vs OSProfiler J aeger Tracing DISADVANTAGES ADVANTA GES • Doesn’t support • Support smart DAG-based model sampling • Doesn’t use advanced • Support collecting logical clock as the trace data async. metadata 9

Jaeger vs OSProfiler OSP rofiler DISADVANTAGES • Doesn’t have sampling ADVANTAGES • Doesn’t collect trace data • Rudimentary on-demand asynchronously tracing • Doesn’t support • Already adopt by DAG-based model OpenStack and have • Doesn’t use advanced instrumentation logical clock as the metadata 9

Jaeger vs OSProfiler OSP rofiler DISADVANTAGES Doesn’t have sampling • ADVANTAGES • Doesn’t collect trace data • Rudimentary on-demand asynchronously tracing • Doesn’t support • Already adopt by DAG-based model OpenStack and have • Doesn’t use advanced instrumentation logical clock as the metadata 9

Jaeger vs OSProfiler OSP rofiler with Jaeger Tracing ADVANTAGES DISADVANTAGES • Rudimentary on-demand • Doesn’t have sampling tracing • Doesn’t collect trace data • Already adopt by asynchronously OpenStack and have • Doesn’t support instrumentation DAG-based model • Doesn’t use advanced logical clock as the metadata 9

Jaeger vs OSProfiler OSP rofiler with Jaeger Tracing ADVANTAGES DISADVANTAGES • Rudimentary on-demand • Doesn’t have sampling tracing • Doesn’t collect trace data • Already adopt by asynchronously OpenStack and have Doesn’t support • instrumentation DAG-based model • Modifications we done • Doesn’t use advanced can be directly other logical clock as the Jaeger instrumented metadata systems 9

Feasibility K ey Challenges: Trace Metadata/OSProfiler library change • Implement CONTEXT generation using Jaeger • Implement CONTEXT propagation using Jaeger Trace Points/OpenStack instrumentation • All of the instrumentation will be able to be reused 2 Backend side • Need to deploy Backend/Collector for Jaeger Tracing 2 Modifying instrumentation for the purpose of our research is orthogonal. 10

Feasibility K ey Challenges: Trace Metadata/OSProfiler library change • Implement CONTEXT generation using Jaeger • Implement CONTEXT propagation using Jaeger Trace Points/OpenStack instrumentation • All of the instrumentation will be able to be reused 2 � Backend side • Need to deploy Backend/Collector for Jaeger Tracing � 2 Modifying instrumentation for the purpose of our research is orthogonal. 10

Feasibility Definition (Context) C ontext is an abstraction of the metadata so that it is easier to interact with (injecting/extracting a trace to/from). Example Implementation // Context holds the basic metadata. type Context struct { TraceID uint64 SpanID uint64 Sampled bool Baggage map[string]string // initialized on first use } 11

Feasibility: Context Generation CONTEXT generation : All of the modification will be done in OSProfiler library 3 • The span context generation will be done using Jaeger to substitute the OSProfiler implementation. 3 In OpenStack developers instrument their codebase using functionalities implemented in OSProfiler library. 12

Feasibility: Context Propagation CONTEXT pr opagation : OpenStack Instrumentation side • REST API Transform the metadata propagation in OpenStack clients to propagate Jaeger metadata. We might only need to change OSProfiler library. • RPC API Need to implement helper functions for metadata propagation RPC. We might need to modify component codebase depends on the RCP is handled in different components. OSProfiler Library side • Need to deploy Backend/Collector for Jaeger 12 Tracing

Status Update CONTEXT generation : • A talk during 2017 OpenStack Sydney Summit demonstrates how easy to plainly record all the OSProfiler tracing information in Jaeger. ( i.e. Context generation is done in OSProfiler) • Additionally we need to generate context using Jaeger tracing. CONTEXT propagation : • Will begin to look at ways to enforce metadata propagation in OpenStack RPC API and REST API 13

Jaeger Tracing Approach

Enhancing End-to-End Tracing Systems for Automated Performance - PowerPoint PPT Presentation

Enhancing End-to-End Tracing Systems for Automated Performance Debugging in Distributed Systems Jethro S. Sun January 23, 2018 MassOpenCloud Research Group 1 Introduction A Sad Story ... 2 A Sad Story ... A distributed system is one in

Advanced Ray Tracing 1 2/8/2006 Distributed Ray Tracing Distributed ray tracing is an

Computer Graphics - Ray-Tracing II - Hendrik Lensch Computer Graphics WS07/08 Ray Tracing II

1 minute Path tracing Bidirectional path tracing Progressive photon mapping 1 minute

MIT 6.837 - Ray Tracing Ray Tracing MIT EECS 6.837 Most slides are taken from Frdo Durand and

Advanced Ray Tracing Stochastic ray tracing: distribute rays stochastically across pixel

61A Extra Lecture 9 Announcements Pixels (Demo) Ray Tracing Ray Tracing A technique for

Computer Graphics - Ray Tracing I - Hendrik Lensch Computer Graphics WS07/08 Ray Tracing I

Knowledge Tracing Machines: Factorization Machines for Knowledge Tracing Jill-Jnn Vie Hisashi

Introduction to Path Tracing Marc Sunet Table of contents From Ray Tracing to Path Tracing The

Ray Tracing 1 Ray Tracing Ray Tracing kills two birds with one stone: Solves the Hidden

Tracing with Perf tools Namhyung Kim 2013-11-13 Wed Namhyung Kim Tracing with Perf tools

Low-Overhead System Tracing With eBPF Akshay Kapoor DevOps Engineer @ SAP Labs May 2018

Overview FDA Food Safety FDA Food Safety Modernization Act & Product Tracing Product Tracing

Accelerating Sphere Tracing Csaba Blint , Gbor Valasek Etvs Lornd University, Hungary

Relativistic Ray Tracing in Julia Ryan McKinnon November 30, 2015 Introduction Ray tracing is

COVID-19 Contact Tracing in Schools Louisiana Department of Health Office of Public Health

Quiz Define linear combination and give two examples using the 3-vectors v 1 = [1 , 1 , 0] , v

Activity Suppose you have available a procedure is independent(L) , which takes a list L of Vec s

Computational Complexity (Continued) 15-150 1 Story so far We need to model the efficiency

Span-based Localizing Network for Natural Language Video Localization Hao Zhang 1,2 , Aixin Sun 1

Cable-Stayed Bridges (Schrgseilbrcken) 26.05.2020 ETH Zrich | Chair of Concrete

nineteen wrought-iron rivets steel construction: trusses, decks & plate girders

4.3 Linearly Independent Sets McDonald Fall 2018, MATH 2210Q, 4.3 Slides 4.3 Homework : Read

Using Kieker with Elastic APM: An Experience Report Valentin Seifermann Duan Okanovi SSP

Enhancing End-to-End Tracing Systems for Automated Performance - PowerPoint PPT Presentation

Enhancing End-to-End Tracing Systems for Automated Performance Debugging in Distributed Systems Jethro S. Sun January 23, 2018 MassOpenCloud Research Group 1 Introduction A Sad Story ... 2 A Sad Story ... A distributed system is one in

Advanced Ray Tracing 1 2/8/2006 Distributed Ray Tracing Distributed ray tracing is an

Computer Graphics - Ray-Tracing II - Hendrik Lensch Computer Graphics WS07/08 Ray Tracing II

1 minute Path tracing Bidirectional path tracing Progressive photon mapping 1 minute

MIT 6.837 - Ray Tracing Ray Tracing MIT EECS 6.837 Most slides are taken from Frdo Durand and

Advanced Ray Tracing Stochastic ray tracing: distribute rays stochastically across pixel

61A Extra Lecture 9 Announcements Pixels (Demo) Ray Tracing Ray Tracing A technique for

Computer Graphics - Ray Tracing I - Hendrik Lensch Computer Graphics WS07/08 Ray Tracing I

Knowledge Tracing Machines: Factorization Machines for Knowledge Tracing Jill-Jnn Vie Hisashi

Introduction to Path Tracing Marc Sunet Table of contents From Ray Tracing to Path Tracing The

Ray Tracing 1 Ray Tracing Ray Tracing kills two birds with one stone: Solves the Hidden

Tracing with Perf tools Namhyung Kim 2013-11-13 Wed Namhyung Kim Tracing with Perf tools

Low-Overhead System Tracing With eBPF Akshay Kapoor DevOps Engineer @ SAP Labs May 2018

Overview FDA Food Safety FDA Food Safety Modernization Act &amp; Product Tracing Product Tracing

Accelerating Sphere Tracing Csaba Blint , Gbor Valasek Etvs Lornd University, Hungary

Relativistic Ray Tracing in Julia Ryan McKinnon November 30, 2015 Introduction Ray tracing is

COVID-19 Contact Tracing in Schools Louisiana Department of Health Office of Public Health

Quiz Define linear combination and give two examples using the 3-vectors v 1 = [1 , 1 , 0] , v

Activity Suppose you have available a procedure is independent(L) , which takes a list L of Vec s

Computational Complexity (Continued) 15-150 1 Story so far We need to model the efficiency

Span-based Localizing Network for Natural Language Video Localization Hao Zhang 1,2 , Aixin Sun 1

Cable-Stayed Bridges (Schrgseilbrcken) 26.05.2020 ETH Zrich | Chair of Concrete

nineteen wrought-iron rivets steel construction: trusses, decks &amp; plate girders

4.3 Linearly Independent Sets McDonald Fall 2018, MATH 2210Q, 4.3 Slides 4.3 Homework : Read

Using Kieker with Elastic APM: An Experience Report Valentin Seifermann Duan Okanovi SSP

Overview FDA Food Safety FDA Food Safety Modernization Act & Product Tracing Product Tracing

nineteen wrought-iron rivets steel construction: trusses, decks & plate girders