 
              1 An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert Toslali 1 , Orran Krieger 1 , Richard Megginson 2 , Ayse K. Coskun 1 , and Raja Sambasivan 3 1 Boston University; 2 RedHat, Inc.; 3 Tufts University ACM Symposium on Cloud Computing November 21, 2019, Santa Cruz, CA
2 Debugging Distributed Systems Challenging: Where is the problem? It could be in: One of many components ● One of several stack levels ● ● VM vs. hypervisor ● Application vs. kernel Inter-component interactions ●
3 Today’s Debugging Methods Instrumentation data = instrumentation point Different problems benefit from different instrumentation points. You can’t instrument everything: too much overhead, too much data.
4 Today’s Debugging Cycle Able to idenfity problem source? Usually no… Gather data from current Use data to instrumentation guess where to add instrumentation
5 Our Research Question Sometimes yes!! Able to idenfity problem source? Gather data from current Use data to instrumentation guess where to Report to add developers instrumentation Can we create a continuously-running instrumentation framework for production distributed systems that will automatically explore instrumentation choices across stack-layers for a newly- observed performance problem?
6 Key insight: Performance variation indicates where to instrument  If requests that are expected to perform similarly do not:  There is something unknown about their workflows, which could represent performance problems  Localizing source of variation gives insight into where instrumentation is needed. time hierarchy A READ request from storage client start client end LB start LB end Request #2 client start client end metadata read LB start LB end client start client end metadata read Request #3 LB start LB end metadata read
7 Key Enabler: Workflow-centric Tracing  Used to get workflows from running systems  Works by propagating common context with requests (e.g., request ID)  Trace points record important events with context  Granularity is determined by instrumentation in the system time hierarchy client start client end LB start LB end metadata read client start client end LB start LB end metadata read
8 Vision of Pythia
9 Vision of Pythia
10 Vision of Pythia
11 Vision of Pythia
12 Vision of Pythia
13 Vision of Pythia
14 Vision of Pythia
15 Vision of Pythia
16 Vision of Pythia
17 Vision of Pythia 17
18 Challenge 1: Grouping
19 Which Requests are Expected to Perform Similarly  Depends on the distributed application begin debugged  Generally applicable: Requests of the same type that access the same services  Additional app-specific details could be incorporated client start client end client start client end client start client end client start client end LB start LB end LB start LB end Expectation 1: LB start LB end metadata read Read requests LB start LB end metadata read metadata read metadata read client start client end client start client end client start client end Expectation 2: auth start auth end Auth requests auth start auth end auth start auth end
20 Challenge 2: Localization
21 Localizing Performance Variations completed  Order groups and edges within groups. # reqs  How to quantify performance variation?  Multiple metrics to measure variation time  Variance/standard deviation = acceptable std dev threshold  Coefficient of variance (std. / mean)  Intuitive  Very small mean -> very high CoV  Multimodality completed # reqs  Multiple modes of operation time
22 Challenge 3: What to enable 22
23 Search Space Search Strategies  How to represent all of the  How to explore the search space? instrumentation that Pythia can  Quickly converge on problems control?  Keep instrumentation overhead low  How to find relevant next-trace-  Reduce time-to-solution points after problem is narrowed  Many possible options down?  Pluggable design  Trade-offs:  Quick to access  Compact  Limit spurious instrumentation choices
24 Search Space: Calling Context Trees Offline-collected trace Search space nova start  One node for each calling nova context i.e., stack trace keystone start  Leverages the hierarchy of glance keystone neutron keystone end distributed system architecture glance start  Construction: offline profiling keystone  Trade-offs neutron start keystone  Quick to access  Compact neutron end glance start  Limit spurious instrumentation choices nova end
25 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start  One of many choices glance keystone neutron keystone end  Search trace point choices glance start keystone top-down neutron start keystone  Very compatible with Calling Context Trees neutron end glance end nova end
26 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start  One of many choices glance keystone neutron keystone end  Search trace point choices glance start keystone top-down neutron start keystone  Very compatible with Calling Context Trees neutron end glance start nova end
27 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start  One of many choices glance keystone neutron keystone end  Search trace point choices glance start keystone top-down neutron start keystone  Very compatible with Calling Context Trees neutron end glance start nova end
28 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start  One of many choices glance keystone neutron keystone end  Search trace point choices glance start keystone top-down neutron start keystone  Very compatible with Calling Context Trees neutron end glance start nova end
29 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start  One of many choices glance keystone neutron keystone end  Search trace point choices glance start keystone top-down neutron start keystone  Very compatible with Calling Context Trees neutron end glance start nova end
30 Explaining Variation Using Key-Value Pairs in Trace Points  Canonical Correlation Analysis (CCA)  Used to find important key-value pairs in the traces 𝑏 ′ = max 𝑑𝑝𝑠𝑠(𝑏 𝑈 𝑌, 𝑍) 𝑏 𝑍 = (𝑢 1 , 𝑢 2 , … , 𝑢 𝑜 ) the request durations 𝑌 = (𝑦 1 , 𝑦 2 , … , 𝑦 𝑛 ) the collected variables 𝑏′ ∈ ℝ 𝑛 the coefficients indicating most correlated variables
31 Vision of Pythia – Completing the Cycle 31
32 Validating Pythia’s Approach  Can performance variation guide instrumentation choices?  Run exploratory analysis for OpenStack  Start with default instrumentation  Localize performance variation  Find next instrumentation to enable  Use CCA for finding important key-value pairs
33 Validating Pythia’s Approach - Setup  OpenStack: an open source cloud platform, written in Python  OSProfiler: OpenStack’s tracing framework  We implemented controllable trace points  Store more variables such as queue lengths  Running on MOC  8 vCPUs, 32 GB memory  Workload  9 request types, VM/floating IP/volume create/list/delete  Simultaneously execute 20 workloads
34 Step 1: Grouping & Localization  Collect latency values for each request  Grouping: Same request type with same trace points  Server create requests have unusually high variance and latency  Pythia would focus on this group
35 Step 2: Enable additional instrumentation Groups with different queue lengths  Pythia localizes variation into a semaphore in server create  After adding queue length variable into traces, we see 3 distinct latency groups  CCA also finds this variable important TAKEAWAY: Pythia’s approach identifies the instrumentation needed to debug this problem
36 Open Questions  What is the ideal structure of the search space? What are possible search strategies? What are the trade-offs?  How can we formulate and choose an “instrumentation budget”?  How granular should the performance expectations be?  How can we integrate multiple stack layers into Pythia?
37 More in the paper  Pythia architecture  Problem scenarios  Instrumentation plane requirements  Cross-layer instrumentation
An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications 11/21/2019 38 38  It is very difficult to debug distributed systems  Automating instrumentation choice is a promising solution to overcome this difficulty More info in our paper (bu.edu/peaclab/publications) Concluding Remarks Please send feedback to ates@bu.edu or join us at the poster sesion
Recommend
More recommend