in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert - - PowerPoint PPT Presentation

in distributed applications
SMART_READER_LITE
LIVE PREVIEW

in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert - - PowerPoint PPT Presentation

1 An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert Toslali 1 , Orran Krieger 1 , Richard Megginson 2 , Ayse K. Coskun 1 , and Raja


slide-1
SLIDE 1

1

An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications

1Boston University; 2RedHat, Inc.; 3Tufts University

ACM Symposium on Cloud Computing November 21, 2019, Santa Cruz, CA

By Emre Ates1, Lily Sturmann2, Mert Toslali1, Orran Krieger1, Richard Megginson2, Ayse K. Coskun1, and Raja Sambasivan3

slide-2
SLIDE 2

2

Debugging Distributed Systems

Challenging: Where is the problem? It could be in:

  • One of many components
  • One of several stack levels
  • VM vs. hypervisor
  • Application vs. kernel
  • Inter-component interactions
slide-3
SLIDE 3

3

Today’s Debugging Methods

Different problems benefit from different instrumentation points.

= instrumentation point

You can’t instrument everything: too much overhead, too much data.

Instrumentation data

slide-4
SLIDE 4

4

Gather data from current instrumentation Use data to guess where to add instrumentation Able to idenfity problem source? Usually no…

Today’s Debugging Cycle

slide-5
SLIDE 5

5

Gather data from current instrumentation Use data to guess where to add instrumentation Able to idenfity problem source?

Our Research Question

Sometimes yes!!

Can we create a continuously-running instrumentation framework for production distributed systems that will automatically explore instrumentation choices across stack-layers for a newly-

  • bserved performance problem?

Report to developers

slide-6
SLIDE 6

6

 If requests that are expected to perform similarly do not:

There is something unknown about their workflows, which could represent performance problems

Localizing source of variation gives insight into where instrumentation is needed. time clientstart LBstart metadata read LBend clientend clientstart LBstart metadata read LBend clientend clientstart LBstart metadata read LBend clientend A READ request from storage Request #2 Request #3

Key insight: Performance variation indicates where to instrument

hierarchy

slide-7
SLIDE 7

7

Key Enabler: Workflow-centric Tracing

 Used to get workflows from running systems  Works by propagating common context with requests (e.g., request ID)

 Trace points record important events with context

 Granularity is determined by instrumentation in the system

time clientstart LBstart metadata read LBend clientend clientstart LBstart metadata read LBend clientend hierarchy

slide-8
SLIDE 8

8

Vision of Pythia

slide-9
SLIDE 9

9

Vision of Pythia

slide-10
SLIDE 10

10

Vision of Pythia

slide-11
SLIDE 11

11

Vision of Pythia

slide-12
SLIDE 12

12

Vision of Pythia

slide-13
SLIDE 13

13

Vision of Pythia

slide-14
SLIDE 14

14

Vision of Pythia

slide-15
SLIDE 15

15

Vision of Pythia

slide-16
SLIDE 16

16

Vision of Pythia

slide-17
SLIDE 17

17

Vision of Pythia

17

slide-18
SLIDE 18

18

Challenge 1: Grouping

slide-19
SLIDE 19

19

Which Requests are Expected to Perform Similarly

 Depends on the distributed application begin debugged  Generally applicable: Requests of the same type that access the same services  Additional app-specific details could be incorporated

clientstart LBstart metadata read LBend clientend clientstart LBstart metadata read LBend clientend clientstart LBstart metadata read LBend clientend clientstart LBstart metadata read LBend clientend clientstart authstart authend clientend clientstart authstart authend clientend clientstart authstart authend clientend

Expectation 1: Read requests Expectation 2: Auth requests

slide-20
SLIDE 20

20

Challenge 2: Localization

slide-21
SLIDE 21

21

Localizing Performance Variations

 Order groups and edges within groups.

 How to quantify performance variation?

 Multiple metrics to measure variation

 Variance/standard deviation  Coefficient of variance (std. / mean)

 Intuitive  Very small mean -> very high CoV

 Multimodality

 Multiple modes of operation time # reqs completed # reqs completed time

= acceptable std dev threshold

slide-22
SLIDE 22

22

Challenge 3: What to enable

22

slide-23
SLIDE 23

23

Search Space

 How to represent all of the

instrumentation that Pythia can control?

 How to find relevant next-trace-

points after problem is narrowed down?

 Trade-offs:

 Quick to access  Compact  Limit spurious instrumentation choices

Search Strategies

 How to explore the search space?

 Quickly converge on problems  Keep instrumentation overhead low  Reduce time-to-solution

 Many possible options

 Pluggable design

slide-24
SLIDE 24

24

Search Space: Calling Context Trees

 One node for each calling

context i.e., stack trace

 Leverages the hierarchy of

distributed system architecture

 Construction: offline profiling  Trade-offs

 Quick to access  Compact  Limit spurious instrumentation

choices

nova start nova end

neutron end glance start keystone start keystone end neutron start glance start keystone keystone neutron keystone glance

nova

Offline-collected trace Search space

slide-25
SLIDE 25

25

Search Strategy: Hierarchical Search

 One of many choices  Search trace point choices

top-down

 Very compatible with

Calling Context Trees

nova start nova end

neutron end glance end keystone start keystone end neutron start glance start keystone keystone neutron keystone glance

nova

Offline-collected trace Search space

slide-26
SLIDE 26

26

Search Strategy: Hierarchical Search

 One of many choices  Search trace point choices

top-down

 Very compatible with

Calling Context Trees

nova start nova end

neutron end glance start keystone start keystone end neutron start glance start keystone keystone neutron keystone glance

nova

Offline-collected trace Search space

slide-27
SLIDE 27

27

Search Strategy: Hierarchical Search

 One of many choices  Search trace point choices

top-down

 Very compatible with

Calling Context Trees

nova start nova end

neutron end glance start keystone start keystone end neutron start glance start keystone keystone neutron keystone glance

nova

Offline-collected trace Search space

slide-28
SLIDE 28

28

Search Strategy: Hierarchical Search

 One of many choices  Search trace point choices

top-down

 Very compatible with

Calling Context Trees

nova start nova end

neutron end glance start keystone start keystone end neutron start glance start keystone keystone neutron keystone glance

nova

Offline-collected trace Search space

slide-29
SLIDE 29

29

Search Strategy: Hierarchical Search

 One of many choices  Search trace point choices

top-down

 Very compatible with

Calling Context Trees

nova start nova end

neutron end glance start keystone start keystone end neutron start glance start keystone keystone neutron keystone glance

nova

Offline-collected trace Search space

slide-30
SLIDE 30

30

Explaining Variation Using Key-Value Pairs in Trace Points

 Canonical Correlation Analysis (CCA)  Used to find important key-value pairs in the traces

𝑏′ = max

𝑏

𝑑𝑝𝑠𝑠(𝑏𝑈𝑌, 𝑍)

𝑍 = (𝑢1, 𝑢2, … , 𝑢𝑜) the request durations 𝑌 = (𝑦1, 𝑦2, … , 𝑦𝑛) the collected variables 𝑏′ ∈ ℝ𝑛 the coefficients indicating most correlated variables

slide-31
SLIDE 31

31

Vision of Pythia – Completing the Cycle

31

slide-32
SLIDE 32

32

Validating Pythia’s Approach

 Can performance variation guide instrumentation choices?  Run exploratory analysis for OpenStack

 Start with default instrumentation  Localize performance variation  Find next instrumentation to enable  Use CCA for finding important key-value pairs

slide-33
SLIDE 33

33

Validating Pythia’s Approach - Setup

 OpenStack: an open source cloud platform, written in Python  OSProfiler: OpenStack’s tracing framework

 We implemented controllable trace points  Store more variables such as queue lengths

 Running on MOC

 8 vCPUs, 32 GB memory

 Workload

 9 request types, VM/floating IP/volume create/list/delete  Simultaneously execute 20 workloads

slide-34
SLIDE 34

34

Step 1: Grouping & Localization

 Collect latency values for

each request

 Grouping: Same request

type with same trace points

 Server create requests have

unusually high variance and latency

 Pythia would focus on this

group

slide-35
SLIDE 35

35

Step 2: Enable additional instrumentation

 Pythia localizes variation into a semaphore in server create  After adding queue length variable into traces, we see 3 distinct latency groups

CCA also finds this variable important Groups with different queue lengths

TAKEAWAY: Pythia’s approach identifies the instrumentation needed to debug this problem

slide-36
SLIDE 36

36

Open Questions

 What is the ideal structure of the search space? What are possible search

strategies? What are the trade-offs?

 How can we formulate and choose an “instrumentation budget”?  How granular should the performance expectations be?  How can we integrate multiple stack layers into Pythia?

slide-37
SLIDE 37

37

More in the paper

 Pythia architecture  Problem scenarios  Instrumentation plane requirements  Cross-layer instrumentation

slide-38
SLIDE 38

38

11/21/2019 An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications 38

Concluding Remarks

 It is very difficult to debug distributed systems  Automating instrumentation choice is a

promising solution to overcome this difficulty

More info in our paper (bu.edu/peaclab/publications) Please send feedback to ates@bu.edu

  • r join us at the poster sesion