How To Keep Your Head Above Water While Detecting Errors Ignacio - - PowerPoint PPT Presentation

how to keep your head above water while detecting errors
SMART_READER_LITE
LIVE PREVIEW

How To Keep Your Head Above Water While Detecting Errors Ignacio - - PowerPoint PPT Presentation

USENIX / ACM / IFIP 10th International Middleware Conference How To Keep Your Head Above Water While Detecting Errors Ignacio Laguna, Fahad A. Arshad, David M. Grothe, Saurabh Bagchi Dependable Computing System Lab School of Electrical and


slide-1
SLIDE 1

Slide 1/27

How To Keep Your Head Above Water While Detecting Errors

Ignacio Laguna, Fahad A. Arshad, David M. Grothe, Saurabh Bagchi

Dependable Computing System Lab School of Electrical and Computer Engineering Purdue University USENIX / ACM / IFIP 10th International Middleware Conference

slide-2
SLIDE 2

Slide 2/27

Impact of Failures in Internet Services

  • Internet services are expected to be running 24/7

– System downtime can cost $1 million / hour

(Source: Meta Group, 2002)

  • Service degradation is the most frequent problem

– Service is slower than usual; almost unavailable – Can be difficult to detect and diagnose

  • Internet-based applications are very large and dynamic

– Complexity increases as new components are added

slide-3
SLIDE 3

Slide 3/27

Complexity of Internet-based Applications

Software Components

  • Each tier has multiple components
  • Components can be stateful

Servlets, JavaBeans, EJBs

slide-4
SLIDE 4

Slide 4/27

The Monitor Detection System (TDSC ’06)

Observed System Monitor

  • Non-intrusive  observe messages between components
  • Online  faster detection than offline approaches
  • Black-box detection  components treated as black boxes

– No knowledge of components’ internals

P R E V I O U S W O R K

slide-5
SLIDE 5

Slide 5/27

Stateful Rule-based Detection

Packet Capturer Finite State Machine (event-based state transition model) Normal- Behavior Rules Detection Process A message is captured Deduce current application state Match rules based on the deduced state If a rule does not satisfy, signal alarm Monitor 1 2 3 4

P R E V I O U S W O R K

slide-6
SLIDE 6

Slide 6/27

300 600 900 1200 100 200 300 400

Captured Packets / second (Incoming msg rate at Monitor) Detection Latency (msec)

The Breaking Point in High Rate of Messages

  • After breaking point, latency increases sharply
  • True alarms rate decreases because packets are dropped
  • Breaking point expected in any stateful detection system

Breaking point

P R E V I O U S W O R K

slide-7
SLIDE 7

Slide 7/27

Avoiding the Breaking Point: Random Sampling (SRDS ‘07)

  • Processing Load in Monitor

δ = R × C

R: Incoming message rate, C: Processing cost per message

  • Processing Load δ is reduced by reducing R

– Only a portion of incoming messages is processed – n out of m messages are randomly sampled

  • Sampling is activated if R ≥ breaking point

P R E V I O U S W O R K

slide-8
SLIDE 8

Slide 8/27

The Non-Determinism Caused by Sampling

Definitions: State Vector (SV)  The state(s) of the application from Monitor’s point of view (deduced state(s)) Non-Determinism  Monitor is no longer aware of the exact state the application is in (because of dropped messages)

A portion of a Finite State Machine S0

m1 m2 m3 m4 m5

… … … …

Events in Monitor (1) SV = { S0 } (2) A message is dropped (3) SV = { S1, S2 } (4) A message is sampled (5) The message is m5 (6) SV = { S5 } S1 S2 S3 S4 S5

P R E V I O U S W O R K

slide-9
SLIDE 9

Slide 9/27

Remaining Agenda

I. Addressing the problem of non-determinism

  • A. Intelligent Sampling

B. Hidden Markov Model (HMM) for state determination

  • II. Experimental Test-bed
  • III. Performance Results
  • IV. Efficient Rule Matching and Triggering

C U R R E N T W O R K

slide-10
SLIDE 10

Slide 10/27

Challenges with Non-Determinism

Decrease Detection Latency

(Large SV increases rule matching time)

Increase True Alarms

(Reduce effect of incorrect messages)

Decrease False Alarms

(Reduce incorrect states in SV)

Intelligent Sampling Hidden Markov Model SV reduction Non-Determinism

slide-11
SLIDE 11

Slide 11/27

Intelligent Sampling

Random Sampling

Finite State Machine

S1 S2 S5 S6 S4 S3

State Vector

Sampled Messages Incoming Messages

Intelligent Sampling

Finite State Machine

S1 S4 S3

State Vector

Sampled Messages Incoming Messages

Message with desirable property

slide-12
SLIDE 12

Slide 12/27

What is a Desirable Property of a Message?

Discriminative Size  Number of times a message appears in transitions to different states in the FSM

  • Desirable property is a small discriminative size

S2

… S1

S3 S5 S4 S6 S8 S7 S9

m1 m1 m1 m2 m2 m2 m3 m3

… … …

A portion of a Finite State Machine

Suppose, SV ={ S1, S2, S3 }, Sampling Rate = 1/3

SV Sampled Message

Discriminative Size {S4, S5, S6}

m1

3 { S7, S8 }

m2

2 { S9 }

m3

1

slide-13
SLIDE 13

Slide 13/27

Benefits of Intelligent Sampling

Random Sampling

  • SV can grow into large size
  • Multiple incorrect states

– Increase of false alarms

Intelligent Sampling

  • SV is kept small

– Detection latency reduction

  • Less incorrect states in SV

– False alarms reduction

slide-14
SLIDE 14

Slide 14/27

The Problem of Sampling an Incorrect Message

  • What if an incorrect message is sampled?

– The message is incorrect in current states, e.g., a message from buggy component

S0

m1 m2 m3 m4 m5

… … … …

S1 S2 S3 S4 S5

  • Suppose SV = { S1, S2 } and m3 is changed to m5

⇒ SV = { S5 } (incorrect SV!, it should be { S3 } )

slide-15
SLIDE 15

Slide 15/27

Probabilistic State Vector Reduction: A Hidden Markov Model Approach

  • Hidden Markov Model (HMM) used to reduce SV

– An HMM is an extended Markov Model where states are not observable – States are hidden as in the monitored application

  • Given an HMM, we can ask:

The probability of the application being in any state, given a sequence of messages? Cost is O(N3L), N: number of states L: sequence length

  • The HMM is trained (offline) with application traces
slide-16
SLIDE 16

Slide 16/27

State Vector Reduction with the HMM

  • Monitor asks the HMM: {p1, p2,…, pN}

pi = P(Si | O), Si: application state i, O: observation’s sequence Messages (observations) HMM p1, p2, …, pN Sort by Probabilities p3, p5, …, p1 α = Top k probabilities α ∩ SV = new SV New SV is robust to incorrect messages

slide-17
SLIDE 17

Slide 17/27

Experimental Testbed: Java Duke’s Bank Application

  • Simulates multi-tier online banking

system

  • User transactions:

– Access account information, transfer money, etc.

  • Application stressed with different

workloads – Incoming message rate at Monitor varies with user load

slide-18
SLIDE 18

Slide 18/27

Web Interaction: A Sequence of calls and returns

Components: Java servlets, JavaBeans, EJBs

slide-19
SLIDE 19

Slide 19/27

Error Injection Types

  • Errors are injected in components touched by web

interactions

– A web interaction is faulty if at least one of its components is faulty

Error Type Description Response Delays a response delay in a method call Null calls a call to a component that is never executed Unhandled Exceptions exception thrown by execution that is never caught by the program Incorrect Message Sequences change randomly the web interaction structure

slide-20
SLIDE 20

Slide 20/27

Performance Metrics Used in Experiments

Accuracy (True Alarms) % of true detections out of web interactions where errors were injected Precision (False Alarms) % of true detections out of the total number of detections Detection Latency time elapsed between the error injection and its detection

1 2 3 4 5 Error Injected

X X X

Detection (An alarm is signaled)

X X X X

Web Interactions

Accuracy = 2/3 = 0.67

Example:

Precision = 2/4 = 0.5

slide-21
SLIDE 21

Slide 21/27

Results: State Vector Reduction

  • Peaks are not observed in intelligent sampling (IS)

– IS capability of selecting messages with small discriminative size

  • SV of size 1 is more frequent in IS

10 20 30 40 50 60 70 80 90 100 10 20 30 Discrete Time State Vector Size Random Sampling 10 20 30 40 50 60 70 80 90 100 10 20 30 Discrete Time State Vector Size Intelligent Sampling

2 4 6 8 0.2 0.4 0.6 0.8 1 Pruned State Vector Size CDF Random Sampling Intelligent Sampling

slide-22
SLIDE 22

Slide 22/27

Results: Monitor vs. Pinpoint (Accuracy, Precision)

  • Monitor and Pinpoint expose similar levels of accuracy
  • Precision in Monitor (1.0) is higher than in Pinpoint (0.9)

4 8 12 16 20 24 0.2 0.4 0.6 0.8 1 Concurrent Users Accuracy Monitor Pinpoint-PCFG 4 8 12 16 20 24 0.2 0.4 0.6 0.8 1 Concurrent Users Precision Monitor Pinpoint-PCFG

  • Pinpoint (NSDI ‘04), traces paths though multiple components
  • Use of PCFG to detect abnormal paths
slide-23
SLIDE 23

Slide 23/27

Results: Monitor vs. Pinpoint (Detection Latency)

  • Detection latency in Monitor is in the order of

milliseconds, while in Pinpoint is in seconds

  • The PCFG has a high space and time complexity

4 8 12 16 20 24 50 100 150 200

Concurrent Users

Detection Latency (msec)

Monitor 4 8 12 16 20 24 50 100 150 200

Concurrent Users

Detection Latency (sec)

Pinpoint-PCFG

slide-24
SLIDE 24

Slide 24/27

Results: Memory Consumption (MB)

  • Monitor doesn’t rely in large data structures
  • PCFG in Pinpoint has high space and time complexity

– O(RL2) and O(L3) R: number of rules in the grammar L: size of a web interaction

  • Pinpoint thrashes due to high memory requirements

Virtual Memory RAM Monitor 282.27 25.53 Pinpoint-PCFG 933.56 696.06

slide-25
SLIDE 25

Slide 25/27

Efficient Rule Matching

. . . . . .

Expensive to match? System unstable? Match Yes Yes No No Next rule

  • Selectively match computationally expensive rules

– Expensive rules don’t have to be matched all the time

  • Rules are matched only if instability is present
slide-26
SLIDE 26

Slide 26/27

  • Efficiently detect memory leak in Apache web server

– Memory leak injected probabilistically with web requests

  • Expensive ARIMA-based rule to detect abnormal

memory usage

– Average matching latency reduced

Efficient Rule Matching Example: Detecting Memory Leak

Rule Matching Criteria Memory Leak Detected Average Matching Latency (msec.) Always matched yes 19.283 σ ≥ 0.5 yes 7.115 σ ≥ 1.0 no 1.25

slide-27
SLIDE 27

Slide 27/27

Concluding Remarks

  • Contributions:

– Sampling used to scale stateful detection system under high- rate of messages – Intelligent Sampling reduces non-determinism caused by sampling – HMM-approach handles incorrect messages – Techniques can be applied to any stateful detection system – Monitor performs better than other approaches

  • Future Work:

– Efficient Rule Matching technique will be extended – Sampling only sequences of messages that lead to errors – Automatic generation of rules from traces