Slide 1/27
How To Keep Your Head Above Water While Detecting Errors Ignacio - - PowerPoint PPT Presentation
How To Keep Your Head Above Water While Detecting Errors Ignacio - - PowerPoint PPT Presentation
USENIX / ACM / IFIP 10th International Middleware Conference How To Keep Your Head Above Water While Detecting Errors Ignacio Laguna, Fahad A. Arshad, David M. Grothe, Saurabh Bagchi Dependable Computing System Lab School of Electrical and
Slide 2/27
Impact of Failures in Internet Services
- Internet services are expected to be running 24/7
– System downtime can cost $1 million / hour
(Source: Meta Group, 2002)
- Service degradation is the most frequent problem
– Service is slower than usual; almost unavailable – Can be difficult to detect and diagnose
- Internet-based applications are very large and dynamic
– Complexity increases as new components are added
Slide 3/27
Complexity of Internet-based Applications
Software Components
- Each tier has multiple components
- Components can be stateful
Servlets, JavaBeans, EJBs
Slide 4/27
The Monitor Detection System (TDSC ’06)
Observed System Monitor
- Non-intrusive observe messages between components
- Online faster detection than offline approaches
- Black-box detection components treated as black boxes
– No knowledge of components’ internals
P R E V I O U S W O R K
Slide 5/27
Stateful Rule-based Detection
Packet Capturer Finite State Machine (event-based state transition model) Normal- Behavior Rules Detection Process A message is captured Deduce current application state Match rules based on the deduced state If a rule does not satisfy, signal alarm Monitor 1 2 3 4
P R E V I O U S W O R K
Slide 6/27
300 600 900 1200 100 200 300 400
Captured Packets / second (Incoming msg rate at Monitor) Detection Latency (msec)
The Breaking Point in High Rate of Messages
- After breaking point, latency increases sharply
- True alarms rate decreases because packets are dropped
- Breaking point expected in any stateful detection system
Breaking point
P R E V I O U S W O R K
Slide 7/27
Avoiding the Breaking Point: Random Sampling (SRDS ‘07)
- Processing Load in Monitor
δ = R × C
R: Incoming message rate, C: Processing cost per message
- Processing Load δ is reduced by reducing R
– Only a portion of incoming messages is processed – n out of m messages are randomly sampled
- Sampling is activated if R ≥ breaking point
P R E V I O U S W O R K
Slide 8/27
The Non-Determinism Caused by Sampling
Definitions: State Vector (SV) The state(s) of the application from Monitor’s point of view (deduced state(s)) Non-Determinism Monitor is no longer aware of the exact state the application is in (because of dropped messages)
A portion of a Finite State Machine S0
m1 m2 m3 m4 m5
… … … …
Events in Monitor (1) SV = { S0 } (2) A message is dropped (3) SV = { S1, S2 } (4) A message is sampled (5) The message is m5 (6) SV = { S5 } S1 S2 S3 S4 S5
P R E V I O U S W O R K
Slide 9/27
Remaining Agenda
I. Addressing the problem of non-determinism
- A. Intelligent Sampling
B. Hidden Markov Model (HMM) for state determination
- II. Experimental Test-bed
- III. Performance Results
- IV. Efficient Rule Matching and Triggering
C U R R E N T W O R K
Slide 10/27
Challenges with Non-Determinism
Decrease Detection Latency
(Large SV increases rule matching time)
Increase True Alarms
(Reduce effect of incorrect messages)
Decrease False Alarms
(Reduce incorrect states in SV)
Intelligent Sampling Hidden Markov Model SV reduction Non-Determinism
Slide 11/27
Intelligent Sampling
Random Sampling
Finite State Machine
S1 S2 S5 S6 S4 S3
State Vector
Sampled Messages Incoming Messages
Intelligent Sampling
Finite State Machine
S1 S4 S3
State Vector
Sampled Messages Incoming Messages
Message with desirable property
Slide 12/27
What is a Desirable Property of a Message?
Discriminative Size Number of times a message appears in transitions to different states in the FSM
- Desirable property is a small discriminative size
S2
… S1
S3 S5 S4 S6 S8 S7 S9
m1 m1 m1 m2 m2 m2 m3 m3
… … …
A portion of a Finite State Machine
Suppose, SV ={ S1, S2, S3 }, Sampling Rate = 1/3
SV Sampled Message
Discriminative Size {S4, S5, S6}
m1
3 { S7, S8 }
m2
2 { S9 }
m3
1
Slide 13/27
Benefits of Intelligent Sampling
Random Sampling
- SV can grow into large size
- Multiple incorrect states
– Increase of false alarms
Intelligent Sampling
- SV is kept small
– Detection latency reduction
- Less incorrect states in SV
– False alarms reduction
Slide 14/27
The Problem of Sampling an Incorrect Message
- What if an incorrect message is sampled?
– The message is incorrect in current states, e.g., a message from buggy component
S0
m1 m2 m3 m4 m5
… … … …
S1 S2 S3 S4 S5
- Suppose SV = { S1, S2 } and m3 is changed to m5
⇒ SV = { S5 } (incorrect SV!, it should be { S3 } )
Slide 15/27
Probabilistic State Vector Reduction: A Hidden Markov Model Approach
- Hidden Markov Model (HMM) used to reduce SV
– An HMM is an extended Markov Model where states are not observable – States are hidden as in the monitored application
- Given an HMM, we can ask:
The probability of the application being in any state, given a sequence of messages? Cost is O(N3L), N: number of states L: sequence length
- The HMM is trained (offline) with application traces
Slide 16/27
State Vector Reduction with the HMM
- Monitor asks the HMM: {p1, p2,…, pN}
pi = P(Si | O), Si: application state i, O: observation’s sequence Messages (observations) HMM p1, p2, …, pN Sort by Probabilities p3, p5, …, p1 α = Top k probabilities α ∩ SV = new SV New SV is robust to incorrect messages
Slide 17/27
Experimental Testbed: Java Duke’s Bank Application
- Simulates multi-tier online banking
system
- User transactions:
– Access account information, transfer money, etc.
- Application stressed with different
workloads – Incoming message rate at Monitor varies with user load
Slide 18/27
Web Interaction: A Sequence of calls and returns
Components: Java servlets, JavaBeans, EJBs
Slide 19/27
Error Injection Types
- Errors are injected in components touched by web
interactions
– A web interaction is faulty if at least one of its components is faulty
Error Type Description Response Delays a response delay in a method call Null calls a call to a component that is never executed Unhandled Exceptions exception thrown by execution that is never caught by the program Incorrect Message Sequences change randomly the web interaction structure
Slide 20/27
Performance Metrics Used in Experiments
Accuracy (True Alarms) % of true detections out of web interactions where errors were injected Precision (False Alarms) % of true detections out of the total number of detections Detection Latency time elapsed between the error injection and its detection
1 2 3 4 5 Error Injected
X X X
Detection (An alarm is signaled)
X X X X
Web Interactions
Accuracy = 2/3 = 0.67
Example:
Precision = 2/4 = 0.5
Slide 21/27
Results: State Vector Reduction
- Peaks are not observed in intelligent sampling (IS)
– IS capability of selecting messages with small discriminative size
- SV of size 1 is more frequent in IS
10 20 30 40 50 60 70 80 90 100 10 20 30 Discrete Time State Vector Size Random Sampling 10 20 30 40 50 60 70 80 90 100 10 20 30 Discrete Time State Vector Size Intelligent Sampling
2 4 6 8 0.2 0.4 0.6 0.8 1 Pruned State Vector Size CDF Random Sampling Intelligent Sampling
Slide 22/27
Results: Monitor vs. Pinpoint (Accuracy, Precision)
- Monitor and Pinpoint expose similar levels of accuracy
- Precision in Monitor (1.0) is higher than in Pinpoint (0.9)
4 8 12 16 20 24 0.2 0.4 0.6 0.8 1 Concurrent Users Accuracy Monitor Pinpoint-PCFG 4 8 12 16 20 24 0.2 0.4 0.6 0.8 1 Concurrent Users Precision Monitor Pinpoint-PCFG
- Pinpoint (NSDI ‘04), traces paths though multiple components
- Use of PCFG to detect abnormal paths
Slide 23/27
Results: Monitor vs. Pinpoint (Detection Latency)
- Detection latency in Monitor is in the order of
milliseconds, while in Pinpoint is in seconds
- The PCFG has a high space and time complexity
4 8 12 16 20 24 50 100 150 200
Concurrent Users
Detection Latency (msec)
Monitor 4 8 12 16 20 24 50 100 150 200
Concurrent Users
Detection Latency (sec)
Pinpoint-PCFG
Slide 24/27
Results: Memory Consumption (MB)
- Monitor doesn’t rely in large data structures
- PCFG in Pinpoint has high space and time complexity
– O(RL2) and O(L3) R: number of rules in the grammar L: size of a web interaction
- Pinpoint thrashes due to high memory requirements
Virtual Memory RAM Monitor 282.27 25.53 Pinpoint-PCFG 933.56 696.06
Slide 25/27
Efficient Rule Matching
. . . . . .
Expensive to match? System unstable? Match Yes Yes No No Next rule
- Selectively match computationally expensive rules
– Expensive rules don’t have to be matched all the time
- Rules are matched only if instability is present
Slide 26/27
- Efficiently detect memory leak in Apache web server
– Memory leak injected probabilistically with web requests
- Expensive ARIMA-based rule to detect abnormal
memory usage
– Average matching latency reduced
Efficient Rule Matching Example: Detecting Memory Leak
Rule Matching Criteria Memory Leak Detected Average Matching Latency (msec.) Always matched yes 19.283 σ ≥ 0.5 yes 7.115 σ ≥ 1.0 no 1.25
Slide 27/27
Concluding Remarks
- Contributions:
– Sampling used to scale stateful detection system under high- rate of messages – Intelligent Sampling reduces non-determinism caused by sampling – HMM-approach handles incorrect messages – Techniques can be applied to any stateful detection system – Monitor performs better than other approaches
- Future Work: