Automatic Failure Diagnosis based on Timing Behavior Anomaly - - PowerPoint PPT Presentation

automatic failure diagnosis based on timing behavior
SMART_READER_LITE
LIVE PREVIEW

Automatic Failure Diagnosis based on Timing Behavior Anomaly - - PowerPoint PPT Presentation

Automatic Failure Diagnosis based on Timing Behavior Anomaly Detection in Distributed Java Web Applications Diploma Thesis Presentation Nina S. Marwede Abteilung Software Engineering Fakultt II Department fr Informatik August 26, 2008


slide-1
SLIDE 1

Automatic Failure Diagnosis based on Timing Behavior Anomaly Detection in Distributed Java Web Applications

Diploma Thesis Presentation Nina S. Marwede

Abteilung Software Engineering Fakultät II – Department für Informatik

August 26, 2008

First examiner

  • Prof. Dr. Wilhelm Hasselbring

Second examiner MIT Matthias Rohr Advisor Dipl.-Inform. André van Hoorn Advisor MIT Matthias Rohr

slide-2
SLIDE 2

Contents

1

Motivation

2

Foundations

3

Goals

4

Approach

5

Case Study

6

Conclusions

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 2 / 36

slide-3
SLIDE 3

Motivation

Motivation for Automatic Failure Diagnosis

Software systems are practically never free of faults Software failures have great influence on our lives Large effort for manual diagnosis and debugging Automated processes are required

1

Failure detection

2

Fault localization

3

Fault removal

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 3 / 36

slide-4
SLIDE 4

Motivation

Motivation for Automatic Failure Diagnosis

Software systems are practically never free of faults Software failures have great influence on our lives Large effort for manual diagnosis and debugging Automated processes are required

1

Failure detection

2

Fault localization

3

Fault removal

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 3 / 36

slide-5
SLIDE 5

Foundations

Monitoring of System Behavior

Log files User interfaces Resources Control flow Timing behavior → Instrumentation of hardware/software

Kieker [Rohr et al., 2008]

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 4 / 36

slide-6
SLIDE 6

Foundations

Monitoring of System Behavior

Log files User interfaces Resources Control flow Timing behavior → Instrumentation of hardware/software

Kieker [Rohr et al., 2008]

<<Component>>

M M

Software System with Monitoring Instrumentation

Database

M M M M M

:Tpan

:TpmonControl

:Tpmon

<<Component>> <<Component>> Timing Diagrams Markov Chains Dependency Graphs Sequence Diagrams :SequenceAnalysis <<Component>> :DependencyAnalysis <<Component>> :TimingAnalysis <<Component>>

:ExecutionModelAnalysis

<<Component>>

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 4 / 36

slide-7
SLIDE 7

Foundations

Failure Diagnosis

Model checking: explicit messages Timing behavior: throughput, latency, response times Anomaly detection: statistical analysis Correlation: connection of information from different sources Goal: cause instead of symptoms Visualization

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 5 / 36

slide-8
SLIDE 8

Goals

Goals

1 Design of an approach for fault localization ◮ Timing behavior anomaly detection [Rohr, 2008] ◮ Calling dependencies between components

(dependency graphs)

◮ Focus on event correlation

⇒ “Anomaly Correlator”

2 Evaluation: Case Study ◮ Java Web Application: iBATIS JPetStore ◮ Workload Generation: Markov4JMeter [van Hoorn et al., 2008] ◮ Fault Injection Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 6 / 36

slide-9
SLIDE 9

Approach

Contents

1

Motivation

2

Foundations

3

Goals

4

Approach

5

Case Study

6

Conclusions

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 7 / 36

slide-10
SLIDE 10

Approach Solution Idea

Solution Idea

Correlation: Draw conclusions from the arrangement

  • f the anomalies in the calling dependency graph

B A E D C G F

anomalous unsure normal Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 8 / 36

slide-11
SLIDE 11

Approach Implementation

Implementation

Extension to existing software “Kieker” [Rohr et al., 2008] Tpmon stores monitoring data, Tpan with its plug-ins analyzes it Correlator: Plug-in for Tpan

Anomaly Detector Correlator

Tpan

Model Building Execution Aggregation Cause Estimation Visualization Anomaly Graphs Textual Output

1 2 3 4

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 9 / 36

slide-12
SLIDE 12

Approach Assumptions

Assumptions

Correct failure detection Correct anomaly scoring Failure has distinct cause Exactly one failure in the observation period Anomaly propagation

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 10 / 36

slide-13
SLIDE 13

Approach Input Data

Input Data

1 Calling dependencies

between operations

2 Anomalies in the timing

behavior of executions

Comp VM Start RT Anomaly ___________________________ ... A X 0001 8 0.6 C Y 0002 1 −0.2 B X 0004 4 0.9 C Y 0006 2 0.3 ...

A B C

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 11 / 36

slide-14
SLIDE 14

Approach Step 1: Preparation

Step 1: Preparation of Data Structures

Generation of calling dependency graphs from traces Connection of anomalies with software architecture

doGet(HttpServletRequest,HttpServletResponse) addItemToCart() newOrder() viewItem() viewCategory() getItem(String) $ doPost(HttpServletRequest,HttpServletResponse) insertOrder(Order) signon() getProductListByCategory(String) getCategory(String)

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 12 / 36

slide-15
SLIDE 15

Approach Challenges

Challenges (1/2)

Aggregation: How to aggregate a number of anomaly scores into one value?

Four places are involved: Three architectural levels, and neighbors on operation level Five methods are evaluated: Median, power mean (three exponents), maximum

Number of executions Anomaly score

0.2

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 13 / 36

slide-16
SLIDE 16

Approach Challenges

Challenges (2/2)

Correlation: How to recognize the propagation of an anomaly?

Consider the perspective of each component Three algorithms are evaluated: Trivial, simple, advanced

B A E D C G F

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 14 / 36

slide-17
SLIDE 17

Approach Challenges

Challenges (2/2)

Correlation: How to recognize the propagation of an anomaly?

Consider the perspective of each component Three algorithms are evaluated: Trivial, simple, advanced

B A E D C G F

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 14 / 36

slide-18
SLIDE 18

Approach Challenges

Challenges (2/2)

Correlation: How to recognize the propagation of an anomaly?

Consider the perspective of each component Three algorithms are evaluated: Trivial, simple, advanced

B A E D C G F

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 14 / 36

slide-19
SLIDE 19

Approach Step 2: Processing

Step 2: Processing of Anomaly Scores

Three algorithms

1 Trivial:

Simple aggregation, no correlation

2 Simple:

Simple aggregation, “pessimistic” correlation

3 Advanced:

Weighted configurable aggregation, “optimistic” correlation

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 15 / 36

slide-20
SLIDE 20

Approach Step 2: Processing

Trivial Algorithm

Aggregation: Unweighted arithmetic mean on each level Correlation: None

Deployment Context Component Component Component

... Operation Operation Operation Operation Operation ... Execution Execution Execution Execution Execution Execution Execution Execution Execution Execution Execution

...

Application

...

... ...

Deployment ...

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 16 / 36

slide-21
SLIDE 21

Approach Step 2: Processing

Simple Algorithm

1 Rule 1:

Mean of anomaly ratings of directly connected callers . . . relative high? ⇒ Increase rating

2 Rule 2:

Maximum of anomaly ratings of directly connected callees . . . relative high? ⇒ Decrease rating

B A E D C G F

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 17 / 36

slide-22
SLIDE 22

Approach Step 2: Processing

Simple Algorithm

1 Rule 1:

Mean of anomaly ratings of directly connected callers . . . relative high? ⇒ Increase rating

2 Rule 2:

Maximum of anomaly ratings of directly connected callees . . . relative high? ⇒ Decrease rating

B A E D C G F

1.0 0.5 0.2 −0.7 −0.6 −0.2 −0.6

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 17 / 36

slide-23
SLIDE 23

Approach Step 2: Processing

Advanced Algorithm

Aggregation

◮ In addition to arithmetic mean:

median, power mean, maximum

Correlation

◮ Consideration of call frequencies

(edges in CDG)

◮ Transitive closure of callers ◮ Transitive closure of callees B A E D C G F I H J M L K

3256 231 564 4612 4312 958

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 18 / 36

slide-24
SLIDE 24

Approach Step 3: Output

Step 3: Output

1 Text output

Components sorted by cause rating in descending

  • rder:

================== 8.89% persistence .sqlmapdao. ItemSqlMapDao 7.78% service.hessian.server. CatalogService 6.91% persistence .sqlmapdao. ProductSqlMapDao 6.83% presentation .OrderBean 6.71%

  • rg.apache.struts.action. ActionServlet

6.03% persistence .sqlmapdao. AccountSqlMapDao ...

Cause rating for:

◮ Deployment contexts (e.g., virtual machines) ◮ Components (e.g., Java classes) ◮ Operations (e.g., Java methods) Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 19 / 36

slide-25
SLIDE 25

Approach Step 3: Output

Step 3: Output (cont’d)

2 Visualization of the application hierarchy structure

Virtual Machine 'klotz' [ 131294/212418 | 0,02 | 25,33% ] service.hessian.client.CatalogService [ 51754/91532 | 0,048 | 6,85% ] presentation.CatalogBean [ 18978/26524 | 0,310 | 8,56% ] ce resentation.AccountBean 16/1086 | -0,005 | 6,51% ] presentation.CartBean [ 1113/2190 | 0,055 | 6,90% ]

  • rg.apache.struts.action.ActionServlet

[ 56134/85142 | -0,102 | 5,87% ] $ doPost(HttpServletRequest,HttpServletResponse) [ 1575/3143 | -0,026 | -0,026 | 2,73% ] 2057 doGet(HttpServletRequest,HttpServletResponse) [ 27476/39428 | 0,239 | 0,239 | 3,47% ] 38455 process(HttpServletRequest,HttpServletResponse) [ 27083/42571 | -0,036 | -0,518 | 1,35% ] 3143 39428 getItemListByProduct(String,int,int) [ 11028/18090 | 0,083 | 0,083 | 3,04% ] 18090 getProduct(String) [ 5223/9045 | 0,057 | 0,057 | 2,96% ] 9045 getItem(String) [ 5174/8425 | 0,100 | 0,100 | 3,08% ] 8425 getItemListByProduct(String) [ 11088/18090 | 0,085 | 0,085 | 3,04% ] 18090 getCategory(String) [ 5836/11191 | 0,093 | 0,093 | 3,06% ] 11191 getProductListByCategory(String,int,int) [ 6339/12277 | -0,015 | -0,015 | 2,76% ] 12277 isItemInStock(String) [ 756/2137 | -0,003 | -0,003 | 2,80% ] 2137 getProductListByCategory(String) [ 6310/12277 | -0,015 | -0,015 | 2,76% ] 12277 viewItem() [ 4094/6288 | 0,112 | 0,112 | 3,12% ] 6288 viewCategory() [ 5839/11191 | 0,031 | 0,031 | 2,89% ] 11191 11191 viewProduct() [ 9045/9045 | 0,786 | 0,786 | 5,01% ] 9045 18090 signon() 6/1086 | -0,005 | -0,005 | 2,79% ] 1086 1086 addItemToCart() [ 1113/2190 | 0,055 | 0,055 | 2,96% ] 2137 2137 1086 973 6288 11191 9045 1086 2190 1943

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 20 / 36

slide-26
SLIDE 26

Approach Step 3: Output

Visualization Parameters

Hierarchy levels Node and edge annotations Color shade spectrum Embedded anomaly score histograms Additional explanations, caption, legend, . . .

Virtual Machine ’klotz’ presentation.AccountBean service.hessian.client.OrderService service.hessian.client.AccountService

  • rg.apache.struts.action.ActionServlet

Virtual Machine ’scooter’ service.hessian.server.AccountService persistence.sqlmapdao.AccountSqlMapDao Virtual Machine ’puck’ persistence.sqlmapdao.OrderSqlMapDao service.hessian.server.OrderService

$ doPost(HttpServletRequest,HttpServletResponse) doGet(HttpServletRequest,HttpServletResponse) process(HttpServletRequest,HttpServletResponse) signon() getAccount(String,String) insertOrder(Order) insertOrder(Order) getAccount(String,String) insertOrder(Order) getNextId(String) getAccount(String,String)

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 21 / 36

slide-27
SLIDE 27

Case Study

Contents

1

Motivation

2

Foundations

3

Goals

4

Approach

5

Case Study

6

Conclusions

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 22 / 36

slide-28
SLIDE 28

Case Study

Goals & Metrics

Goals

Proof of concept Quantitative evaluation of the Correlator Appropriate visualization

Metrics

Success rating: Percent value comparing the highest rated element to the element where the fault injection happened. Clearness rating: Reflects the visual impression of contrast.

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 23 / 36

slide-29
SLIDE 29

Case Study Experiment Setup

Experiment Setup

Non-trivial software system: JPetStore Application distributed to 4 machines Workload generation

◮ Probabilistic user behavior ◮ Constant number of users

Fault injection Monitoring using Tpmon

Distributed JPetStore

Component A Component B Tpmon Tpmon JMeter with Markov4JMeter

M F F M M M

Fault Injection

Anomaly Detector

Correlator

Tpan

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 24 / 36

slide-30
SLIDE 30

Case Study Experiment Setup

Distributed JPetStore

4 deployment contexts + DBMS 34 operations are instrumented with monitoring probes

«execution environment»

Presentation

«execution environment»

Catalog

«device»

Database

«component»

Presentation «execution environment»

Order

«component»

Order

«component»

Catalog «execution environment»

Account

«component»

Account

«component»

OrderDatabase

«component»

CatalogDatabase

«component»

AccountDatabase

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 25 / 36

slide-31
SLIDE 31

Case Study Experiment Setup

Fault Injection

1 Programming faults ◮ Duplicated code execution ◮ Empty DB query result set 2 Database connection slowdown ◮ Thread.sleep(10) 3 Hard disk misconfiguration ◮ hdparm -X mdma1 /dev/hda 4 Resource intensive processes ◮ “Reiner’s Fork Bomb” 5 CPU throttling ◮ To simulate a broken CPU cooling system ◮ cpufreq-set -f 800 ◮ Clock duty cycle of 50% Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 26 / 36

slide-32
SLIDE 32

Case Study Experiment Statistics

Experiment Statistics

42 experiment runs + 3 fault-free runs 20 hours total experiment time 16 million monitored executions 100 MB data per experiment run

  • !
  • """ !
  • #

"" !

  • $

""" !

  • Nina Marwede (Univ. of Oldenburg)

Failure Diagnosis based on Timing Behavior Aug 26, 2008 27 / 36

slide-33
SLIDE 33

Case Study Analysis

Correlator Configuration

Algorithm selection – 3 implemented; extendable Algorithm parameters – 11 variables for advanced algorithm Result export detail level and file names – 7 variables Visualization parameters – 12 feature switches, 9 color selections, 5 font settings, 4 others (30 total) → Java properties

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 28 / 36

slide-34
SLIDE 34

Case Study Results

Results: Quality of the Correlation Algorithms

No. Injection Variant Trivial Simple Advanced Optimized 1 Programming fault 1 + +

  • -
  • 2

Programming fault 2 + +

  • ++

3 Programming fault 3

  • +

++ 4 DB conn. slowdown 1 + ++

  • ++

5 DB conn. slowdown 2 + ++ + ++ 6–14

  • ther
  • -

1–5 Averages 2.4 2.0 2.8 1.4

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 29 / 36

slide-35
SLIDE 35

Case Study Results

Results: Visualization Clearness – Trivial vs. Optimized

Virtual Machine ’tier’ Virtual Machine ’klotz’ Virtual Machine ’scooter’ Virtual Machine ’puck’ $

  • rg.apache.struts.action.ActionServlet

presentation.OrderBean presentation.CatalogBean presentation.CartBean presentation.AccountBean persistence.sqlmapdao.ItemSqlMapDao service.hessian.server.CatalogService persistence.sqlmapdao.ProductSqlMapDao service.hessian.client.OrderService service.hessian.server.OrderService service.hessian.client.CatalogService service.hessian.client.AccountService service.hessian.server.AccountService persistence.sqlmapdao.OrderSqlMapDao persistence.sqlmapdao.AccountSqlMapDao Virtual Machine ’tier’ Virtual Machine ’klotz’ Virtual Machine ’scooter’ Virtual Machine ’puck’ $

  • rg.apache.struts.action.ActionServlet

presentation.OrderBean presentation.AccountBean presentation.CatalogBean presentation.CartBean persistence.sqlmapdao.ItemSqlMapDao persistence.sqlmapdao.ProductSqlMapDao service.hessian.server.CatalogService service.hessian.client.OrderService service.hessian.server.OrderService service.hessian.client.AccountService service.hessian.server.AccountService service.hessian.client.CatalogService persistence.sqlmapdao.OrderSqlMapDao persistence.sqlmapdao.AccountSqlMapDao

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 30 / 36

slide-36
SLIDE 36

Case Study Results

Results

0.1 0.2 0.5 1.0 2.0 5.0 10.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Neighborhood mean distance exponent Clearness Fault Scenario 5: DB Connection Slowdown

  • Depl. Context Level

Component Level Operation Level

0.1 0.2 0.5 1.0 2.0 5.0 10.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Neighborhood mean distance exponent Clearness Fault Scenario 3: Programming Fault

  • Depl. Context Level

Component Level Operation Level Trivial Simple Advanced Optimized 0.0 0.5 1.0 1.5

Clearness Fault Scenario 5: DB Connection Slowdown Trivial Simple Advanced Optimized 0.0 0.5 1.0 1.5 Clearness Fault Scenario 3: Programming Fault

  • Depl. Context Level

Component Level Operation Level

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 31 / 36

slide-37
SLIDE 37

Conclusions Summary

Summary & Conclusions

Summary New approach for failure diagnosis Implementation and evaluation of correlation algorithms Visualization of the results (vector graphic export) Conclusions Good chance of pointing to the right cause Small risk of denoting false positives At least large parts are declared as not containing a fault Simpler algorithms show more the effect, less the cause

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 32 / 36

slide-38
SLIDE 38

Conclusions Future Work

Future Work

Continuous analysis and visualization (in progress)

(“Software-Betriebsleitstand”) [Giesecke et al., 2006] Application to industry system data (in progress) Recognition of known anomaly patterns learned from training data Improved interactive user interface

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 33 / 36

slide-39
SLIDE 39

Bibliography

Bibliography

Algirdas Anthony Avižienis, Jean-Claude Laprie, Brian Randell, and Carl E. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11–33, 2004. Simon Giesecke, Matthias Rohr, and Wilhelm Hasselbring. Software-Betriebs-Leitstände für

  • Unternehmensanwendungslandschaften. In Proceedings of the Workshop

“Software-Leitstände: Integrierte Werkzeuge zur Softwarequalitätssicherung”, volume P-94 of Lecture Notes in Informatics, pages 110–117. Gesellschaft für Informatik, October 2006. Matthias Rohr. Workload-sensitive Timing Behavior Anomaly Detection for Automatic Software Failure Diagnosis. PhD thesis, Department of Computing Science, University of Oldenburg, Oldenburg, Germany, 2008. work in progress. Matthias Rohr, André van Hoorn, Jasminka Matevska, Nils Sommer, Lena Stoever, Simon Giesecke, and Wilhelm Hasselbring. Kieker: Continuous monitoring and on demand visualization of Java software behavior. In Claus Pahl, editor, Proceedings of the IASTED International Conference on Software Engineering 2008 (SE 2008), pages 80–85, Anaheim, February 2008. ACTA Press. André van Hoorn, Matthias Rohr, and Wilhelm Hasselbring. Generating probabilistic and intensity-varying workload for web-based software systems. In Samuel Kounev, Ian Gorton, and Kai Sachs, editors, Performance Evaluation – Metrics, Models and Benchmarks: Proceedings of the SPEC International Performance Evaluation Workshop (SIPEW ’08), volume 5119 of Lecture Notes in Computer Science (LNCS), pages 124–143, Heidelberg, June 2008. Springer.

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 34 / 36

slide-40
SLIDE 40

Thanks

Thanks for your attention.

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 35 / 36

slide-41
SLIDE 41

Fault Injection Requirements

Requirements for Fault Injection

Noticeable effect No administrative messages Diversity in position All hierarchy levels Realistic, and repeatable Increasing & decreasing the response times

Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 36 / 36