Automatic Failure Diagnosis Support in Distributed Large-Scale - - PowerPoint PPT Presentation

automatic failure diagnosis support in distributed large
SMART_READER_LITE
LIVE PREVIEW

Automatic Failure Diagnosis Support in Distributed Large-Scale - - PowerPoint PPT Presentation

Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems based on Timing Behavior Anomaly Correlation Presentation at 13th European Conference on Software Maintenance and Reengineering Nina Marwede 1 , Matthias Rohr 1 ,


slide-1
SLIDE 1

Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems based on Timing Behavior Anomaly Correlation

Presentation at 13th European Conference on Software Maintenance and Reengineering Nina Marwede1, Matthias Rohr1, André van Hoorn2, Wilhelm Hasselbring3

1BTC Business Technology Consulting AG, Germany 2Graduate School TrustSoft, University of Oldenburg, Germany 3Software Engineering Group, University of Kiel, Germany

Contact: matthias.rohr@btc-ag.com

March 25, 2009

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 1 / 25

slide-2
SLIDE 2

Motivation

Motivation

Complex Software System

Users

Complex software systems are almost never free of faults.

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 2 / 25

slide-3
SLIDE 3

Motivation

Motivation

Complex Software System

Users Administrators

Complex software systems are almost never free of faults. Software faults are a major cause for system failures [Küng and Krause, 2007;

Gray, 1986]

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 2 / 25

slide-4
SLIDE 4

Motivation

Motivation

Complex Software System

Users Administrators

Complex software systems are almost never free of faults. Software faults are a major cause for system failures [Küng and Krause, 2007;

Gray, 1986]

Manual failure diagnosis is time-consuming and error-prone.

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 2 / 25

slide-5
SLIDE 5

Motivation

Motivation

Complex Software System

Users Administrators

Complex software systems are almost never free of faults. Software faults are a major cause for system failures [Küng and Krause, 2007;

Gray, 1986]

Manual failure diagnosis is time-consuming and error-prone.

Huge amount of program states (space and time) [Cleve and Zeller, 2005] Temporal & spatial chasms between cause and symptom [Eisenstadt, 1997] Many systems are not known completely by a single person Some failure are hard to repeat – e.g., Heisenbugs

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 2 / 25

slide-6
SLIDE 6

Motivation

Motivation

Complex Software System

Users Administrators

Complex software systems are almost never free of faults. Software faults are a major cause for system failures [Küng and Krause, 2007;

Gray, 1986]

Manual failure diagnosis is time-consuming and error-prone. Most common failure diagnosis methods [Eisenstadt, 1997]:

Data-gathering (e.g., print-statements to source code, memory dumps) Interactive execution using debugging tools

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 2 / 25

slide-7
SLIDE 7

Motivation

Motivation

Strategy to support failure diagnosis Runtime behavior is indicative for failures and error-propagation. Automatic fault localization using anomaly detection on monitoring data. Analysis and visualization in the context of automatically derived architecture models.

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 2 / 25

slide-8
SLIDE 8

Foundations

Outline

1

Motivation

2

Foundations

3

Approach

4

Case Study

5

Summary & Conclusions

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 3 / 25

slide-9
SLIDE 9

Foundations

Online failure diagnosis based on anomaly detection

Anomalies Anomalies are deviations from normal system behavior.

System Anomaly detection

System influences System behavior

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 4 / 25

slide-10
SLIDE 10

Foundations

Online failure diagnosis based on anomaly detection

Anomalies Anomalies are deviations from normal system behavior.

System Anomaly detection

System influences System behavior

‘ Fault localization activities Anomaly Detection Anomaly Correlation (often plain aggregation) Visualization and/or reporting

Component Anomaly detection Component Anomaly detection Component Anomaly detection Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 4 / 25

slide-11
SLIDE 11

Foundations

Propagation and Anomaly Detection

Error propagation

:A

<<Component>>

:B

<<Component>> System ... (System Service) Failure Error Fault (dormant / active)

Many errors propagate along calling dependencies.

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 5 / 25

slide-12
SLIDE 12

Foundations

Propagation and Anomaly Detection

Error propagation

:A

<<Component>>

:B

<<Component>> System ... (System Service) Failure Error Fault (dormant / active)

Many errors propagate along calling dependencies. Anomaly correlation Anomalies propagate as well - compensating analysis is required. Some approaches analyze anomalies in context of calling dependency graphs.

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 5 / 25

slide-13
SLIDE 13

Foundations

Dependency Graphs

$ ActionServlet 250 CatalogBean 113 CartBean 210

... ...

Calling Dependency Graphs Nodes: E.g., Operations, Components, Deployment contexts, Virtual Machines Directed edges represent call actions Weights quantify call frequencies

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 6 / 25

slide-14
SLIDE 14

Approach

Contents

1

Motivation

2

Foundations

3

Approach

4

Case Study

5

Summary & Conclusions

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 7 / 25

slide-15
SLIDE 15

Approach

Overview

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 8 / 25

slide-16
SLIDE 16

Approach

Input Data

1

Calling dependencies between operations

2

Anomalies scores provided by a timing behavior anomaly detector

Comp VM Start RT Anomaly ___________________________ ... A X 0001 8 0.6 C Y 0002 1 −0.2 B X 0004 4 0.9 C Y 0006 2 0.3 ...

A B C Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 9 / 25

slide-17
SLIDE 17

Approach

Architectural model creation

Calling Dependency Graph (class granularity) for iBatis JPetStore

$ ActionServlet 14149 OrderService 328 Order 367 LineItem 32 ItemSqlMapDao 344 SequenceSqlMapDao 328 OrderSqlMapDao 343 CartBean 24 CatalogService 1576 Item 18 Cart 1654 CartItem 18 2252 2252 Account 11322 1332 4504 CategorySqlMapDao 3908 ProductSqlMapDao 7406 6823 Product RequestProcessor 3737 33545 66811 148530 111341 6856 OrderBean 29055 9082 7319 CatalogBean 36800 36089 AccountBean 20719 ActionMessages 14911 ActionMessage 177 Category 3908 ActionForm 17518 704 DaoConfig 2662 Action 1 ActionMapping 14842 11724 Sequence 72180 ProductSqlMapDao$ProductSearch 80 343 994 5171 15 1654 1349 AbstractBean 330 5306 2291 10656 32 32 16187 320 2143 100 4796 2398 1422 33224 130484 1094 374 13180 334 AccountService 399 4 129 ActionMessages$ActionMessageItem 177 ActionMessages$1 222 AccountSqlMapDao 8217 399 2296 444 14855 44561 13341 3362

Two alternative methods for creating the CDG: Analysis of monitoring data Static (source code) analysis

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 10 / 25

slide-18
SLIDE 18

Approach

Aggregation and integration into the architectural model

Approach Each architectural element’s anomaly scores are aggregated into a single value

Several metrics explored (mean, median, power mean, ...)

Number of executions Anomaly score

0.2

The aggregation reduces the complexity for the correlation activity

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 11 / 25

slide-19
SLIDE 19

Approach

Aggregation and integration into the architectural model

Approach Each architectural element’s anomaly scores are aggregated into a single value

Several metrics explored (mean, median, power mean, ...)

Number of executions Anomaly score

0.2

The aggregation reduces the complexity for the correlation activity Example result: Three operations with assigned anomaly scores

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 11 / 25

slide-20
SLIDE 20

Approach

Correlation of anomaly ratings

Approach Rules are applied that recompute an elements anomaly score in the context of its callers and callees

Similar approach to cellular automaton

The rules encapsulate error and anomaly propagation knowledge Example scenario: Is A’s anomaly score just the result of a fault in B?

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 12 / 25

slide-21
SLIDE 21

Approach

Correlation of anomaly ratings

Approach Rules are applied that recompute an elements anomaly score in the context of its callers and callees

Similar approach to cellular automaton

The rules encapsulate error and anomaly propagation knowledge Example scenario: Is A’s anomaly score just the result of a fault in B?

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 12 / 25

slide-22
SLIDE 22

Approach

Rules

Rule 1: Mean of anomaly ratings of directly connected callers . . . relatively high? ⇒ Increase rating

B A E D C G F

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 13 / 25

slide-23
SLIDE 23

Approach

Rules

Rule 1: Mean of anomaly ratings of directly connected callers . . . relatively high? ⇒ Increase rating Rule 2: Maximum of anomaly ratings of directly connected callees . . . relative high? ⇒ Decrease rating

B A E D C G F

1.0 0.5 0.2 −0.7 −0.6 −0.2 −0.6 Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 13 / 25

slide-24
SLIDE 24

Approach

Rules

Rule 1: Mean of anomaly ratings of directly connected callers . . . relatively high? ⇒ Increase rating Rule 2: Maximum of anomaly ratings of directly connected callees . . . relative high? ⇒ Decrease rating Additional rules:

Consideration of call frequencies (edges in CDG) Transitive closure of callers Transitive closure of callees

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 13 / 25

slide-25
SLIDE 25

Approach Visualization

Visualization - Three visualization granularity levels

Granularity levels: Deployment context level / Virtual Machine level Component level Operation level

$

40912

Virtual Machine ’tier’ [ 41472/61098 | 0,03 | 25,90% ]

43498

Virtual Machine ’scooter’ [ 818/2176 | −0,07 | 23,43% ]

1088

Virtual Machine ’puck’ [ 1447/2943 | −0,03 | 24,48% ]

981

Virtual Machine ’klotz’

  • rg.apache.struts.action.ActionServlet

[ 41827/85960 | 0,190 | 7,81% ] 40912 presentation.AccountBean [ 494/1088 | −0,062 | 6,16% ] 1088 presentation.CartBean [ 1107/2170 | −0,087 | 5,99% ] 2170 presentation.CatalogBean [ 18138/26837 | 0,048 | 6,88% ] 26837 presentation.OrderBean [ 1454/3917 | 0,094 | 7,18% ] 3917 service.hessian.client.OrderService [ 484/981 | −0,057 | 6,19% ] 981 1088 service.hessian.client.AccountService [ 523/1088 | −0,033 | 6,35% ] 1088 4224 47353 981 43498 1088

service.hessian.client.CatalogService getProductListByCategory(String)

[ 6365/12437 | −0,161 | −0,006 | 3,14% ] 1088 getCategory(String) [ 6309/11349 | 0,216 | −0,037 | 3,04% ] 11349 getItemListByProduct(String) [ 9167/9167 | 0,995 | 0,406 | 4,43% ] getItemListByProduct(String,int,int) [ 9167/9167 | 0,995 | 0,488 | 4,69% ] 9167 getProduct(String) [ 3629/9167 | −0,046 | −0,281 | 2,27% ] 9167 getProductListByCategory(String,int,int) [ 6402/12437 | −0,161 | −0,106 | 2,82% ] 12437 12437 9167 9167 11349 11349

Deployment Context Level Component Level Component Level Operation Level

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 14 / 25

slide-26
SLIDE 26

Approach Visualization

Visualization - Deployment context / Virtual Machine level

  • !
  • """ !
  • #

"" !

  • $

""" !

  • Matthias Rohr (BTC AG)

Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 15 / 25

slide-27
SLIDE 27

Approach Visualization

Component level

Virtual Machine ’tier’ Virtual Machine ’klotz’ Virtual Machine ’scooter’ Virtual Machine ’puck’ $

  • rg.apache.struts.action.ActionServlet

presentation.OrderBean presentation.CatalogBean presentation.CartBean presentation.AccountBean persistence.sqlmapdao.ItemSqlMapDao service.hessian.server.CatalogService persistence.sqlmapdao.ProductSqlMapDao service.hessian.client.OrderService service.hessian.server.OrderService service.hessian.client.CatalogService service.hessian.client.AccountService service.hessian.server.AccountService persistence.sqlmapdao.OrderSqlMapDao persistence.sqlmapdao.AccountSqlMapDao

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 16 / 25

slide-28
SLIDE 28

Approach Visualization

Operation level

Virtual Machine 'klotz' [ 131294/212418 | 0,02 | 25,33% ] service.hessian.client.CatalogService [ 51754/91532 | 0,048 | 6,85% ] presentation.CatalogBean [ 18978/26524 | 0,310 | 8,56% ] ce resentation.AccountBean 16/1086 | -0,005 | 6,51% ] presentation.CartBean [ 1113/2190 | 0,055 | 6,90% ]

  • rg.apache.struts.action.ActionServlet

[ 56134/85142 | -0,102 | 5,87% ] $ doPost(HttpServletRequest,HttpServletResponse) [ 1575/3143 | -0,026 | -0,026 | 2,73% ] 2057 doGet(HttpServletRequest,HttpServletResponse) [ 27476/39428 | 0,239 | 0,239 | 3,47% ] 38455 process(HttpServletRequest,HttpServletResponse) [ 27083/42571 | -0,036 | -0,518 | 1,35% ] 3143 39428 getItemListByProduct(String,int,int) [ 11028/18090 | 0,083 | 0,083 | 3,04% ] 18090 getProduct(String) [ 5223/9045 | 0,057 | 0,057 | 2,96% ] 9045 getItem(String) [ 5174/8425 | 0,100 | 0,100 | 3,08% ] 8425 getItemListByProduct(String) [ 11088/18090 | 0,085 | 0,085 | 3,04% ] 18090 getCategory(String) [ 5836/11191 | 0,093 | 0,093 | 3,06% ] 11191 getProductListByCategory(String,int,int) [ 6339/12277 | -0,015 | -0,015 | 2,76% ] 12277 isItemInStock(String) [ 756/2137 | -0,003 | -0,003 | 2,80% ] 2137 getProductListByCategory(String) [ 6310/12277 | -0,015 | -0,015 | 2,76% ] 12277 viewItem() [ 4094/6288 | 0,112 | 0,112 | 3,12% ] 6288 viewCategory() [ 5839/11191 | 0,031 | 0,031 | 2,89% ] 11191 11191 viewProduct() [ 9045/9045 | 0,786 | 0,786 | 5,01% ] 9045 18090 signon() 6/1086 | -0,005 | -0,005 | 2,79% ] 1086 1086 addItemToCart() [ 1113/2190 | 0,055 | 0,055 | 2,96% ] 2137 2137 1086 973 6288 11191 9045 1086 2190 1943

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 17 / 25

slide-29
SLIDE 29

Case Study

Contents

1

Motivation

2

Foundations

3

Approach

4

Case Study

5

Summary & Conclusions

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 18 / 25

slide-30
SLIDE 30

Case Study

Goals & Metrics

Goals Proof of concept Quantitative evaluation Visualization evaluation Metrics Accuracy: Are injected faults accurately localized? Clearness: Are the results clearly (sufficient contrast) ranked?

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 19 / 25

slide-31
SLIDE 31

Case Study Experiment Setup

Experiment Setup

Distributed variant of iBATIS JPetStore (5 nodes) 34 operations are instrumented with monitoring probes Workload generation

Probabilistic user behavior

Fault injection

Programming faults Database connection slowdown Hard disk misconfiguration Resource intensive concurrent processes CPU throttling

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 20 / 25

slide-32
SLIDE 32

Case Study Results

Results: Experiment statistics and fault localization quality

Experiment statistics 42 experiment scenarios 20 hours total experiment time 16 million monitored executions 100 MB data per experiment run Fault localization quality (Accuracy and Clearness)

Scenario Injection “Trivial” “Simple” “Advanced”

  • No. 1
  • Progr. fault

+ + +

  • No. 2
  • Progr. fault

+ + ++

  • No. 3
  • Progr. fault
  • +
  • No. 4

DB slowdown + ++ ++

  • No. 5

DB slowdown

  • +

++ Averages 3.4 3.8 4.6

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 21 / 25

slide-33
SLIDE 33

Case Study Results

Visualization Clearness: No correlation vs. our approach

Virtual Machine ’klotz’ $ ActionServlet OrderBean CatalogBean CartBean AccountBean client.CatalogService client.OrderService client.AccountService Virtual Machine ’tier’ Virtual Machine ’scooter’ Virtual Machine ’puck’ server.CatalogService ItemSqlMapDao ProductSqlMapDao server.OrderService server.AccountService OrderSqlMapDao AccountSqlMapDao Virtual Machine ’klotz’ $ ActionServlet OrderBean CatalogBean CartBean AccountBean client.CatalogService client.OrderService client.AccountService Virtual Machine ’tier’ Virtual Machine ’scooter’ Virtual Machine ’puck’ server.CatalogService ItemSqlMapDao ProductSqlMapDao server.OrderService server.AccountService OrderSqlMapDao AccountSqlMapDao

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 22 / 25

slide-34
SLIDE 34

Summary & Conclusions

Contents

1

Motivation

2

Foundations

3

Approach

4

Case Study

5

Summary & Conclusions

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 23 / 25

slide-35
SLIDE 35

Summary & Conclusions

Issues

Number of monitoring points:

Too less: Architecture and its dependencies not discovered Too many: Large overhead Trade-off: Major component services and entry points

Monitoring overhead:

Overhead approx. few microseconds/observation

Maintainability:

Approach automatically adapts to architectural changes Non-intrusive monitoring instrumentation

Anomaly detector requirements:

False alarms (false positives) can be tolerated if equally distributed over the architecture

Computational requirements:

35.000 executions/sec on 1.5 GHz Desktop

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 24 / 25

slide-36
SLIDE 36

Summary & Conclusions Summary

Summary & Conclusions

Summary New approach for failure diagnosis (focus on correlation and visualization) Evaluation of accuracy and clearness of correlation algorithms Case study with distributed web-application, fault injection, and probabilistic workload Conclusions Good chance of localizing the fault Large system parts are declared of not being a fault’s cause Approaches without correlation show a fault’s effect, not its origin Multi-granularity visualization even for small systems required

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 25 / 25

slide-37
SLIDE 37

Thanks

Questions?

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 26 / 25

slide-38
SLIDE 38

Bibliography

Bibliography

Holger Cleve and Andreas Zeller. Locating causes of program failures. In Proceedings of the 27th International Conference on Software Engineering (ICSE’05), pages 342–351. ACM Press, May 2005. ISBN 1595939632. Marc Eisenstadt. My hairiest bug war stories. Commun. ACM, 40(4):30–37, 1997. ISSN 0001-0782. doi:10.1145/248448.248456. Simon Giesecke, Matthias Rohr, and Wilhelm Hasselbring. Software-Betriebs-Leitstände für

  • Unternehmensanwendungslandschaften. In Proceedings of the Workshop “Software-Leitstände: Integrierte

Werkzeuge zur Softwarequalitätssicherung”, volume P-94 of Lecture Notes in Informatics, pages 110–117. Gesellschaft für Informatik, October 2006. ISBN 978-3-88579-188-1. Jim Gray. Why do computers stop and what can be done about it? In Proceedings of Symposium on Reliability in Distributed Software and Database Systems (SRDS-5), pages 3–12. IEEE, 1986. Peter Küng and Heinrich Krause. Why do software applications fail and what can software engineers do about it? a case study. In Proceedings IRMA Conference: Managing Worldwide Operations and Communications with Information Technology, pages 319–322. IGI Publishing, 2007. ISBN 978-1-59904-929-8. Matthias Rohr, André van Hoorn, Jasminka Matevska, Nils Sommer, Lena Stoever, Simon Giesecke, and Wilhelm Hasselbring. Kieker: Continuous monitoring and on demand visualization of Java software

  • behavior. In Proceedings of the IASTED International Conference on Software Engineering 2008, pages

80–85. ACTA Press, February 2008. ISBN 978-0-88986-715-4.

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 26 / 25

slide-39
SLIDE 39

Monitoring of System Behavior

Characteristics monitored by online fault localization approaches Hardware platform (e.g., CPU, Network, Memory) Operating system and middleware (e.g., OS Services, Application server, Virtual machine) Internal software application behavior:

Operation execution sequences (Control flow) Response times (end-to-end and of single software operations) Application output

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 26 / 25

slide-40
SLIDE 40

Monitoring of System Behavior

Characteristics monitored by online fault localization approaches Hardware platform (e.g., CPU, Network, Memory) Operating system and middleware (e.g., OS Services, Application server, Virtual machine) Internal software application behavior:

Operation execution sequences (Control flow) Response times (end-to-end and of single software operations) Application output

Monitoring Framework Kieker [Rohr et al., 2008]

<<Component>> M M Software System with Monitoring Instrumentation

Database

M M M M M

:Tpan :TpmonControl :Tpmon

<<Component>> <<Component>> Timing Diagrams Markov Chains Dependency Graphs Sequence Diagrams :SequenceAnalysis <<Component>> :DependencyAnalysis <<Component>> :TimingAnalysis <<Component>>

:ExecutionModelAnalysis

<<Component>>

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 25 / 25

slide-41
SLIDE 41

Future Work

Open and Future Work

Future work: Field studies on accuracy and clearness Evaluation of the visualization method (field or lab study)

Are three architectural levels better than two or four?

Evaluation of the complete approach

Whats the benefit in terms of repair time reduction?

Continuous analysis and visualization

(“Leitstand” / “Cockpit”) [Giesecke et al., 2006]

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 25 / 25

slide-42
SLIDE 42

Future Work

Fault Injection

1

Programming faults

Duplicated code execution Empty DB query result set

2

Database connection slowdown

Thread.sleep(10)

3

Hard disk misconfiguration

hdparm -X mdma1 /dev/hda

4

Resource intensive concurrent processes

5

CPU throttling

Simulation of a broken CPU cooling system

Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 25 / 25