Fault Diagnosis of Software Systems Rui Abreu Dept. of Informatics - PowerPoint PPT Presentation

44 Fault diagnosis 1. Spectra for N test cases M components x 11 x 12 … x 1M e 1 x 21 x 22 ... x 2M e 2 N cases … … ... ... … x N1 x N2 … x NM e N

45 Fault diagnosis 1. Spectra for M test cases x 11 x 12 … x 1M e 1 x 21 x 22 ... x 2M e 2 … … ... ... … x N1 x N2 … x NM e N Row i : the blocks that are executed in case i

46 Fault diagnosis 1. Spectra for M test cases x 11 x 12 … x 1M e 1 x 21 x 22 ... x 2M e 2 … … ... ... … x N1 x N2 … x NM e N Column j : the test cases in which block j was executed

47 Fault diagnosis 1. Spectra for M test cases 2. Error detection per test case x 11 x 12 … x 1M e 1 x 21 x 22 ... x 2M e 2 … … ... ... … x N1 x N2 … x NM e N e i =1 : error in the i -th test e i =0 : no error in the i -th test

48 Statistics-based Fault diagnosis Compare every column vector with the error vector. error vector block j x 11 x 12 … x 1M e 1 x 21 x 22 ... x 2M e 2 … … ... ... … x N1 x N2 … x NM e N similarity s j

49 Statistics-based Fault diagnosis Jaccard similarity coefficient: error vector block j 1 0 0 1 1 1 0 0 1 1 n 11 s j = n 11 +n 10 +n 01

50 Statistics-based Fault diagnosis Jaccard similarity coefficient: error vector block j 1 0 0 1 1 1 0 0 1 1 n 11 s j = n 11 +n 10 +n 01

51 Statistics-based Fault diagnosis Jaccard similarity coefficient: error vector block j 1 0 0 1 1 1 0 0 1 1 2 s j = 2 + n 10 +n 01

52 Statistics-based Fault diagnosis Jaccard similarity coefficient: error vector block j 1 0 0 1 1 1 0 0 1 1 2 s j = 2 + 1 + n 01

53 Statistics-based Fault diagnosis Jaccard similarity coefficient: error vector block j 1 0 0 1 1 1 0 0 1 1 2 s j = 2 + 1 + 1

54 Statistics-based Fault diagnosis For every block: similarity with the error “block” M components error vector x 11 x 12 … x 1M e 1 x 21 x 22 ... x 2M e 2 m cases … … ... ... … x N1 x N2 … x NM e N s 1 s 2 … s M The component with the highest s i most likely contains the fault.

55 Statistics-based Fault diagnosis n 11 s = n 11 +n 10 +n 01 component a b c d e f g fail test 1 0 1 1 1 1 0 0 0 test 2 0 0 0 1 0 1 1 1 test 3 1 1 1 1 0 0 0 1 test 4 0 0 0 0 0 0 0 0 test 5 1 1 0 1 1 0 1 1 ⅔ ½ ¼ ¾ ¼ ⅓ ⅔

56 Example: rational bubble sort void RationalSort( int n, int *num, int *den ) { int i,j; /* block 1 */ for ( i=n-1; i>=0; i-- ) { assert( den[i] != 0 ); /* block 2 */ for ( j=0; j<i; j++ ) { if ( RationalGT( num[j], den[j], /* block 3 */ num[j+1], den[j+1] ) ) { swap( &num[j], &num[j+1] ); /* block 4 */ /* swap( &den[j], &den[j+1] ); */ } } } } Fault : forgot to swap denominators Error : sequence is not a permutation of input sequence Failure : output is not a sorted version of the input

57 Example (2) earlier example 0 2 4 4 2 0 2 4 0 2 0 4 1 2 1 1 2 1 1 2 1 1 2 1 ERROR!

58 Example (3) Block 4 has highest similarity coefficient -> most likely suspect

59 Reasoning-based Fault Diagnosis • MBD – Reasoning approach based on behavioral comp models – High(er) diagnostic accuracy – Prohibitive (modeling and/or diagnosis) cost • SFL – Statistical based on execution spectra – Lower diagnostic accuracy: cannot reason over multiple faults – No modeling (except test oracle) + low diagnosis cost

60 60 Idea: Extend SFL with MBD • Combine best of both worlds • MBD – Reasoning approach based on behavioral comp models – High(er) diagnostic accuracy – Prohibitive (modeling and/or diagnosis) cost • SFL – Statistical based on execution spectra – Lower diagnostic accuracy: cannot reason over MF – No modeling (except test oracle) + low diagnosis cost

61 61 Working Example

62 62 SFL TARANTULA

63 63 Reasoning

64 64 Reasoning

65 65 Reasoning

67 Ranking Candidates • Probabilities updated according to Bayes’ rule – where

68 68 Ranking Candidates • Many ε -policies exist – Ideally • e.g., • ε = (1-h 1 ) . (1-h 2 ) . (1-h 1 ) . (1-h 2 ) . h 1 . h 2 – But estimating h j is far from trivial, hence approximations have been used so far (BAYES-A, [Abreu et al., WODA’08])

69 Barinel • Barinel’s key idea – for each d k , compute h j for the candidate’s faulty components that maximizes the probability Pr(e| d k ) of observations e occurring , conditioned on candidate d k

70 Barinel Algorithm c 1 c 2 c 3 e 1 1 0 1 (F) 0 1 1 1 (F) 1 0 0 1 (F) 1 0 1 0 (P) 1. Compute set of valid diagnosis candidates D = {d 1 = {1,2}, d 2 = {1,3}} • 2. Derive Pr(e|d) • Pr(e|d 1 ) = (1- h 1 . h 2 ) . (1 – h 2 ) . (1 – h 1 ) . h 1 • Pr(e|d 2 ) = (1 – h 1 ). (1 - h 3 ) . (1 – h 1 ) . h 3 . h 1

71 Barinel Algorithm 3. Compute h j by maximizing Pr (e|d) – Maximum likelihood estimation – Gradient ascent procedure – Pr(e|d 1 ): h 1 = 0.47 ; h 2 = 0.19  Pr(d 1 ) = 0.19 – Pr(e|d 2 ): h 1 = 0.41 ; h 3 = 0.50  Pr(d 2 ) = 0.04 4. Rank candidates according to Pr(d) – D = <{1,2}, {1,3}> – Inspection starts with components 1 and 2

72 LIVE DEMO • Requirements: – The Zoltar toolset (www.fdir.org/zoltar) – LLVM – OS: Linux • Because of Zoltar… – Soon • Will be available as an eclipse plugin • Support for Java • T. Janssen, R. Abreu, and A.J.C. van Gemund, Zoltar: A Toolset for Automatic Fault Localization . In Proceedings of the 24th International Conference on Automated Software Engineering (ASE'09) - Tools Track, pp. 662--664, Auckland, New Zealand, November 2009. IEEE Computer Society. ( Best Demo Award )

73 Model-based vs. Spectrum-based Model-based Spectrum-based • Model used primarily for • Model used primarily for reasoning error detection • All generated explanations • Ranking may contain invalid are valid explanations • Most likely diagnosis need • Invalid explanations may not be actual cause rank high • Well suited for hardware • Well suited for software

74 Outline • Part I – Diagnosis principles – Model-Based Diagnosis – Spectrum-Based Fault Localization – Live Demo • Part II – Existing systems – Lessons learned – Case studies – Further applications – Related work

75 Existing applications • PinPoint : large on-line transaction processing systems (search engines, web mail) [Chen02] • Tarantula : visualizing test information to aid manual debugging [Jones02] • Ochiai [TAIC PART07; JSS09] • Barinel [ASE09] • …

76 Similarity Coefficients • Jaccard (PinPoint) • Tarantula • Ochiai (molecular biology)

77 Fault diagnosis Jaccard similarity coefficient: error vector block j 1 0 0 1 1 1 0 0 1 1 n 11 s j = n 11 +n 10 +n 01

78 Fault diagnosis Jaccard similarity coefficient: error vector block j 1 0 0 1 1 1 0 0 1 1 n 11 s j = n 11 +n 10 +n 01

79 Fault diagnosis Jaccard similarity coefficient: error vector block j 1 0 0 1 1 1 0 0 1 1 2 s j = 2 + n 10 +n 01

80 Fault diagnosis Jaccard similarity coefficient: error vector block j 1 0 0 1 1 1 0 0 1 1 2 s j = 2 + 1 + n 01

81 Fault diagnosis Jaccard similarity coefficient: error vector block j 1 0 0 1 1 1 0 0 1 1 2 s j = 2 + 1 + 1

82 Diagnostic quality • Percentage of blocks that need not be inspected:

83 Discussion • Under the specific conditions of our experiment, Ochiai outperforms 8 other coefficients. • Why? • To what extent does this depend on the conditions of our experiment? – Quality of the passed / failed information – Numers of runs – Artificial bugs in Siemens set

84 Ochiai outperforms Tarantula n 11 /(n 11 +n 01 ) n 11 /(n 11 +n 01 ) + n 10 /(n 10 +n 00 ) n 11 >0 n 10 n 11 +n 01 1 / (1 + ) n 10 +n 00 n 11 n 10 n 11 +n 01 NF 1 / (1 + c ), with c = = n 11 n 10 +n 00 NP

85 Ochiai outperforms Tarantula Tarantula n 10 1 / (1 + c ) n 11 Ochiai n 11 √ ((n 11 +n 01 ).(n 11 +n 10 ))

86 Ochiai outperforms Tarantula Tarantula Only presence in passed runs n 10 lowers the similarity 1 / (1 + c ) n 11 Ochiai n 11 Absence in failed √ ((n 11 +n 01 ).(n 11 +n 10 )) runs also lowers the similarity

87 Ochiai outperforms Jaccard Jaccard n 11 n 11 +n 01 +n 10 Ochiai n 11 √ ((n 11 +n 01 ).(n 11 +n 10 ))

88 Ochiai outperforms Jaccard n 11 √ ((n 11 +n 01 )(n 11 +n 10 )) square n 11 2 ((n 11 +n 01 )(n 11 +n 10 )) rewrite denominator n 11 2 n 11 2 +n 11 n 10 +n 11 n 01 +n 01 n 10 eliminate a 11 n 11 None of these steps n 11 +n 10 +n 01 + n 01 n 10 /n 11 modifies the ranking!

89 Ochiai outperforms Jaccard Jaccard n 11 n 11 +n 01 +n 10 Ochiai differences are amplified n 11 n 11 +n 10 +n 01 + n 01 n 10 /n 11

90 Quality of the passed / failed info • Failure detection is a crude error detection mechanism. • q e = n 11 / (n 11 + n 10 ) • In the Siemens Set, q e ranges from 1.4% on average for schedule2 to 20.3% on average for tot_info. • Can be increased by excluding a run that contributes to n 10 • Can be decreased by excluding a run that contributes to n 11

91 Quality of the passed / failed info Small fraction of fault activations detected is enough

92 Number of runs • On average, for the Siemens set: – Adding more failed tests is safe – 6 failed tests are enough – The number of passed tests has no influence • However: – For individual runs the effect of adding passed tests differs – It stabilizes around 20 passed tests

93 Influence of #runs

94 Influence of #runs • On average, for our benchmark: – Adding failed runs is safe – 6 failed runs is enough – The number of passed runs has no influence • However – For individual runs, the effect of more passed runs differs – It stabilizes around 20

95 Dependence on Siemens set faults • Investigate industrial relevance in TRADER project: improve the user-perceived reliability of high-volume consumer electronics devices • Test case: television platform from NXP • Partners: – Universities of Delft, Twente, Leiden, – Embedded Systems Institute, Design Technology Institute, IMEC Leuven – NXP (former Philips Semiconductors)

96 Embedded systems • Low overhead • Little infrastructure needed • Consumer electronics – No time for exhaustive debugging – Helps to identify responsible teams / developers • Diagnosis can drive a recovery mechanism, e.g., rebooting suspect processes

97 Case study – platform • Control software of an analog TV • Decoding RC input, displays the on-screen menu, teletext, optimizes parameters for audio / video processing based on signal analysis, etc. • 450 K lines of C code • 2 MB of RAM + 2 MB in development version • CPU: MIPS running a small multi-tasking OS • Work is organized in 315 logical threads • UART connection to a PC

98 Case study 1. Load problem: TV TXT TV

99 Diagnosis • 150 hit spectra of 315 functions, corresponding to the logical threads (one per second): 60 sec. TV, 30 sec. TXT, 60 sec. TV • Marked the last 60 spectra as failed • 2 nd in ranking of 315 functions

100 Case study 2. Teletext lock-up: – Existing problem in another product line – Copied to our platform, triggered by a remote control key sequence – Inconsistency in two state variables, for which only specific combinations are allowed

Fault Diagnosis of Software Systems Rui Abreu Dept. of Informatics - PowerPoint PPT Presentation

Fault Diagnosis of Software Systems Rui Abreu Dept. of Informatics Engineering Faculty of Engineering University of Porto Thanks: Peter Zoeteweij, Tom Janssen, Arjan J.C. van Gemund Johan de Kleer, Wolfgang Mayer 2 About the speaker

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Voting-based fault detection and diagnosis in systems with multiple operating conditions Carlos F

A Note on Fault Diagnosis Algorithms Franck Cassez National ICT Australia & CNRS Sydney,

JUST ONE FAULT Persistent Fault Analysis on Block Ciphers Shivam Bhasin Temasek Labs @ NTU ASK

BSc Project What kinds of fault we may confront in a control loop? Fault Detection &

Semi-asynchronous Fault Diagnosis of Discrete Event Systems ALEJANDRO WHITE DR. ALI KARIMODDINI

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Outline 1. Introduction to fault diagnosis of DES 2. Seminal work of

Active fault level management Introducing the Fault Current Limiting service 1 Fluctuating

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Fault Modeling 1 Why Fault Models? Actual number of physical defects in a circuit are too

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Fault Detection & Diagnosis in Control Valve Shahriar iar Shahra ram Super ervi visor:

Parametric Fault Diagnosis of Nonlinear Analog Circuits using Polynomial Coefficients Suraj

Diagnosis (01) Definitions Alban Grastien alban.grastien@rsise.anu.edu.au Presentation 1

Specialization Is for Insects Polymorphous Architectures: A Unified Approach for Extracting

Evolving Fault Localisation Shin Yoo, University College London, UK Human Competitive Award,

Applications of Machine Learning in Software Testing Lionel C. Briand Simula Research Laboratory

Identifying Bug Signatures Using Discriminative Graph Mining Hong Cheng 1 , David Lo 2 , Yang Zhou

Evaluating the effectiveness of BEN in localizing different types of software fault Jaganmohan

Systematic Study of Mass Loss in the Evolution of Massive Stars Mathieu Renzo advisors: Prof.

Discovery of VHE Gamma-Ray Emission from the Binary System LMC P3 Nukri Komin * for the H.E.S.S.

Black hole X-ray binaries V: Formation and evolution of black hole binaries Thomas J. Maccarone

Fault Diagnosis of Software Systems Rui Abreu Dept. of Informatics - PowerPoint PPT Presentation

Fault Diagnosis of Software Systems Rui Abreu Dept. of Informatics Engineering Faculty of Engineering University of Porto Thanks: Peter Zoeteweij, Tom Janssen, Arjan J.C. van Gemund Johan de Kleer, Wolfgang Mayer 2 About the speaker

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Voting-based fault detection and diagnosis in systems with multiple operating conditions Carlos F

A Note on Fault Diagnosis Algorithms Franck Cassez National ICT Australia &amp; CNRS Sydney,

JUST ONE FAULT Persistent Fault Analysis on Block Ciphers Shivam Bhasin Temasek Labs @ NTU ASK

BSc Project What kinds of fault we may confront in a control loop? Fault Detection &amp;

Semi-asynchronous Fault Diagnosis of Discrete Event Systems ALEJANDRO WHITE DR. ALI KARIMODDINI

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Outline 1. Introduction to fault diagnosis of DES 2. Seminal work of

Active fault level management Introducing the Fault Current Limiting service 1 Fluctuating

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Fault Modeling 1 Why Fault Models? Actual number of physical defects in a circuit are too

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Fault Detection &amp; Diagnosis in Control Valve Shahriar iar Shahra ram Super ervi visor:

Parametric Fault Diagnosis of Nonlinear Analog Circuits using Polynomial Coefficients Suraj

Diagnosis (01) Definitions Alban Grastien alban.grastien@rsise.anu.edu.au Presentation 1

Specialization Is for Insects Polymorphous Architectures: A Unified Approach for Extracting

Evolving Fault Localisation Shin Yoo, University College London, UK Human Competitive Award,

Applications of Machine Learning in Software Testing Lionel C. Briand Simula Research Laboratory

Identifying Bug Signatures Using Discriminative Graph Mining Hong Cheng 1 , David Lo 2 , Yang Zhou

Evaluating the effectiveness of BEN in localizing different types of software fault Jaganmohan

Systematic Study of Mass Loss in the Evolution of Massive Stars Mathieu Renzo advisors: Prof.

Discovery of VHE Gamma-Ray Emission from the Binary System LMC P3 Nukri Komin * for the H.E.S.S.

Black hole X-ray binaries V: Formation and evolution of black hole binaries Thomas J. Maccarone

A Note on Fault Diagnosis Algorithms Franck Cassez National ICT Australia & CNRS Sydney,

BSc Project What kinds of fault we may confront in a control loop? Fault Detection &

Fault Detection & Diagnosis in Control Valve Shahriar iar Shahra ram Super ervi visor: