Timing Behavior Anomaly Detection for Automatic Failure Detection - - PowerPoint PPT Presentation

timing behavior anomaly detection for automatic failure
SMART_READER_LITE
LIVE PREVIEW

Timing Behavior Anomaly Detection for Automatic Failure Detection - - PowerPoint PPT Presentation

Timing Behavior Anomaly Detection for Automatic Failure Detection and Diagnosis Research visit at Charles Univerity Prague Matthias Rohr matthias.rohr@informatik.uni-oldenburg.de Graduate School TrustSoft, Software Engineering Group Department


slide-1
SLIDE 1

Timing Behavior Anomaly Detection for Automatic Failure Detection and Diagnosis

Research visit at Charles Univerity Prague Matthias Rohr matthias.rohr@informatik.uni-oldenburg.de

Graduate School TrustSoft, Software Engineering Group Department of Computing Science, University of Oldenburg

10th of April 2007

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 1 / 32

slide-2
SLIDE 2

Motivation

Motivation

Complex Software System

Users Administrators Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 2 / 32

slide-3
SLIDE 3

Motivation

Motivation

Complex Software System

Users Administrators Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 2 / 32

slide-4
SLIDE 4

Motivation

Motivation

Complex Software System

Users Administrators

Failure diagnosis in business-critical software systems Manual failure diagnosis is time-consuming and error-prone

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 2 / 32

slide-5
SLIDE 5

Motivation

Motivation

Complex Software System with Monitoring

Users Administrators M M M M M M M M Log of Runtime Behavior Measurements

Failure Diagnosis

Diagnosis Report

Failure diagnosis in business-critical software systems Manual failure diagnosis is time-consuming and error-prone Runtime behavior observations are indicative for failure diagnosis

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 2 / 32

slide-6
SLIDE 6

Motivation

Motivation

:Bookshop

<<Component>>

:CRM

<<Component>>

:Catalog

<<Component>>

P

ft = 0.08

P

ft = 0.8

P

ft = 0.12

Vision Automatic localization of faults through runtime behavior evaluation

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 3 / 32

slide-7
SLIDE 7

Motivation

Approach

Automatic localization of faults through runtime behavior evaluation Automatic detection of timing behavior anomalies in software systems

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 4 / 32

slide-8
SLIDE 8

Motivation

Approach

Automatic localization of faults through runtime behavior evaluation Automatic detection of timing behavior anomalies in software systems Research questions:

How can anomalies be detected in timing behavior? How can system usage variations be adresses in timing behavior evaluation? What is the relation between software faults and runtime timing behavior?

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 4 / 32

slide-9
SLIDE 9

Foundations

Outline

1

Foundations Dependability Anomaly Detection Software Performance

2

Creation of the timing behavior profile

3

Fault Localization

4

Evaluation

5

Related work

6

Conclusions

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 5 / 32

slide-10
SLIDE 10

Foundations Dependability

Dependability Terminology [Aviˇ zienis et al., 2004]

Threats to dependability Fault Root-cause of a failure Error Incorrect system state Failure Deviation from correct system behavior visible to the user

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 6 / 32

slide-11
SLIDE 11

Foundations Dependability

Dependability Terminology [Aviˇ zienis et al., 2004]

Threats to dependability Fault Root-cause of a failure Error Incorrect system state Failure Deviation from correct system behavior visible to the user Failure Diagnosis: Failure detection Identification of faults Fault localization

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 6 / 32

slide-12
SLIDE 12

Foundations Dependability

Availability

Availability: Common definition (e.g., [Musa et al., 1987]) Availability = MTTF MTTF + MTTR MTTF Mean Time to Failure MTTR Mean Time to Repair

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 7 / 32

slide-13
SLIDE 13

Foundations Dependability

Availability

Availability: Common definition (e.g., [Musa et al., 1987]) Availability = MTTF MTTF + MTTR MTTF Mean Time to Failure MTTR Mean Time to Repair Two alternative strategies to increase availability Increase of mean time to failure (reliability) Decrease of mean time to repair

Failure diagnosis support

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 7 / 32

slide-14
SLIDE 14

Foundations Anomaly Detection

Anomaly Detection (1/2)

System Anomaly detection Anomaly analysis

System influences System behavior

An anomaly is a deviation from “normal” system behavior

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 8 / 32

slide-15
SLIDE 15

Foundations Anomaly Detection

Anomaly Detection (1/2)

System Anomaly detection Anomaly analysis

System influences System behavior

An anomaly is a deviation from “normal” system behavior Normal system behavior:

Static reference values (e.g., mean response time over a day ≤ T ) Analytical or statistical models in dependence to system influences and historical system behavior

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 8 / 32

slide-16
SLIDE 16

Foundations Anomaly Detection

Anomaly Detection (2/2)

Methods to create normal behavior profiles Manual specification Automatic profile learning from observations

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 9 / 32

slide-17
SLIDE 17

Foundations Anomaly Detection

Anomaly Detection (2/2)

Methods to create normal behavior profiles Manual specification Automatic profile learning from observations Challenges of anomaly detection: False alarms System usage Nonlinear system behavior, modeling uncertainties

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 9 / 32

slide-18
SLIDE 18

Foundations Anomaly Detection

Anomaly Detection (2/2)

Methods to create normal behavior profiles Manual specification Automatic profile learning from observations Challenges of anomaly detection: False alarms System usage Nonlinear system behavior, modeling uncertainties Typical application domains: Industrial manufacturing, large-scale control systems [Palade et al., 2006] Network management [Maxion, 1990] Intrusion detection (Security) [Denning, 1987]

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 9 / 32

slide-19
SLIDE 19

Foundations Software Performance

Software Timing Behavior

Influences to software timing behavior: System architecture:

Hardware resource capacity Software design

System usage: [cp. Sabetta and Koziolek, 2007]:

Workload intensity (e.g., number of active users) Service demand characteristics (e.g., individual request parameters)

System state

Performance tuning (e.g., caching, load balancing), ... Server virtualization

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 10 / 32

slide-20
SLIDE 20

Creation of the timing behavior profile

Outline

1

Foundations

2

Creation of the timing behavior profile Instrumentation Monitoring Analysis of Execution Sequences Analysis of Workload Intensity

3

Fault Localization

4

Evaluation

5

Related work

6

Conclusions

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 11 / 32

slide-21
SLIDE 21

Creation of the timing behavior profile

Failure diagnosis through online timing behavior evaluation

Timing behavior anomalies: Deviations from normal timing behavior (here: response times) of

  • perations of a software system

e.g., exceptional high or low response times

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 12 / 32

slide-22
SLIDE 22

Creation of the timing behavior profile

Failure diagnosis through online timing behavior evaluation

Timing behavior anomalies: Deviations from normal timing behavior (here: response times) of

  • perations of a software system

e.g., exceptional high or low response times Relation between software faults and timing behavior anomalies

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 12 / 32

slide-23
SLIDE 23

Creation of the timing behavior profile

Failure diagnosis through online timing behavior evaluation

Timing behavior anomalies: Deviations from normal timing behavior (here: response times) of

  • perations of a software system

e.g., exceptional high or low response times Relation between software faults and timing behavior anomalies: Software faults tend to cause timing behavior anomalies [Kao et al., 1993] Successful fault localization based on timing behavior anomalies [Agarwal et al., 2004] Response times in enterprise resource planning systems (ERP) are often log-normally distributed [Mielke, 2006]

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 12 / 32

slide-24
SLIDE 24

Creation of the timing behavior profile

Overview

Instrumentation for Monitoring Initial activities Continuous activties Anomaly analysis

Timing behavior anomaly detection for failure diagnosis

Diagnosis report Creation of the timing behavior profile Monitoring Timing behavior profile Monitoring Update of timing behavior profile Activities during failure diagnosis

Log: − response times − execution sequences

Anomaly detection Timing behavior profile Timing behavior profile

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 13 / 32

slide-25
SLIDE 25

Creation of the timing behavior profile

Overview

Instrumentation for Monitoring Initial activities Continuous activties Anomaly analysis

Timing behavior anomaly detection for failure diagnosis

Diagnosis report Creation of the timing behavior profile Monitoring Timing behavior profile Monitoring Update of timing behavior profile Activities during failure diagnosis

Log: − response times − execution sequences

Anomaly detection Timing behavior profile Timing behavior profile

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 13 / 32

slide-26
SLIDE 26

Creation of the timing behavior profile

Overview

Instrumentation for Monitoring Initial activities

Timing behavior anomaly detection for failure diagnosis

Creation of the timing behavior profile Monitoring Timing behavior profile

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 13 / 32

slide-27
SLIDE 27

Creation of the timing behavior profile

Overview

Instrumentation for Monitoring Initial activities

Timing behavior anomaly detection for failure diagnosis

Creation of the timing behavior profile Monitoring Timing behavior profile Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity anolysis Operation analysis Timing behavior profile

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 13 / 32

slide-28
SLIDE 28

Creation of the timing behavior profile Instrumentation

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis :Bookshop :CRM :Catalog :Bookshop

<<Component>>

:CRM

<<Component>>

:Catalog

<<Component>>

M M M

M M M M M M M M Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 14 / 32

slide-29
SLIDE 29

Creation of the timing behavior profile Instrumentation

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis :Bookshop :CRM :Catalog :Bookshop

<<Component>>

:CRM

<<Component>>

:Catalog

<<Component>>

M M M

M M M M M M M M

Monitoring of

Response times (Start and end of an operation execution) Execution sequences of operations for each thread

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 14 / 32

slide-30
SLIDE 30

Creation of the timing behavior profile Instrumentation

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis :Bookshop :CRM :Catalog :Bookshop

<<Component>>

:CRM

<<Component>>

:Catalog

<<Component>>

M M M

M M M M M M M M

Monitoring of

Response times (Start and end of an operation execution) Execution sequences of operations for each thread

Instrumentation challenges:

(Measurement metrics) Number and position of measurement points [Focke et al., 2007a] Maintainable integration of measurement logic [Focke et al., 2007b]

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 14 / 32

slide-31
SLIDE 31

Creation of the timing behavior profile Monitoring

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

A

M M M M M M M M

B C

Operation TraceID tin tout tout − tin a 1 0000 0150 150 c 1 0030 0050 20 b 1 0060 0140 80 c 1 0090 0130 40 . . . . . . . . . . . . . . . c 2 0340 0358 18 c 2 0400 0437 37

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 15 / 32

slide-32
SLIDE 32

Creation of the timing behavior profile Monitoring

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

A

M M M M M M M M

B C a b c1 c2

Operation TraceID tin tout tout − tin a 1 0000 0150 150 c 1 0030 0050 20 b 1 0060 0140 80 c 1 0090 0130 40 . . . . . . . . . . . . . . . c 2 0340 0358 18 c 2 0400 0437 37

Trace reconstruction of execution sequences from monitoring log: Operations: O = {a, b, c} Executions with TraceID 1: E1 = {a, b, c1, c2}

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 15 / 32

slide-33
SLIDE 33

Creation of the timing behavior profile Monitoring

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

A

M M M M M M M M

B C a b c1 c2 $a a$ ba a ab b c2 bc2 c1 a c1

Operation TraceID tin tout tout − tin a 1 0000 0150 150 c 1 0030 0050 20 b 1 0060 0140 80 c 1 0090 0130 40 . . . . . . . . . . . . . . . c 2 0340 0358 18 c 2 0400 0437 37

Trace reconstruction of execution sequences from monitoring log: Operations: O = {a, b, c} Executions with TraceID 1: E1 = {a, b, c1, c2} Execution sequence t1 = ($a, ac1, c1a, ab, bc2, c2b, ba, a$)

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 15 / 32

slide-34
SLIDE 34

Creation of the timing behavior profile Monitoring

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

A

M M M M M M M M

B C a b c1 c2 $a a$ ba a ab b c2 bc2 c1 a c1

Operation TraceID tin tout tout − tin a 1 0000 0150 150 c 1 0030 0050 20 b 1 0060 0140 80 c 1 0090 0130 40 . . . . . . . . . . . . . . . c 2 0340 0358 18 c 2 0400 0437 37

Trace reconstruction of execution sequences from monitoring log: Operations: O = {a, b, c} Executions with TraceID 1: E1 = {a, b, c1, c2} Execution sequence t1 = ($a, ac1, c1a, ab, bc2, c2b, ba, a$)

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 15 / 32

slide-35
SLIDE 35

Creation of the timing behavior profile Monitoring

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

Associate response times to operations: RT := all response times RT(o) := response times of one operation o

All response times RT = (150, 80, ... ) Response time per operation RT(a) = (150, ... ) RT(b) = (80, ... ) RT(c) = (40, 20, 37, 18, ... )

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 16 / 32

slide-36
SLIDE 36

Creation of the timing behavior profile Monitoring

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

Associate response times to operations: RT := all response times RT(o) := response times of one operation o

All response times RT = (150, 80, ... ) Response time per operation RT(a) = (150, ... ) RT(b) = (80, ... ) RT(c) = (40, 20, 37, 18, ... )

Statistical description of RT(o) = (rt1, . . . , rtn) Probability density functions, histograms Location parameters: Mean, Median, Mode

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 16 / 32

slide-37
SLIDE 37

Creation of the timing behavior profile Analysis of Execution Sequences

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

:Bookshop :CRM :Catalog

M M M M M M M M

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 17 / 32

slide-38
SLIDE 38

Creation of the timing behavior profile Analysis of Execution Sequences

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

:Bookshop :CRM :Catalog

M M M M M M M M

Response time

Prob. density

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 17 / 32

slide-39
SLIDE 39

Creation of the timing behavior profile Analysis of Execution Sequences

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

:Bookshop :CRM :Catalog

M M M M M M M M

Response time

Prob. density

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 17 / 32

slide-40
SLIDE 40

Creation of the timing behavior profile Analysis of Execution Sequences

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

Response time (ms)

:Bookshop :CRM :Catalog

M M M M M M M M

Response time

Response time (ms) Prob. density Prob. density

Separation to achieve "trace−aware" timing behavior evaluation

Prob. density

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 17 / 32

slide-41
SLIDE 41

Creation of the timing behavior profile Analysis of Execution Sequences

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

A

M M M M M M M M

B C a b c1 c2 $a a$ ba a ab b c2 bc2 c1 a c1

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 18 / 32

slide-42
SLIDE 42

Creation of the timing behavior profile Analysis of Execution Sequences

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

A

M M M M M M M M

B C a b c1 c2 $a a$ ba a ab b c2 bc2 c1 a c1

Example: Prefix of an execution sequence t1 = (

p(t1,c1)

$a, ac1, c1a, ab, bc2, c2b, ba, a$)

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 18 / 32

slide-43
SLIDE 43

Creation of the timing behavior profile Analysis of Execution Sequences

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

A

M M M M M M M M

B C a b c1 c2 $a a$ ba a ab b c2 bc2 c1 a c1

Example: Prefix of an execution sequence t1 = (

p(t1,c1)

$a, ac1, c1a, ab, bc2

  • p(t1,c2)

, c2b, ba, a$)

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 18 / 32

slide-44
SLIDE 44

Creation of the timing behavior profile Analysis of Execution Sequences

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

A

M M M M M M M M

B C a b c1 c2 $a a$ ba a ab b c2 bc2 c1 a c1

Prefix of an execution sequence t ∈ T of an execution e ∈ E: p : T × E → T; (t, e) → (mi)j

i=1

with mj as pair (e′, e). Example: Prefix of an execution sequence t1 = (

p(t1,c1)

$a, ac1, c1a, ab, bc2

  • p(t1,c2)

, c2b, ba, a$)

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 18 / 32

slide-45
SLIDE 45

Creation of the timing behavior profile Analysis of Execution Sequences

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

Distinction of response times based on prefixes The timing behavior observations of an operation o are distinguished based

  • n their prefix p, denoted RTp = (rt1, . . . , rtn).

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 19 / 32

slide-46
SLIDE 46

Creation of the timing behavior profile Analysis of Execution Sequences

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

Distinction of response times based on prefixes The timing behavior observations of an operation o are distinguished based

  • n their prefix p, denoted RTp = (rt1, . . . , rtn).

Example:

t1 = t2 = (

p(t1,c1)

z }| { $a, ac1, c1a, ab, bc2 | {z }

p(t1,c2)

, c2b, ba, a$) RT($a,ac1) = (40, 37, . . . ) RT($a,ac1,c1a,ab,bc2) = (20, 18, . . . )

O TID . . . tout − tin a 1 150 c 1 20 b 1 80 c 1 40 . . . . . . . . . c 2 18 c 2 37

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 19 / 32

slide-47
SLIDE 47

Creation of the timing behavior profile Analysis of Execution Sequences

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

All response times RT Response times per operation RT(a) = (150, ... ) RT(b) = (80, ... ) RT(c) = (20, 40, 18, 37, ... ) Distinction based

  • n prefix

RT = (150, ... )

p1 p3

RT = (80, ... )

p2

RT = (20,18, ... ) RT = (40,37, ... )

p4

p3=($a,ac ) p4=($a,ac ,c a,ab,bc )

1 2 1 1 Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 20 / 32

slide-48
SLIDE 48

Creation of the timing behavior profile Analysis of Workload Intensity

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

The workload intensity during an execution influences the response times

  • 20

40 60 80 100 120 2000 4000 6000 8000 Workload intensity Response time

  • 20

40 60 80 100 120 2000 4000 6000 8000 dashed, blue line: average response time in ms dotted, red line: median response time in ms Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 21 / 32

slide-49
SLIDE 49

Creation of the timing behavior profile Analysis of Workload Intensity

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

The workload intensity during an execution influences the response times What is the expected response time distribution of an operation for a particular workload intensity?

  • 20

40 60 80 100 120 2000 4000 6000 8000 Workload intensity Response time

  • 20

40 60 80 100 120 2000 4000 6000 8000 dashed, blue line: average response time in ms dotted, red line: median response time in ms Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 21 / 32

slide-50
SLIDE 50

Creation of the timing behavior profile Analysis of Workload Intensity

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

The workload intensity during an execution influences the response times What is the expected response time distribution of an operation for a particular workload intensity? Metric for workload intensity w(e):

Average number of active application threads during the

  • peration execution e
  • 20

40 60 80 100 120 2000 4000 6000 8000 Workload intensity Response time

  • 20

40 60 80 100 120 2000 4000 6000 8000 dashed, blue line: average response time in ms dotted, red line: median response time in ms Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 21 / 32

slide-51
SLIDE 51

Creation of the timing behavior profile Analysis of Workload Intensity

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

Process: Determine the workload intensity for each execution monitored

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 22 / 32

slide-52
SLIDE 52

Creation of the timing behavior profile Analysis of Workload Intensity

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

Process: Determine the workload intensity for each execution monitored RTp = (rt1, ..., rtn) is extended to RT ′

p = ((rt1, w1), . . . , (rtn, wn))

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 22 / 32

slide-53
SLIDE 53

Creation of the timing behavior profile Analysis of Workload Intensity

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

Process: Determine the workload intensity for each execution monitored RTp = (rt1, ..., rtn) is extended to RT ′

p = ((rt1, w1), . . . , (rtn, wn))

Approximation of normalized probability density funtions f w

RTp : R → [0, 1]; rt → f w RTp(rt)

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 22 / 32

slide-54
SLIDE 54

Creation of the timing behavior profile Analysis of Workload Intensity

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

Process: Determine the workload intensity for each execution monitored RTp = (rt1, ..., rtn) is extended to RT ′

p = ((rt1, w1), . . . , (rtn, wn))

Approximation of normalized probability density funtions f w

RTp : R → [0, 1]; rt → f w RTp(rt)

Example: Approximated normal distributions for response times in dependence to the workload intensity w (normalized to [0, 1])

1000 1020 1040 1060 1080 1100 0.0 0.2 0.4 0.6 0.8 1.0 response time in ms normalized prob. density (pdf(rt)) w = workload intensity w=1 Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 22 / 32

slide-55
SLIDE 55

Creation of the timing behavior profile Analysis of Workload Intensity

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

Process: Determine the workload intensity for each execution monitored RTp = (rt1, ..., rtn) is extended to RT ′

p = ((rt1, w1), . . . , (rtn, wn))

Approximation of normalized probability density funtions f w

RTp : R → [0, 1]; rt → f w RTp(rt)

Example: Approximated normal distributions for response times in dependence to the workload intensity w (normalized to [0, 1])

1000 1020 1040 1060 1080 1100 0.0 0.2 0.4 0.6 0.8 1.0 response time in ms normalized prob. density (pdf(rt)) w = workload intensity 1000 1020 1040 1060 1080 1100 0.0 0.2 0.4 0.6 0.8 1.0 response time in ms normalized prob. density (pdf(rt)) w = workload intensity w=1 w=5 Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 22 / 32

slide-56
SLIDE 56

Creation of the timing behavior profile Analysis of Workload Intensity

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

Process: Determine the workload intensity for each execution monitored RTp = (rt1, ..., rtn) is extended to RT ′

p = ((rt1, w1), . . . , (rtn, wn))

Approximation of normalized probability density funtions f w

RTp : R → [0, 1]; rt → f w RTp(rt)

Example: Approximated normal distributions for response times in dependence to the workload intensity w (normalized to [0, 1])

1000 1020 1040 1060 1080 1100 0.0 0.2 0.4 0.6 0.8 1.0 response time in ms normalized prob. density (pdf(rt)) w = workload intensity 1000 1020 1040 1060 1080 1100 0.0 0.2 0.4 0.6 0.8 1.0 response time in ms normalized prob. density (pdf(rt)) w = workload intensity 1000 1020 1040 1060 1080 1100 0.0 0.2 0.4 0.6 0.8 1.0 response time in ms normalized prob. density (pdf(rt)) w = workload intensity w=1 w=5 w=10 Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 22 / 32

slide-57
SLIDE 57

Creation of the timing behavior profile Analysis of Workload Intensity

Instrumentation for Monitoring Monitoring Execution sequence analysis Worload intensity analysis Operation analysis

Process: Determine the workload intensity for each execution monitored RTp = (rt1, ..., rtn) is extended to RT ′

p = ((rt1, w1), . . . , (rtn, wn))

Approximation of normalized probability density funtions f w

RTp : R → [0, 1]; rt → f w RTp(rt)

Example: Approximated normal distributions for response times in dependence to the workload intensity w (normalized to [0, 1])

1000 1020 1040 1060 1080 1100 0.0 0.2 0.4 0.6 0.8 1.0 response time in ms normalized prob. density (pdf(rt)) w = workload intensity 1000 1020 1040 1060 1080 1100 0.0 0.2 0.4 0.6 0.8 1.0 response time in ms normalized prob. density (pdf(rt)) w = workload intensity 1000 1020 1040 1060 1080 1100 0.0 0.2 0.4 0.6 0.8 1.0 response time in ms normalized prob. density (pdf(rt)) w = workload intensity 1000 1020 1040 1060 1080 1100 0.0 0.2 0.4 0.6 0.8 1.0 response time in ms normalized prob. density (pdf(rt)) w = workload intensity w=1 w=5 w=10 w=15 Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 22 / 32

slide-58
SLIDE 58

Creation of the timing behavior profile Analysis of Workload Intensity

Creation of the timing behavior profile in summary

All monitored response times RT Response times per operation RT(a) = (150, ... ) RT(b) = (80, ... ) RT(c) = (20, 40, 18, 37, ... ) Distinction based

  • n prefix

RT = (150, ... )

p1 p3

RT = (80, ... )

p2

RT = (20,18, ... ) RT = (40,37, ... )

p4

p3=($a,ac ) p4=($a,ac ,c a,ab,bc )

1 2 1 1

Modeling of workload intensity p1=($a) p2=($a,ac ,c a,ab)

1 1

Timing behavior profile The timing behavior profile consists of a function f w

RTp for each prefix of the

monitoring data. The values f w

RTp(rt) ∈ [0, 1] describe how “normal” a response

time rt is under consideration of a workload intensity w and a prefix p.

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 23 / 32

slide-59
SLIDE 59

Fault Localization

Outline

1

Foundations

2

Creation of the timing behavior profile

3

Fault Localization

4

Evaluation

5

Related work

6

Conclusions

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 24 / 32

slide-60
SLIDE 60

Fault Localization

Overview Fault Localization

Instrumentation for Monitoring Initial Activities Continuous activities

Anomaly analysis Diagnosis report

Creation of Timing behavior profile Monitoring Timing behavior profile Monitoring Update of Timing behavior profile Timing behavior profile

Activities during diagnosis after detection of a failure

Monitoring Log

  • f some time

period before the failure: − Response times − Execution sequnces

Timing behavior profile Anomaly detection

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 25 / 32

slide-61
SLIDE 61

Fault Localization

Activities of Fault Localization (1/2)

After detection of a failure at time ta:

1

Determination of response times, prefixes and workload intensities (for each execution) for the time period [ta − δ, ta]:

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 26 / 32

slide-62
SLIDE 62

Fault Localization

Activities of Fault Localization (1/2)

After detection of a failure at time ta:

1

Determination of response times, prefixes and workload intensities (for each execution) for the time period [ta − δ, ta]:

O TraceID tin tout tout − tin p w Catalog.getBook(..) 121 1182 1201 19 p17 17 . . . Bookshop.query(..) 131 1195 1221 26 p41 21 . . .

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 26 / 32

slide-63
SLIDE 63

Fault Localization

Activities of Fault Localization (1/2)

After detection of a failure at time ta:

1

Determination of response times, prefixes and workload intensities (for each execution) for the time period [ta − δ, ta]:

O TraceID tin tout tout − tin p w Catalog.getBook(..) 121 1182 1201 19 p17 17 . . . Bookshop.query(..) 131 1195 1221 26 p41 21 . . .

2

Anomaly detection through computation of 1 − f w

RTp(rt):

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 26 / 32

slide-64
SLIDE 64

Fault Localization

Activities of Fault Localization (1/2)

After detection of a failure at time ta:

1

Determination of response times, prefixes and workload intensities (for each execution) for the time period [ta − δ, ta]:

O TraceID tin tout tout − tin p w Catalog.getBook(..) 121 1182 1201 19 p17 17 . . . Bookshop.query(..) 131 1195 1221 26 p41 21 . . .

2

Anomaly detection through computation of 1 − f w

RTp(rt): O TraceID tin tout tout − tin p w 1 − fw

RTp

Catalog.getBook(..) 121 1182 1201 19 p2 17 1 − f 17

RTp2(19) = 0.75

. . . Bookshop.query(..) 131 1195 1221 26 p41 21 1 − f 21

RTp41(17) = 0.21

. . .

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 26 / 32

slide-65
SLIDE 65

Fault Localization

Activities of fault localization (2/2)

3

Anomaly analysis: Aggregation of many anomaly values

Mean degree of anomaly for each operation / component / deployment context Analysis of anomalies in combination with component dependency graphs Neural networks [Stransky, 2006] Event correlation techniques [Steinder and Sethi, 2004]

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 27 / 32

slide-66
SLIDE 66

Fault Localization

Activities of fault localization (2/2)

3

Anomaly analysis: Aggregation of many anomaly values

Mean degree of anomaly for each operation / component / deployment context Analysis of anomalies in combination with component dependency graphs Neural networks [Stransky, 2006] Event correlation techniques [Steinder and Sethi, 2004]

4

Presentation of results (diagnosis report)

:Bookshop

<<Component>>

:CRM

<<Component>>

:Catalog

<<Component>>

P

ft = 0.08

P

ft = 0.8

P

ft = 0.12

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 27 / 32

slide-67
SLIDE 67

Evaluation

Outline

1

Foundations

2

Creation of the timing behavior profile

3

Fault Localization

4

Evaluation Lab studies Field studies

5

Related work

6

Conclusions

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 28 / 32

slide-68
SLIDE 68

Evaluation Lab studies

Evaluation – Lab studies

Evaluation goals: Proof of concept: Failure diagnosis for injected faults Efficiency of anomaly detection and anomaly analysis Setting: Generation of artificial (probabilistic) system usage Fault injection Example applications:

Sun Java PetStore Demo Application, (and reimplementations) (Rubis Benchmark) TPC-App Benchmark

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 29 / 32

slide-69
SLIDE 69

Evaluation Field studies

Evaluation – Field studies

Evaluation goals: Applicability in real world systems

Complex system usage Long execution sequences

Effectiveness of anomaly detection

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 30 / 32

slide-70
SLIDE 70

Evaluation Field studies

Evaluation – Field studies

Evaluation goals: Applicability in real world systems

Complex system usage Long execution sequences

Effectiveness of anomaly detection Field study in progress: Evaluation of 12 month timing behavior data from a customer portal of a middle-size telecommunication company (only highly aggregated response times)

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 30 / 32

slide-71
SLIDE 71

Evaluation Field studies

Evaluation – Field studies

Evaluation goals: Applicability in real world systems

Complex system usage Long execution sequences

Effectiveness of anomaly detection Field study in progress: Evaluation of 12 month timing behavior data from a customer portal of a middle-size telecommunication company (only highly aggregated response times) Field studies in preparation: Telecommunication system of Siemens

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 30 / 32

slide-72
SLIDE 72

Evaluation Field studies

Evaluation – Field studies

Evaluation goals: Applicability in real world systems

Complex system usage Long execution sequences

Effectiveness of anomaly detection Field study in progress: Evaluation of 12 month timing behavior data from a customer portal of a middle-size telecommunication company (only highly aggregated response times) Field studies in preparation: Telecommunication system of Siemens E-learning management platform StudIP

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 30 / 32

slide-73
SLIDE 73

Related work

Related work

Failure diagnosis based on analysis of timing behavior: Failure diagnosis based on analysis of (component) execution sequences: Failure diagnosis based on multiple runtime behavior metrics:

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 31 / 32

slide-74
SLIDE 74

Related work

Related work

Failure diagnosis based on analysis of timing behavior: [Agarwal et al., 2004] Response time analysis in the context of average historic response times and SLA violations [Diaconescu and Murphy, 2005]: Anomalies as violations of relative threshold values (based on historic average) Failure diagnosis based on analysis of (component) execution sequences: Failure diagnosis based on multiple runtime behavior metrics:

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 31 / 32

slide-75
SLIDE 75

Related work

Related work

Failure diagnosis based on analysis of timing behavior: [Agarwal et al., 2004] Response time analysis in the context of average historic response times and SLA violations [Diaconescu and Murphy, 2005]: Anomalies as violations of relative threshold values (based on historic average) Failure diagnosis based on analysis of (component) execution sequences: [Kiciman and Fox, 2005; Kiciman, 2005; Aguilera et al., 2003; Barham et al., 2004]: Component interaction probabilities and component dependency graphs Failure diagnosis based on multiple runtime behavior metrics:

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 31 / 32

slide-76
SLIDE 76

Related work

Related work

Failure diagnosis based on analysis of timing behavior: [Agarwal et al., 2004] Response time analysis in the context of average historic response times and SLA violations [Diaconescu and Murphy, 2005]: Anomalies as violations of relative threshold values (based on historic average) Failure diagnosis based on analysis of (component) execution sequences: [Kiciman and Fox, 2005; Kiciman, 2005; Aguilera et al., 2003; Barham et al., 2004]: Component interaction probabilities and component dependency graphs Failure diagnosis based on multiple runtime behavior metrics: [Cohen et al., 2005]: Monitoring and evaluation of 62 platform metrics for failure diagnosis, response times of diagnosis of performance problems [Salfner and Malek, 2005]: Prediction of failures based on runtime behavior monitoring (for rejuvenation)

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 31 / 32

slide-77
SLIDE 77

Conclusions

Conclusions

New approach to the detection of timing behavior anomalies for the localization of faults Improvement of timing behavior analysis:

Workload intensity awareness Awareness of service demand characteristics

Anomaly detection is used to increase availability and reliability of enterprise-scale software systems

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 32 / 32

slide-78
SLIDE 78

Conclusions

Conclusions

New approach to the detection of timing behavior anomalies for the localization of faults Improvement of timing behavior analysis:

Workload intensity awareness Awareness of service demand characteristics

Anomaly detection is used to increase availability and reliability of enterprise-scale software systems Empirical evaluation requires much effort (fault injection & complex usage)

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 32 / 32

slide-79
SLIDE 79

References

  • M. K. Agarwal, K. Appleby, M. Gupta, G. Kar, A. Neogi, and A. Sailer. Problem

determination using dependency graphs and run-time behavior models. In 15th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM’04), volume 3278 of Lecture Notes in Computer Science, pages 171–182. Springer, 2004. ISBN 3-540-23631-7. doi:10.1007/b102082.

  • M. K. Aguilera, J. C. Mogul, J. L. Wiener, P

. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In SOSP ’03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 74–89, New York, NY, USA, 2003. ACM Press. ISBN 1-58113-757-5. doi:10.1145/945445.945454.

  • A. Aviˇ

zienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11–33, 2004. ISSN 1545-5971. doi:10.1109/TDSC.2004.2. P . T. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In 6th Symposium On Operating Systems Design and Implementation (OSDI’04), pages 259–272, 2004.

  • I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox.

Capturing, indexing, clustering, and retrieving system history. In SOSP ’05:

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 32 / 32

slide-80
SLIDE 80

References

Proceedings of the twentieth ACM symposium on Operating systems principles, pages 105–118, New York, NY, USA, 2005. ACM Press. ISBN 1-59593-079-5. doi:http://doi.acm.org/10.1145/1095810.1095821.

  • D. Denning. An intrusion-detection model. IEEE Transactions on Software

Engineering, 13(2):222–232, Feb. 1987.

  • A. Diaconescu and J. Murphy. Automating the performance management of

component-based enterprise systems through the use of redundancy. In Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering (ASE ’05), pages 44–53, New York, NY, USA, 2005. ACM Press. ISBN 1-59593-993-4. doi:10.1145/1101908.1101918.

  • T. Focke, W. Hasselbring, M. Rohr, and J.-G. Schute. Ein Vorgehensmodell

f¨ ur Performance-Monitoring von Informationssystemlandschaften. EMISA Forum, 27(1), Jan. 2007a. ISSN 1610-3351.

  • T. Focke, W. Hasselbring, M. Rohr, and J.-G. Schute. Instrumentierung zum

Monitoring mittels Aspekt-orientierter Programmierung. In W.-G. Bleek,

  • H. Schwentner, and H. Z¨

ullighoven, editors, Proceedings Software Engineering 2007, Hamburg, volume 106 of GI-Edition – Lecture Notes in

  • Informatics. Gesellschaft f¨

ur Informatik, Bonner K¨

  • llen Verlag, Mar. 2007b.

ISBN 978-3-88579-200-0. W.-I. Kao, R. Iyer, and D. Tang. FINE: A fault injection and monitoring environment for tracing the unix system behavior under faults. Transactions

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 32 / 32

slide-81
SLIDE 81

References

  • n Software Engineering, 19(11):1105–1118, 1993. ISSN 0098-5589.

doi:10.1109/32.256857.

  • E. Kiciman. Using Statistical Monitoring to Detect Failures in Internet
  • Services. PhD thesis, Stanford University, Sept. 2005.
  • E. Kiciman and A. Fox. Detecting application-level failures in

component-based internet services. IEEE Transactions on Neural Networks, 16(5):1027–1041, Sept. 2005. doi:10.1109/TNN.2005.853411.

  • R. A. Maxion. Anomaly detection for network diagnosis. In B. Randell, editor,

Proceedings of the 20th International Symposium on Fault-Tolerant Computing (FTCS ’90), pages 20–27. IEEE, June 1990. ISBN 0-8186-2051-X.

  • A. Mielke. Elements for response-time statistics in ERP transaction systems.

Performance Evaluation, 63(7):635–653, July 2006. doi:j.peva.2005.05.006.

  • J. D. Musa, A. Iannino, and K. Okumoto. Software Reliability: Measurement,

Prediction, Application. McGraw-Hill, New York, first edition, 1987. ISBN 0-07-044093-X.

  • V. Palade, C. D. Bocaniala, and L. C. Jain, editors. Computational Intelligence

in Fault Diagnosis. Advanced Information and Knowledge Processing. Springer, 2006. ISBN 978-1-184628-343-7.

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 32 / 32

slide-82
SLIDE 82
  • A. Sabetta and H. Koziolek. Measuring performance metrics: Techniques and
  • tools. In I. Eusgeld, F

. Freiling, and R. Reussner, editors, Dependability Metrics, Lecture Notes of Computer Science, 2007. to appear in LNCS. F . Salfner and M. Malek. Proactive fault handling for system availability

  • enhancement. In IEEE Proceedings of the DPDNS Workshop in

conjunction with IPDPS 2005, Denver, Colorado, 2005.

  • M. Steinder and A. S. Sethi. A survey of fault localization techniques in

computer networks. Science of Computer Programming, 53(2):165–194,

  • Nov. 2004. doi:10.1016/j.scico.2004.01.010.

F . Stransky. Automatisierte Lokalisierung von Fehlerursachen bei Performance-Problemen in J2EE Anwendungen. Individuelles Project (Bachelor Thesis), 2006.

Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 32 / 32