Mining Anomaly Detectors Paolo Tonella Software Engineering - PowerPoint PPT Presentation

Mining Anomaly Detectors Paolo Tonella Software Engineering Research Unit Fondazione Bruno Kessler Trento, Italy http://se.fbk.eu/tonella

Outline • Role and classification of (mined) oracles • Oracle mining techniques • Empirical validation of mined oracles • Future research directions

Role of oracles P Observability of P P attempts to Structure of P may be used to define limits information implement S T; Semantics of P determines available in O propagation of errors S S may be used O approximates S to define T O T Effectiveness of testing depends on O; T may influence which variables to consider in O For a given program P, what combination of tests T and oracle O achieves the highest fault revealing level? M. Staats, M. W. Whalen and M. P. E. Heimdahl, Programs, Tests, and Oracles: The Foundations of Testing Revisited. ICSE 2011.

Mutation testing & testability Mutation adequacy (revised for any arbitrary o ): 𝑁𝑣𝑢 𝑁 𝑞 × 𝑡 × 𝑈𝑇 × 𝑝 ⇒ ∀𝑛 ∈ 𝑁, ∃𝑢 ∈ 𝑈𝑇: ¬𝑝 𝑢, 𝑛 Effectiveness of mutation testing depends on the power of o. Testability of program location loc is defined as the probability that the system fails if location loc is faulty. Propagation probability (revised): probability that a perturbed value of a at location loc affects a variable used by oracle o . Testability of a program depends also on the oracle. Low testability locations can be made more testable by using a more powerful oracle.

Oracle comparison Oracle power (𝑝 1 ≥ 𝑈𝑇 𝑝 2 ): ∀𝑢 ∈ 𝑈𝑇, 𝑝 1 𝑢, 𝑞 ⇒ 𝑝 2 𝑢, 𝑞 Oracle power is a partial order relation (not all pairs of oracles satisfy the oracle power relation in either direction), hence there are un-comparable oracles according to power. Probabilistic better (𝑝 1 𝑄𝐶 𝑈𝑇 𝑝 2 ): For a randomly selected 𝑢 ∈ 𝑈𝑇: 𝑄[𝑝 1 𝑢, 𝑞 = 𝐺] ≥ 𝑄[𝑝 2 𝑢, 𝑞 = 𝐺] Probabilistic better is a total order relation. Probabilistic better is weaker than (subsumed by) the oracle power relation.

Classes of oracles Complete oracle: 𝑑𝑝𝑠𝑠 𝑢, 𝑞, 𝑡 ⇒ 𝑝(𝑢, 𝑞) • Faults revealed by o are real faults; pass runs may miss a fault. Sound oracle: 𝑝 𝑢, 𝑞 ⇒ 𝑑𝑝𝑠𝑠(𝑢, 𝑞, 𝑡) • Oracle proves correctness; no fault is missed. Perfect oracle: 𝑝 𝑢, 𝑞 ⟺ 𝑑𝑝𝑠𝑠(𝑢, 𝑞, 𝑡) corr(t, p, s): spec s holds for p when t is run. 1. Unsound/complete [FN ≥ 0; FP = 0] • Pre/post-conditions; invariants; assertions 2. Unsound/incomplete [ FN ≥ 0 ; FP ≥ 0] • Anomaly detectors (oracle/spec mining/learning)

Mining oracles 1. Mining finite state machines 2. Mining temporal properties / association rules 3. Mining data invariants Common assumption [well-enough debugged program]: during mining (training) only or mostly correct program behaviors are observed. INPUT : static traces (paths) or dynamic traces (logs). OUTPUT : oracles/specifications, that can be checked dynamically or statically (e.g., through model checking).

Mining finite state machines Dynamic traces (execution logs) locale(), out() Formatter() close() FSM inference flush() format() close() format(), locale(), out()

State abstraction Execution ADABU [Dallmeier et al.; WODA 2006] logs Formatter, [in=In@6f3321a3,out=Out@5d0385c1] println format [in=In@6f3321a3,out=Out@5d0385c1] Formatter [in=In@6f3321a3,out=Out@5d0385c1] close [in=null,out=Out@5d0385c1] println [in=In@4a3922f3,out=Out@5f0476d2] println println in ≠ null, [in=In@4a3922f3,out=Out@5f0476d2] Formatter out ≠ null [in=In@4a3922f3,out=Out@5f0476d2] format [in=In@4a3922f3,out=Out@5f0476d2] close close [in=null,out=Out@5f0476d2] println [in=In@1b25672c,out=Out@34ab4411] println [in=In@1b25672c,out=Out@34ab4411] Formatter in = null, [in=In@1b25672c,out=Out@34ab4411] format out ≠ null [in=In@1b25672c,out=Out@34ab4411] format [in=In@1b25672c,out=Out@34ab4411] format println [in=In@1b25672c,out=Out@34ab4411] close [in=null,out=Out@34ab4411] println

Event sequence abstraction Execution kTail [Biermann & Feldman; Trans Comp 1972] logs KLFA [Mariani & Pastore; ISSRE 2008] Synoptic [Beschastnikh et al; FSE 2011] println Formatter [Ammons et al.; POPL 2002] close println [Whaley et al.; ISSTA 2002] println Formatter println println format Formatter close println close format println Formatter format format format format Based on grammar inference, usually close println under the constraint that: no negative example is available.

Grammar inference Based on a sample of strings that belong to a language L, we want to build a regular grammar whose accepted language is as close as possible to L. a b d a a b c c c c d a c 2-tails: a b c c d a <b, c> <b, d> b c c c c d b c c c d b d c K-tail principle: Two states are merged (matched) if they have the same k-tails

Active learning LearnLib [Raffelt et al.; STTT 2009] println println Formatter println, Formatter, close? println, Formatter, println? Software close format System yes / no Learner Teacher format

Mining temporal properties Micro-pattern templates: OCD [Gabel & Su; ICSE 2010] Sequencing: ab Loop begin: ab + Perracotta [Yang et al.; ICSE 2006] Loop end: a + b Pre-condition: ab? Alternation rule: a b (a b) * Post-condition: a?b Generalized pre-cond: a + b * E.g.: lock/unlock Generalized post-cond: a * b + Association rule: (ab | ba) General assoc rule: (a + b + | b + a + ) IsEnforcing(sat: int, fail: int ) → {ENFORCE, LEARN, DEAD}

Association rule mining Itemset database: D = {{a, b, c, d, e}, {a, b, d, e, f}, {a, b, d, g}, {a, c, h, i}} Support of itemsets: support({a, b, d}) = 3 Frequent itemsets (support > 2): F = {{a}, {b}, {d}, {a, b}, {a, d}, {b, d}, {a, b, d}} Association rules and confidence for frequent itemset {a, b, d}: c(A ⇒ B) = P[B | A] = support(A B) / support(A) {a} ⇒ {b, d} c = ¾ = 75% {a, b} ⇒ {d} c = 100% {b} ⇒ {a, d} c = 100% DynaMine: a ⇒ b DynaMine [Livshits & Zimmermann; Resorts to mining software FSE 2005] revisions (co-added method [Thummalapenta & Xie; ICSE 2009] calls) to find rule instances. [Weimer & Necula; TACAS 2005]

Mining data invariants Daikon [Ernst et al.; ICSE 1999] Invariant templates: x == c Dynamically discovered invariants a <= x <= b are reported if the probability for x = a y + b z + c them to be coincidental is < x = abs(y) confidence threshold (e.g., x = max(y, z) prob(N_occur) < 0.01). x < y x == y, x + y == c, x - y == c Diduce [Hangal & Lam; ICSE 2002] sorted(x[]) subsequence(x[], y[]) c in x[], y in x[] strcmp(x, y) < 0

Empirical validation Mined oracles are unsound (FN ≥ 0) and incomplete (FP ≥ 0 ). Are they useful in practice? Key research questions : 1. Missed faults (FN): how many faults are not exposed by the mined oracle? 2. False alarms (FP): how many false alarms are raised by the mined oracle? 3. Fault characterization (FC): is there a particular class of faults that is specifically addressed by the mined oracle? How relevant is such fault class?

Empirical studies Oracle mining tool FN FP FC ADABU [WODA 2006] kTail [Trans Comp 1972] KLFA [ISSRE 2008] Synoptic [FSE 2011] LearnLib [STTT 2009] OCD [ICSE 2010] Perracotta [ICSE 2007] DynaMine [FSE 2005] Daikon [ICSE 1999] Diduce [ICSE 2002] Most experimental validations focus on the accuracy of the mined models/specs and conduct in-depth analysis of few sample anomalies, without any attempt of a systematic evaluation.

Future work Solid, empirical validation of mined oracles: • Experimental framework • Benchmark (programs, test cases, traces, faults, …) • Key research questions • Metrics • Comparative evaluations • Characterization by fault class We (probably) do not need more oracle mining techniques; we (definitely) need to better understand and compare the effectiveness of existing techniques .

Mining Anomaly Detectors Paolo Tonella Software Engineering - PowerPoint PPT Presentation

Mining Anomaly Detectors Paolo Tonella Software Engineering Research Unit Fondazione Bruno Kessler Trento, Italy http://se.fbk.eu/tonella Outline Role and classification of (mined) oracles Oracle mining techniques Empirical

What is an anomaly? Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Defining

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Anomaly Detection of Trajectories Junier B. Oliva Anomaly Detection An anomaly (or outlier)

Isolation trees Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Isolation

Anomaly Detection Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Detectors installation in the TAN at IR1 and IR5: Detectors installation in the TAN at IR1 and

RICH DETECTORS Giulia Meo University of Heidelberg 27 January 2017 1/30 Cherenkov Radiation

Anomaly Detection Lecture Notes for Chapter 9 Introduction to Data Mining, 2 nd Edition by Tan,

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

OpenCms Days 2011 Workshop Track: Creating Plug & Play Modules for OpenCms 8 Rdiger Kurz

Building DIFT Systems for Software Security Michael Dalton Computer Systems Laboratory Stanford

Compressing Strings of the Kernel Wolfram Sang Consultant 21.8.2014, LinuxCon14 Wolfram Sang

THE C PROGRAMMING LANGUAGE WHY LEARN C? Compared to other high-level languages Maps almost

Threads for Camelot Stephen Gilmore LFCS, University of Edinburgh Presented in Munich, 22nd

CS 225 Data Structures Sept. 29 Functors It Iterators Iterators give client code access to

Complex Cameras (were) complex Jacopo Mondi FOSDEM 2019 Jacopo Mondi - FOSDEM 2019 Updates from

Introduction to Object-Oriented Programming Basic IO Christopher Simpkins

Mining Anomaly Detectors Paolo Tonella Software Engineering - PowerPoint PPT Presentation

Mining Anomaly Detectors Paolo Tonella Software Engineering Research Unit Fondazione Bruno Kessler Trento, Italy http://se.fbk.eu/tonella Outline Role and classification of (mined) oracles Oracle mining techniques Empirical

What is an anomaly? Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Defining

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Anomaly Detection of Trajectories Junier B. Oliva Anomaly Detection An anomaly (or outlier)

Isolation trees Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Isolation

Anomaly Detection Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Detectors installation in the TAN at IR1 and IR5: Detectors installation in the TAN at IR1 and

RICH DETECTORS Giulia Meo University of Heidelberg 27 January 2017 1/30 Cherenkov Radiation

Anomaly Detection Lecture Notes for Chapter 9 Introduction to Data Mining, 2 nd Edition by Tan,

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

OpenCms Days 2011 Workshop Track: Creating Plug &amp; Play Modules for OpenCms 8 Rdiger Kurz

Building DIFT Systems for Software Security Michael Dalton Computer Systems Laboratory Stanford

Compressing Strings of the Kernel Wolfram Sang Consultant 21.8.2014, LinuxCon14 Wolfram Sang

THE C PROGRAMMING LANGUAGE WHY LEARN C? Compared to other high-level languages Maps almost

Threads for Camelot Stephen Gilmore LFCS, University of Edinburgh Presented in Munich, 22nd

CS 225 Data Structures Sept. 29 Functors It Iterators Iterators give client code access to

Complex Cameras (were) complex Jacopo Mondi FOSDEM 2019 Jacopo Mondi - FOSDEM 2019 Updates from

Introduction to Object-Oriented Programming Basic IO Christopher Simpkins

OpenCms Days 2011 Workshop Track: Creating Plug & Play Modules for OpenCms 8 Rdiger Kurz