Alert classification to reduce false positives in intrusion - - PowerPoint PPT Presentation

alert classification to reduce false positives in
SMART_READER_LITE
LIVE PREVIEW

Alert classification to reduce false positives in intrusion - - PowerPoint PPT Presentation

Alert classification to reduce false positives in intrusion detection P h D D e f e n s e P r e s e n t a t i o n / 't deu p e'tr ek / Tadeusz Pietraszek tadek@pietraszek.org Albert-Ludwigs-Universitt Freiburg


slide-1
SLIDE 1

Albert-Ludwigs-Universität Freiburg Fakultät für Angewandte Wissenschaften Dec 5, 2006

Alert classification to reduce false positives in intrusion detection

Tadeusz Pietraszek / 'tʌ·deuʃ

pɪe·'trʌ·ʃek/

tadek@pietraszek.org

P h D D e f e n s e P r e s e n t a t i o n

slide-2
SLIDE 2

PhD Defense 2 13.08.07

Thesis Statement

Thesis at the intersection of machine learning and computer security. 1. Using machine learning it is possible to train classifiers in the form of human readable classification rules by observing the human analyst. 2. Abstaining Classifiers can significantly reduce the number of misclassified alerts with acceptable abstention rate and are useful in intrusion detection. 3. Combining supervised and unsupervised learning in a two-stage alert-processing system forms a robust framework for alert processing.

slide-3
SLIDE 3

PhD Defense 3 13.08.07

Outline

  • Background and problem statement.
  • 1. Adaptive learning for alert classification.
  • 2. Abstaining classifiers.
  • 3. Combining supervised and

unsupervised learning.

  • Summary and conclusions.
slide-4
SLIDE 4

PhD Defense 4 13.08.07

Intrusion Detection Background

  • Intrusion Detection Systems (IDSs) [And80,Den87] detect

intrusions i.e. sets of actions that attempt to compromise the integrity, confidentiality, or availability of computer resource [HLMS90].

  • IDS have to be effective (detect as many intrusions as possible)

and keep false positives to the acceptable level, however, in real environments 95-99% alerts are false positives [Axe99, Jul01, Jul03].

  • Eliminating false positives is a difficult problem:

– intrusion may only slightly differ from normal actions (IDSs have limited context processing capabilities), – writing a good signature is a difficult task (specific vs. general), – actions considered intrusive in one systems, may be normal in others, – viewed as a statistical problem – base rate fallacy.

slide-5
SLIDE 5

PhD Defense 5 13.08.07

Global picture – IDS monitoring

Manual knowledge acquisition is not used for classifying alerts – Fact 1: Large database of historical alerts. – Fact 2: Analyst typically analyzes alerts in real time.

slide-6
SLIDE 6

PhD Defense 6 13.08.07

Problem statement

  • Given

– A sequence of alerts (A1, A2, …, Ai, …) in an alert log L – A set of classes C = {C1, C2, …, Cn} – An intrusion detection analyst O sequentially and in real-time assigning classes to alerts – A utility function U describing the value of a classifier to the analyst O

  • Find

– A system classifying alerts, maximizing the utility function U

  • Misclassified alerts
  • Analyst’s workload
  • Abstentions
slide-7
SLIDE 7

PhD Defense 7 13.08.07

Outline

  • Background and problem statement.
  • 1. Adaptive learning for alert classification.
  • 2. Abstaining classifiers.
  • 3. Combining supervised and

unsupervised learning.

  • Summary and conclusions.
slide-8
SLIDE 8

PhD Defense 8 13.08.07

ALAC (Adaptive Learner for Alert Classification)

Automatically learn an alert classifier based on analyst’s feedback using machine learning techniques.

Background Knowledge

Rules Params

Alert Classifier

Classified Alerts Alerts Update Rules ID Analyst IDS

Machine Learning

Training Examples Feedback Model Update

Recommender mode

  • Misclassifications
slide-9
SLIDE 9

PhD Defense 9 13.08.07

ALAC (Adaptive Learner for Alert Classification)

Background Knowledge

Rules Params

Alert Classifier

No Alerts Update Rules ID Analyst IDS

Machine Learning

Training Examples Feedback Model Update

Confident? Process

Yes

Agent mode

  • Misclassifications
  • Analyst’s workload
slide-10
SLIDE 10

PhD Defense 10 13.08.07

Why does learning work and why can it be difficult?

  • The approach hinges on the two assumptions

– Analysts are able to classify most of alerts correctly – It is possible to learn a classifier based on historical alerts

  • Difficult learning problem

1. Use analyst’s feedback (learning from training examples). 2. Generate the rules in a human readable form (correctness can be verified). 3. Be efficient for large data files. 4. Use background knowledge. 5. Asses the confidence of classification. 6. Work with skewed class distributions / misclassification costs. 7. Adapt to environment changes.

slide-11
SLIDE 11

PhD Defense 11 13.08.07

Requirements - revisited

  • 1. Core algorithm - RIPPER.
  • 2. Rules in readable form.
  • 3. Efficient to work on large datasets.
  • 4. Background knowledge represented in attribute-value form.
  • 5. Confidence – rule performance on testing data with Laplace

correction.

  • 6. Cost Sensitivity – weighted examples.
  • 7. Incremental Learning – “batch incremental approach” –

batch size depends on the current classification accuracy.

slide-12
SLIDE 12

PhD Defense 12 13.08.07

Results - Thesis Statement (1)

Adaptive Learner for Alert Classification (ALAC)

  • Human feedback, background knowledge, ML

techniques.

– Recommender Mode (focusing on the misclassifications in the utility function U).

  • Good performance: fn=0.025, fp=0.038 (DARPA),

fn = 0.003, fp = 0.12 (Data Set B).

– Agent Mode (focusing on the misclassifications and the workload in the utility function U).

  • Similar number of misclassifications and more than 66%
  • f false positives are automatically discarded.

– Many rules are interpretable.

slide-13
SLIDE 13

PhD Defense 13 13.08.07

Outline

  • Background and problem statement.
  • 1. Adaptive learning for alert classification.
  • 2. Abstaining classifiers.
  • 3. Combining supervised and

unsupervised learning.

  • Summary and conclusions.
slide-14
SLIDE 14

PhD Defense 14 13.08.07

Metaclassifier Aα,β

Abstaining binary classifier A is a classifier that in certain case can refrain from classification. We construct it as follows:

where Cα, Cβ is such that: (Conditions used by Flach&Wu [FW05] in their work on repairing concavities of the ROC curves, met in particular if Cα, Cβ are constructed from a single scoring classifier R). Can we optimally select Cα, Cβ?

Cα Cβ

Result + + +

  • +

? +

  • Impossible
  • (

) (

)

⎪ ⎩ ⎪ ⎨ ⎧ − = − + = ∧ − = + = + = ) ( ) ( ) ( ? ) ( ) (

,

x x x x x

β β α α β α

C C C C A

) ) ( ) ( ( ) ) ( ) ( ( : − = ⇒ − = ∧ + = ⇒ + = ∀ x x x x x

α β β α

C C C C

slide-15
SLIDE 15

PhD Defense 15 13.08.07

“Optimal” Metaclassifier Aα,β

How do we compare binary classifiers and abstaining classifiers? How to select an optimal classifier? No clear answer

– Use cost based model (Cost-Based Model) (extension of [Tor04] – Use boundary conditions:

  • Maximum number of instances classified as “?” (Bounded-

Abstention Model)

  • Maximum misclassification cost (Bounded-Improvement Model)
slide-16
SLIDE 16

PhD Defense 16 13.08.07

Cost-based model – a simulated example

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

ROC curve with two optimal classifiers

FP TP Classifier A Classifier B F P ( a ) 0.0 0.2 0.4 0.6 0.8 1.0 FP(b) 0.0 0.2 0.4 0.6 0.8 1.0 Cost 0.2 0.3 0.4 0.5

Misclassification cost for different combinations of A and B

F P ( a ) 0.0 0.2 0.4 0.6 0.8 1.0 FP(b) 0.0 0.2 0.4 0.6 0.8 1.0 Cost 0.2 0.3 0.4 0.5

Misclassification cost for different combinations of A and B

P N c c c fp f P N c c c fp f

ROC ROC 13 23 21 13 12 23

) ( ) ( − = ′ − = ′

α β

slide-17
SLIDE 17

PhD Defense 17 13.08.07

Bounded models

Problem: 2x3 cost matrix is not always given and would have to be estimated. However, classifier is very sensitive to c13, c23. Finding other optimization criteria for an abstaining classifier using a standard cost matrix. – Calculate misclassification costs per classified instance. Follow the same reasoning to find the optimal classifier.

slide-18
SLIDE 18

PhD Defense 18 13.08.07

Bounded models equation

Obtained the following equation, determining the relationship between k and rc for as a function of classifiers Cα, Cβ. – Constrain k, minimize rc → bounded-abstention – Constrain rc, minimize k → bounded-improvement No algebraic solution, however, for a convex ROCCH we can show an efficient algorithm.

( )( )(

) ( ) ( ) ( )

β α α β β α

FN FN FP FP P N k c FN c FP P N k rc − + − + = + + − = 1 1 1

12 21

slide-19
SLIDE 19

PhD Defense 19 13.08.07

Bounded-abstention model

Among classifiers abstaining for no more than a fraction

  • f kMAX instances find the one that minimizes rc.

Useful application in real-time processing instances where the non-classified instances will be processed by another classifier with a limited processing speed. Algorithm: Three-step derivation – Step 1: Show an (impractical) solution for a smooth ROCCH and equality k = kMAX. – Step 2: Extend for a inequality k ≤ kMAX – Step 3: Derive an algorithm for ROCCH.

slide-20
SLIDE 20

PhD Defense 20 13.08.07

Bounded-abstention model – Step 1 and 2

Using Lagrange method (constrained optimization under equality conditions) ∇rc ×∇k=0 we obtain Starting for known optimal classifier for a given k can construct an

  • ptimal classifier path for k + δk.

– Known points can be either the optimal binary classifier or an all- abstaining classifier. – Such a solution is impractical. Can show that except for a very special boundary case this classifier is also optimal for k ≤ kMAX.

( )

( )

12 21 2 12 21

1 c c P N c c P N fp f fp f

ROC ROC

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + ′ ′

α β

slide-21
SLIDE 21

PhD Defense 21 13.08.07

Bounded-abstention model – Step 3

ROCCH consists of line segments connecting points Pi, Pi+1 with coefficients Ai and Bi such that tp = Ai fp + Bi Using similar reasoning as in Step 1 we obtain that: – Either Cα or Cβ is located on the vertex Pi or Pj. – The optimal classifier depends

  • n the sign of X.

O(n) algorithm for finding the

  • ptimal classifier.

Pi Pj Pi+1 Case 1 X > 0 Pi Pj Pj−1 Case 2 X < 0

12 21 2 12 21

1 c c P N c c P N A A X

i j

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + =

slide-22
SLIDE 22

PhD Defense 22 13.08.07

Bounded-abstention model – a simulated example

FP(a) 0.0 0.2 0.4 0.6 0.8 1.0 FP(b) 0.0 0.2 0.4 0.6 0.8 1.0 C

  • s

t 0.1 0.2 0.3 0.4 0.5

Optimal classifier path − bounded−abstention

FP(a) 0.0 0.2 0.4 0.6 0.8 1.0 F P ( b ) 0.0 0.2 0.4 0.6 0.8 1.0 C

  • s

t 0.2 0.3 0.4 0.5

Optimal classifier path − bounded−abstention

slide-23
SLIDE 23

PhD Defense 23 13.08.07

How can we use it in ALAC? ALAC+

ALAC architecture fits a tri-state classifier.

Tri-State Alert Classifier

No (?) Alerts ID Analyst IDS

Learn Tri-State Classifier

Training Examples Feedback

Class Assigned? Process

Yes (+/-)

slide-24
SLIDE 24

PhD Defense 24 13.08.07

Results - Thesis Statement (2)

  • Applied abstaining classifiers for alert classification

(ALAC+) – Recommender mode

  • DARPA: comparable fn, significantly lower fp (up to 97%),

cost reduction by 15-20%.

  • Data Set B: lowered fn (76%) and fp (97%), cost reduction

by 87%.

– Agent mode

  • DARPA: comparable fn, much lower fp, comparable cost
  • Data Set B: lowered fn (60%) and fp (96%), cost reduction

by 72%.

– ALAC+ reduced the overall number of misclassifications (in particular fp) and, in most cases, misclassification costs. – Higher precision is better for human analysts [Axe99].

slide-25
SLIDE 25

PhD Defense 25 13.08.07

Outline

  • Background and problem statement.
  • 1. Adaptive learning for alert classification.
  • 2. Abstaining classifiers.
  • 3. Combining supervised and

unsupervised learning.

  • Summary, conclusions and

contributions.

slide-26
SLIDE 26

PhD Defense 26 13.08.07

Clustering (CLARAty)

  • Julisch[Jul03] observed that a great number of alerts can

be attributed to a small number of root causes, which are persistent over time. – Julisch used a modified AOI[Jul03] to generate human readable cluster descriptions – Root causes can be identified and root causes can be removed.

  • Inputs:

– Alerts – Generalization Hierarchies (mostly IP addresses)

  • Outputs:

– Clusters (in the form of generalized alerts)

slide-27
SLIDE 27

PhD Defense 27 13.08.07

Two-stage alert classification system

Alert Classifier Machine Learning

Training Examples Background Knowledge Rules Params

Alerts Update Rules ID Analyst IDS

Agent Confident?

Process

Yes No

Environment:

  • investigating network and

configuration problems Alert Clustering

Clusters

Analyst

  • Interpretation
  • Finding root causes

IDS:

  • filtering rules

Alert Filter IDS:

  • modif.

signatures Environment:

  • investigating intrusions

Feedback

Alert Clustering Adaptive Alert Classification

  • CLARAty used for filtering and labeling alerts

– Filtering mode (FI) – Feature Construction mode (FC)

  • Alerts subsequently passed onto ALAC
slide-28
SLIDE 28

PhD Defense 28 13.08.07

Results - Thesis Statement (3)

  • Proposed a two-stage alert classification system, based on

CLARAty[Jul03]

  • using clusters for retrospective alert analysis,
  • automated cluster processing system,
  • two-stage alert processing system.
  • Feature Construction (FC) does not yield big improvements.
  • Filtering (FI) performs better (in terms of FN ) and

comparably (in terms of FP ). Most likely because the “easy” alerts have been removed.

  • Thanks to the first stage, the number of alerts to be

processed in the second stage (analyst’s workload) have been reduced by 63%.

slide-29
SLIDE 29

PhD Defense 29 13.08.07

Outline

  • Background and problem statement.
  • 1. Adaptive learning for alert classification.
  • 2. Abstaining classifiers.
  • 3. Combining supervised and

unsupervised learning.

  • Summary and conclusions.
slide-30
SLIDE 30

PhD Defense 30 13.08.07

Conclusions

  • Evolution of IDSs

– Level 1: Improving IDSs themselves – Level 2: Leveraging the Environment – Level 3: Alert Postprocessing – Level 4: Analyst’s Involvement

  • Using ML techniques for IDS alert classification.
  • Verified the three-part thesis statement.
  • System works but inherently there is a risk that some

attacks might be missed.

  • Step towards a more efficient and reliable alert-

management system.

slide-31
SLIDE 31

Thank you!

slide-32
SLIDE 32

PhD Defense 32 13.08.07

Future Work

Combining with existing multi-stage alert correlation systems. Other learning algorithms: SVM, Bayesian, predictive clustering rules? Multi-class classification. Link mining. Dynamic ROC evaluation in incremental settings. HCI issues.

slide-33
SLIDE 33

PhD Defense 33 13.08.07

Can Machine Learning be secure? [NKS06], [BNSJ+06] ML does not deal with active attackers [CB06]

– “Mutagenesis dataset never tried to evade your classifier”

All automated classification systems bear certain risk (it’s a matter of trade-offs!)

– attacker may try to hide their activities among background alerts hoping to evade detection – BUT they do it anyway because such attacks already have a lower chance of being caught!

  • By removing irrelevant alerts the system can highlight the

important ones, but there’s no guarantee.

  • But it is also possible that this effect is amplified by ALAC.
slide-34
SLIDE 34

PhD Defense 34 13.08.07

Can Machine Learning be secure? [NKS06], [BNSJ+06] Good news

– ALAC does not provide immediate feedback. – The interaction with background knowledge is complex. – There is only that many attacks the attacker might try. – Might be treated as noise.

Bad news

– There is no guarantee. – Once such systems are common, they may turn into “arms race” (cf. spam). But for this to happen IDSs would have to be much better than they are now. Let’s see how spam filters and automated signature generation deals with it first ;-)

slide-35
SLIDE 35

PhD Defense 35 13.08.07

Publication List

  • Tadeusz Pietraszek. On the use of ROC analysis for the optimization of abstaining
  • classifiers. Machine Learning Journal, (accepted with minor revisions to appear), 2007.
  • Tadeusz Pietraszek. Classification of intrusion detection alerts using abstaining
  • classifiers. Intelligent Data Analysis Journal, 11(3):(to appear), 2007.
  • Tadeusz Pietraszek, Axel Tanner. Data Mining and Machine Learning---Towards

Reducing False Positives in Intrusion Detection. Information Security Technical Report Journal, Volume 10(3), pages 169--183, 2005.

  • Tadeusz Pietraszek, Chris Vanden Berghe. Defending against Injection Attacks through

Context-Sensitive String Evaluation. In Recent Advances in Intrusion Detection (RAID 2005), volume 3858 of Lecture Notes in Computer Science, pages 124--145, Seattle, WA, 2005.

  • Tadeusz Pietraszek. Optimizing Abstaining Classifiers using ROC Analysis. In

Proceedings of 22nd International Conference in Machine Learning (ICML 2005), pages 665-672, Bonn, Germany, 2005.

  • Tadeusz Pietraszek. Using Adaptive Alert Classification to Reduce False Positives in

Intrusion Detection. In Recent Advances in Intrusion Detection (RAID2004), volume 3324 of Lecture Notes in Computer Science, pages 102-124, Sophia Antipolis, France, 2004.

slide-36
SLIDE 36

PhD Defense 36 13.08.07

References (1)

  • [And80] James P. Anderson. Computer security threat monitoring and

surveillance.Technical report, James P. Anderson Co., 1980.

  • [Axe05] Stefan Axelsson.Understanding Intrusion Detection Through Visualization.

PhD thesis, Chalmers University of Technology, 2005.

  • [Axe99] Stefan Axelsson.The base-rate fallacy and its implications for the intrusion
  • detection. In Proceedings of the 6th ACM Conference on Computer and

Communications Security, pages 1-7, Kent Ridge Digital Labs, Singapore, 1999.

  • [BNSJ+06] Marco Barreno, Blaine Nelson, Russell Sears, Anthony D.
  • Joseph, J. D. Tygar. Can Machine Learning Be Secure?. Conference on Computer and

Communications Security. Proceedings of the 2006 ACM Symposium on Information, computer and communications security. Pages 16-25. Taipei, Taiwan 2006.

  • [Bugtraq03] SecurityFocus. BugTraq. Web page at http://www.securityfocus.com/bid,

1998-2004.

  • [CAMB02] Frederic Cuppens, Fabien Autrel, Alexandre Miege, and Salem Benferhat.

Correlation in an intrusion detection process. In Proceedings SEcurit\e des Communications sur Internet(SECI02), pages 153-171, 2002.

  • [CB06] Alvaro A. Cárdenas and John S. Baras. Evaluation of Classifiers and Learning

Rules: Considerations for Security Applications. Proceedings of the AAAI 06 Workshop on Evaluation Methods for Machine Learning. Boston, Massachusetts, July 17, 2006.

  • [DC01] Olivier Dain and Robert K. Cunningham. Fusing a heterogeneous alert stream

into scenarios. In Proceedings of the 2001 ACM Workshop on Data Mining for Security Application, pages 1-13, Philadelphia, PA, 2001.

  • [Den87] Dorothy E. Denning. An intrusion detection model. IEEE Transactions on

Software Engineering, SE-13(2):222-232, 1987.

slide-37
SLIDE 37

PhD Defense 37 13.08.07

References (2)

  • [Der03] Renaud Deraison.The Nessus Project. Web page at http://www.nessus.org,

2000-2003.

  • [DW01] Herve Debar and Andreas Wespi. Aggregation and correlation of intrusion-

detection alerts. In Recent Advances in Intrusion Detection (RAID2001), volume 2212 of Lecture Notes in Computer Science, pages 85-103. Springer-Verlag, 2001.

  • [FW05] P. A. Flach and S. Wu.Repairing concavities in ROC curves. In Proceedings

2003 UK Workshop on Computational Intelligence, pages 38-44, Bristol, UK, 2003.

  • [HLMS90] Richard Heady, George Luger, Arthur Maccabe, and Mark Servilla. The

architecture of a network level intrusion detection system. Technical report, University

  • f New Mexico, 1990.
  • [How97] John D. Howard.An Analysis of Security Incidents on the Internet 1989-1995.

PhD thesis, Carnegie Mellon University, 1997.

  • [IBM03] IBM. IBM Tivoli Risk Manager. Tivoli Risk Manager User's Guide. Version 4.1,

2002.

  • [Jul01] Klaus Julisch. Mining Alarm Clusters to Improve Alarm Handling Efficiency. In

Proceedings 17th Annual Computer Security Applications Conference, pages 12-21, New Orleans, LA, Dec. 2001.

  • [Jul03a] Klaus Julisch. Clustering intrusion detection alarms to support root cause
  • analysis. ACM Transactions on Information and System Security (TISSEC), 6(4):443-

471, 2003.

  • [Jul03b] Klaus Julisch. Using Root Cause Analysis to Handle Intrusion Detection
  • Alarms. PhD thesis, University of Dortmund, Germany, 2003.
  • [Krs98] Ivan Victor Krsul. Software Vulnerability Analysis. PhD thesis, Purdue

University, 1998.

slide-38
SLIDE 38

PhD Defense 38 13.08.07

References (3)

  • [LBMC94] Carl E. Landwehr, Alan R. Bull, John P. McDermott, and William S. Choi. A

taxonomy of computer program security flaws. ACM Computing Surveys (CSUR), 26(3):211-254, 1994.

  • [LWS02] Richard Lippmann, Seth Webster, and Douglas Stetson. The effect of

identifying vulnerabilities and patching software on the utility of network intrusion

  • detection. In Recent Advances in Intrusion Detection (RAID2002), volume 2516 of

Lecture Notes in Computer Science, pages 307-326. Springer-Verlag, 2002.

  • [MC03] Matthew V. Mahoney and Philip K. Chan. An analysis of the 1999

DARPA/Lincoln Laboratory evaluation data for network anomaly detection. In Recent Advances in Intrusion Detection (RAID2003), volume 2820 of Lecture Notes in Computer Science, pages 220-237. Springer-Verlag, 2003.

  • [McH00] John McHugh. The 1998 Lincoln Laboratory IDS evaluation. A critique. In

Recent Advances in Intrusion Detection (RAID2000), volume 1907 of Lecture Notes in Computer Science, pages 145-161. Springer-Verlag, 2000.

  • [MIT03] MITRE. Common Vulnerabilites and Exposures. Web page at

http://cve.mitre.org, 1999-2004.

  • [MHL94] Biswanath Mukherjee, Todd L. Heberlein, and Karl N. Levitt. Network intrusion
  • detection. IEEE Network, 8(3):26-41, 1994.
  • [NKS06] James Newsome, Brad Karp, Dawn Song. Paragraph: Thwarting Signature

Learning By Training Maliciously. In Recent Advances in Intrusion Detection (RAID2006), Hamburg, Germany 2006.

  • [PF98] Mark Paradies and David Busch. Root cause analysis at Savannah river plant. In

Proceedings of the IEEE Converence on Human Factors and Power Plants, 1988.

slide-39
SLIDE 39

PhD Defense 39 13.08.07

References (4)

  • [PV05] Tadeusz Pietraszek and Chris Vanden Berghe. Defending against injection

attacks through context-sensitive string evaluation. In Recent Advances in Intrusion Detection (RAID2005), volume 3858 of Lecture Notes in Computer Science, pages 124- 145, Seattle, WA, 2005. Springer-Verlag.

  • [RZD05] James Riordan, Diego Zamboni, and Yann Duponchel.Billy Goat, an accurate

worm-detection system (revised version) (RZ 3609). Technical report, IBM Zurich Research Laboratory, 2005.

  • [SP01] Umesh Shankar and Vern Paxson. Active mapping: Resisting NIDS evasion

without altering traffic.In Proceedings of the 2003 IEEE Symposium on Security and Privacy, pages 44-62, Oakland, CA, 2001.

  • [SP03] Robin Sommer and Vern Paxson. Enhancing byte-level network intrusion

detection signatures with context. In Proceedings of the 10th ACM Conference on Computer and Communication Security, pages 262-271, Washington, DC, 2003.

  • [VS01] Alfonso Valdes and Keith Skinner. Probabilistic alert correlation. In Recent

Advances in Intrusion Detection (RAID2001), volume 2212 of Lecture Notes in Computer Science, pages 54-68. Springer-Verlag, 2001.

  • [VVCK04] F. Valeur, G. Vigna, C.Kruegel, and R. Kemmerer. A comprehensive approach

to intrusion detection alert correlation. IEEE Transactions on Dependable and Secure Computing, 1(3):146-169, 2004.

  • [WF00] Ian H. Witten and Eibe Frank. Data Mining: Practical machine learning tools with

Java implementations. Morgan Kaufmann Publishers, San Francisco, CA, 2000.

slide-40
SLIDE 40

Security – supporting slides

slide-41
SLIDE 41

PhD Defense 41 13.08.07

Computer Security

Confidentiality – prevention of (un)intentional unauthorized disclosure of data Integrity – prevention of (un)intentional unauthorized modification of data. Availability - prevention of unauthorized withholding of computing resources. Intrusion – any set of actions that attempt to compromise the confidentiality, integrity or availability of a computing resource.

slide-42
SLIDE 42

PhD Defense 42 13.08.07

Intrusion detection

Traditional approach to security: build a “protective shield” around systems [MHL94].

– Tradeoff security vs. usability – Open systems are more productive – “Secure systems” are vulnerable to attacks exploiting internal errors (e.g., buffer overflows, injection attacks, race conditions), – Systems are vulnerable to insider attacks (intentional

  • r unintentional)

Intrusion detection [Den87]: retrofit systems with security by detecting attacks and alerting a SSO.

slide-43
SLIDE 43

PhD Defense 43 13.08.07

Intrusion Detection Systems (IDS)

  • Intrusion Detection System - an automated system detecting

and alarming of any situation where an intrusion has taken

  • r is about to take place [Axe05].

SSO/ID Analyst

Monitored System Audit collection Audit storage Processing Component Alert Active/Processing Data Reference Data Configuration Data Automated Response Automated Response Manual Response

slide-44
SLIDE 44

PhD Defense 44 13.08.07

Intrusion Detection Systems: anomaly vs. misuse

Anomaly-based model Misuse-base model, detects only known attacks. Why not just prevent them if they are known?

– Window of vulnerability – Detecting failed attacks – Detecting policy violations – Additional layer of protection – Generalized and intent-guessing signatures.

slide-45
SLIDE 45

PhD Defense 45 13.08.07

Snort – an open source IDS

slide-46
SLIDE 46

PhD Defense 46 13.08.07

Snort – signature examples

slide-47
SLIDE 47

PhD Defense 47 13.08.07

Striving to reduce false positives

  • Level 1: Improving IDSs themselves

– More sophisticated protocol analyzers, state keeping [Roe05,Pax99] – Highly specialized IDSs: Billygoat [RZD05], CSSE[PV05]

  • Level 2: Leveraging the environment

– Active mapping [SP01], Context Signatures [SP03] – Vulnerability correlation [LWS02, VVCK04]

  • Level 3: Alert Postprocessing

– Data mining [Jul03b] – Alert correlation systems [CAMB02, DW01, VS01, VVCK04]

  • Level 4: Analyst’s Involvement

– Idea pursued in this thesis and mostly orthogonal to the other approaches.

Alerts ID Analyst IDS

Level 1 Level 4 Level 3 Level 2

slide-48
SLIDE 48

PhD Defense 48 13.08.07

Binary vs. multi-class classification

  • Analysts analyze

– The root cause of alerts [PB88] – The impact on the environment – The actions that need to be taken

  • Taxonomizing root causes is a difficult task [How97, Jul03b, Krs98, LBMC94].
  • Ad-hoc classifications exist, for example:

– Intentional/malicious (e.g. scanning, unauthorized access, privilege escalation, policy violation, DoS attack), – Inadvetent/non-malicious (e.g. network misconfiguraion, normal acitivities).

  • The main distinction for the analyst “Is the alert actionable or not?”

– Determined by the combination of the root cause and the impact on the environment. – For our purposes we will assume that the this is equivalent to our two classes: true positives and false positives.

slide-49
SLIDE 49

ALAC – supporting slides

slide-50
SLIDE 50

PhD Defense 50 13.08.07

Evaluation problem – two datasets [Pie04, PT05]

  • DARPA1999 Data Set

– Used network traces, run through Snort IDS – Used attack truth tables to label the alerts

  • Data Set B

– Real network traces collected in a mid-sized corporate network – Used Snort IDS to generate alerts – Manually labeled (bias!)

slide-51
SLIDE 51

PhD Defense 51 13.08.07

Evaluation Problem – new dataset

  • MSSD Datasets

– Real datasets from MSSD, different commercial NIDSs, some companies more than 1. – Looked at some 20 companies for time period

  • f 6 months

– Some alerts belong to incidents, labeled by security analysts

slide-52
SLIDE 52

PhD Defense 52 13.08.07

Evaluation problem

  • Lack of publicly available data sources for the evaluation of IDSs

– No common reference for evaluation – Everybody can install an IDS in their own network

  • Yes, but this data often cannot be shared (sensitive information)
  • Has no labels

– Honeypots Data [PDP05]

  • All data is by definition suspicious
  • More useful for detecting automated attacks then real attackers.
  • DARPA1998 and DARPA1999 efforts

– MIT Lincoln Labs simulated environment – Many flaws have been identified [McH01, MC03] – Still used in many papers (e.g., UCI Dataset and KDD CUP 1999)

  • Recent effort presented at ETRICS06 (Qian et at.)
  • Proprietary

– Data Set B: undisclosed customer, collected with Snort, classified by the author – MSSD Data Sets: data from IBM’s SOC, implicitly classified by real security analysts.

slide-53
SLIDE 53

PhD Defense 53 13.08.07

Evaluation Problem – Summary

  • These datasets are quite different!

– DARPA1999 DataSet & Data Set B

  • On average 1472 alerts per company per day out of which 359 are

true positives (24%)

– MSSD Dataset

  • On average 3250 alerts per company per day out of which 11 are

true positives (0.34%).

  • Most alerts are clustered in incidents, on average 1 incident every 9

days

– Moreover, we are not sure if all the labels are correct

  • Some incidents could have been missed
  • Some incidents may have turned out to be false positives
  • We should probably handle them differently
slide-54
SLIDE 54

PhD Defense 54 13.08.07

Background Knowledge

Network topology

– Classification of IP addresses – Create rules using generalized concepts

Installed software Alert semantics

– How do we understand the attack? CVE[MIT03], Bugtraq[03] – Was the attack successful? IDD[IBM03], Nessus[Des03]

Alert context i.e. alerts related to the current one (correlation in intrusion detection, e.g. [DW01, DC01, VS01]).

– Set or sequence of alerts related to the current one – Additional features (aggregates, alert summaries, alert categories), expressing domain knowledge in intrusion detection

slide-55
SLIDE 55

PhD Defense 55 13.08.07

Background Knowledge

Alerts with classification have been written to relational database Use scripts to generate background knowledge in A-V form: – IP address classification – OS classification – Aggregates (all in three different time windows – 1min, 5min, 30min)

  • Number of alerts coming from the same IP addresses (src,

dst, srcdst, srcdst)

  • Number of alerts of the same type
  • Number of alerts with similar classification
slide-56
SLIDE 56

PhD Defense 56 13.08.07

Misclassification, Statistics (DARPA1999)

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 20000 30000 40000 50000 60000 False Negative rate (fn) Alerts Processed by System 0.01 0.02 0.03 0.04 0.05 20000 30000 40000 50000 60000 False Positive rate (fp) Alerts Processed by the System Agent - sampling 0.1 Agent - sampling 0.25 Agent - sampling 0.5 Recommender Batch Classification

slide-57
SLIDE 57

PhD Defense 57 13.08.07

Misclassification, Statistics (Data Set B)

0.005 0.01 0.015 0.02 10000 15000 20000 25000 30000 35000 40000 45000 50000 False Negative rate (fn) Alerts Processed by System Agent - sampling 0.1 Agent - sampling 0.25 Agent - sampling 0.5 Recommender Batch Learning 0.05 0.1 0.15 0.2 10000 20000 30000 40000 50000 False Positive rate (FP) Alerts Processed by the System Agent - sampling 0.1 Agent - sampling 0.25 Agent - sampling 0.5 Recommender Batch Classification

slide-58
SLIDE 58

PhD Defense 58 13.08.07

Automatic Processing

DARPA1999 Data Set B

0.25 0.5 0.75 1 20000 30000 40000 50000 60000 Discarded False Positive rate Alerts Processed by the System Agent - sampling 0.1 Agent - sampling 0.25 Agent - sampling 0.5 0.25 0.5 0.75 10000 20000 30000 40000 50000 Discarded False Positive rate Alerts Processed by the System Agent - sampling 0.1 Agent - sampling 0.25 Agent - sampling 0.5

slide-59
SLIDE 59

PhD Defense 59 13.08.07

Understanding the Rules

Rules are quite understandable. They use attributes generated by the background knowledge.

(cnt_intr_w1 <= 0) and (cnt_sign_w3 >= 1) and (cnt_sign_w1 >= 1) and (cnt_dstIP_w1 >= 1) => class=FALSE (cnt_srcIP_w3 <= 6) and (cnt_int_w2 <= 0) and (cnt_ip_w2 >= 2) and (sign = ICMP PING NMAP) => class=FALSE

“If there’s been similar alerts recently and they are all false alarms (no intrusions) then the current alert is a false alert.” “If the number of NMAP pings is small and there are no intrusions, the alert is a false alert.”

slide-60
SLIDE 60

PhD Defense 60 13.08.07

Experiments - Setting ALAC Parameters

  • Using ROC curve, one can choose the optimal classifier.

– Need to know target class distributions and misclassification costs

  • Didn’t have such data -> selected value ad-hoc CR=50

(more on this later!)

  • Classification accuracy

– When to retrain the model

  • Selected a value based on performance in ROC
  • Automatic processing – confidence

– Right now ad-hoc, looking for something better (more on this later!)

slide-61
SLIDE 61

ML – supporting slides

slide-62
SLIDE 62

PhD Defense 62 13.08.07

Evaluating classifiers – confusion and cost matrices

A/C +

  • +

c12

  • c21

A = Actual, C = Classified as

TN FP TN tn TN FP FP fp FN TP FN fn FN TP TP tp + = + = + = + = Cost Matrix Confusion Matrix

12 21

c c CR =

A/C +

  • +

TP FN P

  • FP

TN N

slide-63
SLIDE 63

PhD Defense 63 13.08.07

ROC Background

ROC (Received Operating Characteristic) used for model evaluation and model selection for binary classifiers – multiple class extensions are not practically used Allows to evaluate model performance under all class and cost distributions – 2D plot fp × tp (X – false positive rate, Y - true positive rate) – one point corresponds to one classifier

slide-64
SLIDE 64

PhD Defense 64 13.08.07

ROC Background

A Classifier C produces a single point on the ROC curve (fp, tp). Classifier Cτ (or a machine learning method Lτ) has a parameter τ varying which produces multiple points. Therefore we consider a ROC curve a function f : τ a (fpτ, tpτ). Can find an inverse function f-1 : (fpτ, tpτ) a τ and approximate it with f ˆ-1

slide-65
SLIDE 65

PhD Defense 65 13.08.07

ROC Background

ROC Convex Hull – A piecewise-linear convex down curve fR, having the following properties:

  • fR(0) = 0, fR(1) = 1
  • Slope of fR is monotonically non-increasing.

– Assume that for any value m, there [PF98] exists fR(x) = m.

  • Vertices have ``slopes’’ assuming values between the slopes of adjacent

edges

  • Assume sentinel edges: 0th edge with a slope ∞ and (n+1)th edge with a

slope 0.

– We will use ROCCH instead of ROC.

slide-66
SLIDE 66

Abstaining Classifiers – supporting slides

slide-67
SLIDE 67

PhD Defense 67 13.08.07

Selecting the Optimal Classifier

Criteria – minimize the misclassification cost

( ) ( )

1 1 1 ) 1 ( 1 ) ( 1

12 21 12 21 12 21 12 21

= ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ′ ⋅ ⋅ − + = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⋅ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − + ⋅ + = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⋅ = ⋅ − + ⋅ + = = ⋅ + ⋅ + = N FP f c N P c P N FP d rc d c N FP f P c FP P N rc N FP f P TP c TP P c FP P N rc fp f tp c FN c FP P N rc

ROC ROC ROC ROC

slide-68
SLIDE 68

PhD Defense 68 13.08.07

Cost Minimizing Criteria for One Classifier

( )

P N CR fp fROC = ′

  • Known iso-performance lines [PF98]
slide-69
SLIDE 69

PhD Defense 69 13.08.07

Cost-based model - selecting the optimal classifier

Similar criteria – minimize the cost

( ) ( )

P N c c c fp f P N c c c fp f FP rc FP rc c FN FN c FP FP c FP c FN P N rc

ROC ROC misclass disagree misclass disagree fp fp fn fn 13 23 21 13 12 23 . 13 . 23 , 21 , 12

) ( ) ( 1 − = ′ − = ′ ⇒ = ∂ ∂ ∧ = ∂ ∂ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ − + − + + + =

− − α β α β α β α β α β β α α α β β

4 4 3 4 4 2 1 4 4 3 4 4 2 1 3 2 1 3 2 1

Depends only on the slopes of ROC (similar to a single-classifier case)

slide-70
SLIDE 70

PhD Defense 70 13.08.07

Cost-based model - understanding cost matrices

2x2 cost matrix is well known. 2x3 cost matrices has some interesting properties: e.g. under which conditions the optimal classifier is an abstaining classifier Our derivation is valid for we can prove that if this condition is not met the classifier is a trivial binary classifier

( ) ( ) ( )

12 23 13 21 12 21 13 12 23 21

c c c c c c c c c c + ≥ ∧ > ∧ ≥

slide-71
SLIDE 71

PhD Defense 71 13.08.07

Cost-based model - understanding cost matrices

  • Theorem. If (*) is not met, the classifier is a trivial

binary classifier. Proof (sketch) – show that for an optimal classifier fR’(fp*

α) ≥ fR’(fp*) ≥ fR’(fp* β), where fp* corresponds

to an optimal binary classifier. – show that if (*) is not met, is positive for fp*

α < fp*

and that is positive for fp*

β > fp*

– therefore fp*

α = fp* = fp* β

( ) ( ) ( )

(*)

12 23 13 21 12 21 13 12 23 21

c c c c c c c c c c + ≥ ∧ > ∧ ≥

α

fp rc ∂ ∂

β

fp rc ∂ ∂

slide-72
SLIDE 72

PhD Defense 72 13.08.07

Cost-based model - interesting cases

How to set c13, c23 so that the classifier is a non- trivial abstaining classifier? Two interesting cases – Symmetric case (c13=c23) – Proportional case (c13/c12 = c23/c21)

12 21 21 12 23 13

c c c c c c + ≤ =

2 2

21 23 12 13

c c c c ≤ ⇔ ≤

slide-73
SLIDE 73

PhD Defense 73 13.08.07

Bounded-abstention model – Algorithm

slide-74
SLIDE 74

PhD Defense 74 13.08.07

Experiments

Tested with 15 UCI KDD datasets, using averaged cross-validation. In each model used one independent parameter c13=c23, k or f. Classifier – Bayesian classifier from Weka [WF00]. In many cases obtained a large cost reduction even with a small abstention k=0.1. Now applying it to alert classification.

slide-75
SLIDE 75

PhD Defense 75 13.08.07

Results with abstaining classifiers

slide-76
SLIDE 76

CLARAty – supporting slides

slide-77
SLIDE 77

PhD Defense 77 13.08.07

CLARAty algorithm[Jul03b]

slide-78
SLIDE 78

PhD Defense 78 13.08.07

CLARAty & cluster labeling

  • Running CLARAty with no labels and trying to label clusters

– Containing only false positives – Containing only true positives – Mixed

  • Two main purposes

– Retroactive alert analysis

  • By looking at cluster descriptions again, the analysts may spot previously

missed incidents or large groups of alerts indicating problems

  • Can write rules recognizing some incidents

– Predictive value

Historical Incident Data

Alerts

Alert Clustering

Create Trigger Rule (TP) Alert Filter

Historical Alert Data

Incident Correlation

Create Filtering Rule (FP) Split Clusters or Investigate missed TP

slide-79
SLIDE 79

PhD Defense 79 13.08.07

Clustering Conclusions

  • Clusters are persistent

– Average clustering coverage 90% – Average filtering coverage 63%

  • Most of the clusters are FP only clusters (avg. 95%)

– These clusters tend to be persistent

  • There are only very few TP clusters (avg. 0.2%)

– These clusters are ephemeral

  • Mixed clusters (avg. 5%)

– These clusters need to be investigated more carefully.

  • Filtering works well, although if clusters are not reviewed some attacks

can be missed – Investigated, mostly due to incorrect labeling.

slide-80
SLIDE 80

PhD Defense 80 13.08.07

Automated Clustering & Analysis Framework

  • Only FP clusters are filtered out.

– Evaluation verifies that no FP are missed – Applied to DARPA1999 Data Set and Data Set B – Applied 26 times to each company on MSSD data (weekly clustering)

slide-81
SLIDE 81

PhD Defense 81 13.08.07

Cluster filtering (DARPA1999 and Data Set B)

5000 10000 15000 20000

DARPA1999 Data

date (clustering period 1 week) #alerts Feb 28 Mar 10 Mar 20 Mar 30 #alerts #positives clusters 0.0 0.2 0.4 0.6 0.8 1.0

DARPA1999 Data

date (clustering period 1 week) fraction of alerts (1 week) Feb 28 Mar 10 Mar 20 Mar 30 5000 10000 15000 20000

Data Set B

date (clustering period 1 week) #alerts Nov 14 Nov 19 Nov 24 Nov 29 Dec 04 Dec 09 #alerts #positives clusters 0.0 0.2 0.4 0.6 0.8 1.0

Data Set B

date (clustering period 1 week) fraction of alerts (1 week) Nov 14 Nov 19 Nov 24 Nov 29 Dec 04 Dec 09

slide-82
SLIDE 82

PhD Defense 82 13.08.07

Cluster persistency (DARPA1999 and Data Set B)

5000 10000 15000 20000

Filtering using Clustering (DARPA 1999 Data)

date (clustering period 1 week) #alerts Feb 28 Mar 10 Mar 20 Mar 30 #alerts #covered (clustering) #covered (FP clusters) #filtered #positives #positives missed 0.0 0.2 0.4 0.6 0.8 1.0

Filtering using Clustering (DARPA 1999 Data))

date (clustering period 1 week) fraction 1/(N+P) Feb 28 Mar 10 Mar 20 Mar 30 covered (clustering) covered (FP clusters) filtered positives missed (FN/P) 5000 10000 15000 20000

Filtering using Clustering (Data Set B)

date (clustering period 1 week) #alerts Nov 14 Nov 19 Nov 24 Nov 29 Dec 04 Dec 09 #alerts #covered (clustering) #covered (FP clusters) #filtered #positives #positives missed 0.0 0.2 0.4 0.6 0.8 1.0

Filtering using Clustering (DARPA 1999 Data))

date (clustering period 1 week) fraction 1/(N+P) Feb 28 Mar 10 Mar 20 Mar 30 covered (clustering) covered (FP clusters) filtered positives missed (FN/P)

slide-83
SLIDE 83

PhD Defense 83 13.08.07

Cluster Accuracy & Coverage (DARPA1999)

72076 72084 72088 72128 72193 72194 72195 72200 72203 72206 72242 72243 72244 72245 72246 72247 72248 72249 72250 72251 72252 72275 72288 72289 72290 72311 72312 72351 72352 72397 72424 72448 72457 72468 72472 72494 72517 72518 72519 72520 72521 72522 72523 72524 72525 72526 72534 72552 72553 72554 72576 72577 72578 72585 72590 72592 72594 72595 72599 72601 72602 72604 72605 72607 72608 72610 72630 72634 72648 72650 72651 72652 72654 72661 72674 72677 72683 72684 72685 72686 72687 72688 72689 72690 72691 72692 72699 72700 72701 72702 72703 72704 72716 72720 72721 72722 72723 72724 72725 72730 72792 72800 72802 72803 72805 72812 72815 72835 72836 72837 72857 72858 72859 72864 72865 72869 72870 72876 72880 72881 20282 20280 20278 20275 20273 20269 20260 20257 20251 20250 20249 20248 20244 20239 20236 20234 20219 20213 20204 20197 20196 20190 20187 20185 20184 20178 20176 20166 20156 20155 20151 20149 20146 20143 20139 20134 20132 20115 20108 20102 20052 20046 20040 20035 20027 20024 20020 20018 20013 Clustering Accuracy CLUSTERING STAGE (DARPA 1999 Data) Clusters #alerts 1000 2000 3000 4000 5000 72076 72084 72088 72128 72200 72203 72275 72351 72352 72397 72424 72448 72457 72468 72472 72494 72517 72518 72519 72520 72521 72522 72523 72524 72525 72526 72534 72552 72576 72585 72590 72592 72594 72595 72599 72602 72604 72605 72607 72630 72634 72648 72661 72683 72684 72685 72686 72687 72688 72689 72691 72692 72792 72800 72802 72803 72805 72812 72815 72835 72836 72837 72876 20282 20280 20278 20251 20248 20234 20231 20219 20204 20190 20184 20178 20166 20155 20151 20149 20139 20134 20132 20115 20102 20040 20024 20011 − Clustering Accuracy FILTERING STAGE (DARPA 1999 Data) Clusters #alerts 100 200 300 400 500 600 20011 20012 20013 20015 20018 20020 20024 20027 20030 20034 20035 20036 20039 20040 20041 20042 20045 20046 20047 20052 20053 20100 20101 20102 20105 20106 20108 20109 20110 20111 20112 20114 20115 20129 20130 20131 20132 20134 20135 20136 20139 20140 20141 20143 20146 20148 20149 20150 20151 20153 20154 20155 20156 20157 20159 20166 20167 20168 20169 20170 20171 20173 20174 20176 20178 20180 20182 20183 20184 20185 20187 20188 20190 20194 20195 20196 20197 20199 20200 20201 20202 20204 20205 20207 20208 20209 20210 20212 20213 20214 20215 20216 20217 20218 20219 20221 20225 20227 20229 20231 20232 20233 20234 20235 20236 20238 20239 20241 20244 20247 20248 20249 20250 20251 20253 20254 20256 20257 20258 20259 20260 20262 20264 20265 20266 20269 20270 20271 20273 20275 20278 20279 20280 20282 20283 20285 20290 20291 20293 20297 20299 72881 72880 72876 72870 72869 72865 72864 72859 72858 72857 72837 72836 72835 72815 72812 72805 72803 72802 72800 72792 72730 72725 72724 72723 72722 72721 72720 72716 72704 72703 72702 72701 72700 72699 72692 72691 72690 72689 72688 72687 72686 72685 72684 72683 72677 72674 72654 72652 72651 Clustering Coverage CLUSTERING STAGE (DARPA 1999 Data) Incidents #alerts 1000 2000 3000 4000 5000 20011 20012 20013 20015 20018 20020 20024 20027 20030 20034 20035 20036 20039 20040 20041 20042 20045 20046 20047 20052 20053 20100 20101 20102 20105 20106 20108 20109 20110 20111 20112 20114 20115 20129 20130 20131 20132 20134 20135 20136 20139 20140 20141 20143 20146 20148 20149 20150 20151 20153 20154 20155 20156 20157 20159 20166 20167 20168 20169 20170 20171 20173 20174 20176 20178 20180 20182 20183 20184 20185 20187 20188 20190 20194 20195 20196 20197 20199 20200 20201 20202 20204 20205 20207 20208 20209 20210 20212 20213 20214 20215 20216 20217 20218 20219 20221 20225 20227 20229 20231 20232 20233 20234 20235 20236 20238 20239 20241 20244 20247 20248 20249 20250 20251 20253 20254 20256 20257 20258 20259 20260 20262 20264 20265 20266 20269 20270 20271 20273 20275 20278 20279 20280 20282 20283 20285 20290 20291 20293 20297 20299 72661 72648 72630 72607 72604 72599 72594 72592 72590 72526 72525 72524 72523 72522 72521 72520 72519 72518 72517 72494 72472 72468 72457 72448 72424 72397 72352 72351 72128 72088 72084 72076 − Clustering Coverage FILTERING STAGE (DARPA 1999 Data) Incidents #alerts 1000 2000 3000 4000 5000
slide-84
SLIDE 84

PhD Defense 84 13.08.07

Cluster Accuracy & Coverage (Data Set B)

72890 72950 72962 72979 72989 73004 73035 73054 73055 73056 73058 73060 73068 73076 73077 73099 73104 73170 73171 73172 73203 73219 73256 20309 20307 20306 20305 − Clustering Accuracy CLUSTERING STAGE (Data Set B) Clusters #alerts 2000 4000 6000 8000 20305 20306 20307 20308 20309 73256 73219 73203 73172 73171 73170 73104 73099 73077 73076 73068 73060 73058 73056 73055 73054 73035 73004 72989 72979 72962 72950 72890 − Clustering Coverage CLUSTERING STAGE (Data Set B) Incidents #alerts 2000 4000 6000 8000 10000 12000 20305 20306 20307 20308 20309 73256 73172 73171 73058 73056 73055 73054 72979 72962 72950 72890 − Clustering Coverage FILTERING STAGE (Data Set B) Incidents #alerts 2000 4000 6000 8000 10000 12000 72890 72950 72962 72979 73054 73055 73056 73058 73171 73172 73256 20309 20307 20306 Clustering Accuracy FILTERING STAGE (Data Set B) Clusters #alerts 20 40 60 80 100 120
slide-85
SLIDE 85

PhD Defense 85 13.08.07

Two-stage alert classification – ROC analysis

  • Feature construction performs only marginally better
  • Filtering performs much better for DARPA and comparably

for Data Set B.

0.0 0.1 0.2 0.3 0.4 0.90 0.92 0.94 0.96 0.98 1.00

Two−Staged System − DARPA

fp tp 1 4 8 12 39 53.5 54.06 80.12 128 2 4 8 16 84 127.5 1 3 4.5 7 18 54 1 3 4.5 7 18 54.34 81 Original Feature construction (2FC) Filtering (2FI) Filtering (2FI) (rescaled) 0.0 0.1 0.2 0.3 0.4 0.90 0.92 0.94 0.96 0.98 1.00

Two−Staged System − Data Set B

fp tp 0.25 1 2.5 6 163245 48 112.39 0.25 1 1 2 4.25 4.69 7 11 19.11 34.88 54.49 132.45 512 0.5 1 2.75 4.25 6 12 24 0.5 1 2.75 4.25 6 12 24 25.4 61.72 69. Original Feature construction (2FC) Filtering (2FI) Filtering (2FI) (rescaled)

slide-86
SLIDE 86

PhD Defense 86 13.08.07

Misclassifications (two-stage) (DARPA)

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 10000 20000 30000 40000 50000 60000 False Negative rate (fn) Alerts Processed by System Recommender Recommender (2FC) Recommender (2FI) Agent Agent (2FC) Agent (2FI) 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 10000 20000 30000 40000 50000 60000 False Positive rate (fp) Alerts Processed by System

slide-87
SLIDE 87

PhD Defense 87 13.08.07

Misclassifications (two-stage) (Data Set B)

0.002 0.004 0.006 0.008 0.01 10000 20000 30000 40000 50000 False Negative rate (fn) Alerts Processed by System Recommender Recommender (2FC) Recommender (2FI) Agent Agent (2FC) Agent (2FI) 0.05 0.1 0.15 0.2 10000 20000 30000 40000 50000 False Positive rate (fp) Alerts Processed by System

slide-88
SLIDE 88

PhD Defense 88 13.08.07

Automatic Processing (two-stage)

0.25 0.5 0.75 10000 20000 30000 40000 50000 60000 Discarded False Positive rate Alerts Processed by the System Agent Agent (2FC) Agent (2FI) 0.25 0.5 0.75 1 10000 20000 30000 40000 50000 Discarded False Positive rate Alerts Processed by the System Agent Agent (2FC) Agent (2FI)

DARPA1999 Data Set B

slide-89
SLIDE 89

END!