Anomaly Detection Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar
4/14/2019 Introduction to Data Mining, 2nd Edition 1
Anomaly Detection Lecture Notes for Chapter 9 Introduction to Data - - PowerPoint PPT Presentation
Anomaly Detection Lecture Notes for Chapter 9 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 4/14/2019 Introduction to Data Mining, 2nd Edition 1 Anomaly/ Outlier Detection What are anomalies/outliers?
4/14/2019 Introduction to Data Mining, 2nd Edition 1
4/14/2019 Introduction to Data Mining, 2nd Edition 2
In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that
dropped 10% below normal levels
Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?
The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!
Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html
4/14/2019 Introduction to Data Mining, 2nd Edition 3
4/14/2019 Introduction to Data Mining, 2nd Edition 4
Noise is erroneous, perhaps random, values or
Noise doesn’t necessarily produce unusual values or
Noise is not interesting Anomalies may be interesting if they are not a result of
Noise and anomalies are related but distinct concepts
4/14/2019 Introduction to Data Mining, 2nd Edition 5
Many anomalies are defined in terms of a single attribute
Can be hard to find an anomaly using all attributes
However, an object may not be anomalous in any one
4/14/2019 Introduction to Data Mining, 2nd Edition 6
Many anomaly detection techniques provide only a binary
Other approaches assign a score to all points
In the end, you often need a binary decision
How many anomalies are there?
4/14/2019 Introduction to Data Mining, 2nd Edition 7
Find all anomalies at once or one at a time
Evaluation
Efficiency Context
4/14/2019 Introduction to Data Mining, 2nd Edition 8
4/14/2019 Introduction to Data Mining, 2nd Edition 9
Build a model for the data and see
Anomalies are those points that don’t fit well Anomalies are those points that distort the model Examples:
– Statistical distribution – Clusters – Regression – Geometric – Graph
Anomalies are regarded as a rare class Need to have training data
4/14/2019 Introduction to Data Mining, 2nd Edition 10
4/14/2019 Introduction to Data Mining, 2nd Edition 11
4/14/2019 Introduction to Data Mining, 2nd Edition 12
Usually assume a parametric model describing the distribution
Apply a statistical test that depends on
Issues
Heavy tailed distribution
4/14/2019 Introduction to Data Mining, 2nd Edition 13
x y
1 2 3 4 5
1 2 3 4 5 6 7 8 probability density 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
4/14/2019 Introduction to Data Mining, 2nd Edition 14
Let Lt+1 (D) be the new log likelihood. Compute the difference, ∆ = Lt(D) – Lt+1 (D) If ∆ > c (some threshold), then xt is declared as an anomaly
4/14/2019 Introduction to Data Mining, 2nd Edition 15
∈ ∈ ∈ ∈ =
t i t t i t t i t t t i t t
A x i A t M x i M t t A x i A A M x i M M N i i D t
| | | | 1
4/14/2019 Introduction to Data Mining, 2nd Edition 16
Firm mathematical foundation Can be very efficient Good results if distribution is known In many cases, data distribution may not be known For high dimensional data, it may be difficult to estimate
Anomalies can distort the parameters of the distribution
4/14/2019 Introduction to Data Mining, 2nd Edition 17
4/14/2019 Introduction to Data Mining, 2nd Edition 18
D
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Outlier Score
4/14/2019 Introduction to Data Mining, 2nd Edition 19
D
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55
Outlier Score
4/14/2019 Introduction to Data Mining, 2nd Edition 20
D
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Outlier Score
4/14/2019 Introduction to Data Mining, 2nd Edition 21
D
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Outlier Score
4/14/2019 Introduction to Data Mining, 2nd Edition 22
4/14/2019 Introduction to Data Mining, 2nd Edition 23
4/14/2019 Introduction to Data Mining, 2nd Edition 24
4/14/2019 Introduction to Data Mining, 2nd Edition 25
Outlier Score
1 2 3 4 5 6
6.85 1.33 1.40
A C D
4/14/2019 Introduction to Data Mining, 2nd Edition 26
For each point, compute the density of its local neighborhood Compute local outlier factor (LOF) of a sample p as the average of
Outliers are points with largest LOF value
p2
×
p1
×
4/14/2019 Introduction to Data Mining, 2nd Edition 27
4/14/2019 Introduction to Data Mining, 2nd Edition 28
Clustering-based Outlier: An
Other issues include the impact of
4/14/2019 Introduction to Data Mining, 2nd Edition 29
Outlier Score
0.5 1 1.5 2 2.5 3 3.5 4 4.5 D C A
1.2 0.17 4.6
4/14/2019 Introduction to Data Mining, 2nd Edition 30
Outlier Score
0.5 1 1.5 2 2.5 3 3.5 4
4/14/2019 Introduction to Data Mining, 2nd Edition 31
4/14/2019 Introduction to Data Mining, 2nd Edition 32