Anomaly Detection Qi Liu University of Science and Technology of - - PowerPoint PPT Presentation

anomaly detection
SMART_READER_LITE
LIVE PREVIEW

Anomaly Detection Qi Liu University of Science and Technology of - - PowerPoint PPT Presentation

Anomaly Detection Qi Liu University of Science and Technology of China qiliuql@ustc.edu.cn ili l@ t d Data Mining Tasks Data Mining Tasks 2 Data Tid Tid Refund Refund Marital Marital Taxable Taxable Cheat Status Income


slide-1
SLIDE 1

Anomaly Detection

Qi Liu University of Science and Technology of China ili l@ t d qiliuql@ustc.edu.cn

slide-2
SLIDE 2

Data Mining Tasks … Data Mining Tasks …

2

Tid Refund Marital Taxable

Data

Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes

10

Milk

slide-3
SLIDE 3

Anomaly/Outlier Detection Anomaly/Outlier Detection

What are anomalies/outliers?

The set of data points that are

considerably different than the considerably different than the remainder of the data

Natural implication is that anomalies are relatively

rare

O

i th d ft if h l t f d t

One in a thousand occurs often if you have lots of data Context is important, e.g., freezing temps in July

Can be important or a nuisance

10 foot tall 2 year old Unusually high blood pressure

slide-4
SLIDE 4

Importance of Anomaly Detection Importance of Anomaly Detection

Ozone Depletion History

  • In 1985 three researchers (Farman,

Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels below normal levels

  • Why did the Nimbus 7 satellite, which

had instruments aboard for recording had instruments aboard for recording

  • zone levels, not record similarly low
  • zone concentrations?
  • The ozone concentrations recorded by

the satellite were so low they were being treated as outliers by a computer

Sources: htt // l i d t d / ht l

program and discarded!

http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html

slide-5
SLIDE 5

Causes of Anomalies Causes of Anomalies

Data from different classes

Measuring the weights of oranges, but a few grapefruit are mixed

in

Natural ariation

Natural variation

Unusually tall people

Data errors

200 pound 2 year old 200 pound 2 year old

slide-6
SLIDE 6

Distinction Between Noise and Anomalies Anomalies

h d l

Noise is erroneous, perhaps random, values or

contaminating objects

Weight recorded incorrectly Grapefruit mixed in with the oranges Noise doesn’t necessarily produce unusual values or

  • bjects

Noise is not interesting Anomalies may be interesting if they are not a result of

noise noise

Noise and anomalies are related but distinct concepts

slide-7
SLIDE 7

General Issues: Number of Attributes General Issues: Number of Attributes

Many anomalies are defined in terms of a single attribute Height Shape Color Can be hard to find an anomaly using all attributes Noisy or irrelevant attributes Noisy or irrelevant attributes Object is only anomalous with respect to some attributes However, an object may not be anomalous in any one

tt ib t attribute

slide-8
SLIDE 8

General Issues: Anomaly Scoring General Issues: Anomaly Scoring

Many anomaly detection techniques provide only a

binary categorization

An object is an anomaly or it isn’t This is especially true of classification‐based approaches Other approaches assign a score to all points This score measures the degree to which an object is an anomaly This score measures the degree to which an object is an anomaly This allows objects to be ranked In the end, you often need a binary decision Should this credit card transaction be flagged?

gg

Still useful to have a score How many anomalies are there?

slide-9
SLIDE 9

Other Issues for Anomaly Detection y

Find all anomalies at once or one at a time Swamping Masking

E l ti

Evaluation How do you measure performance? Supervised vs unsupervised situations Supervised vs. unsupervised situations Efficiency Efficiency Context Context Professional basketball team

slide-10
SLIDE 10

Variants of Anomaly Detection Problems Problems

Gi d t t D fi d ll d t i t D ith

Given a data set D, find all data points x ∈ D with

anomaly scores greater than some threshold t

Given a data set D, find all data points x ∈ D having

the top n largest anomaly scores the top‐n largest anomaly scores d l l b

Given a data set D, containing mostly normal (but

unlabeled) data points, and a test point x, compute the l f ith t t D anomaly score of x with respect to D

slide-11
SLIDE 11

Model‐Based Anomaly D t ti Detection

Build a model for the data and see

Build a model for the data and see Unsupervised

Anomalies are those points that don’t fit well Anomalies are those points that don t fit well Anomalies are those points that distort the model Examples: Statistical distribution Clusters Regression

g

Geometric Graph Su e

i ed

Supervised Anomalies are regarded as a rare class Need to have training data

g

slide-12
SLIDE 12

Additional Anomaly Detection Te hni ues Techniques

P i it b d

Proximity‐based

Anomalies are points far away from other points Can detect this graphically in some cases Can detect this graphically in some cases

Density‐based

Low density points are outliers Low density points are outliers

Pattern matching

Create profiles or templates of atypical but important events or Create profiles or templates of atypical but important events or

  • bjects

Algorithms to detect these patterns are usually simple and efficient

g p y p

slide-13
SLIDE 13

Graphical Approaches Graphical Approaches

B l l

Boxplots or scatter plots Limitations

N t t ti

Not automatic Subjective

slide-14
SLIDE 14

Convex Hull Method Convex Hull Method

Extreme points are assumed to be outliers Extreme points are assumed to be outliers Use convex hull method to detect extreme

values

What if the outlier occurs in the middle of

the data?

slide-15
SLIDE 15

Statistical Approaches Statistical Approaches

Probabilistic definition of an outlier: An outlier is an object that Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data.

Usually assume a parametric model describing the distribution

  • f the data (e.g., normal distribution)

Apply a statistical test that depends on

Data distribution Parameters of distribution (e.g., mean, variance) Number of expected outliers (confidence limit)

I ue

Issues

Identifying the distribution of a data set Heavy tailed distribution Heavy tailed distribution Number of attributes Is the data a mixture of distributions?

slide-16
SLIDE 16

Normal Distributions Normal Distributions

One-dimensional G i Gaussian

6 7 8 0.1

Two-dimensional Gaussian

2 3 4 5 0.06 0.07 0.08 0.09

Gaussian

y

  • 2
  • 1

1 0.02 0.03 0.04 0.05

x

  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

  • 5
  • 4
  • 3

probability density 0.01

slide-17
SLIDE 17

Grubbs’ Test Grubbs Test

D li i i i d

Detect outliers in univariate data Assume data comes from normal distribution Detects one outlier at a time, remove the outlier,

and repeat and repeat

H0: There is no outlier in data

X X G − = max

HA: There is at least one outlier

Grubbs’ test statistic:

s

2

) 2 / (

) 1 ( −

N N

t N G

α

Reject H0 if:

2

) 2 , / ( ) 2 , / (

2 ) (

− −

+ − >

N N N N

t N N G

α α

slide-18
SLIDE 18

Statistical‐based – Likelihood A h Approach

Assume the data set D contains samples from a

mixture of two probability distributions:

M (majority distribution) A (anomalous distribution)

General Approach:

Initially, assume all the data points belong to M

L L (D) b h l lik lih d f D i

Let Lt(D) be the log likelihood of D at time t For each point xt that belongs to M, move it to A Let L

1 (D) be the new log likelihood

Let Lt+1 (D) be the new log likelihood. Compute the difference, Δ = Lt(D) – Lt+1 (D) If Δ > c (some threshold), then xt is declared as an anomaly and moved

tl f M t A permanently from M to A

slide-19
SLIDE 19

Statistical‐based – Likelihood A h Approach

Data distribution, D = (1 – λ) M + λ A M is a probability distribution estimated from data M is a probability distribution estimated from data

Can be based on any modeling method (naïve Bayes,

maximum entropy etc) maximum entropy, etc)

A is initially assumed to be uniform distribution

⎞ ⎛ ⎞ ⎛

N

Likelihood at time t:

∑ ∑ ∏ ∏ ∏

∈ ∈ =

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − = =

t i t t t i t t

A x i A A M x i M M N i i D t

x P x P x P D L ) ( ) ( ) 1 ( ) ( ) (

| | | | 1

λ λ

∑ ∑

∈ ∈

+ + + − =

t i t t i t

A x i A t M x i M t t

x P A x P M D LL ) ( log log ) ( log ) 1 log( ) ( λ λ

slide-20
SLIDE 20

Strengths/Weaknesses of Statistical A h Approaches

Firm mathematical foundation Can be very efficient

G d l f d b k

Good results if distribution is known

I d di ib i b k

In many cases, data distribution may not be known For high dimensional data it may be difficult to estimate For high dimensional data, it may be difficult to estimate

the true distribution

Anomalies can distort the parameters of the distribution

slide-21
SLIDE 21

Distance‐Based Approaches Distance Based Approaches

Several different techniques An object is an outlier if a specified fraction of the

  • bjects is more than a specified distance away (Knorr,

j p y ( , Ng 1998)

Some statistical definitions are special cases of this

The outlier score of an object is the distance to its kth

i hb nearest neighbor

slide-22
SLIDE 22

One Nearest Neighbor ‐ One Outlier

D

1 8 2 1.6 1.8 1.2 1.4 0 8 1 0.6 0.8 0.4

Outlier Score

slide-23
SLIDE 23

One Nearest Neighbor ‐ Two Outliers g

0 55

D

0.5 0.55 0.4 0.45 0.3 0.35 0.2 0.25 0 1 0.15 0.05 0.1

Outlier Score

slide-24
SLIDE 24

Five Nearest Neighbors ‐ Small Cluster Cluster

2

D

1.8 1.4 1.6 1.2 0.8 1 0.6 0.4

Outlier Score

slide-25
SLIDE 25

Five Nearest Neighbors ‐ Differing D it Density

D

1 6 1.8 1.4 1.6 1 1.2 0.8 0.4 0.6 0.2

Outlier Score

slide-26
SLIDE 26

Strengths/Weaknesses of Distance‐Based Approaches Strengths/Weaknesses of Distance Based Approaches

Simple Expensive – O(n2)

S iti t t

Sensitive to parameters Sensitive to variations in density Sensitive to variations in density

Distance becomes less meaningful in high

Distance becomes less meaningful in high‐

dimensional space

slide-27
SLIDE 27

Density‐Based Approaches Density‐Based Approaches

Density‐based Outlier: The outlier score of an object

is the inverse of the density around the object.

Can be defined in terms of the k nearest neighbors One definition: Inverse of distance to kth neighbor

A h d fi i i I f h di k i hb

Another definition: Inverse of the average distance to k neighbors DBSCAN definition

If there are regions of different density, this approach

can have problems can have problems

slide-28
SLIDE 28

Relative Density Relative Density

Consider the density of a point relative to that of its k

nearest neighbors

slide-29
SLIDE 29

Relative Density Outlier Scores Relative Density Outlier Scores

6

6.85

C 5 4

1.40

D 3

1.33

1 2 A

Outlier Score

1

slide-30
SLIDE 30

Density‐based: LOF h approach

For each point compute the density of its local

For each point, compute the density of its local

neighborhood

Compute local outlier factor (LOF) of a sample p as the Compute local outlier factor (LOF) of a sample p as the

average of the ratios of the density of sample p and the density of its nearest neighbors y g

Outliers are points with largest LOF value

In the NN approach, p2 is not considered as outlier, while LOF approach find

p2

while LOF approach find both p1 and p2 as outliers

×

p1

×

slide-31
SLIDE 31

Strengths/Weaknesses of Density‐Based Approaches Strengths/Weaknesses of Density Based Approaches

Simple

E O

2

Expensive – O(n2)

Se iti e to a a ete

Sensitive to parameters

D it b l i f l i hi h

Density becomes less meaningful in high‐

dimensional space

slide-32
SLIDE 32

Clustering‐Based Approaches

Clustering‐based Outlier: An Clustering‐based Outlier: An

  • bject is a cluster‐based outlier if

it does not strongly belong to any g y g y cluster

For prototype‐based clusters, an

bj t i tli if it i t l

  • bject is an outlier if it is not close

enough to a cluster center

For density‐based clusters, an object

y , j is an outlier if its density is too low

For graph‐based clusters, an object is

an outlier if it is not well connected an outlier if it is not well connected

Other issues include the impact of

  • utliers on the clusters and the
  • utliers on the clusters and the

number of clusters

slide-33
SLIDE 33

Distance of Points from Closest Centroids Centroids

4 5 4 4.5 C

4.6

3 3.5 2.5 D

0.17

1.5 2 0.5 1 A

1.2

Outlier Score

0 5

slide-34
SLIDE 34

Relative Distance of Points from Closest Centroid Closest Centroid

4 3.5

C: 76.9

2 5 3

D: 15.0

2 2.5 1.5 0.5 1

A: 13.1

Outlier Score

slide-35
SLIDE 35

Strengths/Weaknesses of Cluster‐Based Approaches g pp

Simple Many clustering techniques can be used Can be difficult to decide on a clustering technique Can be difficult to decide on number of clusters Outliers can distort the clusters

slide-36
SLIDE 36

Co‐anomaly Event Detection in Multiple Temperature Series p p

36

slide-37
SLIDE 37

Co‐anomaly Event Detection in Multiple Temperature Series Multiple Temperature Series

37

slide-38
SLIDE 38

Co‐anomaly Event Detection in Multiple Temperature Series Multiple Temperature Series

38

slide-39
SLIDE 39

Co‐anomaly Event Detection in Multiple Temperature Series

39

Multiple Temperature Series