Methods for Anomaly Detection: a Survey Kalinichenko L.A., Shanin - - PowerPoint PPT Presentation

methods for anomaly detection a survey
SMART_READER_LITE
LIVE PREVIEW

Methods for Anomaly Detection: a Survey Kalinichenko L.A., Shanin - - PowerPoint PPT Presentation

RCDL 2014 Methods for Anomaly Detection: a Survey Kalinichenko L.A., Shanin I., Taraban I. IPI RAN Outline Introduction Motivation Methods taxonomy Metric Data Oriented Methods Distance-based Data Methods


slide-1
SLIDE 1

Methods for Anomaly Detection: a Survey

Kalinichenko L.A., Shanin I., Taraban I. IPI RAN RCDL 2014

slide-2
SLIDE 2

Outline

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work

RCDL-2014

slide-3
SLIDE 3

Popular anomaly detection applications

  • Credit Card (and Mobile Phone) Fraud Detection
  • Suspicious Web Site Detection
  • Whole-Genome DNA Matching
  • ECG-Signal Filtering
  • Suspicious transaction detection
  • Analysis of Digital Sky Surveys
  • Social Network Analysis

RCDL-2014

slide-4
SLIDE 4

Introduction

Anomaly Detection

  • Finding data objects that have suspicious origin.

Purposes:

  • 1. Data filtering
  • 2. Finding rare events
  • 3. Finding surprise

data patterns

RCDL-2014

slide-5
SLIDE 5

Motivation

Sometimes data has unobvious form

RCDL-2014

slide-6
SLIDE 6

Motivation

  • Sometimes

anomalies are difficult to interpret

RCDL-2014

slide-7
SLIDE 7

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-8
SLIDE 8

Data forms

  • Definition of the outlier depends on the specific

problem and its data representation. The most popular data forms are:

  • Metric Data (data as objects in a feature space)
  • Evolving Data (time series and sequences)
  • Text-based Data (i.e. poll answers)
  • Graph-based Data (social networks)

RCDL-2014

slide-9
SLIDE 9

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-10
SLIDE 10

Metric Data Oriented Methods

  • Data is a set of objects in the “feature” space

There are three main groups of methods

  • Distance-based and density-based methods (with no

additional assumptions)

  • Methods that assume that the data has a certain

probabilistic distribution

  • Methods that assume that the “feature” space has

highly correlated dimensions.

RCDL-2014

slide-11
SLIDE 11

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-12
SLIDE 12

Distance-based Data Oriented Methods

  • Data is represented as a set of objects in a «feature»

space with a natural Euclidian metric.

  • The anomaly can be defined as a data object in this

space, that is located far enough from other objects.

  • If no other assumptions on the data are given, there

are two main approaches to detect the anomaly:

  • k-Nearest-Neighbor based methods
  • Clustering based methods

RCDL-2014

slide-13
SLIDE 13

Clustering-based methods

  • The purpose is to form large clusters of data object

that would define «normal» data.

  • The most popular methods are
  • k-means
  • EM-algorithm
  • Self-organizing map

RCDL-2014

slide-14
SLIDE 14

K-Nearest Neighbors

  • These methods use a set of distances from a given
  • bject to it’s k closest neighbors.
  • Assumption: the anomaly object has much larger

distances than normal data points.

  • There are many version of kNN algorithm that

consider density of the nighborhood:

  • kNN with Local Outlier Factor (LOF)
  • kNN with Local Correlation Integral (LOCI)
  • kNN-DD (using Kolmogorov-Smirnov test)
  • etc…

RCDL-2014

slide-15
SLIDE 15

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-16
SLIDE 16

Probabilistically distributed data

  • The data has a natural probabilistic distribution

The main approaches:

  • Extreme value analysis (Markov, Chebychev,

Hoeffding inequalities, t-value test, etc...)

  • Probabilistic distribution estimation (Bayesian

methods)

  • Probabilistic mixture modeling (EM-algorithm)

RCDL-2014

slide-17
SLIDE 17

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-18
SLIDE 18

Correlated dimensions data methods

  • Different dimensions are highly correlated with one

another, linear data models can be used

  • Main assumption: the data is embedded in a lower

dimensional subspace. Main approach:

  • Linear regression modeling
  • Principal component analysis

RCDL-2014

slide-19
SLIDE 19

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-20
SLIDE 20

Evolving Data Oriented Methods

  • Data is generated by a continuous or discreet

temporal process. So there are two main groups of methods:

  • Discrete sequences oriented methods
  • Example: Genome sequence analysis
  • Example: User-action sequence analysis
  • Time Series oriented methods
  • Example: Detecting novel events in social media
  • The data can be presented in a multidimensional form

(such as datastreams)

RCDL-2014

slide-21
SLIDE 21

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-22
SLIDE 22

Discrete Sequence Oriented Methods

  • There are several ways to determine an anomaly in

discrete sequence:

  • as a position anomaly (single value is anomaly)
  • as a combination anomaly (whole sequence is

anomaly)

  • Main approaches to determine a rarity
  • f a value or a combination:
  • Distance-based
  • analysis of the pair-wise distance matrix
  • Frequency-based
  • Model-based (Hidden Markov Models)

RCDL-2014

slide-23
SLIDE 23

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-24
SLIDE 24

Time Series Oriented Methods

  • There are several ways to determine an anomaly in

time series:

  • abrupt change in time series
  • unexpected trend in time series
  • time series of unusual shape
  • Main approaches are:
  • correlation across time or series
  • time series representation
  • HMM (Hidden Markov Model)
  • ARMA (Autoregression Model)
  • Wavelets
  • Spectral

RCDL-2014

slide-25
SLIDE 25

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-26
SLIDE 26

Graph Data Oriented Methods

  • Data is represented as a graph of a certain structure

There are two different cases:

  • The data is presented as many small graphs

(i.e. chemical and biological compounds)

  • The data is presented as a single large graph

(i.e. social network, web, etc...)

RCDL-2014

slide-27
SLIDE 27

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-28
SLIDE 28

«Many small graphs» case

  • A single graph can be considered to be anomaly.
  • The easy way: to extract features from each graph

and use metric data oriented method

  • Different similarity functions can be considered, that

allow to perform clustering and kNN-alike methods

  • Largest common subgraph
  • Graph edit distance
  • Largest matching node set
  • Similarity computation can be difficult

RCDL-2014

slide-29
SLIDE 29

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-30
SLIDE 30

«Single large graph» case

  • Depending on the problem specification an anomaly

can be defined differently:

  • as a node anomaly
  • as a linkage anomaly
  • as a subgraph anomaly
  • Features can be extracted from 1-step neighborhood

(egonet) node-wise and edge-wise.

  • For a linkage anomaly detection random graph

modeling can be used.

  • Other popular methods include spectral methods and

matrix factorization approach.

RCDL-2014

slide-31
SLIDE 31

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-32
SLIDE 32

Text Data Oriented Methods

  • The anomaly detection can be considered as:
  • Noise reduction problem
  • Latent Semantic Indexing (LSI) is used.
  • Novelty detection (first story detection)
  • TF-IDF representation based algorithms
  • Probabilistic Latent Semantic Indexing (PLSI)

RCDL-2014

slide-33
SLIDE 33

Outline

RCDL-2014

  • Introduction
  • Motivation
  • Methods taxonomy
  • Metric Data Oriented Methods
  • Distance-based Data Methods
  • Probabilistically Distributed Data Methods
  • Data with Correlated Features Methods
  • Evolving Data Oriented Data
  • Discrete Sequences Data
  • Time Series
  • Graph-based Data Oriented Methods
  • Methods for many small graphs analysis
  • Methods for large single graph analysis
  • Text-based Data Oriented Methods
  • Future work
slide-34
SLIDE 34

Future work

  • Research in anomaly detection in massive digital sky

astronomy surveys

  • Preparing a master level course in anomaly detection

methods.

RCDL-2014