methods for anomaly detection a survey
play

Methods for Anomaly Detection: a Survey Kalinichenko L.A., Shanin - PowerPoint PPT Presentation

RCDL 2014 Methods for Anomaly Detection: a Survey Kalinichenko L.A., Shanin I., Taraban I. IPI RAN Outline Introduction Motivation Methods taxonomy Metric Data Oriented Methods Distance-based Data Methods


  1. RCDL 2014 Methods for Anomaly Detection: a Survey Kalinichenko L.A., Shanin I., Taraban I. IPI RAN

  2. Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014

  3. Popular anomaly detection applications • Credit Card (and Mobile Phone) Fraud Detection • Suspicious Web Site Detection • Whole-Genome DNA Matching • ECG-Signal Filtering • Suspicious transaction detection • Analysis of Digital Sky Surveys • Social Network Analysis RCDL-2014

  4. Introduction Anomaly Detection • Finding data objects that have suspicious origin. Purposes: 1. Data filtering 2. Finding rare events 3. Finding surprise data patterns RCDL-2014

  5. Motivation Sometimes data has unobvious form RCDL-2014

  6. Motivation • Sometimes anomalies are difficult to interpret RCDL-2014

  7. Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014

  8. Data forms • Definition of the outlier depends on the specific problem and its data representation. The most popular data forms are: • Metric Data (data as objects in a feature space) • Evolving Data (time series and sequences) • Text-based Data (i.e. poll answers) • Graph-based Data (social networks) RCDL-2014

  9. Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014

  10. Metric Data Oriented Methods • Data is a set of objects in the “feature” space There are three main groups of methods • Distance-based and density-based methods (with no additional assumptions) • Methods that assume that the data has a certain probabilistic distribution • Methods that assume that the “feature” space has highly correlated dimensions . RCDL-2014

  11. Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014

  12. Distance-based Data Oriented Methods • Data is represented as a set of objects in a «feature» space with a natural Euclidian metric. • The anomaly can be defined as a data object in this space, that is located far enough from other objects. • If no other assumptions on the data are given, there are two main approaches to detect the anomaly: • k-Nearest-Neighbor based methods • Clustering based methods RCDL-2014

  13. Clustering-based methods • The purpose is to form large clusters of data object that would define «normal» data. • The most popular methods are • k-means • EM-algorithm • Self-organizing map RCDL-2014

  14. K-Nearest Neighbors • These methods use a set of distances from a given object to it’s k closest neighbors. • Assumption: the anomaly object has much larger distances than normal data points. • There are many version of kNN algorithm that consider density of the nighborhood: • kNN with Local Outlier Factor (LOF) • kNN with Local Correlation Integral (LOCI) • kNN-DD (using Kolmogorov-Smirnov test) • etc… RCDL-2014

  15. Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014

  16. Probabilistically distributed data • The data has a natural probabilistic distribution The main approaches: • Extreme value analysis (Markov, Chebychev, Hoeffding inequalities, t-value test, etc...) • Probabilistic distribution estimation (Bayesian methods) • Probabilistic mixture modeling (EM-algorithm) RCDL-2014

  17. Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014

  18. Correlated dimensions data methods • Different dimensions are highly correlated with one another, linear data models can be used • Main assumption: the data is embedded in a lower dimensional subspace. Main approach: • Linear regression modeling • Principal component analysis RCDL-2014

  19. Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014

  20. Evolving Data Oriented Methods • Data is generated by a continuous or discreet temporal process. So there are two main groups of methods: • Discrete sequences oriented methods • Example: Genome sequence analysis • Example: User-action sequence analysis • Time Series oriented methods • Example: Detecting novel events in social media • The data can be presented in a multidimensional form (such as datastreams) RCDL-2014

  21. Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014

  22. Discrete Sequence Oriented Methods • There are several ways to determine an anomaly in discrete sequence: • as a position anomaly (single value is anomaly) • as a combination anomaly (whole sequence is anomaly) • Main approaches to determine a rarity of a value or a combination: • Distance-based • analysis of the pair-wise distance matrix • Frequency-based • Model-based (Hidden Markov Models) RCDL-2014

  23. Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014

  24. Time Series Oriented Methods • There are several ways to determine an anomaly in time series: • abrupt change in time series • unexpected trend in time series • time series of unusual shape • Main approaches are: • correlation across time or series • time series representation • HMM (Hidden Markov Model) • ARMA (Autoregression Model) • Wavelets • Spectral RCDL-2014

  25. Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend