course content
play

Course Content Week 12 (May 26) Introduction to Data Mining - PowerPoint PPT Presentation

Lecture 7 Course Content Week 12 (May 26) Introduction to Data Mining 33459-01 Principles of Knowledge Discovery in Data Association analysis Sequential Pattern Analysis Outlier Detection Classification and prediction


  1. Lecture 7 Course Content Week 12 (May 26) • Introduction to Data Mining 33459-01 Principles of Knowledge Discovery in Data • Association analysis • Sequential Pattern Analysis Outlier Detection • Classification and prediction • Contrast Sets • Data Clustering Lecture by: Dr. Osmar R. Zaïane • Outlier Detection • Web Mining 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 1 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 2 (Dr. O. Zaiane) (Dr. O. Zaiane) What is an Outlier? Many Names for Outlier Detection • An observation (or measurement) that is • Outlier detection unusually different (large or small) relative to the • Outlier analysis other values in a data set. • Anomaly detection • Outliers typically are attributable to one of the • Intrusion detection following causes: • Misuse detection – Error : the measurement or event is observed, • Surprise discovery recorded, or entered into the computer incorrectly. – Contamination : the measurement or event comes • Rarity detection from a different population. • Detection of unusual events – Inherent variability : the measurement or event is correct, but represents a rare event. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 3 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 4 (Dr. O. Zaiane) (Dr. O. Zaiane)

  2. Lecture Outline Finding Gems Part I: What is Outlier Detection (30 minutes) • Introduction to outlier analysis • If Data Mining is about finding gems in a • Definitions and Relative Notions database, from all the data mining tasks: • Motivating Examples for outlier detection • Taxonomy of Major Outlier Detection Algorithms characterization, classification, clustering, Part II: Statistics Approaches association analysis, contrasting…, outlier • Distribution-Based (Univariate and multivariate) detection is the closest to this metaphor. • Depth-Based • Graphical Aids Part III: Data Mining Approaches • Clustering-Based Data Mining can • Distance-Based discover “gems” in the • Density-Based data • Resolution-Based 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 5 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 6 (Dr. O. Zaiane) (Dr. O. Zaiane) Global versus Local Outliers Different Definitions • Global outliers Vis-à-vis the whole dataset • An observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism. [Hawkins, 1980] • An outlier is an observation (or subset of observations which appear to be inconsistent with the remainder of • Local outliers that dataset [Barnet & Lewis,1994] Vis-à-vis a subset of the data • Is there an anomaly • An outlier is an observation that lies outside the overall more outlier than pattern of a distribution [Moore & McCabe, 1999] other outliers? • Outliers are those data records that do not follow any • Could we rank patter in an application. [Chen, Tan & Fu, 2003] outliers? 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 7 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 8 (Dr. O. Zaiane) (Dr. O. Zaiane)

  3. More Definitions Relativity of an Outlier • An object O in a dataset is a DB( p,D )-outlier if at least a fraction p of the other objects in the dataset lies greater than distance D from O . [Knorr & Ng, 1997] • The notion of outlier is subjective and • An outlier in a set of data is an observation or a point that is highly application-domain-dependant. considerably dissimilar or inconsistent with the remainder of the data [Ramaswany, Rastogi & Shim, 2000] • Given and input data set with N points, parameters n and k, a point p is a D k N outlier if there are no more than n-1 other points p’ such that D k ( d’ )<D k ( p ) where D k ( p ) denotes the distance of point p from its k th nearest neighbor. [Ramaswany, Rastogi & Shim, 2000] • Given a set of observations X, an outlier is an observation that is an element of this set X but which is inconsistent with the majority of the data or inconsistent with a sub-group of X to which the element is meant to be similar. There is an ambiguity in defining an outlier [Fan, Zaïane, Foss & Wu, 2006] 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 9 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 10 (Dr. O. Zaiane) (Dr. O. Zaiane) Topology for Outlier Detection Application of Anomaly Detection • Data Cleaning - Elimination of Noise (abnormal data) Outlier Detection Methods – Noise can significantly affect data modeling (Data Quality) • Network Intrusion (Hackers, DoS, etc.) Statistical Methods Data Mining Methods • Fraud detection (Credit cards, stocks, financial transactions, communications, voting irregularities, etc.) Distribution-Based Visual generic Specific • Surveillance • Performance Analysis (for scouting athletes, etc.) Depth-Based Distance-Based Resolution-Based Spatial • Weather Prediction (Environmental protection, disaster Clustering-Based Density-Based Sequence-Based prevention, etc.) • Real-time anomaly detection in various monitoring Top-n systems, such as structural health, transportation; 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 11 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 12 (Dr. O. Zaiane) (Dr. O. Zaiane)

  4. Lecture Outline Outliers and Statistics Part I: What is Outlier Detection (30 minutes) • Currently, in most applications outlier detection still • Introduction to outlier analysis depends to a large extent on traditional statistical • Definitions and Relative Notions methods. • Motivating Examples for outlier detection • Taxonomy of Major Outlier Detection Algorithms Part II: Statistics Approaches • In Statistics, prior to the building of a multivariate • Distribution-Based (Univariate and multivariate) (or any) statistical representation from the process • Depth-Based data, a pre-screening/pre-treatment procedure is Distribution-Based Visual • Graphical Aids essential to remove noise that can affect models Depth-Based Part III: Data Mining Approaches and seriously bias and influence statistic estimates. • Clustering-Based • Distance-Based • Assume statistical distribution and find records • Density-Based which deviate significantly from the assumed model. • Resolution-Based 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 13 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 14 (Dr. O. Zaiane) (Dr. O. Zaiane) Chebyshev Theorem Distribution-Based Outlier Detection • Univariate • Univariate According to Chebyshev’s theorem almost all The definition is based on a standard the observations in a data set will have a z- probability model (Normal, Poison, Binomial) score less than 3 in absolute value. – i.e. all Assumes or fits a distribution to the data. data fall into interval [ µ -3 σ , µ +3 σ ] µ is the mean and σ is the standard deviation. • The Russian mathematician P. L. Chebyshev (1821- 1894) discovered that the Z-score z=(x- µ )/ σ fraction of observations falling between two distinct values, whose differences from the mean have the same absolute value, is related to the variance of the population. Chebyshev's Theorem gives a conservative estimate to the above The z-score for each data point is computed and the percentage. observations with z-score greater than 3 are declared outliers. Theorem: The fraction of any data set lying within k standard deviations µ and σ are themselves very sensitive to of the mean is at least 1 – 1/k 2 outliers. Extreme values skew the mean. • Any problem • For any population or sample, at least (1 - (1 / k2) of the observations in the data Consider the mean of {1,2,3,4,5} is 3 while with this? set fall within k standard deviations of the mean, where k >= 1. the mean of {1, 2, 3, 4, 1000} is 202. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 15 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 16 (Dr. O. Zaiane) (Dr. O. Zaiane)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend