Course Content Week 12 (May 26) Introduction to Data Mining - PowerPoint PPT Presentation

Lecture 7 Course Content Week 12 (May 26) • Introduction to Data Mining 33459-01 Principles of Knowledge Discovery in Data • Association analysis • Sequential Pattern Analysis Outlier Detection • Classification and prediction • Contrast Sets • Data Clustering Lecture by: Dr. Osmar R. Zaïane • Outlier Detection • Web Mining 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 1 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 2 (Dr. O. Zaiane) (Dr. O. Zaiane) What is an Outlier? Many Names for Outlier Detection • An observation (or measurement) that is • Outlier detection unusually different (large or small) relative to the • Outlier analysis other values in a data set. • Anomaly detection • Outliers typically are attributable to one of the • Intrusion detection following causes: • Misuse detection – Error : the measurement or event is observed, • Surprise discovery recorded, or entered into the computer incorrectly. – Contamination : the measurement or event comes • Rarity detection from a different population. • Detection of unusual events – Inherent variability : the measurement or event is correct, but represents a rare event. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 3 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 4 (Dr. O. Zaiane) (Dr. O. Zaiane)

Lecture Outline Finding Gems Part I: What is Outlier Detection (30 minutes) • Introduction to outlier analysis • If Data Mining is about finding gems in a • Definitions and Relative Notions database, from all the data mining tasks: • Motivating Examples for outlier detection • Taxonomy of Major Outlier Detection Algorithms characterization, classification, clustering, Part II: Statistics Approaches association analysis, contrasting…, outlier • Distribution-Based (Univariate and multivariate) detection is the closest to this metaphor. • Depth-Based • Graphical Aids Part III: Data Mining Approaches • Clustering-Based Data Mining can • Distance-Based discover “gems” in the • Density-Based data • Resolution-Based 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 5 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 6 (Dr. O. Zaiane) (Dr. O. Zaiane) Global versus Local Outliers Different Definitions • Global outliers Vis-à-vis the whole dataset • An observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism. [Hawkins, 1980] • An outlier is an observation (or subset of observations which appear to be inconsistent with the remainder of • Local outliers that dataset [Barnet & Lewis,1994] Vis-à-vis a subset of the data • Is there an anomaly • An outlier is an observation that lies outside the overall more outlier than pattern of a distribution [Moore & McCabe, 1999] other outliers? • Outliers are those data records that do not follow any • Could we rank patter in an application. [Chen, Tan & Fu, 2003] outliers? 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 7 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 8 (Dr. O. Zaiane) (Dr. O. Zaiane)

More Definitions Relativity of an Outlier • An object O in a dataset is a DB( p,D )-outlier if at least a fraction p of the other objects in the dataset lies greater than distance D from O . [Knorr & Ng, 1997] • The notion of outlier is subjective and • An outlier in a set of data is an observation or a point that is highly application-domain-dependant. considerably dissimilar or inconsistent with the remainder of the data [Ramaswany, Rastogi & Shim, 2000] • Given and input data set with N points, parameters n and k, a point p is a D k N outlier if there are no more than n-1 other points p’ such that D k ( d’ )<D k ( p ) where D k ( p ) denotes the distance of point p from its k th nearest neighbor. [Ramaswany, Rastogi & Shim, 2000] • Given a set of observations X, an outlier is an observation that is an element of this set X but which is inconsistent with the majority of the data or inconsistent with a sub-group of X to which the element is meant to be similar. There is an ambiguity in defining an outlier [Fan, Zaïane, Foss & Wu, 2006] 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 9 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 10 (Dr. O. Zaiane) (Dr. O. Zaiane) Topology for Outlier Detection Application of Anomaly Detection • Data Cleaning - Elimination of Noise (abnormal data) Outlier Detection Methods – Noise can significantly affect data modeling (Data Quality) • Network Intrusion (Hackers, DoS, etc.) Statistical Methods Data Mining Methods • Fraud detection (Credit cards, stocks, financial transactions, communications, voting irregularities, etc.) Distribution-Based Visual generic Specific • Surveillance • Performance Analysis (for scouting athletes, etc.) Depth-Based Distance-Based Resolution-Based Spatial • Weather Prediction (Environmental protection, disaster Clustering-Based Density-Based Sequence-Based prevention, etc.) • Real-time anomaly detection in various monitoring Top-n systems, such as structural health, transportation; 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 11 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 12 (Dr. O. Zaiane) (Dr. O. Zaiane)

Lecture Outline Outliers and Statistics Part I: What is Outlier Detection (30 minutes) • Currently, in most applications outlier detection still • Introduction to outlier analysis depends to a large extent on traditional statistical • Definitions and Relative Notions methods. • Motivating Examples for outlier detection • Taxonomy of Major Outlier Detection Algorithms Part II: Statistics Approaches • In Statistics, prior to the building of a multivariate • Distribution-Based (Univariate and multivariate) (or any) statistical representation from the process • Depth-Based data, a pre-screening/pre-treatment procedure is Distribution-Based Visual • Graphical Aids essential to remove noise that can affect models Depth-Based Part III: Data Mining Approaches and seriously bias and influence statistic estimates. • Clustering-Based • Distance-Based • Assume statistical distribution and find records • Density-Based which deviate significantly from the assumed model. • Resolution-Based 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 13 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 14 (Dr. O. Zaiane) (Dr. O. Zaiane) Chebyshev Theorem Distribution-Based Outlier Detection • Univariate • Univariate According to Chebyshev’s theorem almost all The definition is based on a standard the observations in a data set will have a z- probability model (Normal, Poison, Binomial) score less than 3 in absolute value. – i.e. all Assumes or fits a distribution to the data. data fall into interval [ µ -3 σ , µ +3 σ ] µ is the mean and σ is the standard deviation. • The Russian mathematician P. L. Chebyshev (1821- 1894) discovered that the Z-score z=(x- µ )/ σ fraction of observations falling between two distinct values, whose differences from the mean have the same absolute value, is related to the variance of the population. Chebyshev's Theorem gives a conservative estimate to the above The z-score for each data point is computed and the percentage. observations with z-score greater than 3 are declared outliers. Theorem: The fraction of any data set lying within k standard deviations µ and σ are themselves very sensitive to of the mean is at least 1 – 1/k 2 outliers. Extreme values skew the mean. • Any problem • For any population or sample, at least (1 - (1 / k2) of the observations in the data Consider the mean of {1,2,3,4,5} is 3 while with this? set fall within k standard deviations of the mean, where k >= 1. the mean of {1, 2, 3, 4, 1000} is 202. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 15 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 16 (Dr. O. Zaiane) (Dr. O. Zaiane)

Course Content Week 12 (May 26) Introduction to Data Mining - PowerPoint PPT Presentation

Lecture 7 Course Content Week 12 (May 26) Introduction to Data Mining 33459-01 Principles of Knowledge Discovery in Data Association analysis Sequential Pattern Analysis Outlier Detection Classification and prediction

Course Orientation q Course Description q Course Outcomes q Course Requirements q Course Outline

CONTENT DURING CORONAVIRUS LUCINDA DAWES - STRATEGIC CONTENT MARKETING Copywriter & Content

Content Provider Content Resolver Cursor Content Provider Basics Content providers is one

Peering and CDNs Arturo Servin Google Imagine youre a Content Provider Content Provider

CS371m - Mobile Computing Content Providers And Content Resolvers Content Providers One of

Course Search Widget Topics StudyLink Course Search Widget Demo Generic Course Search

Course Specifications/Detailed Course Outline Course code : STA 331 2.0 Course title :

Content Editors Training Course 2 In this session we will introduce Content Editors to the new

NC COURSE OF STUDY GRADUATION REQUIREMENTS * Content Area CAREER PREP COLLEGE TECH PREP**

DPD Basic Bicycle Course Course Objectives COURSE GOAL: The course will provide the trainee with

CANVAS COURSE PROFILE STUDENT PERFORMANCE COURSE OVERVIEW ASSIGNMENT AND SUBMISSION ANALYSIS

Leadplane Training Course Leadplane Training Course Course Objectives Describe procedures for

Statistics II Xavier Vil Course 2004-2005 1.- Course Contents 2.- Course Resources 3.-

ARM Microcontroller Course June 3, 2015 ARM Microcontroller Course The Course Direct Digital

Course Home Page Course Design Course Structure main source reading-intensive course

Level 1, V2.0 Level 1, V2.0 1 Course Contents Course Contents Course Contents Course

Q1 2020 CONFERENCE CALL May 8, 2020 Cautionary Statements This presentation contains

Cycle 1 2018: Broad PCORI Funding Announcements (PFAs) Applicant Town Hall February 1, 2018

Fermilab Computing Sector (CS) Elizabeth Sexton-Kennedy, Jon Bakken, Panagiotis Spentzorous

Statistical Regularities in ATM: Network Properties, Trajectory Deviations, and Delays SID 2012

High-Performance Outlier Detection Algorithm for Finding Blob-Filaments in Plasma Lingfei Wu 1 ,

Detecting Outliers under Detecting Outliers . . . What We Plan To Do Interval Uncertainty:

Visualizing Big Data Outliers through Distributed Aggregation Leland Wilkinson. Proc VAST 2017,

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule

Course Content Week 12 (May 26) Introduction to Data Mining - PowerPoint PPT Presentation

Lecture 7 Course Content Week 12 (May 26) Introduction to Data Mining 33459-01 Principles of Knowledge Discovery in Data Association analysis Sequential Pattern Analysis Outlier Detection Classification and prediction

Course Orientation q Course Description q Course Outcomes q Course Requirements q Course Outline

CONTENT DURING CORONAVIRUS LUCINDA DAWES - STRATEGIC CONTENT MARKETING Copywriter &amp; Content

Content Provider Content Resolver Cursor Content Provider Basics Content providers is one

Peering and CDNs Arturo Servin Google Imagine youre a Content Provider Content Provider

CS371m - Mobile Computing Content Providers And Content Resolvers Content Providers One of

Course Search Widget Topics StudyLink Course Search Widget Demo Generic Course Search

Course Specifications/Detailed Course Outline Course code : STA 331 2.0 Course title :

Content Editors Training Course 2 In this session we will introduce Content Editors to the new

NC COURSE OF STUDY GRADUATION REQUIREMENTS * Content Area CAREER PREP COLLEGE TECH PREP**

DPD Basic Bicycle Course Course Objectives COURSE GOAL: The course will provide the trainee with

CANVAS COURSE PROFILE STUDENT PERFORMANCE COURSE OVERVIEW ASSIGNMENT AND SUBMISSION ANALYSIS

Leadplane Training Course Leadplane Training Course Course Objectives Describe procedures for

Statistics II Xavier Vil Course 2004-2005 1.- Course Contents 2.- Course Resources 3.-

ARM Microcontroller Course June 3, 2015 ARM Microcontroller Course The Course Direct Digital

Course Home Page Course Design Course Structure main source reading-intensive course

Level 1, V2.0 Level 1, V2.0 1 Course Contents Course Contents Course Contents Course

Q1 2020 CONFERENCE CALL May 8, 2020 Cautionary Statements This presentation contains

Cycle 1 2018: Broad PCORI Funding Announcements (PFAs) Applicant Town Hall February 1, 2018

Fermilab Computing Sector (CS) Elizabeth Sexton-Kennedy, Jon Bakken, Panagiotis Spentzorous

Statistical Regularities in ATM: Network Properties, Trajectory Deviations, and Delays SID 2012

High-Performance Outlier Detection Algorithm for Finding Blob-Filaments in Plasma Lingfei Wu 1 ,

Detecting Outliers under Detecting Outliers . . . What We Plan To Do Interval Uncertainty:

Visualizing Big Data Outliers through Distributed Aggregation Leland Wilkinson. Proc VAST 2017,

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule

CONTENT DURING CORONAVIRUS LUCINDA DAWES - STRATEGIC CONTENT MARKETING Copywriter & Content