Outlier Detection in Axis-Parallel Subspaces of High Dimensional - PowerPoint PPT Presentation

LUDWIG- MAXIMILIANS- DEPARTMENT DATABASE UNIVERSITY INSTITUTE FOR SYSTEMS MUNICH INFORMATICS GROUP Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data PAKDD 2009 Hans-Peter Kriegel, Peer Kröger, Erich Schubert, Arthur Zimek Ludwig-Maximilians-Universität München Munich, Germany http://www.dbs.ifi.lmu.de {kriegel,kroegerp,schube,zimek}@dbs.ifi.lmu.de

Outline DATABASE SYSTEMS GROUP 1. Motivation 2. Subspace Outlier 3. Reference Set for Outliers 4. Comparison to Existing Approaches 5. Conclusion 2 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Motivation DATABASE SYSTEMS GROUP • Hawkins Definition: “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” • Collecting data with high dimensionality � “ curse of dimensionality ” • two aspects here: – Euclidean distances (as commonly used) loose their expressiveness: no outlier can be detected that deviates considerably from the majority of points in comparison to other points – a “generating mechanism” to identify may be responsible for a subset of the features only ( local feature relevance ) 4 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Motivation DATABASE SYSTEMS GROUP • try to find outliers in subspaces, i.e., based on the subset of features related to a “generating mechanism” • subspace {A 1 }: o is an outlier • subspace {A 2 }: o is not an outlier • full-dimensional space {A 1 , A 2 }: o is not an outlier • distribution of attribute values in A 2 appears to be not relevant for the “mechanism” in question 5 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Subspace Outlier DATABASE SYSTEMS GROUP general idea: • assign a set of reference points to a point o (e.g., k-nearest neighbors – but keep in mind the “curse of dimensionality”: local feature relevance vs. meaningful distances) • find the subspace spanned by these reference points (allowing some jitter) • analyze for the point o how well it fits to this subspace 7 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Subspace Outlier DATABASE SYSTEMS GROUP • subspace spanned a set of points S: orthogonal to a subspace minimizing the variance but maximizing the number of attributes - a hyperplane more or less accommodating the set S of reference points • within this subspace, the variance of the points in S is high • in the perpendicular space, the variance is low 8 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Subspace Outlier DATABASE SYSTEMS GROUP • variance VAR S : averaged square distance of the points in S to the mean μ S : ( ( ) ) ∑ 2 , μ S dist p ∈ = p S S VAR S • variance along attribute i: ( ( ) ) ∑ 2 μ S dist p , i i ∈ = p S S var i S 9 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Subspace Outlier DATABASE SYSTEMS GROUP • derive the subspace: subspace defining vector specifies the relevant attributes of the subspace defined by a reference set, i.e., the attributes where the reference points exhibit low variance • in all d attributes, the points have a total variance of VAR S • the expected variance along attribute i is VAR S / d • variance along attribute i is low if it is smaller than the expected variance by a predefined coefficient α : ⎧ S VAR ⎪ < α S if 1 , var = S ⎨ v i d i ⎪ ⎩ 0 , else 10 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Subspace Outlier DATABASE SYSTEMS GROUP • subspace hyperplane H (S) of reference set S is defined by mean value μ S and the subspace defining vector v S • points in the reference set R(o) of o form a line in three-dimensional space v R(o) = (1,0,1) 11 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Subspace Outlier DATABASE SYSTEMS GROUP • distance of o to the reference hyperplane: ( ) d ( ) ∑ 2 = ⋅ − μ S S dist o H S v o , ( ) i i i = i 1 • the higher this distance, the more o deviates the point o from the behavior of the reference set, the more likely it is an outlier 12 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Subspace Outlier DATABASE SYSTEMS GROUP subspace outlier degree (SOD) of a point p: ( ( ) ) dist p , H R ( p ) = SOD ( p ) R ( p ) R ( p ) v i.e., the distance normalized by the number of contributing attributes possible normalization to a probability-value [0,1] in relation to the distribution of distances of all points in S 13 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Reference Set for Outliers DATABASE SYSTEMS GROUP • recall “curse of dimensionality” – local feature relevance � need for a local reference set – distances loose expressiveness � how to choose a meaningful local reference set? • consider l nearest neighbors in terms of the shared nearest neighbor similarity – given a primary distance function dist (e.g. Euclidean distance) – N k (p) : k -nearest neighbors in terms of dist – SNN similarity for two points p and q : = ∩ sim p q N p N q ( , ) ( ) ( ) SNN k k – reference set R(p) : l -nearest neighbors of p using sim SNN • observations back the assumption that SNN stabilizes neighborhood in high dimensional data 15 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Comparison to Existing Approaches DATABASE SYSTEMS GROUP complexity: • determine set of k -nearest neighbors for each of n points: O(dn 2 ) • determine reference set for each point ( l -nearest neighbors based on sim SNN ): O(kn) • overall (since k<<n ): O(dn 2 ) � comparable to most existing outlier detection algorithms 17 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Comparison to Existing Approaches DATABASE SYSTEMS GROUP • 2-d sample data: SOD LOF ABOD 18 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Comparison to Existing Approaches DATABASE SYSTEMS GROUP • Gaussian distribution in 3 dimensions, 20 outliers • adding 7, 17, 27, 47, 67, 97 irrelevant attributes 19 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Conclusion DATABASE SYSTEMS GROUP • SOD is a new approach to model outliers in high dimensional data. • SOD explores outliers in subspaces of the original feature space by combining the tasks of outlier detection and finding the relevant subspace. • SOD is relatively stable with increasing dimensionality by determining the set of locally relevant neighbors based on SNN. • SOD finds interesting and meaningful outliers in high dimensional data based on a different intuition compared to full-dimensional outlier models — without adding computational costs. 21 Kriegel et al.: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data (PAKDD 2009)

Outlier Detection in Axis-Parallel Subspaces of High Dimensional - PowerPoint PPT Presentation

LUDWIG- MAXIMILIANS- DEPARTMENT DATABASE UNIVERSITY INSTITUTE FOR SYSTEMS MUNICH INFORMATICS GROUP Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data PAKDD 2009 Hans-Peter Kriegel, Peer Krger, Erich Schubert, Arthur

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Outlier Detection Outlier detection is both easy and difficult. It is easy since there are

Proximity-based Outlier Detection Objects far away from the others are outliers The

Outlier Detection Motivation: Fraud Detection http://i.imgur.com/ckkoAOp.gif Jian Pei: CMPT

Section 16: Neutral Axis and Parallel Axis Theorem 16-1 Geometry of deformation Geometry of

HYPOTHALAMO PITUITARY GONADAL AXIS Physiology of the HPG axis Endogenous opioids

Shape Outlier Detection Using Pose Preserving Dynamic Shape Models Chan-Su Lee and Ahmed

Outlier Detection Chapter 12 of Data Mining: Concepts and Techniques JIAWEI HAN, MICHELINE KAMBER,

Good and Bad Neighborhood Approximations for Outlier Detection Ensembles Evelyn Kirner, Erich

Whats so great about Krylov subspaces? David S. Watkins Department of Mathematics Washington

Quiz Describe the two most important ways in which subspaces of F D arise. (These ways were

Subspaces and the Three Matrix Spaces Subspaces Defn. A subspace of a vector space V is a subset

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Background Data Resampling for Outlier-Aware Classification Out-of-distribution Detection

Autonomous Robotic Projects at Cyber Physical Systems Group Ol Olive iver Hftberger, Vi

zsvii..Q.a e . .4 u ! /* 10 1' )I) .41-1Y4. vt I w %I II

CS24 FRESHMAN SEMINAR FOR CS SCHOLARS WEEK 1 - INTRODUCTION - LATEX101 U N I V E R S I T Y O F

Lecture 31 No computer use today. Reminder: Project 5 is due today. Project 6 has been

Role-based access control Role-based access control 1 RBAC: Motivations Complexity of

Specication and Enforcement of Static Specication and Enforcement of Static Separation-of-Duty

The symmetry-adapted configurational ensemble approach to the computer simulation of

Security on demand for (DS)MIP6 I-D: