DBSCAN Presented by: Garrett Poppe A density-based algorithm for - PowerPoint PPT Presentation

DBSCAN Presented by: Garrett Poppe

A density-based algorithm for discovering clusters in large spatial databases with noise by Martin Ester, Hans-peter Kriegel, Jörg S, Xiaowei Xu Slides adapted from resources outlined in the resources slide

Summary ● K-Means Clustering Method ● Density Based Clustering ● DBSCAN – Points – Optimal Eps & MinPts – Algorithm – Flaws – Complexity ● Resources ● Questions

The K-Means Clustering Method : for numerical attributes Given k , the k-means algorithm is implemented in four steps: ● Partition objects into k non-empty subsets ● Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point , of the cluster) ● Assign each object to the cluster with the nearest seed point ● Go back to Step 2, stop when no more new assignment

The K-Means Clustering Method X Y 1 2 2 4 3 3 4 2 2.5 2.75 The mean point can be a virtual point and the mean point can be influenced by an outlier.

The K-Means Clustering Method ● Example 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 Update 4 Assign 3 3 3 the each 2 2 2 cluster 1 1 objects 1 0 0 means 0 0 1 2 3 4 5 6 7 8 9 10 to 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 most reassign reassign similar center 10 10 K=2 9 9 8 8 Arbitrarily choose 7 7 6 6 K object as initial 5 5 cluster center Update 4 4 the 3 3 2 2 cluster 1 1 means 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 June 23, 2014 Data Mining: Concepts and Techniques 6

The K-Means Clustering Method Iteration 1 Iteration 2 Iteration 3 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Iteration 5 Iteration 6 Iteration 4 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

The K-Means Clustering Method The k-means algorithm is sensitive to outliers Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Density-Based Clustering Methods Clustering based on density (local cluster criterion), such as density-connected points Major features: ● Discover clusters of arbitrary shape ● Handle noise ● One scan ● Need density parameters as termination condition Several interesting studies: ● DBSCAN: Ester, et al. (KDD’96) ● OPTICS: Ankerst, et al (SIGMOD’99). ● DENCLUE: Hinneburg & D. Keim (KDD’98) ● CLIQUE: Agrawal, et al. (SIGMOD’98)

Density-Based Clustering Clustering based on density (local cluster criterion), such as density-connected points Each cluster has a considerable higher density of points than outside of the cluster

DBSCAN DBSCAN is a density-based algorithm. Density = number of points within a specified radius r (Eps) ● A point is a core point if it has more than a specified number of points ● (MinPts) within Eps These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the ● neighborhood of a core point A noise point is any point that is not a core point or a border point. ●

DBSCAN: Core, Border, and Noise points

DBSCAN Two parameters (eps and MinPts) : ● ε : Maximum radius of the neighbourhood ● MinPts : Minimum number of points in an Eps-neighbourhood of that point ● N ε (p) : {q belongs to D | dist(p,q) <= ε } Directly density-reachable : A point p is directly density-reachable from a point q wrt. ε , MinPts if 1) p belongs to N ε (q) 2) core point condition: | N ε (q) | >= MinPts

Density-Reachable and Density-Connected (w.r.t. Eps , MinPts) Let p be a core point, then every point ● p in its Eps neighborhood is said to be directly density-reachable from p. p q 1 A point p is density-reachable from a ● point core point q if there is a chain of points p 1 , …, p n , p 1 = q , p n = p A point p is density-connected to a ● p q point q if there is a point o such that both, p and q are density-reachable o from o

DBSCAN: Large Eps Original Points Point types: core, border and noise

DBSCAN: Optimal Eps Clusters Original Points

Determining Eps and MinPts Idea is that for points in a cluster, their k th nearest  neighbors are at roughly the same distance Noise points have the k th nearest neighbor at farther  distance So, plot sorted distance of every point to its k th  nearest neighbor (e.g., k=4) Thus, eps=10

DBSCAN: Algorithm Let ClusterCount=0. For every point p : 1. If p it is not a core point, assign a null label to it [e.g., zero] 2. If p is a core point, a new cluster is formed [with label ClusterCount:= ClusterCount+1] Then find all points density-reachable from p and classify them in the cluster. [Reassign the zero labels but not the others] Repeat this process until all of the points have been visited. Since all the zero labels of border points have been reassigned in 2, the remaining points with zero label are noise.

DBSCAN: Flaws (MinPts=4, Eps=large value). Original Points • Varying densities • High-dimensional data (MinPts=4, Eps=small value; min density increases)

DBSCAN: Complexity Time Complexity: O(n 2 )—for each point it has to be determined if it is a core point, can be reduced to O(n*log(n)) in lower dimensional spaces by using efficient data structures (n is the number of objects to be clustered); Space Complexity: O(n).

Resources Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data ● Mining. Michigan State University. University of Minnesota. http://www-users.cs.umn.edu/~kumar/dmbook/index.php http://www.cse.ust.hk/~qyang/337/slides/cluster.ppt ● http://www2.cs.uh.edu/~ceick/ML/Topic9.ppt ● www.cs.uiuc.edu/~hanj and Martin Pfeifle www.dbs.informatik.uni-muenchen.de ● http://www.cs.ucla.edu/classes/spring08/cs240B/notes/clusteringCont.ppt ●

Questions?

DBSCAN Presented by: Garrett Poppe A density-based algorithm for - PowerPoint PPT Presentation

DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large spatial databases with noise by Martin Ester, Hans-peter Kriegel, Jrg S, Xiaowei Xu Slides adapted from resources outlined in the resources

CB-DBSCAN: A Novel Clustering Algorithm for Adjacent Clusters with Different Densities Gashin

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Clustering DWML, 2007 1/27 Densitiy Based Clustering DBSCAN Idea: identify contiguous regions

LECTURE 7 Clustering The k-means algorithm Hierarchical Clustering The DBSCAN algorithm

MIC Independent RoboCup Rescue Simulation League 2018 M I C Introduction Agents

CLUSTER ANALYSIS Agenda Introduction to cluster analysis and application Feature

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Data Warehousing and Machine Learning Density-based clustering Thomas D. Nielsen Aalborg

Lecture 12 Jan-Willem van de Meent Evaluation of Clustering Clusters in

PEER-TO-PEER NUMERIC COMPUTING WITH JAVASCRIPT Athan Reines @kgryte / BLOOM FILTERS

OpHit Slicing Dan Pershey Feb 11, 2019 Overview Implemented an OpHit clusterer, based on

Clustering: K-Means & Mixture models Prof. Mike Hughes Many ideas/slides attributable to:

Introduction to Machine Learning Part 1 and Part 2 Yingyu Liang yliang@cs.wisc.edu Computer

Clustering in Go May 2016 Wilfried Schobeiri MediaMath

Data Clustering with R Yanchang Zhao http://www.RDataMining.com R and Data Mining Course

Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram Venkataraman, Amar Phanishayee,

Detecting Clusters in Moderate-to-high Dimensional Data: Subspace Clustering, Pattern-based

A Fistful of Bitcoins: Characterizing Payments Among Men with No Names Sarah Meiklejohn (UC San

Implementing a Parallel Graph Clustering Algorithm with Sparse Matrix Computation Jun Chen,

Reliable Variational Learning for Hierarchical Dirichlet Processes Erik Sudderth Brown University

Escapers and non-escapers in star clusters Douglas Heggie University of Edinburgh UK Luchon

Vembu extends support to Vembu extends support to Vembu v4.0 Hyper-V Cluster with v4.0 Agenda

Geodesic Distance Distance based based Geodesic Fuzzy Clustering Clustering Fuzzy Abonyi and

CSE 158 Lecture 5 Web Mining and Recommender Systems Dimensionality Reduction This week How

DBSCAN Presented by: Garrett Poppe A density-based algorithm for - PowerPoint PPT Presentation

DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large spatial databases with noise by Martin Ester, Hans-peter Kriegel, Jrg S, Xiaowei Xu Slides adapted from resources outlined in the resources

CB-DBSCAN: A Novel Clustering Algorithm for Adjacent Clusters with Different Densities Gashin

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Clustering DWML, 2007 1/27 Densitiy Based Clustering DBSCAN Idea: identify contiguous regions

LECTURE 7 Clustering The k-means algorithm Hierarchical Clustering The DBSCAN algorithm

MIC Independent RoboCup Rescue Simulation League 2018 M I C Introduction Agents

CLUSTER ANALYSIS Agenda Introduction to cluster analysis and application Feature

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Data Warehousing and Machine Learning Density-based clustering Thomas D. Nielsen Aalborg

Lecture 12 Jan-Willem van de Meent Evaluation of Clustering Clusters in

PEER-TO-PEER NUMERIC COMPUTING WITH JAVASCRIPT Athan Reines @kgryte / BLOOM FILTERS

OpHit Slicing Dan Pershey Feb 11, 2019 Overview Implemented an OpHit clusterer, based on

Clustering: K-Means &amp; Mixture models Prof. Mike Hughes Many ideas/slides attributable to:

Introduction to Machine Learning Part 1 and Part 2 Yingyu Liang yliang@cs.wisc.edu Computer

Clustering in Go May 2016 Wilfried Schobeiri MediaMath

Data Clustering with R Yanchang Zhao http://www.RDataMining.com R and Data Mining Course

Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram Venkataraman, Amar Phanishayee,

Detecting Clusters in Moderate-to-high Dimensional Data: Subspace Clustering, Pattern-based

A Fistful of Bitcoins: Characterizing Payments Among Men with No Names Sarah Meiklejohn (UC San

Implementing a Parallel Graph Clustering Algorithm with Sparse Matrix Computation Jun Chen,

Reliable Variational Learning for Hierarchical Dirichlet Processes Erik Sudderth Brown University

Escapers and non-escapers in star clusters Douglas Heggie University of Edinburgh UK Luchon

Vembu extends support to Vembu extends support to Vembu v4.0 Hyper-V Cluster with v4.0 Agenda

Geodesic Distance Distance based based Geodesic Fuzzy Clustering Clustering Fuzzy Abonyi and

CSE 158 Lecture 5 Web Mining and Recommender Systems Dimensionality Reduction This week How

Clustering: K-Means & Mixture models Prof. Mike Hughes Many ideas/slides attributable to: