data warehousing and machine learning
play

Data Warehousing and Machine Learning Density-based clustering - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Density-based clustering Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Density-based Clustering DWML Spring 2008 1 / 29 Densitiy Based Clustering DBSCAN Idea:


  1. Data Warehousing and Machine Learning Density-based clustering Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Density-based Clustering DWML Spring 2008 1 / 29

  2. Densitiy Based Clustering DBSCAN Idea: identify contiguous regions of high density. Density-based Clustering DWML Spring 2008 2 / 29

  3. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k .: Label as core points : points with at least k other points within distance ǫ .: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  4. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ .: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  5. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ .: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  6. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ 3.: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  7. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ 3.: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  8. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ 3.: Label as border points : points within distance ǫ of a core point 4.: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  9. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ 3.: Label as border points : points within distance ǫ of a core point 4.: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  10. Densitiy Based Clustering Step 2: Define Connectivity Density-based Clustering DWML Spring 2008 4 / 29

  11. Densitiy Based Clustering Step 2: Define Connectivity 1. Two core points are directly connected if they are within ǫ distance of each other. 2. Each border point is directly connected to one randomly chosen core point within distance ǫ . Density-based Clustering DWML Spring 2008 4 / 29

  12. Densitiy Based Clustering Step 2: Define Connectivity 1. Two core points are directly connected if they are within ǫ distance of each other. 2. Each border point is directly connected to one randomly chosen core point within distance ǫ . 3. Each connected component of the directly connected relation (with at least one core point) is a cluster. Density-based Clustering DWML Spring 2008 4 / 29

  13. Densitiy Based Clustering Setting k and ǫ For fixed k there exist heuristic methods for choosing ǫ by considering the distribution in the data of the distance to the k th nearest neighbor. Pros and Cons + Can detect clusters of highly irregular shape + Robust with respect to outliers - Difficulties with clusters of varying density - Parameters k , ǫ must be suitably chosen Density-based Clustering DWML Spring 2008 5 / 29

  14. EM Clustering Probabilistic Model for Clustering Assumption: • Data a 1 , . . . , a N is generated by a mixture of k probability distributions P 1 , . . . , P k , i.e. k k X X P ( a ) = λ i P i ( a ) ( λ i = 1 ) i = 1 i = 1 • Cluster label of instance = (index of) distribution from which instance was drawn • The P i are not (exactly) known Density-based Clustering DWML Spring 2008 6 / 29

  15. EM Clustering Clustering principle Try to find the most likely explanation of the data, i.e. • determine (parameters of) P 1 , . . . , P k and λ 1 , . . . , λ k , such that • likelihood function N Y P ( a 1 , . . . , a N | P 1 , . . . , P k , λ 1 , . . . , λ k ) = P ( a j ) j = 1 is maximized. • instance a is assigned to cluster j = max k i = 1 P i ( a ) λ i . Density-based Clustering DWML Spring 2008 7 / 29

  16. EM Clustering Standard normal distribution Standard normal distribution 0.4 0.35 0.3 0.25 Probability density 0.2 0.15 0.1 0.05 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 x A standard normal distribution (normal distribution with mean µ = 0 and standard deviation σ = 1): „ « − ( x − µ ) 2 1 P ( x | µ, σ ) = √ exp 2 σ 2 2 πσ Density -based Clustering DWML Spring 2008 8 / 29

  17. EM Clustering Bivariate normal distribution Bivariate normal Bivariate normal 5 4.5 4 3.5 3 0.2 2.5 y 2 0.15 1.5 Probability density 5 1 0.1 4 0.5 0 3 0.05 −0.5 2 −4 −2 0 2 4 6 x 1 0 −4 0 −2 0 y 2 4 6 x » 2 » 1 – – 0 . 5 µ = Σ = 2 0 . 5 0 . 5 A bivariate normal distribution with mean µ and covariance matrix Σ : „ « 1 − 1 T Σ − 1 ( x − µ ) P ( x | µ , Σ) = ( 2 π ) N / 2 | Σ | 1 / 2 exp 2 ( x − u ) Density -based Clustering DWML Spring 2008 9 / 29

  18. EM Clustering Mixture of Gaussians Mixture of three Gaussian distributions with weights λ = 0 . 2 , 0 . 3 , 0 . 5. „ « 1 − 1 T Σ − 1 ( x − µ ) P i ( x | µ , Σ) = ( 2 π ) N / 2 | Σ | 1 / 2 exp 2 ( x − u ) Density -based Clustering DWML Spring 2008 10 / 29

  19. EM Clustering Mixture Model → Data Equi -potential lines and centers of mixture components Density-based Clustering DWML Spring 2008 11 / 29

  20. EM Clustering Mixture Model → Data Equi -potential lines and centers of mixture components Sample from mixture Density-based Clustering DWML Spring 2008 11 / 29

  21. EM Clustering Mixture Model → Data Equi -potential lines and centers of mixture components Sample from mixture Data we see Density-based Clustering DWML Spring 2008 11 / 29

  22. EM Clustering Data → Clustering Density -based Clustering DWML Spring 2008 12 / 29

  23. EM Clustering Data → Clustering Fit a mixture of three Gaussians to the data Density -based Clustering DWML Spring 2008 12 / 29

  24. EM Clustering Data → Clustering Fit a mixture of three Gaussians to the data Density -based Clustering DWML Spring 2008 12 / 29

  25. EM Clustering Data → Clustering Fit a mixture of three Gaussians to the data Density -based Clustering DWML Spring 2008 12 / 29

  26. EM Clustering Data → Clustering Fit a mixture of three Gaussians to the data Assign instances to their most probable mixture components Density -based Clustering DWML Spring 2008 12 / 29

  27. EM Clustering Gaussian Mixture Models Each mixture component is a Gaussian distribution. A Gaussian distribution is determined by • a mean vector (“center”) • a covarianc matrix Usually: all components are assumed to have the same covariance matrix. Then to fit mixture to data: need to find weights and mean vectors of mixture components. If covariance matrix is a diagonal matrix with constant entries on the diagonal, then fitting the Gaussian mixture model is equivalent to minimizing the sum of squared errors (or within cluster point scatter), i.e. the k -means algorithm effectively fits such a Gaussian mixture model. Density-based Clustering DWML Spring 2008 13 / 29

  28. EM Clustering Naive Bayes Mixture Model (for discrete attributes): each mixture component is a distribution in which the attributes are independent: C A 3 A 4 A 5 A 6 A 7 A 1 A 2 Model determined by parameters: • λ 1 , . . . , λ k (prior probabilities of the class variable) • P ( A j = a | C = c j ) ( a ∈ States ( A j ) , c j ∈ States ( C ) ) Fitting the model: finding parameters that maximize probability of observed instances. Density -based Clustering DWML Spring 2008 14 / 29

  29. EM Clustering Clustering as fitting Incomplete Data Clustering data as incomplete labeled data: SL SW PL PW Cluster 5.1 3.5 1.4 0.2 ? 4.9 3.0 1.4 0.2 ? 6.3 2.9 6.0 2.1 ? 6.3 2.5 4.9 1.5 ? . . . . . . . . . . . . . . . SubAllCap TrustSend InvRet . . . B’zambia’ Cluster y n n . . . n ? n n n . . . n ? n y n . . . n ? n n n . . . n ? . . . . . . . . . . . . . . . . . . Density -based Clustering DWML Spring 2008 15 / 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend