Data Warehousing and Machine Learning Density-based clustering - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Density-based clustering Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Density-based Clustering DWML Spring 2008 1 / 29

Densitiy Based Clustering DBSCAN Idea: identify contiguous regions of high density. Density-based Clustering DWML Spring 2008 2 / 29

Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k .: Label as core points : points with at least k other points within distance ǫ .: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ .: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ 3.: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ 3.: Label as border points : points within distance ǫ of a core point 4.: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

Densitiy Based Clustering Step 2: Define Connectivity Density-based Clustering DWML Spring 2008 4 / 29

Densitiy Based Clustering Step 2: Define Connectivity 1. Two core points are directly connected if they are within ǫ distance of each other. 2. Each border point is directly connected to one randomly chosen core point within distance ǫ . Density-based Clustering DWML Spring 2008 4 / 29

Densitiy Based Clustering Step 2: Define Connectivity 1. Two core points are directly connected if they are within ǫ distance of each other. 2. Each border point is directly connected to one randomly chosen core point within distance ǫ . 3. Each connected component of the directly connected relation (with at least one core point) is a cluster. Density-based Clustering DWML Spring 2008 4 / 29

Densitiy Based Clustering Setting k and ǫ For fixed k there exist heuristic methods for choosing ǫ by considering the distribution in the data of the distance to the k th nearest neighbor. Pros and Cons + Can detect clusters of highly irregular shape + Robust with respect to outliers - Difficulties with clusters of varying density - Parameters k , ǫ must be suitably chosen Density-based Clustering DWML Spring 2008 5 / 29

EM Clustering Probabilistic Model for Clustering Assumption: • Data a 1 , . . . , a N is generated by a mixture of k probability distributions P 1 , . . . , P k , i.e. k k X X P ( a ) = λ i P i ( a ) ( λ i = 1 ) i = 1 i = 1 • Cluster label of instance = (index of) distribution from which instance was drawn • The P i are not (exactly) known Density-based Clustering DWML Spring 2008 6 / 29

EM Clustering Clustering principle Try to find the most likely explanation of the data, i.e. • determine (parameters of) P 1 , . . . , P k and λ 1 , . . . , λ k , such that • likelihood function N Y P ( a 1 , . . . , a N | P 1 , . . . , P k , λ 1 , . . . , λ k ) = P ( a j ) j = 1 is maximized. • instance a is assigned to cluster j = max k i = 1 P i ( a ) λ i . Density-based Clustering DWML Spring 2008 7 / 29

EM Clustering Standard normal distribution Standard normal distribution 0.4 0.35 0.3 0.25 Probability density 0.2 0.15 0.1 0.05 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 x A standard normal distribution (normal distribution with mean µ = 0 and standard deviation σ = 1): „ « − ( x − µ ) 2 1 P ( x | µ, σ ) = √ exp 2 σ 2 2 πσ Density -based Clustering DWML Spring 2008 8 / 29

EM Clustering Bivariate normal distribution Bivariate normal Bivariate normal 5 4.5 4 3.5 3 0.2 2.5 y 2 0.15 1.5 Probability density 5 1 0.1 4 0.5 0 3 0.05 −0.5 2 −4 −2 0 2 4 6 x 1 0 −4 0 −2 0 y 2 4 6 x » 2 » 1 – – 0 . 5 µ = Σ = 2 0 . 5 0 . 5 A bivariate normal distribution with mean µ and covariance matrix Σ : „ « 1 − 1 T Σ − 1 ( x − µ ) P ( x | µ , Σ) = ( 2 π ) N / 2 | Σ | 1 / 2 exp 2 ( x − u ) Density -based Clustering DWML Spring 2008 9 / 29

EM Clustering Mixture of Gaussians Mixture of three Gaussian distributions with weights λ = 0 . 2 , 0 . 3 , 0 . 5. „ « 1 − 1 T Σ − 1 ( x − µ ) P i ( x | µ , Σ) = ( 2 π ) N / 2 | Σ | 1 / 2 exp 2 ( x − u ) Density -based Clustering DWML Spring 2008 10 / 29

EM Clustering Mixture Model → Data Equi -potential lines and centers of mixture components Density-based Clustering DWML Spring 2008 11 / 29

EM Clustering Mixture Model → Data Equi -potential lines and centers of mixture components Sample from mixture Density-based Clustering DWML Spring 2008 11 / 29

EM Clustering Mixture Model → Data Equi -potential lines and centers of mixture components Sample from mixture Data we see Density-based Clustering DWML Spring 2008 11 / 29

EM Clustering Data → Clustering Density -based Clustering DWML Spring 2008 12 / 29

EM Clustering Data → Clustering Fit a mixture of three Gaussians to the data Density -based Clustering DWML Spring 2008 12 / 29

EM Clustering Data → Clustering Fit a mixture of three Gaussians to the data Assign instances to their most probable mixture components Density -based Clustering DWML Spring 2008 12 / 29

EM Clustering Gaussian Mixture Models Each mixture component is a Gaussian distribution. A Gaussian distribution is determined by • a mean vector (“center”) • a covarianc matrix Usually: all components are assumed to have the same covariance matrix. Then to fit mixture to data: need to find weights and mean vectors of mixture components. If covariance matrix is a diagonal matrix with constant entries on the diagonal, then fitting the Gaussian mixture model is equivalent to minimizing the sum of squared errors (or within cluster point scatter), i.e. the k -means algorithm effectively fits such a Gaussian mixture model. Density-based Clustering DWML Spring 2008 13 / 29

EM Clustering Naive Bayes Mixture Model (for discrete attributes): each mixture component is a distribution in which the attributes are independent: C A 3 A 4 A 5 A 6 A 7 A 1 A 2 Model determined by parameters: • λ 1 , . . . , λ k (prior probabilities of the class variable) • P ( A j = a | C = c j ) ( a ∈ States ( A j ) , c j ∈ States ( C ) ) Fitting the model: finding parameters that maximize probability of observed instances. Density -based Clustering DWML Spring 2008 14 / 29

EM Clustering Clustering as fitting Incomplete Data Clustering data as incomplete labeled data: SL SW PL PW Cluster 5.1 3.5 1.4 0.2 ? 4.9 3.0 1.4 0.2 ? 6.3 2.9 6.0 2.1 ? 6.3 2.5 4.9 1.5 ? . . . . . . . . . . . . . . . SubAllCap TrustSend InvRet . . . B’zambia’ Cluster y n n . . . n ? n n n . . . n ? n y n . . . n ? n n n . . . n ? . . . . . . . . . . . . . . . . . . Density -based Clustering DWML Spring 2008 15 / 29

Data Warehousing and Machine Learning Density-based clustering - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Density-based clustering Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Density-based Clustering DWML Spring 2008 1 / 29 Densitiy Based Clustering DBSCAN Idea:

Database Management Objectives of Lecture 5 Systems Data Warehousing and OLAP Data Warehousing

Data Warehousing Outline Overview of data warehousing Dimensional Modeling Online

CS520 Data Integration, Warehousing, and Provenance 6. Data Warehousing IIT DBGroup Boris

A Generic Solution for A Generic Solution for Warehousing Business Warehousing Business

Warehousing Warehousing are the activities involved in the design and operation of warehouses

Lifting your cargoes faster shorecranes up to 208 tons rhb stevedoring & warehousing Our

TAPPI Shipping, Receiving & Warehousing Workshop TAPPI Shipping, Receiving & Warehousing

4/14/20 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

5/5/16 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Data Management and Analysis with Business Applications Data Warehousing Andrea Brunello

Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

The SPARC Water Vapour Assessment-II Karen Rosenlof, NOAA ESRL CSD Dale Hurst, NOAA ESRL GMD

Come dine with the future Wednesday 30 November 2016 National Museum of Wales, Cardiff

Spatiotemporal considerations in energy decisions Sarah Marie Jordaan, B.Sc., Ph.D. Assistant

COVID-19 SBA Disaster Loans Empire State Development A Division of Empire State Development

Clustering DWML, 2007 1/27 Densitiy Based Clustering DBSCAN Idea: identify contiguous regions

Regional modeling in Central America and the Caribbean Daniel Mar*nez-Castro Ins2tuto de

for or Miss ssion ion-Cri Critic tical al Appl pplications ications Arjmand Samuel, Ph.D.

An Introduction to Homotopy Type Theory Morteza Moniri Department of Mathematics Shahid

Data Warehousing and Machine Learning Density-based clustering - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Density-based clustering Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Density-based Clustering DWML Spring 2008 1 / 29 Densitiy Based Clustering DBSCAN Idea:

Database Management Objectives of Lecture 5 Systems Data Warehousing and OLAP Data Warehousing

Data Warehousing Outline Overview of data warehousing Dimensional Modeling Online

CS520 Data Integration, Warehousing, and Provenance 6. Data Warehousing IIT DBGroup Boris

A Generic Solution for A Generic Solution for Warehousing Business Warehousing Business

Warehousing Warehousing are the activities involved in the design and operation of warehouses

Lifting your cargoes faster shorecranes up to 208 tons rhb stevedoring &amp; warehousing Our

TAPPI Shipping, Receiving &amp; Warehousing Workshop TAPPI Shipping, Receiving &amp; Warehousing

4/14/20 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

5/5/16 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Data Management and Analysis with Business Applications Data Warehousing Andrea Brunello

Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

The SPARC Water Vapour Assessment-II Karen Rosenlof, NOAA ESRL CSD Dale Hurst, NOAA ESRL GMD

Come dine with the future Wednesday 30 November 2016 National Museum of Wales, Cardiff

Spatiotemporal considerations in energy decisions Sarah Marie Jordaan, B.Sc., Ph.D. Assistant

COVID-19 SBA Disaster Loans Empire State Development A Division of Empire State Development

Clustering DWML, 2007 1/27 Densitiy Based Clustering DBSCAN Idea: identify contiguous regions

Regional modeling in Central America and the Caribbean Daniel Mar*nez-Castro Ins2tuto de

for or Miss ssion ion-Cri Critic tical al Appl pplications ications Arjmand Samuel, Ph.D.

An Introduction to Homotopy Type Theory Morteza Moniri Department of Mathematics Shahid

Lifting your cargoes faster shorecranes up to 208 tons rhb stevedoring & warehousing Our

TAPPI Shipping, Receiving & Warehousing Workshop TAPPI Shipping, Receiving & Warehousing