Distance-based Methods: Drawbacks Hard to find clusters with - PowerPoint PPT Presentation

Distance-based Methods: Drawbacks • Hard to find clusters with irregular shapes • Hard to specify the number of clusters • Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1

How to Find Irregular Clusters? • Divide the whole space into many small areas – The density of an area can be estimated – Areas may or may not be exclusive – A dense area is likely in a cluster • Start from a dense area, traverse connected dense areas and discover clusters in irregular shape Jian Pei: CMPT 459/741 Clustering (3) 2

Directly Density Reachable p MinPts = 3 q Eps = 1 cm • Parameters – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Eps- neighborhood of that point – NEps(p): {q | dist(p,q) ≤ Eps} • Core object p: |NEps(p)| ≥ MinPts – A core object is in a dense area • Point q directly density-reachable from p iff q ∈ NEps(p) and p is a core object Jian Pei: CMPT 459/741 Clustering (3) 3

Density-Based Clustering • Density-reachable – Directly density reachable p 1 à p 2 , p 2 à p 3 , … , p n-1 à p n – p n density-reachable from p 1 • Density-connected – If points p, q are density-reachable from o then p and q are density-connected p q p p 1 o q Jian Pei: CMPT 459/741 Clustering (3) 4

DBSCAN • A cluster: a maximal set of density- connected points – Discover clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm Core MinPts = 5 Jian Pei: CMPT 459/741 Clustering (3) 5

DBSCAN: the Algorithm • Arbitrary select a point p • Retrieve all points density-reachable from p wrt Eps and MinPts • If p is a core point, a cluster is formed • If p is a border point, no points are density- reachable from p and DBSCAN visits the next point of the database • Continue the process until all of the points have been processed Jian Pei: CMPT 459/741 Clustering (3) 6

Challenges for DBSCAN • Different clusters may have very different densities • Clusters may be in hierarchies Jian Pei: CMPT 459/741 Clustering (3) 7

OPTICS: A Cluster-ordering Method • Idea: ordering points to identify the clustering structure • “Group” points by density connectivity – Hierarchies of clusters • Visualize clusters and the hierarchy Jian Pei: CMPT 459/741 Clustering (3) 8

Ordering Points • Points strongly density-connected should be close to one another • Clusters density-connected should be close to one another and form a “ cluster ” of clusters Jian Pei: CMPT 459/741 Clustering (3) 9

OPTICS: An Example Reachability-distance undefined ε ε ε ‘ Cluster-order of the objects Jian Pei: CMPT 459/741 Clustering (3) 10

DENCLUE: Using Density Functions • DENsity-based CLUstEring • Major features – Solid mathematical foundation – Good for data sets with large amounts of noise – Allow a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets – Significantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45) – But need a large number of parameters Jian Pei: CMPT 459/741 Clustering (3) 11

DENCLUE: Techniques • Use grid cells – Only keep grid cells actually containing data points – Manage cells in a tree-based access structure • Influence function: describe the impact of a data point on its neighborhood • Overall density of the data space is the sum of the influence function of all data points • Clustering by identifying density attractors – Density attractor: local maximal of the overall density function Jian Pei: CMPT 459/741 Clustering (3) 12

Density Attractor Jian Pei: CMPT 459/741 Clustering (3) 13

Center-defined and Arbitrary Clusters Jian Pei: CMPT 459/741 Clustering (3) 14

A Shrinking-based Approach • Difficulties of Multi-dimensional Clustering – Noise (outliers) – Clusters of various densities – Not well-defined shapes • A novel preprocessing concept “Shrinking” • A shrinking-based clustering approach Jian Pei: CMPT 459/741 Clustering (3) 15

Intuition & Purpose • For data points in a data set, what if we could make them move towards the centroid of the natural subgroup they belong to? • Natural sparse subgroups become denser, thus easier to be detected – Noises are further isolated Jian Pei: CMPT 459/741 Clustering (3) 16

Inspiration • Newton’s Universal Law of Gravitation – Any two objects exert a gravitational force of attraction on each other – The direction of the force is along the line joining the objects – The magnitude of the force is directly proportional to the product of the gravitational masses of the objects, and inversely proportional to the square of the distance between them m m Fg G 1 2 – G: universal gravitational constant = 2 r • G = 6.67 x 10 -11 N m 2 /kg 2 Jian Pei: CMPT 459/741 Clustering (3) 17

The Concept of Shrinking • A data preprocessing technique – Aim to optimize the inner structure of real data sets • Each data point is “attracted” by other data points and moves to the direction in which way the attraction is the strongest • Can be applied in different fields Jian Pei: CMPT 459/741 Clustering (3) 18

Apply shrinking into clustering field • Shrink the natural sparse clusters to make them much denser to facilitate further cluster-detecting process. Multi- attribute hyperspac e Jian Pei: CMPT 459/741 Clustering (3) 19

Data Shrinking • Each data point moves along the direction of the density gradient and the data set shrinks towards the inside of the clusters • Points are “ attracted ” by their neighbors and move to create denser clusters • It proceeds iteratively ; repeated until the data are stabilized or the number of iterations exceeds a threshold Jian Pei: CMPT 459/741 Clustering (3) 20

Approximation & Simplification • Problem: Computing mutual attraction of each data points pair is too time consuming O(n 2 ) – Solution: No Newton's constant G, m 1 and m 2 are set to unit • Only aggregate the gravitation surrounding each data point • Use grids to simplify the computation Jian Pei: CMPT 459/741 Clustering (3) 21

Termination condition • Average movement of all points in the current iteration is less than a threshold • The number of iterations exceeds a threshold Jian Pei: CMPT 459/741 Clustering (3) 22

Optics on Pendigits Data Before data shrinking After data shrinking Jian Pei: CMPT 459/741 Clustering (3) 23

Biclustering • Clustering both objects and attributes simultaneously • Four requirements – Only a small set of objects in a cluster (bicluster) – A bicluster only involves a small number of attributes – An object may participate in multiple biclusters or no biclusters – An attribute may be involved in multiple biclusters, or no biclusters Jian Pei: Big Data Analytics -- Clustering 24

Application Examples • Recommender systems sample/condition – Objects: users w w w 11 12 1m – Attributes: items gene w w w 21 22 2m w w w – Values: user ratings 31 3m 32 • Microarray data – Objects: genes w w w n2 n1 nm – Attributes: samples – Values: expression levels Jian Pei: Big Data Analytics -- Clustering 25

Biclusters with Constant Values · · · b 6 · · · b 12 · · · b 36 · · · b 99 · · · · · · 60 · · · 60 · · · 60 · · · 60 · · · a 1 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 60 · · · 60 · · · 60 · · · 60 · · · a 33 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 60 · · · 60 · · · 60 · · · 60 · · · a 86 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 10 10 10 10 10 20 20 20 20 20 50 50 50 50 50 0 0 0 0 0 On rows Jian Pei: Big Data Analytics -- Clustering 26

Biclusters with Coherent Values • Also known as pattern-based clusters Jian Pei: Big Data Analytics -- Clustering 27

Biclusters with Coherent Evolutions • Only up- or down-regulated changes over rows or columns 10 50 30 70 20 20 100 50 1000 30 50 100 90 120 80 0 80 20 100 10 Coherent evolutions on rows Jian Pei: Big Data Analytics -- Clustering 28

Differences from Subspace Clustering • Subspace clustering uses global distance/ similarity measure • Pattern-based clustering looks at patterns • A subspace cluster according to a globally defined similarity measure may not follow the same pattern Jian Pei: Big Data Analytics -- Clustering 29

Objects Follow the Same Pattern? pScore Object blue Obejct green D 1 D 2 The less the pScore, the more consistent the objects Jian Pei: Big Data Analytics -- Clustering 30

Distance-based Methods: Drawbacks Hard to find clusters with - PowerPoint PPT Presentation

Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find Irregular Clusters? Divide

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

Phylogenetics: Distance Methods COMP 571 Luay Nakhleh, Rice University Outline Evolutionary

Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline

Distance Methods Distance Estimates attempt to estimate the mean number of changes per site

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Distance Education Technologies: Distance Education Technologies: Distance Education

CCLD 363 CCLD 363 Distance Field Distance Field Education Education Placements Placements

Distance fields imre paadik Overview Signed distance fields Distance fields in computer

Model-based methods for distance sampling CS Oedekoven and ST Buckland Conventional distance

The parsimony assumption in distance based methods Stuart Serdoz University of Western Sydney

Signal Processing General Introduction Based Intrusion Detection Using Several drawbacks to

Geodesic Distance Distance based based Geodesic Fuzzy Clustering Clustering Fuzzy Abonyi and

Mixtures of Weighted Distance-Based Models for Ranking Data Paul H. Lee Philip L. H. Yu The

Algorithms in Bioinformatics: Proteins Methods for protein Molecular Distance Geometry

Distance Learning Components Piedmont Unified School District August 5, 2020 Distance Learning

Topics Topics Thread Programming (Chapter 12) Threads & Locks

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main

Parallel Programming Overview and Concepts Dr Mark Bull, EPCC markb@epcc.ed.ac.uk Outline

Drawbacks of single cycle implementation All instructions take the same time although

Analysis of VPLS Deployment draft-gu-l2vpn-vpls-analysis-00 R. Gu, J. Dong, M. Chen, Q. Zeng

Sofya

IMITATOR II A Tool for Solving the Good Parameters Problem in Timed Automata Etienne Andr

2019 Prop K Strategic Plan and 5-Year Prioritization Program Update Board Agenda Item 8 SAN N

Distance-based Methods: Drawbacks Hard to find clusters with - PowerPoint PPT Presentation

Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find Irregular Clusters? Divide

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

Phylogenetics: Distance Methods COMP 571 Luay Nakhleh, Rice University Outline Evolutionary

Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline

Distance Methods Distance Estimates attempt to estimate the mean number of changes per site

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Distance Education Technologies: Distance Education Technologies: Distance Education

CCLD 363 CCLD 363 Distance Field Distance Field Education Education Placements Placements

Distance fields imre paadik Overview Signed distance fields Distance fields in computer

Model-based methods for distance sampling CS Oedekoven and ST Buckland Conventional distance

The parsimony assumption in distance based methods Stuart Serdoz University of Western Sydney

Signal Processing General Introduction Based Intrusion Detection Using Several drawbacks to

Geodesic Distance Distance based based Geodesic Fuzzy Clustering Clustering Fuzzy Abonyi and

Mixtures of Weighted Distance-Based Models for Ranking Data Paul H. Lee Philip L. H. Yu The

Algorithms in Bioinformatics: Proteins Methods for protein Molecular Distance Geometry

Distance Learning Components Piedmont Unified School District August 5, 2020 Distance Learning

Topics Topics Thread Programming (Chapter 12) Threads &amp; Locks

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main

Parallel Programming Overview and Concepts Dr Mark Bull, EPCC markb@epcc.ed.ac.uk Outline

Drawbacks of single cycle implementation All instructions take the same time although

Analysis of VPLS Deployment draft-gu-l2vpn-vpls-analysis-00 R. Gu, J. Dong, M. Chen, Q. Zeng

Sofya

IMITATOR II A Tool for Solving the Good Parameters Problem in Timed Automata Etienne Andr

2019 Prop K Strategic Plan and 5-Year Prioritization Program Update Board Agenda Item 8 SAN N

Topics Topics Thread Programming (Chapter 12) Threads & Locks