CLUSTER ANALYSIS Agenda Introduction to cluster analysis and - PowerPoint PPT Presentation

CLUSTER ANALYSIS

Agenda ● Introduction to cluster analysis and application ● Feature Selection for Clustering Clustering Algorithm ● DBScan ● ● Cluster Validation

Introduction Clustering is the partitioning of large Applications number of data into smaller number Data Summarization ● of groups. Customer Segmentation ● ● Social Network Analysis ● Relationship to other data mining problems

Feature Selection in Clustering Unique to each student The aim of feature selection is to remove Student ID Name Score Class noisy attributes that do T00525452 John 80 Grade 5 not cluster well. Difficult T00423514 Peter 79 Grade 5 in clustering because it is unsupervised. T00321452 James 55 Grade 3 T00715261 Tony 63 Grade 4 Filter Model ● Wrapper Model ●

Filter Model and Wrapper Model In filter model, a specific criterion is used filter out features or data points that does not meet specific score and affect clustering tendency. In wrapper model, clustering algorithm is used to evaluate quality of features Wrapper Model - Integrated into clustering process Filter Model - preprocessing phase

Common Filter Model Criterions ● Term Strength - fraction of similar document pair (sparse domain) ● Predictive Attribute Dependence - correlated features form better clusters ○ Classification Algorithm Regression Modeling (Numeric) ○ Entropy - clustered data reflect clustering characteristics. ● ● Hopkins statistic- measure clustering tendency of Data set

CLUSTERING ALGORITHM

Representative-Based Algorithm ● Simplest clustering algorithm ● No hierarchical relationship Cluster created in one shot ● Uses partition representative ●

Minimization of distance of the different data point and the closest representative

Types of Representative Based Algorithm ● k-Means Algorithm - Euclidean Distance, does not work when cluster are arbitrary of shape Kernel K-Means - determine arbitrary shape of cluster ● k-Median - Manhattan distance ● ● K-medoids - representatives selected from database because k-means can be distorted by outliers

Hierarchical Clustering Algorithm ● Cluster data with distance Sub topics Topics ● Distance function not compulsory Density based or graph based ● algorithm can be used

Types of Hierarchical Clustering Algorithm ● Bottom-up (agglomerative) Methods - individual data points are agglomerated to higher level Top Down (divisive) Methods - partition data point into three like structures ● Dendogram

Probabilistics Model-Based Algorithm ● Soft algorithm ● Each data point may have a nonzero assignment probability to many clusters Soft solution can be converted to hard solution by assignment of data point ● to each cluster

Grid-Based and Density-Based Algorithm Grid Based ● Number of data cluster is not pre-defined in advanced compared to k-means Return natural cluster with in data with corresponding shapes ●

Density Based Clustering Method ● Clustering based on density such as density connected points or based on an explicitly constructed density function Can handle noise ● Handles clusters of all shapes ● ● One scan ● Doesn’t work well varying densities and high dimensional data

DBScan ● DBScan is a density based algorithm, used to find associations and structures in data that are hard to find manually but can be relevant to find patterns and predict trends ● Density = number of points within a specified radius(Eps) ● A point is a core point if it has more than a specified number of points(MinPts) within Eps ● A border point has fewer MinPts within Eps, but is in the neighbourhood of a core point ● A noise point is any point that is not a core point or a border point

Practical Applications of DBScan Satellite Imagery ● X-ray Crystallography ● ● Anomaly Detection in Temperature Data Credit Fraud Detection ● E-commerce (recommending ● relevant products to customers)

DBScan vs K-means clustering In terms of location analytics, Dbscan ● ● K-means clustering works does a wonderful job of identifying high fine but doesn’t provide crime areas through density filtering, additional insight

Cluster Validation ● Cluster validation refers to procedures that evaluate the results of clustering in a quantitative and objective fashion. [Jane & Dubes, 1988] ● Quantitative means to employ the measures while objective means to validate the measures Cluster validation is needed to compare ● cluster algorithms, solve number of clusters and avoid finding patterns in Noise

Measures of Cluster Validation External Index ● Comparing the clustering results to ground truth (externally known results) ● Comparing two clusters Internal Index ● Evaluating quality of clusters without reference to external information ● Validate with different number of clusters

Thank you

CLUSTER ANALYSIS Agenda Introduction to cluster analysis and - PowerPoint PPT Presentation

CLUSTER ANALYSIS Agenda Introduction to cluster analysis and application Feature Selection for Clustering Clustering Algorithm DBScan Cluster Validation Introduction Clustering is the partitioning of large

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Introduction to Graph Cluster Analysis Outline Introduction to Cluster Analysis Types of

Kmean Cluster Analysis 1 Learning Objectives Understanding the kmean cluster analysis

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

Reaching the Goal with the Regensburg Marathon Cluster - A NetBSD Cluster Project - Hubert Feyrer

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Introduction to K- means Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

CS6220: DATA MINING TECHNIQUES Chapter 10: Cluster Analysis: Basic Concepts and Methods

Research at the Nano/Bio Interface Bio-electronic and Biooptoelectronic Hybrid Systems

Probing Hydrogen Bond Energies by Mass Spectrometry Hai-Feng Su, Lan Xue,* Yun-Hua Li, Shui-Chao

Progress on the co-crystallization of Thermoplasma acidophilum nucleoside kinase (TaNK) with

Custom Writing Service - Special Prices Powerpoint presentation assignment x ray crystallography

Dennis Wright, Ph.D. Craig M. Crews, Ph.D. Professor of Medicinal Chemistry Lewis B. Cullman

Global leader in predictive diagnostics & analytical services ASX: PIQ 1 Proteomics

INDUSTRIAL RESEARCH AT E.N.E.A. R. Coppola ENEA-Casaccia, Via Anguillarese 301, 00123 Roma, Italy

Feasibility Study of X-ray Fluorescence Imaging System: Surface Modification Gold Nanoparticles

CLUSTER ANALYSIS Agenda Introduction to cluster analysis and - PowerPoint PPT Presentation

CLUSTER ANALYSIS Agenda Introduction to cluster analysis and application Feature Selection for Clustering Clustering Algorithm DBScan Cluster Validation Introduction Clustering is the partitioning of large

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Introduction to Graph Cluster Analysis Outline Introduction to Cluster Analysis Types of

Kmean Cluster Analysis 1 Learning Objectives Understanding the kmean cluster analysis

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

Reaching the Goal with the Regensburg Marathon Cluster - A NetBSD Cluster Project - Hubert Feyrer

Computing Cluster Usage Visualization Tool Compu&amp;ng Cluster Usage Visualiza&amp;on

Computing Cluster Usage Visualization Tool Compu&amp;ng Cluster Usage Visualiza&amp;on

Introduction to K- means Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

CS6220: DATA MINING TECHNIQUES Chapter 10: Cluster Analysis: Basic Concepts and Methods

Research at the Nano/Bio Interface Bio-electronic and Biooptoelectronic Hybrid Systems

Probing Hydrogen Bond Energies by Mass Spectrometry Hai-Feng Su, Lan Xue,* Yun-Hua Li, Shui-Chao

Progress on the co-crystallization of Thermoplasma acidophilum nucleoside kinase (TaNK) with

Custom Writing Service - Special Prices Powerpoint presentation assignment x ray crystallography

Dennis Wright, Ph.D. Craig M. Crews, Ph.D. Professor of Medicinal Chemistry Lewis B. Cullman

Global leader in predictive diagnostics &amp; analytical services ASX: PIQ 1 Proteomics

INDUSTRIAL RESEARCH AT E.N.E.A. R. Coppola ENEA-Casaccia, Via Anguillarese 301, 00123 Roma, Italy

Feasibility Study of X-ray Fluorescence Imaging System: Surface Modification Gold Nanoparticles

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Global leader in predictive diagnostics & analytical services ASX: PIQ 1 Proteomics