Data Mining Clustering Hamid Beigy Sharif University of Technology - PowerPoint PPT Presentation

Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 41

Table of contents Introduction 1 Data matrix and dissimilarity matrix 2 Proximity Measures 3 Clustering methods 4 Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering Cluster validation and assessment 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 41

Introduction Clustering is the process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster have high similarity, but are very dissimilar to objects in other clusters. Dissimilarities and similarities are assessed based on the attribute values describing the objects and often involve distance measures. Clustering as a data mining tool has its roots in many application areas such as biology, security, business intelligence, and Web search. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 41

Requirements for cluster analysis Clustering is a challenging research field and the following are its typical requirements. Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Requirements for domain knowledge to determine input parameters Ability to deal with noisy data Incremental clustering and insensitivity to input order Capability of clustering high-dimensionality data Constraint-based clustering Interpretability and usability Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 41

Comparing clustering methods The clustering methods can be compared using the following aspects: The partitioning criteria : In some methods, all the objects are partitioned so that no hierarchy exists among the clusters. Separation of clusters : In some methods, data partitioned into mutually exclusive clusters while in some other methods, the clusters may not be exclusive, that is, a data object may belong to more than one cluster. Similarity measure : Some methods determine the similarity between two objects by the distance between them; while in other methods, the similarity may be defined by connectivity based on density or contiguity. Clustering space : Many clustering methods search for clusters within the entire data space. These methods are useful for low-dimensionality data sets. With high- dimensional data, however, there can be many irrelevant attributes, which can make similarity measurements unreliable. Consequently, clusters found in the full space are often meaningless. Its often better to instead search for clusters within different subspaces of the same data set. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 41

Data matrix and dissimilarity matrix Suppose that we have n objects described by p attributes. The objects are x 1 = ( x 11 , x 12 , . . . , x 1 p ), x 2 = ( x 21 , x 22 , . . . , x 2 p ), and so on, where x ij is the value for object x i of the j th attribute. For brevity, we hereafter refer to object x i as object i . The objects may be tuples in a relational database, and are also referred to as data samples or feature vectors. Main memory-based clustering and nearest-neighbor algorithms typically operate on either of the following two data structures: Data matrix This structure stores the n objects in the form of a table or n × p matrix.   x 11 . . . x 1 f . . . x 1 p . . . . .  . . . . .  . . . . .     x i 1 . . . x if . . . x ip     . . . . .  . . . . .  . . . . .   . . . . . . x n 1 x nf x np Dissimilarity matrix : This structure stores a collection of proximities that are available for all pairs of objects. It is often represented by an n × n matrix or table:   0 d (1 , 2) d (1 , 3) . . . d (1 , n ) d (2 , 1) 0 d (2 , 3) . . . d (2 , n )     . . . . ...   . . . . . . . .   d ( n , 1) d ( n , 2) d ( n , 3) . . . 0 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 41

Proximity Measures Proximity measures for nominal attributes : Let the number of states of a nominal attribute be M . The dissimilarity between two objects i and j can be computed based on the ratio of mismatches: d ( i , j ) = p − m p where m is the number of matches and p is the total number of attributes describing the objects. Proximity measures for binary attributes : Binary attributes are either symmetric or asymmetric. Object j 1 0 sum 1 q r q + r Object i 0 s t s + t sum q + s r + t p For symmetric binary attributes, similarity is calculated as r + s d ( i , j ) = q + r + s + t For asymmetric binary attributes when the number of negative matches, t , is unimportant and the number of positive matches, q , is important , similarity is calculated as r + s d ( i , j ) = q + r + s Hamid Beigy (Sharif University of Technology) Coefficient 1 − d ( i , j ) is called the Jaccard coefficient. Data Mining Fall 1396 7 / 41

Proximity Measures (cont.) Dissimilarity of numeric attributes : The most popular distance measure is Euclidean distance √ ( x i 1 − x j 2 ) 2 + ( x i 2 − x j 1 ) 2 + . . . + ( x ip − x jp ) 2 d ( i , j ) = Another well-known measure is Manhattan distance d ( i , j ) = | x i 1 − x j 2 | + | x i 2 − x j 1 | + . . . + | x ip − x jp | Minkowski distance is generalization of Euclidean and Manhattan distances √ | x i 1 − x j 2 | h + | x i 2 − x j 1 | h + . . . + | x ip − x jp | h d ( i , j ) = h Dissimilarity of ordinal attributes : We first replace each x if by its corresponding rank r if ∈ { 1 , . . . , M f } and then normalize it using z if = r if − 1 M f − 1 Then dissimilarity can be computed using distance measures for numeric attributes using z if . Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 41

Proximity Measures (cont.) Dissimilarity for attributes of mixed types : A more preferable approach is to process all attribute types together, performing a single analysis. ∑ p f =1 δ ( f ) ij d ( f ) ij d ( i , j ) = ∑ p f =1 δ ( f ) ij where the indicator δ ( f ) = 0 if either ij x if or x jf is missing x if = x jf = 0 and attribute f is asymmetric binary and otherwise δ ( f ) = 1. ij The distance d ( f ) is computed based on the type of attribute f . ij Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 41

Clustering methods There are many clustering algorithms in the literature. It is difficult to provide a crisp categorization of clustering methods because these categories may overlap so that a method may have features from several categories. In general, the major fundamental clustering methods can be classified into the following categories. Method General Characteristics Partitioning – Find mutually exclusive clusters of spherical shape methods – Distance-based – May use mean or medoid (etc.) to represent cluster center – Effective for small- to medium-size data sets Hierarchical – Clustering is a hierarchical decomposition (i.e., multiple levels) methods – Cannot correct erroneous merges or splits – May incorporate other techniques like microclustering or consider object “linkages” Density-based – Can find arbitrarily shaped clusters methods – Clusters are dense regions of objects in space that are separated by low-density regions – Cluster density: Each point must have a minimum number of points within its “neighborhood” – May filter out outliers Grid-based – Use a multiresolution grid data structure methods – Fast processing time (typically independent of the number of data objects, yet dependent on grid size) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 41

Data Mining Clustering Hamid Beigy Sharif University of Technology - PowerPoint PPT Presentation

Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 41 Table of contents Introduction 1 Data matrix and dissimilarity matrix 2 Proximity

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

How to Optimize Gower Distance Weights for the k-Medoids Clustering Algorithm to Obtain Mobility

Fast and Accurate Distance Computation from Unaligned Genomes Fabian Kltzl & Bernhard

Is this NE tagger getting old? Language Resources and Evaluation Conference Marrakech, Morocco -

Proximity based one-class classification with Common N-Gram dissimilarity for authorship

Machine Learning Lecture Notes on Clustering (II) 2016-2017 Davide Eynard davide.eynard@usi.ch

L ECTURE 26: C LUSTERING Prof. Julia Hockenmaier juliahmr@illinois.edu CS446 Machine Learning 1

Localization from Incomplete Noisy Distance Measurements Adel Javanmard and Andrea Montanari

Clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Data Mining Clustering Hamid Beigy Sharif University of Technology - PowerPoint PPT Presentation

Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 41 Table of contents Introduction 1 Data matrix and dissimilarity matrix 2 Proximity

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

How to Optimize Gower Distance Weights for the k-Medoids Clustering Algorithm to Obtain Mobility

Fast and Accurate Distance Computation from Unaligned Genomes Fabian Kltzl &amp; Bernhard

Is this NE tagger getting old? Language Resources and Evaluation Conference Marrakech, Morocco -

Proximity based one-class classification with Common N-Gram dissimilarity for authorship

Machine Learning Lecture Notes on Clustering (II) 2016-2017 Davide Eynard davide.eynard@usi.ch

L ECTURE 26: C LUSTERING Prof. Julia Hockenmaier juliahmr@illinois.edu CS446 Machine Learning 1

Localization from Incomplete Noisy Distance Measurements Adel Javanmard and Andrea Montanari

Clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Fast and Accurate Distance Computation from Unaligned Genomes Fabian Kltzl & Bernhard