Cluster Analysis Grouping the data items into a number of sets such - - PDF document

cluster analysis
SMART_READER_LITE
LIVE PREVIEW

Cluster Analysis Grouping the data items into a number of sets such - - PDF document

Cluster Analysis Grouping the data items into a number of sets such that the members of each set have more in common with each other than with any members of any other set More in common can be defined in many ways but some


slide-1
SLIDE 1

Cluster Analysis

  • Grouping the data items into a number of sets such that the members
  • f each set have “more in common” with each other than with any

members of any other set

– “More in common” can be defined in many ways but some form of distance metric based on the characteristics of each data item is normal – Data items belonging to a cluster will be nearer to each other in terms of this distance measure than to data items in any other cluster

  • Clustering algorithms can be divided into 2 types

– Hierarchical – Non-hierarchical

Hierarchical Clustering

  • Hierarchical clustering produces a family of alternative clusterings
  • If we have n data items then we start with n clusters – this is our first

clustering

  • We merge the two clusters which are “closest” according to some

metric to form n-1 clusters – this is our second clustering

  • We continue to merge the closest pairs of clusters – producing

successive clusterings – until we have just one cluster which contains all of the data items

  • This can be visualised in a dendrogram
slide-2
SLIDE 2

Dendrogram Example Distance Metrics for Clusters

  • Clearly the distance/difference metrics we have considered so far cannot be

applied to clusters

– Clusters will, in general, contain more than one data item so there will be more than

  • ne value for each characteristic within a cluster
  • Common metrics for clusters include

– Set the distance between two clusters to be the minimum distance between any pair of data items, where one data item is in one cluster and the other data item is in the other cluster – Set the distance between two clusters to be the maximum distance between any pair of data items – Set the distance between two clusters to be the average of the distances between all pairs of data items

slide-3
SLIDE 3

Non-Hierarchical Clustering

  • Non-hierarchical methods are many and varied but they all produce

just one clustering of the data items

– The number of clusters to be formed is supplied as an input to the process

  • Each cluster is characterised by a centroid

– The centroid of a cluster is usually defined to be the set of average values of the characteristics of the data items in the cluster

  • Initially, our clusters will contain no data items so we assign default

centroid values to each cluster, carefully chosen to ensure a spread across the range of possibilities

Non-Hierarchical Clustering Method

  • First, each data item is assigned to a cluster based on the distance

between that item and the centroids of the clusters

  • After this assignment the clusters will contain actual data items and

we can calculate their real centroids

  • Next we re-evaluate each data item and transfer it from its current

cluster to the cluster whose centroid is closest to it

– We note that this will change the centroids of both the cluster which the data item is removed from and the cluster to which it is added

  • We now iteratively repeat the evaluation of each data item until no

further transfers are required

slide-4
SLIDE 4

Nearest Neighbour Methods

  • Nearest neighbour methods can be used for both clustering and

classification

  • We form a training set of data items which are intended to be

typical of a certain class/cluster of such items

  • We next form a response value for each data item, based on some

function of its characteristics

  • For each class/cluster we determine the average response value
  • Data items can then be assigned to classes/clusters according to the

response value that each generates