Finding Clusters Types of Clustering Approaches: Linkage Based, - - PowerPoint PPT Presentation

finding clusters
SMART_READER_LITE
LIVE PREVIEW

Finding Clusters Types of Clustering Approaches: Linkage Based, - - PowerPoint PPT Presentation

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-Means Density Based Clustering, e.g. DBScan Grid Based Clustering Compendium slides for Guide to Intelligent


slide-1
SLIDE 1

Finding Clusters

Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-Means Density Based Clustering, e.g. DBScan Grid Based Clustering

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

1 / 60

slide-2
SLIDE 2

Hierarchical Clustering

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

2 / 60

slide-3
SLIDE 3

Hierarchical clustering

–3 –2 –1 1 2 3 –3 –2 –1 1 2 3

Iris setosa Iris versicolor Iris virginica

In the two-dimensional MDS (Sammon mapping) representation of the Iris data set, two clusters can be identified. (The colours, indicating the species of the flowers, are ignored here.)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

3 / 60

slide-4
SLIDE 4

Hierarchical clustering

Hierarchical clustering builds clusters step by step.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

4 / 60

slide-5
SLIDE 5

Hierarchical clustering

Hierarchical clustering builds clusters step by step. Usually a bottom up strategy is applied by first considering each data

  • bject as a separate cluster and then step by step joining clusters

together that are close to each other. This approach is called agglomerative hierarchical clustering.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

4 / 60

slide-6
SLIDE 6

Hierarchical clustering

Hierarchical clustering builds clusters step by step. Usually a bottom up strategy is applied by first considering each data

  • bject as a separate cluster and then step by step joining clusters

together that are close to each other. This approach is called agglomerative hierarchical clustering. In contrast to agglomerative hierarchical clustering, divisive hierarchical clustering starts with the whole data set as a single cluster and then divides clusters step by step into smaller clusters.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

4 / 60

slide-7
SLIDE 7

Hierarchical clustering

Hierarchical clustering builds clusters step by step. Usually a bottom up strategy is applied by first considering each data

  • bject as a separate cluster and then step by step joining clusters

together that are close to each other. This approach is called agglomerative hierarchical clustering. In contrast to agglomerative hierarchical clustering, divisive hierarchical clustering starts with the whole data set as a single cluster and then divides clusters step by step into smaller clusters. In order to decide which data objects should belong to the same cluster, a (dis-)similarity measure is needed.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

4 / 60

slide-8
SLIDE 8

Hierarchical clustering

Hierarchical clustering builds clusters step by step. Usually a bottom up strategy is applied by first considering each data

  • bject as a separate cluster and then step by step joining clusters

together that are close to each other. This approach is called agglomerative hierarchical clustering. In contrast to agglomerative hierarchical clustering, divisive hierarchical clustering starts with the whole data set as a single cluster and then divides clusters step by step into smaller clusters. In order to decide which data objects should belong to the same cluster, a (dis-)similarity measure is needed. Note: We do need to have access to features, all that is needed for hierarchical clustering is an n × n-matrix [di,j], where di,j is the (dis-)similarity of data objects i and j. (n is the number of data

  • bjects.)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

4 / 60

slide-9
SLIDE 9

Hierarchical clustering: Dissimilarity matrix

The dissimilarity matrix [di,j] should at least satisfy the following conditions. di,j ≥ 0, i.e. dissimilarity cannot be negative. di,i = 0, i.e. each data object is completely similar to itself. di,j = dj,i, i.e. data object i is (dis-)similar to data object j to the same degree as data object j is (dis-)similar to data object i. It is often useful if the dissimilarity is a (pseudo-)metric, satisfying also the triangle inequality di,k ≤ di,j + dj,k.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

5 / 60

slide-10
SLIDE 10

Agglomerative hierarchical clustering: Algorithm

Input: n × n dissimilarity matrix [di,j].

1 Start with n clusters, each data objects forms a single cluster. 2 Reduce the number of clusters by joining those two clusters that are

most similar (least dissimilar).

3 Repeat step 3 until there is only one cluster left containing all data

  • bjects.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

6 / 60

slide-11
SLIDE 11

Measuring dissimilarity between clusters

The dissimilarity between two clusters containing only one data object each is simply the dissimilarity of the two data objects specified in the dissimilarity matrix [di,j].

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

7 / 60

slide-12
SLIDE 12

Measuring dissimilarity between clusters

The dissimilarity between two clusters containing only one data object each is simply the dissimilarity of the two data objects specified in the dissimilarity matrix [di,j]. But how do we compute the dissimilarity between clusters that contain more than one data object?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

7 / 60

slide-13
SLIDE 13

Measuring dissimilarity between clusters

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

8 / 60

slide-14
SLIDE 14

Measuring dissimilarity between clusters

Centroid Distance between the centroids (mean value vectors) of the two clusters1

1Requires that we can compute the mean vector!

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

8 / 60

slide-15
SLIDE 15

Measuring dissimilarity between clusters

Centroid Distance between the centroids (mean value vectors) of the two clusters1 Average Linkage Average dissimilarity between all pairs of points of the two clusters.

1Requires that we can compute the mean vector!

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

8 / 60

slide-16
SLIDE 16

Measuring dissimilarity between clusters

Centroid Distance between the centroids (mean value vectors) of the two clusters1 Average Linkage Average dissimilarity between all pairs of points of the two clusters. Single Linkage Dissimilarity between the two most similar data objects of the two clusters.

1Requires that we can compute the mean vector!

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

8 / 60

slide-17
SLIDE 17

Measuring dissimilarity between clusters

Centroid Distance between the centroids (mean value vectors) of the two clusters1 Average Linkage Average dissimilarity between all pairs of points of the two clusters. Single Linkage Dissimilarity between the two most similar data objects of the two clusters. Complete Linkage Dissimilarity between the two most dissimilar data objects of the two clusters.

1Requires that we can compute the mean vector!

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

8 / 60

slide-18
SLIDE 18

Measuring dissimilarity between clusters

Centroid Distance between the centroids (mean value vectors) of the two clusters1 Average Linkage Average dissimilarity between all pairs of points of the two clusters. Single Linkage Dissimilarity between the two most similar data objects of the two clusters. Complete Linkage Dissimilarity between the two most dissimilar data objects of the two clusters.

1Requires that we can compute the mean vector!

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

8 / 60

slide-19
SLIDE 19

Measuring dissimilarity between clusters

Single linkage can “follow chains” in the data (may be desirable in certain applications). Complete linkage leads to very compact clusters. Average linkage also tends clearly towards compact clusters.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

9 / 60

slide-20
SLIDE 20

Measuring dissimilarity between clusters

Single linkage Complete linkage

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

10 / 60

slide-21
SLIDE 21

Measuring dissimilarity between clusters

Ward’s method

another strategy for merging clusters In contrast to single, complete or average linkage, it takes the number

  • f data objects in each cluster into account.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

11 / 60

slide-22
SLIDE 22

Measuring dissimilarity between clusters

The updated dissimilarity between the newly formed cluster {C ∪ C′} and the cluster C′′ is computed in the follwing way.

d′({C ∪ C′}, C′′) = ... single linkage = min{d′(C, C′′), d′(C′, C′′)} complete linkage = max{d′(C, C′′), d′(C′, C′′)} average linkage = |C|d′(C, C′′) + |C′|d′(C′, C′′) |C| + |C′| Ward = (|C| + |C′′|)d′(C, C′′) + (|C′| + |C′′|)d′(C′, C′′) − |C′′|d′(C, C′) |C| + |C′| + |C′′| centroid2 = 1 |C ∪ C′||C′′|

  • x∈C∪C′
  • y∈C′′

d(x, y)

2If metric, usually mean vector needs to be computed!

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

12 / 60

slide-23
SLIDE 23

Dendrograms

The cluster merging process arranges the data points in a binary tree. Draw the data tuples at the bottom or on the left (equally spaced if they are multi-dimensional). Draw a connection between clusters that are merged, with the distance to the data points representing the distance between the clusters.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

13 / 60

slide-24
SLIDE 24

Hierarchical clustering

Example

Clustering of the 1-dimensional data set {2, 12, 16, 25, 29, 45}. All three approaches to measure the distance between clusters lead to different dendrograms.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

14 / 60

slide-25
SLIDE 25

Hierarchical clustering

Centroid Single linkage Complete linkage

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

15 / 60

slide-26
SLIDE 26

Dendrograms

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

16 / 60

slide-27
SLIDE 27

Dendrograms

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

17 / 60

slide-28
SLIDE 28

Dendrograms

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

18 / 60

slide-29
SLIDE 29

Choosing the right clusters

Simplest Approach:

  • Specify a minimum desired distance between clusters.
  • Stop merging clusters if the closest two clusters are farther apart than

this distance.

Visual Approach:

  • Merge clusters until all data points are combined into one cluster.
  • Draw the dendrogram and find a good cut level.
  • Advantage: Cut needs not be strictly horizontal.

More Sophisticated Approaches:

  • Analyze the sequence of distances in the merging process.
  • Try to find a step in which the distance between the two clusters

merged is considerably larger than the distance of the previous step.

  • Several heuristic criteria exist for this step selection.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

19 / 60

slide-30
SLIDE 30

Heatmaps

A heatmap combines a dendrogram resulting from clustering the data, a dendrogram resulting from clustering the attributes and colours to indicate the values of the attributes.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

20 / 60

slide-31
SLIDE 31

Example: Heatmap and dendrogram

x y

2 1 3 4 9 8 10 7 5 6

−2 2 4

Value

1 2 3 4

Color Key and Histogram Count

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

21 / 60

slide-32
SLIDE 32

Example: Heatmap and dendrogram

1 2 3 4

45 47 21 28 7 14 26 35 43 4 38 40 25 42 49 44 16 34 9 24 6 41 37 12 23 10 19 48 29 30 20 50 1 36 46 31 32 18 22 39 2 33 5 3 11 15 13 27 8 17 85 97 99 78 79 81 89 76 86 92 77 95 82 94 84 83 91 96 80 87 93 67 53 65 73 98 100 54 55 66 58 64 63 68 72 56 59 74 52 62 57 60 61 51 75 71 69 70 111 114 115 101 109 125 116 123 104 105 119 106 112 107 102 120 103 121 122 117 118 124 113 108 110 133 136 128 132 148 129 139 90 88 146 126 137 138 144 142 145 140 149 135 127 131 141 130 150 143 134 147

5 10 15

Value

50 150

Color Key and Histogram Count

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

22 / 60

slide-33
SLIDE 33

Example: Heatmap and dendrogram

1 2

44 57 75 11 72 59 96 100 77 86 54 67 15 47 33 1 37 43 53 12 41 27 39 8 51 79 68 18 55 45 82 5 80 24 34 31 97 56 63 95 84 4 83 9 94 98 28 14 62 21 26 85 88 70 90 35 64 73 50 52 71 2 22 3 13 48 16 40 6 49 32 60 69 23 76 17 46 92 78 58 89 29 66 74 20 25 99 10 81 36 65 42 93 61 30 38 19 91 7 87

−2 −1 1 2

Value

5 15 25

Color Key and Histogram Count

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

23 / 60

slide-34
SLIDE 34

Iris Data: Heatmap and dendrogram

sw sl pl pw

23 5 38 11 49 20 47 22 45 15 19 6 17 16 33 34 44 24 27 36 50 7 12 25 37 21 32 29 8 40 28 41 1 18 31 10 35 46 13 2 26 9 14 39 43 4 30 3 48 101 137 149 103 113 140 105 141 142 146 111 116 145 125 121 144 110 118 132 108 131 126 130 106 136 119 123 42 61 99 58 94 63 88 69 120 80 68 83 93 60 91 95 107 54 81 82 70 90 62 92 64 79 75 98 72 74 67 85 97 56 100 65 89 96 115 114 122 102 143 109 73 147 112 124 127 84 135 66 87 51 53 55 134 77 59 76 129 133 78 104 148 117 138 150 128 139 52 57 71 86

−2 1 2 3

Value

40 80

Color Key and Histogram Count

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

24 / 60

slide-35
SLIDE 35

Divisive hierarchical clustering

The top-down approach of divisive hierarchical clustering is rarely used. In agglomerative clustering the minimum of the pairwise dissimilarities has to be determined, leading to a quadratic complexity in each step (quadratic in the number of clusters still present in the corresponding step). In divisive clustering for each cluster all possible splits would have to be considered. In the first step, there are 2n−1 − 1 possible splits, where n is the number of data objects.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

25 / 60

slide-36
SLIDE 36

What is Similarity?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

26 / 60

slide-37
SLIDE 37

How to cluster these objects?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

27 / 60

slide-38
SLIDE 38

How to cluster these objects?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

28 / 60

slide-39
SLIDE 39

How to cluster these objects?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

29 / 60

slide-40
SLIDE 40

Clustering example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

30 / 60

slide-41
SLIDE 41

Clustering example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

31 / 60

slide-42
SLIDE 42

Clustering example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

32 / 60

slide-43
SLIDE 43

Scaling

The previous three slides show the same data set. In the second slide, the unit on the x-axis was changed to centi-units. In the third slide, the unit on the y-axis was changed to centi-units.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

33 / 60

slide-44
SLIDE 44

Scaling

The previous three slides show the same data set. In the second slide, the unit on the x-axis was changed to centi-units. In the third slide, the unit on the y-axis was changed to centi-units. Clusters should not depend on the measurement unit!

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

33 / 60

slide-45
SLIDE 45

Scaling

The previous three slides show the same data set. In the second slide, the unit on the x-axis was changed to centi-units. In the third slide, the unit on the y-axis was changed to centi-units. Clusters should not depend on the measurement unit! Therefore, some kind of normalisation (see the chapter on data preparation) should be carried out before clustering.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

33 / 60

slide-46
SLIDE 46

Complex Similarities: An Example

A few Adrenalin-like drug candidates: Adrenalin (D) (C) (B) (E)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

34 / 60

slide-47
SLIDE 47

Complex Similarities: An Example

Similarity: Polarity

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

35 / 60

slide-48
SLIDE 48

Complex Similarities: An Example

Dissimilarity: Hydrophobic / Hydrophilic

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

36 / 60

slide-49
SLIDE 49

Complex Similarities: An Example

Similar to Adrenalin... Adrenalin Amphetamin Ephedrin Dopamin MDMA

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

37 / 60

slide-50
SLIDE 50

Complex Similarities: An Example

Similar to Adrenalin...but some cross the blood-brain barrier Adrenalin Amphetamin (Speed) Ephedrin Dopamin MDMA (Ecstasy)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

37 / 60

slide-51
SLIDE 51

Similarity Measures

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

38 / 60

slide-52
SLIDE 52

Notion of (dis-)similarity: Numerical attributes

Various choices for dissimilarities between two numerical vectors:

Manhatten Pearson Tschebyschew Euclidean

Minkowksi Lp dp(x, y) =

p

n

i=1 |xi − yi|p

Euclidean L2 dE(x, y) =

  • (x1 − y1)2 + . . . + (xn − yn)2

Manhattan L1 dM(x, y) = |x1 − y1| + . . . + |xn − yn| Tschebyschew L∞ d∞(x, y) = max{|x1 − y1|, . . . , |xn − yn|} Cosine dC(x, y) = 1 −

x⊤y xy

Tanimoto dT (x, y) =

x⊤y x2+y2−x⊤y

Pearson Euclidean of z-score transformed x, y

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

39 / 60

slide-53
SLIDE 53

Notion of (dis-)similarity: Binary attributes

The two values (e.g. 0 and 1) of a binary attribute can be interpreted as some property being absent (0) or present (1). In this sense, a vector of binary attribute can be interpreted as a set of properties that the corresponding object has.

Example

The binary vector (0, 1, 1, 0, 1) corresponds to the set of properties {a2, a3, a5}. The binary vector (0, 0, 0, 0, 0) corresponds to the empty set. The binary vector (1, 1, 1, 1, 1) corresponds to the set {a1, a2, a3, a4, a5}.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

40 / 60

slide-54
SLIDE 54

Notion of (dis-)similarity: Binary attributes

Dissimilarity measures for two vectors of binary attributes. Each data object is represent by the corresponding set of properties that are present. binary attributes sets of properties simple match dS = 1 −

b+n b+n+x

Russel & Rao dR = 1 −

b b+n+x

1 − |X∩Y |

|Ω|

Jaccard dJ = 1 −

b b+x

1 − |X∩Y |

|X∪Y |

Dice dD = 1 −

2b 2b+x

1 − 2|X∩Y |

|X|+|Y |

  • no. of predicates that...

b = ...hold in both records n = ...do not hold in both records x = ...hold in only one of both records

x y set X set Y b n x dM dR dJ dD 101000 111000 {a1, a3} {a1, a2, a3} 2 3 1 0.1¯ 6 0.6¯ 6 0.3¯ 3 0.20

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

41 / 60

slide-55
SLIDE 55

Notion of (dis-)similarity: Nominal attributes

Nominal attributes may be transformed into a set of binary attributes, each

  • f them indicating one particular feature of the attribute (1-of-n coding).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

42 / 60

slide-56
SLIDE 56

Notion of (dis-)similarity: Nominal attributes

Nominal attributes may be transformed into a set of binary attributes, each

  • f them indicating one particular feature of the attribute (1-of-n coding).

Example

Attribute Manufacturer with the values BMW, Chrysler, Dacia, Ford, Volkswagen. manufacturer ... Volkswagen ... Dacia ... Ford ... → binary vector 00001 01000 00100 Then one of the dissimilarity measures for binary attribute can be applied.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

42 / 60

slide-57
SLIDE 57

Notion of (dis-)similarity: Nominal attributes

Nominal attributes may be transformed into a set of binary attributes, each

  • f them indicating one particular feature of the attribute (1-of-n coding).

Example

Attribute Manufacturer with the values BMW, Chrysler, Dacia, Ford, Volkswagen. manufacturer ... Volkswagen ... Dacia ... Ford ... → binary vector 00001 01000 00100 Then one of the dissimilarity measures for binary attribute can be applied. Another way to measure similarity between two vectors of nominal attributes is to compute the proportion of attributes where both vectors have the same value, leading to the Russel & Rao dissimilarity measure.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

42 / 60

slide-58
SLIDE 58

Prototype Based Clustering

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

43 / 60

slide-59
SLIDE 59

Prototype Based Clustering

given: dataset of size n return: set of typical examples of size k << n.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

44 / 60

slide-60
SLIDE 60

k-Means clustering

Choose a number k of clusters to be found (user input). Initialize the cluster centres randomly (for instance, by randomly selecting k data points). Data point assignment: Assign each data point to the cluster centre that is closest to it (i.e. closer than any other cluster centre). Cluster centre update: Compute new cluster centres as the mean vectors of the assigned data points. (Intuitively: centre of gravity if each data point has unit weight.)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

45 / 60

slide-61
SLIDE 61

k-Means clustering

Repeat these two steps (data point assignment and cluster centre update) until the clusters centres do not change anymore. It can be shown that this scheme must converge, i.e., the update of the cluster centres cannot go on forever.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

46 / 60

slide-62
SLIDE 62

k-Means clustering

Aim: Minimize the objective function f =

k

  • i=1

n

  • j=1

uijdij under the constraints uij ∈ {0, 1} and

k

  • i=1

uij = 1 for all j = 1, . . . , n.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

47 / 60

slide-63
SLIDE 63

Alternating optimization

Assuming the cluster centres to be fixed, uij = 1 should be chosen for the cluster i to which data object xj has the smallest distance in

  • rder to minimize the objective function.

Assuming the assignments to the clusters to be fixed, each cluster centre should be chosen as the mean vector of the data objects assigned to the cluster in order to minimize the objective function.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

48 / 60

slide-64
SLIDE 64

k-Means clustering: Example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

49 / 60

slide-65
SLIDE 65

k-Means clustering: Example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

49 / 60

slide-66
SLIDE 66

k-Means clustering: Example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

49 / 60

slide-67
SLIDE 67

k-Means clustering: Example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

49 / 60

slide-68
SLIDE 68

k-Means clustering: Example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

49 / 60

slide-69
SLIDE 69

k-Means clustering: Example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

49 / 60

slide-70
SLIDE 70

k-Means clustering: Local minima

Clustering is successful in this example: The clusters found are those that would have been formed intuitively. Convergence is achieved after only 5 steps. (This is typical: convergence is usually very fast.) However: The clustering result is fairly sensitive to the initial positions of the cluster centres. With a bad initialisation clustering may fail (the alternating update process gets stuck in a local minimum).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

50 / 60

slide-71
SLIDE 71

k-Means clustering: Local minima

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

51 / 60

slide-72
SLIDE 72

Gaussian Mixture Models

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

52 / 60

slide-73
SLIDE 73

Gaussian mixture models – EM clustering

Assumption: Data was generated by sampling a set of normal distributions. (The probability density is a mixture of normal distributions.) Aim: Find the parameters for the normal distributions and how much each normal distribution contributes to the data.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

53 / 60

slide-74
SLIDE 74

Gaussian mixture models

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

  • 3
  • 2
  • 1

1 2 3 4

Two normal distributions.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

54 / 60

slide-75
SLIDE 75

Gaussian mixture models

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

  • 3
  • 2
  • 1

1 2 3 4

Two normal distributions.

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

  • 3
  • 2
  • 1

1 2 3 4

Mixture model (both normal distrubutions contribute 50%).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

54 / 60

slide-76
SLIDE 76

Gaussian mixture models

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

  • 3
  • 2
  • 1

1 2 3 4

Two normal distributions.

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

  • 3
  • 2
  • 1

1 2 3 4

Mixture model (both normal distrubutions contribute 50%).

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

  • 3
  • 2
  • 1

1 2 3 4

Mixture model (one normal distrubutions contributes 10%, the

  • ther 90%).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

54 / 60

slide-77
SLIDE 77

Gaussian mixture models

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

55 / 60

slide-78
SLIDE 78

Gaussian mixture models – EM clustering

Assumption: Data were generated by sampling a set of normal distributions. (The probability density is a mixture of normal distributions.) Aim: Find the parameters for the normal distributions and how much each normal distribution contributes to the data. Algorithm: EM clustering (expectation maximisation). Alternating scheme in which the parameters of the normal distributions and the likelihoods of the data points to be generated by the corresponding normal distributions are estimated.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

56 / 60

slide-79
SLIDE 79

Density Based Clustering

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

57 / 60

slide-80
SLIDE 80

Density-based clustering

For numerical data, density-based clustering algorithm often yield the best results. Principle: A connected region with high data density corresponds to one cluster. DBScan is one of the most popular density-based clustering algorithms.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

58 / 60

slide-81
SLIDE 81

Density-based clustering: DBScan

Principle idea of DBScan:

1 Find a data point where the data density is high, i.e. in whose

ε-neighbourhood are at least ℓ other points. (ε and ℓ are parameters

  • f the algorithm to be chosen by the user.)

2 All the points in the ε-neighbourhood are considered to belong to one

cluster.

3 Expand this ε-neighbourhood (the cluster) as long as the high density

criterion is satisfied.

4 Remove the cluster (all data points assigned to the cluster) from the

data set and continue with 1. as long as data points with a high data density around them can be found.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

59 / 60

slide-82
SLIDE 82

Density-based clustering: DBScan

grid cell neighbourhood cell with at least 3 hits grid cell neighbourhood cell with at least 3 hits

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

60 / 60