Data Warehousing and Machine Learning Preprocessing Thomas D. - - PowerPoint PPT Presentation

data warehousing and machine learning
SMART_READER_LITE
LIVE PREVIEW

Data Warehousing and Machine Learning Preprocessing Thomas D. - - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35 Preprocessing Before you can start on the actual data mining, the data may require


slide-1
SLIDE 1

Data Warehousing and Machine Learning

Preprocessing Thomas D. Nielsen

Aalborg University Department of Computer Science

Spring 2008

DWML Spring 2008 1 / 35

slide-2
SLIDE 2

Preprocessing

Before you can start on the actual data mining, the data may require some preprocessing:

  • Attributes may be redundant.
  • Values may be missing.
  • The data contains outliers.
  • The data is not in a suitable format.
  • The values appear inconsistent.

Garbage in, garbage out

DWML Spring 2008 2 / 35

slide-3
SLIDE 3

Preprocessing

Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F

  • 40000

40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 F 99999 30 D 3000

DWML Spring 2008 3 / 35

slide-4
SLIDE 4

Preprocessing

Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F

  • 40000

40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 F 99999 30 D 3000 – Correct zip code?

DWML Spring 2008 3 / 35

slide-5
SLIDE 5

Preprocessing

Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F

  • 40000

40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 F 99999 30 D 3000 – Correct zip code?

DWML Spring 2008 3 / 35

slide-6
SLIDE 6

Preprocessing

Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F

  • 40000

40 W 4000 1003 90210 ?? 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 F 99999 30 D 3000 – Missing value!

DWML Spring 2008 3 / 35

slide-7
SLIDE 7

Preprocessing

Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F

  • 40000

40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 F 99999 30 D 3000 – Error/outlier!

DWML Spring 2008 3 / 35

slide-8
SLIDE 8

Preprocessing

Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F

  • 40000

40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 F 99999 30 D 3000 – Error!

DWML Spring 2008 3 / 35

slide-9
SLIDE 9

Preprocessing

Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F

  • 40000

40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 F 99999 30 D 3000 – Unexpected precision.

DWML Spring 2008 3 / 35

slide-10
SLIDE 10

Preprocessing

Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F

  • 40000

40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 F 99999 30 D 3000 – Categorical value?

DWML Spring 2008 3 / 35

slide-11
SLIDE 11

Preprocessing

Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F

  • 40000

40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 F 99999 30 D 3000 – Error/missing value?

DWML Spring 2008 3 / 35

slide-12
SLIDE 12

Preprocessing

Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F

  • 40000

40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 F 99999 30 D 3000 Other issues:

  • What are the semantics of the marital status?

DWML Spring 2008 3 / 35

slide-13
SLIDE 13

Preprocessing

Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F

  • 40000

40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 F 99999 30 D 3000 Other issues:

  • What are the semantics of the marital status?
  • What is the unit of measure for the transaction field?

DWML Spring 2008 3 / 35

slide-14
SLIDE 14

Preprocessing

Missing Values In many real world data bases you will be faced with the problem of missing data: Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good By simply discarding the records with missing data we might unintentionally bias the data.

DWML Spring 2008 4 / 35

slide-15
SLIDE 15

Preprocessing

Missing Values Possible strategies for handling missing data:

  • Use a predefined constant.
  • Use the mean (for numerical variables) or the mode (for categorical values).
  • Use a value drawn randomly form the observed distribution.

Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good

DWML Spring 2008 5 / 35

slide-16
SLIDE 16

Preprocessing

Missing Values Possible strategies for handling missing data:

  • Use a predefined constant.
  • Use the mean (for numerical variables) or the mode (for categorical values).
  • Use a value drawn randomly form the observed distribution.

Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good Both Low and Medium are ’modes’ for savings.

DWML Spring 2008 5 / 35

slide-17
SLIDE 17

Preprocessing

Missing Values Possible strategies for handling missing data:

  • Use a predefined constant.
  • Use the mean (for numerical variables) or the mode (for categorical values).
  • Use a value drawn randomly form the observed distribution.

Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low High 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Medium 25 Bad 8 Medium Medium 75 Good High and Medium are drawn randomly from the observed distribution for Assets.

DWML Spring 2008 5 / 35

slide-18
SLIDE 18

Preprocessing

Missing Values Possible strategies for handling missing data:

  • Use a predefined constant.
  • Use the mean (for numerical variables) or the mode (for categorical values).
  • Use a value drawn randomly form the observed distribution.

Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low High 25 Bad 4 Medium Medium 54 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Medium 25 Bad 8 Medium Medium 75 Good 54 ≈ 75 + 50 + 25 + 100 + 25 + 25 + 75 7 .

DWML Spring 2008 5 / 35

slide-19
SLIDE 19

Preprocessing

Discretization Some data mining algorithms can only handle discrete attributes. Possible solution: Divide the continuous range into intervals. Example: (Income, Risk) = (25, B), (25, B), (50, G), (51, B), (54, G), (75, G), (75, G)(100, G), (100, G) Unsupervised discretization Equal width binning (width 25): Bin 1: 25, 25 [25, 50) Bin 2: 50, 51, 54 [50, 75) Bin 3: 75, 75, 100, 100 [75, 100] Equal frequency binning (bin density 3): Bin 1: 25, 25, 50 [25, 50.5) Bin 2: 51, 54, 75, 75 [50.5, 87.5) Bin 3: 100, 100 [87.5, 100]

DWML Spring 2008 6 / 35

slide-20
SLIDE 20

Preprocessing

Supervised discretization Take the class distribution into account when selecting the intervals. For example, recursively bisect the interval by selecting the split point giving the highest information gain: Gain(S, v) = Ent(S) − »|S≤v| |S| Ent(S≤v ) + |S>v| |S| Ent(S>v ) – Until some stopping criteria is met. (Income, Risk) = (25, B), (25, B), (50, G), (51, B), (54, G), (75, G), (75, G)(100, G), (100, G) Ent(S) = − „ 3 9 log2 3 9 + 6 9 log2 6 9 « = 0.9183 Split E-Ent Interval 25 0.4602 (−∞, 25], (25, ∞) 50 0.7395 (−∞, 50], (50, ∞) 51 0.3606 (−∞, 51], (51, ∞) 54 0.5394 (−∞, 54], (54, ∞) 75 0.7663 (−∞, 75], (75, ∞)

DWML Spring 2008 7 / 35

slide-21
SLIDE 21

Preprocessing

Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example,

  • Age versus income.

DWML Spring 2008 8 / 35

slide-22
SLIDE 22

Preprocessing

Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example,

  • Age versus income.

The typical approach is to standardize the scales: Min-Max Normalization: X ∗ = X − min(X) max(X) − min(X) .

0.2 0.4 0.6 0.8 1

  • 20

20 40 60 80 100 120 normalized values

  • riginal values

A1 A2

DWML Spring 2008 8 / 35

slide-23
SLIDE 23

Preprocessing

Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example,

  • Age versus income.

The typical approach is to standardize the scales: Min-Max Normalization: X ∗ = X − min(X) max(X) − min(X) .

0.2 0.4 0.6 0.8 1

  • 20

20 40 60 80 100 120 normalized values

  • riginal values

A1 A2

Z-score standardization: X ∗ = X − mean(X) SD(X) .

  • 4
  • 3
  • 2
  • 1

1 2 3

  • 20

20 40 60 80 100 120 standardized values

  • riginal values

A1 A2

DWML Spring 2008 8 / 35

slide-24
SLIDE 24

Preprocessing

Outliers Data: 1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 8, 20.

0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20

Summary statistics:

  • First quartile (1Q): 25% of the data = 4 .
  • Second quartile (2Q): 50% of the data = 6.
  • Third quartile (3Q): 75% of the data = 7.

Interquartile range IQR = 3Q − 1Q = 3.

DWML Spring 2008 9 / 35

slide-25
SLIDE 25

Preprocessing

Outliers Data: 1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 8, 20.

0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20

Summary statistics:

  • First quartile (1Q): 25% of the data = 4 .
  • Second quartile (2Q): 50% of the data = 6.
  • Third quartile (3Q): 75% of the data = 7.

Interquartile range IQR = 3Q − 1Q = 3. A data point may be an outlier if:

  • It is lower than 1Q − 1.5 · IQR = 4 − 1.5 · 3 = −0.5.
  • It is higher than 3Q + 1.5 · IQR = 7 + 1.5 · 3 = 11.5.

DWML Spring 2008 9 / 35

slide-26
SLIDE 26

Data Warehousing and Machine Learning

Clustering Thomas D. Nielsen

Aalborg University Department of Computer Science

Spring 2008

Clustering: partitional and hierarchical DWML Spring 2008 10 / 35

slide-27
SLIDE 27

Clustering

Unlabeled Data The Iris data with class labels removed: Attributes SL SW PL PW 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2 6.3 2.9 6.0 2.1 6.3 2.5 4.9 1.5 . . . . . . . . . . . . Unlabeled data in general: (discrete or continuous) attributes, no class variable.

Clustering: partitional and hierarchical DWML Spring 2008 11 / 35

slide-28
SLIDE 28

Clustering

Clustering A clustering of the data S = s1, . . . , sN consists of a set C = {c1, . . . , ck} of cluster labels, and a cluster assignment ca : S → C. Clustering Iris with C = {blue, red}: Note: a clustering partitions the datapoints, not necessarily the instance space. When cluster labels have no particular significance, can identify clustering also with partition S = S1 ∪ . . . ∪ Sk where Si = ca−1(ci).

Clustering: partitional and hierarchical DWML Spring 2008 12 / 35

slide-29
SLIDE 29

Clustering

Clustering goal Instance Space

Within−cluster distances Between−cluster distances

A candidate clustering (indicated by colors) of data cases in instance space. Arrows indicate between- and within-cluster distances (selected). General goal: find clustering with large between-cluster variation (sum of between-cluster distances), and small within-cluster variation (sum of within-cluster distances). Concrete goal varies according to exact distance definition.

Clustering: partitional and hierarchical DWML Spring 2008 13 / 35

slide-30
SLIDE 30

Clustering

Examples

  • Group plants/animals into families or related species, based on
  • morphological features
  • molecular features
  • Identify types of customers based on attributes in a database (can then be targeted by special

advertising campaigns)

  • Web mining: group web-pages according to content

Clustering: partitional and hierarchical DWML Spring 2008 14 / 35

slide-31
SLIDE 31

Clustering

Clustering vs. Classification The cluster label can be interpreted as a hidden class variable

  • that is never observed
  • whose number of states is unknown
  • on which the distribution of attribute values depends

Clustering is often called unsupervised learning, vs. the supervised learning of classifiers: in supervised learning correct class labels for the training data are provided to the learning algorithm by a supervisor, or teacher. One key problem in clustering is determining the “right” number of clusters. Two different approaches:

  • Partition-based clustering
  • Hierarchical clustering

All clustering methods require a distance measure on the instance space!

Clustering: partitional and hierarchical DWML Spring 2008 15 / 35

slide-32
SLIDE 32

Clustering

Partition-based Clustering Number k of clusters fixed (user defined). Partition data into k clusters. k-means clustering Assume that

  • there is a distance function d(s, s′) defined between data items
  • we can compute the mean value of a collection {s1, . . . , sl} of data items

Initialize: randomly pick initial cluster centers c = c1, . . . , ck from S repeat for i = 1, . . . , k Si := {s ∈ S | ci = arg minc∈c d(c, s)} cold,i := ci ci := mean Si ca(s) := ci (s ∈ Si) until c = cold

Clustering: partitional and hierarchical DWML Spring 2008 16 / 35

slide-33
SLIDE 33

Clustering

Example k = 3:

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

slide-34
SLIDE 34

Clustering

Example k = 3: c1 c2 c3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

slide-35
SLIDE 35

Clustering

Example k = 3: c1 c2 c3 S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

slide-36
SLIDE 36

Clustering

Example k = 3: c1 c2 c3 S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

slide-37
SLIDE 37

Clustering

Example k = 3: c1 c2 c3 S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

slide-38
SLIDE 38

Clustering

Example k = 3: c1 c2 c3 S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

slide-39
SLIDE 39

Clustering

Example k = 3: c1 c2 c3 S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

slide-40
SLIDE 40

Clustering

Example k = 3: c1 c2 c3 S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

slide-41
SLIDE 41

Clustering

Example k = 3: c1 c2 c3 S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

slide-42
SLIDE 42

Clustering

Example k = 3: c1 c2 c3 S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

slide-43
SLIDE 43

Clustering

Example(cont.) Result for clustering the same data with k = 2: c1 c2 S1 S2 Result can depend on choice of initial cluster centers!

Clustering: partitional and hierarchical DWML Spring 2008 18 / 35

slide-44
SLIDE 44

Clustering

Outliers The result of partitional clustering can be skewed by outliers. Example with k = 2: useful preprocessing: outlier detection and elimination (be careful not to eliminate interesting

  • utliers!).

Clustering: partitional and hierarchical DWML Spring 2008 19 / 35

slide-45
SLIDE 45

Clustering

k-Means as optimization With a Euclidean distance function dist we can use the sum of squared errors for evaluating a clustering: SSE =

k

X

i=1

X

x∈Ci

dist(x, ci)2.

Clustering: partitional and hierarchical DWML Spring 2008 20 / 35

slide-46
SLIDE 46

Clustering

k-Means as optimization With a Euclidean distance function dist we can use the sum of squared errors for evaluating a clustering: SSE =

k

X

i=1

X

x∈Ci

dist(x, ci)2. k-means directly tries to minimize this error: Initialize: randomly pick initial cluster centers c = c1, . . . , ck from S repeat for i = 1, . . . , k Si := {s ∈ S | ci = arg minc∈c d(c, s)} //Minimize the SSE for the current clusters cold,i := ci ci := mean Si //The centroid that minimizes the SSE for the assigned objects ca(s) := ci (s ∈ Si) until c = cold Only guaranteed to find a local minimum

Clustering: partitional and hierarchical DWML Spring 2008 20 / 35

slide-47
SLIDE 47

Hierarchical Clustering

Reducing SSE Choosing initial centroids:

  • Perform multiple runs with random initializations.
  • Initialize centroids based on results from another algorithm (e.g. hierarchical).
  • . . .

Clustering: partitional and hierarchical DWML Spring 2008 21 / 35

slide-48
SLIDE 48

Hierarchical Clustering

Reducing SSE Choosing initial centroids:

  • Perform multiple runs with random initializations.
  • Initialize centroids based on results from another algorithm (e.g. hierarchical).
  • . . .

Postprocessing:

  • Split a cluster
  • Disperse a cluster (choose the one that increases the SSE the least)
  • Merge two clusters (the two with closets centroids or the two that increases the SSE the

least).

Clustering: partitional and hierarchical DWML Spring 2008 21 / 35

slide-49
SLIDE 49

Hierarchical Clustering

Hierarchical clustering The “right” number of clusters may not only be unknown, it may also be quite ambiguous:

Clustering: partitional and hierarchical DWML Spring 2008 22 / 35

slide-50
SLIDE 50

Hierarchical Clustering

Hierarchical clustering The “right” number of clusters may not only be unknown, it may also be quite ambiguous:

Clustering: partitional and hierarchical DWML Spring 2008 22 / 35

slide-51
SLIDE 51

Hierarchical Clustering

Hierarchical clustering The “right” number of clusters may not only be unknown, it may also be quite ambiguous:

Clustering: partitional and hierarchical DWML Spring 2008 22 / 35

slide-52
SLIDE 52

Hierarchical Clustering

Hierarchical clustering The “right” number of clusters may not only be unknown, it may also be quite ambiguous: Provide an explicit representation of nested clusterings of different granularity

Clustering: partitional and hierarchical DWML Spring 2008 22 / 35

slide-53
SLIDE 53

Hierarchical Clustering

Agglomerative hierarchical clustering Extend distance function d(s, s′) to distance function D(S, S′) between sets of data items. Two

  • ut of many possibilities:

Daverage(S, S′) := 1 |S| · |S′| X

s∈S,s′∈S′

d(s, s′) Dmin(S, S′) := mins∈S,s′∈S′d(s, s′) for i = 1, . . . , N: Si := {si} while current partition S1 ∪ . . . ∪ Sk of S contains more than one element (i, j) := arg mini,j∈1,...,k D(Si, Sj) form new partition by merging Si and Sj. When Daverage is used, this is also called average link clustering; for Dmin: single link clustering.

Clustering: partitional and hierarchical DWML Spring 2008 23 / 35

slide-54
SLIDE 54

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-55
SLIDE 55

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-56
SLIDE 56

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-57
SLIDE 57

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-58
SLIDE 58

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-59
SLIDE 59

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-60
SLIDE 60

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-61
SLIDE 61

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-62
SLIDE 62

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-63
SLIDE 63

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-64
SLIDE 64

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-65
SLIDE 65

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-66
SLIDE 66

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

slide-67
SLIDE 67

Hierarchical Clustering

Dendrogram Representation of Hierarchical Clustering

Distance of merged components

Clustering: partitional and hierarchical DWML Spring 2008 25 / 35

slide-68
SLIDE 68

Hierarchical Clustering

Dendrogram Representation of Hierarchical Clustering

Distance of merged components 3−clustering 5−clustering

The length of the distance interval correponding to a specific clustering can be interpreted as a measure for the significance of this particular clustering

Clustering: partitional and hierarchical DWML Spring 2008 25 / 35

slide-69
SLIDE 69

Hierarchical Clustering

Single link vs. Average link

Clustering: partitional and hierarchical DWML Spring 2008 26 / 35

slide-70
SLIDE 70

Hierarchical Clustering

Single link vs. Average link 4-clustering for single link and average link

Clustering: partitional and hierarchical DWML Spring 2008 26 / 35

slide-71
SLIDE 71

Hierarchical Clustering

Single link vs. Average link 4-clustering for single link and average link single link 2-clustering

Clustering: partitional and hierarchical DWML Spring 2008 26 / 35

slide-72
SLIDE 72

Hierarchical Clustering

Single link vs. Average link 4-clustering for single link and average link single link 2-clustering average link 2-clustering

Clustering: partitional and hierarchical DWML Spring 2008 26 / 35

slide-73
SLIDE 73

Hierarchical Clustering

Single link vs. Average link 4-clustering for single link and average link single link 2-clustering average link 2-clustering Generally: single link will produce rather elongated, linear clusters, average link more convex clusters

Clustering: partitional and hierarchical DWML Spring 2008 26 / 35

slide-74
SLIDE 74

Hierarchical Clustering

Another Example

Clustering: partitional and hierarchical DWML Spring 2008 27 / 35

slide-75
SLIDE 75

Hierarchical Clustering

Another Example single link 2-clustering

Clustering: partitional and hierarchical DWML Spring 2008 27 / 35

slide-76
SLIDE 76

Hierarchical Clustering

Another Example average link 2-clustering (or similar)

Clustering: partitional and hierarchical DWML Spring 2008 27 / 35

slide-77
SLIDE 77

Data Warehousing and Machine Learning

Self Organizing Maps Thomas D. Nielsen

Aalborg University Department of Computer Science

Spring 2008

Self Organizing Maps DWML Spring 2008 28 / 35

slide-78
SLIDE 78

Self Organizing Maps

SOMs as Special Neural Networks

Input Layer Output Layer

  • Neural network structure without hidden layers
  • Output neurons structured as two-dimensional array
  • Connection from ith input to jth output has weight wi,j
  • No activation function for output nodes

Self Organizing Maps DWML Spring 2008 29 / 35

slide-79
SLIDE 79

Self Organizing Maps

Kohonen Learning Given: Unlabeled data a1, . . . , aN ∈ Rn Distance measure dn(·, ·) on Rn Distance measure dout(·, ·) on output neurons Update function η(t, d) : N × R → R; decreasing in t and d. 1. Initialize weight vectors w(0)

j

for output nodes oj 2. t := 0 3. repeat 4. t := t + 1 5. for i = 1, . . . , N 6. let oj be the output neuron minimizing dn(wj, ai). 7. for all output nodes oh: 8. w(t)

h

:= w(t−1)

h

+ η(t, dout(oh, oj))(ai − w(t−1)

h

) 9. until termination condition applies

Self Organizing Maps DWML Spring 2008 30 / 35

slide-80
SLIDE 80

Self Organizing Maps

Distances etc. Possible choices: dn: Euclidean dout(oj, oh): e.g. 1 if oj, oh are neighbors (rectangular or hexagonal layout), or Euclidean distance on grid indices η(t, d): e.g. α(t)exp(−d2/2σ2(t)) with α(t), σ(t) decreasing in t.

Self Organizing Maps DWML Spring 2008 31 / 35

slide-81
SLIDE 81

Self Organizing Maps

Intuition SOM learning can be understood as fitting a 2-dimensional surface to the data:

  • 1,0
  • 0,0
  • 1,1
  • 0,1

Colors indicate associ- ation with different out- put neurons, not data at- tributes. Some

  • utput

neurons may not have any asso- ciated data cases.

Self Organizing Maps DWML Spring 2008 32 / 35

slide-82
SLIDE 82

Self Organizing Maps

Example (from Tan et al.) Data: Word occurrence data (?) from 3204 articles from the Los Angeles Times with (hidden) section labels Entertainment, Financial, Foreign, Metro, National, Sports. Result of SOM clustering on 4 × 4 hexagonal grid:

Sports Metro Sports Metro Foreign Sports Metro Sports Entertainment Entertainment Metro Metro National Metro Financial Financial high low Density

Output nodes labelled with majority label of associated cases and colored according to number of cases associated with it (fictional).

Self Organizing Maps DWML Spring 2008 33 / 35

slide-83
SLIDE 83

Self Organizing Maps

SOMs and k-means In spite of its roots in neural networks, SOMs are more closely related to k-means clustering:

  • Weight vectors wj are cluster centers
  • Kohonen updating associates data cases with cluster centers, and repositions cluster centers

to fit associated data cases

  • Differences:
  • 2-dim. “spatial” relationship among cluster centers
  • Data cases associated with more than one cluster center
  • On-line updating (one case at a time)

Self Organizing Maps DWML Spring 2008 34 / 35

slide-84
SLIDE 84

Self Organizing Maps

Pros and Cons + Provides more insight than a basic clustering (i.e. partitioning of data) + Can produce intuitive representations of clustering results

  • No well-defined objective function that is optimized

Self Organizing Maps DWML Spring 2008 35 / 35