Machine Learning: Algorithms and Applications Floriano Zini Free - - PDF document

machine learning algorithms and applications
SMART_READER_LITE
LIVE PREVIEW

Machine Learning: Algorithms and Applications Floriano Zini Free - - PDF document

14/05/12 Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 10: 14 May 2012 Unsupervised Learning (cont) Slides courtesy of Bing


slide-1
SLIDE 1

14/05/12 ¡ 1 ¡

Machine Learning: Algorithms and Applications

Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 10: 14 May 2012

Unsupervised Learning (cont…)

Slides courtesy of Bing Liu: www.cs.uic.edu/~liub/WebMiningBook.html

slide-2
SLIDE 2

14/05/12 ¡ 2 ¡

Road map

n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary

Hierarchical Clustering

Produce a nested sequence of clusters, a tree, also called dendrogram

n Singleton clusters are at

the bottom of the three

n One root clusters covers all

the data points

n Siblings clusters partition the

data points of the common parent

slide-3
SLIDE 3

14/05/12 ¡ 3 ¡

Types of hierarchical clustering

n Agglomerative (bottom up) clustering: it builds the

dendrogram (tree) from the bottom level, and

q merges the most similar (or nearest) pair of clusters q stops when all the data points are merged into a single

cluster (i.e., the root cluster)

n Divisive (top down) clustering: it starts with all data

points in one cluster, the root

q splits the root into a set of child clusters q each child cluster is recursively divided further q stops when only singleton clusters of individual data points

remain

Agglomerative clustering

It is more popular then divisive methods

n At the beginning, each data point forms a

cluster (also called a node)

n Merge nodes/clusters that have the least

distance

n Go on merging n Eventually all nodes belong to one cluster

slide-4
SLIDE 4

14/05/12 ¡ 4 ¡

Agglomerative clustering algorithm An example: working of the algorithm

slide-5
SLIDE 5

14/05/12 ¡ 5 ¡

Measuring the distance of two clusters

n A few ways to measure distances of two

clusters

q k-means uses only the distances between

centroids

n Different variations of the algorithm

q Single link q Complete link q Average link q Centroids q …

Single link method

n The distance between

two clusters is the distance between two closest data points in the two clusters

q one data point from each

cluster

n It can find arbitrarily

shaped clusters, but

q It may cause the

undesirable “chain effect” by noisy points (in black)

The two natural clusters (in red) are not found

slide-6
SLIDE 6

14/05/12 ¡ 6 ¡

Complete link method

n The distance between two clusters is the distance of two

furthest data points in the two clusters

n It is sensitive to outliers (in black) because they are far away n It usually produces better clusters than the single-link method

Average link and centroid methods

Average link method

n A compromise between

q the sensitivity of complete-link clustering to outliers and q the tendency of single-link clustering to form long chains

that do not correspond to the intuitive notion of clusters as compact, spherical objects

n The distance between two clusters is the average

distance of all pair-wise distances between the data points in two clusters Centroid method

n the distance between two clusters is the distance

between their centroids

slide-7
SLIDE 7

14/05/12 ¡ 7 ¡

The complexity

n All the hierarchical algorithms are at least O(n2)

q n is the number of data points

n Single link can be done in O(n2) n Complete and average links can be done in O(n2log n) n Due the complexity, hierarchical algorithms are hard to use for

large data sets

q Perform hierarchical clustering on a sample of data points

and then assign the others by distance or by supervised learning (see lecture 9)

q Use scale-up methods (e.g., BIRCH) that

n

find many small clusters using an efficient algorithm

n

use these clusters as the starting nodes for the hierarchical clustering

Road map

n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary

slide-8
SLIDE 8

14/05/12 ¡ 8 ¡

Distance functions

n Key to clustering

q “similarity” and “dissimilarity” are other commonly

used terms

n There are numerous distance functions for

q Different types of data

n Numeric data n Nominal data n …

q Different specific applications

Distance functions for numeric attributes

n We denote distance with dist(xi, xj), where xi and

xj are data points (vectors)

n Most commonly used functions are

q Euclidean distance and q Manhattan (city block) distance

n They are special cases of Minkowski distance

h is positive integer, r is the number of attributes

dist(xi,x j ) = xi1 ! x j1

h + xi2 ! x j2 h +...+ xir ! x jr h

( )

1 h

slide-9
SLIDE 9

14/05/12 ¡ 9 ¡

Euclidean distance and Manhattan distance

n If h = 2, it is the Euclidean distance n If h = 1, it is the Manhattan distance n Weighted Euclidean distance

2 2 2 2 2 1 1

) ( ... ) ( ) ( ) , (

jr ir j i j i j i

x x x x x x dist − + + − + − = x x

| | ... | | | | ) , (

2 2 1 1 jr ir j i j i j i

x x x x x x dist − + + − + − = x x

2 2 2 2 2 2 1 1 1

) ( ... ) ( ) ( ) , (

jr ir r j i j i j i

x x w x x w x x w dist − + + − + − = x x

Squared distance and Chebychev distance

n Squared Euclidean distance: to place

progressively greater weight on data points that are further apart

n Chebychev distance: one wants to define two

data points as “different” if they are different

  • n any one of the attributes

2 2 2 2 2 1 1

) ( ... ) ( ) ( ) , (

jr ir j i j i j i

x x x x x x dist − + + − + − = x x

dist(xi,x j ) = max xi1 ! x j1 , xi2 ! x j2 ,…, xir ! x jr

( )

slide-10
SLIDE 10

14/05/12 ¡ 10 ¡

Distance functions for binary and nominal attributes

n Binary attribute: has two values or states but

no ordering relationships,

q E.g., Gender: female and male q The 2 values are conventionally represented by 1

and 0

n We use a confusion matrix to introduce the

distance functions/measures

n Let the ith and jth data points be xi and xj

(vectors)

Confusion matrix

slide-11
SLIDE 11

14/05/12 ¡ 11 ¡

Symmetric binary attributes

n A binary attribute is symmetric if both of its states

(0 and 1) have equal importance, e.g., female and male of the attribute Gender

n Distance function: Simple Matching Distance, proportion of

mismatches of their values

n There are variations, adding weights

d c b a c b dist

j i

+ + + + = ) , ( x x

dist(xi,x j ) = 2(b+c) a + d + 2(b+c)

dist(xi,x j ) = b+c 2(a + d)+ b+c

To mismatches To matches

(1)

n x1 and x2 are two data points n Each of the 7 attributes is symmetric binary n The simple matching distance is n If there is a weight on mismatches

Symmetric binary attributes: example

dist(x1,x2) = b+c a + b+c+ d = 2 +1 2 + 2 +1+ 2 = 3 7 = 0.429 dist(x1,x2) = 2(b+c) a + 2(b+c)+ d = 2(2 +1) 2 + 2(2 +1)+ 2 = 6 10 = 0.6

slide-12
SLIDE 12

14/05/12 ¡ 12 ¡

Asymmetric binary attributes

n Asymmetric: if one of the states is more

important or valuable than the other

q By convention, state 1 represents the more important

state, which is typically the rare or infrequent state

q Jaccard distance is a popular measure q There are variations, adding weights

c b a c b dist

j i

+ + + = ) , ( x x

dist(xi,x j ) = 2(b+c) a + 2(b+c)

dist(xi,x j ) = b+c 2a + b+c

To mismatches To matches of the important state

(2)

n x1 and x2 are two data points n Each of the 7 attributes is asymmetric binary n The Jaccard distance is n If there is a weight on matches of the important state

Asymmetric binary attributes: example

dist(x1,x2) = b+c a + b+c = 2 +1 2 + 2 +1 = 3 5 = 0.6 dist(x1,x2) = b+c 2a + b+c = 2 +1 2*2 + 2 +1 = 3 7 = 0.429

slide-13
SLIDE 13

14/05/12 ¡ 13 ¡

Nominal attributes

n Nominal attributes: with more than two states

  • r values

q the commonly used distance measure is also

based on the simple matching method

q Given two data points xi and xj, let the number of

attributes be r, and the number of values that match in xi and xj be q

r q r dist

j i

− = ) , ( x x

(3)

Road map

n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary

slide-14
SLIDE 14

14/05/12 ¡ 14 ¡

Data standardization

n In the Euclidean space, standardization of attributes

is recommended so that all attributes can have equal impact on the computation of distances

n Consider the following pair of data points

q xi: (0.1, 20) and xj: (0.9, 720)

n The distance is almost completely dominated by

(720-20) = 700

n Standardize attributes: to force the attributes to have

a common value range

dist(xi,x j ) = (0.9 ! 0.1)2 +(720 ! 20)2 = 700.000457

Interval-scaled attributes

n Their values are real numbers following a

linear scale

q E.g., the difference in Age between 10 and 20 is

the same as that between 40 and 50

q The key idea is that intervals keep the same

importance through out the scale

n Two main approaches to standardize interval

scaled attributes, range and z-score

slide-15
SLIDE 15

14/05/12 ¡ 15 ¡

Interval-scaled attributes (cont …)

n Range: transform the values of an attribute f so that they are

between 0 and 1

n Z-score: transform the values of an attribute f based on the mean

and standard deviation of the attribute

q

Indicates how far and in what direction the value deviates from the mean

q

The deviation is expressed in units of the standard deviation of the attribute

! f = (xif !µ f )2

i=1 n

"

n !1

µ f = 1 n xif

i=1 n

!

z(xif ) = xif !µ f ! f

Z-score:

rg(xif ) = xif ! min( f ) max( f )! min( f )

Ratio-scaled attributes

n Numeric attributes, but unlike interval-scaled

attributes, their scales are exponential

n For example, the total amount of

microorganisms that evolve in a time t is approximately given by

q where A and B are positive constants

n Approach

1.

Do log transform

2.

Then treat xif’ as an interval-scaled attribute xif

' = log(xif )

AeBt

slide-16
SLIDE 16

14/05/12 ¡ 16 ¡

Nominal (unordered categorical) attributes

n Sometime, we need to transform nominal

attributes to numeric attributes

n Transform nominal attributes to binary attributes

q The number of values of a nominal attribute is v q Create v binary attributes to represent the values q If a data instance for the nominal attribute takes a

particular value, the value of its binary attribute is set to 1, otherwise it is set to 0

n The resulting binary attributes can be used as

numeric attributes, with two values, 0 and 1

Nominal attributes: an example

n Nominal attribute fruit: has three values

q Apple, Orange, and Pear

n We create three binary attributes called,

Apple, Orange, and Pear in the new data

n If a particular data instance in the original

data has Apple as the value for fruit

q then in the transformed data, we set the value of

the attribute Apple to 1, and

q the values of attributes Orange and Pear to 0

slide-17
SLIDE 17

14/05/12 ¡ 17 ¡

Ordinal (ordered categorical) attributes

n Ordinal attribute: it is like a nominal attribute,

but its values have a numerical ordering

n E.g.,

q Age attribute with ordered values: Young,

MiddleAge, and Old

q Common approach to standardization: treat is as

an interval-scaled attribute

n E.g., Young à 0, MiddleAge à 1, Old à 2