Fuzzy Clustering Each point x i takes a probability w ij to belong - - PowerPoint PPT Presentation

fuzzy clustering
SMART_READER_LITE
LIVE PREVIEW

Fuzzy Clustering Each point x i takes a probability w ij to belong - - PowerPoint PPT Presentation

Fuzzy Clustering Each point x i takes a probability w ij to belong to a cluster C j Requirements k w 1 For each point x i , = ij j 1 = m For each cluster C j 0 w m < ij < i = 1 Jian Pei: CMPT 459/741


slide-1
SLIDE 1

Jian Pei: CMPT 459/741 Clustering (4) 1

Fuzzy Clustering

  • Each point xi takes a probability wij to belong

to a cluster Cj

  • Requirements

– For each point xi, – For each cluster Cj

1

1

=

= k j ij

w

m w

m i ij <

<∑

=1

slide-2
SLIDE 2

Jian Pei: CMPT 459/741 Clustering (4) 2

Fuzzy C-Means (FCM)

Select an initial fuzzy pseudo-partition, i.e., assign values to all the wij Repeat

Compute the centroid of each cluster using the fuzzy pseudo-partition Recompute the fuzzy pseudo-partition, i.e., the wij

Until the centroids do not change (or the change is below some threshold)

slide-3
SLIDE 3

Jian Pei: CMPT 459/741 Clustering (4) 3

Critical Details

  • Optimization on sum of the squared error

(SSE):

  • Computing centroids:
  • Updating the fuzzy pseudo-partition

– When p=2

∑∑

= =

=

k j m i j i p ij k

c x dist w C C SSE

1 1 2 1

) , ( ) , , ( …

∑ ∑

= =

=

m i p ij m i i p ij j

w x w c

1 1

/

= − −

=

k q p q i p j i ij

c x dist c x dist w

1 1 1 2 1 1 2

) ) , ( / 1 ( ) ) , ( / 1 (

=

=

k q q i j i ij

c x dist c x dist w

1 2 2

) , ( / 1 ) , ( / 1

slide-4
SLIDE 4

Jian Pei: CMPT 459/741 Clustering (4) 4

Choice of P

  • When p à 1, FCM behaves like traditional

k-means

  • When p is larger, the cluster centroids

approach the global centroid of all data points

  • The partition becomes fuzzier as p

increases

slide-5
SLIDE 5

Jian Pei: CMPT 459/741 Clustering (4) 5

Effectiveness

slide-6
SLIDE 6

Jian Pei: CMPT 459/741 Clustering (4) 6

Mixture Models

  • A cluster can be modeled as a probability

distribution

– Practically, assume a distribution can be approximated well using multivariate normal distribution

  • Multiple clusters is a mixture of different

probability distributions

  • A data set is a set of observations from a

mixture of models

slide-7
SLIDE 7

Jian Pei: CMPT 459/741 Clustering (4) 7

Object Probability

  • Suppose there are k clusters and a set X of

m objects

– Let the j-th cluster have parameter θj = (µj, σj) – The probability that a point is in the j-th cluster is wj, w1 + …+ wk = 1

  • The probability of an object x is

=

= Θ

k j j j j

x p w x prob

1

) | ( ) | ( θ

∏∑ ∏

= = =

= Θ = Θ

m i k j j i j j m i i

x p w x prob X prob

1 1 1

) | ( ) | ( ) | ( θ

slide-8
SLIDE 8

Jian Pei: CMPT 459/741 Clustering (4) 8

Example

2 2

2 ) (

2 1 ) | (

σ µ

σ π

− −

= Θ

x i

e x prob

) 2 , 4 ( ) 2 , 4 (

2 1

= − = θ θ

8 ) 4 ( 8 ) 4 (

2 2

2 2 1 2 2 1 ) | (

− − + −

+ = Θ

x x

e e x prob π π

slide-9
SLIDE 9

Jian Pei: CMPT 459/741 Clustering (4) 9

Maximal Likelihood Estimation

  • Maximum likelihood principle: if we know a

set of objects are from one distribution, but do not know the parameter, we can choose the parameter maximizing the probability

  • Maximize

– Equivalently, maximize

= − −

= Θ

m j x i

e x prob

1 2 ) (

2 2

2 1 ) | (

σ µ

σ π

=

− − − − = Θ

m i i

m m x X prob

1 2 2

log 2 log 5 . 2 ) ( ) | ( log σ π σ µ

slide-10
SLIDE 10

Jian Pei: CMPT 459/741 Clustering (4) 10

EM Algorithm

  • Expectation Maximization algorithm

Select an initial set of model parameters Repeat

Expectation Step: for each object, calculate the probability that it belongs to each distribution θi, i.e., prob(xi|θi) Maximization Step: given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood

Until the parameters are stable

slide-11
SLIDE 11

Jian Pei: CMPT 459/741 Clustering (4) 11

Advantages and Disadvantages

  • Mixture models are more general than k-

means and fuzzy c-means

  • Clusters can be characterized by a small

number of parameters

  • The results may satisfy the statistical

assumptions of the generative models

  • Computationally expensive
  • Need large data sets
  • Hard to estimate the number of clusters
slide-12
SLIDE 12

Jian Pei: CMPT 459/741 Clustering (4) 12

Grid-based Clustering Methods

  • Ideas

– Using multi-resolution grid data structures – Using dense grid cells to form clusters

  • Several interesting methods

– CLIQUE – STING – WaveCluster

slide-13
SLIDE 13

Jian Pei: CMPT 459/741 Clustering (4) 13

CLIQUE

  • Clustering In QUEst
  • Automatically identify subspaces of a high

dimensional data space

  • Both density-based and grid-based
slide-14
SLIDE 14

Jian Pei: CMPT 459/741 Clustering (4) 14

CLIQUE: the Ideas

  • Partition each dimension into the same

number of equal length intervals

– Partition an m-dimensional data space into non-

  • verlapping rectangular units
  • A unit is dense if the number of data points

in the unit exceeds a threshold

  • A cluster is a maximal set of connected

dense units within a subspace

slide-15
SLIDE 15

Jian Pei: CMPT 459/741 Clustering (4) 15

CLIQUE: the Method

  • Partition the data space and find the number of

points in each cell of the partition

– Apriori: a k-d cell cannot be dense if one of its (k-1)-d projection is not dense

  • Identify clusters:

– Determine dense units in all subspaces of interests and connected dense units in all subspaces of interests

  • Generate minimal description for the clusters

– Determine the minimal cover for each cluster

slide-16
SLIDE 16

Jian Pei: CMPT 459/741 Clustering (4) 16

Salary (10,000) age Vac atio n 30 50 20 30 40 50 60 age 5 4 3 1 2 6 7 Vacation (week) 20 30 40 50 60 age 5 4 3 1 2 6 7

CLIQUE: An Example

slide-17
SLIDE 17

Jian Pei: CMPT 459/741 Clustering (4) 17

CLIQUE: Pros and Cons

  • Automatically find subspaces of the highest

dimensionality with high density clusters

  • Insensitive to the order of input

– Not presume any canonical data distribution

  • Scale linearly with the size of input
  • Scale well with the number of dimensions
  • The clustering result may be degraded at

the expense of simplicity of the method

slide-18
SLIDE 18

Jian Pei: CMPT 459/741 Clustering (4) 18

Bad Cases for CLIQUE

Parts of a cluster may be missed A cluster from CLIQUE may contain noise

slide-19
SLIDE 19

Jian Pei: CMPT 459/741 Clustering (4) 19

Dimensionality Reduction

  • Clustering a high dimensional data set is

challenging

– Distance between two points could be dominated by noise

  • Dimensionality reduction: choosing the informative

dimensions for clustering analysis

– Feature selection: choosing a subset of existing dimensions – Feature construction: construct a new (small) set of informative attributes

slide-20
SLIDE 20

Jian Pei: CMPT 459/741 Clustering (4) 20

Variance and Covariance

  • Given a set of 1-d points, how different are

those points?

– Standard deviation: – Variance:

  • Given a set of 2-d points, are the two

dimensions correlated?

– Covariance:

1 ) (

1 2

− − = ∑

=

n X X s

n i i

1 ) (

1 2 2

− − = ∑

=

n X X s

n i i

1 ) )( ( ) , cov(

1

− − − = ∑

=

n Y Y X X Y X

n i i i

slide-21
SLIDE 21

Jian Pei: CMPT 459/741 Clustering (4) 21

Principal Components

Art work and example from http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

slide-22
SLIDE 22

Jian Pei: CMPT 459/741 Clustering (4) 22

Step 1: Mean Subtraction

  • Subtract the mean from each dimension for each

data point

  • Intuition: centralizing the data set
slide-23
SLIDE 23

Jian Pei: CMPT 459/741 Clustering (4) 23

Step 2: Covariance Matrix

⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ = ) , cov( ) , cov( ) , cov( ) , cov( ) , cov( ) , cov( ) , cov( ) , cov( ) , cov(

2 1 2 2 2 1 2 1 2 1 1 1 n n n n n n

D D D D D D D D D D D D D D D D D D C

slide-24
SLIDE 24

Jian Pei: CMPT 459/741 Clustering (4) 24

Step 3: Eigenvectors and Eigenvalues

  • Compute the eigenvectors and the

eigenvalues of the covariance matrix

– Intuition: find those direction invariant vectors as candidates of new attributes – Eigenvalues indicate how much the direction invariant vectors are scaled – the larger the better for manifest the data variance

slide-25
SLIDE 25

Jian Pei: CMPT 459/741 Clustering (4) 25

Step 4: Forming New Features

  • Choose the principal components and forme new

features

– Typically, choose the top-k components

slide-26
SLIDE 26

Jian Pei: CMPT 459/741 Clustering (4) 26

New Features

NewData = RowFeatureVector x RowDataAdjust

The first principal component is used

slide-27
SLIDE 27

Clustering in Derived Space

Jian Pei: CMPT 459/741 Clustering (4) 27

Y X O

  • 0.707x + 0.707y
slide-28
SLIDE 28

Spectral Clustering

Jian Pei: CMPT 459/741 Clustering (4) 28

cluster the original data ij

[ ]

Data Affinity matrix k eigenvectors of A A = f(W) Av = \lamda v Clustering in the new space Computing the leading Projecting back to W

slide-29
SLIDE 29

Affinity Matrix

  • Using a distance measure

where σ is a scaling parameter controling how fast the affinity Wij decreases as the distance increases

  • In the Ng-Jordan-Weiss algorithm, Wii is set

to 0

Jian Pei: CMPT 459/741 Clustering (4) 29

Wij = e−

dist(oi,oj ) σw

slide-30
SLIDE 30

Clustering

  • In the Ng-Jordan-Weiss algorithm, we

define a diagonal matrix such that

  • Then,
  • Use the k leading eigenvectors to form a

new space

  • Map the original data to the new space and

conduct clustering

Jian Pei: CMPT 459/741 Clustering (4) 30

Dii =

n

X

j=1

Wij

A = D− 1

2 WD− 1 2

slide-31
SLIDE 31

Is a Clustering Good?

  • Feasibility

– Applying any clustering methods on a uniformly distributed data set is meaningless

  • Quality

– Are the clustering results meeting users’ interest? – Clustering patients into clusters corresponding various disease or sub-phenotypes is meaningful – Clustering patients into clusters corresponding to male or female is not meaningful

Jian Pei: CMPT 459/741 Clustering (4) 31

slide-32
SLIDE 32

Major Tasks

  • Assessing clustering tendency

– Are there non-random structures in the data?

  • Determining the number of clusters or other

critical parameters

  • Measuring clustering quality

Jian Pei: CMPT 459/741 Clustering (4) 32

slide-33
SLIDE 33

Uniformly Distributed Data

  • Clustering uniformly distributed data is

meaningless

  • A uniformly distributed data set is generated

by a uniform data distribution

Jian Pei: CMPT 459/741 Clustering (4) 33

slide-34
SLIDE 34

Hopkins Statistic

  • Hypothesis: the data is generated by a

uniform distribution in a space

  • Sample n points, p1, …, pn, uniformly from

the space of D

  • For each point pi, find the nearest neighbor
  • f pi in D, let xi be the distance between pi

and its nearest neighbor in D

Jian Pei: CMPT 459/741 Clustering (4) 34

xi = min

v∈D{dist(pi, v)}

slide-35
SLIDE 35

Hopkins Statistic

  • Sample n points, q1, …, qn, uniformly from D
  • For each qi, find the nearest neighbor of qi

in D – {qi}, let yi be the distance between qi and its nearest neighbor in D – {qi}

  • Calculate the Hopkins Statistic H

Jian Pei: CMPT 459/741 Clustering (4) 35

yi = min

v2D,v6=qi{dist(qi, v)}

H =

n

P

i=1

yi

n

P

i=1

xi +

n

P

i=1

yi

slide-36
SLIDE 36

Explanation

  • If D is uniformly distributed, then and

would be close to each other, and thus H would be round 0.5

  • If D is skewed, then would be

substantially smaller, and thus H would be close to 0

  • If H > 0.5, then it is unlikely that D has

statistically significant clusters

Jian Pei: CMPT 459/741 Clustering (4) 36

n

X

i=1

yi

n

X

i=1

xi

n

X

i=1

yi

slide-37
SLIDE 37

Finding the Number of Clusters

  • Depending on many factors

– The shape and scale of the distribution in the data set – The clustering resolution required by the user

  • Many methods exist

– Set , each cluster has points on average – Plot the sum of within-cluster variances with respect to k, find the first (or the most significant turning point)

Jian Pei: CMPT 459/741 Clustering (4) 37

k = rn 2

√ 2n

slide-38
SLIDE 38

A Cross-Validation Method

  • Divide the data set D into m parts
  • Use m – 1 parts to find a clustering
  • Use the remaining part as the test set to test

the quality of the clustering

– For each point in the test set, find the closest centroid or cluster center – Use the squared distances between all points in the test set and the corresponding centroids to measure how well the clustering model fits the test set

  • Repeat m times for each value of k, use the

average as the quality measure

Jian Pei: CMPT 459/741 Clustering (4) 38

slide-39
SLIDE 39

Measuring Clustering Quality

  • Ground truth: the ideal clustering

determined by human experts

  • Two situations

– There is a known ground truth – the extrinsic (supervised) methods, comparing the clustering against the ground truth – The ground truth is unavailable – the intrinsic (unsupervised) methods, measuring how well the clusters are separated

Jian Pei: CMPT 459/741 Clustering (4) 39

slide-40
SLIDE 40

Quality in Extrinsic Methods

  • Cluster homogeneity: the more pure the

clusters in a clustering are, the better the clustering

  • Cluster completeness: objects in the same

cluster in the ground truth should be clustered together

  • Rag bag: putting a heterogeneous object into a

pure cluster is worse than putting it into a rag bag

  • Small cluster preservation: splitting a small

cluster in the ground truth into pieces is worse than splitting a bigger one

Jian Pei: CMPT 459/741 Clustering (4) 40

slide-41
SLIDE 41

Bcubed Precision and Recall

  • D = {o1, …, on}

– L(oi) is the cluster of oi given by the ground truth

  • C is a clustering on D

– C(oi) is the cluster-id of oi in C

  • For two objects oi and oj, the correctness is

1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0

  • therwise

Jian Pei: CMPT 459/741 Clustering (4) 41

slide-42
SLIDE 42

Bcubed Precision and Recall

  • Precision
  • Recall

Jian Pei: CMPT 459/741 Clustering (4) 42

Precision BCubed =

n

  • i=1
  • j:i̸=j,C(oi)=C(oj)

Correctness(oi, oj) ∥{oj|i ̸= j, C(oi) = C(oj)}∥ n .

Recall BCubed =

n

  • i=1
  • j:i̸=j,L(oi)=L(oj)

Correctness(oi, oj) ∥{oj|i ̸= j, L(oi) = L(oj)}∥ n .

slide-43
SLIDE 43

Silhouette Coefficient

  • No ground truth is assumed
  • Suppose a data set D of n objects is partitioned

into k clusters, C1, …, Ck

  • For each object o,

– Calculate a(o), the average distance between o and every other object in the same cluster – compactness of a cluster, the smaller, the better – Calculate b(o), the minimum average distance from

  • to every objects in a cluster that o does not

belong to – degree of separation from other clusters, the larger, the better

Jian Pei: CMPT 459/741 Clustering (4) 43

slide-44
SLIDE 44

Silhouette Coefficient

  • Then
  • Use the average silhouette coefficient of all
  • bjects as the overall measure

Jian Pei: CMPT 459/741 Clustering (4) 44

b(o) = min

Cj:o62Cj{

P

  • 02Cj

dist(o, o0) |Cj| } a(o) = P

  • ,o02Ci,o06=o

dist(o, o0) |Ci| − 1 s(o) = b(o) − a(o) max{a(o), b(o)}

slide-45
SLIDE 45

Multi-Clustering

  • A data set may be clustered in different

ways

– In different subspaces, that is, using different attributes – Using different similarity measures – Using different clustering methods

  • Some different clusterings may capture

different meanings of categorization

– Orthogonal clusterings

  • Putting users in the loop

Jian Pei: CMPT 459/741 Clustering (4) 45

slide-46
SLIDE 46

To-Do List

  • Read Chapters 10.5, 10.6, and 11.1
  • Find out how Gaussian mixture can be used

in SPARK MLlib

  • (for thesis-based graduate students only)

Learn LDA (Latent Dirichlet allocation) by yourself

Jian Pei: CMPT 459/741 Clustering (4) 46