Clustering and K-means Root Mean Square Error (RMS) Data: ! x 1 , ! - - PowerPoint PPT Presentation
Clustering and K-means Root Mean Square Error (RMS) Data: ! x 1 , ! - - PowerPoint PPT Presentation
Clustering and K-means Root Mean Square Error (RMS) Data: ! x 1 , ! x 2 , , ! x N R d Approximations: ! z 1 , ! z 2 , , ! z N R d x i ! ! N 1 2 Root Mean Square error = y i 2 z N i = 1 PCA based predic>on Data: !
Root Mean Square Error (RMS)
Data: ! x1, ! x2,…, ! xN ∈Rd Approximations: ! z1,! z2,…,! zN ∈Rd Root Mean Square error = 1 N ! xi − ! yi 2
2 i=1 N
∑
z
PCA based predic>on
Data: ! x1, ! x2,…, ! xN ∈Rd Mean vector: ! µ Top k eigenvectors: ! v1, ! v2,…, ! vk Approximation of ! x j : !
- j = !
µ +
i=1 k
∑
! vi ⋅ ! x j
( )!
vi
x x x x x x x x
- o
- RMS Error =
1 N i=1
N
∑
! xi − !
- i 2
2
Regression based Predic>on
Data: (! x1,y1), (! x2,y2), … , (! xN,yN ) ∈Rd Input: ! x ∈Rd Output: y ∈R Approximation of y given ! x : ˆ y = a0 + ai
i=1 d
∑
xi
x x x x x x x x
- RMS Error =
1 N i=1
N
∑
yi − ˆ yi
( )
2
K-means clustering
RMS Error = 1 N i=1
N
∑
! xi − !
- i 2
2
Data: ! x1, ! x2,…, ! xN ∈Rd Model: k representatives: ! r
1,!
r
2,…,!
r
k ∈Rd
Approximation of ! x j : !
- j = argmin!
r
i
! x j − ! r
i 2 2
= the representative closest to ! x j
K-means Algorithm
Initialize k representatives ! r
1,!
r
2,…,!
r
k ∈Rd
Iterate until convergence:
- a. Associate each !
xi with it's closest representative xi "! → rj " !
- b. Replace each representative !
rj with the mean of the points assigned to ! rj Both a step and b step reduce RMSE
Simple Ini>aliza>on
Simplest Ini>aliza>on: choose representa>ve from data points independently at random.
– Problem: some representa>ves are close to each
- ther and some parts of the data have no
representa>ves. – Kmeans is a local search method – can get stuck in local minima.
Kmeans++
Data: ! x1,…, ! xN Current Reps: ! r
1,…,!
rj Distance of example to Reps: d(! x,{! r
1,…,!
rj}) = min1≤i≤ j‖! x − ! r
i‖
- Prob. of selecting example !
x as next representative: P(! x) = 1 Z 1 d(! x,{! r
1,…,!
rj})
- A different method for ini>alizing representa>ves.
- Spreads out ini>al representa>ves
- Add representa>ves one by one
- Before adding representa>ve, define distribu>on
- ver unselected data points.
Example for Kmeans++
This is an unlikely ini>aliza>on for kmeans++
Parallelized Kmeans
- Suppose the data points are par>>oned randomly across
several machines.
- We want to perform the a,b steps with minimal
communica>on btwn machines.
- 1. Choose ini>al representa>ves and broadcast to all
machines.
- 2. Each machine par>>ons its own data points according to
closest representa>ve. Defines (key,value) pairs where key=index of closest representa>ve. Value=example.
- 3. Compute the mean for each set by performing
- reduceByKey. (most of the summing done locally on each
machine).
- 4. Broadcast new reps to all machines.
Clustering stability
Clustering stability
Clustering using Star>ng points I Clustering using Star>ng points 2 Clustering using Star>ng points 3
Measuring clustering stability
x1 x2 x3 x4 x5 x6 xn Clustering 1 1 1 3 1 3 2 2 2 3 Clustering 2 2 2 1 2 1 3 3 3 1 Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest representa>ve to xi for clustering j The first three clusterings are completely consistent with each other The fourth clustering has a disagreement in x5
How to quan>fy stability?
- We say that a clustering is stable if the
examples are always grouped in the same way.
- When we have thousands of examples, we
cannot expect all of them to always be grouped the same way.
- We need a way to quan>fy the stability.
- Basic idea: measure how much groupings
differ between clusterings.
Entropy
A partition G of the data defines a distribution over the parts: p1 + p2 +!+ pk = 1 The information in this partition is measured by the Entropy: H(G) = H(p1, p2,…, pk) = pi
i=1 k
∑
log2 1 pi H(G) is a number between 0 (one part with prob. 1) and log2 k (p1 = p2 =!= pk = 1 k)
Entropy of a combined par>>on
If clustering1 and clustering 2 partition the data in the exact same way then G1 = G2, H(G1,G2) = H(G1) = H(G2) Suppse we produce many clusterings, using many starting points. Suppose we plot H(G1),H(G1,G2),…,H(G1,G2,…,Gi),… As a function of i If the graph increases like ilog2 k then the clustering is completely unstable If the graph stops increasing after some i then we reached stability.