CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 - - PowerPoint PPT Presentation
CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 - - PowerPoint PPT Presentation
CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Time Series Graph & Data Network Classification Decision Tree;
Methods to Learn
Matrix Data Set Data Sequence Data Time Series Graph & Network Classification
Decision Tree; Naรฏve Bayes; Logistic Regression SVM; kNN HMM Label Propagation
Clustering
K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means SCAN; Spectral Clustering
Frequent Pattern Mining
Apriori; FP-growth GSP; PrefixSpan
Prediction
Linear Regression Autoregression
Similarity Search
DTW P-PageRank
Ranking
PageRank
2
Matrix Data: Clustering: Part 2
- Revisit K-means
- Mixture Model and EM algorithm
- Kernel K-means
- Summary
3
Recall K-Means
- Objective function
- ๐พ = ๐=1
๐
๐ท ๐ =๐ ||๐ฆ๐ โ ๐
๐||2
- Total within-cluster variance
- Re-arrange the objective function
- ๐พ = ๐=1
๐
๐ ๐ฅ๐๐||๐ฆ๐ โ ๐
๐||2
- ๐ฅ๐๐ โ {0,1}
- ๐ฅ๐๐ = 1, ๐๐ ๐ฆ๐ ๐๐๐๐๐๐๐ก ๐ข๐ ๐๐๐ฃ๐ก๐ข๐๐ ๐; ๐ฅ๐๐ =
0, ๐๐ขโ๐๐ ๐ฅ๐๐ก๐
- Looking for:
- The best assignment ๐ฅ๐๐
- The best center ๐
๐
4
Solution of K-Means
- Iterations
- Step 1: Fix centers ๐
๐, find assignment ๐ฅ๐๐ that
minimizes ๐พ
- => ๐ฅ๐๐ = 1, ๐๐ ||๐ฆ๐ โ ๐
๐||2 is the smallest
- Step 2: Fix assignment ๐ฅ๐๐, find centers that
minimize ๐พ
- => first derivative of ๐พ = 0
- =>
๐๐พ ๐๐๐ = โ2 ๐ ๐ฅ๐๐(๐ฆ๐ โ ๐ ๐) = 0
- =>๐
๐ = ๐ ๐ฅ๐๐๐ฆ๐ ๐ ๐ฅ๐๐
- Note ๐ ๐ฅ๐๐ is the total number of objects in cluster j
5
๐พ =
๐=1 ๐ ๐
๐ฅ๐๐||๐ฆ๐ โ ๐
๐||2
Converges! Why?
Limitations of K-Means
- K-means has problems when clusters are
- f differing
- Sizes
- Densities
- Non-Spherical Shapes
12
Limitations of K-Means: Different Density and Size
13
Limitations of K-Means: Non-Spherical Shapes
14
Demo
- http://webdocs.cs.ualberta.ca/~yaling/Clu
ster/Applet/Code/Cluster.html
15
Connections of K-means to Other Methods
16
K-means Gaussian Mixture Model Kernel K- means
Matrix Data: Clustering: Part 2
- Revisit K-means
- Mixture Model and EM algorithm
- Kernel K-means
- Summary
17
Fuzzy Set and Fuzzy Cluster
- Clustering methods discussed so far
- Every data object is assigned to exactly one cluster
- Some applications may need for fuzzy or
soft cluster assignment
- Ex. An e-game could belong to both entertainment
and software
- Methods: fuzzy clusters and probabilistic
model-based clusters
- Fuzzy cluster: A fuzzy set S: FS : X โ [0, 1]
(value between 0 and 1)
18
Probabilistic Model-Based Clustering
- Cluster analysis is to find hidden categories.
- A hidden category (i.e., probabilistic cluster) is a distribution over the data
space, which can be mathematically represented using a probability density function (or distribution function).
๏ฎ
- Ex. categories for digital cameras sold
๏ฎ consumer line vs. professional line ๏ฎ density functions f1, f2 for C1, C2 ๏ฎ obtained by probabilistic clustering
๏ฎ
A mixture model assumes that a set of observed objects is a mixture
- f instances from multiple probabilistic clusters, and conceptually
each observed object is generated independently
๏ฎ
Our task: infer a set of k probabilistic clusters that is mostly likely to generate D using the above data generation process
19
Mixture Model-Based Clustering
- A set C of k probabilistic clusters C1, โฆ,Ck
with probability density functions f1, โฆ, fk, respectively, and their probabilities w1, โฆ, wk, ๐ ๐ฅ
๐ = 1
- Probability of an object i generated by
cluster Cj is: ๐(๐ฆ๐, ๐จ๐ = ๐ท
๐) = ๐ฅ ๐๐ ๐(๐ฆ๐)
- Probability of i generated by the set of
cluster C is: ๐ ๐ฆ๐ = ๐ ๐ฅ
๐๐ ๐(๐ฆ๐)
20
Maximum Likelihood Estimation
- Since objects are assumed to be
generated independently, for a data set D = {x1, โฆ, xn}, we have, ๐ ๐ธ =
๐
๐ ๐ฆ๐ =
๐ ๐
๐ฅ
๐๐ ๐(๐ฆ๐)
- Task: Find a set C of k probabilistic
clusters s.t. P(D) is maximized
21
The EM (Expectation Maximization) Algorithm
- The (EM) algorithm: A framework to approach maximum
likelihood or maximum a posteriori estimates of parameters in statistical models.
- E-st
step ep assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters
- ๐ฅ๐๐
๐ข = ๐ ๐จ๐ = ๐ ๐ ๐ ๐ข, ๐ฆ๐ โ ๐ ๐ฆ๐ ๐ท ๐ ๐ข, ๐ ๐ ๐ข ๐(๐ท ๐ ๐ข)
- M-st
step ep finds the new clustering or parameters that maximize the expected likelihood
22
Case 1: Gaussian Mixture Model
- Generative model
- For each object:
- Pick its distribution component:
๐~๐๐ฃ๐๐ข๐ ๐ฅ1, โฆ , ๐ฅ๐
- Sample a value from the selected distribution:
๐~๐ ๐๐, ๐๐
2
- Overall likelihood function
- ๐ ๐ธ| ๐ = ๐ ๐ ๐ฅ
๐๐(๐ฆ๐|๐๐, ๐ ๐ 2)
- Q: What is ๐ here?
23
Estimating Parameters
- ๐ ๐ธ; ๐ = ๐ log ๐ ๐ฅ
๐๐(๐ฆ๐|๐๐, ๐ ๐ 2)
- Considering the first derivative of ๐๐:
- ๐๐
๐๐ฃ๐ = ๐ ๐ฅ๐ ๐ ๐ฅ๐๐(๐ฆ๐|๐๐,๐๐
2)
๐๐(๐ฆ๐|๐๐,๐๐
2)
๐๐๐
- = ๐
๐ฅ๐๐(๐ฆ๐|๐๐,๐๐
2)
๐ ๐ฅ๐๐(๐ฆ๐|๐๐,๐๐
2)
1 ๐(๐ฆ๐|๐๐,๐๐
2)
๐๐(๐ฆ๐|๐๐,๐๐
2)
๐๐๐
- = ๐
๐ฅ๐๐(๐ฆ๐|๐๐,๐๐
2)
๐ ๐ฅ๐๐(๐ฆ๐|๐๐,๐๐
2)
๐๐๐๐๐(๐ฆ๐|๐๐,๐๐
2)
๐๐ฃ๐
24
๐ฅ๐๐ = ๐(๐ = ๐|๐ = ๐ฆ๐, ๐) ๐๐(๐ฆ๐)/๐๐๐
Intractable!
Like weighted likelihood estimation; But the weight is determined by the parameters!
Apply EM algorithm
- An iterative algorithm (at iteration t+1)
- E(expectation)-step
- Evaluate the weight ๐ฅ๐๐ when ๐๐, ๐
๐, ๐ฅ ๐are given
- ๐ฅ๐๐
๐ข = ๐ฅ๐
๐ข๐(๐ฆ๐|๐๐ ๐ข,(๐๐ 2)๐ข)
๐ ๐ฅ๐
๐ข๐(๐ฆ๐|๐๐ ๐ข,(๐๐ 2)๐ข)
- M(maximization)-step
- Evaluate ๐๐, ๐
๐, ๐๐ when ๐ฅ๐๐โs are given that maximize the
weighted likelihood
- It is equivalent to Gaussian distribution parameter
estimation when each point has a weight belonging to each distribution
- ๐๐
๐ข+1 = ๐ ๐ฅ๐๐
๐ข ๐ฆ๐
๐ ๐ฅ๐๐
๐ข ; (๐๐
2)๐ข+1 = ๐ ๐ฅ๐๐
๐ข
๐ฆ๐โ๐๐
๐ข 2
๐ ๐ฅ๐๐
๐ข
; ๐ฅ
๐ ๐ข+1 โ ๐ ๐ฅ๐๐ ๐ข
25
K-Means: A Special Case of Gaussian Mixture Model
- When each Gaussian component with
covariance matrix ๐2๐ฝ
- Soft K-means
- ๐ ๐ฆ๐ ๐๐, ๐2
โ exp{โ ๐ฆ๐ โ ๐๐
2/๐2}
- When ๐2 โ 0
- Soft assignment becomes hard assignment
- ๐ฅ๐๐ โ 1, ๐๐ ๐ฆ๐ is closest to ๐๐ (why?)
26
Distance!
Case 2: Multinomial Mixture Model
- Generative model
- For each object:
- Pick its distribution component:
๐~๐๐ฃ๐๐ข๐ ๐ฅ1, โฆ , ๐ฅ๐
- Sample a value from the selected distribution:
๐~๐๐ฃ๐๐ข๐ ๐พ๐1, ๐พ๐2, โฆ , ๐พ๐๐
- Overall likelihood function
- ๐ ๐ธ| ๐ = ๐ ๐ ๐ฅ
๐๐(๐๐|๐ธ๐)
- ๐ ๐ฅ
๐ = 1; ๐ ๐พ๐๐ = 1
- Q: What is ๐ here?
27
Application: Document Clustering
- A vocabulary containing m words
- Each document i:
- A m-dimensional vector: ๐๐1, ๐๐2, โฆ , ๐๐๐
- ๐๐๐ is the number of occurrence of word l
appearing in document i
- Under unigram assumption
- ๐ ๐๐ ๐ธ๐
=
( ๐ ๐๐๐)! ๐๐1!โฆ๐๐๐! ๐พ๐1 ๐๐1 โฆ ๐พ๐๐ ๐๐๐
28
Length of document Constant to all parameters
Example
29
Estimating Parameters
- ๐ ๐ธ; ๐ = ๐ log ๐ ๐๐ ๐ ๐๐๐๐๐๐๐พ๐๐
- Apply EM algorithm
- E-step:
- w๐๐ =
๐ฅ๐๐(๐๐|๐ธ๐) ๐ ๐ฅ๐๐(๐๐|๐ธ๐)
- M-step: maximize weighted likelihood
๐ ๐ฅ๐๐ ๐ ๐๐๐๐๐๐๐พ๐๐
- ๐พ๐๐ =
๐ ๐ฅ๐๐๐๐๐ ๐โฒ ๐ ๐ฅ๐๐๐๐๐โฒ ; ๐๐ โ ๐ ๐ฅ๐๐
30
Weighted percentage of word l in cluster j
Better Way for Topic Modeling
- Topic: a word distribution
- Unigram multinomial mixture model
- Once the topic of a document is decided, all its
words are generated from that topic
- PLSA (probabilistic latent semantic analysis)
- Every word of a document can be sampled from
different topics
- LDA (Latent Dirichlet Allocation)
- Assume priors on word distribution and/or
document cluster distribution
31
Why EM Works?
- E-Step: computing a tight lower bound f of the
- riginal objective function at ๐๐๐๐
- M-Step: find ๐๐๐๐ฅ to maximize the lower bound
- ๐ ๐๐๐๐ฅ โฅ ๐ ๐๐๐๐ฅ โฅ ๐(๐๐๐๐) = ๐(๐๐๐๐)
32
*How to Find Tight Lower Bound?
- Jensenโs inequality
- When โ=โ holds to get a tight lower bound?
- ๐ โ = ๐(โ|๐, ๐) (why?)
33
๐ โ : ๐ขโ๐ ๐ข๐๐โ๐ข ๐๐๐ฅ๐๐ ๐๐๐ฃ๐๐ ๐ฅ๐ ๐ฅ๐๐๐ข ๐ข๐ ๐๐๐ข
Advantages and Disadvantages of Mixture Models
- Strength
- Mixture models are more general than partitioning
- Clusters can be characterized by a small number of parameters
- The results may satisfy the statistical assumptions of the generative
models
- Weakness
- Converge to local optimal (overcome: run multi-times w. random
initialization)
- Computationally expensive if the number of distributions is large,
- r the data set contains very few observed data points
- Need large data sets
- Hard to estimate the number of clusters
34
Matrix Data: Clustering: Part 2
- Revisit K-means
- Mixture Model and EM algorithm
- Kernel K-means
- Summary
35
Kernel K-Means
- How to cluster the following data?
- A non-linear map: ๐: ๐๐ โ ๐บ
- Map a data point into a higher/infinite dimensional space
- ๐ฆ โ ๐ ๐ฆ
- Dot product matrix ๐ฟ๐๐
- ๐ฟ๐๐ =< ๐ ๐ฆ๐ , ๐(๐ฆ๐) >
36
Typical Kernel Functions
- Recall kernel SVM:
37
Solution of Kernel K-Means
- Objective function under new feature space:
- ๐พ = ๐=1
๐
๐ ๐ฅ๐๐||๐(๐ฆ๐) โ ๐
๐||2
- Algorithm
- By fixing assignment ๐ฅ๐๐
- ๐
๐ = ๐ ๐ฅ๐๐ ๐(๐ฆ๐)/ ๐ ๐ฅ๐๐
- In the assignment step, assign the data points to the
closest center
- ๐ ๐ฆ๐, ๐
๐ =
๐ ๐ฆ๐ โ
๐โฒ ๐ฅ๐โฒ๐๐ ๐ฆ๐โฒ ๐โฒ ๐ฅ๐โฒ๐ 2
= ๐ ๐ฆ๐ โ ๐ ๐ฆ๐ โ 2
๐โฒ ๐ฅ๐โฒ๐๐ ๐ฆ๐ โ ๐ ๐ฆ๐โฒ ๐โฒ ๐ฅ๐โฒ๐
+
๐โฒ ๐ ๐ฅ๐โฒ๐๐ฅ๐๐๐ ๐ฆ๐โฒ โ ๐ ๐ฆ๐ ( ๐โฒ ๐ฅ๐โฒ๐)^2
38
Do not really need to know ๐ ๐ , ๐๐๐ ๐๐๐๐ ๐ณ๐๐
Advantages and Disadvantages of Kernel K-Means
- Advantages
- Algorithm is able to identify the non-linear structures.
- Disadvantages
- Number of cluster centers need to be predefined.
- Algorithm is complex in nature and time complexity is
large.
- References
- Kernel k-means and Spectral Clustering by Max
Welling.
- Kernel k-means, Spectral Clustering and Normalized
Cut by Inderjit S. Dhillon, Yuqiang Guan and Brian Kulis.
- An Introduction to kernel methods by Colin Campbell.
39
Matrix Data: Clustering: Part 2
- Revisit K-means
- Mixture Model and EM algorithm
- Kernel K-means
- Summary
40
Summary
- Revisit k-means
- Derivative
- Mixture models
- Gaussian mixture model; multinomial mixture
model; EM algorithm; Connection to k-means
- Kernel k-means
- Objective function; solution; connection to k-
means
41