CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 - - PowerPoint PPT Presentation

โ–ถ
cs6220 data mining techniques
SMART_READER_LITE
LIVE PREVIEW

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 - - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Time Series Graph & Data Network Classification Decision Tree;


slide-1
SLIDE 1

CS6220: DATA MINING TECHNIQUES

Instructor: Yizhou Sun

yzsun@ccs.neu.edu October 19, 2014

Matrix Data: Clustering: Part 2

slide-2
SLIDE 2

Methods to Learn

Matrix Data Set Data Sequence Data Time Series Graph & Network Classification

Decision Tree; Naรฏve Bayes; Logistic Regression SVM; kNN HMM Label Propagation

Clustering

K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means SCAN; Spectral Clustering

Frequent Pattern Mining

Apriori; FP-growth GSP; PrefixSpan

Prediction

Linear Regression Autoregression

Similarity Search

DTW P-PageRank

Ranking

PageRank

2

slide-3
SLIDE 3

Matrix Data: Clustering: Part 2

  • Revisit K-means
  • Mixture Model and EM algorithm
  • Kernel K-means
  • Summary

3

slide-4
SLIDE 4

Recall K-Means

  • Objective function
  • ๐พ = ๐‘˜=1

๐‘™

๐ท ๐‘— =๐‘˜ ||๐‘ฆ๐‘— โˆ’ ๐‘‘

๐‘˜||2

  • Total within-cluster variance
  • Re-arrange the objective function
  • ๐พ = ๐‘˜=1

๐‘™

๐‘— ๐‘ฅ๐‘—๐‘˜||๐‘ฆ๐‘— โˆ’ ๐‘‘

๐‘˜||2

  • ๐‘ฅ๐‘—๐‘˜ โˆˆ {0,1}
  • ๐‘ฅ๐‘—๐‘˜ = 1, ๐‘—๐‘” ๐‘ฆ๐‘— ๐‘๐‘“๐‘š๐‘๐‘œ๐‘•๐‘ก ๐‘ข๐‘ ๐‘‘๐‘š๐‘ฃ๐‘ก๐‘ข๐‘“๐‘  ๐‘˜; ๐‘ฅ๐‘—๐‘˜ =

0, ๐‘๐‘ขโ„Ž๐‘“๐‘ ๐‘ฅ๐‘—๐‘ก๐‘“

  • Looking for:
  • The best assignment ๐‘ฅ๐‘—๐‘˜
  • The best center ๐‘‘

๐‘˜

4

slide-5
SLIDE 5

Solution of K-Means

  • Iterations
  • Step 1: Fix centers ๐‘‘

๐‘˜, find assignment ๐‘ฅ๐‘—๐‘˜ that

minimizes ๐พ

  • => ๐‘ฅ๐‘—๐‘˜ = 1, ๐‘—๐‘” ||๐‘ฆ๐‘— โˆ’ ๐‘‘

๐‘˜||2 is the smallest

  • Step 2: Fix assignment ๐‘ฅ๐‘—๐‘˜, find centers that

minimize ๐พ

  • => first derivative of ๐พ = 0
  • =>

๐œ–๐พ ๐œ–๐‘‘๐‘˜ = โˆ’2 ๐‘— ๐‘ฅ๐‘—๐‘˜(๐‘ฆ๐‘— โˆ’ ๐‘‘ ๐‘˜) = 0

  • =>๐‘‘

๐‘˜ = ๐‘— ๐‘ฅ๐‘—๐‘˜๐‘ฆ๐‘— ๐‘— ๐‘ฅ๐‘—๐‘˜

  • Note ๐‘— ๐‘ฅ๐‘—๐‘˜ is the total number of objects in cluster j

5

๐พ =

๐‘˜=1 ๐‘™ ๐‘—

๐‘ฅ๐‘—๐‘˜||๐‘ฆ๐‘— โˆ’ ๐‘‘

๐‘˜||2

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11

Converges! Why?

slide-12
SLIDE 12

Limitations of K-Means

  • K-means has problems when clusters are
  • f differing
  • Sizes
  • Densities
  • Non-Spherical Shapes

12

slide-13
SLIDE 13

Limitations of K-Means: Different Density and Size

13

slide-14
SLIDE 14

Limitations of K-Means: Non-Spherical Shapes

14

slide-15
SLIDE 15

Demo

  • http://webdocs.cs.ualberta.ca/~yaling/Clu

ster/Applet/Code/Cluster.html

15

slide-16
SLIDE 16

Connections of K-means to Other Methods

16

K-means Gaussian Mixture Model Kernel K- means

slide-17
SLIDE 17

Matrix Data: Clustering: Part 2

  • Revisit K-means
  • Mixture Model and EM algorithm
  • Kernel K-means
  • Summary

17

slide-18
SLIDE 18

Fuzzy Set and Fuzzy Cluster

  • Clustering methods discussed so far
  • Every data object is assigned to exactly one cluster
  • Some applications may need for fuzzy or

soft cluster assignment

  • Ex. An e-game could belong to both entertainment

and software

  • Methods: fuzzy clusters and probabilistic

model-based clusters

  • Fuzzy cluster: A fuzzy set S: FS : X โ†’ [0, 1]

(value between 0 and 1)

18

slide-19
SLIDE 19

Probabilistic Model-Based Clustering

  • Cluster analysis is to find hidden categories.
  • A hidden category (i.e., probabilistic cluster) is a distribution over the data

space, which can be mathematically represented using a probability density function (or distribution function).

๏ฎ

  • Ex. categories for digital cameras sold

๏ฎ consumer line vs. professional line ๏ฎ density functions f1, f2 for C1, C2 ๏ฎ obtained by probabilistic clustering

๏ฎ

A mixture model assumes that a set of observed objects is a mixture

  • f instances from multiple probabilistic clusters, and conceptually

each observed object is generated independently

๏ฎ

Our task: infer a set of k probabilistic clusters that is mostly likely to generate D using the above data generation process

19

slide-20
SLIDE 20

Mixture Model-Based Clustering

  • A set C of k probabilistic clusters C1, โ€ฆ,Ck

with probability density functions f1, โ€ฆ, fk, respectively, and their probabilities w1, โ€ฆ, wk, ๐‘˜ ๐‘ฅ

๐‘˜ = 1

  • Probability of an object i generated by

cluster Cj is: ๐‘„(๐‘ฆ๐‘—, ๐‘จ๐‘— = ๐ท

๐‘˜) = ๐‘ฅ ๐‘˜๐‘” ๐‘˜(๐‘ฆ๐‘—)

  • Probability of i generated by the set of

cluster C is: ๐‘„ ๐‘ฆ๐‘— = ๐‘˜ ๐‘ฅ

๐‘˜๐‘” ๐‘˜(๐‘ฆ๐‘—)

20

slide-21
SLIDE 21

Maximum Likelihood Estimation

  • Since objects are assumed to be

generated independently, for a data set D = {x1, โ€ฆ, xn}, we have, ๐‘„ ๐ธ =

๐‘—

๐‘„ ๐‘ฆ๐‘— =

๐‘— ๐‘˜

๐‘ฅ

๐‘˜๐‘” ๐‘˜(๐‘ฆ๐‘—)

  • Task: Find a set C of k probabilistic

clusters s.t. P(D) is maximized

21

slide-22
SLIDE 22

The EM (Expectation Maximization) Algorithm

  • The (EM) algorithm: A framework to approach maximum

likelihood or maximum a posteriori estimates of parameters in statistical models.

  • E-st

step ep assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters

  • ๐‘ฅ๐‘—๐‘˜

๐‘ข = ๐‘ž ๐‘จ๐‘— = ๐‘˜ ๐œ„ ๐‘˜ ๐‘ข, ๐‘ฆ๐‘— โˆ ๐‘ž ๐‘ฆ๐‘— ๐ท ๐‘˜ ๐‘ข, ๐œ„ ๐‘˜ ๐‘ข ๐‘ž(๐ท ๐‘˜ ๐‘ข)

  • M-st

step ep finds the new clustering or parameters that maximize the expected likelihood

22

slide-23
SLIDE 23

Case 1: Gaussian Mixture Model

  • Generative model
  • For each object:
  • Pick its distribution component:

๐‘Ž~๐‘๐‘ฃ๐‘š๐‘ข๐‘— ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘™

  • Sample a value from the selected distribution:

๐‘Œ~๐‘‚ ๐œˆ๐‘Ž, ๐œ๐‘Ž

2

  • Overall likelihood function
  • ๐‘€ ๐ธ| ๐œ„ = ๐‘— ๐‘˜ ๐‘ฅ

๐‘˜๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜, ๐œ ๐‘˜ 2)

  • Q: What is ๐œ„ here?

23

slide-24
SLIDE 24

Estimating Parameters

  • ๐‘€ ๐ธ; ๐œ„ = ๐‘— log ๐‘˜ ๐‘ฅ

๐‘˜๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜, ๐œ ๐‘˜ 2)

  • Considering the first derivative of ๐œˆ๐‘˜:
  • ๐œ–๐‘€

๐œ–๐‘ฃ๐‘˜ = ๐‘— ๐‘ฅ๐‘˜ ๐‘˜ ๐‘ฅ๐‘˜๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜,๐œ๐‘˜

2)

๐œ–๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜,๐œ๐‘˜

2)

๐œ–๐œˆ๐‘˜

  • = ๐‘—

๐‘ฅ๐‘˜๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜,๐œ๐‘˜

2)

๐‘˜ ๐‘ฅ๐‘˜๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜,๐œ๐‘˜

2)

1 ๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜,๐œ๐‘˜

2)

๐œ–๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜,๐œ๐‘˜

2)

๐œ–๐œˆ๐‘˜

  • = ๐‘—

๐‘ฅ๐‘˜๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜,๐œ๐‘˜

2)

๐‘˜ ๐‘ฅ๐‘˜๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜,๐œ๐‘˜

2)

๐œ–๐‘š๐‘๐‘•๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜,๐œ๐‘˜

2)

๐œ–๐‘ฃ๐‘˜

24

๐‘ฅ๐‘—๐‘˜ = ๐‘„(๐‘Ž = ๐‘˜|๐‘Œ = ๐‘ฆ๐‘—, ๐œ„) ๐œ–๐‘š(๐‘ฆ๐‘—)/๐œ–๐œˆ๐‘˜

Intractable!

Like weighted likelihood estimation; But the weight is determined by the parameters!

slide-25
SLIDE 25

Apply EM algorithm

  • An iterative algorithm (at iteration t+1)
  • E(expectation)-step
  • Evaluate the weight ๐‘ฅ๐‘—๐‘˜ when ๐œˆ๐‘˜, ๐œ

๐‘˜, ๐‘ฅ ๐‘˜are given

  • ๐‘ฅ๐‘—๐‘˜

๐‘ข = ๐‘ฅ๐‘˜

๐‘ข๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜ ๐‘ข,(๐œ๐‘˜ 2)๐‘ข)

๐‘˜ ๐‘ฅ๐‘˜

๐‘ข๐‘ž(๐‘ฆ๐‘—|๐œˆ๐‘˜ ๐‘ข,(๐œ๐‘˜ 2)๐‘ข)

  • M(maximization)-step
  • Evaluate ๐œˆ๐‘˜, ๐œ

๐‘˜, ๐œ•๐‘˜ when ๐‘ฅ๐‘—๐‘˜โ€™s are given that maximize the

weighted likelihood

  • It is equivalent to Gaussian distribution parameter

estimation when each point has a weight belonging to each distribution

  • ๐œˆ๐‘˜

๐‘ข+1 = ๐‘— ๐‘ฅ๐‘—๐‘˜

๐‘ข ๐‘ฆ๐‘—

๐‘— ๐‘ฅ๐‘—๐‘˜

๐‘ข ; (๐œ๐‘˜

2)๐‘ข+1 = ๐‘— ๐‘ฅ๐‘—๐‘˜

๐‘ข

๐‘ฆ๐‘—โˆ’๐œˆ๐‘˜

๐‘ข 2

๐‘— ๐‘ฅ๐‘—๐‘˜

๐‘ข

; ๐‘ฅ

๐‘˜ ๐‘ข+1 โˆ ๐‘— ๐‘ฅ๐‘—๐‘˜ ๐‘ข

25

slide-26
SLIDE 26

K-Means: A Special Case of Gaussian Mixture Model

  • When each Gaussian component with

covariance matrix ๐œ2๐ฝ

  • Soft K-means
  • ๐‘ž ๐‘ฆ๐‘— ๐œˆ๐‘˜, ๐œ2

โˆ exp{โˆ’ ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘˜

2/๐œ2}

  • When ๐œ2 โ†’ 0
  • Soft assignment becomes hard assignment
  • ๐‘ฅ๐‘—๐‘˜ โ†’ 1, ๐‘—๐‘” ๐‘ฆ๐‘— is closest to ๐œˆ๐‘˜ (why?)

26

Distance!

slide-27
SLIDE 27

Case 2: Multinomial Mixture Model

  • Generative model
  • For each object:
  • Pick its distribution component:

๐‘Ž~๐‘๐‘ฃ๐‘š๐‘ข๐‘— ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘™

  • Sample a value from the selected distribution:

๐‘Œ~๐‘๐‘ฃ๐‘š๐‘ข๐‘— ๐›พ๐‘Ž1, ๐›พ๐‘Ž2, โ€ฆ , ๐›พ๐‘Ž๐‘›

  • Overall likelihood function
  • ๐‘€ ๐ธ| ๐œ„ = ๐‘— ๐‘˜ ๐‘ฅ

๐‘˜๐‘ž(๐’š๐‘—|๐œธ๐‘˜)

  • ๐‘˜ ๐‘ฅ

๐‘˜ = 1; ๐‘š ๐›พ๐‘˜๐‘š = 1

  • Q: What is ๐œ„ here?

27

slide-28
SLIDE 28

Application: Document Clustering

  • A vocabulary containing m words
  • Each document i:
  • A m-dimensional vector: ๐‘‘๐‘—1, ๐‘‘๐‘—2, โ€ฆ , ๐‘‘๐‘—๐‘›
  • ๐‘‘๐‘—๐‘š is the number of occurrence of word l

appearing in document i

  • Under unigram assumption
  • ๐‘ž ๐’š๐‘— ๐œธ๐‘˜

=

( ๐‘› ๐‘‘๐‘—๐‘š)! ๐‘‘๐‘—1!โ€ฆ๐‘‘๐‘—๐‘›! ๐›พ๐‘˜1 ๐‘‘๐‘—1 โ€ฆ ๐›พ๐‘˜๐‘› ๐‘‘๐‘—๐‘›

28

Length of document Constant to all parameters

slide-29
SLIDE 29

Example

29

slide-30
SLIDE 30

Estimating Parameters

  • ๐‘š ๐ธ; ๐œ„ = ๐‘— log ๐‘˜ ๐œ•๐‘˜ ๐‘š ๐‘‘๐‘—๐‘š๐‘š๐‘๐‘•๐›พ๐‘˜๐‘š
  • Apply EM algorithm
  • E-step:
  • w๐‘—๐‘˜ =

๐‘ฅ๐‘˜๐‘ž(๐’š๐‘—|๐œธ๐‘˜) ๐‘˜ ๐‘ฅ๐‘˜๐‘ž(๐’š๐‘—|๐œธ๐‘˜)

  • M-step: maximize weighted likelihood

๐‘— ๐‘ฅ๐‘—๐‘˜ ๐‘š ๐‘‘๐‘—๐‘š๐‘š๐‘๐‘•๐›พ๐‘˜๐‘š

  • ๐›พ๐‘˜๐‘š =

๐‘— ๐‘ฅ๐‘—๐‘˜๐‘‘๐‘—๐‘š ๐‘šโ€ฒ ๐‘— ๐‘ฅ๐‘—๐‘˜๐‘‘๐‘—๐‘šโ€ฒ ; ๐œ•๐‘˜ โˆ ๐‘— ๐‘ฅ๐‘—๐‘˜

30

Weighted percentage of word l in cluster j

slide-31
SLIDE 31

Better Way for Topic Modeling

  • Topic: a word distribution
  • Unigram multinomial mixture model
  • Once the topic of a document is decided, all its

words are generated from that topic

  • PLSA (probabilistic latent semantic analysis)
  • Every word of a document can be sampled from

different topics

  • LDA (Latent Dirichlet Allocation)
  • Assume priors on word distribution and/or

document cluster distribution

31

slide-32
SLIDE 32

Why EM Works?

  • E-Step: computing a tight lower bound f of the
  • riginal objective function at ๐œ„๐‘๐‘š๐‘’
  • M-Step: find ๐œ„๐‘œ๐‘“๐‘ฅ to maximize the lower bound
  • ๐‘š ๐œ„๐‘œ๐‘“๐‘ฅ โ‰ฅ ๐‘” ๐œ„๐‘œ๐‘“๐‘ฅ โ‰ฅ ๐‘”(๐œ„๐‘๐‘š๐‘’) = ๐‘š(๐œ„๐‘๐‘š๐‘’)

32

slide-33
SLIDE 33

*How to Find Tight Lower Bound?

  • Jensenโ€™s inequality
  • When โ€œ=โ€ holds to get a tight lower bound?
  • ๐‘Ÿ โ„Ž = ๐‘ž(โ„Ž|๐‘’, ๐œ„) (why?)

33

๐‘Ÿ โ„Ž : ๐‘ขโ„Ž๐‘“ ๐‘ข๐‘—๐‘•โ„Ž๐‘ข ๐‘š๐‘๐‘ฅ๐‘“๐‘  ๐‘๐‘๐‘ฃ๐‘œ๐‘’ ๐‘ฅ๐‘“ ๐‘ฅ๐‘๐‘œ๐‘ข ๐‘ข๐‘ ๐‘•๐‘“๐‘ข

slide-34
SLIDE 34

Advantages and Disadvantages of Mixture Models

  • Strength
  • Mixture models are more general than partitioning
  • Clusters can be characterized by a small number of parameters
  • The results may satisfy the statistical assumptions of the generative

models

  • Weakness
  • Converge to local optimal (overcome: run multi-times w. random

initialization)

  • Computationally expensive if the number of distributions is large,
  • r the data set contains very few observed data points
  • Need large data sets
  • Hard to estimate the number of clusters

34

slide-35
SLIDE 35

Matrix Data: Clustering: Part 2

  • Revisit K-means
  • Mixture Model and EM algorithm
  • Kernel K-means
  • Summary

35

slide-36
SLIDE 36

Kernel K-Means

  • How to cluster the following data?
  • A non-linear map: ๐œš: ๐‘†๐‘œ โ†’ ๐บ
  • Map a data point into a higher/infinite dimensional space
  • ๐‘ฆ โ†’ ๐œš ๐‘ฆ
  • Dot product matrix ๐ฟ๐‘—๐‘˜
  • ๐ฟ๐‘—๐‘˜ =< ๐œš ๐‘ฆ๐‘— , ๐œš(๐‘ฆ๐‘˜) >

36

slide-37
SLIDE 37

Typical Kernel Functions

  • Recall kernel SVM:

37

slide-38
SLIDE 38

Solution of Kernel K-Means

  • Objective function under new feature space:
  • ๐พ = ๐‘˜=1

๐‘™

๐‘— ๐‘ฅ๐‘—๐‘˜||๐œš(๐‘ฆ๐‘—) โˆ’ ๐‘‘

๐‘˜||2

  • Algorithm
  • By fixing assignment ๐‘ฅ๐‘—๐‘˜
  • ๐‘‘

๐‘˜ = ๐‘— ๐‘ฅ๐‘—๐‘˜ ๐œš(๐‘ฆ๐‘—)/ ๐‘— ๐‘ฅ๐‘—๐‘˜

  • In the assignment step, assign the data points to the

closest center

  • ๐‘’ ๐‘ฆ๐‘—, ๐‘‘

๐‘˜ =

๐œš ๐‘ฆ๐‘— โˆ’

๐‘—โ€ฒ ๐‘ฅ๐‘—โ€ฒ๐‘˜๐œš ๐‘ฆ๐‘—โ€ฒ ๐‘—โ€ฒ ๐‘ฅ๐‘—โ€ฒ๐‘˜ 2

= ๐œš ๐‘ฆ๐‘— โ‹… ๐œš ๐‘ฆ๐‘— โˆ’ 2

๐‘—โ€ฒ ๐‘ฅ๐‘—โ€ฒ๐‘˜๐œš ๐‘ฆ๐‘— โ‹…๐œš ๐‘ฆ๐‘—โ€ฒ ๐‘—โ€ฒ ๐‘ฅ๐‘—โ€ฒ๐‘˜

+

๐‘—โ€ฒ ๐‘š ๐‘ฅ๐‘—โ€ฒ๐‘˜๐‘ฅ๐‘š๐‘˜๐œš ๐‘ฆ๐‘—โ€ฒ โ‹…๐œš ๐‘ฆ๐‘š ( ๐‘—โ€ฒ ๐‘ฅ๐‘—โ€ฒ๐‘˜)^2

38

Do not really need to know ๐” ๐’š , ๐’„๐’—๐’– ๐’‘๐’๐’Ž๐’› ๐‘ณ๐’‹๐’Œ

slide-39
SLIDE 39

Advantages and Disadvantages of Kernel K-Means

  • Advantages
  • Algorithm is able to identify the non-linear structures.
  • Disadvantages
  • Number of cluster centers need to be predefined.
  • Algorithm is complex in nature and time complexity is

large.

  • References
  • Kernel k-means and Spectral Clustering by Max

Welling.

  • Kernel k-means, Spectral Clustering and Normalized

Cut by Inderjit S. Dhillon, Yuqiang Guan and Brian Kulis.

  • An Introduction to kernel methods by Colin Campbell.

39

slide-40
SLIDE 40

Matrix Data: Clustering: Part 2

  • Revisit K-means
  • Mixture Model and EM algorithm
  • Kernel K-means
  • Summary

40

slide-41
SLIDE 41

Summary

  • Revisit k-means
  • Derivative
  • Mixture models
  • Gaussian mixture model; multinomial mixture

model; EM algorithm; Connection to k-means

  • Kernel k-means
  • Objective function; solution; connection to k-

means

41