cs6220 data mining techniques
play

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Time Series Graph & Data Network Classification Decision Tree;


  1. CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014

  2. Methods to Learn Matrix Data Set Data Sequence Time Series Graph & Data Network Classification Decision Tree; Naïve HMM Label Propagation Bayes; Logistic Regression SVM; kNN K-means; hierarchical SCAN; Spectral Clustering clustering; DBSCAN; Clustering Mixture Models; kernel k-means Apriori; GSP; Frequent FP-growth PrefixSpan Pattern Mining Prediction Linear Regression Autoregression Similarity DTW P-PageRank Search Ranking PageRank 2

  3. Matrix Data: Clustering: Part 2 • Revisit K-means • Mixture Model and EM algorithm • Kernel K-means • Summary 3

  4. Recall K-Means • Objective function 𝑙 𝑘 || 2 • 𝐾 = 𝑘=1 𝐷 𝑗 =𝑘 ||𝑦 𝑗 − 𝑑 • Total within-cluster variance • Re-arrange the objective function 𝑙 𝑘 || 2 • 𝐾 = 𝑘=1 𝑗 𝑥 𝑗𝑘 ||𝑦 𝑗 − 𝑑 • 𝑥 𝑗𝑘 ∈ {0,1} • 𝑥 𝑗𝑘 = 1, 𝑗𝑔 𝑦 𝑗 𝑐𝑓𝑚𝑝𝑜𝑕𝑡 𝑢𝑝 𝑑𝑚𝑣𝑡𝑢𝑓𝑠 𝑘; 𝑥 𝑗𝑘 = 0, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 • Looking for: • The best assignment 𝑥 𝑗𝑘 • The best center 𝑑 𝑘 4

  5. Solution of K-Means 𝑙 𝑘 || 2 𝐾 = 𝑥 𝑗𝑘 ||𝑦 𝑗 − 𝑑 • Iterations 𝑘=1 𝑗 • Step 1: Fix centers 𝑑 𝑘 , find assignment 𝑥 𝑗𝑘 that minimizes 𝐾 𝑘 || 2 is the smallest • => 𝑥 𝑗𝑘 = 1, 𝑗𝑔 ||𝑦 𝑗 − 𝑑 • Step 2: Fix assignment 𝑥 𝑗𝑘 , find centers that minimize 𝐾 • => first derivative of 𝐾 = 0 𝜖𝐾 • => 𝜖𝑑 𝑘 = −2 𝑗 𝑥 𝑗𝑘 (𝑦 𝑗 − 𝑑 𝑘 ) = 0 𝑗 𝑥 𝑗𝑘 𝑦 𝑗 • => 𝑑 𝑘 = 𝑗 𝑥 𝑗𝑘 • Note 𝑗 𝑥 𝑗𝑘 is the total number of objects in cluster j 5

  6. Converges! Why?

  7. Limitations of K-Means • K-means has problems when clusters are of differing • Sizes • Densities • Non-Spherical Shapes 12

  8. Limitations of K-Means: Different Density and Size 13

  9. Limitations of K-Means: Non-Spherical Shapes 14

  10. Demo • http://webdocs.cs.ualberta.ca/~yaling/Clu ster/Applet/Code/Cluster.html 15

  11. Connections of K-means to Other Methods K-means Gaussian Kernel K- Mixture means Model 16

  12. Matrix Data: Clustering: Part 2 • Revisit K-means • Mixture Model and EM algorithm • Kernel K-means • Summary 17

  13. Fuzzy Set and Fuzzy Cluster • Clustering methods discussed so far • Every data object is assigned to exactly one cluster • Some applications may need for fuzzy or soft cluster assignment • Ex. An e-game could belong to both entertainment and software • Methods: fuzzy clusters and probabilistic model-based clusters • Fuzzy cluster: A fuzzy set S: F S : X → [0 , 1] (value between 0 and 1) 18

  14. Probabilistic Model-Based Clustering • Cluster analysis is to find hidden categories. • A hidden category (i.e., probabilistic cluster) is a distribution over the data space, which can be mathematically represented using a probability density function (or distribution function). Ex. categories for digital cameras sold   consumer line vs. professional line  density functions f 1 , f 2 for C 1 , C 2  obtained by probabilistic clustering A mixture model assumes that a set of observed objects is a mixture  of instances from multiple probabilistic clusters, and conceptually each observed object is generated independently Our task : infer a set of k probabilistic clusters that is mostly likely to  generate D using the above data generation process 19

  15. Mixture Model-Based Clustering • A set C of k probabilistic clusters C 1 , …, C k with probability density functions f 1 , …, f k , respectively, and their probabilities w 1 , …, w k , 𝑘 𝑥 𝑘 = 1 • Probability of an object i generated by cluster C j is: 𝑄(𝑦 𝑗 , 𝑨 𝑗 = 𝐷 𝑘 ) = 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) • Probability of i generated by the set of cluster C is: 𝑄 𝑦 𝑗 = 𝑘 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 20

  16. Maximum Likelihood Estimation • Since objects are assumed to be generated independently, for a data set D = {x 1 , …, x n }, we have, 𝑄 𝐸 = 𝑄 𝑦 𝑗 = 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 𝑗 𝑗 𝑘 • Task: Find a set C of k probabilistic clusters s.t. P ( D ) is maximized 21

  17. The EM (Expectation Maximization) Algorithm • The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models. • E-st step ep assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters 𝑢 = 𝑞 𝑨 𝑗 = 𝑘 𝜄 𝑢 𝑞(𝐷 𝑢 , 𝑦 𝑗 ∝ 𝑞 𝑦 𝑗 𝐷 𝑢 , 𝜄 𝑢 ) • 𝑥 𝑗𝑘 𝑘 𝑘 𝑘 𝑘 • M-st step ep finds the new clustering or parameters that maximize the expected likelihood 22

  18. Case 1: Gaussian Mixture Model • Generative model • For each object: • Pick its distribution component: 𝑎~𝑁𝑣𝑚𝑢𝑗 𝑥 1 , … , 𝑥 𝑙 • Sample a value from the selected distribution: 2 𝑌~𝑂 𝜈 𝑎 , 𝜏 𝑎 • Overall likelihood function 2 ) • 𝑀 𝐸| 𝜄 = 𝑗 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 , 𝜏 𝑘 • Q: What is 𝜄 here? 23

  19. Estimating Parameters 2 ) • 𝑀 𝐸; 𝜄 = 𝑗 log 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 , 𝜏 Intractable! 𝑘 • Considering the first derivative of 𝜈 𝑘 : 2 ) 𝜖𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝑥 𝑘 𝜖𝑀 𝜖𝑣 𝑘 = 𝑗 • 2 ) 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝜖𝜈 𝑘 2 ) 2 ) 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝜖𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 1 = 𝑗 • 2 ) 2 ) 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝜖𝜈 𝑘 2 ) 2 ) 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝜖𝑚𝑝𝑕𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 Like weighted = 𝑗 • likelihood 2 ) 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 ,𝜏 𝑘 𝜖𝑣 𝑘 estimation; But the weight is determined by 𝜖𝑚(𝑦 𝑗 )/𝜖𝜈 𝑘 𝑥 𝑗𝑘 = 𝑄(𝑎 = 𝑘|𝑌 = 𝑦 𝑗 , 𝜄) the parameters! 24

  20. Apply EM algorithm • An iterative algorithm (at iteration t+1) • E(expectation)-step • Evaluate the weight 𝑥 𝑗𝑘 when 𝜈 𝑘 , 𝜏 𝑘 , 𝑥 𝑘 are given 𝑢 𝑞(𝑦 𝑗 |𝜈 𝑘 𝑢 ,(𝜏 𝑘 2 ) 𝑢 ) 𝑥 𝑘 𝑢 = • 𝑥 𝑗𝑘 𝑢 𝑞(𝑦 𝑗 |𝜈 𝑘 𝑢 ,(𝜏 𝑘 2 ) 𝑢 ) 𝑘 𝑥 𝑘 • M(maximization)-step • Evaluate 𝜈 𝑘 , 𝜏 𝑘 , 𝜕 𝑘 when 𝑥 𝑗𝑘 ’s are given that maximize the weighted likelihood • It is equivalent to Gaussian distribution parameter estimation when each point has a weight belonging to each distribution 2 𝑢 𝑢 𝑢 𝑦 𝑗 𝑗 𝑥 𝑗𝑘 𝑦 𝑗 −𝜈 𝑘 𝑗 𝑥 𝑗𝑘 𝑢+1 = 𝑢+1 ∝ 𝑗 𝑥 𝑗𝑘 2 ) 𝑢+1 = 𝑢 • 𝜈 𝑘 𝑢 ; (𝜏 𝑘 ; 𝑥 𝑘 𝑢 𝑗 𝑥 𝑗𝑘 𝑗 𝑥 𝑗𝑘 25

  21. K-Means: A Special Case of Gaussian Mixture Model • When each Gaussian component with covariance matrix 𝜏 2 𝐽 • Soft K-means Distance! 2 /𝜏 2 } • 𝑞 𝑦 𝑗 𝜈 𝑘 , 𝜏 2 ∝ exp{− 𝑦 𝑗 − 𝜈 𝑘 • When 𝜏 2 → 0 • Soft assignment becomes hard assignment • 𝑥 𝑗𝑘 → 1, 𝑗𝑔 𝑦 𝑗 is closest to 𝜈 𝑘 (why?) 26

  22. Case 2: Multinomial Mixture Model • Generative model • For each object: • Pick its distribution component: 𝑎~𝑁𝑣𝑚𝑢𝑗 𝑥 1 , … , 𝑥 𝑙 • Sample a value from the selected distribution: 𝑌~𝑁𝑣𝑚𝑢𝑗 𝛾 𝑎1 , 𝛾 𝑎2 , … , 𝛾 𝑎𝑛 • Overall likelihood function • 𝑀 𝐸| 𝜄 = 𝑗 𝑘 𝑥 𝑘 𝑞(𝒚 𝑗 |𝜸 𝑘 ) • 𝑘 𝑥 𝑘 = 1; 𝑚 𝛾 𝑘𝑚 = 1 • Q: What is 𝜄 here? 27

  23. Application: Document Clustering • A vocabulary containing m words • Each document i: • A m-dimensional vector: 𝑑 𝑗1 , 𝑑 𝑗2 , … , 𝑑 𝑗𝑛 • 𝑑 𝑗𝑚 is the number of occurrence of word l appearing in document i • Under unigram assumption Length of document ( 𝑛 𝑑 𝑗𝑚 )! 𝑑 𝑗1 … 𝛾 𝑘𝑛 𝑑 𝑗𝑛 • 𝑞 𝒚 𝑗 𝜸 𝑘 = 𝑑 𝑗1 !…𝑑 𝑗𝑛 ! 𝛾 𝑘1 Constant to all parameters 28

  24. Example 29

  25. Estimating Parameters • 𝑚 𝐸; 𝜄 = 𝑗 log 𝑘 𝜕 𝑘 𝑚 𝑑 𝑗𝑚 𝑚𝑝𝑕𝛾 𝑘𝑚 • Apply EM algorithm • E-step: 𝑥 𝑘 𝑞(𝒚 𝑗 |𝜸 𝑘 ) • w 𝑗𝑘 = 𝑘 𝑥 𝑘 𝑞(𝒚 𝑗 |𝜸 𝑘 ) • M-step: maximize weighted likelihood 𝑗 𝑥 𝑗𝑘 𝑚 𝑑 𝑗𝑚 𝑚𝑝𝑕𝛾 𝑘𝑚 𝑗 𝑥 𝑗𝑘 𝑑 𝑗𝑚 • 𝛾 𝑘𝑚 = 𝑚′ 𝑗 𝑥 𝑗𝑘 𝑑 𝑗𝑚′ ; 𝜕 𝑘 ∝ 𝑗 𝑥 𝑗𝑘 Weighted percentage of word l in cluster j 30

  26. Better Way for Topic Modeling • Topic: a word distribution • Unigram multinomial mixture model • Once the topic of a document is decided, all its words are generated from that topic • PLSA (probabilistic latent semantic analysis) • Every word of a document can be sampled from different topics • LDA (Latent Dirichlet Allocation) • Assume priors on word distribution and/or document cluster distribution 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend