cs145 introduction to data mining
play

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text


  1. CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017

  2. Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN; SVM ; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 2

  3. Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 3

  4. Text Data • Word/term • Document • A sequence of words • Corpus • A collection of documents 4

  5. Represent a Document • Most common way: Bag-of-Words • Ignore the order of words • keep the count c1 c2 c3 c4 c5 m1 m2 m3 m4 Vector space model 5

  6. Topics • Topic • A topic is represented by a word distribution • Relate to an issue 6

  7. Topic Models • Topic modeling • Get topics automatically from a corpus • Assign documents to topics automatically • Most frequently used topic models • pLSA • LDA 7

  8. Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 8

  9. Mixture Model-Based Clustering • A set C of k probabilistic clusters C 1 , …, C k • probability density/mass functions: f 1 , …, f k , • Cluster prior probabilities: w 1 , …, w k , σ 𝑘 𝑥 𝑘 = 1 • Joint Probability of an object i and its cluster C j is: • 𝑄(𝑦 𝑗 , 𝑨 𝑗 = 𝐷 𝑘 ) = 𝑥 𝑘 𝑔 𝑘 𝑦 𝑗 • 𝑨 𝑗 : hidden random variable 𝑔 1 (𝑦) • Probability of i is: 𝑔 2 (𝑦) • 𝑄 𝑦 𝑗 = σ 𝑘 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 9

  10. Maximum Likelihood Estimation • Since objects are assumed to be generated independently, for a data set D = {x 1 , …, x n }, we have, 𝑄 𝐸 = ෑ 𝑄 𝑦 𝑗 = ෑ ෍ 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 𝑗 𝑗 𝑘 ⇒ 𝑚𝑝𝑕𝑄 𝐸 = ෍ 𝑚𝑝𝑕𝑄 𝑦 𝑗 = ෍ 𝑚𝑝𝑕 ෍ 𝑥 𝑘 𝑔 𝑘 (𝑦 𝑗 ) 𝑗 𝑗 𝑘 • Task: Find a set C of k probabilistic clusters s.t. P ( D ) is maximized 10

  11. Gaussian Mixture Model • Generative model • For each object: • Pick its cluster, i.e., a distribution component: 𝑎~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗 𝑥 1 , … , 𝑥 𝑙 • Sample a value from the selected distribution: 2 𝑌|𝑎~𝑂 𝜈 𝑎 , 𝜏 𝑎 • Overall likelihood function 2 ) • 𝑀 𝐸| 𝜄 = ς 𝑗 σ 𝑘 𝑥 𝑘 𝑞(𝑦 𝑗 |𝜈 𝑘 , 𝜏 𝑘 s.t. σ 𝑘 𝑥 𝑘 = 1 𝑏𝑜𝑒 𝑥 𝑘 ≥ 0 11

  12. Multinomial Mixture Model • For documents with bag-of-words representation • 𝒚 𝑒 = (𝑦 𝑒1 , 𝑦 𝑒2 , … , 𝑦 𝑒𝑂 ) , 𝑦 𝑒𝑜 is the number of words for nth word in the vocabulary • Generative model • For each document • Sample its cluster label 𝑨~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝝆) • 𝝆 = (𝜌 1 , 𝜌 2 , … , 𝜌 𝐿 ) , 𝜌 𝑙 is the proportion of kth cluster • 𝑞 𝑨 = 𝑙 = 𝜌 𝑙 • Sample its word vector 𝒚 𝑒 ~𝑛𝑣𝑚𝑢𝑗𝑜𝑝𝑛𝑗𝑏𝑚(𝜸 𝑨 ) • 𝜸 𝑨 = 𝛾 𝑨1 , 𝛾 𝑨2 , … , 𝛾 𝑨𝑂 , 𝛾 𝑨𝑜 is the parameter associate with nth word in the vocabulary σ 𝑜 𝑦 𝑒𝑜 ! 𝑦 𝑒𝑜 ∝ ς 𝑜 𝛾 𝑙𝑜 𝑦 𝑒𝑜 • 𝑞 𝒚 𝑒 |𝑨 = 𝑙 = ς 𝑜 𝑦 𝑒𝑜 ! ς 𝑜 𝛾 𝑙𝑜 12

  13. Likelihood Function • For a set of M documents 𝑀 = ෑ 𝑞(𝒚 𝑒 ) = ෑ ෍ 𝑞(𝒚 𝑒 , 𝑨 = 𝑙) 𝑒 𝑒 𝑙 = ෑ ෍ 𝑞 𝒚 𝑒 𝑨 = 𝑙 𝑞(𝑨 = 𝑙) 𝑒 𝑙 𝑦 𝑒𝑜 ∝ ෑ ෍ 𝑞(𝑨 = 𝑙) ෑ 𝛾 𝑙𝑜 𝑒 𝑙 𝑜 13

  14. Mixture of Unigrams • For documents represented by a sequence of words • 𝒙 𝑒 = (𝑥 𝑒1 , 𝑥 𝑒2 , … , 𝑥 𝑒𝑂 𝑒 ) , 𝑂 𝑒 is the length of document d , 𝑥 𝑒𝑜 is the word at the nth position of the document • Generative model • For each document • Sample its cluster label 𝑨~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝝆) • 𝝆 = (𝜌 1 , 𝜌 2 , … , 𝜌 𝐿 ) , 𝜌 𝑙 is the proportion of kth cluster • 𝑞 𝑨 = 𝑙 = 𝜌 𝑙 • For each word in the sequence • Sample the word 𝑥 𝑒𝑜 ~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜸 𝑨 ) • 𝑞 𝑥 𝑒𝑜 |𝑨 = 𝑙 = 𝛾 𝑙𝑥 𝑒𝑜 14

  15. Likelihood Function • For a set of M documents 𝑀 = ෑ 𝑞(𝒙 𝑒 ) = ෑ ෍ 𝑞(𝒙 𝑒 , 𝑨 = 𝑙) 𝑒 𝑒 𝑙 = ෑ ෍ 𝑞 𝒙 𝑒 𝑨 = 𝑙 𝑞(𝑨 = 𝑙) 𝑒 𝑙 = ෑ ෍ 𝑞(𝑨 = 𝑙) ෑ 𝛾 𝑙𝑥 𝑒𝑜 𝑒 𝑙 𝑜 15

  16. Question • Are multinomial mixture model and mixture of unigrams model equivalent? Why? 16

  17. Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 17

  18. Notations • Word, document, topic • 𝑥, 𝑒, 𝑨 • Word count in document • 𝑑(𝑥, 𝑒) • Word distribution for each topic ( 𝛾 𝑨 ) • 𝛾 𝑨𝑥 : 𝑞(𝑥|𝑨) • Topic distribution for each document ( 𝜄 𝑒 ) • 𝜄 𝑒𝑨 : 𝑞(𝑨|𝑒) (Yes, soft clustering) 18

  19. Issues of Mixture of Unigrams • All the words in the same documents are sampled from the same topic • In practice, people switch topics during their writing 19

  20. Illustration of pLSA 20

  21. Generative Model for pLSA • Describe how a document is generated probabilistically • For each position in d, 𝑜 = 1, … , 𝑂 𝑒 • Generate the topic for the position as 𝑨 𝑜 ~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜾 𝑒 ), 𝑗. 𝑓. , 𝑞 𝑨 𝑜 = 𝑙 = 𝜄 𝑒𝑙 (Note, 1 trial multinomial, i.e., categorical distribution) • Generate the word for the position as 𝑥 𝑜 ~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜸 𝑨 𝑜 ), 𝑗. 𝑓. , 𝑞 𝑥 𝑜 = 𝑥 = 𝛾 𝑨 𝑜 𝑥 21

  22. Graphical Model Note: Sometimes, people add parameters such as 𝜄 𝑏𝑜𝑒 𝛾 into the graphical model 22

  23. The Likelihood Function for a Corpus • Probability of a word 𝑞 𝑥|𝑒 = ෍ 𝑞(𝑥, 𝑨 = 𝑙|𝑒) = ෍ 𝑞 𝑥 𝑨 = 𝑙 𝑞 𝑨 = 𝑙|𝑒 = ෍ 𝛾 𝑙𝑥 𝜄 𝑒𝑙 𝑙 𝑙 𝑙 • Likelihood of a corpus 𝜌 𝑒 𝑗𝑡 𝑣𝑡𝑣𝑏𝑚𝑚𝑧 𝑑𝑝𝑜𝑡𝑗𝑒𝑓𝑠𝑓𝑒 𝑏𝑡 𝑣𝑜𝑗𝑔𝑝𝑠𝑛 , i.e., 1/M 23

  24. Re-arrange the Likelihood Function • Group the same word from different positions together max 𝑚𝑝𝑕𝑀 = ෍ 𝑑 𝑥, 𝑒 𝑚𝑝𝑕 ෍ 𝜄 𝑒𝑨 𝛾 𝑨𝑥 𝑒𝑥 𝑨 𝑡. 𝑢. ෍ 𝜄 𝑒𝑨 = 1 𝑏𝑜𝑒 ෍ 𝛾 𝑨𝑥 = 1 𝑨 𝑥 24

  25. Optimization: EM Algorithm • Repeat until converge • E-step: for each word in each document, calculate its conditional probability belonging to each topic 𝑞 𝑨 𝑥, 𝑒 ∝ 𝑞 𝑥 𝑨, 𝑒 𝑞 𝑨 𝑒 = 𝛾 𝑨𝑥 𝜄 𝑒𝑨 (𝑗. 𝑓. , 𝑞 𝑨 𝑥, 𝑒 𝛾 𝑨𝑥 𝜄 𝑒𝑨 = ) σ 𝑨′ 𝛾 𝑨′𝑥 𝜄 𝑒𝑨′ • M-step: given the conditional distribution, find the parameters that can maximize the expected likelihood σ 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥,𝑒 𝛾 𝑨𝑥 ∝ σ 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 (𝑗. 𝑓. , 𝛾 𝑨𝑥 = σ 𝑥′,𝑒 𝑞 𝑨 𝑥 ′ , 𝑒 𝑑 𝑥 ′ ,𝑒 ) (𝑗. 𝑓. , 𝜄 𝑒𝑨 = σ 𝑥 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 𝜄 𝑒𝑨 ∝ ෍ 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 ) 𝑂 𝑒 𝑥 25

  26. Example • Two documents, two topics • Vocabulary: {data, mining, frequent, pattern, web, information, retrieval} • At some iteration of EM algorithm, E-step 26

  27. Example (Continued) • M-step 𝛾 11 = 0.8 ∗ 5 + 0.5 ∗ 2 = 5/17.6 11.8 + 5.8 𝜄 11 = 11.8 𝛾 12 = 0.8 ∗ 4 + 0.5 ∗ 3 17 = 4.7/17.6 11.8 + 5.8 𝜄 12 = 5.2 𝛾 13 = 3/17.6 17 𝛾 14 = 1.6/17.6 𝛾 15 = 1.3/17.6 𝛾 16 = 1.2/17.6 𝛾 17 = 0.8/17.6 27

  28. Text Data: Topic Models • Text Data and Topic Models • Revisit of Mixture Model • Probabilistic Latent Semantic Analysis (pLSA) • Summary 28

  29. Summary • Basic Concepts • Word/term, document, corpus, topic • Mixture of unigrams • pLSA • Generative model • Likelihood function • EM algorithm 29

  30. Quiz • Q1: Is Multinomial Naïve Bayes a linear classifier? • Q2: In pLSA, For the same word in different positions in a document, do they have the same conditional probability 𝑞 𝑨 𝑥, 𝑒 ? 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend