CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model - - PowerPoint PPT Presentation
CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model - - PowerPoint PPT Presentation
CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text
Methods to be Learnt
2
Vector Data Set Data Sequence Data Text Data Classification
Logistic Regression; Decision Tree; KNN; SVM; NN Naïve Bayes for Text
Clustering
K-means; hierarchical clustering; DBSCAN; Mixture Models PLSA
Prediction
Linear Regression GLM*
Frequent Pattern Mining
Apriori; FP growth GSP; PrefixSpan
Similarity Search
DTW
Text Data: Topic Models
- Text Data and Topic Models
- Revisit of Mixture Model
- Probabilistic Latent Semantic Analysis
(pLSA)
- Summary
3
Text Data
- Word/term
- Document
- A sequence of words
- Corpus
- A collection of
documents
4
Represent a Document
- Most common way: Bag-of-Words
- Ignore the order of words
- keep the count
5 c1 c2 c3 c4 c5 m1 m2 m3 m4
Vector space model
Topics
- Topic
- A topic is represented by a word
distribution
- Relate to an issue
6
Topic Models
- Topic modeling
- Get topics automatically
from a corpus
- Assign documents to
topics automatically
- Most frequently used
topic models
- pLSA
- LDA
7
Text Data: Topic Models
- Text Data and Topic Models
- Revisit of Mixture Model
- Probabilistic Latent Semantic Analysis
(pLSA)
- Summary
8
Mixture Model-Based Clustering
- A set C of k probabilistic clusters C1, …,Ck
- probability density/mass functions: f1, …, fk,
- Cluster prior probabilities: w1, …, wk, σ𝑘 𝑥
𝑘 = 1
- Joint Probability of an object i and its cluster
Cj is:
- 𝑄(𝑦𝑗, 𝑨𝑗 = 𝐷
𝑘) = 𝑥 𝑘𝑔 𝑘 𝑦𝑗
- 𝑨𝑗: hidden random variable
- Probability of i is:
- 𝑄 𝑦𝑗 = σ𝑘 𝑥
𝑘𝑔 𝑘(𝑦𝑗)
9
𝑔
1(𝑦)
𝑔
2(𝑦)
Maximum Likelihood Estimation
- Since objects are assumed to be generated
independently, for a data set D = {x1, …, xn}, we have, 𝑄 𝐸 = ෑ
𝑗
𝑄 𝑦𝑗 = ෑ
𝑗
𝑘
𝑥
𝑘𝑔 𝑘(𝑦𝑗)
⇒ 𝑚𝑝𝑄 𝐸 =
𝑗
𝑚𝑝𝑄 𝑦𝑗 =
𝑗
𝑚𝑝
𝑘
𝑥
𝑘𝑔 𝑘(𝑦𝑗)
- Task: Find a set C of k probabilistic clusters
s.t. P(D) is maximized
10
Gaussian Mixture Model
- Generative model
- For each object:
- Pick its cluster, i.e., a distribution component:
𝑎~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗 𝑥1, … , 𝑥𝑙
- Sample a value from the selected distribution:
𝑌|𝑎~𝑂 𝜈𝑎, 𝜏𝑎
2
- Overall likelihood function
- 𝑀 𝐸| 𝜄 = ς𝑗 σ𝑘 𝑥
𝑘𝑞(𝑦𝑗|𝜈𝑘, 𝜏 𝑘 2)
s.t. σ𝑘 𝑥
𝑘 = 1 𝑏𝑜𝑒 𝑥 𝑘 ≥ 0
11
Multinomial Mixture Model
- For documents with bag-of-words
representation
- 𝒚𝑒 = (𝑦𝑒1, 𝑦𝑒2, … , 𝑦𝑒𝑂), 𝑦𝑒𝑜 is the number of
words for nth word in the vocabulary
- Generative model
- For each document
- Sample its cluster label 𝑨~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝝆)
- 𝝆 = (𝜌1, 𝜌2, … , 𝜌𝐿), 𝜌𝑙 is the proportion of kth cluster
- 𝑞 𝑨 = 𝑙 = 𝜌𝑙
- Sample its word vector 𝒚𝑒~𝑛𝑣𝑚𝑢𝑗𝑜𝑝𝑛𝑗𝑏𝑚(𝜸𝑨)
- 𝜸𝑨 = 𝛾𝑨1, 𝛾𝑨2, … , 𝛾𝑨𝑂 , 𝛾𝑨𝑜 is the parameter associate with nth word
in the vocabulary
- 𝑞 𝒚𝑒|𝑨 = 𝑙 =
σ𝑜 𝑦𝑒𝑜 ! ς𝑜 𝑦𝑒𝑜! ς𝑜 𝛾𝑙𝑜 𝑦𝑒𝑜 ∝ ς𝑜 𝛾𝑙𝑜 𝑦𝑒𝑜 12
Likelihood Function
- For a set of M documents
𝑀 = ෑ
𝑒
𝑞(𝒚𝑒) = ෑ
𝑒
𝑙
𝑞(𝒚𝑒, 𝑨 = 𝑙) = ෑ
𝑒
𝑙
𝑞 𝒚𝑒 𝑨 = 𝑙 𝑞(𝑨 = 𝑙) ∝ ෑ
𝑒
𝑙
𝑞(𝑨 = 𝑙) ෑ
𝑜
𝛾𝑙𝑜
𝑦𝑒𝑜
13
Mixture of Unigrams
- For documents represented by a sequence of
words
- 𝒙𝑒 = (𝑥𝑒1, 𝑥𝑒2, … , 𝑥𝑒𝑂𝑒), 𝑂𝑒 is the length of
document d, 𝑥𝑒𝑜 is the word at the nth position
- f the document
- Generative model
- For each document
- Sample its cluster label 𝑨~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝝆)
- 𝝆 = (𝜌1, 𝜌2, … , 𝜌𝐿), 𝜌𝑙 is the proportion of kth cluster
- 𝑞 𝑨 = 𝑙 = 𝜌𝑙
- For each word in the sequence
- Sample the word 𝑥𝑒𝑜~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜸𝑨)
- 𝑞 𝑥𝑒𝑜|𝑨 = 𝑙 = 𝛾𝑙𝑥𝑒𝑜
14
Likelihood Function
- For a set of M documents
𝑀 = ෑ
𝑒
𝑞(𝒙𝑒) = ෑ
𝑒
𝑙
𝑞(𝒙𝑒, 𝑨 = 𝑙) = ෑ
𝑒
𝑙
𝑞 𝒙𝑒 𝑨 = 𝑙 𝑞(𝑨 = 𝑙) = ෑ
𝑒
𝑙
𝑞(𝑨 = 𝑙) ෑ
𝑜
𝛾𝑙𝑥𝑒𝑜
15
Question
- Are multinomial mixture model and
mixture of unigrams model equivalent? Why?
16
Text Data: Topic Models
- Text Data and Topic Models
- Revisit of Mixture Model
- Probabilistic Latent Semantic Analysis
(pLSA)
- Summary
17
Notations
- Word, document, topic
- 𝑥, 𝑒, 𝑨
- Word count in document
- 𝑑(𝑥, 𝑒)
- Word distribution for each topic (𝛾𝑨)
- 𝛾𝑨𝑥: 𝑞(𝑥|𝑨)
- Topic distribution for each document (𝜄𝑒)
- 𝜄𝑒𝑨: 𝑞(𝑨|𝑒) (Yes, soft clustering)
18
Issues of Mixture of Unigrams
- All the words in the same documents are
sampled from the same topic
- In practice, people switch topics during their
writing
19
Illustration of pLSA
20
Generative Model for pLSA
- Describe how a document is generated
probabilistically
- For each position in d, 𝑜 = 1, … , 𝑂𝑒
- Generate the topic for the position as
𝑨𝑜~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜾𝑒), 𝑗. 𝑓. , 𝑞 𝑨𝑜 = 𝑙 = 𝜄𝑒𝑙
(Note, 1 trial multinomial, i.e., categorical distribution)
- Generate the word for the position as
𝑥𝑜~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜸𝑨𝑜), 𝑗. 𝑓. , 𝑞 𝑥𝑜 = 𝑥 = 𝛾𝑨𝑜𝑥
21
Graphical Model
Note: Sometimes, people add parameters such as 𝜄 𝑏𝑜𝑒 𝛾 into the graphical model
22
The Likelihood Function for a Corpus
- Probability of a word
𝑞 𝑥|𝑒 =
𝑙
𝑞(𝑥, 𝑨 = 𝑙|𝑒) =
𝑙
𝑞 𝑥 𝑨 = 𝑙 𝑞 𝑨 = 𝑙|𝑒 =
𝑙
𝛾𝑙𝑥𝜄𝑒𝑙
- Likelihood of a corpus
23
𝜌𝑒 𝑗𝑡 𝑣𝑡𝑣𝑏𝑚𝑚𝑧 𝑑𝑝𝑜𝑡𝑗𝑒𝑓𝑠𝑓𝑒 𝑏𝑡 𝑣𝑜𝑗𝑔𝑝𝑠𝑛, i.e., 1/M
Re-arrange the Likelihood Function
- Group the same word from different
positions together max 𝑚𝑝𝑀 =
𝑒𝑥
𝑑 𝑥, 𝑒 𝑚𝑝
𝑨
𝜄𝑒𝑨 𝛾𝑨𝑥 𝑡. 𝑢.
𝑨
𝜄𝑒𝑨 = 1 𝑏𝑜𝑒
𝑥
𝛾𝑨𝑥 = 1
24
Optimization: EM Algorithm
- Repeat until converge
- E-step: for each word in each document, calculate its conditional
probability belonging to each topic 𝑞 𝑨 𝑥, 𝑒 ∝ 𝑞 𝑥 𝑨, 𝑒 𝑞 𝑨 𝑒 = 𝛾𝑨𝑥𝜄𝑒𝑨 (𝑗. 𝑓. , 𝑞 𝑨 𝑥, 𝑒 = 𝛾𝑨𝑥𝜄𝑒𝑨 σ𝑨′ 𝛾𝑨′𝑥𝜄𝑒𝑨′ )
- M-step: given the conditional distribution, find the parameters that
can maximize the expected likelihood 𝛾𝑨𝑥 ∝ σ𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 (𝑗. 𝑓. , 𝛾𝑨𝑥 =
σ𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥,𝑒 σ𝑥′,𝑒 𝑞 𝑨 𝑥′, 𝑒 𝑑 𝑥′,𝑒 )
𝜄𝑒𝑨 ∝
𝑥
𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 (𝑗. 𝑓. , 𝜄𝑒𝑨 = σ𝑥 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 𝑂𝑒 )
25
Example
- Two documents, two topics
- Vocabulary: {data, mining, frequent, pattern, web, information, retrieval}
- At some iteration of EM algorithm, E-step
26
Example (Continued)
- M-step
27
𝛾11 = 0.8 ∗ 5 + 0.5 ∗ 2 11.8 + 5.8 = 5/17.6 𝛾12 = 0.8 ∗ 4 + 0.5 ∗ 3 11.8 + 5.8 = 4.7/17.6 𝛾13 = 3/17.6 𝛾14 = 1.6/17.6 𝛾15 = 1.3/17.6 𝛾16 = 1.2/17.6 𝛾17 = 0.8/17.6 𝜄11 = 11.8 17 𝜄12 = 5.2 17
Text Data: Topic Models
- Text Data and Topic Models
- Revisit of Mixture Model
- Probabilistic Latent Semantic Analysis
(pLSA)
- Summary
28
Summary
- Basic Concepts
- Word/term, document, corpus, topic
- Mixture of unigrams
- pLSA
- Generative model
- Likelihood function
- EM algorithm
29
Quiz
- Q1: Is Multinomial Naïve Bayes a linear
classifier?
- Q2: In pLSA, For the same word in
different positions in a document, do they have the same conditional probability 𝑞 𝑨 𝑥, 𝑒 ?
30