CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model - - PowerPoint PPT Presentation

cs145 introduction to data mining
SMART_READER_LITE
LIVE PREVIEW

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model - - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text


slide-1
SLIDE 1

CS145: INTRODUCTION TO DATA MINING

Instructor: Yizhou Sun

yzsun@cs.ucla.edu December 4, 2017

Text Data: Topic Model

slide-2
SLIDE 2

Methods to be Learnt

2

Vector Data Set Data Sequence Data Text Data Classification

Logistic Regression; Decision Tree; KNN; SVM; NN Naïve Bayes for Text

Clustering

K-means; hierarchical clustering; DBSCAN; Mixture Models PLSA

Prediction

Linear Regression GLM*

Frequent Pattern Mining

Apriori; FP growth GSP; PrefixSpan

Similarity Search

DTW

slide-3
SLIDE 3

Text Data: Topic Models

  • Text Data and Topic Models
  • Revisit of Mixture Model
  • Probabilistic Latent Semantic Analysis

(pLSA)

  • Summary

3

slide-4
SLIDE 4

Text Data

  • Word/term
  • Document
  • A sequence of words
  • Corpus
  • A collection of

documents

4

slide-5
SLIDE 5

Represent a Document

  • Most common way: Bag-of-Words
  • Ignore the order of words
  • keep the count

5 c1 c2 c3 c4 c5 m1 m2 m3 m4

Vector space model

slide-6
SLIDE 6

Topics

  • Topic
  • A topic is represented by a word

distribution

  • Relate to an issue

6

slide-7
SLIDE 7

Topic Models

  • Topic modeling
  • Get topics automatically

from a corpus

  • Assign documents to

topics automatically

  • Most frequently used

topic models

  • pLSA
  • LDA

7

slide-8
SLIDE 8

Text Data: Topic Models

  • Text Data and Topic Models
  • Revisit of Mixture Model
  • Probabilistic Latent Semantic Analysis

(pLSA)

  • Summary

8

slide-9
SLIDE 9

Mixture Model-Based Clustering

  • A set C of k probabilistic clusters C1, …,Ck
  • probability density/mass functions: f1, …, fk,
  • Cluster prior probabilities: w1, …, wk, σ𝑘 𝑥

𝑘 = 1

  • Joint Probability of an object i and its cluster

Cj is:

  • 𝑄(𝑦𝑗, 𝑨𝑗 = 𝐷

𝑘) = 𝑥 𝑘𝑔 𝑘 𝑦𝑗

  • 𝑨𝑗: hidden random variable
  • Probability of i is:
  • 𝑄 𝑦𝑗 = σ𝑘 𝑥

𝑘𝑔 𝑘(𝑦𝑗)

9

𝑔

1(𝑦)

𝑔

2(𝑦)

slide-10
SLIDE 10

Maximum Likelihood Estimation

  • Since objects are assumed to be generated

independently, for a data set D = {x1, …, xn}, we have, 𝑄 𝐸 = ෑ

𝑗

𝑄 𝑦𝑗 = ෑ

𝑗

𝑘

𝑥

𝑘𝑔 𝑘(𝑦𝑗)

⇒ 𝑚𝑝𝑕𝑄 𝐸 = ෍

𝑗

𝑚𝑝𝑕𝑄 𝑦𝑗 = ෍

𝑗

𝑚𝑝𝑕 ෍

𝑘

𝑥

𝑘𝑔 𝑘(𝑦𝑗)

  • Task: Find a set C of k probabilistic clusters

s.t. P(D) is maximized

10

slide-11
SLIDE 11

Gaussian Mixture Model

  • Generative model
  • For each object:
  • Pick its cluster, i.e., a distribution component:

𝑎~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗 𝑥1, … , 𝑥𝑙

  • Sample a value from the selected distribution:

𝑌|𝑎~𝑂 𝜈𝑎, 𝜏𝑎

2

  • Overall likelihood function
  • 𝑀 𝐸| 𝜄 = ς𝑗 σ𝑘 𝑥

𝑘𝑞(𝑦𝑗|𝜈𝑘, 𝜏 𝑘 2)

s.t. σ𝑘 𝑥

𝑘 = 1 𝑏𝑜𝑒 𝑥 𝑘 ≥ 0

11

slide-12
SLIDE 12

Multinomial Mixture Model

  • For documents with bag-of-words

representation

  • 𝒚𝑒 = (𝑦𝑒1, 𝑦𝑒2, … , 𝑦𝑒𝑂), 𝑦𝑒𝑜 is the number of

words for nth word in the vocabulary

  • Generative model
  • For each document
  • Sample its cluster label 𝑨~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝝆)
  • 𝝆 = (𝜌1, 𝜌2, … , 𝜌𝐿), 𝜌𝑙 is the proportion of kth cluster
  • 𝑞 𝑨 = 𝑙 = 𝜌𝑙
  • Sample its word vector 𝒚𝑒~𝑛𝑣𝑚𝑢𝑗𝑜𝑝𝑛𝑗𝑏𝑚(𝜸𝑨)
  • 𝜸𝑨 = 𝛾𝑨1, 𝛾𝑨2, … , 𝛾𝑨𝑂 , 𝛾𝑨𝑜 is the parameter associate with nth word

in the vocabulary

  • 𝑞 𝒚𝑒|𝑨 = 𝑙 =

σ𝑜 𝑦𝑒𝑜 ! ς𝑜 𝑦𝑒𝑜! ς𝑜 𝛾𝑙𝑜 𝑦𝑒𝑜 ∝ ς𝑜 𝛾𝑙𝑜 𝑦𝑒𝑜 12

slide-13
SLIDE 13

Likelihood Function

  • For a set of M documents

𝑀 = ෑ

𝑒

𝑞(𝒚𝑒) = ෑ

𝑒

𝑙

𝑞(𝒚𝑒, 𝑨 = 𝑙) = ෑ

𝑒

𝑙

𝑞 𝒚𝑒 𝑨 = 𝑙 𝑞(𝑨 = 𝑙) ∝ ෑ

𝑒

𝑙

𝑞(𝑨 = 𝑙) ෑ

𝑜

𝛾𝑙𝑜

𝑦𝑒𝑜

13

slide-14
SLIDE 14

Mixture of Unigrams

  • For documents represented by a sequence of

words

  • 𝒙𝑒 = (𝑥𝑒1, 𝑥𝑒2, … , 𝑥𝑒𝑂𝑒), 𝑂𝑒 is the length of

document d, 𝑥𝑒𝑜 is the word at the nth position

  • f the document
  • Generative model
  • For each document
  • Sample its cluster label 𝑨~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝝆)
  • 𝝆 = (𝜌1, 𝜌2, … , 𝜌𝐿), 𝜌𝑙 is the proportion of kth cluster
  • 𝑞 𝑨 = 𝑙 = 𝜌𝑙
  • For each word in the sequence
  • Sample the word 𝑥𝑒𝑜~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜸𝑨)
  • 𝑞 𝑥𝑒𝑜|𝑨 = 𝑙 = 𝛾𝑙𝑥𝑒𝑜

14

slide-15
SLIDE 15

Likelihood Function

  • For a set of M documents

𝑀 = ෑ

𝑒

𝑞(𝒙𝑒) = ෑ

𝑒

𝑙

𝑞(𝒙𝑒, 𝑨 = 𝑙) = ෑ

𝑒

𝑙

𝑞 𝒙𝑒 𝑨 = 𝑙 𝑞(𝑨 = 𝑙) = ෑ

𝑒

𝑙

𝑞(𝑨 = 𝑙) ෑ

𝑜

𝛾𝑙𝑥𝑒𝑜

15

slide-16
SLIDE 16

Question

  • Are multinomial mixture model and

mixture of unigrams model equivalent? Why?

16

slide-17
SLIDE 17

Text Data: Topic Models

  • Text Data and Topic Models
  • Revisit of Mixture Model
  • Probabilistic Latent Semantic Analysis

(pLSA)

  • Summary

17

slide-18
SLIDE 18

Notations

  • Word, document, topic
  • 𝑥, 𝑒, 𝑨
  • Word count in document
  • 𝑑(𝑥, 𝑒)
  • Word distribution for each topic (𝛾𝑨)
  • 𝛾𝑨𝑥: 𝑞(𝑥|𝑨)
  • Topic distribution for each document (𝜄𝑒)
  • 𝜄𝑒𝑨: 𝑞(𝑨|𝑒) (Yes, soft clustering)

18

slide-19
SLIDE 19

Issues of Mixture of Unigrams

  • All the words in the same documents are

sampled from the same topic

  • In practice, people switch topics during their

writing

19

slide-20
SLIDE 20

Illustration of pLSA

20

slide-21
SLIDE 21

Generative Model for pLSA

  • Describe how a document is generated

probabilistically

  • For each position in d, 𝑜 = 1, … , 𝑂𝑒
  • Generate the topic for the position as

𝑨𝑜~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜾𝑒), 𝑗. 𝑓. , 𝑞 𝑨𝑜 = 𝑙 = 𝜄𝑒𝑙

(Note, 1 trial multinomial, i.e., categorical distribution)

  • Generate the word for the position as

𝑥𝑜~𝑁𝑣𝑚𝑢𝑗𝑜𝑝𝑣𝑚𝑚𝑗(𝜸𝑨𝑜), 𝑗. 𝑓. , 𝑞 𝑥𝑜 = 𝑥 = 𝛾𝑨𝑜𝑥

21

slide-22
SLIDE 22

Graphical Model

Note: Sometimes, people add parameters such as 𝜄 𝑏𝑜𝑒 𝛾 into the graphical model

22

slide-23
SLIDE 23

The Likelihood Function for a Corpus

  • Probability of a word

𝑞 𝑥|𝑒 = ෍

𝑙

𝑞(𝑥, 𝑨 = 𝑙|𝑒) = ෍

𝑙

𝑞 𝑥 𝑨 = 𝑙 𝑞 𝑨 = 𝑙|𝑒 = ෍

𝑙

𝛾𝑙𝑥𝜄𝑒𝑙

  • Likelihood of a corpus

23

𝜌𝑒 𝑗𝑡 𝑣𝑡𝑣𝑏𝑚𝑚𝑧 𝑑𝑝𝑜𝑡𝑗𝑒𝑓𝑠𝑓𝑒 𝑏𝑡 𝑣𝑜𝑗𝑔𝑝𝑠𝑛, i.e., 1/M

slide-24
SLIDE 24

Re-arrange the Likelihood Function

  • Group the same word from different

positions together max 𝑚𝑝𝑕𝑀 = ෍

𝑒𝑥

𝑑 𝑥, 𝑒 𝑚𝑝𝑕 ෍

𝑨

𝜄𝑒𝑨 𝛾𝑨𝑥 𝑡. 𝑢. ෍

𝑨

𝜄𝑒𝑨 = 1 𝑏𝑜𝑒 ෍

𝑥

𝛾𝑨𝑥 = 1

24

slide-25
SLIDE 25

Optimization: EM Algorithm

  • Repeat until converge
  • E-step: for each word in each document, calculate its conditional

probability belonging to each topic 𝑞 𝑨 𝑥, 𝑒 ∝ 𝑞 𝑥 𝑨, 𝑒 𝑞 𝑨 𝑒 = 𝛾𝑨𝑥𝜄𝑒𝑨 (𝑗. 𝑓. , 𝑞 𝑨 𝑥, 𝑒 = 𝛾𝑨𝑥𝜄𝑒𝑨 σ𝑨′ 𝛾𝑨′𝑥𝜄𝑒𝑨′ )

  • M-step: given the conditional distribution, find the parameters that

can maximize the expected likelihood 𝛾𝑨𝑥 ∝ σ𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 (𝑗. 𝑓. , 𝛾𝑨𝑥 =

σ𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥,𝑒 σ𝑥′,𝑒 𝑞 𝑨 𝑥′, 𝑒 𝑑 𝑥′,𝑒 )

𝜄𝑒𝑨 ∝ ෍

𝑥

𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 (𝑗. 𝑓. , 𝜄𝑒𝑨 = σ𝑥 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 𝑂𝑒 )

25

slide-26
SLIDE 26

Example

  • Two documents, two topics
  • Vocabulary: {data, mining, frequent, pattern, web, information, retrieval}
  • At some iteration of EM algorithm, E-step

26

slide-27
SLIDE 27

Example (Continued)

  • M-step

27

𝛾11 = 0.8 ∗ 5 + 0.5 ∗ 2 11.8 + 5.8 = 5/17.6 𝛾12 = 0.8 ∗ 4 + 0.5 ∗ 3 11.8 + 5.8 = 4.7/17.6 𝛾13 = 3/17.6 𝛾14 = 1.6/17.6 𝛾15 = 1.3/17.6 𝛾16 = 1.2/17.6 𝛾17 = 0.8/17.6 𝜄11 = 11.8 17 𝜄12 = 5.2 17

slide-28
SLIDE 28

Text Data: Topic Models

  • Text Data and Topic Models
  • Revisit of Mixture Model
  • Probabilistic Latent Semantic Analysis

(pLSA)

  • Summary

28

slide-29
SLIDE 29

Summary

  • Basic Concepts
  • Word/term, document, corpus, topic
  • Mixture of unigrams
  • pLSA
  • Generative model
  • Likelihood function
  • EM algorithm

29

slide-30
SLIDE 30

Quiz

  • Q1: Is Multinomial Naïve Bayes a linear

classifier?

  • Q2: In pLSA, For the same word in

different positions in a document, do they have the same conditional probability 𝑞 𝑨 𝑥, 𝑒 ?

30