October 18 th , 2017 Adapted from UIUC CS410 Outline What is topic - - PowerPoint PPT Presentation

october 18 th 2017
SMART_READER_LITE
LIVE PREVIEW

October 18 th , 2017 Adapted from UIUC CS410 Outline What is topic - - PowerPoint PPT Presentation

DATA130006 Text Management and Analysis Language Model for Topic Analysis School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410 Outline What is topic mining? Topic


slide-1
SLIDE 1

复旦大学大数据学院

School of Data Science, Fudan University

DATA130006 Text Management and Analysis

Language Model for Topic Analysis

魏忠钰

October 18th, 2017

Adapted from UIUC CS410

slide-2
SLIDE 2

Outline

§ What is topic mining?

slide-3
SLIDE 3

Topic Mining and Analysis: Motivation

§ Topic » main idea discussed in text data

§ Theme/subject of a discussion or conversation § Different granularities (e.g., topic of a sentence, an article, etc.)

§ Many applications require discovery of topics in text

§ What are Weibo users talking about today? § What are the current research topics in data mining? How are they different from those 5 years ago? § What were the major topics debated in 2012 presidential election?

slide-4
SLIDE 4

Topics as Knowledge About the World Real World

Text Data

Knowledge about the world

Non-Text Data + Context Time Location … Topic 1 Topic 2 Topic k

slide-5
SLIDE 5

Tasks of Topic Mining and Analysis

Task 2: Figure out which documents cover which topics Task 1: Discover k topics

Text Data

Topic 1 Topic 2 Topic k

Doc 2 Doc 1

slide-6
SLIDE 6

Formal Definition of Topic Mining and Analysis

§ Input

§ A collection of N text documents C={d1, …, dN} § Number of topics: k

§ Output

§ k topics: { q1, …, qk } § Coverage of topics in each di: { pi1, …, pik } § pij = prob. of di covering topic qj

1

k 1 j ij =

p

å

=

How to define qi ?

slide-7
SLIDE 7

Initial Idea: Topic = Term

Text Data

“Sports” “Travel” “Science”

Doc 2 Doc N

Doc 1

q1 q2 qk

p11 p12 p1k p21=0 p22 p2k pN1=0 pN2 pNk 30% 12% 8%

slide-8
SLIDE 8

Mining k Topical Terms from Collection C

§ Parse text in C to obtain candidate terms (e.g., term = word). § Design a scoring function to measure how good each term is as a topic.

§ Favor a representative term (high frequency is favored) § Avoid words that are too frequent (e.g., “the”, “a”, stop words). § TF-IDF weighting from retrieval can be very useful. § Domain-specific heuristics are possible (e.g., favor title words, hashtags in microblog).

§ Pick k terms with the highest scores but try to minimize redundancy.

§ If multiple terms are very similar or closely related, pick only one of them and ignore others.

slide-9
SLIDE 9

Computing Topic Coverage: pij

“Sports” “Travel” “Science”

Doc di q1 q2 qk

pi1 pi2 pik count(“sports”, di)=4 count(“travel”, di) =2 count(“science”, di)=1

å

=

q q = p

k 1 L i L i j ij

) d , ( count ) d , ( count

slide-10
SLIDE 10

How Well Does This Approach Work?

“Sports” “Travel” “Science”

Doc di q1 q2 qk

Cavaliers vs. Golden State Warriors: NBA playoff finals … basketball game … travel to Cleveland … star …

  • 3. Mine complicated topics?
  • 1. Need to count

related words also!

  • 2. “Star” can be ambiguous (e.g., star in the sky).

) d , " sports (" c

i 1 i

= µ p ) d , " science (" c

i ik

= µ p 1 ) d , " travel (" c

i 2 i

> = µ p

slide-11
SLIDE 11

Problems with “Term as Topic”

  • 1. Lack of expressive power
  • Can only represent simple/general topics
  • Can’t represent complicated topics
  • 2. Incompleteness in vocabulary coverage
  • Can’t capture variations of vocabulary (e.g., related words)
  • 3. Word sense ambiguity
  • A topical term or related term can be ambiguous (e.g.,

basketball star vs. star in the sky) è Topic = {Multiple Words}

+ weights on words

è Split an ambiguous word A probabilistic topic model can do all these!

slide-12
SLIDE 12

Improved Idea: Topic = Word Distribution

“Sports” “Travel” “Science”

q1 q2 qk

P(w|qk) P(w|q2) travel 0.05 attraction 0.03 trip 0.01 flight 0.004 hotel 0.003 island 0.003 … culture 0.001 … play 0.0002 … science 0.04 scientist 0.03 spaceship 0.006 telescope 0.004 genomics 0.004 star 0.002 … genetics 0.001 … travel 0.00001 … sports 0.02 game 0.01 basketball 0.005 football 0.004 play 0.003 star 0.003 … nba 0.001 … travel 0.0005 … P(w|q1)

å

Î

= q

V w i

1 ) | w ( p

Vocabulary Set: V={w1, w2,….}

slide-13
SLIDE 13

Probabilistic Topic Mining and Analysis

§ Input

§ A collection of N text documents C={d1, …, dN} § Vocabulary set: V={w1, …, wM} § Number of topics: k

§ Output

§ k topics, each a word distribution: { q1, …, qk } § Coverage of topics in each di: { pi1, …, pik } § pij=prob. of di covering topic qj

1

k 1 j ij =

p

å

=

å

Î

= q

V w i

1 ) | w ( p

slide-14
SLIDE 14

The Computation Task

Doc 2 Doc N

Doc 1

q1 q2 qk

p11 p12 p1k p21=0% p22 p2k pN1=0% pN2 pNk 30% 12% 8%

sports 0.02 game 0.01 basketball 0.005 football 0.004 … science 0.04 scientist 0.03 spaceship 0.006 … travel 0.05 attraction 0.03 trip 0.01 …

INPUT: C, k, V OUTPUT: { q1, …, qk }, { pi1, …, pik }

Text Data

slide-15
SLIDE 15

Generative Model for Text Mining

Modeling of Data Generation: P(Data |Model, L) L=({ q1, …, qk }, { p11, …, p1k }, …, { pN1, …, pNk }) Parameter Estimation/ Inferences L* = argmax L p(Data| Model, L)

P(Data |Model, L) L L*

How many parameters in total?

slide-16
SLIDE 16

Simplest Case of Topic Model: Mining One Topic

Doc d

q

100%

INPUT: C={d}, V OUTPUT: { q}

Text Data

text ? mining ? association ? database ? … query ? … P(w|q)

slide-17
SLIDE 17

Language Model Setup

§ Data: Document d= x1 x2 … x|d| , xi ÎV={w1 ,…, wM}is a word § Model: Unigram LM q :

{qi=p(wi |q)}, i=1, …, M; q1+…+qM=1

§ Likelihood function: § ML estimate:

Õ Õ

= =

q = q = q ´ ´ q = q ´ ´ q = q

M 1 i ) d , w ( c i M 1 i ) d , w ( c i ) d , w ( c M ) d , w ( c 1 | d | 1

i i M 1

) | w ( p ) | w ( p ... ) | w ( p ) | x ( p ... ) | x ( p ) | d ( p

Õ

= q q q q

q = q = q q

M 1 i ) d , w ( c i ,..., ,..., M 1

i M 1 M 1

max arg ) | d ( p max arg ) ˆ ,..., ˆ (

slide-18
SLIDE 18

Computation of Maximum Likelihood Estimate

Normalized Counts

Lagrange function: f (θ | d) = c(wi, d)logθi

i=1 M

+ λ( θi

i=1 M

−1) ∂f (θ | d) ∂θ i = c(wi, d) θ i + λ = 0 → θ i= − c(wi, d) λ − c(wi, d) λ =1

i=1 M

→ λ = − c(wi, d)

i=1 N

→ ˆ θ i= p(wi | ˆ θ) = c(wi, d) c(wi, d)

i=1 M

= c(wi, d) | d | Use Lagrange multiplier approach

Õ

=

= =

M i d w c i M

i M M

d p

1 ) , ( ,..., ,..., 1

1 1

max arg ) | ( max arg ) ˆ ,..., ˆ ( q q q q

q q q q

Maximize p(d|q)

i M 1 i i ,..., ,..., M 1

log ) d , w ( c max arg )] | d ( p log[ max arg ) ˆ ,..., ˆ (

M 1 M 1

q = q = q q

å

= q q q q

  • Max. Log-Likelihood

Subject to constraint:

1

M 1 i i =

q

å

=

slide-19
SLIDE 19

What Does the Topic Look Like?

Can we get rid of these common words? the 0.031 a 0.018 … text 0.04 mining 0.035 association 0.03 clustering 0.005 computer 0.0009 … food 0.000001 …

Text mining paper

d

p(w| q)

slide-20
SLIDE 20

Factoring out Background Words

the 0.031 a 0.018 … text 0.04 mining 0.035 association 0.03 clustering 0.005 computer 0.0009 … food 0.000001 …

Text mining paper

d

p(w| q)

How can we get rid of these common words?

slide-21
SLIDE 21

Generate d Using Two Word Distributions

text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001

Text mining paper

d

Topic: qd

the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 …

Background (topic) qB P(w| qd) p(w| qB)

P(qd)=0.5 P(qB)=0.5 Topic Choice

p(qd )+(qB)=1

slide-22
SLIDE 22

What’s the probability of observing a word w?

text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001

d

Topic: qd

the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 …

Background (topic) qB P(w| qd) p(w| qB)

P(qd)=0.5 P(qB)=0.5 Topic Choice

p(qd )+(qB)=1 “text”? “the”? P(“the”)=p(qd)p(“the”|qd) + p(qB)p(“the”| qB) = 0.5*0.000001+0.5*0.03 P(“text”)=p(qd)p(“text”|qd) + p(qB) p(“text”| qB) = 0.5*0.04+0.5*0.000006

slide-23
SLIDE 23

The Idea of a Mixture Model

text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 qd the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 qB P(qd)=0.5 P(qB)=0.5 Topic Choice

p(qd )+(qB)=1 “text”? “the”?

w

Mixture Model

slide-24
SLIDE 24

As a Generative Model… Formally defines the following generative model: p(w)=p(qd)p(w|qd) + p(qB )p(w| qB)

w

What if p(qd )=1 or p(qB )=1? Estimate of the model “discovers” two topics + topic coverage

slide-25
SLIDE 25

Mixture of Two Unigram Language Models § Data: Document d § Mixture Model: parameters L=({p(w|qd )}, {p(w|qB )}, p(qB), p(qd ))

§ Two unigram LMs: qd (the topic of d); qB (background topic)

§ Mixing weight (topic choice): p(qd )+p(qB)=1

§ Likelihood function: § ML Estimate:

Õ Õ Õ

= = =

q q + q q = q q + q q = L = L

M 1 i ) d , w ( c B i B d i d | d | 1 i B i B d i d | d | 1 i i

)] | w ( p ) ( p ) | w ( p ) ( p [ )] | x ( p ) ( p ) | x ( p ) ( p [ ) | x ( p ) | d ( p

) | d ( p max arg

*

L = L

L

1 ) ( p ) ( p 1 ) | w ( p ) | w ( p

B d M 1 i B i M 1 i d i

= q + q = q = q

å å

= =

Subject to

slide-26
SLIDE 26

Back to Factoring out Background Words

text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 qd the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 qB P(qd)=0.5 P(qB)=0.5 Topic Choice

p(qd )+(qB)=1

… text mining... is… clustering… we…. Text.. the Text Mining Paper

d

slide-27
SLIDE 27

Estimation of One Topic: P(w| qd)

text ? mining ? association ? clustering ? … the ? qd the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 qB P(qd)=0.5 P(qB)=0.5 Topic Choice

p(qd )+(qB)=1

… text mining... is… clustering… we…. Text.. the

d

Adjust qdto maximize p(d|L) (all other parameters are known) Would the ML estimate demote background words in qd ?

slide-28
SLIDE 28

Behavior of a Mixture Model text the

d =

Likelihood: P(“text”)=p(qd)p(“text”|qd) + p(qB)p(“text”| qB) = 0.5*p(“text”|qd) +0.5*0.1 P(“the”) = 0.5*p(“the”|qd) +0.5*0.9 text ? the ? qd the 0.9 text 0.1 qB P(qd)=0.5 P(qB)=0.5 p(d|L)=p(“text”|L) p(“the”|L) = [0.5*p(“text”|qd) + 0.5*0.1] x [0.5*p(“the”|qd) + 0.5*0.9] How can we set p(“text”|qd) & p(“text”|qd) to maximize it? Note that p(“text”|qd) + p(“the”|qd) =1

slide-29
SLIDE 29

“Collaboration” and “Competition” of qd and qB

text the

d =

p(d|L)=p(“text”|L) p(“the”|L) text ? the ? qd the 0.9 text 0.1 qB P(qd)=0.5 P(qB)=0.5 = [0.5*p(“text”|qd) + 0.5*0.1] x [0.5*p(“the”|qd) + 0.5*0.9] Note that p(“text”|qd) + p(“the”|qd) =1 If 𝒚 + 𝒛 = 𝒅𝒑𝒐𝒕𝒖𝒃𝒐𝒖, then 𝒚𝒛 reaches maximum when 𝒚 = 𝒛. 0.5*p(“text”|qd) + 0.5*0.1= 0.5*p(“the”|qd) + 0.5*0.9 è p(“text”|qd)=0.9 >> p(“the”|qd) =0.1 !

Behavior 1: if p(w1|qB)> p(w2|qB), then p(w1|qd) < p(w2|qd)

slide-30
SLIDE 30

Response to Data Frequency text the

d =

p(d|L) = [0.5*p(“text”|qd) + 0.5*0.1] x [0.5*p(“the”|qd) + 0.5*0.9] è p(“text”|qd)=0.9 >> p(“the”|qd) =0.1 !

text the the the the …the

d’ =

x [0.5*p(“the”|qd) + 0.5*0.9] p(d’|L) = [0.5*p(“text”|qd) + 0.5*0.1] x [0.5*p(“the”|qd) + 0.5*0.9] x [0.5*p(“the”|qd) + 0.5*0.9] x [0.5*p(“the”|qd) + 0.5*0.9]

What’s the optimal solution now? p(“the”|qd) > 0.1? or p(“the”|qd) < 0.1?

Behavior 2: high frequency words get higher p(w|qd) What if we generate more “the”?

slide-31
SLIDE 31

Estimation of One Topic: P(w| qd)

text ? mining ? association ? clustering ? … the ? qd the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 qB P(qd)=0.5 P(qB)=0.5 Topic Choice

p(qd )+(qB)=1

… text mining... is… clustering… we…. Text.. the

d

How to set qdto maximize p(d|L)? (all other parameters are known)

slide-32
SLIDE 32

If we know which word is from which distribution…

text ? mining ? association ? clustering ? … the ? qd the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 qB P(qd)=0.5 P(qB)=0.5 Topic Choice

p(qd )+(qB)=1 p(w| qB)

… text mining... is… clustering… we…. Text.. the

d

å

Î

= q

V ' w i d i

) ' d , ' w ( c ) ' d , w ( c ) | w ( p

d’

slide-33
SLIDE 33

infer the distribution a word is from…

text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 qd the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 qB

P(w| qd) p(w| qB)

P(qd)=0.5 P(qB)=0.5 Topic Choice

p(qd )+p(qB)=1 Is “text” more likely from qd or qB ?

p(qd)p(“text”|qd) From qd (Z=0)? p(qB)p(“text”|qB) From qB (Z=1)?

) | " text (" p ) ( p ) | " text (" p ) ( p ) | " text (" p ) ( p ) " text " w | z ( p

B B d d d d

q q + q q q q = = =

slide-34
SLIDE 34

The Expectation-Maximization (EM) Algorithm

Hidden Variable: z Î{0, 1} the paper presents a text mining algorithm for clustering ... z 1 1 1 1 1 ... Initialize p(w|qd ) with random values. Then iteratively improve it using E-step & M-step. Stop when likelihood doesn’t change. E-step M-step

) | w ( p ) ( p ) | w ( p ) ( p ) | w ( p ) ( p ) w | z ( p

B B d ) n ( d d ) n ( d ) n (

q q + q q q q = =

å

Î +

= = = q

V ' w ) n ( ) n ( d ) 1 n (

) ' w | z ( p ) d , ' w ( c ) w | z ( p ) d , w ( c ) | w ( p

How likely wis from qd

slide-35
SLIDE 35

EM Computation in Action (In class practice)

Word # p(w|qB) Iteration 1 Iteration 2 Iteration 3 P(w|q) p(z=0|w) P(w|q) P(z=0|w) P(w|q) P(z=0|w) The 4 0.5 0.25 Paper 2 0.3 0.25 Text 4 0.1 0.25 Mining 2 0.1 0.25 Log-Likelihood

E-step M-step Assume p(qd )=p(qB)= 0.5 and p(w|qB) is known

) | w ( p ) ( p ) | w ( p ) ( p ) | w ( p ) ( p ) w | z ( p

B B d ) n ( d d ) n ( d ) n (

q q + q q q q = =

å

Î +

= = = q

V ' w ) n ( ) n ( d ) 1 n (

) ' w | z ( p ) d , ' w ( c ) w | z ( p ) d , w ( c ) | w ( p

slide-36
SLIDE 36

EM Computation in Action (In class practice)

Word # p(w|qB) Iteration 1 Iteration 2 Iteration 3 P(w|q) p(z=0|w) P(w|q) P(z=0|w) P(w|q) P(z=0|w) The 4 0.5 0.25 0.33 0.20 0.29 0.18 0.26 Paper 2 0.3 0.25 0.45 0.14 0.32 0.10 0.25 Text 4 0.1 0.25 0.71 0.44 0.81 0.50 0.93 Mining 2 0.1 0.25 0.71 0.22 0.69 0.22 0.69 Log-Likelihood

  • 16.96
  • 16.13
  • 16.02

E-step M-step Assume p(qd )=p(qB)= 0.5 and p(w|qB) is known Likelihood increasing

) | w ( p ) ( p ) | w ( p ) ( p ) | w ( p ) ( p ) w | z ( p

B B d ) n ( d d ) n ( d ) n (

q q + q q q q = =

å

Î +

= = = q

V ' w ) n ( ) n ( d ) 1 n (

) ' w | z ( p ) d , ' w ( c ) w | z ( p ) d , w ( c ) | w ( p

slide-37
SLIDE 37

Document as a Sample of Mixed Topics

Topic q1

Topic qk Topic q2

Background qB government 0.3 response 0.2 ... donate 0.1 relief 0.05 help 0.02 ... city 0.2 new 0.1

  • rleans 0.05

... the 0.04 a 0.03 ... [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …

Blog article about “Hurricane Katrina” Many applications are possible if we can “decode” the topics in text…

slide-38
SLIDE 38

Mining Multiple Topics from Text

Doc 2 Doc N

Doc 1

q1 q2 qk

p11 p12 p1k p21=0% p22 p2k pN1=0% pN2 pNk 30% 12% 8%

sports 0.02 game 0.01 basketball 0.005 football 0.004 … science 0.04 scientist 0.03 spaceship 0.006 … travel 0.05 attraction 0.03 trip 0.01 …

INPUT: C, k, V OUTPUT: { q1, …, qk }, { pi1, …, pik }

Text Data

slide-39
SLIDE 39

Generating Text with Multiple Topics: p(w)=?

Topic q1

Topic qk Topic q2

Background qB government 0.3 response 0.2 ... donate 0.1 relief 0.05 help 0.02 ... city 0.2 new 0.1

  • rleans 0.05

... the 0.04 a 0.03 ...

Topic Choice

p(qB)= lB

p(q1)=pd,1 p(q2)=pd,2 p(qk)=pd,k

1

k 1 i i , d =

p

å =

p(w|q1) p(w|q2) p(w|qk) p(w|qB)

w

1- lB

lB p(w|qB) (1- lB)p(qk) p(w|qk) (1-lB)p(q2) p(w|q2) (1-lB)p(q1) p(w|q1)

+ + … + +

slide-40
SLIDE 40

Probabilistic Latent Semantic Analysis (PLSA)

å å å å å å

Î = Î = Î =

  • +

= L

  • +

=

  • +

=

C d k j j j d B B B V w k j j j d B B B V w k j j j d B B B d

w p w p d w c C p w p w p d w c d p w p w p w p ] ) | ( ) 1 ( ) | ( [ log ) , ( ) | ( log ] ) | ( ) 1 ( ) | ( [ log ) , ( ) ( log ) | ( ) 1 ( ) | ( ) (

1 , 1 , 1 ,

q p l q l q p l q l q p l q l

Unknown Parameters: L=({pd,j}, {q j}), j=1, …, k

Percentage of background words (known) Background LM (known) Coverage of topic q j in doc d

  • Prob. of word w in topic q j

How many unknown parameters are there in total?

slide-41
SLIDE 41

ML Parameter Estimation

å å å å å å

Î = Î = Î =

  • +

= L

  • +

=

  • +

=

C d k j j j d B B B V w k j j j d B B B V w k j j j d B B B d

w p w p d w c C p w p w p d w c d p w p w p w p ] ) | ( ) 1 ( ) | ( [ log ) , ( ) | ( log ] ) | ( ) 1 ( ) | ( [ log ) , ( ) ( log ) | ( ) 1 ( ) | ( ) (

1 , 1 , 1 ,

q p l q l q p l q l q p l q l

) | C ( p max arg

*

L = L

L

å =

= q Î "

M 1 i j i

1 ) | w ( p ], k , 1 [ j

å =

= p Î "

k 1 j j , d

1 , C d

Constrained Optimization:

slide-42
SLIDE 42

EM Algorithm for PLSA: E-Step

å å

= =

q p l

  • +

q l q l = = q p q p = =

k 1 j j ) n ( ) n ( j , d B B B B B w , d k 1 ' j ' j ) n ( ) n ( ' j , d j ) n ( ) n ( j , d w , d

) | w ( p ) 1 ( ) | w ( p ) | w ( p ) B z ( p ) | w ( p ) | w ( p ) j z ( p

Probability that w in doc c d is generated from topic c q j

j

Probability that w in doc c d is generated from back ckground q B

B

Use of Bayes Rule

Hidden Variable (=topic indicator): zd,w Î{B, 1, 2, …, k}

slide-43
SLIDE 43

EM Algorithm for PLSA: M-Step

Re-estimated pr proba babi bility of doc c d covering topic c q j

j

Re-estimated pr proba babi bility of wo word w for topic c q j

j

ML Estimate based on “allocated” word counts to topic q j

Hidden Variable (=topic indicator): zd,w Î{B, 1, 2, …, k}

å å å å å å

Î Î Î + Î Î +

= =

  • =

=

  • =

q = =

  • =

=

  • =

p

V ' w C d ' w , d ' w , d C d w , d w , d j ) 1 n ( ' j V w w , d w , d V w w , d w , d ) 1 n ( j , d

) j z ( p )) B z ( p 1 )( d , ' w ( c ) j z ( p )) B z ( p 1 )( d , w ( c ) | w ( p ) ' j z ( p )) B z ( p 1 )( d , w ( c ) j z ( p )) B z ( p 1 )( d , w ( c

slide-44
SLIDE 44

Computation of the EM Algorithm

§ Initialize all unknown parameters randomly § Repeat until likelihood converges

§ E-step § M-step ) | w ( p ) B z ( p ) | w ( p ) j z ( p

B B w , d j ) n ( ) n ( j , d w , d

q l µ = q p µ =

å =

= =

k 1 j w , d

1 ) j z ( p

What’s the normalizer for this one?

å å

Î + Î +

= =

  • µ

q = =

  • µ

p

C d w , d w , d j ) 1 n ( V w w , d w , d ) 1 n ( j , d

) j z ( p )) B z ( p 1 )( d , w ( c ) | w ( p ) j z ( p )) B z ( p 1 )( d , w ( c

1 , C d

k 1 j j , d =

p Î "

å =

1 ) | w ( p ], k , 1 [ j

V w j =

q Î "

å

Î

In general, accumulate counts, and then normalize

slide-45
SLIDE 45

Applications of Topic Models for Text Mining: Illustration with 2 Topics

Likelihood:

Application Scenarios:

  • p(w|q1) & p(w|q2) are known; estimate l
  • p(w|q1) & l are known; estimate p(w|q2)
  • p(w|q1) is known; estimate l & p(w|q2)
  • l is known; estimate p(w|q1)& p(w|q2)
  • Estimate l, p(w|q1), p(w|q2)

The doc is about text mining and food nutrition, how much percent is about text mining? 30% of the doc is about text mining, what’s the rest about? The doc is about text mining, is it also about some

  • ther topic, and if so to what extent?

30% of the doc is about one topic and 70% is about another, what are these two topics? The doc is about two subtopics, find out what these two subtopics are and to what extent the doc covers each.

)] | ( ) 1 ( ) | ( log[ ) , ( ) | ( log )] | ( ) 1 ( ) | ( [ ) | (

2 1 2 1 ) , ( 2 1 2 1

q l q l q q q l q l q q w p w p d w c d p w p w p d p

V w d w c V w

  • +

= Å

  • +

= Å

å Õ

Î Î

slide-46
SLIDE 46

Use PLSA for Text Mining

§ PLSA would be able to generate § Topic coverage in each document: p(pd = j) § Word distribution for each topic: p(w|qj) § Topic assignment at the word level for each document § The number of topics must be given in advance § These probabilities can be used in many different ways § qj naturally serves as a word cluster § pd,j can be used for document clustering § Contextual text mining: Make these parameters conditioned on context, e.g., § p(qj |time), from which we can compute/plot p(time| qj ) § p(qj |location), from which we can compute/plot p(loc| qj )

j j d

j

,

max arg * p =

slide-47
SLIDE 47

Sample Topics from TDT Corpus [Hofmann 99b]

slide-48
SLIDE 48

How to Help Users Interpret a Topic Model? [Mei et al. 07b]

  • Use top words
  • automatic, but hard to make sense
  • Human generated labels
  • Make sense, but cannot scale up

term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independence 0.03 model 0.03 frequent 0.02 probabilistic 0.02 document 0.02 … Retrieval Models

Question: Can we automatically generate understandable labels for topics?

Term, relevance, weight, feedback

insulin foraging foragers collected grains loads collection nectar …

?

slide-49
SLIDE 49

What is a Good Label?

  • Semantically close (relevance)
  • Understandable – phrases?
  • High coverage inside topic
  • Discriminative across topics

term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 …

Retrieval models

A topic from [Mei & Zhai 06b]

slide-50
SLIDE 50

Automatic Labeling of Topics [Mei et al. 07b]

Statistical topic models NLP Chunker Ngram stat.

term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 …

Multinomial topic models

database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure …

Candidate label pool Collection (Context) Ranked List

  • f Labels

clustering algorithm; distance measure; …

Relevance Score Re-ranking

Coverage; Discrimination

1 2

slide-51
SLIDE 51

Relevance: the Zero-Order Score § Intuition: prefer phrases well covering top words

Clustering dimensional algorithm birch shape Latent Topic q … Good Label (l1): “clustering algorithm” body Bad Label (l2): “body shape” …

p(w|q)

p(“clustering”|q) = 0.4 p(“dimensional”|q) = 0.3 p(“body”|q) = 0.001 p(“shape”|q) = 0.01

>

) lg ( ) | lg (

  • rithm

a clustering p

  • rithm

a clustering p + + q

) ( ) | ( shape body p shape body p + + q

slide-52
SLIDE 52

Relevance: the First-Order Score

Clustering hash dimension key algorithm …

Bad Label (l2): “hash join” p(w | hash join)

§ Intuition: prefer phrases with similar context (distribution)

Clustering dimension partition algorithm hash

Topic q

P(w|q)

D(q | clustering algorithm) < D(q | hash join)

SIGMOD Proceedings

Clustering hash dimension algorithm partition … p(w | clustering algorithm )

Good Label (l1): “clustering algorithm”

å

µ

w

C l w PMI w p ) | , ( ) | ( q

Score (l, q )

slide-53
SLIDE 53

Results: Sample Topic Labels

sampling 0.06 estimation 0.04 approximate 0.04 histograms 0.03 selectivity 0.03 histogram 0.02 answers 0.02 accurate 0.02 tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality 0.005

clustering algorithm clustering structure … large data, data quality, high data, data application, …

selectivity estimation … iran contra … r tree b tree … indexing methods

slide-54
SLIDE 54

Results: Contextual-Sensitive Labeling

sampling estimation approximation histogram selectivity histograms …

selectivity estimation; random sampling; approximate answers; multivalue dependency functional dependency Iceberg cube distributed retrieval; parameter estimation; mixture models; term dependency; independence assumption; Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings)

dependencies functional cube multivalued iceberg buc …

slide-55
SLIDE 55

Using PLSA to Discover Temporal Topic Trends [Mei & Zhai 05]

0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 1999 2000 2001 2002 2003 2004 Time (year) Normalized Strength of Theme Biology Data Web Information Time Series Classification Association Rule Clustering Bussiness

gene 0.0173 expressions 0.0096 probability 0.0081 microarray 0.0038 … marketing 0.0087 customer 0.0086 model 0.0079 business 0.0048 … rules 0.0142 association 0.0064 support 0.0053 …

slide-56
SLIDE 56

Use PLSA to Integrate Opinions [Lu & Zhai 08]

cute… tiny… ..thicker.. last many hrs die out soon could afford it still expensive Design Battery Price..

Topic: iPod Expert review with aspects Text collection of

  • rdinary opinions,

e.g. Weblogs

Integrated Summary

Design Battery Price

iTunes … easy to use… warranty …better to extend.. Review Aspects Extra Aspects

Similar

  • pinions

Supplementary

  • pinions

Input Output