复旦大学大数据学院
School of Data Science, Fudan University
DATA130006 Text Management and Analysis
Language Model for Topic Analysis
魏忠钰
October 18th, 2017
Adapted from UIUC CS410
October 18 th , 2017 Adapted from UIUC CS410 Outline What is topic - - PowerPoint PPT Presentation
DATA130006 Text Management and Analysis Language Model for Topic Analysis School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410 Outline What is topic mining? Topic
复旦大学大数据学院
School of Data Science, Fudan University
DATA130006 Text Management and Analysis
Language Model for Topic Analysis
魏忠钰
Adapted from UIUC CS410
Outline
§ What is topic mining?
Topic Mining and Analysis: Motivation
§ Topic » main idea discussed in text data
§ Theme/subject of a discussion or conversation § Different granularities (e.g., topic of a sentence, an article, etc.)
§ Many applications require discovery of topics in text
§ What are Weibo users talking about today? § What are the current research topics in data mining? How are they different from those 5 years ago? § What were the major topics debated in 2012 presidential election?
Topics as Knowledge About the World Real World
Text Data
Knowledge about the world
Non-Text Data + Context Time Location … Topic 1 Topic 2 Topic k
Tasks of Topic Mining and Analysis
Task 2: Figure out which documents cover which topics Task 1: Discover k topics
Text Data
Topic 1 Topic 2 Topic k
Doc 2 Doc 1
Formal Definition of Topic Mining and Analysis
§ Input
§ A collection of N text documents C={d1, …, dN} § Number of topics: k
§ Output
§ k topics: { q1, …, qk } § Coverage of topics in each di: { pi1, …, pik } § pij = prob. of di covering topic qj
1
k 1 j ij =
p
=
How to define qi ?
Initial Idea: Topic = Term
Text Data
“Sports” “Travel” “Science”
Doc 2 Doc N
Doc 1
q1 q2 qk
p11 p12 p1k p21=0 p22 p2k pN1=0 pN2 pNk 30% 12% 8%
Mining k Topical Terms from Collection C
§ Parse text in C to obtain candidate terms (e.g., term = word). § Design a scoring function to measure how good each term is as a topic.
§ Favor a representative term (high frequency is favored) § Avoid words that are too frequent (e.g., “the”, “a”, stop words). § TF-IDF weighting from retrieval can be very useful. § Domain-specific heuristics are possible (e.g., favor title words, hashtags in microblog).
§ Pick k terms with the highest scores but try to minimize redundancy.
§ If multiple terms are very similar or closely related, pick only one of them and ignore others.
Computing Topic Coverage: pij
“Sports” “Travel” “Science”
Doc di q1 q2 qk
pi1 pi2 pik count(“sports”, di)=4 count(“travel”, di) =2 count(“science”, di)=1
=
q q = p
k 1 L i L i j ij
) d , ( count ) d , ( count
How Well Does This Approach Work?
“Sports” “Travel” “Science”
Doc di q1 q2 qk
Cavaliers vs. Golden State Warriors: NBA playoff finals … basketball game … travel to Cleveland … star …
related words also!
) d , " sports (" c
i 1 i
= µ p ) d , " science (" c
i ik
= µ p 1 ) d , " travel (" c
i 2 i
> = µ p
Problems with “Term as Topic”
basketball star vs. star in the sky) è Topic = {Multiple Words}
+ weights on words
è Split an ambiguous word A probabilistic topic model can do all these!
Improved Idea: Topic = Word Distribution
“Sports” “Travel” “Science”
q1 q2 qk
P(w|qk) P(w|q2) travel 0.05 attraction 0.03 trip 0.01 flight 0.004 hotel 0.003 island 0.003 … culture 0.001 … play 0.0002 … science 0.04 scientist 0.03 spaceship 0.006 telescope 0.004 genomics 0.004 star 0.002 … genetics 0.001 … travel 0.00001 … sports 0.02 game 0.01 basketball 0.005 football 0.004 play 0.003 star 0.003 … nba 0.001 … travel 0.0005 … P(w|q1)
Î
= q
V w i
1 ) | w ( p
Vocabulary Set: V={w1, w2,….}
Probabilistic Topic Mining and Analysis
§ Input
§ A collection of N text documents C={d1, …, dN} § Vocabulary set: V={w1, …, wM} § Number of topics: k
§ Output
§ k topics, each a word distribution: { q1, …, qk } § Coverage of topics in each di: { pi1, …, pik } § pij=prob. of di covering topic qj
1
k 1 j ij =
p
=
Î
= q
V w i
1 ) | w ( p
The Computation Task
Doc 2 Doc N
Doc 1
q1 q2 qk
p11 p12 p1k p21=0% p22 p2k pN1=0% pN2 pNk 30% 12% 8%
sports 0.02 game 0.01 basketball 0.005 football 0.004 … science 0.04 scientist 0.03 spaceship 0.006 … travel 0.05 attraction 0.03 trip 0.01 …
INPUT: C, k, V OUTPUT: { q1, …, qk }, { pi1, …, pik }
Text Data
Generative Model for Text Mining
Modeling of Data Generation: P(Data |Model, L) L=({ q1, …, qk }, { p11, …, p1k }, …, { pN1, …, pNk }) Parameter Estimation/ Inferences L* = argmax L p(Data| Model, L)
P(Data |Model, L) L L*
How many parameters in total?
Simplest Case of Topic Model: Mining One Topic
Doc d
100%
INPUT: C={d}, V OUTPUT: { q}
Text Data
text ? mining ? association ? database ? … query ? … P(w|q)
Language Model Setup
§ Data: Document d= x1 x2 … x|d| , xi ÎV={w1 ,…, wM}is a word § Model: Unigram LM q :
{qi=p(wi |q)}, i=1, …, M; q1+…+qM=1
§ Likelihood function: § ML estimate:
= =
q = q = q ´ ´ q = q ´ ´ q = q
M 1 i ) d , w ( c i M 1 i ) d , w ( c i ) d , w ( c M ) d , w ( c 1 | d | 1
i i M 1
) | w ( p ) | w ( p ... ) | w ( p ) | x ( p ... ) | x ( p ) | d ( p
Õ
= q q q q
q = q = q q
M 1 i ) d , w ( c i ,..., ,..., M 1
i M 1 M 1
max arg ) | d ( p max arg ) ˆ ,..., ˆ (
Computation of Maximum Likelihood Estimate
Normalized Counts
Lagrange function: f (θ | d) = c(wi, d)logθi
i=1 M
∑
+ λ( θi
i=1 M
∑
−1) ∂f (θ | d) ∂θ i = c(wi, d) θ i + λ = 0 → θ i= − c(wi, d) λ − c(wi, d) λ =1
i=1 M
∑
→ λ = − c(wi, d)
i=1 N
∑
→ ˆ θ i= p(wi | ˆ θ) = c(wi, d) c(wi, d)
i=1 M
∑
= c(wi, d) | d | Use Lagrange multiplier approach
Õ
=
= =
M i d w c i M
i M M
d p
1 ) , ( ,..., ,..., 1
1 1
max arg ) | ( max arg ) ˆ ,..., ˆ ( q q q q
q q q q
Maximize p(d|q)
i M 1 i i ,..., ,..., M 1
log ) d , w ( c max arg )] | d ( p log[ max arg ) ˆ ,..., ˆ (
M 1 M 1
q = q = q q
å
= q q q q
Subject to constraint:
1
M 1 i i =
q
å
=
What Does the Topic Look Like?
Can we get rid of these common words? the 0.031 a 0.018 … text 0.04 mining 0.035 association 0.03 clustering 0.005 computer 0.0009 … food 0.000001 …
Text mining paper
p(w| q)
Factoring out Background Words
the 0.031 a 0.018 … text 0.04 mining 0.035 association 0.03 clustering 0.005 computer 0.0009 … food 0.000001 …
Text mining paper
p(w| q)
How can we get rid of these common words?
Generate d Using Two Word Distributions
text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001
Text mining paper
Topic: qd
the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 …
Background (topic) qB P(w| qd) p(w| qB)
P(qd)=0.5 P(qB)=0.5 Topic Choice
p(qd )+(qB)=1
What’s the probability of observing a word w?
text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001
Topic: qd
the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 …
Background (topic) qB P(w| qd) p(w| qB)
P(qd)=0.5 P(qB)=0.5 Topic Choice
p(qd )+(qB)=1 “text”? “the”? P(“the”)=p(qd)p(“the”|qd) + p(qB)p(“the”| qB) = 0.5*0.000001+0.5*0.03 P(“text”)=p(qd)p(“text”|qd) + p(qB) p(“text”| qB) = 0.5*0.04+0.5*0.000006
The Idea of a Mixture Model
text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 qd the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 qB P(qd)=0.5 P(qB)=0.5 Topic Choice
p(qd )+(qB)=1 “text”? “the”?
w
Mixture Model
As a Generative Model… Formally defines the following generative model: p(w)=p(qd)p(w|qd) + p(qB )p(w| qB)
w
What if p(qd )=1 or p(qB )=1? Estimate of the model “discovers” two topics + topic coverage
Mixture of Two Unigram Language Models § Data: Document d § Mixture Model: parameters L=({p(w|qd )}, {p(w|qB )}, p(qB), p(qd ))
§ Two unigram LMs: qd (the topic of d); qB (background topic)
§ Mixing weight (topic choice): p(qd )+p(qB)=1
§ Likelihood function: § ML Estimate:
= = =
q q + q q = q q + q q = L = L
M 1 i ) d , w ( c B i B d i d | d | 1 i B i B d i d | d | 1 i i
)] | w ( p ) ( p ) | w ( p ) ( p [ )] | x ( p ) ( p ) | x ( p ) ( p [ ) | x ( p ) | d ( p
) | d ( p max arg
*
L = L
L
1 ) ( p ) ( p 1 ) | w ( p ) | w ( p
B d M 1 i B i M 1 i d i
= q + q = q = q
å å
= =
Subject to
Back to Factoring out Background Words
text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 qd the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 qB P(qd)=0.5 P(qB)=0.5 Topic Choice
p(qd )+(qB)=1
… text mining... is… clustering… we…. Text.. the Text Mining Paper
Estimation of One Topic: P(w| qd)
text ? mining ? association ? clustering ? … the ? qd the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 qB P(qd)=0.5 P(qB)=0.5 Topic Choice
p(qd )+(qB)=1
… text mining... is… clustering… we…. Text.. the
Adjust qdto maximize p(d|L) (all other parameters are known) Would the ML estimate demote background words in qd ?
Behavior of a Mixture Model text the
Likelihood: P(“text”)=p(qd)p(“text”|qd) + p(qB)p(“text”| qB) = 0.5*p(“text”|qd) +0.5*0.1 P(“the”) = 0.5*p(“the”|qd) +0.5*0.9 text ? the ? qd the 0.9 text 0.1 qB P(qd)=0.5 P(qB)=0.5 p(d|L)=p(“text”|L) p(“the”|L) = [0.5*p(“text”|qd) + 0.5*0.1] x [0.5*p(“the”|qd) + 0.5*0.9] How can we set p(“text”|qd) & p(“text”|qd) to maximize it? Note that p(“text”|qd) + p(“the”|qd) =1
“Collaboration” and “Competition” of qd and qB
text the
p(d|L)=p(“text”|L) p(“the”|L) text ? the ? qd the 0.9 text 0.1 qB P(qd)=0.5 P(qB)=0.5 = [0.5*p(“text”|qd) + 0.5*0.1] x [0.5*p(“the”|qd) + 0.5*0.9] Note that p(“text”|qd) + p(“the”|qd) =1 If 𝒚 + 𝒛 = 𝒅𝒑𝒐𝒕𝒖𝒃𝒐𝒖, then 𝒚𝒛 reaches maximum when 𝒚 = 𝒛. 0.5*p(“text”|qd) + 0.5*0.1= 0.5*p(“the”|qd) + 0.5*0.9 è p(“text”|qd)=0.9 >> p(“the”|qd) =0.1 !
Behavior 1: if p(w1|qB)> p(w2|qB), then p(w1|qd) < p(w2|qd)
Response to Data Frequency text the
p(d|L) = [0.5*p(“text”|qd) + 0.5*0.1] x [0.5*p(“the”|qd) + 0.5*0.9] è p(“text”|qd)=0.9 >> p(“the”|qd) =0.1 !
text the the the the …the
x [0.5*p(“the”|qd) + 0.5*0.9] p(d’|L) = [0.5*p(“text”|qd) + 0.5*0.1] x [0.5*p(“the”|qd) + 0.5*0.9] x [0.5*p(“the”|qd) + 0.5*0.9] x [0.5*p(“the”|qd) + 0.5*0.9]
What’s the optimal solution now? p(“the”|qd) > 0.1? or p(“the”|qd) < 0.1?
Behavior 2: high frequency words get higher p(w|qd) What if we generate more “the”?
Estimation of One Topic: P(w| qd)
text ? mining ? association ? clustering ? … the ? qd the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 qB P(qd)=0.5 P(qB)=0.5 Topic Choice
p(qd )+(qB)=1
… text mining... is… clustering… we…. Text.. the
How to set qdto maximize p(d|L)? (all other parameters are known)
If we know which word is from which distribution…
text ? mining ? association ? clustering ? … the ? qd the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 qB P(qd)=0.5 P(qB)=0.5 Topic Choice
p(qd )+(qB)=1 p(w| qB)
… text mining... is… clustering… we…. Text.. the
Î
= q
V ' w i d i
) ' d , ' w ( c ) ' d , w ( c ) | w ( p
infer the distribution a word is from…
text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 qd the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 qB
P(w| qd) p(w| qB)
P(qd)=0.5 P(qB)=0.5 Topic Choice
p(qd )+p(qB)=1 Is “text” more likely from qd or qB ?
p(qd)p(“text”|qd) From qd (Z=0)? p(qB)p(“text”|qB) From qB (Z=1)?
) | " text (" p ) ( p ) | " text (" p ) ( p ) | " text (" p ) ( p ) " text " w | z ( p
B B d d d d
q q + q q q q = = =
The Expectation-Maximization (EM) Algorithm
Hidden Variable: z Î{0, 1} the paper presents a text mining algorithm for clustering ... z 1 1 1 1 1 ... Initialize p(w|qd ) with random values. Then iteratively improve it using E-step & M-step. Stop when likelihood doesn’t change. E-step M-step
) | w ( p ) ( p ) | w ( p ) ( p ) | w ( p ) ( p ) w | z ( p
B B d ) n ( d d ) n ( d ) n (
q q + q q q q = =
Î +
= = = q
V ' w ) n ( ) n ( d ) 1 n (
) ' w | z ( p ) d , ' w ( c ) w | z ( p ) d , w ( c ) | w ( p
How likely wis from qd
EM Computation in Action (In class practice)
Word # p(w|qB) Iteration 1 Iteration 2 Iteration 3 P(w|q) p(z=0|w) P(w|q) P(z=0|w) P(w|q) P(z=0|w) The 4 0.5 0.25 Paper 2 0.3 0.25 Text 4 0.1 0.25 Mining 2 0.1 0.25 Log-Likelihood
E-step M-step Assume p(qd )=p(qB)= 0.5 and p(w|qB) is known
) | w ( p ) ( p ) | w ( p ) ( p ) | w ( p ) ( p ) w | z ( p
B B d ) n ( d d ) n ( d ) n (
q q + q q q q = =
Î +
= = = q
V ' w ) n ( ) n ( d ) 1 n (
) ' w | z ( p ) d , ' w ( c ) w | z ( p ) d , w ( c ) | w ( p
EM Computation in Action (In class practice)
Word # p(w|qB) Iteration 1 Iteration 2 Iteration 3 P(w|q) p(z=0|w) P(w|q) P(z=0|w) P(w|q) P(z=0|w) The 4 0.5 0.25 0.33 0.20 0.29 0.18 0.26 Paper 2 0.3 0.25 0.45 0.14 0.32 0.10 0.25 Text 4 0.1 0.25 0.71 0.44 0.81 0.50 0.93 Mining 2 0.1 0.25 0.71 0.22 0.69 0.22 0.69 Log-Likelihood
E-step M-step Assume p(qd )=p(qB)= 0.5 and p(w|qB) is known Likelihood increasing
) | w ( p ) ( p ) | w ( p ) ( p ) | w ( p ) ( p ) w | z ( p
B B d ) n ( d d ) n ( d ) n (
q q + q q q q = =
Î +
= = = q
V ' w ) n ( ) n ( d ) 1 n (
) ' w | z ( p ) d , ' w ( c ) w | z ( p ) d , w ( c ) | w ( p
Document as a Sample of Mixed Topics
Topic q1
Topic qk Topic q2
Background qB government 0.3 response 0.2 ... donate 0.1 relief 0.05 help 0.02 ... city 0.2 new 0.1
... the 0.04 a 0.03 ... [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …
Blog article about “Hurricane Katrina” Many applications are possible if we can “decode” the topics in text…
Mining Multiple Topics from Text
Doc 2 Doc N
Doc 1
q1 q2 qk
p11 p12 p1k p21=0% p22 p2k pN1=0% pN2 pNk 30% 12% 8%
sports 0.02 game 0.01 basketball 0.005 football 0.004 … science 0.04 scientist 0.03 spaceship 0.006 … travel 0.05 attraction 0.03 trip 0.01 …
INPUT: C, k, V OUTPUT: { q1, …, qk }, { pi1, …, pik }
Text Data
Generating Text with Multiple Topics: p(w)=?
Topic q1
Topic qk Topic q2
Background qB government 0.3 response 0.2 ... donate 0.1 relief 0.05 help 0.02 ... city 0.2 new 0.1
... the 0.04 a 0.03 ...
Topic Choice
p(qB)= lB
p(q1)=pd,1 p(q2)=pd,2 p(qk)=pd,k
1
k 1 i i , d =
p
p(w|q1) p(w|q2) p(w|qk) p(w|qB)
1- lB
lB p(w|qB) (1- lB)p(qk) p(w|qk) (1-lB)p(q2) p(w|q2) (1-lB)p(q1) p(w|q1)
+ + … + +
Probabilistic Latent Semantic Analysis (PLSA)
Î = Î = Î =
= L
=
=
C d k j j j d B B B V w k j j j d B B B V w k j j j d B B B d
w p w p d w c C p w p w p d w c d p w p w p w p ] ) | ( ) 1 ( ) | ( [ log ) , ( ) | ( log ] ) | ( ) 1 ( ) | ( [ log ) , ( ) ( log ) | ( ) 1 ( ) | ( ) (
1 , 1 , 1 ,
q p l q l q p l q l q p l q l
Unknown Parameters: L=({pd,j}, {q j}), j=1, …, k
Percentage of background words (known) Background LM (known) Coverage of topic q j in doc d
How many unknown parameters are there in total?
ML Parameter Estimation
Î = Î = Î =
= L
=
=
C d k j j j d B B B V w k j j j d B B B V w k j j j d B B B d
w p w p d w c C p w p w p d w c d p w p w p w p ] ) | ( ) 1 ( ) | ( [ log ) , ( ) | ( log ] ) | ( ) 1 ( ) | ( [ log ) , ( ) ( log ) | ( ) 1 ( ) | ( ) (
1 , 1 , 1 ,
q p l q l q p l q l q p l q l
) | C ( p max arg
*
L = L
L
= q Î "
M 1 i j i
1 ) | w ( p ], k , 1 [ j
= p Î "
k 1 j j , d
1 , C d
Constrained Optimization:
EM Algorithm for PLSA: E-Step
= =
q p l
q l q l = = q p q p = =
k 1 j j ) n ( ) n ( j , d B B B B B w , d k 1 ' j ' j ) n ( ) n ( ' j , d j ) n ( ) n ( j , d w , d
) | w ( p ) 1 ( ) | w ( p ) | w ( p ) B z ( p ) | w ( p ) | w ( p ) j z ( p
Probability that w in doc c d is generated from topic c q j
j
Probability that w in doc c d is generated from back ckground q B
B
Use of Bayes Rule
Hidden Variable (=topic indicator): zd,w Î{B, 1, 2, …, k}
EM Algorithm for PLSA: M-Step
Re-estimated pr proba babi bility of doc c d covering topic c q j
j
Re-estimated pr proba babi bility of wo word w for topic c q j
j
ML Estimate based on “allocated” word counts to topic q j
Hidden Variable (=topic indicator): zd,w Î{B, 1, 2, …, k}
Î Î Î + Î Î +
= =
=
q = =
=
p
V ' w C d ' w , d ' w , d C d w , d w , d j ) 1 n ( ' j V w w , d w , d V w w , d w , d ) 1 n ( j , d
) j z ( p )) B z ( p 1 )( d , ' w ( c ) j z ( p )) B z ( p 1 )( d , w ( c ) | w ( p ) ' j z ( p )) B z ( p 1 )( d , w ( c ) j z ( p )) B z ( p 1 )( d , w ( c
Computation of the EM Algorithm
§ Initialize all unknown parameters randomly § Repeat until likelihood converges
§ E-step § M-step ) | w ( p ) B z ( p ) | w ( p ) j z ( p
B B w , d j ) n ( ) n ( j , d w , d
q l µ = q p µ =
å =
= =
k 1 j w , d
1 ) j z ( p
What’s the normalizer for this one?
Î + Î +
= =
q = =
p
C d w , d w , d j ) 1 n ( V w w , d w , d ) 1 n ( j , d
) j z ( p )) B z ( p 1 )( d , w ( c ) | w ( p ) j z ( p )) B z ( p 1 )( d , w ( c
1 , C d
k 1 j j , d =
p Î "
å =
1 ) | w ( p ], k , 1 [ j
V w j =
q Î "
å
Î
In general, accumulate counts, and then normalize
Applications of Topic Models for Text Mining: Illustration with 2 Topics
Likelihood:
Application Scenarios:
The doc is about text mining and food nutrition, how much percent is about text mining? 30% of the doc is about text mining, what’s the rest about? The doc is about text mining, is it also about some
30% of the doc is about one topic and 70% is about another, what are these two topics? The doc is about two subtopics, find out what these two subtopics are and to what extent the doc covers each.
)] | ( ) 1 ( ) | ( log[ ) , ( ) | ( log )] | ( ) 1 ( ) | ( [ ) | (
2 1 2 1 ) , ( 2 1 2 1
q l q l q q q l q l q q w p w p d w c d p w p w p d p
V w d w c V w
= Å
= Å
å Õ
Î Î
Use PLSA for Text Mining
§ PLSA would be able to generate § Topic coverage in each document: p(pd = j) § Word distribution for each topic: p(w|qj) § Topic assignment at the word level for each document § The number of topics must be given in advance § These probabilities can be used in many different ways § qj naturally serves as a word cluster § pd,j can be used for document clustering § Contextual text mining: Make these parameters conditioned on context, e.g., § p(qj |time), from which we can compute/plot p(time| qj ) § p(qj |location), from which we can compute/plot p(loc| qj )
j j d
j
,
max arg * p =
Sample Topics from TDT Corpus [Hofmann 99b]
How to Help Users Interpret a Topic Model? [Mei et al. 07b]
term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independence 0.03 model 0.03 frequent 0.02 probabilistic 0.02 document 0.02 … Retrieval Models
Question: Can we automatically generate understandable labels for topics?
Term, relevance, weight, feedback
insulin foraging foragers collected grains loads collection nectar …
What is a Good Label?
term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 …
Retrieval models
A topic from [Mei & Zhai 06b]
Automatic Labeling of Topics [Mei et al. 07b]
Statistical topic models NLP Chunker Ngram stat.
term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 …
Multinomial topic models
database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure …
Candidate label pool Collection (Context) Ranked List
clustering algorithm; distance measure; …
Relevance Score Re-ranking
Coverage; Discrimination
1 2
Relevance: the Zero-Order Score § Intuition: prefer phrases well covering top words
Clustering dimensional algorithm birch shape Latent Topic q … Good Label (l1): “clustering algorithm” body Bad Label (l2): “body shape” …
p(w|q)
p(“clustering”|q) = 0.4 p(“dimensional”|q) = 0.3 p(“body”|q) = 0.001 p(“shape”|q) = 0.01
>
) lg ( ) | lg (
a clustering p
a clustering p + + q
) ( ) | ( shape body p shape body p + + q
Relevance: the First-Order Score
Clustering hash dimension key algorithm …
Bad Label (l2): “hash join” p(w | hash join)
§ Intuition: prefer phrases with similar context (distribution)
Clustering dimension partition algorithm hash
Topic q
…
P(w|q)
D(q | clustering algorithm) < D(q | hash join)
SIGMOD Proceedings
Clustering hash dimension algorithm partition … p(w | clustering algorithm )
Good Label (l1): “clustering algorithm”
å
µ
w
C l w PMI w p ) | , ( ) | ( q
Score (l, q )
Results: Sample Topic Labels
sampling 0.06 estimation 0.04 approximate 0.04 histograms 0.03 selectivity 0.03 histogram 0.02 answers 0.02 accurate 0.02 tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality 0.005
clustering algorithm clustering structure … large data, data quality, high data, data application, …
selectivity estimation … iran contra … r tree b tree … indexing methods
Results: Contextual-Sensitive Labeling
sampling estimation approximation histogram selectivity histograms …
selectivity estimation; random sampling; approximate answers; multivalue dependency functional dependency Iceberg cube distributed retrieval; parameter estimation; mixture models; term dependency; independence assumption; Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings)
dependencies functional cube multivalued iceberg buc …
Using PLSA to Discover Temporal Topic Trends [Mei & Zhai 05]
0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 1999 2000 2001 2002 2003 2004 Time (year) Normalized Strength of Theme Biology Data Web Information Time Series Classification Association Rule Clustering Bussiness
gene 0.0173 expressions 0.0096 probability 0.0081 microarray 0.0038 … marketing 0.0087 customer 0.0086 model 0.0079 business 0.0048 … rules 0.0142 association 0.0064 support 0.0053 …
Use PLSA to Integrate Opinions [Lu & Zhai 08]
cute… tiny… ..thicker.. last many hrs die out soon could afford it still expensive Design Battery Price..
Topic: iPod Expert review with aspects Text collection of
e.g. Weblogs
Integrated Summary
Design Battery Price
iTunes … easy to use… warranty …better to extend.. Review Aspects Extra Aspects
Similar
Supplementary
Input Output