复旦大学大数据学院
School of Data Science, Fudan University
DATA130006 Text Management and Analysis
Text Clustering
魏忠钰
October 18th, 2017
Adapted from UIUC CS410
October 18 th , 2017 Adapted from UIUC CS410 What Is Text - - PowerPoint PPT Presentation
DATA130006 Text Management and Analysis Text Clustering School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410 What Is Text Clustering? Discover natural structure
复旦大学大数据学院
School of Data Science, Fudan University
DATA130006 Text Management and Analysis
Text Clustering
魏忠钰
Adapted from UIUC CS410
What Is Text Clustering?
§ Discover “natural structure” § Group similar objects together § Objects can be documents, terms, passages, websites,… § Example:
Not well defined! What does “similar” mean?
The “Clustering Bias”
§ Any two objects can be similar, depending on how you look at them! § Are “car” and “horse” similar? § A user must define the perspective (i.e., a “bias”) for assessing similarity!
Basis for evaluation
Examples of Text Clustering
§ Clustering of documents in the whole collection § Term clustering to define “concept”/“theme”/“topic” § Clustering of passages/sentences or any selected text segments from larger text objects § Clustering of websites (text object has multiple documents) § Text clusters can be further clustered to generate a hierarchy
Why Text Clustering?
§ In general, very useful for text mining and exploratory text analysis:
§ Get a sense about the overall content of a collection (e.g., what are some of the “typical”/representative documents in a collection?) § Link (similar) text objects (e.g., removing duplicated content) § Create a structure on the text data (e.g., for browsing) § As a way to induce additional features (i.e., clusters) for classification of text objects
§ Examples of applications
§ Clustering of search results § Understanding major complaints in emails from customers
Topic Mining Revisited
6
Doc 2 Doc N
Doc 1
q1 q2 qk
p11 p12 p1k p21=0% p22 p2k pN1=0% pN2 pNk 30% 12% 8%
sports 0.02 game 0.01 basketball 0.005 football 0.004 … science 0.04 scientist 0.03 spaceship 0.006 … travel 0.05 attraction 0.03 trip 0.01 …
INPUT: C, k, V OUTPUT: { q1, …, qk }, { pi1, …, pik }
Text Data
One Topic(=cluster) Per Document
Doc 2 Doc N
Doc 1
q1 q2 qk
p11=100%
p12=0 p1k=0 p21=0%
sports 0.02 game 0.01 basketball 0.005 football 0.004 … science 0.04 scientist 0.03 spaceship 0.006 … travel 0.05 attraction 0.03 trip 0.01 …
INPUT: C, k, V OUTPUT: { q1, …, qk },
Text Data
p22=100% p1k=0 pN2=0 pNk=0
pN1=100%
{ c1, …, cN } ci Î[1,k]
Mining One Topic Revisited
Doc d
100%
INPUT: C={d}, V OUTPUT: { q}
Text Data
text ? mining ? association ? database ? … query ? … P(w|q) (1 Doc, 1 Topic)
è (N Docs, N Topics) k<N è (N Docs, k Shared Topics)=Clustering!
What Generative Model Can Do Clustering?
9
Doc 2 Doc N
Doc 1
q1 q2 qk
p11=100%
p12=0 p1k=0 p21=0%
sports 0.02 game 0.01 basketball 0.005 football 0.004 … science 0.04 scientist 0.03 spaceship 0.006 … travel 0.05 attraction 0.03 trip 0.01 …
INPUT: C, k, V OUTPUT: { q1, …, qk },
Text Data
p22=100% p1k=0 pN2=0 pNk=0
pN1=100%
{ c1, …, cN } ci Î[1,k] How can we force every document to be generated using one topic (instead of k topics)?
Generative Topic Model Revisited
text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 q1 the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 q2 P(q1)=0.5 P(q2)=0.5 Topic Choice
p(q1 )+p(q2)=1 “text”? “the”?
w
Why can’t this model be used for clustering?
Mixture Model for Document Clustering
11
text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 q1 the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 q2
P(w| q1) p(w| q2)
P(q1)=0.5 P(q2)=0.5 Topic Choice
p(q1 )+p(q2)=1
L
d=x1 x2 … xL
L
Difference from topic model? What if P(q1)=1
Likelihood Function: p(d)=? d=x1 x2 … xL
= =
q q + q q = q q + q q =
L 1 i 2 i 2 L 1 i 1 i 1 2 2 1 1
) | x ( p ) ( p ) | x ( p ) ( p ) | d ( p ) ( p ) | d ( p ) ( p ) d ( p
How is this different from a topic model?
)] | x ( p ) ( p ) | x ( p ) ( p [ ) d ( p : el mod topic
2 i 2 1 i 1 L 1 i
q q + q q =Õ =
Likelihood Function: p(d)=? d=x1 x2 … xL
= =
q q + q q = q q + q q =
L 1 i 2 i 2 L 1 i 1 i 1 2 2 1 1
) | x ( p ) ( p ) | x ( p ) ( p ) | d ( p ) ( p ) | d ( p ) ( p ) d ( p
How can we generalize it to include k topics/clusters?
Mixture Model for Document Clustering
§ Data: a collection of documents C={d1, …, dN} § Model: mixture of k unigram LMs: L=({qi}; {p(qi)}), iÎ[1,k]
§ To generate a document, first choose a qi according to p(qi), and then generate all words in the document using p(w|qi)
§ Likelihood: § Maximum Likelihood estimate
] ) | w ( p ) ( p [ )] | x ( p ) ( p [ ) | d ( p
) d , w ( c i k 1 i V w i i k 1 i | d | 1 j j i
q q = q q = L
= Î = =
) | d ( p max arg
*
L = L
L
Cluster Allocation After Parameter Estimation
§ Parameters of the mixture model: L=({qi}; {p(qi)}), iÎ[1,k]
§ Each qi represents the content of cluster i : p(w| qi) § p(qi) indicates the size of cluster i
§ Which cluster should document d belong to? cd=?
§ Likelihood only: Assign d to the cluster corresponding to the topic qi that most likely has been used to generate d § Likelihood + prior p(qi) (Bayesian): favor large clusters ) | d ( p max arg c
i i d
q =
How Can We Compute the ML Estimate?
§ Data: a collection of documents C={d1, …, dN} § Model: mixture of k unigram LMs: L=({qi}; {p(qi)}), iÎ[1,k]
§ To generate a document, first choose a qi according to p(qi) and then generate all words in the document using p(w|qi)
§ Likelihood: § Maximum Likelihood estimate
= = Î
L = L q q = L
N 1 j j ) d , w ( c i k 1 i V w i
) | d ( p ) | C ( p ] ) | w ( p ) ( p [ ) | d ( p ) | C ( p max arg
*
L = L
L
EM Algorithm for Document Clustering
§ Initialization: Randomly set L=({qi}; {p(qi)}), iÎ[1,k] § Repeat until likelihood p(C|L) converges
§ E-Step: Infer which distribution has been used to generate document d: hidden variable Zd Î[1, k] § M-Step: Re-estimation of all parameters
) d , w ( c i V w ) n ( i ) n ( d ) n (
) | w ( p ) ( p ) d | i Z ( p q q µ =
1 ) d | i Z ( p
k 1 i d ) n (
= =
+
= µ q
N 1 j j d ) n ( i ) 1 n (
) d | i Z ( p ) ( p
j
1 ) ( p
k 1 i i ) 1 n (
= q
+
å =
+
= µ q
N 1 j j d ) n ( j i ) 1 n (
) d | 1 Z ( p ) d , w ( c ) | w ( p
j
] k , 1 [ i , 1 ) | w ( p
V w i ) 1 n (
Î " = q
+
EM Algorithm for Document Clustering
§ Initialization L=({qi}; {p(qi)}), iÎ[1,k] § E-Step: Compute § M-Step: Re-estimate all parameters. L=({qi}; {p(qi)}),
𝑄(𝑎$ = 𝑗|𝑒)
An Example of 2 Clusters
Random Initialization E-step Hidden variables: Zd Î{1, 2}
p(q1 )=p(q2 )= 0.5
p(w|q1 )
p(w|q2 )
text 0.5 0.1 mining 0.2 0.1 medic al 0.2 0.75 health 0.1 0.05 c(w,d) text 2 mining 2 medical health
101 100 1 . * 1 . * 5 . 2 . * 5 . * 5 . 2 . * 5 . * 5 . ) | " ing min (" p ) | " text (" p ) ( p ) | " ing min (" p ) | " text (" p ) ( p ) | " ing min (" p ) | " text (" p ) ( p ) d | 1 Z ( p
2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 2 1 2 1 1 d
= + = q q q + q q q q q q = =
Document d
? ) d | 2 Z ( p
d
= =
Normalization to Avoid Underflow
p(w|q1 )
p(w|q2 )
text 0.5 0.1 (0.5+0.1)/ 2 mining 0.2 0.1 (0.2+0.1)/ 2 medical 0.2 0.75 (0.2+0.75) /2 health 0.1 0.05 (0.1+0.05) /2
2 2 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 2 1 1 d
) | " ing min (" p ) | " text (" p ) | " ing min (" p ) | " text (" p ) ( p ) | " ing min (" p ) | " text (" p ) | " ing min (" p ) | " text (" p ) ( p ) | " ing min (" p ) | " text (" p ) | " ing min (" p ) | " text (" p ) ( p ) d | 1 Z ( p q q q q q + q q q q q q q q q q = =
) | w ( p q Average of p(w|qi ) as a possible normalizer
Summary of Generative Model for Clustering
§ A slight variation of topic model can be used for clustering documents
§ Each cluster is represented by a unigram LM p(w|qi ) è Term cluster § A document is generated by first choosing a unigram LM and then generating ALL words in the document using this single LM § Estimated model parameters give both a topic characterization of each cluster and a probabilistic assignment of a document into each cluster
§ EM algorithm can be used to compute the ML estimate
§ Normalization is often needed to avoid underflow
Hard vs. soft clustering
§ Hard clustering: Each document belongs to exactly
§ by forcing a document into the cluster corresponding to the unigram LM most likely used to generate the document
§ Soft clustering: A document can belong to more than
§ You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes
Other Clustering Algorithms
§ Flat algorithms
§ Usually start with a random (partial) partitioning § Refine it iteratively § K means clustering
§ Hierarchical algorithms
§ Bottom-up § Top-down
K-Means
§ Assumes documents are real-valued vectors. § Clusters based on centroids (aka the center of gravity
§ Reassignment of instances to clusters is based on distance to the current cluster centroids.
Î
=
c x
x c
!
! ! | | 1 (c) µ
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds. Until clustering converges (or other stopping criterion): For each doc di: Assign di to the cluster cjsuch that dist(xi, sj) is minimal. (Next, update the seeds to the centroid of each cluster) For each cluster cj sj= µ(cj)
K Means Example (K=2) Pick seeds Reassign clusters Compute centroids
x x
Reassign clusters
x x x x
Compute centroids Reassign clusters Converged!
Termination conditions
§ Several possibilities, e.g.,
§ A fixed number of iterations. § Doc partition unchanged. § Centroid positions don’t change.
Does this mean that the docs in a cluster are unchanged?
Seed Choice
§ Results can vary based on random seed selection. § Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings.
§ Select good seeds using a heuristic (e.g., doc least similar to any existing mean) § Try out multiple starting points § Initialize with the results of another method.
In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F}
Example showing sensitivity to seeds
How Many Clusters?
§ Number of clusters K is given
§ Partition n docs into predetermined number of clusters
§ Finding the “right” number of clusters is part of the problem
§ Given docs, partition into an “appropriate” number of subsets. § E.g., for query results - ideal value of K not known up front
K not specified in advance
§ Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroid § Define the Total Benefit to be the sum of the individual doc Benefits.
Penalize lots of clusters
§ For each cluster, we have a Cost C. § Thus for a clustering with K clusters, the Total Cost is KC. § Define the Value of a clustering to be =
§ Total Benefit - Total Cost.
§ Find the clustering of highest value, over all choices of K.
§ Total benefit increases with increasing K. But can stop when it doesn’t increase by “much”. The Cost term enforces this.
Hierarchical Clustering
from a set of documents.
clustering algorithm.
animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate
Dendrogram: Hierarchical Clustering § Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.
Hierarchical Agglomerative Clustering (HAC)
§ Starts with each doc in a separate cluster
§ then repeatedly joins the closest pair of clusters, until there is only one cluster.
§ The history of merging forms a binary tree or hierarchy.
Note: the resulting clusters are still “hard” and induce a partition
Closest pair of clusters
§ Many variants to defining closest pair of clusters § Single-link
§ Similarity of the most cosine-similar (single-link)
§ Complete-link
§ Similarity of the “furthest” points, the least cosine-similar
§ Centroid
§ Clusters whose centroids (centers of gravity) are the most cosine-similar
§ Average-link
§ Average cosine between all pairs of elements
Single Link Agglomerative Clustering
§ Use maximum similarity of pairs: § Can result in “straggly” (long and thin) clusters due to chaining effect. § After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
) , ( max ) , (
,
y x sim c c sim
j i
c y c x j i Î Î
=
)) , ( ), , ( max( ) ), ((
k j k i k j i
c c sim c c sim c c c sim = È
Ci Cj Ck
Single Link Example
Complete Link
§ Use minimum similarity of pairs: § Makes “tighter,” spherical clusters that are typically preferable. § After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
) , ( min ) , (
,
y x sim c c sim
j i
c y c x j i Î Î
=
)) , ( ), , ( min( ) ), ((
k j k i k j i
c c sim c c sim c c c sim = È
Ci Cj Ck
Complete Link Example
Group Average § Similarity of two clusters = average similarity of all pairs within merged cluster. § Compromise between single and complete link. § Two options: § Averaged across all ordered pairs in the merged cluster § Averaged over all pairs between the two original clusters § No clear difference in efficacy
È Î ¹ È Î
È =
) ( : ) (
) , ( ) 1 ( 1 ) , (
j i j i
c c x x y c c y j i j i j i
y x sim c c c c c c sim
! ! ! !
! !
Computing Group Average Similarity
§ Always maintain sum of vectors in each cluster. § Compute similarity of clusters in constant time:
Î
=
j
c x j
x c s
!
! ! ) (
) 1 | | | |)(| | | (| |) | | (| )) ( ) ( ( )) ( ) ( ( ) , (
+ +
=
j i j i j i j i j i j i
c c c c c c c s c s c s c s c c sim ! ! ! !
What Is A Good Clustering
§ Internal criterion: A good clustering will produce high quality clusters in which:
§ document representation and similarity measure
In Inter er-cl cluster distance ces are ma maximi mized In Intra-cl cluster distance ces are mi minimi mized
External criteria for clustering quality
§ Quality measured by its ability to discover some or all
standard data § Assesses a clustering with respect to ground truth … requires labeled data
External Evaluation of Cluster Quality
§ Simple measure: purity, the ratio between the dominant class from ground-truth in the cluster ωi and the size of cluster ωi § Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members. § Biased because having n clusters maximizes purity
ij j i i
Purity example
Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5 Circle is from the system; Color is from the groud-truth
Summary of Text Clustering
§ Text clustering is an unsupervised general text mining technique to
§ obtain an overall picture of the text content (exploring text data) § discover interesting clustering structures in text data
§ Many approaches are possible
§ Strong clusters tend to show up no matter what method used § Effectiveness of a method highly depends on whether the desired clustering bias is captured appropriately (either through using the right generative model or the right similarity function) § Deciding the optimal number of clusters is generally a difficult problem for any method due to the unsupervised nature
§ Evaluation of clustering results can be done both directly and indirectly
Suggested Reading
§ Manning, Chris D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2007. (Chapter 16)