October 18 th , 2017 Adapted from UIUC CS410 What Is Text - - PowerPoint PPT Presentation

october 18 th 2017
SMART_READER_LITE
LIVE PREVIEW

October 18 th , 2017 Adapted from UIUC CS410 What Is Text - - PowerPoint PPT Presentation

DATA130006 Text Management and Analysis Text Clustering School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410 What Is Text Clustering? Discover natural structure


slide-1
SLIDE 1

复旦大学大数据学院

School of Data Science, Fudan University

DATA130006 Text Management and Analysis

Text Clustering

魏忠钰

October 18th, 2017

Adapted from UIUC CS410

slide-2
SLIDE 2

What Is Text Clustering?

§ Discover “natural structure” § Group similar objects together § Objects can be documents, terms, passages, websites,… § Example:

Not well defined! What does “similar” mean?

slide-3
SLIDE 3

The “Clustering Bias”

§ Any two objects can be similar, depending on how you look at them! § Are “car” and “horse” similar? § A user must define the perspective (i.e., a “bias”) for assessing similarity!

Basis for evaluation

slide-4
SLIDE 4

Examples of Text Clustering

§ Clustering of documents in the whole collection § Term clustering to define “concept”/“theme”/“topic” § Clustering of passages/sentences or any selected text segments from larger text objects § Clustering of websites (text object has multiple documents) § Text clusters can be further clustered to generate a hierarchy

slide-5
SLIDE 5

Why Text Clustering?

§ In general, very useful for text mining and exploratory text analysis:

§ Get a sense about the overall content of a collection (e.g., what are some of the “typical”/representative documents in a collection?) § Link (similar) text objects (e.g., removing duplicated content) § Create a structure on the text data (e.g., for browsing) § As a way to induce additional features (i.e., clusters) for classification of text objects

§ Examples of applications

§ Clustering of search results § Understanding major complaints in emails from customers

slide-6
SLIDE 6

Topic Mining Revisited

6

Doc 2 Doc N

Doc 1

q1 q2 qk

p11 p12 p1k p21=0% p22 p2k pN1=0% pN2 pNk 30% 12% 8%

sports 0.02 game 0.01 basketball 0.005 football 0.004 … science 0.04 scientist 0.03 spaceship 0.006 … travel 0.05 attraction 0.03 trip 0.01 …

INPUT: C, k, V OUTPUT: { q1, …, qk }, { pi1, …, pik }

Text Data

slide-7
SLIDE 7

One Topic(=cluster) Per Document

Doc 2 Doc N

Doc 1

q1 q2 qk

p11=100%

p12=0 p1k=0 p21=0%

sports 0.02 game 0.01 basketball 0.005 football 0.004 … science 0.04 scientist 0.03 spaceship 0.006 … travel 0.05 attraction 0.03 trip 0.01 …

INPUT: C, k, V OUTPUT: { q1, …, qk },

Text Data

p22=100% p1k=0 pN2=0 pNk=0

pN1=100%

{ c1, …, cN } ci Î[1,k]

slide-8
SLIDE 8

Mining One Topic Revisited

Doc d

q

100%

INPUT: C={d}, V OUTPUT: { q}

Text Data

text ? mining ? association ? database ? … query ? … P(w|q) (1 Doc, 1 Topic)

è (N Docs, N Topics) k<N è (N Docs, k Shared Topics)=Clustering!

slide-9
SLIDE 9

What Generative Model Can Do Clustering?

9

Doc 2 Doc N

Doc 1

q1 q2 qk

p11=100%

p12=0 p1k=0 p21=0%

sports 0.02 game 0.01 basketball 0.005 football 0.004 … science 0.04 scientist 0.03 spaceship 0.006 … travel 0.05 attraction 0.03 trip 0.01 …

INPUT: C, k, V OUTPUT: { q1, …, qk },

Text Data

p22=100% p1k=0 pN2=0 pNk=0

pN1=100%

{ c1, …, cN } ci Î[1,k] How can we force every document to be generated using one topic (instead of k topics)?

slide-10
SLIDE 10

Generative Topic Model Revisited

text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 q1 the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 q2 P(q1)=0.5 P(q2)=0.5 Topic Choice

p(q1 )+p(q2)=1 “text”? “the”?

w

d

Why can’t this model be used for clustering?

slide-11
SLIDE 11

Mixture Model for Document Clustering

11

text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 q1 the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 q2

P(w| q1) p(w| q2)

P(q1)=0.5 P(q2)=0.5 Topic Choice

p(q1 )+p(q2)=1

L

d d

d=x1 x2 … xL

L

Difference from topic model? What if P(q1)=1

  • r P(q2)=1?
slide-12
SLIDE 12

Likelihood Function: p(d)=? d=x1 x2 … xL

Õ Õ

= =

q q + q q = q q + q q =

L 1 i 2 i 2 L 1 i 1 i 1 2 2 1 1

) | x ( p ) ( p ) | x ( p ) ( p ) | d ( p ) ( p ) | d ( p ) ( p ) d ( p

How is this different from a topic model?

)] | x ( p ) ( p ) | x ( p ) ( p [ ) d ( p : el mod topic

2 i 2 1 i 1 L 1 i

q q + q q =Õ =

slide-13
SLIDE 13

Likelihood Function: p(d)=? d=x1 x2 … xL

Õ Õ

= =

q q + q q = q q + q q =

L 1 i 2 i 2 L 1 i 1 i 1 2 2 1 1

) | x ( p ) ( p ) | x ( p ) ( p ) | d ( p ) ( p ) | d ( p ) ( p ) d ( p

How can we generalize it to include k topics/clusters?

slide-14
SLIDE 14

Mixture Model for Document Clustering

§ Data: a collection of documents C={d1, …, dN} § Model: mixture of k unigram LMs: L=({qi}; {p(qi)}), iÎ[1,k]

§ To generate a document, first choose a qi according to p(qi), and then generate all words in the document using p(w|qi)

§ Likelihood: § Maximum Likelihood estimate

] ) | w ( p ) ( p [ )] | x ( p ) ( p [ ) | d ( p

) d , w ( c i k 1 i V w i i k 1 i | d | 1 j j i

q q = q q = L

å Õ å Õ

= Î = =

) | d ( p max arg

*

L = L

L

slide-15
SLIDE 15

Cluster Allocation After Parameter Estimation

§ Parameters of the mixture model: L=({qi}; {p(qi)}), iÎ[1,k]

§ Each qi represents the content of cluster i : p(w| qi) § p(qi) indicates the size of cluster i

§ Which cluster should document d belong to? cd=?

§ Likelihood only: Assign d to the cluster corresponding to the topic qi that most likely has been used to generate d § Likelihood + prior p(qi) (Bayesian): favor large clusters ) | d ( p max arg c

i i d

q =

slide-16
SLIDE 16

How Can We Compute the ML Estimate?

§ Data: a collection of documents C={d1, …, dN} § Model: mixture of k unigram LMs: L=({qi}; {p(qi)}), iÎ[1,k]

§ To generate a document, first choose a qi according to p(qi) and then generate all words in the document using p(w|qi)

§ Likelihood: § Maximum Likelihood estimate

Õ å Õ

= = Î

L = L q q = L

N 1 j j ) d , w ( c i k 1 i V w i

) | d ( p ) | C ( p ] ) | w ( p ) ( p [ ) | d ( p ) | C ( p max arg

*

L = L

L

slide-17
SLIDE 17

EM Algorithm for Document Clustering

§ Initialization: Randomly set L=({qi}; {p(qi)}), iÎ[1,k] § Repeat until likelihood p(C|L) converges

§ E-Step: Infer which distribution has been used to generate document d: hidden variable Zd Î[1, k] § M-Step: Re-estimation of all parameters

) d , w ( c i V w ) n ( i ) n ( d ) n (

) | w ( p ) ( p ) d | i Z ( p q q µ =

Õ Î

1 ) d | i Z ( p

k 1 i d ) n (

= =

å =

å =

+

= µ q

N 1 j j d ) n ( i ) 1 n (

) d | i Z ( p ) ( p

j

1 ) ( p

k 1 i i ) 1 n (

= q

å =

+

å =

+

= µ q

N 1 j j d ) n ( j i ) 1 n (

) d | 1 Z ( p ) d , w ( c ) | w ( p

j

] k , 1 [ i , 1 ) | w ( p

V w i ) 1 n (

Î " = q

å Î

+

slide-18
SLIDE 18

EM Algorithm for Document Clustering

§ Initialization L=({qi}; {p(qi)}), iÎ[1,k] § E-Step: Compute § M-Step: Re-estimate all parameters. L=({qi}; {p(qi)}),

𝑄(𝑎$ = 𝑗|𝑒)

slide-19
SLIDE 19

An Example of 2 Clusters

Random Initialization E-step Hidden variables: Zd Î{1, 2}

p(q1 )=p(q2 )= 0.5

p(w|q1 )

p(w|q2 )

text 0.5 0.1 mining 0.2 0.1 medic al 0.2 0.75 health 0.1 0.05 c(w,d) text 2 mining 2 medical health

101 100 1 . * 1 . * 5 . 2 . * 5 . * 5 . 2 . * 5 . * 5 . ) | " ing min (" p ) | " text (" p ) ( p ) | " ing min (" p ) | " text (" p ) ( p ) | " ing min (" p ) | " text (" p ) ( p ) d | 1 Z ( p

2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 2 1 2 1 1 d

= + = q q q + q q q q q q = =

Document d

? ) d | 2 Z ( p

d

= =

slide-20
SLIDE 20

Normalization to Avoid Underflow

p(w|q1 )

p(w|q2 )

text 0.5 0.1 (0.5+0.1)/ 2 mining 0.2 0.1 (0.2+0.1)/ 2 medical 0.2 0.75 (0.2+0.75) /2 health 0.1 0.05 (0.1+0.05) /2

2 2 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 2 1 1 d

) | " ing min (" p ) | " text (" p ) | " ing min (" p ) | " text (" p ) ( p ) | " ing min (" p ) | " text (" p ) | " ing min (" p ) | " text (" p ) ( p ) | " ing min (" p ) | " text (" p ) | " ing min (" p ) | " text (" p ) ( p ) d | 1 Z ( p q q q q q + q q q q q q q q q q = =

) | w ( p q Average of p(w|qi ) as a possible normalizer

slide-21
SLIDE 21

Summary of Generative Model for Clustering

§ A slight variation of topic model can be used for clustering documents

§ Each cluster is represented by a unigram LM p(w|qi ) è Term cluster § A document is generated by first choosing a unigram LM and then generating ALL words in the document using this single LM § Estimated model parameters give both a topic characterization of each cluster and a probabilistic assignment of a document into each cluster

§ EM algorithm can be used to compute the ML estimate

§ Normalization is often needed to avoid underflow

slide-22
SLIDE 22

§More About Text Clustering

slide-23
SLIDE 23

Hard vs. soft clustering

§ Hard clustering: Each document belongs to exactly

  • ne cluster

§ by forcing a document into the cluster corresponding to the unigram LM most likely used to generate the document

§ Soft clustering: A document can belong to more than

  • ne cluster.

§ You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes

slide-24
SLIDE 24

Other Clustering Algorithms

§ Flat algorithms

§ Usually start with a random (partial) partitioning § Refine it iteratively § K means clustering

§ Hierarchical algorithms

§ Bottom-up § Top-down

slide-25
SLIDE 25

K-Means

§ Assumes documents are real-valued vectors. § Clusters based on centroids (aka the center of gravity

  • r mean) of points in a cluster, c:

§ Reassignment of instances to clusters is based on distance to the current cluster centroids.

å

Î

=

c x

x c

!

! ! | | 1 (c) µ

slide-26
SLIDE 26

K-Means Algorithm

Select K random docs {s1, s2,… sK} as seeds. Until clustering converges (or other stopping criterion): For each doc di: Assign di to the cluster cjsuch that dist(xi, sj) is minimal. (Next, update the seeds to the centroid of each cluster) For each cluster cj sj= µ(cj)

slide-27
SLIDE 27

K Means Example (K=2) Pick seeds Reassign clusters Compute centroids

x x

Reassign clusters

x x x x

Compute centroids Reassign clusters Converged!

slide-28
SLIDE 28

Termination conditions

§ Several possibilities, e.g.,

§ A fixed number of iterations. § Doc partition unchanged. § Centroid positions don’t change.

Does this mean that the docs in a cluster are unchanged?

slide-29
SLIDE 29

Seed Choice

§ Results can vary based on random seed selection. § Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings.

§ Select good seeds using a heuristic (e.g., doc least similar to any existing mean) § Try out multiple starting points § Initialize with the results of another method.

In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F}

Example showing sensitivity to seeds

slide-30
SLIDE 30

How Many Clusters?

§ Number of clusters K is given

§ Partition n docs into predetermined number of clusters

§ Finding the “right” number of clusters is part of the problem

§ Given docs, partition into an “appropriate” number of subsets. § E.g., for query results - ideal value of K not known up front

  • though UI may impose limits.
slide-31
SLIDE 31

K not specified in advance

§ Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroid § Define the Total Benefit to be the sum of the individual doc Benefits.

slide-32
SLIDE 32

Penalize lots of clusters

§ For each cluster, we have a Cost C. § Thus for a clustering with K clusters, the Total Cost is KC. § Define the Value of a clustering to be =

§ Total Benefit - Total Cost.

§ Find the clustering of highest value, over all choices of K.

§ Total benefit increases with increasing K. But can stop when it doesn’t increase by “much”. The Cost term enforces this.

slide-33
SLIDE 33

Hierarchical Clustering

  • Build a tree-based hierarchical taxonomy (dendrogram)

from a set of documents.

  • One approach: recursive application of a partitional

clustering algorithm.

animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

slide-34
SLIDE 34

Dendrogram: Hierarchical Clustering § Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

slide-35
SLIDE 35

Hierarchical Agglomerative Clustering (HAC)

§ Starts with each doc in a separate cluster

§ then repeatedly joins the closest pair of clusters, until there is only one cluster.

§ The history of merging forms a binary tree or hierarchy.

Note: the resulting clusters are still “hard” and induce a partition

slide-36
SLIDE 36

Closest pair of clusters

§ Many variants to defining closest pair of clusters § Single-link

§ Similarity of the most cosine-similar (single-link)

§ Complete-link

§ Similarity of the “furthest” points, the least cosine-similar

§ Centroid

§ Clusters whose centroids (centers of gravity) are the most cosine-similar

§ Average-link

§ Average cosine between all pairs of elements

slide-37
SLIDE 37

Single Link Agglomerative Clustering

§ Use maximum similarity of pairs: § Can result in “straggly” (long and thin) clusters due to chaining effect. § After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

) , ( max ) , (

,

y x sim c c sim

j i

c y c x j i Î Î

=

)) , ( ), , ( max( ) ), ((

k j k i k j i

c c sim c c sim c c c sim = È

Ci Cj Ck

slide-38
SLIDE 38

Single Link Example

slide-39
SLIDE 39

Complete Link

§ Use minimum similarity of pairs: § Makes “tighter,” spherical clusters that are typically preferable. § After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

) , ( min ) , (

,

y x sim c c sim

j i

c y c x j i Î Î

=

)) , ( ), , ( min( ) ), ((

k j k i k j i

c c sim c c sim c c c sim = È

Ci Cj Ck

slide-40
SLIDE 40

Complete Link Example

slide-41
SLIDE 41

Group Average § Similarity of two clusters = average similarity of all pairs within merged cluster. § Compromise between single and complete link. § Two options: § Averaged across all ordered pairs in the merged cluster § Averaged over all pairs between the two original clusters § No clear difference in efficacy

å å

È Î ¹ È Î

  • È

È =

) ( : ) (

) , ( ) 1 ( 1 ) , (

j i j i

c c x x y c c y j i j i j i

y x sim c c c c c c sim

! ! ! !

! !

slide-42
SLIDE 42

Computing Group Average Similarity

§ Always maintain sum of vectors in each cluster. § Compute similarity of clusters in constant time:

å

Î

=

j

c x j

x c s

!

! ! ) (

) 1 | | | |)(| | | (| |) | | (| )) ( ) ( ( )) ( ) ( ( ) , (

  • +

+ +

  • +
  • +

=

j i j i j i j i j i j i

c c c c c c c s c s c s c s c c sim ! ! ! !

slide-43
SLIDE 43

What Is A Good Clustering

§ Internal criterion: A good clustering will produce high quality clusters in which:

§ document representation and similarity measure

In Inter er-cl cluster distance ces are ma maximi mized In Intra-cl cluster distance ces are mi minimi mized

slide-44
SLIDE 44

External criteria for clustering quality

§ Quality measured by its ability to discover some or all

  • f the hidden patterns or latent classes in gold

standard data § Assesses a clustering with respect to ground truth … requires labeled data

slide-45
SLIDE 45

External Evaluation of Cluster Quality

§ Simple measure: purity, the ratio between the dominant class from ground-truth in the cluster ωi and the size of cluster ωi § Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members. § Biased because having n clusters maximizes purity

C j n n Purity

ij j i i

Î = ) ( max 1 ) (w

slide-46
SLIDE 46

Purity example

  • Cluster I

Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5 Circle is from the system; Color is from the groud-truth

slide-47
SLIDE 47

Summary of Text Clustering

§ Text clustering is an unsupervised general text mining technique to

§ obtain an overall picture of the text content (exploring text data) § discover interesting clustering structures in text data

§ Many approaches are possible

§ Strong clusters tend to show up no matter what method used § Effectiveness of a method highly depends on whether the desired clustering bias is captured appropriately (either through using the right generative model or the right similarity function) § Deciding the optimal number of clusters is generally a difficult problem for any method due to the unsupervised nature

§ Evaluation of clustering results can be done both directly and indirectly

slide-48
SLIDE 48

Suggested Reading

§ Manning, Chris D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2007. (Chapter 16)