October 18 th , 2017 Adapted from UIUC CS410 What Is Text - PowerPoint PPT Presentation

DATA130006 Text Management and Analysis Text Clustering 魏忠钰复旦大学大数据学院 School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410

What Is Text Clustering? § Discover “natural structure” § Group similar objects together § Objects can be documents, terms, passages, websites,… § Example: Not well defined! What does “similar” mean?

The “Clustering Bias” § Any two objects can be similar, depending on how you look at them! Basis for evaluation § Are “car” and “horse” similar? § A user must define the perspective (i.e., a “ bias ”) for assessing similarity!

Examples of Text Clustering § Clustering of documents in the whole collection § Term clustering to define “concept”/“theme”/“topic” § Clustering of passages/sentences or any selected text segments from larger text objects § Clustering of websites (text object has multiple documents) § Text clusters can be further clustered to generate a hierarchy

Why Text Clustering? § In general, very useful for text mining and exploratory text analysis: § Get a sense about the overall content of a collection (e.g., what are some of the “typical”/representative documents in a collection?) § Link (similar) text objects (e.g., removing duplicated content) § Create a structure on the text data (e.g., for browsing) § As a way to induce additional features (i.e., clusters) for classification of text objects § Examples of applications § Clustering of search results § Understanding major complaints in emails from customers

Topic Mining Revisited OUTPUT: { q 1 , …, q k }, { p i1 , …, p ik } INPUT: C, k, V … Doc N Doc 1 Doc 2 Text Data 30% sports 0.02 game 0.01 q 1 p 11 p 21 =0% p N1 =0% basketball 0.005 football 0.004 … q 2 travel 0.05 12% attraction 0.03 p 12 p 22 p N2 trip 0.01 … … 8% science 0.04 q k scientist 0.03 p 1k p 2k p Nk spaceship 0.006 … 6

One Topic(=cluster) Per Document OUTPUT: { q 1 , …, q k }, { c 1 , …, c N } c i Î [1,k] INPUT: C, k, V … Doc N Doc 1 Doc 2 Text Data p 11 = 100% p N1 = 100% sports 0.02 game 0.01 q 1 p 21 =0% basketball 0.005 football 0.004 … p 22 =100% q 2 travel 0.05 attraction 0.03 p 12 =0 p N2 =0 trip 0.01 … … science 0.04 q k p Nk =0 p 1k =0 p 1k =0 scientist 0.03 spaceship 0.006 …

Mining One Topic Revisited OUTPUT: { q } INPUT: C={d}, V P(w| q ) Doc d Text Data 100% text ? q mining ? association ? database ? … ( 1 Doc, 1 Topic) query ? … è (N Docs, N Topics) k<N è (N Docs, k Shared Topics)=Clustering!

What Generative Model Can Do Clustering? { c 1 , …, c N } c i Î [1,k] OUTPUT: { q 1 , …, q k }, INPUT: C, k, V … Doc N Doc 1 Doc 2 Text Data p 11 = 100% p N1 = 100% sports 0.02 game 0.01 q 1 p 21 =0% basketball 0.005 football 0.004 … p 22 =100% q 2 travel 0.05 p 12 =0 p N2 =0 attraction 0.03 trip 0.01 … … How can we force every document to be generated using one topic (instead of k topics)? science 0.04 q k p 1k =0 p 1k =0 p Nk =0 scientist 0.03 spaceship 0.006 … 9

Generative Topic Model Revisited Why can’t this model be used for clustering? text 0.04 q 1 p( q 1 )+p( q 2 )=1 mining 0.035 d association 0.03 clustering 0.005 P( q 1 )=0.5 … “the”? the 0.000001 w Topic “text”? the 0.03 Choice q 2 a 0.02 P( q 2 )=0.5 is 0.015 we 0.01 food 0.003 … text 0.000006

Mixture Model for Document Clustering L text 0.04 q 1 Difference from p( q 1 )+p( q 2 )=1 mining 0.035 P(w| q 1 ) topic model? d association 0.03 clustering 0.005 P( q 1 )=0.5 … the 0.000001 d=x 1 x 2 … x L Topic Choice the 0.03 q 2 P( q 2 )=0.5 a 0.02 p(w| q 2 ) is 0.015 d we 0.01 What if P( q 1 )=1 food 0.003 or P( q 2 )=1? … L text 0.000006 11

Likelihood Function: p(d)=? = q q + q q p ( d ) p ( ) p ( d | ) p ( ) p ( d | ) 1 1 2 2 Õ Õ L L = q q + q q p ( ) p ( x | ) p ( ) p ( x | ) 1 i 1 2 i 2 = = i 1 i 1 d=x 1 x 2 … x L How is this different from a topic model? = Õ = L q q + q q topic mod el : p ( d ) [ p ( ) p ( x | ) p ( ) p ( x | )] 1 i 1 2 i 2 i 1

Likelihood Function: p(d)=? = q q + q q p ( d ) p ( ) p ( d | ) p ( ) p ( d | ) 1 1 2 2 Õ Õ L L = q q + q q p ( ) p ( x | ) p ( ) p ( x | ) 1 i 1 2 i 2 = = i 1 i 1 d=x 1 x 2 … x L How can we generalize it to include k topics/clusters?

Mixture Model for Document Clustering § Data: a collection of documents C={d 1 , …, d N } § Model: mixture of k unigram LMs: L =({ q i }; {p( q i )}), i Î [1,k] § To generate a document, first choose a q i according to p( q i ), and then generate all words in the document using p(w| q i ) § Likelihood: å Õ k | d | L = q q p ( d | ) [ p ( ) p ( x | )] i j i = = i 1 j 1 å Õ k = q q c ( w , d ) [ p ( ) p ( w | ) ] i i = Î i 1 w V § Maximum Likelihood estimate L = L * arg max p ( d | ) L

Cluster Allocation After Parameter Estimation § Parameters of the mixture model: L =({ q i }; {p( q i )}), i Î [1,k] § Each q i represents the content of cluster i : p(w| q i ) § p( q i ) indicates the size of cluster i § Which cluster should document d belong to? c d =? § Likelihood only : Assign d to the cluster corresponding to the topic q i that most likely has been used to generate d = q c arg max p ( d | ) d i i § Likelihood + prior p( q i ) (Bayesian): favor large clusters

How Can We Compute the ML Estimate? § Data: a collection of documents C={d 1 , …, d N } § Model: mixture of k unigram LMs: L =({ q i }; {p( q i )}), i Î [1,k] § To generate a document, first choose a q i according to p( q i ) and then generate all words in the document using p(w| q i ) § Likelihood: å Õ k L = q q c ( w , d ) p ( d | ) [ p ( ) p ( w | ) ] i i = Î i 1 w V Õ N L = L p ( C | ) p ( d | ) j = j 1 § Maximum Likelihood estimate L = L * arg max p ( C | ) L

EM Algorithm for Document Clustering § Initialization: Randomly set L =({ q i }; {p( q i )}), i Î [1,k] § Repeat until likelihood p(C| L ) converges § E-Step: Infer which distribution has been used to generate document d: hidden variable Z d Î [1, k] Õ Î å = = µ q q k ( n ) ( n ) ( n ) c ( w , d ) p ( Z i | d ) p ( ) p ( w | ) = = ( n ) p ( Z i | d ) 1 d i i d w V i 1 § M-Step: Re-estimation of all parameters å = å = N k + q µ = + q = ( n 1 ) ( n ) ( n 1 ) p ( ) p ( Z i | d ) p ( ) 1 i d j i j 1 i 1 j å = å Î N + q µ = + ( n 1 ) ( n ) q = " Î p ( w | ) c ( w , d ) p ( Z 1 | d ) ( n 1 ) p ( w | ) 1 , i [ 1 , k ] i j d j i j 1 j w V

EM Algorithm for Document Clustering § Initialization L =({ q i }; {p( q i )}), i Î [1,k] § E-Step: Compute 𝑄(𝑎 $ = 𝑗|𝑒) § M-Step: Re-estimate all parameters. L =({ q i }; {p( q i )}),

An Example of 2 Clusters E-step Document d c(w,d) Random Initialization text 2 Hidden variables: p( q 1 )=p( q 2 )= 0.5 mining 2 Z d Î {1, 2} medical 0 p(w| q 1 ) p(w| q 2 ) health 0 q q q 2 2 p ( ) p (" text " | ) p (" min ing " | ) text 0.5 0.1 = = p ( Z 1 | d ) 1 1 1 d q q 2 q 2 + q q 2 q 2 p ( ) p (" text " | ) p (" min ing " | ) p ( ) p (" text " | ) p (" min ing " | ) 1 1 1 2 2 2 mining 0.2 0.1 2 2 0 . 5 * 0 . 5 * 0 . 2 100 = = 2 2 + 2 2 0 . 5 * 0 . 5 * 0 . 2 0 . 5 * 0 . 1 * 0 . 1 101 medic 0.2 0.75 al = = p ( Z 2 | d ) ? d health 0.1 0.05

Summary of Generative Model for Clustering § A slight variation of topic model can be used for clustering documents § Each cluster is represented by a unigram LM p(w| q i ) è Term cluster § A document is generated by first choosing a unigram LM and then generating ALL words in the document using this single LM § Estimated model parameters give both a topic characterization of each cluster and a probabilistic assignment of a document into each cluster § EM algorithm can be used to compute the ML estimate § Normalization is often needed to avoid underflow

§ More About Text Clustering

October 18 th , 2017 Adapted from UIUC CS410 What Is Text - PowerPoint PPT Presentation

DATA130006 Text Management and Analysis Text Clustering School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410 What Is Text Clustering? Discover natural structure

Monday, 16 October 2017 Monday, 16 October 2017 Monday, 16 October 2017 Monday, 16 October 2017

3Q17 Results October, 27 th 2017 3Q 2017 Results October 27 th 2017 / 2 Disclaimer This document

Transportation & Mobility October 11, 2017 October 11, 2017 Community Dialogue Series

1 2 Monday, October 25, 2010 3 4 Monday, October 25, 2010 5 6 Monday, October 25, 2010 7

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

THIRD QUARTER 2017 EARNINGS OCTOBER 31, 2017 THIRD QUARTER 2017 EARNINGS October 31, 2017 SAFE

in Gas & Services Paris, 25 October 2017 2017 Q3 Activity Fabienne Lecorvaisier Executive

2017. October 2017 1 Crowthorne Neighbourhood Area. October 2017 2 Meeting Agenda.

3/19/2017 Resource Aquisition And Transport in Vascular Plants 1 3/19/2017 2 3/19/2017 3

3/3/2017 Rick Guidotti 1 3/3/2017 2 3/3/2017 ALBINISM 3 3/3/2017 4 3/3/2017 5

VBP Bootcamp Managed Long Term Care October 2017 2 10/13/2017 October 2017 2 Agenda Area

VBP Bootcamp Contracting Course October 10, 2017 2 10/13/2017 October 2017 2 Agenda Area

October 2017 OVERVIEW OVERVIEW Demographic Trends Challenges & Opportunities

Proposed Fare Adjustments Public Meetings October 17, 2017 12 noon October 18, 2017 12 noon

Update of 11th meeting of the ERNs Board of Member States 11 October 2017 13th October 2017

Global and regional prospects 7 6.5 2017 (Jan16) 6.2 2017 (Apr16) 6 2017 (Jul16) 2017

Ecien t relational learning from sparse data Lub o P op elnsk Kno wledge

Second Quarter 2016 Earnings Conference Call JULY 28, 2016 1 SECOND QUARTER 2016 EARNINGS

Taking Agile Back Tim Ottinger Ruud Wijnands @tottinge @ruudwijnands #QConLondon -17C/1F as

twenty six concrete construction: columns & frames http://www.building.co.uk Concrete

r-Process nucleosynthesis in neutron star mergers and GW170817 Jonas Lippuner July 18, 2018

Earth-Scattering of Dark Matter: from sub-GeV Dark Matter to WIMPzillas Bradley J. Kavanagh

Time Crystal Platform Krzysztof Sacha Jagiellonian University in Krak ow. People Krzysztof

Updated Barrel EMC Up Geo Geome metry Guang Zhao ( zhaog@ihep.ac.cn ) Institute of High Energy

Sambuz

Useful Links

Newsletter

Mail Us

October 18 th , 2017 Adapted from UIUC CS410 What Is Text - PowerPoint PPT Presentation

DATA130006 Text Management and Analysis Text Clustering School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410 What Is Text Clustering? Discover natural structure

Monday, 16 October 2017 Monday, 16 October 2017 Monday, 16 October 2017 Monday, 16 October 2017

3Q17 Results October, 27 th 2017 3Q 2017 Results October 27 th 2017 / 2 Disclaimer This document

Transportation &amp; Mobility October 11, 2017 October 11, 2017 Community Dialogue Series

1 2 Monday, October 25, 2010 3 4 Monday, October 25, 2010 5 6 Monday, October 25, 2010 7

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

THIRD QUARTER 2017 EARNINGS OCTOBER 31, 2017 THIRD QUARTER 2017 EARNINGS October 31, 2017 SAFE

in Gas &amp; Services Paris, 25 October 2017 2017 Q3 Activity Fabienne Lecorvaisier Executive

2017. October 2017 1 Crowthorne Neighbourhood Area. October 2017 2 Meeting Agenda.

3/19/2017 Resource Aquisition And Transport in Vascular Plants 1 3/19/2017 2 3/19/2017 3

3/3/2017 Rick Guidotti 1 3/3/2017 2 3/3/2017 ALBINISM 3 3/3/2017 4 3/3/2017 5

VBP Bootcamp Managed Long Term Care October 2017 2 10/13/2017 October 2017 2 Agenda Area

VBP Bootcamp Contracting Course October 10, 2017 2 10/13/2017 October 2017 2 Agenda Area

October 2017 OVERVIEW OVERVIEW Demographic Trends Challenges &amp; Opportunities

Proposed Fare Adjustments Public Meetings October 17, 2017 12 noon October 18, 2017 12 noon

Update of 11th meeting of the ERNs Board of Member States 11 October 2017 13th October 2017

Global and regional prospects 7 6.5 2017 (Jan16) 6.2 2017 (Apr16) 6 2017 (Jul16) 2017

Ecien t relational learning from sparse data Lub o P op elnsk Kno wledge

Second Quarter 2016 Earnings Conference Call JULY 28, 2016 1 SECOND QUARTER 2016 EARNINGS

Taking Agile Back Tim Ottinger Ruud Wijnands @tottinge @ruudwijnands #QConLondon -17C/1F as

twenty six concrete construction: columns &amp; frames http://www.building.co.uk Concrete

r-Process nucleosynthesis in neutron star mergers and GW170817 Jonas Lippuner July 18, 2018

Earth-Scattering of Dark Matter: from sub-GeV Dark Matter to WIMPzillas Bradley J. Kavanagh

Time Crystal Platform Krzysztof Sacha Jagiellonian University in Krak ow. People Krzysztof

Updated Barrel EMC Up Geo Geome metry Guang Zhao ( zhaog@ihep.ac.cn ) Institute of High Energy

Sambuz

Useful Links

Newsletter

Mail Us

Transportation & Mobility October 11, 2017 October 11, 2017 Community Dialogue Series

in Gas & Services Paris, 25 October 2017 2017 Q3 Activity Fabienne Lecorvaisier Executive

October 2017 OVERVIEW OVERVIEW Demographic Trends Challenges & Opportunities

twenty six concrete construction: columns & frames http://www.building.co.uk Concrete