October 18 th , 2017 Adapted from UIUC CS410 Outline What is topic - PowerPoint PPT Presentation

DATA130006 Text Management and Analysis Language Model for Topic Analysis 魏忠钰复旦大学大数据学院 School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410

Outline § What is topic mining?

Topic Mining and Analysis: Motivation § Topic » main idea discussed in text data § Theme/subject of a discussion or conversation § Different granularities (e.g., topic of a sentence, an article, etc.) § Many applications require discovery of topics in text § What are Weibo users talking about today? § What are the current research topics in data mining? How are they different from those 5 years ago? § What were the major topics debated in 2012 presidential election?

Topics as Knowledge About the World Non-Text Data Knowledge about the world Topic 1 Text Data + Context Time Real World Topic 2 … Location … … Topic k

Tasks of Topic Mining and Analysis Task 2: Figure out which documents Doc 1 Doc 2 cover which topics Text Data Topic 1 Topic 2 … Topic k Task 1: Discover k topics

Formal Definition of Topic Mining and Analysis § Input § A collection of N text documents C={d 1 , …, d N } § Number of topics : k § Output § k topics : { q 1 , …, q k } k § Coverage of topics in each d i : { p i1 , …, p ik } å p ij = 1 § p ij = prob. of d i covering topic q j = j 1 How to define q i ?

Initial Idea: Topic = Term … Doc 1 Doc 2 Doc N 30% Text Data p 11 q 1 p 21 =0 p N1 =0 “Sports” p 12 p 22 p N2 q 2 “Travel” … 12% q k p 2k p 1k p Nk “Science” 8%

Mining k Topical Terms from Collection C § Parse text in C to obtain candidate terms (e.g., term = word). § Design a scoring function to measure how good each term is as a topic. § Favor a representative term (high frequency is favored) § Avoid words that are too frequent (e.g., “the”, “a”, stop words). § TF-IDF weighting from retrieval can be very useful. § Domain-specific heuristics are possible (e.g., favor title words, hashtags in microblog). § Pick k terms with the highest scores but try to minimize redundancy. § If multiple terms are very similar or closely related, pick only one of them and ignore others.

Computing Topic Coverage: p ij Doc d i p i1 q 1 count(“sports”, d i )=4 “Sports” q count ( , d ) p i2 count(“travel”, d i ) =2 q 2 p = j i “Travel” … ij k å q count ( , d ) L i = L 1 q k p ik count(“science”, d i )=1 “Science”

How Well Does This Approach Work? Doc d i Cavaliers vs. Golden State Warriors: NBA playoff finals … basketball game … travel to Cleveland … star … q 1 1. Need to count “Sports” p µ = c (" sports " , d ) 0 i 1 i related words also! q 2 p µ = > “Travel” c (" travel " , d ) 1 0 … i 2 i 2. “Star” can be ambiguous (e.g., star in the sky). q k p µ = “Science” c (" science " , d ) 0 3. Mine complicated topics? ik i

Problems with “Term as Topic” 1. Lack of expressive power è Topic = {Multiple Words} • Can only represent simple/general topics • Can’t represent complicated topics 2. Incompleteness in vocabulary coverage + weights on words • Can’t capture variations of vocabulary (e.g., related words) 3. Word sense ambiguity è Split an ambiguous word • A topical term or related term can be ambiguous (e.g., basketball star vs. star in the sky) A probabilistic topic model can do all these!

Improved Idea: Topic = Word Distribution … q 1 q k q 2 “Science” “Travel” “Sports” P(w| q k ) P(w| q 1 ) P(w| q 2 ) travel 0.05 sports 0.02 science 0.04 attraction 0.03 game 0.01 scientist 0.03 trip 0.01 basketball 0.005 spaceship 0.006 flight 0.004 football 0.004 telescope 0.004 hotel 0.003 play 0.003 genomics 0.004 island 0.003 star 0.003 star 0.002 … … … culture 0.001 nba 0.001 genetics 0.001 … … … play 0.0002 travel 0.0005 travel 0.00001 … … … å q = p ( w | ) 1 Vocabulary Set: V={w1, w2,….} i Î w V

Probabilistic Topic Mining and Analysis § Input § A collection of N text documents C={d 1 , …, d N } § Vocabulary set: V={w 1 , …, w M } § Number of topics : k § Output å § k topics, each a word distribution : { q 1 , …, q k } q = p ( w | ) 1 i Î § Coverage of topics in each d i : { p i1 , …, p ik } w V k å p ij = § p ij =prob. of d i covering topic q j 1 = j 1

The Computation Task OUTPUT: { q 1 , …, q k }, { p i1 , …, p ik } INPUT: C, k, V … Doc N Doc 1 Doc 2 Text Data 30% sports 0.02 game 0.01 q 1 p 11 p 21 =0% p N1 =0% basketball 0.005 football 0.004 … q 2 travel 0.05 12% attraction 0.03 p 12 p 22 p N2 trip 0.01 … … 8% science 0.04 q k scientist 0.03 p 2k p 1k p Nk spaceship 0.006 …

Generative Model for Text Mining Modeling of Data Generation: P(Data |Model, L ) L =({ q 1 , …, q k }, { p 11 , …, p 1k }, …, { p N1 , …, p Nk }) How many parameters in total? Parameter Estimation/ Inferences L * = argmax L p(Data| Model, L ) P(Data |Model, L ) L L *

Simplest Case of Topic Model: Mining One Topic OUTPUT: { q } INPUT: C={d}, V P(w| q ) Doc d Text Data 100% text ? q mining ? association ? database ? … query ? …

Language Model Setup § Data : Document d= x 1 x 2 … x |d| , x i Î V={w 1 ,…, w M }is a word § Model : Unigram LM q : { q i =p(w i | q )}, i=1, …, M; q 1 +…+ q M =1 § Likelihood function: q = q ´ ´ q p ( d | ) p ( x | ) ... p ( x | ) 1 | d | = q c ( w , d ) ´ ´ q c ( w , d ) p ( w | ) ... p ( w | ) 1 M 1 M M M Õ Õ c ( w , d ) = q = q c ( w , d ) p ( w | ) i i i i = = i 1 i 1 M § ML estimate: Õ ˆ ˆ q q = q = q c ( w , d ) ( ,..., ) arg max p ( d | ) arg max i q q q q 1 M ,..., ,..., i 1 M 1 M = i 1

Computation of Maximum Likelihood Estimate M Õ ˆ ˆ Maximize p(d| q ) q q = q = q c ( w , d ) ( ,..., ) arg max p ( d | ) arg max i q q q q 1 M ,..., ,..., i 1 M 1 M = i 1 M å ˆ ˆ q q = q = q Max. Log-Likelihood ( ,..., ) arg max log[ p ( d | )] arg max c ( w , d ) log q q q q 1 M ,..., ,..., i i 1 M 1 M = i 1 M å Subject to constraint: q i = 1 Use Lagrange multiplier approach = i 1 M M Normalized ∑ ∑ Lagrange function: f ( θ | d ) = c ( w i , d )log θ i + λ ( − 1) θ i Counts i = 1 i = 1 ∂ f ( θ | d ) = c ( w i , d ) → θ i = − c ( w i , d ) + λ = 0 ∂ θ i θ i λ M N − c ( w i , d ) c ( w i , d ) = c ( w i , d ) θ i = p ( w i | ˆ ˆ ∑ ∑ = 1 c ( w i , d ) θ ) = → λ = − → M | d | λ ∑ i = 1 i = 1 c ( w i , d ) i = 1

What Does the Topic Look Like? p(w| q ) d the 0.031 a 0.018 Can we get rid of … Text mining text 0.04 these common words? paper mining 0.035 association 0.03 clustering 0.005 computer 0.0009 … food 0.000001 …

Factoring out Background Words p(w| q ) d the 0.031 a 0.018 How can we get rid of … Text mining text 0.04 these common words? paper mining 0.035 association 0.03 clustering 0.005 computer 0.0009 … food 0.000001 …

Generate d Using Two Word Distributions text 0.04 Topic: q d p( q d )+( q B )=1 mining 0.035 d association 0.03 clustering 0.005 P(w| q d ) P( q d )=0.5 … the 0.000001 Text mining Topic paper the 0.03 Choice p(w| q B ) a 0.02 P( q B )=0.5 is 0.015 we 0.01 food 0.003 … text 0.000006 Background (topic) q B …

What’s the probability of observing a word w? Topic: q d text 0.04 p( q d )+( q B )=1 mining 0.035 d association 0.03 P(“the”)=p( q d )p(“the”| q d ) + p( q B )p(“the”| q B ) clustering 0.005 P(w| q d ) = 0.5*0.000001+0.5*0.03 P( q d )=0.5 … “the”? the 0.000001 Topic P(“text”)=p( q d )p(“text”| q d ) + p( q B ) p(“text”| q B ) “text”? the 0.03 Choice p(w| q B ) a 0.02 = 0.5*0.04+0.5*0.000006 P( q B )=0.5 is 0.015 we 0.01 food 0.003 … text 0.000006 Background (topic) q B …

The Idea of a Mixture Model text 0.04 q d Mixture Model p( q d )+( q B )=1 mining 0.035 association 0.03 clustering 0.005 P( q d )=0.5 … “the”? the 0.000001 w Topic “text”? the 0.03 Choice q B a 0.02 P( q B )=0.5 is 0.015 we 0.01 food 0.003 … text 0.000006

As a Generative Model… Formally defines the following generative model: w p(w)=p( q d )p(w| q d ) + p( q B )p(w| q B ) Estimate of the model “discovers” two topics + topic coverage What if p( q d )=1 or p( q B )=1?

Mixture of Two Unigram Language Models § Data : Document d § Mixture Model : parameters L =({p(w| q d )}, {p(w| q B )}, p( q B ), p( q d )) § Two unigram LMs: q d (the topic of d); q B (background topic) § Mixing weight (topic choice): p( q d )+p( q B )=1 § Likelihood function: Õ Õ | d | | d | L = L = q q + q q p ( d | ) p ( x | ) [ p ( ) p ( x | ) p ( ) p ( x | )] i d i d B i B = = i 1 i 1 Õ M = q q + q q c ( w , d ) [ p ( ) p ( w | ) p ( ) p ( w | )] d i d B i B = i 1 § ML Estimate: L = L * arg max p ( d | ) L å å M M q = q = q + q = Subject to p ( w | ) p ( w | ) 1 p ( ) p ( ) 1 i d i B d B = = i 1 i 1

October 18 th , 2017 Adapted from UIUC CS410 Outline What is topic - PowerPoint PPT Presentation

DATA130006 Text Management and Analysis Language Model for Topic Analysis School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410 Outline What is topic mining? Topic

Monday, 16 October 2017 Monday, 16 October 2017 Monday, 16 October 2017 Monday, 16 October 2017

3Q17 Results October, 27 th 2017 3Q 2017 Results October 27 th 2017 / 2 Disclaimer This document

Transportation & Mobility October 11, 2017 October 11, 2017 Community Dialogue Series

1 2 Monday, October 25, 2010 3 4 Monday, October 25, 2010 5 6 Monday, October 25, 2010 7

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

THIRD QUARTER 2017 EARNINGS OCTOBER 31, 2017 THIRD QUARTER 2017 EARNINGS October 31, 2017 SAFE

in Gas & Services Paris, 25 October 2017 2017 Q3 Activity Fabienne Lecorvaisier Executive

2017. October 2017 1 Crowthorne Neighbourhood Area. October 2017 2 Meeting Agenda.

3/19/2017 Resource Aquisition And Transport in Vascular Plants 1 3/19/2017 2 3/19/2017 3

3/3/2017 Rick Guidotti 1 3/3/2017 2 3/3/2017 ALBINISM 3 3/3/2017 4 3/3/2017 5

VBP Bootcamp Managed Long Term Care October 2017 2 10/13/2017 October 2017 2 Agenda Area

VBP Bootcamp Contracting Course October 10, 2017 2 10/13/2017 October 2017 2 Agenda Area

October 2017 OVERVIEW OVERVIEW Demographic Trends Challenges & Opportunities

Proposed Fare Adjustments Public Meetings October 17, 2017 12 noon October 18, 2017 12 noon

Update of 11th meeting of the ERNs Board of Member States 11 October 2017 13th October 2017

Global and regional prospects 7 6.5 2017 (Jan16) 6.2 2017 (Apr16) 6 2017 (Jul16) 2017

PROGRAMMING FOR ADULTS WITH DEVELOPMENTAL DISABILITIES Why and How Barriers to and benefits of

testbed in the Do we need a testbed in the Do we need a COIN community and for what ? COIN

Contest Debjit Sinha 1 , Lus Guerra e Silva 2 , Jia Wang 3 , Shesha Raghunathan 4 , Dileep

1/ 26/ 2017 Know how. Know now. Know how. Know now. Purpose of the Contest An Overview of Changes

Establishing General Methodologies of Creative Problem-Solving / Task-Achieving: Beyond TRIZ

Planning new Districts in Freiburg Organizing political and administrative Processes Keeping

T HE L ATE R OMAN R EPUBLIC IN 2017: Recent Developments Lea Beness & Tom Hillard National

Lecture 2.2: Tautology and contradiction Matthew Macauley Department of Mathematical Sciences

October 18 th , 2017 Adapted from UIUC CS410 Outline What is topic - PowerPoint PPT Presentation

DATA130006 Text Management and Analysis Language Model for Topic Analysis School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410 Outline What is topic mining? Topic

Monday, 16 October 2017 Monday, 16 October 2017 Monday, 16 October 2017 Monday, 16 October 2017

3Q17 Results October, 27 th 2017 3Q 2017 Results October 27 th 2017 / 2 Disclaimer This document

Transportation &amp; Mobility October 11, 2017 October 11, 2017 Community Dialogue Series

1 2 Monday, October 25, 2010 3 4 Monday, October 25, 2010 5 6 Monday, October 25, 2010 7

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

THIRD QUARTER 2017 EARNINGS OCTOBER 31, 2017 THIRD QUARTER 2017 EARNINGS October 31, 2017 SAFE

in Gas &amp; Services Paris, 25 October 2017 2017 Q3 Activity Fabienne Lecorvaisier Executive

2017. October 2017 1 Crowthorne Neighbourhood Area. October 2017 2 Meeting Agenda.

3/19/2017 Resource Aquisition And Transport in Vascular Plants 1 3/19/2017 2 3/19/2017 3

3/3/2017 Rick Guidotti 1 3/3/2017 2 3/3/2017 ALBINISM 3 3/3/2017 4 3/3/2017 5

VBP Bootcamp Managed Long Term Care October 2017 2 10/13/2017 October 2017 2 Agenda Area

VBP Bootcamp Contracting Course October 10, 2017 2 10/13/2017 October 2017 2 Agenda Area

October 2017 OVERVIEW OVERVIEW Demographic Trends Challenges &amp; Opportunities

Proposed Fare Adjustments Public Meetings October 17, 2017 12 noon October 18, 2017 12 noon

Update of 11th meeting of the ERNs Board of Member States 11 October 2017 13th October 2017

Global and regional prospects 7 6.5 2017 (Jan16) 6.2 2017 (Apr16) 6 2017 (Jul16) 2017

PROGRAMMING FOR ADULTS WITH DEVELOPMENTAL DISABILITIES Why and How Barriers to and benefits of

testbed in the Do we need a testbed in the Do we need a COIN community and for what ? COIN

Contest Debjit Sinha 1 , Lus Guerra e Silva 2 , Jia Wang 3 , Shesha Raghunathan 4 , Dileep

1/ 26/ 2017 Know how. Know now. Know how. Know now. Purpose of the Contest An Overview of Changes

Establishing General Methodologies of Creative Problem-Solving / Task-Achieving: Beyond TRIZ

Planning new Districts in Freiburg Organizing political and administrative Processes Keeping

T HE L ATE R OMAN R EPUBLIC IN 2017: Recent Developments Lea Beness &amp; Tom Hillard National

Lecture 2.2: Tautology and contradiction Matthew Macauley Department of Mathematical Sciences

Transportation & Mobility October 11, 2017 October 11, 2017 Community Dialogue Series

in Gas & Services Paris, 25 October 2017 2017 Q3 Activity Fabienne Lecorvaisier Executive

October 2017 OVERVIEW OVERVIEW Demographic Trends Challenges & Opportunities

T HE L ATE R OMAN R EPUBLIC IN 2017: Recent Developments Lea Beness & Tom Hillard National