Stability Analysis For Topic Models Dr. Derek Greene Insight @ - PowerPoint PPT Presentation

Stability Analysis For   Topic Models Dr. Derek Greene Insight @ UCD

Motivation • Key challenge in topic modeling: selecting an appropriate number of topics for a corpus. • Choosing too few topics will produce results that are overly broad. • Choosing too many will result in the“over-clustering” of a corpus into many small, highly-similar topics. • In the literature, topic modeling results are often presented as lists of top-ranked terms. But how robust are these rankings? • Stability analysis has been used elsewhere to measure ability of an algorithm to produce similar solutions on data originating from the same source (Levine & Domany, 2001). Proposal: term-centric stability approach for selecting the number of topics in a corpus, based on agreement between term rankings. May 2014 2

Term Ranking Similarity Initial Problem: Given a pair of ranked lists of terms, how can we measure the similarity between them? • Simple approaches: Rank Topic 1 Rank Topic 1 1 film 1 celebrity • Measure correlation (e.g. Spearman). 2 music 2 music 3 awards 3 awards • Measure overlap between   | R 1 ∩ R 2 | 4 star 4 star the two sets. | R 1 ∪ R 2 | 5 band 5 ceremony 6 album 6 band • How do we deal with… 7 oscar 7 movie 8 movie 8 oscar • Indefiniteness (i.e. missing terms). 9 cinema 9 cinema 10 song 10 film • Positional information. Ranking R1 Ranking R2 ➡ We propose a “top-weighted” similarity measure that can also handle indefinite rankings. May 2014 3

Term Ranking Similarity Average Jaccard (AJ) Similarity:   t AJ ( R i , R j ) = 1 X γ d ( R i , R j ) Calculate average of the Jaccard scores between t d =1 every pair of subsets of d top-ranked terms in two ranked lists, for depths d ∈ [1, t] . γ d ( R i , R j ) = | R i,d ∩ R j,d | | R i,d ∪ R j,d | Example - AJ Similarity for two ranked lists with t=5 terms: Jac d d R 1 ,d R 2 ,d AJ 1 album sport 0.000 0.000 2 album, music sport, best 0.000 0.000 3 album, music, best sport, best, win 0.200 0.067 4 album, music, best, award sport, best, win, medal 0.143 0.086 5 album, music, best, award, win sport, best, win, medal, award 0.429 0.154 ➡ Di ff erences at the top of the ranked lists have more influence than di ff erences at the tail of the lists. May 2014 4

Topic Model Agreement Next Problem: How to measure agreement between two topic models, each containing k ranked lists? • Proposed Strategy: 1. Build k x k Average Jaccard similarity matrix. 2. Find optimal match between the rows and columns using Hungarian assignment method. 3. Measure agreement as the average similarity between matched topics. Ranking Set #1: Ranking set S 1 : Optimal Match R 11 = { sport, win, award } R 21 R 22 R 23 R 12 = { bank, finance, money } π = ( R 11 , R 23 ) , ( R 12 , R 21 ) , ( R 13 , R 23 ) R 13 = { music, album, band } R 11 0.00 0.07 0.50 agree ( S 1 , S 2 ) = 0 . 50+0 . 50+0 . 61 = 0 . 54 3 R 12 Ranking Set #2: 0.50 0.00 0.07 Ranking set S 2 : R 21 = { finance, bank, economy } R 13 0.00 0.61 0.00 R 22 = { music, band, award } R 23 = { win, sport, money } AJ Similarity Matrix May 2014 5 d it

Model Selection Q. How can we use the agreement between pairs of topic models to choose the number of topics in a corpus? • Proposal: ‣ Generate topics on di ff erent samples of the corpus. ‣ Measure term agreement between topics and a “reference set” of topics. ‣ Higher agreement between terms ➢ A more stable topic model. Rank Topic 1 Topic 2 Rank Topic 1 Topic 2 Low agreement   1 oil win 1 cup first between top   2 bank players 2 labour sales ranked terms 3 election minister 3 growth year 4 policy party 4 team minister 5 government ireland 5 senate firm Low stability   6 match club 6 minister match 7 senate year 7 ireland coalition for k=2 8 democracy election 8 players team 9 firm coalition 9 year election 10 team first 10 economy policy Run 1 Run 2 6

Model Selection Q. How can we use the agreement between pairs of topic models to choose the number of topics in a corpus? • Proposal: ‣ Generate topics on di ff erent samples of the corpus. ‣ Measure term agreement between topics and a “reference set” of topics. ‣ Higher agreement between terms ➢ A more stable topic model. Rank Topic 1 Topic 2 Topic 3 Rank Topic 1 Topic 2 Topic 3 High agreement   1 growth game labour 1 game growth labour between top   2 company ireland election 2 win company election ranked terms 3 market win vote 3 ireland market governmen t 4 economy cup party 4 cup economy party 5 bank goal governmen 5 match bank vote t High stability   6 year match coalition 6 team shares policy 7 firm team minister 7 first year minister for k=3 8 sales first policy 8 players firm democracy 9 shares club democracy 9 club sales senate 10 oil players first 10 goal oil coalition Run 1 Run 2 7

Model Selection - Algorithm 1. Randomly generate τ samples of the data set, each containing β × n documents. 2. For each value of k ∈ [ k min , k max ] : 1. Apply the topic modeling algorithm to the complete data set of n documents to generate k topics, and represent the output as the reference ranking set S 0 . 2. For each sample X i : (a) Apply the topic modeling algorithm to X i to generate k topics, and represent the output as the ranking set S i . (b) Calculate the agreement score agree ( S 0 , S i ). 3. Compute the mean agreement score for k over all τ samples (Eqn. 4). 3. Select one or more values for k based upon the highest mean agreement scores. 0.80 Single stability   0.70 peak for k=5 0.60 Mean Agreement 0.50 0.40 0.30 0.20 2 3 4 5 6 7 8 9 10 Number of Topics (K) 8

Model Selection - Algorithm 1. Randomly generate τ samples of the data set, each containing β × n documents. 2. For each value of k ∈ [ k min , k max ] : 1. Apply the topic modeling algorithm to the complete data set of n documents to generate k topics, and represent the output as the reference ranking set S 0 . 2. For each sample X i : (a) Apply the topic modeling algorithm to X i to generate k topics, and represent the output as the ranking set S i . (b) Calculate the agreement score agree ( S 0 , S i ). 3. Compute the mean agreement score for k over all τ samples (Eqn. 4). 3. Select one or more values for k based upon the highest mean agreement scores. 0.80 Two potentially   0.70 good models 0.60 Mean Agreement 0.50 0.40 0.30 0.20 2 3 4 5 6 7 8 9 10 Number of Topics (K) 9

Model Selection - Algorithm 1. Randomly generate τ samples of the data set, each containing β × n documents. 2. For each value of k ∈ [ k min , k max ] : 1. Apply the topic modeling algorithm to the complete data set of n documents to generate k topics, and represent the output as the reference ranking set S 0 . 2. For each sample X i : (a) Apply the topic modeling algorithm to X i to generate k topics, and represent the output as the ranking set S i . (b) Calculate the agreement score agree ( S 0 , S i ). 3. Compute the mean agreement score for k over all τ samples (Eqn. 4). 3. Select one or more values for k based upon the highest mean agreement scores. 0.80 No coherent   0.70 topics in the   0.60 data? Mean Agreement 0.50 0.40 0.30 0.20 0.10 0.00 2 3 4 5 6 7 8 9 10 Number of Topics (K) 10

Aside: NMF For Topic Models • Applying NMF to Text Data: 1. Construct vector space model for documents (after stop- word filtering), resulting in a document-term matrix A . 2. Apply TF-IDF term weight normalisation to A . 3. Normalize TF-IDF vectors to unit length. 4. Apply Projected Gradient NMF to A . • NMF outputs two factors: 1. Basis matrix: The topics in the data. Rank entries in columns to produce topic ranking sets. 2. Coe ffi cient matrix : The membership weights for documents relative to each topic. Insight Latent Space Workshop 11

Experimental Evaluation • Experimental Setup: ‣ Examine topic stability for k ∈ [2, 12]. ‣ Reference ranking set produced using NNDSVD + NMF on the complete corpus. ‣ Generated 100 test ranking sets using Random Initialisation + NMF , randomly sampling 80% of documents. ‣ Measure agreement using top 20 terms. • Comparison: • Apply popular existing approach for selecting rank for NMF based on the cophenetic correlation of a consensus matrix (Brunet et al, 2004). • Compare both results to ground truth labels for each corpus. Insight Latent Space Workshop 12

Experimental Results bbc corpus bbcsport corpus 1.0 1.0 k=5 ground 0.9 0.9 truth labels 0.8 0.8 Score Score 0.7 0.7 5 ground truth labels   but “athletics” & 0.6 0.6 “tennis” ofter merged 0.5 0.5 Stability (t=20) Stability (t=20) Consensus Consensus 0.4 0.4 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12 K K guardian-2013 corpus 1.0 “Books”, “Fashion” & 0.9 “Music” merged into a culture topic at k=3 0.8 Score k=6 ground 0.7 truth labels 0.6 0.5 Stability (t=20) Consensus 0.4 2 3 4 5 6 7 8 9 10 11 12 K

Stability Analysis For Topic Models Dr. Derek Greene Insight @ - PowerPoint PPT Presentation

Stability Analysis For Topic Models Dr. Derek Greene Insight @ UCD Motivation Key challenge in topic modeling: selecting an appropriate number of topics for a corpus. Choosing too few topics will produce results that are overly

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

UNIT TOPICS TOPIC 1: MINERALS TOPIC 2: IGNEOUS ROCKS TOPIC 3: SEDIMENTARY ROCKS

TOPIC #X: TOPIC NAME DATE, 2020 PRESENTATION OUTLINE Main topic #1 Main topic #2 Main

COMP31212: Concurrency Topic 5.3: Liveness and Topic 5.4 Fairness Topic 5.3: Liveness Properties

Plan of the Lecture Review: Nyquist stability criterion Todays topic: Nyquist stability

Using topic models as classifiers Pavel Oleinikov Associate Director Quantitative Analysis

A tour on Bridgeland stability Paolo Stellari Hamburg, June 2015 Paolo Stellari A tour on

GVP models and Linear stability 1 Non linear stability: variational approaches. 2 A general

Second Year Student Meeting PhD Candidacy Exam On-topic or Off-topic Candidacy Exam? On-Topic:

The Dynamic Earth Unit Topics Topic 1: Earths Interior Topic 2: Continental Drift

Strategic Considerations for Managing a Nanotechnology Patent Portfolio Sarah Korman, Ph.D., J.D.

9/15/17 Outline Topic 1.Introduc8on Topic 2. RCS for six key fuels Topic 3.

Researching Researching Your Paper Topic Your Paper Topic A HOW TO GUIDE A HOW TO GUIDE

Bessel inequality for robust stability analysis of time-delay system F. Gouaisbaut, Y. Ariba, A.

Why learn topic modeling Pavel Oleinikov Associate Director Quantitative Analysis Center

Rhode Island Stormwater Design and Installations Standards Manual Public Workshop Design

WELCOME! BOARD MEETING | N OVEMBER 18 2016 GENERAL SESSION | CALL TO ORDER BOARD MEETING | N

Rotation Invariant Householder Parameterization for Bayesian PCA Rajbir-Singh Nirwan, Nils

Performance Management 2016/17: Quarter 4 Education & Childrens Services Committee Director

paranormal interactivity led mec heht gewyrcan led ordered me to be made Hello

Dementi Dementia a Friend Friendly Leisure and ly Leisure and Culture Facilities Culture

Acquisition of Torius Property in Fukuoka Prefecture, Japan 28 September 2015 1 Disclaimer This

Tips & trick for your first semester at TUM SOM Tuesday, October 20, 2020 TUM School of

Sambuz

Useful Links

Newsletter

Mail Us

Stability Analysis For Topic Models Dr. Derek Greene Insight @ - PowerPoint PPT Presentation

Stability Analysis For Topic Models Dr. Derek Greene Insight @ UCD Motivation Key challenge in topic modeling: selecting an appropriate number of topics for a corpus. Choosing too few topics will produce results that are overly

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

UNIT TOPICS TOPIC 1: MINERALS TOPIC 2: IGNEOUS ROCKS TOPIC 3: SEDIMENTARY ROCKS

TOPIC #X: TOPIC NAME DATE, 2020 PRESENTATION OUTLINE Main topic #1 Main topic #2 Main

COMP31212: Concurrency Topic 5.3: Liveness and Topic 5.4 Fairness Topic 5.3: Liveness Properties

Plan of the Lecture Review: Nyquist stability criterion Todays topic: Nyquist stability

Using topic models as classifiers Pavel Oleinikov Associate Director Quantitative Analysis

A tour on Bridgeland stability Paolo Stellari Hamburg, June 2015 Paolo Stellari A tour on

GVP models and Linear stability 1 Non linear stability: variational approaches. 2 A general

Second Year Student Meeting PhD Candidacy Exam On-topic or Off-topic Candidacy Exam? On-Topic:

The Dynamic Earth Unit Topics Topic 1: Earths Interior Topic 2: Continental Drift

Strategic Considerations for Managing a Nanotechnology Patent Portfolio Sarah Korman, Ph.D., J.D.

9/15/17 Outline Topic 1.Introduc8on Topic 2. RCS for six key fuels Topic 3.

Researching Researching Your Paper Topic Your Paper Topic A HOW TO GUIDE A HOW TO GUIDE

Bessel inequality for robust stability analysis of time-delay system F. Gouaisbaut, Y. Ariba, A.

Why learn topic modeling Pavel Oleinikov Associate Director Quantitative Analysis Center

Rhode Island Stormwater Design and Installations Standards Manual Public Workshop Design

WELCOME! BOARD MEETING | N OVEMBER 18 2016 GENERAL SESSION | CALL TO ORDER BOARD MEETING | N

Rotation Invariant Householder Parameterization for Bayesian PCA Rajbir-Singh Nirwan, Nils

Performance Management 2016/17: Quarter 4 Education &amp; Childrens Services Committee Director

paranormal interactivity led mec heht gewyrcan led ordered me to be made Hello

Dementi Dementia a Friend Friendly Leisure and ly Leisure and Culture Facilities Culture

Acquisition of Torius Property in Fukuoka Prefecture, Japan 28 September 2015 1 Disclaimer This

Tips &amp; trick for your first semester at TUM SOM Tuesday, October 20, 2020 TUM School of

Sambuz

Useful Links

Newsletter

Mail Us

Performance Management 2016/17: Quarter 4 Education & Childrens Services Committee Director

Tips & trick for your first semester at TUM SOM Tuesday, October 20, 2020 TUM School of