understanding the limiting factors of topic modeling via
play

Understanding the Limiting Factors of Topic Modeling via Posterior - PowerPoint PPT Presentation

Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei, Ming Zhan ICML 2014 best paper Presenter: Lu Lin & Tianlu Wang CS6501: Text Mining Outline


  1. Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei, Ming Zhan ICML 2014 best paper Presenter: Lu Lin & Tianlu Wang CS6501: Text Mining

  2. Outline ❖ Background and Motivations ❖ Posterior Contraction Analysis ❖ Empirical Study & Practical Guidance CS6501: Text Mining

  3. Background ❖ Latent Dirichlet Allocation (LDA) for topic modeling D documents, each has N words, generated from K topics 7 89 : observed words ; 9 : document-topic proportion < 89 : topic indicators = > : topic-word proportion Ge Generative process: = > | B ~ Dirichlet( B ) ; 9 | D ~ Dirichlet( D ) < 89 |; 9 ~ Multinomial( ; 9 ) 7 89 | = > , < 89 ~ Multinomial( = E FG ) Ba Bayesian estimation: J, K = argmax N(J, K|7) ∝ argmax N 7 J, K N(J, K) J,K J,K CS6501: Text Mining

  4. Motivation ❖ Latent Dirichlet Allocation (LDA) for topic modeling ❖ Questions: ➢ Is my data topic-model “friendly”? Why did LDA fail on my data? ➢ How many documents do I need to learn 100 topics? ❖ What factors affect LDA’s performance? ➢ # documents D ➢ Length of individual documents N ➢ # topics ➢ Dirichlet hyper-parameters ❖ Formulate the goal: ➢ How fast (rate) does the posterior distribution of ! " ’s converge to the true value as D and N approaching infinity? ⟹ posterior contraction analysis CS6501: Text Mining

  5. Posterior Contraction Analysis ❖ Latent Topic Polytope in LDA ➢ Representation of latent topic structure through convex hull: Topic polytope ! " = $%&'(" ) , … , " , ) ➢ Distance between two polytopes . / and . 0 : 1 2 ! ) , ! 3 = 456{1 ! ) , ! 3 , 1(! 3 , ! ) )} 9 . / , . 0 = max = G ∈@ABC(D G ) ||I / − I 0 || 0 min = > ∈@ABC(D > ) “extr” means the extreme points, i.e., topics in LDA o Equivalent to Hausdorff metric in convex geometry o ❖ Posterior Contraction Analysis ➢ How fast the posterior converges to the true posterior distribution . ∗ i.e., 1 2 ! ) , ! ∗ ≤ ? ➢ CS6501: Text Mining

  6. Posterior Contraction Analysis ❖ Theorem 1 Let the Dirichlet parameters for topic proportion + , ∈ (0,1] , and assume either one holds: (A1) ( = ( ∗ , i.e., the true #topic is known; (A2) Euclidean distance between each pair of topics is bounded from below. then as 3 → ∞ and ! → ∞ such that ! > log 3 : : ; ℳ = > , = ∗ ≤ @ A B %,' C) → 1 O HIJ F LMN G LMN G where the upper bound for contraction rate is E F,G = ( + + F ) P F G ❖ Insights ➢ Length of documents ! should be at least on the order of logarithm of # documents D "#$ % "#$ ' "#$ ' ➢ Convergence rate: max{ , , % } % ' Rate does not depend on #topic ( , if ( ∗ is known ➢ Overfitted setting is prefered, i.e., ( ≫ ( ∗ , because: ➢ CS6501: Text Mining

  7. Posterior Contraction Analysis ❖ Theorem 2 Under the same conditions as the previous theorem, except none of (A1) and (A2) holds: (A1) ! = ! ∗ , i.e., the true #topic is known; (A2) Euclidean distance between each pair of topics is bounded from below. then for ! ∗ < ! ≤ |'| : ( ) ℳ + , , + ∗ ≤ . / 0 1,2 3) → 1 B ;<= 8 ?@A 9 ?@A 9 where the upper bound for contraction rate is 7 8,9 = ( + + 8 ) C(DEB) 8 9 ❖ Insights ➢ The convergence is very slow, depending on K ➢ Underfitting ( ! < ! ∗ ) will result in a persistent error even with infinite data, thus not considered CS6501: Text Mining

  8. Empirical study & practical guidance Synthetic Data: Ground truth number of topics K * = 3 ● ● Vocabulary size |V| = 5000 ● Metric: Minimum-matching Euclidean distance defined in (1) ● Focus on the variation of following parameters: ○ D: number of documents ○ N: length of documents ○ β: topic word Dirichlet distribution hyperparameter ○ K: number of topics specified for inference CS6501: Text Mining

  9. Synthetic Data Larger number of documents => better performance Overly large number of topics for the model => worse performance Topics are known to be word-sparse, word distribution parameter should be set small(e.g. β = 0.01) Longer documents => better performance CS6501: Text Mining

  10. Synthetic Data To verify the exponential theoretical bounds provided by the theorems CS6501: Text Mining

  11. Empirical study & practical guidance Real Data: ● Metric: Point-wise mutual information(Newman et al., 2011) ● Focus on the variation of following parameters: ○ D: number of documents ○ N: length of documents ○ β: topic word Dirichlet distribution hyperparameter ○ ! : document topic Dirichlet distribution hyperparameter ○ K: number of topics specified for inference CS6501: Text Mining

  12. Real Data Individual documents are associated mostly with smaller number of topics => better performance Each documents is associated with few topics, document-topic distribution parameter should be set small(e.g. α = 0.1) CS6501: Text Mining

  13. CS6501: Text Mining

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend