CS6501: Text Mining
Understanding the Limiting Factors of Topic Modeling via Posterior - - PowerPoint PPT Presentation
Understanding the Limiting Factors of Topic Modeling via Posterior - - PowerPoint PPT Presentation
Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei, Ming Zhan ICML 2014 best paper Presenter: Lu Lin & Tianlu Wang CS6501: Text Mining Outline
CS6501: Text Mining
Outline
❖ Background and Motivations ❖ Posterior Contraction Analysis ❖ Empirical Study & Practical Guidance
CS6501: Text Mining
Background
❖ Latent Dirichlet Allocation (LDA) for topic modeling
D documents, each has N words, generated from K topics 789: observed words ;
9: document-topic proportion
<89: topic indicators =>: topic-word proportion
Ge Generative process:
=> | B ~ Dirichlet(B) ;
9 | D ~ Dirichlet(D)
<89 |;
9 ~ Multinomial(; 9 )
789 | => , <89 ~ Multinomial(=EFG )
Ba Bayesian estimation: J, K = argmax
J,K
N(J, K|7) ∝ argmax
J,K
N 7 J, K N(J, K)
CS6501: Text Mining
Motivation
❖ Latent Dirichlet Allocation (LDA) for topic modeling ❖ Questions:
➢ Is my data topic-model “friendly”? Why did LDA fail on my data? ➢ How many documents do I need to learn 100 topics?
❖ What factors affect LDA’s performance?
➢ # documents D ➢ Length of individual documents N ➢ # topics ➢ Dirichlet hyper-parameters
❖ Formulate the goal:
➢ How fast (rate) does the posterior distribution of !" ’s converge to the true value as D and N approaching infinity? ⟹ posterior contraction analysis
CS6501: Text Mining
Posterior Contraction Analysis
❖ Latent Topic Polytope in LDA
➢ Representation of latent topic structure through convex hull: Topic polytope
! " = $%&'("), … , ",)
➢ Distance between two polytopes ./ and .0:
12 !), !3 = 456{1 !), !3 , 1(!3, !))} 9 ./, .0 = max
=>∈@ABC(D>)
min
=G∈@ABC(DG) ||I/ − I0||0
- “extr” means the extreme points, i.e., topics in LDA
- Equivalent to Hausdorff metric in convex geometry
❖ Posterior Contraction Analysis
➢ How fast the posterior converges to the true posterior distribution .∗ ➢ i.e., 12 !), !∗ ≤?
CS6501: Text Mining
Posterior Contraction Analysis
❖ Theorem 1 ❖ Insights
➢ Length of documents ! should be at least on the order of logarithm of # documents D ➢ Convergence rate: max{
"#$ % %
,
"#$ ' '
,
"#$ ' % }
➢ Rate does not depend on #topic (, if (∗ is known ➢ Overfitted setting is prefered, i.e.,( ≫ (∗ , because: Let the Dirichlet parameters for topic proportion +, ∈ (0,1], and assume either one holds: (A1) ( = (∗, i.e., the true #topic is known; (A2) Euclidean distance between each pair of topics is bounded from below. then as 3 → ∞ and ! → ∞ such that ! > log 3:
: ;ℳ =>, =∗ ≤ @ A B%,' C) → 1
where the upper bound for contraction rate is EF,G = (
HIJ F F
+
LMN G G
+
LMN G F )
O P
CS6501: Text Mining
Posterior Contraction Analysis
❖ Theorem 2 ❖ Insights
➢ The convergence is very slow, depending on K ➢ Underfitting (! < !∗) will result in a persistent error even with infinite data, thus not considered Under the same conditions as the previous theorem, except none of (A1) and (A2) holds: (A1) ! = !∗, i.e., the true #topic is known; (A2) Euclidean distance between each pair of topics is bounded from below. then for !∗ < ! ≤ |'|:
( )ℳ +,, +∗ ≤ . / 01,2 3) → 1
where the upper bound for contraction rate is 78,9 = (
;<= 8 8
+
?@A 9 9
+
?@A 9 8 )
B C(DEB)
CS6501: Text Mining
Empirical study & practical guidance
Synthetic Data:
- Ground truth number of topics K* = 3
- Vocabulary size |V| = 5000
- Metric: Minimum-matching Euclidean distance defined in (1)
- Focus on the variation of following parameters:
○ D: number of documents ○ N: length of documents ○ β: topic word Dirichlet distribution hyperparameter ○ K: number of topics specified for inference
CS6501: Text Mining
Synthetic Data
Larger number of documents => better performance Overly large number of topics for the model => worse performance Topics are known to be word-sparse, word distribution parameter should be set small(e.g. β = 0.01) Longer documents => better performance
CS6501: Text Mining
Synthetic Data
To verify the exponential theoretical bounds provided by the theorems
CS6501: Text Mining
Empirical study & practical guidance
Real Data:
- Metric: Point-wise mutual information(Newman et al., 2011)
- Focus on the variation of following parameters:
○ D: number of documents ○ N: length of documents ○ β: topic word Dirichlet distribution hyperparameter ○ !: document topic Dirichlet distribution hyperparameter ○ K: number of topics specified for inference
CS6501: Text Mining
Real Data
Individual documents are associated mostly with smaller number of topics => better performance Each documents is associated with few topics, document-topic distribution parameter should be set small(e.g. α = 0.1)
CS6501: Text Mining