Understanding the Limiting Factors of Topic Modeling via Posterior - - PowerPoint PPT Presentation

understanding the limiting factors of topic modeling via
SMART_READER_LITE
LIVE PREVIEW

Understanding the Limiting Factors of Topic Modeling via Posterior - - PowerPoint PPT Presentation

Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei, Ming Zhan ICML 2014 best paper Presenter: Lu Lin & Tianlu Wang CS6501: Text Mining Outline


slide-1
SLIDE 1

CS6501: Text Mining

Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis

Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei, Ming Zhan ICML 2014 best paper

Presenter: Lu Lin & Tianlu Wang

slide-2
SLIDE 2

CS6501: Text Mining

Outline

❖ Background and Motivations ❖ Posterior Contraction Analysis ❖ Empirical Study & Practical Guidance

slide-3
SLIDE 3

CS6501: Text Mining

Background

❖ Latent Dirichlet Allocation (LDA) for topic modeling

D documents, each has N words, generated from K topics 789: observed words ;

9: document-topic proportion

<89: topic indicators =>: topic-word proportion

Ge Generative process:

=> | B ~ Dirichlet(B) ;

9 | D ~ Dirichlet(D)

<89 |;

9 ~ Multinomial(; 9 )

789 | => , <89 ~ Multinomial(=EFG )

Ba Bayesian estimation: J, K = argmax

J,K

N(J, K|7) ∝ argmax

J,K

N 7 J, K N(J, K)

slide-4
SLIDE 4

CS6501: Text Mining

Motivation

❖ Latent Dirichlet Allocation (LDA) for topic modeling ❖ Questions:

➢ Is my data topic-model “friendly”? Why did LDA fail on my data? ➢ How many documents do I need to learn 100 topics?

❖ What factors affect LDA’s performance?

➢ # documents D ➢ Length of individual documents N ➢ # topics ➢ Dirichlet hyper-parameters

❖ Formulate the goal:

➢ How fast (rate) does the posterior distribution of !" ’s converge to the true value as D and N approaching infinity? ⟹ posterior contraction analysis

slide-5
SLIDE 5

CS6501: Text Mining

Posterior Contraction Analysis

❖ Latent Topic Polytope in LDA

➢ Representation of latent topic structure through convex hull: Topic polytope

! " = $%&'("), … , ",)

➢ Distance between two polytopes ./ and .0:

12 !), !3 = 456{1 !), !3 , 1(!3, !))} 9 ./, .0 = max

=>∈@ABC(D>)

min

=G∈@ABC(DG) ||I/ − I0||0

  • “extr” means the extreme points, i.e., topics in LDA
  • Equivalent to Hausdorff metric in convex geometry

❖ Posterior Contraction Analysis

➢ How fast the posterior converges to the true posterior distribution .∗ ➢ i.e., 12 !), !∗ ≤?

slide-6
SLIDE 6

CS6501: Text Mining

Posterior Contraction Analysis

❖ Theorem 1 ❖ Insights

➢ Length of documents ! should be at least on the order of logarithm of # documents D ➢ Convergence rate: max{

"#$ % %

,

"#$ ' '

,

"#$ ' % }

➢ Rate does not depend on #topic (, if (∗ is known ➢ Overfitted setting is prefered, i.e.,( ≫ (∗ , because: Let the Dirichlet parameters for topic proportion +, ∈ (0,1], and assume either one holds: (A1) ( = (∗, i.e., the true #topic is known; (A2) Euclidean distance between each pair of topics is bounded from below. then as 3 → ∞ and ! → ∞ such that ! > log 3:

: ;ℳ =>, =∗ ≤ @ A B%,' C) → 1

where the upper bound for contraction rate is EF,G = (

HIJ F F

+

LMN G G

+

LMN G F )

O P

slide-7
SLIDE 7

CS6501: Text Mining

Posterior Contraction Analysis

❖ Theorem 2 ❖ Insights

➢ The convergence is very slow, depending on K ➢ Underfitting (! < !∗) will result in a persistent error even with infinite data, thus not considered Under the same conditions as the previous theorem, except none of (A1) and (A2) holds: (A1) ! = !∗, i.e., the true #topic is known; (A2) Euclidean distance between each pair of topics is bounded from below. then for !∗ < ! ≤ |'|:

( )ℳ +,, +∗ ≤ . / 01,2 3) → 1

where the upper bound for contraction rate is 78,9 = (

;<= 8 8

+

?@A 9 9

+

?@A 9 8 )

B C(DEB)

slide-8
SLIDE 8

CS6501: Text Mining

Empirical study & practical guidance

Synthetic Data:

  • Ground truth number of topics K* = 3
  • Vocabulary size |V| = 5000
  • Metric: Minimum-matching Euclidean distance defined in (1)
  • Focus on the variation of following parameters:

○ D: number of documents ○ N: length of documents ○ β: topic word Dirichlet distribution hyperparameter ○ K: number of topics specified for inference

slide-9
SLIDE 9

CS6501: Text Mining

Synthetic Data

Larger number of documents => better performance Overly large number of topics for the model => worse performance Topics are known to be word-sparse, word distribution parameter should be set small(e.g. β = 0.01) Longer documents => better performance

slide-10
SLIDE 10

CS6501: Text Mining

Synthetic Data

To verify the exponential theoretical bounds provided by the theorems

slide-11
SLIDE 11

CS6501: Text Mining

Empirical study & practical guidance

Real Data:

  • Metric: Point-wise mutual information(Newman et al., 2011)
  • Focus on the variation of following parameters:

○ D: number of documents ○ N: length of documents ○ β: topic word Dirichlet distribution hyperparameter ○ !: document topic Dirichlet distribution hyperparameter ○ K: number of topics specified for inference

slide-12
SLIDE 12

CS6501: Text Mining

Real Data

Individual documents are associated mostly with smaller number of topics => better performance Each documents is associated with few topics, document-topic distribution parameter should be set small(e.g. α = 0.1)

slide-13
SLIDE 13

CS6501: Text Mining