correlated t opic models
play

Correlated T opic Models Authors: Blei and LaffertY, 2006 - PowerPoint PPT Presentation

Correlated T opic Models Authors: Blei and LaffertY, 2006 Reviewer: Casey Hanson Recap Latent Dirichlet Allocation set of documents . = set of topics . = set of all words. || words in each doc.


  1. Correlated T opic Models Authors: Blei and LaffertY, 2006 Reviewer: Casey Hanson

  2. Recap Latent Dirichlet Allocation β€’ 𝐸 ≑ set of documents . β€’ 𝐿 = set of topics . β€’ π‘Š = set of all words. |𝑂| words in each doc. β€’ πœ„ 𝑒 ≑ Multi over topics for a document d ∈ 𝐸. πœ„ 𝑒 ~ 𝐸𝑗𝑠(𝛽) β€’ 𝛾 𝑙 ≑ Multi over words in a topic, 𝑙 ∈ 𝐿. 𝛾 𝑙 ~𝐸𝑗𝑠(πœƒ) β€’ π‘Ž 𝑒,π‘œ ≑ topic selected for word π‘œ in document 𝑒. π‘Ž 𝑒,π‘œ ~Multi(πœ„ 𝑒 ) β€’ 𝑋 𝑒,π‘œ ≑ π‘œ π‘’β„Ž word in document 𝑒. 𝑋 𝑒,π‘œ ~ Multi(𝐢 π‘Ž 𝑒,π‘œ )

  3. Latent Dirichlet Allocation β€’ Need to calculate posterior: 𝑄(πœ„ 1:𝐸 , π‘Ž 1:𝐸,1:𝑂 , 𝛾 1:𝐿 |𝑋 1:𝐸,1:𝑂 , 𝛽, πœƒ) β€’ ∝ π‘ž(πœ„ 1:𝐸 , π‘Ž 1:𝐸,1:𝑂 , 𝛾 1:𝐿 , 𝑋 1:𝐸,1:𝑂 , 𝛽, πœƒ) β€’ Normalization factor, πœ„ π‘Ž π‘ž(. . ) , is intractable 𝛾 β€’ Need to use approximate inference. β€’ Gibbs Sampling β€’ Drawback β€’ No intuitive relationship between topics. β€’ Challenge β€’ Develop method similar to LDA with relationships between topics.

  4. Normal or Gaussian Distribution 𝑓 βˆ’ π‘¦βˆ’πœˆ 2 1 𝑔 𝑦 = 2𝜏 2 𝜏 2𝜌 β€’ Continuous distribution β€’ Symmetrical and defined for βˆ’βˆž < 𝑦 < ∞ β€’ P arameters: π’ͺ 𝜈, 𝜏 2 β€’ 𝜈 ≑ mean β€’ 𝜏 2 ≑ variance β€’ 𝜏 ≑ standard deviation β€’ Estimation from Data: π‘Œ = 𝑦 1 … 𝑦 π‘œ 1 π‘œ β€’ π‘œ 𝑗=1 𝜈 = 𝑦 𝑗 𝜏 2 = 1 π‘œ 𝑦 𝑗 βˆ’ 𝜈 2 β€’ π‘œ 𝑗=1

  5. Multivariate Gaussian Distribution: 𝑙 dimensions 1 𝑓 βˆ’1 2 π’€βˆ’π‚ π‘ˆ Ξ£ βˆ’1 (π’€βˆ’π‚) 𝑔 𝒀 = 𝑔 𝑦 π‘Œ 1 … π‘Œ 𝑙 = 2𝜌 𝑙/2 det Ξ£ β€’ 𝒀 = π‘Œ 1 … π‘Œ 𝑙 π‘ˆ ~π’ͺ(𝝂, Ξ£) β€’ 𝝂 ≑ 𝑙 x 1 vector of means for each dimension β€’ 𝚻 ≑ 𝑙 x 𝑙 covariance matrix . Example : 2D Case 𝐹[𝑦 1 ] 𝜈 1 β€’ 𝜈 = 𝐹 𝒀 = 𝐹[𝑦 2 ] = 𝜈 2 𝑦 1 βˆ’ 𝜈 1 2 𝐹 𝐹 𝑦 1 βˆ’ 𝜈 1 𝑦 2 βˆ’ 𝜈 2 β€’ Ξ£ = 𝑦 2 βˆ’ 𝜈 2 2 𝐹 𝑦 1 βˆ’ 𝜈 1 𝑦 2 βˆ’ 𝜈 2 𝐹

  6. 2D Multivariate Gaussian: 2 𝜏 π‘Œ 1 𝜍 π‘Œ 1 ,π‘Œ 2 𝜏 π‘Œ 1 𝜏 π‘Œ 2 β€’ Ξ£ = 2 𝜍 π‘Œ 1 ,π‘Œ 2 𝜏 π‘Œ 1 𝜏 π‘Œ 2 𝜏 π‘Œ 2 β€’ Topic Correlations on Off Diagonal π‘Œ 𝑗,1 βˆ’πœˆ 1 π‘Œ 𝑗,2 βˆ’πœˆ 2 π‘œ β€’ 𝜍 π‘Œ 1 ,π‘Œ 2 𝜏 π‘Œ 1 𝜏 π‘Œ 2 = 𝐹 = 𝑗=1 𝑦 1 βˆ’ 𝜈 1 𝑦 2 βˆ’ 𝜈 2 π‘œ β€’ Covariance matrix is diagonal!

  7. Matlab Demo

  8. …Back to Topic Models β€’ How can we adapt LDA to have correlations between topics. β€’ In LDA, we assume two things: β€’ Assumption 1: Topics in a document are independent. πœ„ 𝑒 ~𝐸𝑗𝑠(𝛽) β€’ Assumption 2: Distribution of words in a topic is stationary. 𝐢 𝑙 ~(πœƒ) β€’ To sample topic distributions for topics that are correlated, we need to correct assumption 1.

  9. Exponential Family of Distributions β€’ Fa mily of distributions that can be placed in the following form: 𝑔 𝑦 πœ„ = β„Ž 𝑦 β‹… 𝑓 πœƒ πœ„ β‹…π‘ˆ 𝑦 βˆ’π΅ πœ„ β€’ Ex: Binomial distribution : πœ„ = π‘ž π‘œ 𝑦 π‘ž 𝑦 (1 βˆ’ π‘ž) π‘œβˆ’π‘¦ , 𝑦 ∈ 0,1,2, … , π‘œ 𝑔 𝑦|πœ„ = π‘ž π‘œ β€’ πœƒ(πœ„) = log β„Ž 𝑦 = 𝑦 , 𝐡 πœ„ = π‘œ log 1 βˆ’ π‘ž , π‘ˆ 𝑦 = 𝑦 1βˆ’π‘ž π‘ž π‘œ 𝑦⋅log 1βˆ’π‘ž +π‘œβ‹…log 1βˆ’π‘ž 𝑔 𝑦 = 𝑦 𝑓 Natural Parameterization

  10. Categorical Distribution β€’ Multinomial n=1: β€’ 𝑔 𝑦 1 = πœ„ 1 ; 𝑔 π‘Ž 1 = πœ„ π‘ˆ β‹… π‘Ž 1 β€’ where π‘Ž 1 = 1 0 0. . 0 π‘ˆ ( Iverson Bracket or Indicator Vector) β€’ 𝑨 𝑗 = 1 β€’ P arameters: πœ„ β€’ πœ„ = π‘ž 1 π‘ž 2 π‘ž 3 , where 𝑗 π‘ž 𝑗 = 1 β€’ πœ„ β€² = π‘ž 1 π‘ž 2 π‘ž 𝑙 1 π‘ž 𝑙 π‘ž 1 π‘ž 2 β€’ log πœ„ β€² = log π‘ž 𝑙 log π‘ž 𝑙 1

  11. Exponential Family Multinomial With N=1 β€’ π‘Ίπ’‡π’…π’ƒπ’Žπ’Ž: 𝑔 π‘Ž 𝑗 πœ„ = πœ„ π‘ˆ β‹… π‘Ž 𝑗 β€’ We want: 𝑔 𝑦 πœ„ = β„Ž 𝑦 β‹… 𝑓 πœƒ πœ„ β‹…π‘ˆ 𝑦 βˆ’π΅ πœ„ 𝑓 πœƒπ‘ˆβ‹…π‘Žπ‘— β€’ 𝑔 π‘Ž 𝑗 πœƒ = 𝑓 πœƒ π‘ˆ π‘Ž 𝑗 βˆ’log 𝑗=1 𝑓 πœƒπ‘— = 𝑗=1 𝑓 πœƒπ‘— β€’ Note: k-1 independent dimensions in Multinomial π‘ž 1 π‘ž 2 π‘ž 𝑗 β€’ πœƒβ€² = [log π‘ž 𝑙 log π‘ž 𝑙 … .0] , πœƒβ€² 𝑗 = log π‘ž 𝑙 𝑓 πœƒβ€²π‘ˆβ‹…π‘Žπ‘— β€’ 𝑔 π‘Ž 𝑗 πœƒ β€² =β‹… πœƒπ‘—β€² π‘™βˆ’1 𝑓 𝑗 1+ 𝑗=1

  12. Verify: Classroom participation π‘ž 1 π‘ž 2 β€’ Given: πœƒ = [log π‘ž 𝑙 log π‘ž 𝑙 … 0] β€’ Show: 𝑔 π‘Ž 𝑗 πœ„ = πœ„ π‘ˆ β‹… π‘Ž 𝑗 = 𝑓 πœƒ π‘ˆ π‘Ž 𝑗 βˆ’log 𝑗=1 𝑓 πœƒ 𝑗

  13. Intuition and Demo β€’ Can sample πœƒ from any number of places. β€’ Choose normal (allows for correlation between topic dimensions) β€’ Get a topic distribution for each document by sampling: πœƒ ~ π’ͺ π‘™βˆ’1 𝜈, 𝜏 β€’ What is the 𝜈 π‘ž 𝑗 β€’ E xpected deviation from last topic: log π‘ž 𝑙 β€’ Negative means push density towards last topic ( πœƒ 𝑗 < 0, π‘ž 𝑙 > π‘ž 𝑗 ) β€’ What about the covariance β€’ Shows variability in deviation from last topic between topics. 𝜈 = 0 0 π‘ˆ , 𝜏 = [1 0; 0 1]

  14. Favoring Topic 3 𝜈 = βˆ’0.9, βˆ’0.9 , Ξ£ = [1 βˆ’ 0.9; βˆ’0.9 1] 𝜈 = βˆ’0.9, βˆ’0.9 , Ξ£ = [1 0; 0 1]

  15. Favoring Topic 3: 𝜈 = βˆ’0.9, βˆ’0.9 , Ξ£ = [1 0.4; 0.4 1]

  16. Exercises

  17. Correlated Topic Model β€’ Algorithm: β€’ βˆ€π‘’ ∈ 𝐸 β€’ Draw πœƒ 𝑒 | 𝜈, Ξ£ ~ π’ͺ(𝜈, Ξ£) β€’ βˆ€ π‘œ ∈ 1 … 𝑂 : β€’ Draw topic assignment β€’ π‘Ž π‘œ,𝑒 |πœƒ 𝑒 ~ Categorical 𝑔 πœƒ 𝑒 β€’ Draw word β€’ 𝑋 𝑒,π‘œ | π‘Ž 𝑒,π‘œ , 𝛾 1:𝐿 ~ Categorical 𝛾 π‘Ž π‘œ β€’ Parameter Estimation: β€’ Intractable β€’ User variational inference (later)

  18. Evaluation I: CTM on Test Data

  19. Evaluation II: 10-Fold Cross Validation LDA vs CTM β€’ ~1500 documents in corpus. β€’ ~5600 unique words β€’ After pruning β€’ Methodology: β€’ Partition data into 10 sets β€’ 10 fold cross validation β€’ Calculate the log likelihood of a set, given you trained on the previous 9 sets, for both LDA and CTM. CTM shows a much higher log likelihood as the number of β€’ Right(L(CTM) - L(LDA)) topics increases. β€’ Left(L(CTM) – L(LDA))

  20. Evaluation II: Predictive Perplexity β€’ Perplexity measure ≑ expected number of equally likely words β€’ Lower perplexity means higher word resolution. β€’ Suppose you see a percentage of words in a document, how likely is the rest of the words in the document according to your model? β€’ CTM does better with lower #’s of observed words. β€’ Able to infer certain words given topic probabilities.

  21. Conclusions β€’ CTM changes the distribution from which hyper parameters are drawn, from a Dirichlet to a logistic normal function. β€’ Very similar to LDA β€’ Able to model correlations between topics. β€’ For larger topic sizes, CTM performs better than LDA. β€’ With known topics, CTM is able to infer words associations better than LDA.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend