Document and Topic Models: pLSA and LDA Andrew Levandoski and - - PDF document

โ–ถ
document and topic models plsa and lda
SMART_READER_LITE
LIVE PREVIEW

Document and Topic Models: pLSA and LDA Andrew Levandoski and - - PDF document

10/4/2018 Document and Topic Models: pLSA and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models pLSA LSA Model Fitting via EM pHITS: link analysis


slide-1
SLIDE 1

10/4/2018 1

Document and Topic Models: pLSA and LDA

Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018

Outline

  • Topic Models
  • pLSA
  • LSA
  • Model
  • Fitting via EM
  • pHITS: link analysis
  • LDA
  • Dirichlet distribution
  • Generative process
  • Model
  • Geometric Interpretation
  • Inference

2

slide-2
SLIDE 2

10/4/2018 2

Topic Models: Visual Representation

Topics Documents Topic proportions and assignments

3

Topic Models: Importance

  • For a given corpus, we learn two things:
  • 1. Topic: from full vocabulary set, we learn important subsets
  • 2. Topic proportion: we learn what each document is about
  • This can be viewed as a form of dimensionality reduction
  • From large vocabulary set, extract basis vectors (topics)
  • Represent document in topic space (topic proportions)
  • Dimensionality is reduced from ๐‘ฅ๐‘— โˆˆ โ„ค๐‘Š

๐‘‚ to ๐œ„ โˆˆ โ„๐ฟ

  • Topic proportion is useful for several applications including document

classification, discovery of semantic structures, sentiment analysis,

  • bject localization in images, etc.

4

slide-3
SLIDE 3

10/4/2018 3

Topic Models: Terminology

  • Document Model
  • Word: element in a vocabulary set
  • Document: collection of words
  • Corpus: collection of documents
  • Topic Model
  • Topic: collection of words (subset of vocabulary)
  • Document is represented by (latent) mixture of topics
  • ๐‘ž ๐‘ฅ ๐‘’ = ๐‘ž ๐‘ฅ ๐‘จ ๐‘ž(๐‘จ|๐‘’) (๐‘จ : topic)
  • Note: document is a collection of words (not a sequence)
  • โ€˜Bag of wordsโ€™ assumption
  • In probability, we call this the exchangeability assumption
  • ๐‘ž ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘‚ = ๐‘ž(๐‘ฅ๐œ 1 , โ€ฆ , ๐‘ฅ๐œ ๐‘‚ ) (๐œ: permutation)

5

Topic Models: Terminology (contโ€™d)

  • Represent each document as a vector space
  • A word is an item from a vocabulary indexed by {1, โ€ฆ , ๐‘Š}. We

represent words using unitโ€basis vectors. The ๐‘ค๐‘ขโ„Ž word is represented by a ๐‘Š vector ๐‘ฅ such that ๐‘ฅ๐‘ค = 1 and ๐‘ฅ๐‘ฃ = 0 for ๐‘ค โ‰  ๐‘ฃ.

  • A document is a sequence of ๐‘œ words denoted by w = (๐‘ฅ1, ๐‘ฅ2, โ€ฆ ๐‘ฅ๐‘œ)

where ๐‘ฅ๐‘œ is the nth word in the sequence.

  • A corpus is a collection of ๐‘ documents denoted by

๐ธ = ๐‘ฅ1, ๐‘ฅ2, โ€ฆ ๐‘ฅ๐‘› .

6

slide-4
SLIDE 4

10/4/2018 4

Probabilistic Latent Semantic Analysis (pLSA)

7

Motivation

  • Learning from text and natural language
  • Learning meaning and usage of words without prior linguistic

knowledge

  • Modeling semantics
  • Account for polysems and similar words
  • Difference between what is said and what is meant

8

slide-5
SLIDE 5

10/4/2018 5

Vector Space Model

  • Want to represent documents and terms as vectors in a lower-

dimensional space

  • N ร— M word-document co-occurrence matrix ๐œฉ
  • limitations: high dimensionality, noisy, sparse
  • solution: map to lower-dimensional latent semantic space using SVD

๐ธ = {๐‘’1, . . . , ๐‘’๐‘‚} W = {๐‘ฅ1, . . . , ๐‘ฅ๐‘} ๐œฉ = ๐‘œ ๐‘’๐‘—, ๐‘ฅ

๐‘˜ ๐‘—๐‘˜

9

Latent Semantic Analysis (LSA)

  • Goal
  • Map high dimensional vector space representation to lower dimensional

representation in latent semantic space

  • Reveal semantic relations between documents (count vectors)
  • SVD
  • N = UฮฃVT
  • U: orthogonal matrix with left singular vectors (eigenvectors of NNT)
  • V: orthogonal matrix with right singular vectors (eigenvectors of NTN)
  • ฮฃ: diagonal matrix with singular values of N
  • Select k largest singular values from ฮฃ to get approximation เทฉ

๐‘‚ with minimal error

  • Can compute similarity values between document vectors and term vectors

10

slide-6
SLIDE 6

10/4/2018 6

LSA

11

LSA Strengths

  • Outperforms naรฏve vector space model
  • Unsupervised, simple
  • Noise removal and robustness due to dimensionality reduction
  • Can capture synonymy
  • Language independent
  • Can easily perform queries, clustering, and comparisons

12

slide-7
SLIDE 7

10/4/2018 7

LSA Limitations

  • No probabilistic model of term occurrences
  • Results are difficult to interpret
  • Assumes that words and documents form a joint Gaussian model
  • Arbitrary selection of the number of dimensions k
  • Cannot account for polysemy
  • No generative model

13

Probabilistic Latent Semantic Analysis (pLSA)

  • Difference between topics and words?
  • Words are observable
  • Topics are not, they are latent
  • Aspect Model
  • Associates an unobserved latent class variable ๐‘จ ๐œ— โ„ค = {๐‘จ1, . . . , ๐‘จ๐ฟ} with each
  • bservation
  • Defines a joint probability model over documents and words
  • Assumes w is independent of d conditioned on z
  • Cardinality of z should be much less than than d and w

14

slide-8
SLIDE 8

10/4/2018 8

pLSA Model Formulation

  • Basic Generative Model
  • Select document d with probability P(d)
  • Select a latent class z with probability P(z|d)
  • Generate a word w with probability P(w|z)
  • Joint Probability Model

๐‘„ ๐‘’, ๐‘ฅ = ๐‘„ ๐‘’ ๐‘„ ๐‘ฅ ๐‘’ ๐‘„ ๐‘ฅ|๐‘’ = เท

๐‘จ ๐œ— โ„ค

๐‘„ ๐‘ฅ|๐‘จ ๐‘„ ๐‘จ ๐‘’

15

pLSA Graphical Model Representation

16

๐‘„ ๐‘’, ๐‘ฅ = ๐‘„ ๐‘’ ๐‘„ ๐‘ฅ ๐‘’ ๐‘„ ๐‘ฅ|๐‘’ = เท

๐‘จ ๐œ— โ„ค

๐‘„ ๐‘ฅ|๐‘จ ๐‘„ ๐‘จ ๐‘’ ๐‘„ ๐‘’, ๐‘ฅ = เท

๐‘จ ๐œ— โ„ค

๐‘„ ๐‘จ ๐‘„ ๐‘’ ๐‘จ ๐‘„(๐‘ฅ|๐‘จ)

slide-9
SLIDE 9

10/4/2018 9

pLSA Joint Probability Model

๐‘„ ๐‘’, ๐‘ฅ = ๐‘„ ๐‘’ ๐‘„ ๐‘ฅ ๐‘’ ๐‘„ ๐‘ฅ|๐‘’ = เท

๐‘จ ๐œ— โ„ค

๐‘„ ๐‘ฅ|๐‘จ ๐‘„ ๐‘จ ๐‘’ โ„’ = เท

๐‘’๐œ—๐ธ

เท

๐‘ฅ๐œ—๐‘‹

๐‘œ ๐‘’, ๐‘ฅ log ๐‘„(๐‘’, ๐‘ฅ) Maximize:

Corresponds to a minimization of KL divergence (cross-entropy) between the empirical distribution

  • f words and the model distribution P(w|d)

17

Probabilistic Latent Semantic Space

  • P(w|d) for all documents is

approximated by a multinomial combination of all factors P(w|z)

  • Weights P(z|d) uniquely define a

point in the latent semantic space, represent how topics are mixed in a document

18

slide-10
SLIDE 10

10/4/2018 10

Probabilistic Latent Semantic Space

  • Topic represented by probability distribution over words
  • Document represented by probability distribution over topics

19

๐‘จ๐‘— = (๐‘ฅ1, . . . , ๐‘ฅ๐‘›) ๐‘จ1 = (0.3, 0.1, 0.2, 0.3, 0.1) ๐‘’๐‘˜ = (๐‘จ1, . . . , ๐‘จ๐‘œ) ๐‘’1 = (0.5, 0.3, 0.2)

Model Fitting via Expectation Maximization

  • E-step
  • M-step

๐‘„ ๐‘จ ๐‘’, ๐‘ฅ = ๐‘„ ๐‘จ ๐‘„ ๐‘’ ๐‘จ ๐‘„ ๐‘ฅ ๐‘จ ฯƒ๐‘จโ€ฒ ๐‘„ ๐‘จโ€ฒ ๐‘„ ๐‘’ ๐‘จโ€ฒ ๐‘„(๐‘ฅ|๐‘จโ€ฒ) ๐‘„(๐‘ฅ|๐‘จ) = ฯƒ๐‘’ ๐‘œ ๐‘’, ๐‘ฅ ๐‘„(๐‘จ|๐‘’, ๐‘ฅ) ฯƒ๐‘’,๐‘ฅโ€ฒ ๐‘œ ๐‘’, ๐‘ฅโ€ฒ ๐‘„(๐‘จ|๐‘’, ๐‘ฅโ€ฒ) ๐‘„(๐‘’|๐‘จ) = ฯƒ๐‘ฅ ๐‘œ ๐‘’, ๐‘ฅ ๐‘„(๐‘จ|๐‘’, ๐‘ฅ) ฯƒ๐‘’โ€ฒ,๐‘ฅ ๐‘œ ๐‘’โ€ฒ, ๐‘ฅ ๐‘„(๐‘จ|๐‘’โ€ฒ, ๐‘ฅ) ๐‘„ ๐‘จ = 1 ๐‘† เท

๐‘’,๐‘ฅ

๐‘œ ๐‘’, ๐‘ฅ ๐‘„ ๐‘จ ๐‘’, ๐‘ฅ , ๐‘† โ‰ก เท

๐‘’,๐‘ฅ

๐‘œ(๐‘’, ๐‘ฅ) Compute posterior probabilities for latent variables z using current parameters Update parameters using given posterior probabilities

20

slide-11
SLIDE 11

10/4/2018 11

pLSA Strengths

  • Models word-document co-occurrences as a mixture of conditionally

independent multinomial distributions

  • A mixture model, not a clustering model
  • Results have a clear probabilistic interpretation
  • Allows for model combination
  • Problem of polysemy is better addressed

21

pLSA Strengths

  • Problem of polysemy is better addressed

22

slide-12
SLIDE 12

10/4/2018 12

pLSA Limitations

  • Potentially higher computational complexity
  • EM algorithm gives local maximum
  • Prone to overfitting
  • Solution: Tempered EM
  • Not a well defined generative model for new documents
  • Solution: Latent Dirichlet Allocation

23

pLSA Model Fitting Revisited

  • Tempered EM
  • Goals: maximize performance on unseen data, accelerate fitting process
  • Define control parameter ฮฒ that is continuously modified
  • Modified E-step

๐‘„๐›พ ๐‘จ ๐‘’, ๐‘ฅ = ๐‘„ ๐‘จ ๐‘„ ๐‘’ ๐‘จ ๐‘„ ๐‘ฅ ๐‘จ

๐›พ

ฯƒ๐‘จโ€ฒ ๐‘„ ๐‘จโ€ฒ ๐‘„ ๐‘’ ๐‘จโ€ฒ ๐‘„ ๐‘ฅ ๐‘จโ€ฒ

๐›พ

24

slide-13
SLIDE 13

10/4/2018 13

Tempered EM Steps

1) Split data into training and validation sets 2) Set ฮฒ to 1 3) Perform EM on training set until performance on validation set decreases 4) Decrease ฮฒ by setting it to ฮทฮฒ, where ฮท <1, and go back to step 3 5) Stop when decreasing ฮฒ gives no improvement

25

Example: Identifying Authoritative Documents

26

slide-14
SLIDE 14

10/4/2018 14

HITS

  • Hubs and Authorities
  • Each webpage has an authority score x and a hub score y
  • Authority โ€“ value of content on the page to a community
  • likelihood of being cited
  • Hub โ€“ value of links to other pages
  • likelihood of citing authorities
  • A good hub points to many good authorities
  • A good authority is pointed to by many good hubs
  • Principal components correspond to different communities
  • Identify the principal eigenvector of co-citation matrix

27

HITS Drawbacks

  • Uses only the largest

eigenvectors, not necessary the

  • nly relevant communities
  • Authoritative documents in

smaller communities may be given no credit

  • Solution: Probabilistic HITS

28

slide-15
SLIDE 15

10/4/2018 15

pHITS

๐‘„ ๐‘’, ๐‘‘ = เท

๐‘จ

๐‘„ ๐‘จ ๐‘„ ๐‘‘ ๐‘จ ๐‘„(๐‘’|๐‘จ)

P(d|z) P(c|z)

Documents Communities

Citations

29

Interpreting pHITS Results

  • Explain d and c in terms of the latent variable โ€œcommunityโ€
  • Authority score: P(c|z)
  • Probability of a document being cited from within community z
  • Hub Score: P(d|z)
  • Probability that a document d contains a reference to community z.
  • Community Membership: P(z|c).
  • Classify documents

30

slide-16
SLIDE 16

10/4/2018 16

Joint Model of pLSA and pHITS

  • Joint probabilistic model of document content (pLSA) and

connectivity (pHITS)

  • Able to answer questions on both structure and content
  • Model can use evidence about link structure to make predictions about

document content, and vice versa

  • Reference flow โ€“ connection between one topic and another
  • Maximize log-likelihood function

โ„’ = เท

๐‘˜

๐›ฝ เท

๐‘—

๐‘‚๐‘—๐‘˜ ฯƒ๐‘—โ€ฒ ๐‘‚๐‘—โ€ฒ๐‘˜ ๐‘š๐‘๐‘• เท

๐‘™

๐‘„ ๐‘ฅ๐‘— ๐‘จ๐‘™ ๐‘„ ๐‘จ๐‘™ ๐‘’๐‘˜ + (1 โˆ’ ๐›ฝ) เท

๐‘š

๐ต๐‘š๐‘˜ ฯƒ๐‘šโ€ฒ ๐ต๐‘šโ€ฒ๐‘˜ ๐‘š๐‘๐‘• เท

๐‘™

๐‘„ ๐‘‘๐‘š ๐‘จ๐‘™ ๐‘„ ๐‘จ๐‘™ ๐‘’๐‘˜

31

pLSA: Main Deficiencies

  • Incomplete in that it provides no probabilistic model at the document level

i.e. no proper priors are defined.

  • Each document is represented as a list of numbers (the mixing proportions

for topics), and there is no generative probabilistic model for these numbers, thus:

  • 1. The number of parameters in the model grows linearly with the size of the

corpus, leading to overfitting

  • 2. It is unclear how to assign probability to a document outside of the training set
  • Latent Dirichlet allocation (LDA) captures the exchangeability of both

words and documents using a Dirichlet distribution, allowing a coherent generative process for test data

32

slide-17
SLIDE 17

10/4/2018 17

Latent Dirichlet Allocation (LDA)

33

LDA: Dirichlet Distribution

34

  • A โ€˜distribution of distributionsโ€™
  • Multivariate distribution whose components all take values on

(0,1) and which sum to one.

  • Parameterized by the vector ฮฑ, which has the same number of

elements (k) as our multinomial parameter ฮธ.

  • Generalization of the beta distribution into multiple dimensions
  • The alpha hyperparameter controls the mixture of topics for a

given document

  • The beta hyperparameter controls the distribution of words per

topic Note: Ideally we want our composites to be made up of only a few topics and our parts to belong to only some of the topics. With this in mind, alpha and beta are typically set below one.

slide-18
SLIDE 18

10/4/2018 18

LDA: Dirichlet Distribution (contโ€™d)

  • A k-dimensional Dirichlet random variable ๐œ„ can take values in the (k-

1)-simplex (a k-vector ๐œ„ lies in the (k-1)-simplex if ๐œ„๐‘— โ‰ฅ 0, ฯƒ๐‘—=1

๐‘™

๐œ„๐‘— = 1) and has the following probability density on this simplex: ๐‘ž ๐œ„ ๐›ฝ =

ฮ“(ฯƒ๐‘—=1

๐‘™

๐›ฝ๐‘—) ฯ‚๐‘—=1

๐‘™

ฮ“(๐›ฝ๐‘—) ๐œ„1 ๐›ฝ1โˆ’1 โ€ฆ ๐œ„๐‘™ ๐›ฝ๐‘™โˆ’1,

where the parameter ๐›ฝ is a k-vector with components ๐›ฝ๐‘— > 0 and where ฮ“(๐‘ฆ) is the Gamma function.

  • The Dirichlet is a convenient distribution on the simplex:
  • In the exponential family
  • Has finite dimensional sufficient statistics
  • Conjugate to the multinomial distribution

35

LDA: Generative Process

LDA assumes the following generative process for each document ๐‘ฅ in a corpus ๐ธ:

  • 1. Choose ๐‘‚ ~ ๐‘„๐‘๐‘—๐‘ก๐‘ก๐‘๐‘œ ๐œŠ .
  • 2. Choose ๐œ„ ~ ๐ธ๐‘—๐‘  ๐›ฝ .
  • 3. For each of the ๐‘‚ words ๐‘ฅ๐‘‚:

a. Choose a topic ๐‘จ๐‘œ ~ ๐‘๐‘ฃ๐‘š๐‘ข๐‘—๐‘œ๐‘๐‘›๐‘—๐‘๐‘š ๐œ„ . b. Choose a word ๐‘ฅ๐‘œ from ๐‘ž(๐‘ฅ๐‘œ|๐‘จ๐‘œ, ๐›พ), a multinomial probability conditioned on the topic ๐‘จ๐‘œ.

Example: Assume a group of articles that can be broken down by three topics described by the following words:

  • Animals: dog, cat, chicken, nature, zoo
  • Cooking: oven, food, restaurant, plates, taste
  • Politics: Republican, Democrat, Congress, ineffective, divisive

To generate a new document that is 80% about animals and 20% about cooking:

  • Choose the length of the article (say, 1000 words)
  • Choose a topic based on the specified mixture (~800 words will coming from topic โ€˜animalsโ€™)
  • Choose a word based on the word distribution for each topic

36

slide-19
SLIDE 19

10/4/2018 19

LDA: Model (Plate Notation)

๐›ฝ is the parameter of the Dirichlet prior on the per-document topic distribution, ๐›พ is the parameter of the Dirichlet prior on the per-topic word distribution, ๐œ„๐‘ is the topic distribution for document M, ๐‘จ๐‘๐‘‚ is the topic for the N-th word in document M, and ๐‘ฅ๐‘๐‘‚ is the word.

37

LDA: Model

38

Parameters of Dirichlet distribution (K-vector)

dk dn ki dn d

๐œ„๐‘’๐‘™ ๐‘จ๐‘’๐‘œ = {1, โ€ฆ , ๐ฟ} ๐›พ๐‘™๐‘— = ๐‘ž(๐‘ฅ|๐‘จ)

1 โ€ฆ topic โ€ฆ K 1 โ€ฆ nth word โ€ฆ Nd 1 โ€ฆ word idx โ€ฆ V 1 โ‹ฎ doc โ‹ฎ M 1 โ‹ฎ doc โ‹ฎ M 1 โ‹ฎ topic โ‹ฎ M

slide-20
SLIDE 20

10/4/2018 20

LDA: Model (contโ€™d)

39

controls the mixture of topics controls the distribution of words per topic

LDA: Model (contโ€™d)

Given the parameters ๐›ฝ and ๐›พ, the joint distribution of a topic mixture ๐œ„, a set of ๐‘‚ topics ๐‘จ, and a set of ๐‘‚ words ๐‘ฅ is given by:

๐‘ž ๐œ„, ๐‘จ, ๐‘ฅ ๐›ฝ, ๐›พ = ๐‘ž ๐œ„ ๐›ฝ เท‘

๐‘œ=1 ๐‘‚

๐‘ž ๐‘จ๐‘œ ๐œ„ ๐‘ž ๐‘ฅ๐‘œ ๐‘จ๐‘œ, ๐›พ ,

where ๐‘ž ๐‘จ๐‘œ ๐œ„ is ๐œ„๐‘— for the unique ๐‘— such that ๐‘จ๐‘œ

๐‘— = 1. Integrating over ๐œ„ and summing over ๐‘จ, we obtain

the marginal distribution of a document:

๐‘ž ๐‘ฅ ๐›ฝ, ๐›พ = เถฑ ๐‘ž(๐œ„|๐›ฝ) เท‘

๐‘œ=1 ๐‘‚

เท

๐‘จ๐‘œ

๐‘ž ๐‘จ๐‘œ ๐œ„ ๐‘ž(๐‘ฅ๐‘œ|๐‘จ๐‘œ, ๐›พ) ๐‘’๐œ„๐‘’.

Finally, taking the products of the marginal probabilities of single documents, we obtain the probability of a corpus:

๐‘ž ๐ธ ๐›ฝ, ๐›พ = เท‘

๐‘’=1 ๐‘

เถฑ ๐‘ž(๐œ„๐‘’|๐›ฝ) เท‘

๐‘œ=1 ๐‘‚๐‘’

เท

๐‘จ๐‘’๐‘œ

๐‘ž ๐‘จ๐‘’๐‘œ ๐œ„๐‘’ ๐‘ž(๐‘ฅ๐‘’๐‘œ|๐‘จ๐‘’๐‘œ, ๐›พ) ๐‘’๐œ„๐‘’.

40

slide-21
SLIDE 21

10/4/2018 21

LDA: Exchangeability

  • A finite set of random variables {๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘‚} is said to be exchangeable if the

joint distribution is invariant to permutation. If ฯ€ is a permutation of the integers from 1 to N: ๐‘ž ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘‚ = ๐‘ž(๐‘ฆ๐œŒ1, โ€ฆ , ๐‘ฆ๐œŒ๐‘‚)

  • An infinite sequence of random numbers is infinitely exchangeable if every

finite sequence is exchangeable

  • We assume that words are generated by topics and that those topics are

infinitely exchangeable within a document

  • By De Finettiโ€™s Theorem:

๐‘ž ๐‘ฅ, ๐‘จ = เถฑ ๐‘ž(๐œ„) เท‘

๐‘œ=1 ๐‘‚

๐‘ž ๐‘จ๐‘œ ๐œ„ ๐‘ž(๐‘ฅ๐‘œ|๐‘จ๐‘œ) ๐‘’๐œ„

41

LDA vs. other latent variable models

42

Unigram model: ๐‘ž ๐‘ฅ = ฯ‚๐‘œ=1

๐‘‚

๐‘ž(๐‘ฅ๐‘œ) Mixture of unigrams: ๐‘ž ๐‘ฅ = ฯƒ๐‘จ ๐‘ž(๐‘จ) ฯ‚๐‘œ=1

๐‘‚

๐‘ž(๐‘ฅ๐‘œ|๐‘จ) pLSI: ๐‘ž ๐‘’, ๐‘ฅ๐‘œ = ๐‘ž(๐‘’) ฯƒ๐‘จ ๐‘ž ๐‘ฅ๐‘œ ๐‘จ ๐‘ž(๐‘จ|๐‘’)

slide-22
SLIDE 22

10/4/2018 22

LDA: Geometric Interpretation

43

  • Topic simplex for three topics embedded

in the word simplex for three words

  • Corners of the word simplex correspond to

the three distributions where each word has probability one

  • Corners of the topic simplex correspond to

three different distributions over words

  • Mixture of unigrams places each

document at one of the corners of the topic simplex

  • pLSI induces an empirical distribution on

the topic simplex denoted by diamonds

  • LDA places a smooth distribution on the

topic simplex denoted by contour lines

LDA: Goal of Inference

LDA inputs: Set of words per document for each document in a corpus LDA outputs: Corpus-wide topic vocabulary distributions Topic assignments per word Topic proportions per document

44

?

slide-23
SLIDE 23

10/4/2018 23

LDA: Inference

45

The key inferential problem we need to solve with LDA is that of computing the posterior distribution

  • f the hidden variables given a document:

๐‘ž ๐œ„, ๐‘จ ๐‘ฅ, ๐›ฝ, ๐›พ = ๐‘ž(๐œ„, ๐‘จ, ๐‘ฅ|๐›ฝ, ๐›พ) ๐‘ž(๐‘ฅ|๐›ฝ, ๐›พ) This formula is intractable to compute in general (the integral cannot be solved in closed form), so to normalize the distribution we marginalize over the hidden variables: ๐‘ž ๐‘ฅ ๐›ฝ, ๐›พ = ฮ“(ฯƒ๐‘— ๐›ฝ๐‘—) ฯ‚๐‘— ฮ“(๐›ฝ๐‘—) เถฑ เท‘

๐‘—=1 ๐‘™

๐œ„๐‘—

๐›ฝ๐‘—โˆ’1

เท‘

๐‘œ=1 ๐‘‚

เท

๐‘—=1 ๐‘™

เท‘

๐‘˜=1 ๐‘Š

(๐œ„๐‘—๐›พ๐‘—๐‘˜)๐‘ฅ๐‘œ

๐‘˜

๐‘’๐œ„

LDA: Variational Inference

  • Basic idea: make use of Jensenโ€™s inequality to obtain an adjustable lower

bound on the log likelihood

  • Consider a family of lower bounds indexed by a set of variational

parameters chosen by an optimization procedure that attempts to find the tightest possible lower bound

  • Problematic coupling between ๐œ„ and ๐›พ arises due to edges between ๐œ„, z

and w. By dropping these edges and the w nodes, we obtain a family of distributions on the latent variables characterized by the following variational distribution: ๐‘Ÿ ๐œ„, ๐‘จ ๐›ฟ, ๐œš = ๐‘Ÿ ๐œ„ ๐›ฟ เท‘

๐‘œ=1 ๐‘‚

๐‘Ÿ ๐‘จ๐‘œ ๐œš๐‘œ where ๐›ฟ and (๐œš1, โ€ฆ , ๐œš๐‘œ) and the free variational parameters.

46

slide-24
SLIDE 24

10/4/2018 24

LDA: Variational Inference (contโ€™d)

  • With this specified family of probability distributions, we set up the following
  • ptimization problem to determine ๐œ„ and ๐œš:

๐›ฟโˆ—, ๐œšโˆ— = ๐‘๐‘ ๐‘•๐‘›๐‘—๐‘œ ๐›ฟ,๐œš ๐ธ(๐‘Ÿ(๐œ„, ๐‘จ|๐›ฟ, ๐œš) โˆฅ ๐‘ž ๐œ„, ๐‘จ ๐‘ฅ, ๐›ฝ, ๐›พ )

  • The optimizing values of these parameters are found by minimizing the KL divergence

between the variational distribution and the true posterior ๐‘ž ๐œ„, ๐‘จ ๐‘ฅ, ๐›ฝ, ๐›พ

  • By computing the derivatives of the KL divergence and setting them equal to zero, we
  • btain the following pair of update equations:

๐œš๐‘œ๐‘— โˆ ๐›พ๐‘—๐‘ฅ๐‘œ exp ๐น๐‘Ÿ log ๐œ„๐‘— ๐›ฟ ๐›ฟ๐‘— = ๐›ฝ๐‘— + เท

๐‘œ=1 ๐‘‚

๐œš๐‘œ๐‘—

  • The expectation in the multinomial update can be computed as follows:

๐น๐‘Ÿ log ๐œ„๐‘— ๐›ฟ = ฮจ ๐›ฟ๐‘— โˆ’ ฮจ(เท‘

๐‘˜=1 ๐‘™

๐›ฟ๐‘˜) where ฮจ is the first derivative of the logฮ“ function.

47

LDA: Variational Inference (contโ€™d)

48

slide-25
SLIDE 25

10/4/2018 25

LDA: Parameter Estimation

  • Given a corpus of documents ๐ธ = {๐‘ฅ1, ๐‘ฅ2 โ€ฆ , ๐‘ฅ๐‘}, we wish to find ๐›ฝ

and ๐›พ that maximize the marginal log likelihood of the data: โ„“ ๐›ฝ, ๐›พ = เท

๐‘’=1 ๐‘

๐‘š๐‘๐‘•๐‘ž(๐‘ฅ๐‘’|๐›ฝ, ๐›พ)

  • Variational EM yields the following iterative algorithm:

1. (E-step) For each document, find the optimizing values of the variational parameters ๐›ฟ๐‘’

โˆ—, ๐œš๐‘’ โˆ—: ๐‘’ โˆˆ ๐ธ

2. (M-step) Maximize the resulting lower bound on the log likelihood with respect to the model parameters ๐›ฝ and ๐›พ

These two steps are repeated until the lower bound on the log likelihood converges.

49

LDA: Smoothing

50

  • Introduces Dirichlet smoothing on ๐›พ to

avoid the zero frequency word problem

  • Fully Bayesian approach:

๐‘Ÿ ๐›พ1:๐‘™, ๐‘จ1:๐‘, ๐œ„1:๐‘ ๐œ‡, ๐œš, ๐›ฟ = เท

๐‘—=1 ๐‘™

๐ธ๐‘—๐‘ (๐›พ๐‘—|๐œ‡๐‘—) เท‘

๐‘’=1 ๐‘

๐‘Ÿ๐‘’(๐œ„๐‘’, ๐‘จ๐‘’|๐œš๐‘’, ๐›ฟ๐‘’) where ๐‘Ÿ๐‘’(๐œ„, ๐‘จ |๐œš, ๐›ฟ) is the variational distribution defined for LDA. We require an additional update for the new variational parameter ๐œ‡: ๐œ‡๐‘—๐‘˜ = ๐œƒ + เท

๐‘’=1 ๐‘

เท

๐‘œ=1 ๐‘‚๐‘’

๐œš๐‘’๐‘œ๐‘—

โˆ—

๐‘ฅ๐‘’๐‘œ

๐‘˜

slide-26
SLIDE 26

10/4/2018 26

Topic Model Applications

  • Information Retrieval
  • Visualization
  • Computer Vision
  • Document = image, word = โ€œvisual wordโ€
  • Bioinformatics
  • Genomic features, gene sequencing, diseases
  • Modeling networks
  • cities, social networks

51

pLSA / LDA Libraries

  • gensim (Python)
  • MALLET (Java)
  • topicmodels (R)
  • Stanford Topic Modeling Toolbox
slide-27
SLIDE 27

10/4/2018 27

References

David M. Blei, Andrew Y. Ng, Michael I. Jordan. Latent Dirichlet Allocation. JMLR, 2003. David Cohn and Huan Chang. Learning to probabilistically identify Authoritative

  • documents. ICML, 2000.

David Cohn and Thomas Hoffman. The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity. NIPS, 2000. Thomas Hoffman. Probabilistic Latent Semantic Analysis. UAI-99, 1999. Thomas Hofmann. Probabilistic Latent Semantic Indexing. SIGIR-99, 1999.

53