An N-gram Topic Model for Time-Stamped Documents Shoaib Jameel and - - PowerPoint PPT Presentation

an n gram topic model for time stamped documents
SMART_READER_LITE
LIVE PREVIEW

An N-gram Topic Model for Time-Stamped Documents Shoaib Jameel and - - PowerPoint PPT Presentation

An N-gram Topic Model for Time-Stamped Documents Shoaib Jameel and Wai Lam The Chinese University of Hong Kong Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Outline Introduction and Motivation The Bag-of-Words (BoW) assumption


slide-1
SLIDE 1

An N-gram Topic Model for Time-Stamped Documents

Shoaib Jameel and Wai Lam

The Chinese University of Hong Kong

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-2
SLIDE 2

Outline

Introduction and Motivation

◮ The Bag-of-Words (BoW) assumption ◮ Temporal nature of data

Related Work

◮ Temporal Topic Models ◮ N-gram Topic Models

Overview of our model

◮ Background ⋆ Topics Over Time (TOT) Model - proposed earlier ⋆ Our proposed n-gram model

Empirical Evaluation Conclusions and Future Directions

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-3
SLIDE 3

The ‘popular’ Bag-of-Words Assumption

Many works in the topic modeling literature assume exchangeability among the words. As a result generate ambiguous words in topics. For example, consider few topics obtained from the NIPS collection using the Latent Dirichlet Allocation (LDA) model:

Example Topic 1 Topic 2 Topic 3 Topic 4 Topic 5

architecture

  • rder

connectionist potential prior recurrent first role membrane bayesian network second binding current data module analysis structures synaptic evidence modules small distributed dendritic experts

The problem with the LDA model

Words in topics are not insightful.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-4
SLIDE 4

The ‘popular’ Bag-of-Words Assumption

Many works in the topic modeling literature assume exchangeability among the words. As a result generate ambiguous words in topics. For example, consider few topics obtained from the NIPS collection using the Latent Dirichlet Allocation (LDA) model:

Example Topic 1 Topic 2 Topic 3 Topic 4 Topic 5

architecture

  • rder

connectionist potential prior recurrent first role membrane bayesian network second binding current data module analysis structures synaptic evidence modules small distributed dendritic experts

The problem with the LDA model

Words in topics are not insightful.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-5
SLIDE 5

The problem with the bag-of-words assumption

1

Logical structure of the document is lost. For example, we do not know whether “the cat saw a dog or a dog saw a cat”.

2

The computational models cannot tap an extra word order information inherent in the text. Therefore, affects the performance.

3

The usefulness of maintaining the word order has also been illustrated in Information Retrieval, Computational Linguistics and many other fields.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-6
SLIDE 6

The problem with the bag-of-words assumption

1

Logical structure of the document is lost. For example, we do not know whether “the cat saw a dog or a dog saw a cat”.

2

The computational models cannot tap an extra word order information inherent in the text. Therefore, affects the performance.

3

The usefulness of maintaining the word order has also been illustrated in Information Retrieval, Computational Linguistics and many other fields.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-7
SLIDE 7

The problem with the bag-of-words assumption

1

Logical structure of the document is lost. For example, we do not know whether “the cat saw a dog or a dog saw a cat”.

2

The computational models cannot tap an extra word order information inherent in the text. Therefore, affects the performance.

3

The usefulness of maintaining the word order has also been illustrated in Information Retrieval, Computational Linguistics and many other fields.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-8
SLIDE 8

Why capture topics over time?

1

We know that data evolves over time.

2

What people are talking today may not be talking tomorrow or an year after.

Burj Khalifa Volcano Manila Hostage Iraq War Year-2010 Wikipedia N.Z Earthquake Osama bin Laden Higgs Boson Year-2011 Year-2012 Gaza Strip Sachin Tendulkar China Apple Inc.

3

Models such as LDA do not capture such temporal characteristics in data.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-9
SLIDE 9

Related Work

Temporal Topic Models

Discrete time assumption models

◮ Blei et al., (David M. Blei and John D. Lafferty. 2006.) - Dynamic

Topic Models - assume that topics in one year are dependent on the topics of the previous year.

◮ Knights et al., (Knights, D., Mozer, M., and Nicolov, N. 2009.) -

Compound Topic Model - Train a topic model on the most recent K months of data.

The problem here

One needs to select an appropriate time slice value manually. The question is which time slice be chosen: day, month, year, etc.?

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-10
SLIDE 10

Related Work

Temporal Topic Models

Discrete time assumption models

◮ Blei et al., (David M. Blei and John D. Lafferty. 2006.) - Dynamic

Topic Models - assume that topics in one year are dependent on the topics of the previous year.

◮ Knights et al., (Knights, D., Mozer, M., and Nicolov, N. 2009.) -

Compound Topic Model - Train a topic model on the most recent K months of data.

The problem here

One needs to select an appropriate time slice value manually. The question is which time slice be chosen: day, month, year, etc.?

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-11
SLIDE 11

Related Work

Temporal Topic Models

Continuous Time Topic Models

◮ Noriaki (Noriaki Kawamae. 2011.) - Trend Analysis Model - The

model has a probability distribution over temporal words, topics, and a continuous distribution over time.

◮ Uri et al., (Uri Nodelman, Christian R. Shelton, and Daphne Koller.

2002.) - Continuous Time Bayesian Networks - Builds a graph where each variable lies in the node whose values change over time.

The problem with the above models

All assume the notion of exchangeability and thus lose important collocation information inherent in the document.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-12
SLIDE 12

Related Work

Temporal Topic Models

Continuous Time Topic Models

◮ Noriaki (Noriaki Kawamae. 2011.) - Trend Analysis Model - The

model has a probability distribution over temporal words, topics, and a continuous distribution over time.

◮ Uri et al., (Uri Nodelman, Christian R. Shelton, and Daphne Koller.

2002.) - Continuous Time Bayesian Networks - Builds a graph where each variable lies in the node whose values change over time.

The problem with the above models

All assume the notion of exchangeability and thus lose important collocation information inherent in the document.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-13
SLIDE 13

Related Work

N-gram Topic Models

1

Wallach’s (Hanna M. Wallach. 2006.) bigram topic model. Maintains word order during topic generation process. Generates

  • nly bigram words in topics.

2

Griffiths et al. (Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B. 2007.) - LDA Collocation Model. Introduced binary random variables which decides when to generate a unigram or a bigram.

3

Wang et al. (Wang, X., McCallum, A., and Wei, X. 2007.) - Topical N-gram Model - Extends the LDA Collocation Model. Gives topic assignment to every word in the phrase.

The problem with the above models

Cannot capture the temporal dynamics in data.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-14
SLIDE 14

Related Work

N-gram Topic Models

1

Wallach’s (Hanna M. Wallach. 2006.) bigram topic model. Maintains word order during topic generation process. Generates

  • nly bigram words in topics.

2

Griffiths et al. (Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B. 2007.) - LDA Collocation Model. Introduced binary random variables which decides when to generate a unigram or a bigram.

3

Wang et al. (Wang, X., McCallum, A., and Wei, X. 2007.) - Topical N-gram Model - Extends the LDA Collocation Model. Gives topic assignment to every word in the phrase.

The problem with the above models

Cannot capture the temporal dynamics in data.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-15
SLIDE 15

Topics Over Time (TOT) (Wang et al., 2006)

1

Our model extends from this model.

2

Assumes the notion of word and topic exchangeability.

Generative Process

1

Draw T multinomials φz from a Dirichlet Prior β, one for each topic z

2

For each document d, draw a multinomial θ(d) from a Dirichlet prior α; then for each word w(d)

i

in the document d

1

Draw a topic zd

i from

Multinomial θ(d)

2

Draw a word w(d)

i

from Multinomial φz(d)

i

3

Draw a timestamp t(d)

i

from Beta Ωz(d)

i

Topics Over Time Model (TOT)

w t z

θ α

Ω φ β D Nd T

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-16
SLIDE 16

Topics Over Time Model (TOT)

1

The model assumes a continuous distribution over time associated with each topic.

2

Topics are responsible for generating both observed time-stamps and also words.

3

The model does not capture the sequence of state changes with a Markov assumption.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-17
SLIDE 17

Topics Over Time Model (TOT)

1

The model assumes a continuous distribution over time associated with each topic.

2

Topics are responsible for generating both observed time-stamps and also words.

3

The model does not capture the sequence of state changes with a Markov assumption.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-18
SLIDE 18

Topics Over Time Model (TOT)

1

The model assumes a continuous distribution over time associated with each topic.

2

Topics are responsible for generating both observed time-stamps and also words.

3

The model does not capture the sequence of state changes with a Markov assumption.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-19
SLIDE 19

Topics Over Time Model (TOT)

Posterior Inference

1

In the Gibbs sampling, compute the conditional: P(z(d)

i

|w, t, z(d)

i

, α, β, Ω) (1)

2

We can thus write the updating equations as: P(z(d)

i

|w, t, z(d)

i

, α, β, Ω) ∝

  • mz(d)

i

+ αz(d)

i

− 1

  • ×

nz(d)

i

w(d)

i

+ βw(d)

i

− 1 W

v=1(nz(d)

i

v + βv) − 1

× (1 − t(d)

i

)

z(d) i 1−1

t

(d)Ω

z(d) i 2−1

i

B(Ωz(d)

i

1, Ωz(d)

i

2)

(2)

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-20
SLIDE 20

Our Model

N-gram Topics Over Time Model

1

The model assumes a continuous distribution over time associated with each topic.

2

Topics are responsible for generating both observed time-stamps and also words.

3

The model does not capture the sequence of state changes with a Markov assumption.

4

Maintains the order of words during topic generation process.

5

Generates words as unigrams, bigrams, etc. in topics.

6

Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-21
SLIDE 21

Our Model

N-gram Topics Over Time Model

1

The model assumes a continuous distribution over time associated with each topic.

2

Topics are responsible for generating both observed time-stamps and also words.

3

The model does not capture the sequence of state changes with a Markov assumption.

4

Maintains the order of words during topic generation process.

5

Generates words as unigrams, bigrams, etc. in topics.

6

Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-22
SLIDE 22

Our Model

N-gram Topics Over Time Model

1

The model assumes a continuous distribution over time associated with each topic.

2

Topics are responsible for generating both observed time-stamps and also words.

3

The model does not capture the sequence of state changes with a Markov assumption.

4

Maintains the order of words during topic generation process.

5

Generates words as unigrams, bigrams, etc. in topics.

6

Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-23
SLIDE 23

Our Model

N-gram Topics Over Time Model

1

The model assumes a continuous distribution over time associated with each topic.

2

Topics are responsible for generating both observed time-stamps and also words.

3

The model does not capture the sequence of state changes with a Markov assumption.

4

Maintains the order of words during topic generation process.

5

Generates words as unigrams, bigrams, etc. in topics.

6

Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-24
SLIDE 24

Our Model

N-gram Topics Over Time Model

1

The model assumes a continuous distribution over time associated with each topic.

2

Topics are responsible for generating both observed time-stamps and also words.

3

The model does not capture the sequence of state changes with a Markov assumption.

4

Maintains the order of words during topic generation process.

5

Generates words as unigrams, bigrams, etc. in topics.

6

Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-25
SLIDE 25

Our Model

N-gram Topics Over Time Model

1

The model assumes a continuous distribution over time associated with each topic.

2

Topics are responsible for generating both observed time-stamps and also words.

3

The model does not capture the sequence of state changes with a Markov assumption.

4

Maintains the order of words during topic generation process.

5

Generates words as unigrams, bigrams, etc. in topics.

6

Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-26
SLIDE 26

Graphical Model

N-gram Topics Over Time Model

α θ ti−1 ti ti+1 xi xi+1 zi−1 zi zi+1 wi−1 wi wi+1 D TW ψ γ T φ β δ σ TW Ω xi+2

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-27
SLIDE 27

Generative Process

N-gram Topics Over Time Model

Draw Discrete(φz) from Dirichlet(β) for each topic z; Draw Bernoulli(ψzw) from Beta(γ) for each topic z and each word w; Draw Discrete(σzw) from Dirichlet(δ) for each topic z and each word w; For every document d, draw Discrete(θ(d)) from Dirichlet(α); foreach word w(d)

i

in document d do Draw x(d)

i

from Bernoulli(ψz(d)

i−1w(d) i−1);

Draw z(d)

i

from Discrete(θ(d)); Draw w(d)

i

from Discrete(σz(d)

i

w(d)

i−1) if x(d)

i

= 1; Otherwise, Draw w(d)

i

from Discrete(φz(d)

i

); Draw a time-stamp t(d)

i

from Beta(Ωz(d)

i

); end

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-28
SLIDE 28

Posterior Inference

Collapsed Gibbs Sampling

P(z(d)

i

, x(d)

i

|w, t, x(d)

¬i , z(d) ¬i , α, β, γ, δ, Ω) ∝

  • γxi (d) + pz(d)

i−1w(d) i−1xi − 1

  • ×
  • αz(d)

i

+ qdz(d)

i

− 1

  • ×

(1 − t(d)

i

)

z(d) i 1−1

t

(d)Ω

z(d) i 2−1

i

B(Ωz(d)

i

1, Ωz(d)

i

2)

×             

β

w(d) i

+n

z(d) i w(d) i −1

W

v=1(βv +n z(d) i v )−1

if x(d)

i

= 0

δ

w(d) i

+m

z(d) i w(d) i−1w(d) i

−1 W

v=1(δv +m z(d) i w(d) i−1v )−1

if x(d)

i

= 1 (3)

Posterior Estimates

ˆ θ(d)

z

= αz + qdz T

t=1(αt + qdt )

(4) ˆ φzw = βw + nzw W

v=1(βv + nzv )

(5) ˆ ψzwk = γk + pzwk 1

k=0(γk + pzwk )

(6) ˆ σzwv = δv + mzwv W

v=1(δv + mzwv )

(7) ˆ Ωz1 = tz tz(1 − tz) s2

z

− 1

  • (8)

ˆ Ωz2 = (1 − tz) tz(1 − tz) s2

z

− 1

  • (9)

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-29
SLIDE 29

Inference Algorithm

Input : γ, δ, α, T, β, Corpus, MaxIteration Output: Topic assignments for all the n-gram words with temporal information

1 Initialization: Randomly initialize the n-gram topic assignment for all words; 2 Zero all count variables; 3 for iteration ← 1 to MaxIteration do 4

for d ← 1 to D do

5

for w ← 1 to Nd according to word order do

6

Draw z(d)

w , x(d) w

defined in Equation 3;

7

if x(d)

w

← 0 then

8

Update nzw;

9

end

10

else

11

Update mzw;

12

end

13

Update qdz, pzw;

14

end

15

end

16

for z ← 1 to T do

17

Update Ωz by the method of moments as in Equations 8 and 9;

18

end

19 end 20 Compute the posterior estimates of α, β, γ, δ defined in Equations 4, 5, 6, 7; Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-30
SLIDE 30

Empirical Evaluation

Data Sets

We have conducted experiments on two datasets

1

U.S. Presidential State-of-the-Union1 speeches from 1790 to 2002.

2

NIPS conference papers - The original raw NIPS dataset2 consists of 17 years of conference papers. But we supplemented this dataset by including some new raw NIPS documents3 and it has 19 years of papers in total.

Preprocessing

1

Removed stopwords.

2

Did not perform word stemming.

1http://infomotions.com/etexts/gutenberg/dirs/etext04/suall11.txt 2http://www.cs.nyu.edu/roweis/data.html 3http://ai.stanford.edu/gal/Data/NIPS/ Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-31
SLIDE 31

Qualitative Results

1800 1850 1900 1950 2000 1000 2000 3000

Year

Our Model Mexican War

1800 1850 1900 1950 2000 2000 4000 6000

Year TOT Mexican War

  • 1. east bank
  • 8. military
  • 2. american coins
  • 9. general herrera
  • 3. mexican flag
  • 10. foreign coin
  • 4. separate independent
  • 11. military usurper
  • 5. american commonwealth
  • 12. mexican treasury
  • 6. mexican population
  • 13. invaded texas
  • 7. texan troops
  • 14. veteran troops
  • 1. mexico
  • 8. territory
  • 2. texas
  • 9. army
  • 3. war
  • 10. peace
  • 4. mexican
  • 11. act
  • 5. united
  • 12. policy
  • 6. country
  • 13. foreign
  • 7. government
  • 14. citizens

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-32
SLIDE 32

Qualitative Results

Topics changes over time

1800 1850 1900 1950 2000 2000 4000 6000

Year Our Model Panama Canal

1800 1850 1900 1950 2000 2000 4000 6000

Year TOT Panama Canal

  • 1. panama canal
  • 8. united states senate
  • 2. isthmian canal
  • 9. french canal company
  • 3. isthmus panama
  • 10. caribbean sea
  • 4. republic panama
  • 11. panama canal bonds
  • 5. united states government
  • 12. panama
  • 6. united states
  • 13. american control
  • 7. state panama
  • 14. canal
  • 1. government
  • 8. spanish
  • 2. cuba
  • 9. island
  • 3. islands
  • 10. act
  • 4. international
  • 11. commission
  • 5. powers
  • 12. officers
  • 6. gold
  • 13. spain
  • 7. action
  • 14. rico

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-33
SLIDE 33

Qualitative Results

Topics changes over time - TOT

cells cell model response firing activity input neurons stimulus figure NIPS-1987 network learning input units training

  • utput

layer hidden weights networks NIPS-1988 data model algorithm method probability models problem distribution information NIPS-1995 function data set distribution model models neural probability parameters networks NIPS-1996 NIPS-2004 NIPS-2005 algorithm state learning time algorithms step action node policy learning data set training algorithm test number kernel classification class set sequence

Figure : Top ten probable phrases from the posterior inference in NIPS year-wise.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-34
SLIDE 34

Qualitative Results

Topics changes over time - Our Model

  • rientation map

firing threshold time delay neural state low conduction safety correlogram peak centric models long channel synaptic chip frog sciatic nerve NIPS-1987 neural networks hidden units hidden layer neural network training set mit press hidden unit learning algorithm

  • utput units
  • utput layer

NIPS-1988 linear algebra input signals gaussian filters

  • ptical flow

model matching resistive line input signal analog vlsi depth map temporal precision NIPS-1995 probability vector relevant documents continuous embedding doubly stochastic matrix probability vectors binding energy energy costs variability index learning bayesian polynomial time NIPS-1996 NIPS-2004 NIPS-2005

  • ptimal policy

build stack reinforcement learning nash equilibrium suit stack synthetic items compressed map reward function td networks intrinsic reward kernel cca empirical risk training sample data clustering random selection gaussian regression

  • nline hypothesis

linear separators covariance operator line algorithm

Figure : Top ten probable phrases from the posterior inference in NIPS year-wise.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-35
SLIDE 35

Qualitative Results

Topics changes over time

1990 1995 2000 2005 1000 2000 3000 4000 5000

  • 1. hidden unit
  • 6. learning algorithms
  • 2. neural net
  • 7. error signals
  • 3. input layer
  • 8. recurrent connections
  • 4. recurrent network
  • 9. training pattern
  • 5. hidden layers
  • 10. recurrent cascade
  • 1. state
  • 6. sequences
  • 2. time
  • 7. recurrent
  • 3. sequence
  • 8. models
  • 4. states
  • 9. markov
  • 5. model
  • 10. transition

Figure : A topic related to “recurrent NNs” comprising of n-gram words

  • btained from both the models. Histograms depict the way topics are

distributed over time and they are fitted with Beta probability density functions.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-36
SLIDE 36

Quantitative Results

Predicting decade on State-of-the-Union dataset

1

Computed the time-stamp prediction performance.

2

Learn a model on some subset of the data randomly sampled from the collection.

3

Given a new document, compute the likelihood of the decade prediction. L1 Error E(L1) Accuracy Our Model 1.60 1.65 0.25 TOT 1.95 1.99 0.20

Table : Results of decade prediction in the State-of-the-Union speeches dataset.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-37
SLIDE 37

Conclusions and Future Work

1

We have presented an n-gram topic model which can capture both temporal structure and n-gram words in the time-stamped documents.

2

Topics found by our model are more interpretable with better qualitative and quantitative performance on two publicly available datasets.

3

We have derived a collapsed Gibbs sampler for faster posterior inference.

4

An advantage of our model is that it does away with ambiguities that might appear among the words in topics.

Future Work

Explore non-parametric methods for n-gram topics over time.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

slide-38
SLIDE 38

References

David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proc. of ICML. 113-120. Knights, D., Mozer, M., and Nicolov, N. (2009). Detecting topic drift with compound topic models. Proc. of the ICWSM’09. Noriaki Kawamae. 2011. Trend analysis model: Trend consists of temporal words, topics, and timestamps. In Proc. of WSDM. 317-326. Hanna M. Wallach. 2006. Topic modeling: beyond bag-of-words. In Proc. of ICML, 977-984. Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological review, 114(2), 211. Wang, X., McCallum, A., and Wei, X. 2007. Topical n-grams: Phrase and topic discovery, with an application to information

  • retrieval. In Proc. of ICDM, (pp. 697-702).

Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-Markov continuous-time model of topical trends. In Proc. of KDD . 424-433.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia