A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on - - PowerPoint PPT Presentation

a probabilistic approach to spatiotemporal theme pattern
SMART_READER_LITE
LIVE PREVIEW

A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on - - PowerPoint PPT Presentation

A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei , Chao Liu , Hang Su , and ChengXiang Zhai : University of Illinois at Urbana-Champaign : Vanderbilt University 1 Weblog as an


slide-1
SLIDE 1

1

A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs

Qiaozhu Mei†, Chao Liu†, Hang Su‡, and ChengXiang Zhai†

†: University of Illinois at Urbana-Champaign ‡: Vanderbilt University

slide-2
SLIDE 2

2

Weblog as an emerging new data…

… …

slide-3
SLIDE 3

3

An Example of Weblog Article

The time stamp Location Info. Blog Contents

slide-4
SLIDE 4

4

Characteristics of Weblogs

Weblog Article Highly personal With opinions With mixed topics Location Time Associated with time & location Interlinking & Forming communities Immediate response to events

slide-5
SLIDE 5

5

Existing Work on Weblog Analysis

  • Interlinking and Community

Analysis

– Identifying communities – Monitoring the evolution and bursting of communities – E.g., [Kumar et al. 2003]

# of nodes in communities # of communities

  • Content Analysis

– Blog level topic analysis – Information diffusion through blogspace – Use topic bursting to predict sales spikes – E.g., [Gruhl et al. 2005]

Sales rank Blog mentions

slide-6
SLIDE 6

6

How to Perform Spatiotemporal Theme Mining?

  • Given a collection of Weblog articles about a topic

with time and location information

– Discover multiple themes (i.e., subtopics) being discussed in these articles – For a given location, discover how each theme evolves

  • ver time (generate a theme life cycle)

– For a given time, reveal how each theme spreads over locations (generate a theme snapshot) – Compare theme life cycles in different locations – Compare theme snapshots in different time periods – …

slide-7
SLIDE 7

7

Locations

Spatiotemporal Theme Patterns

A theme snapshot

Discussion about “Government Response” in articles about Hurricane Katrina Discussion about “Release of iPod Nano” in articles about “iPod Nano”

Strength Time

Unite States China Canada

Theme life cycles

09/20/05 – 09/26/05

slide-8
SLIDE 8

8

Applications of Spatiotemporal Theme Mining

  • Help answer questions like

– Which country responded first to the release of iPod Nano? China, UK, or Canada? – Do people in different states (e.g., Illinois vs. Texas) respond differently/similarly to the increase of gas price during Hurricane Katrina?

  • Potentially useful for

– Summarizing search results – Monitoring public opinions – Business Intelligence – …

slide-9
SLIDE 9

9

Challenges in Spatiotemporal Theme Mining

  • How to represent a theme?
  • How to model the themes in a collection?
  • How to model their dependency on time

and location?

  • How to compute the theme life cycles and

theme snapshots?

  • All these must be done in an unsupervised

way…

slide-10
SLIDE 10

10

Our Solution: Use a Probabilistic Spatiotemporal Theme Model

  • Each theme is represented as a multinomial

distribution over the vocabulary (language model)

  • Consider the collection as a sample from a

mixture of these theme models

  • Fit the model to the data and estimate the

parameters

  • Spatiotemporal theme patterns can then be

computed from the estimated model parameters

slide-11
SLIDE 11

11

Probabilistic Spatiotemporal Theme Model

Theme θ1 Theme θk Theme θ2

Background B

price 0.3

  • il 0.2..

donate 0.1 relief 0.05 help 0.02 .. city 0.2 new 0.1

  • rleans 0.05 ..

Is 0.05 the 0.04 a 0.03 ..

Draw a word from θi Choose a theme θi

  • il

donate city the …

θk θ1 θ2

B

+ λTLP(θi |d)

Probability of choosing theme θi= ...

λTLP(θi|t, l)

Document d Time=t Location=l λTL= weight on spatiotemporal theme distribution

slide-12
SLIDE 12

12

The “Generation” Process

  • A document d of location l and time t is generated, word

by word, as follows

– First, decide whether to use the background theme θB

  • With probability λB , we’ll use the background theme and draw a

word w from p(w|θB)

– If the background theme is not to be used, we’ll decide how to choose a topic theme

  • With probability λTL, we’ll sample a theme using the “shared

spatiotemporal distribution” p(θ|t,l)

  • With probability 1- λTL, we’ll sample a theme using p(θ|d)

– Draw a word w from the selected theme distribution p(w|θi)

  • Parameters

– {p(w|θB), p(w|θi ), p(θ|t,l), p(θ|d)} (will be estimated) – λB =Background noise; λTL=Weight on spatiotemporal modeling (will be manually set)

slide-13
SLIDE 13

13

The Likelihood Function

1

log ( ) ( , ) log ( | ) (1 ) ( | )((1 ) ( | ) ( | , ))

k B j TL j TL j d d d C w V j

p C c w d P w B p w p d p t l λ λ θ λ θ λ θ

Β ∈ ∈ =

⎡ ⎤ = × + − − + ⎢ ⎥ ⎣ ⎦

∑ ∑ ∑

Count of word w in document d Generating w using the background theme Generating w using a topic theme Choosing a topic theme according to the document Choosing a topic theme according to the spatiotemporal context

slide-14
SLIDE 14

14

Parameter Estimation

  • Use the maximum likelihood estimator
  • Use the Expectation-Maximization (EM) algorithm
  • p(w|θB) is set to the collection word probability

∑ =

+ − − + + − − = =

k j d d j m TL j m TL j m B B d d j m TL j m TL j m B w d

l t p d p w p B w p l t p d p w p j z p

1 ' ' ) ( ' ) ( ' ) ( ) ( ) ( ) ( ,

)] , | ( ) | ( ) 1 )[( | ( ) 1 ( ) | ( )] , | ( ) | ( ) 1 )[( | ( ) 1 ( ) ( θ λ θ λ θ λ λ θ λ θ λ θ λ

E Step M Step

) , | ( ) | ( ) 1 ( ) , | ( ) 1 (

) ( ) ( ) ( , , d d j m TL j m TL d d j m TL j w d

l t p d p l t p y p θ λ θ λ θ λ + − = =

∑ ∑ ∑

= ∈ ∈ +

= − = = − = =

k j V w j w d w d V w j w d w d j m

y p j z p d w c y p j z p d w c d p

1 ' ' , , , , , , ) 1 (

)) 1 ( 1 )( ' ( ) , ( )) 1 ( 1 )( ( ) , ( ) | (θ

∑ ∑ ∑ ∑ ∑

= = = ∈ = = ∈ +

= = = = =

l l t t d k j V w j w d w d l l t t d V w j w d w d j m

d d d d

y p j z p d w c y p j z p d w c l t p

, : 1 ' ' , , , , : , , , ) 1 (

) 1 ( ) ' ( ) , ( ) 1 ( ) ( ) , ( ) , | (θ

∑ ∑ ∑

∈ ∈ ∈ +

= = =

V w C d w d C d w d j m

j z p d w c j z p d w c w p

' ' , , ) 1 (

) ( ) , ' ( ) ( ) , ( ) | ( θ

slide-15
SLIDE 15

15

Probabilistic Analysis of Spatiotemporal Themes

  • Once the parameters are estimated, we can

easily perform probabilistic analysis of spatiotemporal themes

– Computing theme life cycles given location – Computing theme snapshots given time

=

T t j j j

l t p l t p l t p l t p l t p

~

) ~ , ~ ( ) ~ , ~ | ( ) ~ , ( ) ~ , | ( ) ~ , | ( θ θ θ

∑∑

∈ =

= ,

L l k j j j j

l t p l t p l t p l t p t l p

~ 1 ' '

) ~ , ~ ( ) ~ , ~ | ( ) , ~ ( ) , ~ | ( ) ~ | ( θ θ θ

slide-16
SLIDE 16

16

Experiments and Results

  • Three time-stamped data sets of weblogs, each

about one event (broad topic):

  • Extract location information from author profiles
  • On each data set, we extract a set of salient

themes and their life cycles / theme snapshots

Data Set # docs Time Span(2005) Query Katrina 9377 08/16 -10/04 Hurricane Katrina Rita 1754 08/16 - 10/04 Hurricane Rita iPod Nano 1720 09/02 - 10/26 iPod Nano

slide-17
SLIDE 17

17

Theme Life Cycles for Hurricane Katrina

city 0.0634

  • rleans 0.0541

new 0.0342 louisiana 0.0235 flood 0.0227 evacuate 0.0211 storm 0.0177 … price 0.0772

  • il 0.0643

gas 0.0454 increase 0.0210 product 0.0203 fuel 0.0188 company 0.0182 … Oil Price New Orleans

slide-18
SLIDE 18

18

Theme Snapshots for Hurricane Katrina

Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week2: The discussion moves towards the north and west Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico

slide-19
SLIDE 19

19

Theme life cycles for Hurricane Rita

Hurricane Katrina: Government Response Hurricane Rita: Government Response Hurricane Rita: Storms A theme in Hurricane Katrina is inspired again by Hurricane Rita

slide-20
SLIDE 20

20

Theme Snapshots for Hurricane Rita

Both Hurricane Katrina and Hurricane Rita have the theme “Oil Price” The spatiotemporal patterns of this theme at the same time period are similar

slide-21
SLIDE 21

21

Theme Life Cycles for iPod Nano

ipod 0.2875 nano 0.1646 apple 0.0813 september 0.0510 mini 0.0442 screen 0.0242 new 0.0200 … Release of Nano United States China United Kingdom Canada

slide-22
SLIDE 22

22

Contributions and Future Work

  • Contributions

– Defined a new problem -- spatiotemporal text mining – Proposed a general mixture model for the mining task – Proposed methods for computing two spatiotemporal patterns -- theme life cycles and theme snapshots – Applied it to Weblog mining with interesting results

  • Future work:

– Capture content dependency between adjacent time stamps and locations – Study granularity selection in spatiotemporal text mining

slide-23
SLIDE 23

23

Thank You!