O tt itti Outtwitting the Twitterers th T itt Predicting - - PowerPoint PPT Presentation

o tt itti outtwitting the twitterers th t itt predicting
SMART_READER_LITE
LIVE PREVIEW

O tt itti Outtwitting the Twitterers th T itt Predicting - - PowerPoint PPT Presentation

O tt itti Outtwitting the Twitterers th T itt Predicting Information Predicting Information Cascades in Microblogs Wojciech Galuba Karl Aberer Wojciech Galuba , Karl Aberer EPFL, Switzerland Dipanjan Chakraborty Dipanjan Chakraborty


slide-1
SLIDE 1

O tt itti th T itt Outtwitting the Twitterers – Predicting Information Predicting Information Cascades in Microblogs

Wojciech Galuba Karl Aberer Wojciech Galuba, Karl Aberer

EPFL, Switzerland

Dipanjan Chakraborty Dipanjan Chakraborty

IBM Research India

Zoran Despotovic, Wolfgang Kellerer

D E L b M i h G Docomo Euro-Labs, Munich, Germany

slide-2
SLIDE 2

Why study information flows in OSNs?

casual link sharing breaking news

M d li

 improve how

information flows

viral marketing activism

Modeling

 new applications  insights into PR campaigns emergencies

g underlying sociology

2

slide-3
SLIDE 3

Information overload?

Full-time job (reading tweets 40h k t 150WPM) Median: 23 tw/h, 552 tw/day a week at 150WPM)

3

(Sep 2009 data)

slide-4
SLIDE 4

OSN information spread modeling

 Related work:

 generative models

 reproduce statistical properties of info spread  reproduce statistical properties of info spread

predict coarse-grained aggregates

 # of nodes reached by spread etc.

 Our approach:  Our approach:

Look at URL diffusion on Twitter Can we predict which user will mention which

URL with what probability?

4

URL with what probability?

slide-5
SLIDE 5

Why predict URL tweets?

 Protect from information overload  Protect from information overload

Sort incoming URLs by probability of

t ti retweeting

 Viral marketing  Viral marketing

Select a subset of users that ensure

f l URL ti successful URL propagation

 Spam detection  Spam detection

Mispredictions are a sign of anomalous

ti it

5

activity

slide-6
SLIDE 6

6

slide-7
SLIDE 7

Data

 300 hour window in Sep’09  22M tweets  2.7M unique users  15M unique URLs  15M unique URLs  700M connections in the follower graph

g p

 Approx. 1/15th of the Twitter traffic

7

slide-8
SLIDE 8

Follower graph*

8

* active users only: that have sent at least one URL in 300h

slide-9
SLIDE 9

F ll h* Follower graph*

Mean (directed): Mean (directed): 3.61

9

* active users only: that have sent at least one URL in 300h

slide-10
SLIDE 10

U ti it User activity

10

slide-11
SLIDE 11

Per-URL activity

11

slide-12
SLIDE 12

Information cascades

Nodes: users that Nodes: users that mentioned a given URL A i f ti fl

12

Arcs: information flow

slide-13
SLIDE 13

Re-tweeting

13

slide-14
SLIDE 14

RT-cascade

@alice: http://url.com @bob: RT @alice http://url.com @ p p @charlie: http://url.com

 Arcs: who retweets whom

 Irrespective of wheter users follow one another

 Single parent

14

g p

 only the user name immediately after „RT” taken into account

slide-15
SLIDE 15

F-cascade

@alice: http://url.com @bob: http://url.com @charlie: http://url.com

 Arc @a@b exists if:

 user @a mentioned URL before user @b  user @a mentioned URL before user @b  user @b follows user @a

15

slide-16
SLIDE 16

RT-cascades vs. F-cascades

 RT cascades are trees  RT-cascades are trees  F-cascades are DAGs  33% of the retweets credit a source that

th d t di tl f ll the user does not directly follow

16

slide-17
SLIDE 17

cascade subcascade

17

slide-18
SLIDE 18

Subcascade size

18

slide-19
SLIDE 19

Cascade fragmentation

19

slide-20
SLIDE 20

Cascade depth

20

slide-21
SLIDE 21

Influence of the root

21

slide-22
SLIDE 22

Information diffusion rate

Median: 50mins

22

slide-23
SLIDE 23

URL tweeting prediction

 Based on the past URL retweets by users  Based on the past URL retweets by users,

predict the future ones

 Find probability that user i mentions URL u

u = u i

pi p

23

slide-24
SLIDE 24

Influence

αij α

24

slide-25
SLIDE 25

External influence

βi β

25

slide-26
SLIDE 26

URL virality

γ u γ

http://cnn com/ http://cnn.com/

26

slide-27
SLIDE 27

Per-user diffusion delay

2

,

i i σ

µ

i i

27

slide-28
SLIDE 28

Model

ij

α

i

βi β

2

,

i i σ

µ

u

γ

http://cnn.com/

28

slide-29
SLIDE 29

At-Least-One (ALO) model

u j u ij

p γ α

j j

P(at least one * Temporal component

=

u i

p

γ β

( event happens) * component

2

,

i i σ

µ

i

p

u iγ

β

29

slide-30
SLIDE 30

Linear threshold (LT) model

u j u ij

p γ α

*

Temporal component

=

u i

p

γ β

* component

2

,

i i σ

µ

i

p

u iγ

β

Thresholding function (sigmoid)

30

slide-31
SLIDE 31

Performance metrics

 Recall: fraction of tweets predicted  Recall: fraction of tweets predicted

out of all tweets that happened

 Precision: fraction of true positives

t f ll t t di t d

out of all tweets predicted

 F-score: harmonic mean of recall and  F score: harmonic mean of recall and

precision

 F-score is the optimization goal

31

slide-32
SLIDE 32

Learning

 Input: a time window of tweets  Input: a time window of tweets  Computation: gradient ascent method

p g

Parameter space:

G l i i F

2

, , , ,

i i u i ji

σ µ γ β α

Goal: maximize F-score

 Output:

u i

p

 Output:

i

p

32

slide-33
SLIDE 33

Lineup

 LT

Linear Threshold model

 LT – Linear Threshold model  LTr – Linear Threshold model with j

α

instead of ALO At L t O d l

j ji

α

 ALO – At-Least-One model  RND – baseline makes random guesses  RND – baseline, makes random guesses

about

u i

p

33

slide-34
SLIDE 34

34

* training data: first 150 h, test data: next 150h, results for 100 random URLs

slide-35
SLIDE 35

Summary

 Log-normal degree distribution  Log normal degree distribution

 Small-world: 3.6 hops from user to user  Power-laws in the user activity and URL

mentions e

  • s

 Cascades are shallow: exponential depth falloff

ff

 Log-normally distributed diffusion delay  The LT model:

The LT model:

 predicts more than half of the URL tweets  with less than 15% false positive rate

35

 with less than 15% false positive rate

slide-36
SLIDE 36

Ongoing work

 Investigating mispredictions  Investigating mispredictions

 URLs  users

 Scaling up the real-time data mining

g p g

 continous MapReduce  crawler farm  crawler farm

 Website: personalized URL rankings for Twitter

users

 Apply to other systems

36

pp y to ot e syste s