there is something beyond the twitter network Karol Wgrzycki - - PowerPoint PPT Presentation

there is something beyond the twitter network
SMART_READER_LITE
LIVE PREVIEW

there is something beyond the twitter network Karol Wgrzycki - - PowerPoint PPT Presentation

there is something beyond the twitter network Karol Wgrzycki 2016-07-11 1 modeling information diffussion 2 Application in: sociology critical analysis social policy political science market analysis and marketing


slide-1
SLIDE 1

there is something beyond the twitter network

Karol Węgrzycki 2016-07-11

1

slide-2
SLIDE 2

modeling information diffussion

2

slide-3
SLIDE 3

Application in:

  • sociology
  • critical analysis
  • social policy
  • political science
  • market analysis and marketing
  • recommender systems
  • routing algorithms

3

slide-4
SLIDE 4

problem with rumour distribution

10 10

1

10

2

10

3

10

4

cascade size 10

  • 7

10

  • 6

10

  • 5

10

  • 4

10

  • 3

10

  • 2

10

  • 1

10 probability

Rysunek 1: Real distribution of tweets

4

slide-5
SLIDE 5

10 10

1

10

2

10

3

10

4

cascade size 10

  • 6

10

  • 5

10

  • 4

10

  • 3

10

  • 2

10

  • 1

10 probability

Rysunek 2: Predicted distribution

5

slide-6
SLIDE 6

goodness of fit

The goodness of fit of a statistical model describes how well it fits a set of observations. Abundance of choice:

  • Kolmogorov–Smirnov test
  • Cram´

er–von Mises criterion

  • Anderson–Darling test
  • Shapiro–Wilk test
  • Chi-squared test
  • Akaike information criterion
  • Hosmer–Lemeshow test

6

slide-7
SLIDE 7

ks-test

7

slide-8
SLIDE 8

sup

x |X(x) − Y (x)|,

8

slide-9
SLIDE 9
  • ther test

Looking “how good” the line fits the distribution in power-law plot is wrong!

  • Lots of distributions give you straight-ish lines on a log-log

plot.

  • Abusing linear regression makes the Gauss cry.
  • Use maximum likelihood to estimate the scaling exponent.
  • Use KS test to estimate where the scaling region begins.

9

slide-10
SLIDE 10

data and simulation technique

We recievied 5GB of tweets from Univeristy of Rome 500 million tweets, 10% sample, from May 2013. Retweet graph has 71 million vertices, 230 million edges. And decided to share them! (We anonymized it, so it does not valioate the twitter policy).

10

slide-11
SLIDE 11

cgm - cascade generation model

According to Leskovec et al. 2007:

  • 1. Uniformly at random pick a starting point of the cascade and

add it to the set of newly informed nodes.

  • 2. Every newly informed node, for each of his direct neighbors,

makes a separate decision to inform the neighbor with the probability α.

  • 3. Let newly informed be the set of nodes that have been

informed for the first time in step 2 and add them to the generated cascade.

  • 4. Add all newly informed nodes to the generated cascade.
  • 5. Repeat steps 2 to 4 until newly informed set is empty.

In CGM regime all nodes have identical impact. The final graph is called a cascade.

11

slide-12
SLIDE 12

cgm learning

0.05 0.10 0.15 0.20 0.25 0.30 alpha 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 K-S test

12

slide-13
SLIDE 13

cgm results

10 10

1

10

2

10

3

cascade size 10

  • 9

10

  • 8

10

  • 7

10

  • 6

10

  • 5

10

  • 4

10

  • 3

10

  • 2

10

  • 1

10 probability

model α real

13

slide-14
SLIDE 14

exponential model

How about rumour aging. The probability, that the rumour will be passed should decay in time.

  • 1. In the first round each neighbor of a initial vertex is informed

and then with probability α becomes the spreader.

  • 2. During the round no. k each previously, not informed neighbor
  • f the new spreaders from the round k − 1 is informed and

subsequently, with probability αk becomes a spreader.

14

slide-15
SLIDE 15

maybe information appears randomly in the network

  • The real structure of social interaction is unknown
  • Can the information appear randomly in the network?

15

slide-16
SLIDE 16

multi source model

The number of spreaders that get to known the information from a different source can be modeled by the Binomial distribution: X ∼ B(n, p). By the law of rare events, this can be approximated by Poisson distribution: X ∼ Pois(np).

16

slide-17
SLIDE 17

compound poisson process

This is is essentially known as compound poisson process! X0 + Y (t) = X0 +

N(t)

  • i=1

Xi =

N(t)

  • i=0

Xi, And we can implement it efficiently!

17

slide-18
SLIDE 18

algorithm

We can model the information diffusion as follows:

  • 1. Randomly choose the first node that will be informed.
  • 2. Propagate the information using the model αk from the

previous section.

  • 3. Until there are new, informed nodes, in each round randomly

choose X ∼ Pois(λ) new source nodes and propagate information from those nodes by model αk. This algorithm with algorithmic and statistical tricks can be simulated essentially in the same time as CGM!

18

slide-19
SLIDE 19

parameters learning

0.105 0.110 0.115 0.120 0.125 0.130 0.135 alpha 0.00 0.05 0.10 0.15 0.20 0.25 0.30 lambda 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 K-S test

19

slide-20
SLIDE 20

comparison with real distribution

100 101 102 103 cascade size 10−8 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100 probability

multi-source real

20

slide-21
SLIDE 21

further improvements

  • Geographically close nodes might be informed through an

unknown social network. Close nodes should be informed with higher probability than distant.

  • The probability of randomly informing a node may decrease in

time because the information may become obsolete.

  • The evolution of the social network structure within time.

21

slide-22
SLIDE 22

all data and code is available online!

(social-networks.mimuw.edu.pl)

22

slide-23
SLIDE 23

future work

  • Propose better model of information flow
  • Propose better metric for comparison of data
  • Give better statistical framework for infomration modeling

23