How Predictable is Information Diffusion? Travis Martin, Jake - - PowerPoint PPT Presentation

how predictable is information diffusion
SMART_READER_LITE
LIVE PREVIEW

How Predictable is Information Diffusion? Travis Martin, Jake - - PowerPoint PPT Presentation

How Predictable is Information Diffusion? Travis Martin, Jake Hofman, Amit Sharma, Ashton Anderson, and Duncan Watts How Predictable is Information Diffusion? 1 / 36 How far will this spread? How Predictable is Information Diffusion? 2 / 36


slide-1
SLIDE 1

How Predictable is Information Diffusion?

Travis Martin, Jake Hofman, Amit Sharma, Ashton Anderson, and Duncan Watts

How Predictable is Information Diffusion? 1 / 36

slide-2
SLIDE 2

How far will this spread?

How Predictable is Information Diffusion? 2 / 36

slide-3
SLIDE 3

How far will this spread?

How Predictable is Information Diffusion? 2 / 36

slide-4
SLIDE 4

Why is so difficult to predict success? Do we need bigger data and better models? Or is information diffusion inherently unpredictable?

How Predictable is Information Diffusion? 3 / 36

slide-5
SLIDE 5

Outline

  • Understanding diffusion: What we know and how we got here
  • Predicting success: Evaluating the state-of-the-art under a

unified framework

  • Theoretical limits: Exploring the limits to predicting success

How Predictable is Information Diffusion? 4 / 36

slide-6
SLIDE 6

Understanding Diffusion

(What we know and how we got here)

How Predictable is Information Diffusion? 5 / 36

slide-7
SLIDE 7

∼1950s: Small-scale surveys of individual interactions

How Predictable is Information Diffusion? 6 / 36

slide-8
SLIDE 8

∼1950s: Small-scale surveys of individual interactions

Katz & Lazarsfeld (1955)

How Predictable is Information Diffusion? 6 / 36

slide-9
SLIDE 9

∼1960s: Mathematical models of aggregate adoption

Rogers (1962), Bass (1969)

How Predictable is Information Diffusion? 7 / 36

slide-10
SLIDE 10

∼1960s: Random graph theory

p > (1 + ǫ) ln n n Erd˝

  • s & R´

enyi (1959)

How Predictable is Information Diffusion? 8 / 36

slide-11
SLIDE 11

∼1990s: Empirical structure and dynamics of networks

Newman, Barabasi, Watts (2006)

How Predictable is Information Diffusion? 9 / 36

slide-12
SLIDE 12

∼2000s: Empirical analyses of large-scale diffusion events

Liben-Nowell & Kleinberg (2007)

How Predictable is Information Diffusion? 10 / 36

slide-13
SLIDE 13

∼2010s: Characterizing online information flows

Celeb Media Org Blog

A B Category of Twitter Users B receive tweets from A % of tweets received from Celeb Media Org Blog Celeb 38.27 6.23 1.55 3.98 Media 3.91 26.22 1.66 5.69 Org 4.64 6.41 8.05 8.70 Blog 4.94 3.89 1.58 22.55

Wu, Hofman, Mason, Watts (2011)

How Predictable is Information Diffusion? 11 / 36

slide-14
SLIDE 14

∼2010s: Cataloging empirical diffusion structures

Density

0.03% 0.1% 0.3% 1% 3% 10% 30% 100% All Else

Tree Size CCDF

100% 10% 1% 0.1% 0.01% 0.001% 0.0001% 1 3 10 30 100 300

Tree Depth

1 2 3 4 5 6 7 8 Y! Kindness Zync Secretary Game Twitter News Twitter Videos Friendsense Y! Voice

A B C

Goel, Goldstein, Watts (2012)

How Predictable is Information Diffusion? 12 / 36

slide-15
SLIDE 15

∼2010s: Cataloging empirical diffusion structures

50 100 150 time size 5 10 15 20 time size 20 40 60 80 100 120 140 time size 20 40 60 80 100 120 time size 0.0 0.5 1.0 1.5 time size 10 20 30 40 50 60 70 time size

Goel, Anderson, Hofman, Watts (2015)

How Predictable is Information Diffusion? 13 / 36

slide-16
SLIDE 16

2016

  • There is a striking concentration of attention online, in

support of the two-step flow of information

  • Most things don’t spread, but when they do, there is a great

deal of diversity in diffusion patterns

  • There is almost no correlation between how things diffuse and

how far they spread

  • Existing diffusion models fail to account for this diversity in
  • utcomes

How Predictable is Information Diffusion? 14 / 36

slide-17
SLIDE 17

Predicting Success

(Evaluating the state-of-the-art under a unified framework)

How Predictable is Information Diffusion? 15 / 36

slide-18
SLIDE 18

Background: Predicting the success of diffusion events

Bakshy, Hofman, Mason, Watts (2011)

  • Looked at 75M diffusion

events across 1M users

  • Found a relatively low

correlation (R2 ∼ 30%) between predicted and actual cascade sizes

  • Almost all predictive power

comes from examining past performance of a user or piece of content

How Predictable is Information Diffusion? 16 / 36

slide-19
SLIDE 19

Background: Predicting the success of diffusion events

Bakshy, Hofman, Mason, Watts (2011)

  • Looked at 75M diffusion

events across 1M users

  • Found a relatively low

correlation (R2 ∼ 30%) between predicted and actual cascade sizes

  • Almost all predictive power

comes from examining past performance of a user or piece of content How much better can we do?

How Predictable is Information Diffusion? 16 / 36

slide-20
SLIDE 20

Related work

  • Hong & Davidson (2010): Will a given user be retweeted?

Topic model features outperform baselines (F1 = 0.47)

  • Petrovic et. al. (2011): Will a given tweet be retweeted?

Social and content features beat humans (F1 = 0.46)

  • Jenders et. al. (2013): Will a cascade reach a minimum size?

Content features lead to good performance (F1 = 0.90)

  • Tan et. al. (2014): Which of two tweets will spread further?

Detailed wording features are informative (Accuracy = 0.65)

  • Cheng et. al. (2014): Will a cascade double in size?

Temporal features provide good performance (AUC = 0.88)

How Predictable is Information Diffusion? 17 / 36

slide-21
SLIDE 21

Progress?

All of this work examines a different question with a different measure of success, evaluated on a different subset of data, making it difficult to assess overall progress1

1http://hunch.net/?p=22 How Predictable is Information Diffusion? 18 / 36

slide-22
SLIDE 22

Ex-ante prediction

We focus on predictions made prior to events of interest “X will succeed because of properties A, B, and C” vs. “X will succeed tomorrow because it is successful today”

How Predictable is Information Diffusion? 19 / 36

slide-23
SLIDE 23

A unified framework: Luck vs. skill2

  • Model success S as a mix of

skill Q and luck ǫ: S = f (Q) + ǫ

  • Measure the fraction of

variance remaining after conditioning on skill: F = E[Var(S|Q)] Var(S) = 1 − R2

  • R2 = 1 in a pure skill world,

R2 = 0 in pure luck world

P[Success] Empirical Observation Success P[Success|skill] “Luck World” Success P[Success|skill] “Skill World” Success 2Formalizes Maboussin (2012) How Predictable is Information Diffusion? 20 / 36

slide-24
SLIDE 24

Data

  • Examined all 1.4B tweets containing URLs posted in February

2015

How Predictable is Information Diffusion? 21 / 36

slide-25
SLIDE 25

Data

  • Examined all 1.4B tweets containing URLs posted in February

2015

  • Eliminated spam using internal Microsoft classifier

How Predictable is Information Diffusion? 21 / 36

slide-26
SLIDE 26

Data

  • Examined all 1.4B tweets containing URLs posted in February

2015

  • Eliminated spam using internal Microsoft classifier
  • Restricted attention to tweets containing URLs from the top

100 English-speaking domains with the most unique adopters

How Predictable is Information Diffusion? 21 / 36

slide-27
SLIDE 27

Data

  • Examined all 1.4B tweets containing URLs posted in February

2015

  • Eliminated spam using internal Microsoft classifier
  • Restricted attention to tweets containing URLs from the top

100 English-speaking domains with the most unique adopters

  • Resulted in 850M tweets from 50M distinct users covering

news, entertainment, videos, images, and products

How Predictable is Information Diffusion? 21 / 36

slide-28
SLIDE 28

Data

  • Examined all 1.4B tweets containing URLs posted in February

2015

  • Eliminated spam using internal Microsoft classifier
  • Restricted attention to tweets containing URLs from the top

100 English-speaking domains with the most unique adopters

  • Resulted in 850M tweets from 50M distinct users covering

news, entertainment, videos, images, and products

  • Measured the total cascade size for each seed tweet

How Predictable is Information Diffusion? 21 / 36

slide-29
SLIDE 29

User distribution

Most users in our dataset have relatively few followers, although low-degree users are under-represented

10−8 10−6 10−4 10−2 1 10 1,000 100,000 10,000,000

Number of followers of a user CCDF How Predictable is Information Diffusion? 22 / 36

slide-30
SLIDE 30

Cascade sizes

Most cascades are small, fewer than 3% reach 10 or more users

10−9 10−7 10−5 10−3 10−1 10 1,000 100,000

Cascade size CCDF How Predictable is Information Diffusion? 23 / 36

slide-31
SLIDE 31

Activity by degree

Most cascades are started by low-degree users

  • ● ●●●
  • 100

10,000 1,000,000 10 1,000 100,000 10,000,000

Number of followers of a user Number of cascades

Number of users

  • 1

100 10,000 1,000,000

How Predictable is Information Diffusion? 24 / 36

slide-32
SLIDE 32

Cascade size by degree

Cascades initiated by high-degree users tend to have larger reach

  • ● ●●●
  • 0.1

10.0 1,000.0 100,000.0 10 1,000 100,000 10,000,000

Number of followers of a user Mean cascade size for a typical user

Number of users

  • 1

100 10,000 1,000,000

How Predictable is Information Diffusion? 25 / 36

slide-33
SLIDE 33

Predictive features

Used a random forest to estimate success (cascade size) given skill (available features)

  • Basic content features: URL domain, time of tweet, spam

score, ODP category

  • Basic user features: number of followers, number of friends,

number of posts, account creation time

  • Topic features: the most probable Latent Dirichlet Allocation

topic for each user and tweet, along with an interaction term

  • Past success: the average number of retweets received by

each URL and user in the past

How Predictable is Information Diffusion? 26 / 36

slide-34
SLIDE 34

Predictive performance

Our best model explains roughly half of the variance in outcomes

How Predictable is Information Diffusion? 27 / 36

slide-35
SLIDE 35

Predictive performance

Content features alone perform poorly

How Predictable is Information Diffusion? 27 / 36

slide-36
SLIDE 36

Predictive performance

Basic user features provide a reasonable boost in performance

How Predictable is Information Diffusion? 27 / 36

slide-37
SLIDE 37

Predictive performance

Past user success alone accounts for almost all of predictive power

How Predictable is Information Diffusion? 27 / 36

slide-38
SLIDE 38

Summary of empirical results

  • This is the best known model since Bakshy et. al., boosting

performance from R2 ∼ 30% to R2 ∼ 50%

  • Both models derive their predictive power from the same

simple feature: a user’s past success

  • Content features are only weakly informative
  • Performance plateaus as we add more features, suggesting a

possible limit to the predictability of diffusion outcomes

How Predictable is Information Diffusion? 28 / 36

slide-39
SLIDE 39

Theoretical limits

(Exploring the limits to predicting success)

How Predictable is Information Diffusion? 29 / 36

slide-40
SLIDE 40

Simulations

  • In practice we can never rule out missing features or superior

models, so we turn to numerical simulations where we have full access to and control of all relevant information

  • Looked at the variation in outcomes when we repeatedly seed

the same user with the same content

  • Examined how this varies with content heterogeneity and

estimation error

How Predictable is Information Diffusion? 30 / 36

slide-41
SLIDE 41

Simulations

  • Created a scale-free network

similar to Twitter but smaller in size

  • Simulated 8B cascades using a

standard SIR model

  • Initiated 1,000 cascades for

each combination of 10,000 different seed users and 800 different infectiousness values

  • Carefully matched distributions
  • f user activity and cascade size

to our empirical data

y x z t r v u w s

(a)

y x z t r v u w s

(b)

y x z t r v u w s

(c)

y x z t r v u w s

(d)

How Predictable is Information Diffusion? 31 / 36

slide-42
SLIDE 42

Repeatedly seed the same user with the same content

Outcomes are highly predictable when all content is identical Content heterogeneity Theoretical limit on predictability Perfect knowledge

  • f identical content

Average content quality

How Predictable is Information Diffusion? 32 / 36

slide-43
SLIDE 43

Repeatedly seed the same user with the same content

Predictive performance decreases sharply with content diversity (e.g., a 15% variation around R∗

0 = 0.2 gives an R2 of 60%)

Content heterogeneity Theoretical limit on predictability Perfect knowledge

  • f diverse content

Average content quality

How Predictable is Information Diffusion? 32 / 36

slide-44
SLIDE 44

Repeatedly seed the same user with the same content

Outcomes are highly predictable assuming exact quality estimates Error in estimating quality Theoretical limit on predictability Perfect knowledge

  • f identical content

Average content quality

How Predictable is Information Diffusion? 33 / 36

slide-45
SLIDE 45

Repeatedly seed the same user with the same content

Predictive performance decreases sharply with estimation error (e.g., R2 < 60% with 30% error in estimating R∗

0 = 0.3)

Theoretical limit on predictability Imperfect knowledge

  • f identical content

Error in estimating quality

Average content quality How Predictable is Information Diffusion? 33 / 36

slide-46
SLIDE 46

Summary of theoretical results

  • Our simulations that suggest that it is the diffusion process

itself that is unpredictable, rather than our ability to estimate

  • r model it
  • Predictability decreases sharply with content diversity
  • Likewise, small errors in estimating quality severely limit

predictability

  • We emphasize the qualitative nature of these results and the

approach to assessing predictability, rather than the specific numerical outcomes presented here

How Predictable is Information Diffusion? 34 / 36

slide-47
SLIDE 47

Conclusions

How Predictable is Information Diffusion? 35 / 36

slide-48
SLIDE 48

Conclusions

Most things don’t spread, but when they do, it’s difficult to predict success

How Predictable is Information Diffusion? 36 / 36

slide-49
SLIDE 49

Conclusions

Despite a great deal of research on the topic, it’s difficult to assess long-term progress in predicting success

How Predictable is Information Diffusion? 36 / 36

slide-50
SLIDE 50

Conclusions

State-of-the-art models explain roughly half of the variance in

  • utcomes, based primarily on past success

How Predictable is Information Diffusion? 36 / 36

slide-51
SLIDE 51

Conclusions

This is likely due to randomness in diffusion process itself, rather than our ability to estimate or model it

How Predictable is Information Diffusion? 36 / 36