o tt itti outtwitting the twitterers th t itt predicting
play

O tt itti Outtwitting the Twitterers th T itt Predicting - PowerPoint PPT Presentation

O tt itti Outtwitting the Twitterers th T itt Predicting Information Predicting Information Cascades in Microblogs Wojciech Galuba Karl Aberer Wojciech Galuba , Karl Aberer EPFL, Switzerland Dipanjan Chakraborty Dipanjan Chakraborty


  1. O tt itti Outtwitting the Twitterers – th T itt Predicting Information Predicting Information Cascades in Microblogs Wojciech Galuba Karl Aberer Wojciech Galuba , Karl Aberer EPFL, Switzerland Dipanjan Chakraborty Dipanjan Chakraborty IBM Research India Zoran Despotovic, Wolfgang Kellerer D Docomo Euro-Labs, Munich, Germany E L b M i h G

  2. Why study information flows in OSNs? casual link sharing  improve how information flows breaking news Modeling M d li activism  new applications viral marketing  insights into g emergencies underlying sociology PR campaigns 2

  3. Information overload? Full-time job (reading tweets 40h a week at 150WPM) k t 150WPM) Median: 23 tw/h, 552 tw/day (Sep 2009 data) 3

  4. OSN information spread modeling  Related work:  generative models  reproduce statistical properties of info spread  reproduce statistical properties of info spread  predict coarse-grained aggregates  # of nodes reached by spread etc.  Our approach:  Our approach:  Look at URL diffusion on Twitter  Can we predict which user will mention which URL with what probability? URL with what probability? 4

  5. Why predict URL tweets?  Protect from information overload  Protect from information overload  Sort incoming URLs by probability of retweeting t ti  Viral marketing  Viral marketing  Select a subset of users that ensure successful URL propagation f l URL ti  Spam detection  Spam detection  Mispredictions are a sign of anomalous activity ti it 5

  6. 6

  7. Data  300 hour window in Sep’09  22M tweets  2.7M unique users  15M unique URLs  15M unique URLs  700M connections in the follower graph g p  Approx. 1/15th of the Twitter traffic 7

  8. Follower graph* * active users only: that have sent at least one URL in 300h 8

  9. F ll Follower graph* h* Mean (directed): Mean (directed): 3.61 * active users only: that have sent at least one URL in 300h 9

  10. U User activity ti it 10

  11. Per-URL activity 11

  12. Information cascades Nodes: users that Nodes: users that mentioned a given URL A Arcs: information flow i f ti fl 12

  13. Re-tweeting 13

  14. RT-cascade @bob: RT @alice @alice: http://url.com @ p http://url.com p @charlie: http://url.com  Arcs: who retweets whom  Irrespective of wheter users follow one another  Single parent g p  only the user name immediately after „RT” taken into account 14

  15. F-cascade @bob: http://url.com @alice: http://url.com @charlie: http://url.com  Arc @a  @b exists if:  user @a mentioned URL before user @b  user @a mentioned URL before user @b  user @b follows user @a 15

  16. RT-cascades vs. F-cascades  RT cascades are trees  RT-cascades are trees  F-cascades are DAGs  33% of the retweets credit a source that th the user does not directly follow d t di tl f ll 16

  17. cascade subcascade 17

  18. Subcascade size 18

  19. Cascade fragmentation 19

  20. Cascade depth 20

  21. Influence of the root 21

  22. Information diffusion rate Median: 50mins 22

  23. URL tweeting prediction  Based on the past URL retweets by users  Based on the past URL retweets by users, predict the future ones  Find probability that user i mentions URL u u = u p i p i 23

  24. Influence α ij α 24

  25. External influence β i β 25

  26. URL virality γ u γ http://cnn com/ http://cnn.com/ 26

  27. Per-user diffusion delay 2 , µ i σ i i i 27

  28. Model α ij β i β i 2 , µ i σ i γ u http://cnn.com/ 28

  29. At-Least-One (ALO) model u p α γ ij j u j j Temporal u p p = P( at least one * ( event happens ) * component component i i 2 , µ i σ i β β i γ γ u 29

  30. Linear threshold (LT) model u p α γ ij u j Temporal u p p   = * component component * i i 2 , µ i σ i β β i γ γ Thresholding u function (sigmoid) 30

  31. Performance metrics  Recall: fraction of tweets predicted  Recall: fraction of tweets predicted  out of all tweets that happened  Precision: fraction of true positives  out of all tweets predicted t f ll t t di t d  F-score: harmonic mean of recall and  F score: harmonic mean of recall and precision  F-score is the optimization goal 31

  32. Learning  Input: a time window of tweets  Input: a time window of tweets  Computation: gradient ascent method p g 2 , , , , α β γ µ σ  Parameter space: ji i u i i  Goal: maximize F-score G l i i F u p p  Output:  Output: i i 32

  33. Lineup  LT – Linear Threshold model  LT Linear Threshold model α  LTr – Linear Threshold model with j j α instead of ji  ALO – At-Least-One model ALO At L t O d l  RND – baseline makes random guesses  RND – baseline, makes random guesses u p about i 33

  34. * training data: first 150 h, test data: next 150h, 34 results for 100 random URLs

  35. Summary  L og-normal degree distribution  L og normal degree distribution  Small-world: 3.6 hops from user to user  Power-laws in the user activity and URL mentions e o s  Cascades are shallow: exponential depth falloff  Log-normally distributed diffusion delay ff  The LT model: The LT model:  predicts more than half of the URL tweets  with less than 15% false positive rate  with less than 15% false positive rate 35

  36. Ongoing work  Investigating mispredictions  Investigating mispredictions  URLs  users  Scaling up the real-time data mining g p g  continous MapReduce  crawler farm  crawler farm  Website: personalized URL rankings for Twitter users  Apply to other systems pp y to ot e syste s 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend