We know what you did at 9am Analysis Systems with Dynamic User - - PowerPoint PPT Presentation

we know what you did at 9am
SMART_READER_LITE
LIVE PREVIEW

We know what you did at 9am Analysis Systems with Dynamic User - - PowerPoint PPT Presentation

We know what you did at 9am Analysis Systems with Dynamic User Generated Content Christian Wallenta, Oxford Mohamed Ahmed, UCL Ian Brown, Oxford Stephen Hailes, UCL Felipe Huici, UCL Multi Service Networks 2008 10.07.2008 Motivation


slide-1
SLIDE 1

We know what you did at 9am

Analysis Systems with Dynamic User Generated Content

Christian Wallenta, Oxford Mohamed Ahmed, UCL Ian Brown, Oxford Stephen Hailes, UCL Felipe Huici, UCL Multi Service Networks 2008

10.07.2008

slide-2
SLIDE 2

MSN 2008 10.07.2008 2

Motivation

  • Understand how data enters these systems
  • Understand how data evolves over time
  • -> Derive models that explain when and from

where data comes into these systems

  • -> Apply these models to a wider range of

applications to optimise their performance

slide-3
SLIDE 3

MSN 2008 10.07.2008 3

Quick Overview

slide-4
SLIDE 4

MSN 2008 10.07.2008 4

Datasets

  • digg.com

– 1.5 million posts including submission

time, author, number of votes between May and November 2007

– 1.6 millions votes for 87,000 posts

between Nov, 21st and Dec, 1st 2007

– 240,000 user profiles

  • reddit.com

– 183,000 posts (Nov 07 to Feb 08) – 13,300 posts + votes (Nov,23rd to Nov, 30th)

slide-5
SLIDE 5

MSN 2008 10.07.2008 5

Content Generation

Trend

50,000 posts in May to 65,000 in November 2007

slide-6
SLIDE 6

MSN 2008 10.07.2008 6

Content Generation

Volume per Week

reddit.com

slide-7
SLIDE 7

MSN 2008 10.07.2008 7

Content Generation

Volume per Week

digg.com

slide-8
SLIDE 8

MSN 2008 10.07.2008 8

Content Generation

User Contribution

slide-9
SLIDE 9

MSN 2008 10.07.2008 9

Content Generation

User Contribution

slide-10
SLIDE 10

MSN 2008 10.07.2008 10

Popularity Analysis

Votes Distribution

  • What % of the votes

goes to what % of the post?

slide-11
SLIDE 11

MSN 2008 10.07.2008 11

Popularity Analysis

Votes Distribution

  • What % of the votes

goes to what % of the post?

  • If votes~popularity

then this distribution is always interesting for caching

slide-12
SLIDE 12

MSN 2008 10.07.2008 12

Popularity Analysis

Popularity Evolution

  • Now we know static behaviour, but...
  • How fast does this happen?
  • How long does content stay popular?
  • Monitor posts from submission time

until they become inactive

slide-13
SLIDE 13

MSN 2008 10.07.2008 13

Popularity Analysis

Popularity Evolution

digg.com

slide-14
SLIDE 14

MSN 2008 10.07.2008 14

Popularity Analysis

Post Lifetime

slide-15
SLIDE 15

MSN 2008 10.07.2008 15

Analysis Summary

  • Lots of content, periodic patterns
  • Few users create most of the content
  • Most votes go to a few posts
  • Content becomes popular fast, and has a

short lifetime in contrast to e.g. YouTube

slide-16
SLIDE 16

MSN 2008 10.07.2008 16

Data Generation Model

Motivation

  • Understanding where data comes from

and when?

  • Develop a simple, generalisable

model that describes:

– the volume of content posted at any

given sample interval

– the relative contribution of each of the 24

possible time zones

– the expected user behaviour throughout

a 24h period

slide-17
SLIDE 17

MSN 2008 10.07.2008 17

Problem: Unprocessed time series is noisy

Data Generation Model

Identifying the dominant frequencies

slide-18
SLIDE 18

MSN 2008 10.07.2008 18

Method: Apply Fourier Transformation to identify the dominant frequencies.

Data Generation Model

Identifying the dominant frequencies

slide-19
SLIDE 19

MSN 2008 10.07.2008 19

Data Generation Model

Identifying the dominant frequencies

slide-20
SLIDE 20

MSN 2008 10.07.2008 20

Data Generation Model

Identifying the dominant frequencies

slide-21
SLIDE 21

MSN 2008 10.07.2008 21

Data Generation Model

Step 2: time zone distribution

  • Problem:

– Fourier gives us dominant frequencies, but

no information from where the content was submitted.

  • Method:

– Incorporate user location information into

the Fourier model.

  • Assumptions:

– Majority of users state correct location – Users that do not reveal location are

proportionally distributed in their geographical location

slide-22
SLIDE 22

MSN 2008 10.07.2008 22

  • Problem:

Some countries have more than 1 time zone

  • Assumption:

User distribution is the same as popularity distribution within the zones

Data Generation Model

Step 2: time zone distribution

slide-23
SLIDE 23

MSN 2008 10.07.2008 23

Data Generation Model

Step 2: time zone distribution

slide-24
SLIDE 24

MSN 2008 10.07.2008 24

Data Generation Model

Step 3: expected user behaviour

  • Idea:

– Content volume per time interval is the

sum of contribution of all time zones

  • Assumption:

– Users in different zones follow roughly the

same usage pattern

= x ?

slide-25
SLIDE 25

MSN 2008 10.07.2008 25

Data Generation Model

Step 3: expected user behaviour

slide-26
SLIDE 26

MSN 2008 10.07.2008 26

Data Generation Model

Step 3: expected user behaviour

slide-27
SLIDE 27

MSN 2008 10.07.2008 27

Data Generation Model

Step 3: expected user behaviour

slide-28
SLIDE 28

MSN 2008 10.07.2008 28

Data Generation Model

Model applied to reddit.com

initial fit:

slide-29
SLIDE 29

MSN 2008 10.07.2008 29

Data Generation Model

Model applied to reddit.com

adapted weights:

slide-30
SLIDE 30

MSN 2008 10.07.2008 30

Model Summary

  • Periodic pattern can be modelled with

few dominant frequencies

  • Time zone analysis reveals where

content comes from

  • Decomposed model describes user

behaviour within a single time zone

slide-31
SLIDE 31

MSN 2008 10.07.2008 31

Design Implications

Applying Geo-Temporal Information

  • Energy-efficient load balancing

– (Chen et al, NSDI 2008)

  • Similar patterns exhibited in

– Facebook (Golders et al, CT 2007) – MSN (Chen et al, NSDI 2008) – Gaming (Chambers et al, IMC 2005)

  • Peer-to-Peer Churn / Content Distribution

– neighbour selection / replication

slide-32
SLIDE 32

MSN 2008 10.07.2008 32

Example

slide-33
SLIDE 33

MSN 2008 10.07.2008 33

Future Work

  • Comparing different node selection

strategies when replicating data in distributed systems

  • Can taking into account time zone

information increase performance?

  • Test other datasets
  • How can time zone behaviour be

learned in a distributed way?

slide-34
SLIDE 34

MSN 2008 10.07.2008 34

The End

Thank you

slide-35
SLIDE 35

MSN 2008 10.07.2008 35

Content Generation

Link Analysis

  • Aim:

– Understand what “content” is submitted – What “content” becomes popular – Does the user filtering achieve anything?

slide-36
SLIDE 36

MSN 2008 10.07.2008 36

Content Generation

Link Analysis

slide-37
SLIDE 37

MSN 2008 10.07.2008 37

Content Generation

Link Analysis

slide-38
SLIDE 38

MSN 2008 10.07.2008 38

Popularity Analysis

Popularity Evolution

reddit.com

slide-39
SLIDE 39

MSN 2008 10.07.2008 39

Data Generation Model

Step 3: expected user behaviour

= x

?

slide-40
SLIDE 40

MSN 2008 10.07.2008 40

  • Solve linear equations

Data Generation Model

Step 3: expected user behaviour

slide-41
SLIDE 41

MSN 2008 10.07.2008 41

Data Generation Model

Step 3: expected user behaviour

= x ?

slide-42
SLIDE 42

MSN 2008 10.07.2008 42

Design Implications

Popularity Prediction

  • Content popularity follows 80-20 rule:

Caching can increase performance

  • Problems/Challenges:

– constantly new content comes into the system – content becomes popular rapidly – content has short lifetime

Cacheable content needs to be identified early