SLIDE 12 The Data Collection Tier The Data Collection Tier The Data Collection Tier The Data Collection Tier
Data acquired from data aggregators focused on some specific
F
geographic areas (such as at a city level) distinguishing between:
- 1. streams of posts pushed to the application as soon as they are generated (such as
tweets via the T witter Streaming API),
- 2. new posts that needs to be pulled at a given rate (e.g. Google Blogger posts via the
Faster content generation rates Sl
- 2. new posts that needs to be pulled at a given rate (e.g. Google Blogger posts via the
Blogger REST API). Data’s pulling rate determines the real-time users trending topics.
The locality of posts is achieved by specifying geolocation parameters
to the T witter Streaming API and by limiting REST requests to new posts
Slower content generation rates
to the T witter Streaming API and by limiting REST requests to new posts from users retrieved from Blogger based on their declared location of residence.
Hyperlinks are extracted from tweets and a separate process retrieves
h i d f h b h l d the associated content from the web page they lead to
- either simply its title or also additional content if it leads to a blog post (merging of
content form different sources)
Three data models Blog post
Three data models
T weet id, text Extended T weet id, text Blog post id, text, title, tags
tweet status +
text, hashtags timestamp text, hashtags, timestamp List <Blog post> tags timestamp
referenced web pages titles
12 MSND@WWW 2012, April 16th, Lyon, France