Social Networking Trends and Social Networking Trends and Social - - PowerPoint PPT Presentation

social networking trends and social networking trends and
SMART_READER_LITE
LIVE PREVIEW

Social Networking Trends and Social Networking Trends and Social - - PowerPoint PPT Presentation

Social Networking Trends and Social Networking Trends and Social Networking Trends and Social Networking Trends and Dynamics Detection Dynamics Detection via a Cloud via a Cloud- -based Framework Design based Framework Design Athena A h


slide-1
SLIDE 1

Social Networking Trends and Social Networking Trends and Social Networking Trends and Social Networking Trends and Dynamics Detection Dynamics Detection via a Cloud via a Cloud-

  • based Framework Design

based Framework Design

A h V k li Athena Vakali Maria Giatsoglou St f A t i Stefanos Antaris Department of Informatics Department of Informatics, Aristotle University, Thessaloniki, Greece

{avakali mgiatsog santaris}@csd auth gr {avakali, mgiatsog, santaris}@csd.auth.gr

1 MSND@WWW 2012, April 16th, Lyon, France

slide-2
SLIDE 2

Outline Outline Outline Outline

Trend detection from social media:

  • Motivation & Challenges

Current approaches

pp

Problem formulation The Cloud4Trends 3 tier approach The Cloud4Trends 3-tier approach Current limitations and the Cloud as a solution The Cloud4Trends Cloud-based architecture Implementation details

p

Future outlook

2 MSND@WWW 2012, April 16th, Lyon, France

slide-3
SLIDE 3

Social media for understanding the Social media for understanding the pulse of the public pulse of the public

Social media have emerged as a popular means of

Social media have emerged as a popular means of communication and opinion sharing

In social media :

  • Users’ discussions range a wide variety of topics
  • Also, users in general express their opinions freely

S i l di b i d fl i f

Social media can be viewed as a reflection of

societal concerns exhibiting ‘bursts’ of content generation on the occurrence of events g

  • popular topics / interests fluctuate with time

Challenging for both computer scientists and

application developers to reach unbiased, meaningful conclusions about trending users’

  • pinion and interests
  • pinion and interests

3 MSND@WWW 2012, April 16th, Lyon, France

slide-4
SLIDE 4

Trend detection from social media Trend detection from social media

  • Challenges

Challenges

Massive content sizes and unpredictable Massive content sizes and unpredictable

content generation rates make analysis difficult

  • scalable analysis is needed

Trending topics should be discovered when

they are “fresh”

  • an on-line analysis approach is demanded

T d h ld b i f l

Trends should be meaningful

  • need for contextual trends

C i di d i l i l

Content is dispersed in multiple sources

  • trend detection needs a combined approach

4 MSND@WWW 2012, April 16th, Lyon, France

slide-5
SLIDE 5

Trend detection from text Trend detection from text-

  • based

based social media content social media content

U d il l i di i

Users daily generate multimedia content in

social media

Some approaches in detecting events from

multimedia content e.g. images, typically g g , yp y limited to off-line detection

T

ext-based content offers a more flexible

T

ext-based content offers a more flexible promising ground for online trend detection

t f tl t d t f t t

  • most frequently user generated type of content;
  • users’ opinion is more explicitly expressed in

text.

5 MSND@WWW 2012, April 16th, Lyon, France

slide-6
SLIDE 6

Blogs and Blogs and Microblogs Microblogs g g

  • as

as sources for trending topics detection sources for trending topics detection

WHY Blogs? g

Blogs have been used for quite a long time and are still popular Opposed to typical online information sources, blogs are primarily

  • pinion-oriented, reflecting the author’s freely expressed point of view

p , g y p p

WHY Microblogs?

Microblogging apps are key actors in real-time information

b d ti broadcasting

Content generation rates in microblogging apps are very high, with T

witter reaching currently reaching 200 million posts per day

Recent studies indicated microblogging (via tweets) is a valuable source of Recent studies indicated microblogging (via tweets) is a valuable source of

latent information about dynamics involved in the public’s opinions/views

  • E.g. for prediction of forthcoming films’ and stock prices’ revenue, real-time

earthquakes,’ identification, analysis of users’ reaction towards political debates

Microblogging apps capture the momentum of a large public’s scale

  • T

witter only has more than 300 million registered users

Although posts are short, they usually contain links to external web pages,

h ll i i i h ’ thus allowing content merging to enrich post’s context

  • e.g. around 25% of all tweets by Sept. 2010 included at least 1 hyperlink

6 MSND@WWW 2012, April 16th, Lyon, France

slide-7
SLIDE 7

Current approaches Current approaches pp pp

  • Typical trend analysis approach: application of traditional statistical methods based on

total number of keyword occurrences in texts to identify temporal trends, and most large- scale efforts specialize on searches analysis (e.g. Google Hot Trends)

  • Clustering has been used for trend detection in blogs, in an offline manner
  • Most relevant online approaches include:
  • BlogScope &

T witterMonitor: collects information from blogs & other online sources and BlogScope & T witterMonitor: collects information from blogs & other online sources and performs spatiotemporal burst detection. Event discovery is based on a given term query based

  • n : i) there is a burst when an event occurs, ii) events have a temporal & geographical scope;
  • NewsStand: online news aggregator service monitoring feeds from online news sources and

detecting their geographical focus by analyzing their content. Articles are grouped into news stories with an online text-based clustering technique, and each story's geographic context is determined by its members geographical focus;

  • T

witterStand: focuses on news detection from tweets It manually selects Seeder users known

  • T

witterStand: focuses on news detection from tweets. It manually selects Seeder users known to post news and tweets are clustered with an online method based on similarities between tweets’ and clusters’ TF-IDF feature vectors (accounting also for their temporal similarity).

  • Applications scalability concerns are not adequately addressed in most research

approaches

  • T

weet content expansion’s usefulness for trending topic detection has not been studied yet p g p y (no generic framework combining tweet & blog analysis for trend detection)

7 MSND@WWW 2012, April 16th, Lyon, France

slide-8
SLIDE 8

Cloud4Trends Cloud4Trends Cloud4Trends Cloud4Trends

Cloud4Trends is a microblogging & blogging localized

content collection and analysis framework for detecting currently popular topics of users’ interest

Cloud4Trends

  • addresses the Web 2.0 large scale reality by adopting methods for

handling efficiently fast evolving data in real time; handling efficiently fast evolving data in real time;

  • supports the analysis of text data from different web sources which

may be generated at various rates in a unified way; y g y

  • proposes a methodology for unsupervised detection of local contextual

trends, combining content from different web sources;

  • captures the shaping and evolution of users’ interests given their

broader geographical location and the type of data source;

  • follows a Cloud based data processing methodology to support a
  • follows a Cloud-based data processing methodology to support a

streaming web data clustering scenario

8 MSND@WWW 2012, April 16th, Lyon, France

slide-9
SLIDE 9

Cloud4Trends approach Cloud4Trends approach Cloud4Trends approach Cloud4Trends approach

Trend dynamics by using T

witter and the Blogosphere as data sources.

Applies incremental text clustering for detecting & maintaining a

set of dynamic clusters

  • assumes that analysis at a “document” instead of a “term” level is more

promising for providing trending topics that are meaningful to users

Clustering approach extends earlier work in T

witterStand

Clustering approach extends earlier work in T

witterStand

  • we expand the original tweet content by additional information following

referenced web sources;

  • we consider active clusters as active topics of users’ interest and ranked

we consider active clusters as active topics of users interest and ranked them based on their observed activity for indentifying the most popular (trending) at their peak;

  • our analysis pertains to certain geographical areas from the data

collection phase rather than examining the geographical scope of the collection phase, rather than examining the geographical scope of the resulting clusters as a post-analysis process;

  • we have designed a parallel Cloud-based architecture to address scalability

concerns and enable our application to analyze more content (e.g. pp y ( g pertaining to several cities).

9 MSND@WWW 2012, April 16th, Lyon, France

slide-10
SLIDE 10

Trend detection : Trend detection :

Problem formulation Problem formulation

Given a time-ordered stream of users’ posts Pt, t = [1,… ) arriving in real-time (tweets) or at a given time granularity (blog posts), identify topics and associated posts that are identify topics and associated posts that are popular (“trending”) at any given time, and monitor their dynamics and evolution across time monitor their dynamics and evolution across time in terms of their popularity.

10 MSND@WWW 2012, April 16th, Lyon, France

slide-11
SLIDE 11

Cloud4Trends 3 Cloud4Trends 3 tier design tier design Cloud4Trends 3 Cloud4Trends 3-tier design tier design

11 MSND@WWW 2012, April 16th, Lyon, France

slide-12
SLIDE 12

The Data Collection Tier The Data Collection Tier The Data Collection Tier The Data Collection Tier

Data acquired from data aggregators focused on some specific

F

geographic areas (such as at a city level) distinguishing between:

  • 1. streams of posts pushed to the application as soon as they are generated (such as

tweets via the T witter Streaming API),

  • 2. new posts that needs to be pulled at a given rate (e.g. Google Blogger posts via the

Faster content generation rates Sl

  • 2. new posts that needs to be pulled at a given rate (e.g. Google Blogger posts via the

Blogger REST API). Data’s pulling rate determines the real-time users trending topics.

The locality of posts is achieved by specifying geolocation parameters

to the T witter Streaming API and by limiting REST requests to new posts

Slower content generation rates

to the T witter Streaming API and by limiting REST requests to new posts from users retrieved from Blogger based on their declared location of residence.

Hyperlinks are extracted from tweets and a separate process retrieves

h i d f h b h l d the associated content from the web page they lead to

  • either simply its title or also additional content if it leads to a blog post (merging of

content form different sources)

Three data models Blog post

Three data models

T weet id, text Extended T weet id, text Blog post id, text, title, tags

tweet status +

text, hashtags timestamp text, hashtags, timestamp List <Blog post> tags timestamp

referenced web pages titles

12 MSND@WWW 2012, April 16th, Lyon, France

slide-13
SLIDE 13

The Data Analysis & Processing Tier The Data Analysis & Processing Tier y g y g

  • Preprocessing and modeling

Preprocessing and modeling

1

Preprocessing to filter out posts with very few

1.

Preprocessing to filter out posts with very few terms

2

T ext sanitization techniques are applied to

2.

T ext sanitization techniques are applied to filter out common words and perform stemming

3.

All posts are mapped to a common Resource model containing: g

  • id
  • TF-IDF based key-value map
  • timestamp,
  • post type (tweet, blog post, or extended tweet)

13 MSND@WWW 2012, April 16th, Lyon, France

slide-14
SLIDE 14

The Data Analysis & Processing Tier The Data Analysis & Processing Tier

A separate index is kept for each resource type and for the

y g y g

  • Key

Key-

  • Value map generation

Value map generation

A separate index is kept for each resource type and for the

  • text, (hash)tags, (title)

attributes via the Lucene Search Engine library.

h d d f d f h b

via these indices, TF-IDF key-value maps defined for each attribute :

  • Goal is to represent a resource’s textual content with a single attribute

Observations:

  • a given key (term) may exist in more than one attributes;
  • usually tags and title attributes are more significant of the resource’s content compared to

text

Approach: Approach:

  • TF-IDF values for each key combined in a single value under an weighting scheme assigning

more weight to terms in titles and tags

  • E.g. a given term T in a blog post resource R will be assigned a score value as follows:

g g g p g the TF-IDF based key-value map

  • f R will be:

[T1:SC(T1,R), T2:SC(T2,R),…,

wi : [0,1] stands for the weight for attribute i

TN:SC(TN,R)] N: # distinct terms in the 3 attributes

14 MSND@WWW 2012, April 16th, Lyon, France

slide-15
SLIDE 15

The Data Analysis & Processing Tier The Data Analysis & Processing Tier

  • Clustering

Clustering

Web Data Streams Clustering P.F.: Given a time-point t, a new resource Rt created at t, and a set of n clusters Ci,[i= 1…n] active at t, assign Rt to the cluster Ck for which the similarity Sim(Rt,Ci) is maximized. If Sim(Rt,Ci) < sim_threshold for each i in [1,n], then start a new cluster with Rt.

  • Cloud4Trends is flexible allowing the use of different Similarity Functions and

Cluster Representation formats

  • We have implemented:
  • Cluster Representation model containing

id mean key-value map (keys: the union of all cluster’s members terms, values: for each term the average of the members’ respective score values)

Cluster’s

for each term the average of the members respective score values) mean timestamp (the average of the members’ timestamps) list of member resources creation time

centroid

creation time extinction time (when it became inactive)

  • Similarity function: a variation of the cosine similarity function taking also the

time aspect into account to bring resources closer to clusters that on average t i b ith i il ti t contain members with similar timestamps.

15 MSND@WWW 2012, April 16th, Lyon, France

slide-16
SLIDE 16

The Trend Detection &Visualization Tier The Trend Detection &Visualization Tier The Trend Detection &Visualization Tier The Trend Detection &Visualization Tier

  • A given cluster is characterized as:
  • active: corresponds to topics that are popular at the given time
  • inactive: corresponds to topics that are no longer trending.
  • Clusters update rates are monitored to determine when a cluster should be made inactive.

2nd approach 1st approach

  • for each cluster we keep successive values

corresponding to the temporal distance between the

  • a cluster is made inactive when

no new resources have been assigned to it for a given timespan threshold p g p timestamps of the last two resources assigned to the cluster

  • the moving average of this value allows us to identify

cluster popularity states (rising in popularity

  • T
  • obtain the actual trends, active clusters should be ranked in terms of an activity

cluster popularity states (rising in popularity, reaching a peak, or getting inactive)

measure

  • for each type of resource and monitored location active clusters are ranked based on: i)

their members’ number, and ii) their mean timestamp,

  • Summary description extracted for each cluster comprising of selected member

terms/phrases based on their scores and their significance (hashtags, title terms, etc)

16 MSND@WWW 2012, April 16th, Lyon, France

slide-17
SLIDE 17

Current limitations Current limitations Current limitations Current limitations

T

weet streams can have unexpected peaks, while blog T weet streams can have unexpected peaks, while blog posts sizes may be considerably large

  • problems referring to handling large & fluctuating data sizes

arise arise

Several operations in Cloud4Trends need to be

parallelized

  • data should be concurrently analyzed for the different

geographic areas and their analysis should be fast;

  • blogs and Twitter data should be collected in parallel and

blogs and Twitter data should be collected in parallel and “undergo” through the same analysis process;

  • the data collection module should operate in parallel with

the data analysis module the data analysis module

As data accumulate, their sizes can be considerable

large

17 MSND@WWW 2012, April 16th, Lyon, France

slide-18
SLIDE 18

Trend detection on the Cloud Trend detection on the Cloud Trend detection on the Cloud Trend detection on the Cloud

The Cloud computing paradigm offers a significant ground

p g p g g g for social streams mining applications due to its support via scalable and powerful infrastructures

  • Computing resources can scale based on real-time processing

p g p g demands

  • Heavy mining algorithms (e.g. involving similarity calculations) can

be broken down in MapReduce – like steps, with the “Mapping”

  • perations being distributed into separate computer nodes
  • Time-critical operations can be executed on time
  • It is easier to share datasets and applications’ results once they

pp y are on the Cloud

  • A Platform As A Service (PAAS) Cloud solution allows

developers/scientists to focus on applications’ refinement/testing, h h h d h d l i rather than on how to setup and operate the underlying infrastructures

18 MSND@WWW 2012, April 16th, Lyon, France

slide-19
SLIDE 19

The VENUS The VENUS C platform C platform The VENUS The VENUS-C platform C platform

Cloud4Trends has been ported to the Cloud with use of the

VENUS-C platform platform The VENUS-C (Virtual Multidisciplinary EnviroNments USing Cloud Infrastructures) project offers to EU research and industry communities an industrial quality service oriented platform based on virtualization industrial-quality, service-oriented platform based on virtualization technologies, leveraging previous experience on Grids & Supercomputing

  • It aims to facilitate a range of research fields allowing then to benefit from the advantages of a

Cloud computing platform, without having to develop custom Cloud-aware solutions Cloud computing platform, without having to develop custom Cloud aware solutions VENUS-C offers two programming models and appropriate data access

mechanisms that constitute a convenient abstraction for deploying scientific mechanisms that constitute a convenient abstraction for deploying scientific applications on top of plain virtual machines

Each programming model is enacted behind a job submission service, where

researchers can submit jobs and manage their workload researchers can submit jobs and manage their workload

T

  • run their apps, end-users should first upload their executables at a

Cloud-based application repository so as to be accessed by the enactment i d di h d i i f i i j b service depending on the description of an incoming job request

Other

VENUS-C services involve Cloud Data Access, and Accounting

19 MSND@WWW 2012, April 16th, Lyon, France

slide-20
SLIDE 20

The Cloud4Trends Cloud The Cloud4Trends Cloud-

  • based

based architecture architecture

Hybrid implementation based on the cooperation Hybrid implementation based on the cooperation

  • f:

i) on-premises client interface components and i) on premises client interface components and ii) multiple job execution components with different functionalities on top of the VENUS-C Cloud p services infrastructure.

Uses the

VENUS-C Generic Worker programming model for job submission & application deployment on the Azure Cloud

  • Generic Worker is best suited for data-driven task-

based job submissions with each job description specifying the necessary input and output files specifying the necessary input and output files

20 MSND@WWW 2012, April 16th, Lyon, France

slide-21
SLIDE 21

The proposed framework The proposed framework The proposed framework The proposed framework

21 MSND@WWW 2012, April 16th, Lyon, France

slide-22
SLIDE 22

Cloud4Trends modules Cloud4Trends modules Cloud4Trends modules Cloud4Trends modules

collect module:

  • client module involving data collectors and the interfaces required for the

communication with the VENUS-C services (for job submission and data access)

  • interfaces for experiment setting (e.g. inclusion of a new city) and monitoring
  • web interface to communicate the Cloud4Trends results to end users,

web interface to communicate the Cloud4Trends results to end users,

“on the cloud” module:

  • involves the Data analysis and Processing Tier which has been ported to the

Cloud

  • data Parsing, Clustering, & Cluster State Updating modules are deployed via the

VENUS-C Services with the related operations are submitted as jobs through the Generic Worker programming model.

  • Clustering is realized by the Splitter, Similarity Calculation (Mapper), and

Clustering is realized by the Splitter, Similarity Calculation (Mapper), and Aggregation (Reducer) modules, under the MapReduce paradigm

  • indexing module is implemented independently as a set of Cloud services

responsible for creating indexes for each type of input data

cloud services module: cloud services module:

  • involves the specific

VENUS-C components used for assisting and simplifying the application’s porting to the Cloud

  • VENUS-C Data Access SDK for accessing the Cloud Storage (Blobs and Tables)

g g ( )

  • VENUS-C Execution (enactment) Service for submitting, distributing, and setting

up new processing jobs to the Cloud

22 MSND@WWW 2012, April 16th, Lyon, France

slide-23
SLIDE 23

Implementation details Implementation details Implementation details Implementation details

The on-premises job submission client submits data for analysis in

batches batches

The data indexing services have been deployed in Azure using the

Azure Library for Lucene.NET

A scaling manager has been deployed on the Cloud as an individual

service that increases the number on computing nodes when there are many jobs that remain pending for too long and decreases are many jobs that remain pending for too long, and decreases them when the extra resources are no longer needed

Communication between the Cloud-based data parsers and the

p indexing services has been achieved via queues

Resources’ representations are “passed” between application as

i li d JSON bj t serialized JSON objects

Cluster entities are stored in Azure Tables Clusters’ representations are downloaded to the client-side Clusters representations are downloaded to the client-side

module at a selected time interval and ranked for ‘hottest’ trends detection

23 MSND@WWW 2012, April 16th, Lyon, France

slide-24
SLIDE 24

Future outlook Future outlook Future outlook Future outlook

Fine tuning of the clustering and trend Fine-tuning of the clustering and trend

detection algorithm and experimental evaluation of results evaluation of results

Implementation of a shared-based

distributed Indexing service since for the g time being the service for each type of resource is deployed on a single instance

Measuring the system’s performance for

different design parameters P i h i li i d l l h

Porting the visualization modules also to the

Cloud

24 MSND@WWW 2012, April 16th, Lyon, France

slide-25
SLIDE 25

25 MSND@WWW 2012, April 16th, Lyon, France