Graph Visualization Tool for Twittersphere users based on a - - PowerPoint PPT Presentation

graph visualization tool for twittersphere users based on
SMART_READER_LITE
LIVE PREVIEW

Graph Visualization Tool for Twittersphere users based on a - - PowerPoint PPT Presentation

Graph Visualization Tool for Twittersphere users based on a high-scalable Extract, Transform and Load System Pablo Aragn, igo Garca and Antonio Garca May, 27th 2011 INDEX INTRODUCTION Cierzo Development and SMMART Structure of


slide-1
SLIDE 1

Graph Visualization Tool for Twittersphere users based on a high-scalable Extract, Transform and Load System

Pablo Aragón, Íñigo García and Antonio García

May, 27th 2011

slide-2
SLIDE 2

INDEX

INTRODUCTION

Cierzo Development and SMMART Structure of Twitter Volume of Twitter Detection of influencers

DISTRIBUTED COMPUTATION

Hadoop Amazon EC2 Amazon EC2

PIPELINE DESIGN

Crawling Module Metadata Extraction Module Indexing Module Graph Visualization Module

RESULTS

Western Sahara Conflict Patxi López Conclusions Future work

slide-3
SLIDE 3

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS

SMMART (Social Media Marketing Analysis and

CIERZO DEVELOPMENT AND SMMART STRUCTURE OF TWITTER VOLUME OF TWITTER DETECTION OF INFLUENCERS

INTRODUCTION: CIERZO DEVELOPMENT AND SMMART

SMMART (Social Media Marketing Analysis and Reporting Tool) is the system developed by Cierzo Development for: Corporate social reputation Measuring effectiveness of marketing campaigns Detection of new trends

slide-4
SLIDE 4

INTRODUCTION: STRUCTURE OF TWITTER

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS CIERZO DEVELOPMENT AND SMMART STRUCTURE OF TWITTER VOLUME OF TWITTER DETECTION OF INFLUENCERS

Structure of a profile

slide-5
SLIDE 5

INTRODUCTION: STRUCTURE OF TWITTER

A user can set a relationship with another user by:

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS CIERZO DEVELOPMENT AND SMMART STRUCTURE OF TWITTER VOLUME OF TWITTER DETECTION OF INFLUENCERS

A user can set a relationship with another user by: Reply: Update that begins with @username Mention: Update that contains @username in the body of the tweet Retweet: Update that contains the body of another user tweet by specifying the original author

slide-6
SLIDE 6

INTRODUCTION: VOLUME OF THE TWITTER

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS CIERZO DEVELOPMENT AND SMMART STRUCTURE OF TWITTER VOLUME OF TWITTER DETECTION OF INFLUENCERS

More than 200M users publishing millions of tweets per day

slide-7
SLIDE 7

INTRODUCTION: DETECTION OF INFLUENCERS

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS CIERZO DEVELOPMENT AND SMMART STRUCTURE OF TWITTER VOLUME OF TWITTER DETECTION OF INFLUENCERS

Old metrics based on data as: Absolute info: Number of followers Relative info: Quotient of following users and followers

slide-8
SLIDE 8

INTRODUCTION: DETECTION OF INFLUENCERS

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS CIERZO DEVELOPMENT AND SMMART STRUCTURE OF TWITTER VOLUME OF TWITTER DETECTION OF INFLUENCERS

Available search engines track Twitter and list results, but they do not set a value to the users from the response.

slide-9
SLIDE 9

#spanishrevolution #yeswecamp #15m

slide-10
SLIDE 10

DISTRIBUTED COMPUTATION

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS HADOOP AMAZON EC2

  • Management of large volumes at the

lowest cost

  • Automatic adjustment to the daily

growth of users and the oscillations in the frequency of publication

slide-11
SLIDE 11

DISTRIBUTED COMPUTATION: HADOOP

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS HADOOP AMAZON EC2

Map Reduce Distributed File System

slide-12
SLIDE 12

DISTRIBUTED COMPUTATION: AMAZON EC2

Definition of a Hadoop node as a machine image in Amazon Elastic

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS HADOOP AMAZON EC2

machine image in Amazon Elastic Compute Cloud. The system balancing mechanism adds and removes Hadoop nodes in real time on demand.

slide-13
SLIDE 13

PIPELINE DESIGN

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS CRAWLING MODULE METADATA EXTRACTION MODULE INDEXING MODULE GRAPH VISUALIZATION MODULE

slide-14
SLIDE 14

PIPELINE DESIGN: CRAWLING MODULE

Based on Nutch

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS CRAWLING MODULE METADATA EXTRACTION MODULE INDEXING MODULE GRAPH VISUALIZATION MODULE

Based on Nutch

1. Crawl the Twitter profiles stored in a DB 2. Extract outlinks to new profiles

slide-15
SLIDE 15

PIPELINE DESIGN: METADATA EXTRACTION MODULE

The portion of HTML of a tweet

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS CRAWLING MODULE METADATA EXTRACTION MODULE INDEXING MODULE GRAPH VISUALIZATION MODULE

The portion of HTML of a tweet contains a set of metadata:

  • Textual content
  • Publication date
  • Author
  • Mention to other users
slide-16
SLIDE 16

PIPELINE DESIGN: INDEXING MODULE

Apache Solr (enterprise search server based on Lucene)

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS CRAWLING MODULE METADATA EXTRACTION MODULE INDEXING MODULE GRAPH VISUALIZATION MODULE

Sorting algorithms Stemming Stopwords filters Faceted searchs Multicore architecture sharding by publication date.

slide-17
SLIDE 17

PIPELINE DESIGN: GRAPH VISUALIZATION MODULE

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS CRAWLING MODULE METADATA EXTRACTION MODULE INDEXING MODULE GRAPH VISUALIZATION MODULE

The Graph Visualization module transforms the responses from the index into a graph by the force-based multilevel layout Yifan Hu’s algorithm provided in Gephi Toolkit.

slide-18
SLIDE 18
slide-19
SLIDE 19

RESULTS: WESTERN SAHARA CONFLICT

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS WESTERN SAHARA CONFLICT PATXI LÓPEZ CONCLUSIONS FUTURE WORK In November 2010, Moroccan security forces involved in a camp in Western Sahara. This action was criticized by part of the Spanish society.

slide-20
SLIDE 20

RESULTS: WESTERN SAHARA CONFLICT

Search

  • content:‘sahara’

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS WESTERN SAHARA CONFLICT PATXI LÓPEZ CONCLUSIONS FUTURE WORK

  • language:’es’
  • date:[2010-11-10 TO 2010-11-18]

Results

  • 1721 users
  • 3925 tweets
  • 707 mentions
slide-21
SLIDE 21

RESULTS: WESTERN SAHARA CONFLICT

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS WESTERN SAHARA CONFLICT PATXI LÓPEZ CONCLUSIONS FUTURE WORK

slide-22
SLIDE 22

RESULTS: PATXI LÓPEZ

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS WESTERN SAHARA CONFLICT PATXI LÓPEZ CONCLUSIONS FUTURE WORK Patxi López holds the position of the President of the Basque Country Government. His campaign included strategies in social networks.

slide-23
SLIDE 23

RESULTS: PATXI LÓPEZ

Search

  • mention:‘patxi_lopez’

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS WESTERN SAHARA CONFLICT PATXI LÓPEZ CONCLUSIONS FUTURE WORK

  • language:’es’
  • date:[2010-11-10 TO 2010-11-18]

Results

  • 186 users
  • 196 tweets
  • 366 mentions
slide-24
SLIDE 24

RESULTS: PATXI LÓPEZ

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS WESTERN SAHARA CONFLICT PATXI LÓPEZ CONCLUSIONS FUTURE WORK

slide-25
SLIDE 25

RESULTS: CONCLUSIONS

The implemented tool identifies main influencers in a specific topic or around a concrete user

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS WESTERN SAHARA CONFLICT PATXI LÓPEZ CONCLUSIONS FUTURE WORK

The high-scalable design adapts to a large social network as Twitter Enterprises can deploy social media monitoring systems using exclusively open source technologies The tool provides information for performing crisis management

slide-26
SLIDE 26
slide-27
SLIDE 27

RESULTS: FUTURE WORK

New versions for more social media sources

INTRODUCTION DISTRIBUTED COMPUTATION PIPELINE DESIGN RESULTS WESTERN SAHARA CONFLICT PATXI LÓPEZ CONCLUSIONS FUTURE WORK

Real-time results New data mining applications Predictive models

slide-28
SLIDE 28

Thanks for your attention