[PPT] - Exploring Big Data in Social Networks virgilio@dcc.ufmg.br PowerPoint Presentation

SLIDE 1

Exploring Big Data in Social Networks

virgilio@dcc.ufmg.br (meira@dcc.ufmg.br)

INWEB – National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG

May 2013

SLIDE 2

Some thoughts about computing, future and innovation…

SLIDE 3

What happens in 60 seconds on the Internet?

SLIDE 4

4

Explosion of Web Data

SLIDE 5

5

BIG DATA:
data collection,
storage,
management,
automated large-scale

analysis

SLIDE 6

Research interests

BIG DATA Algorithms and MACHINE LEARNING SOCIAL and ECONOMICS

characterization
models
incentives
privacy
network effects
crowdsourcing
anti-social behavior
spam and malware

s

algorithms around

social networks

VERY large graphs
data mining
analytics
Systems
Infrastructure
cloud
characterization

SLIDE 7

The fundamental challenge of Big Data is not collecting data -- it's making sense of it.

1) What is the starting point? 2) What are the computation paths to discovery? 3) What are the appropriate algorithms? 3) How to visualize the findings?

SLIDE 8

Experimental Methodology Measure  Analyze  Model  Synthesize

Models

Analysis Validation

Observations Artifacts Algorithms

Distributions of Random Variables Synthetic Workloads Logs and Traces

What if questions:

SLIDE 9

Challenges in Online Social Networking Research

Explosive growth in size, complexity, and unstructured

data;

Enabled by various experimental methods: observational

studies, simulations,..., huge amount of data;

It is “big data,” the vast sets of information gathered

by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world. Companies often refuse to make such information public, sometimes for competitive reasons and sometimes to protect customers’ privacy. (New York Times, May 21)

SLIDE 10

Enablers of Big Data

Hardware capability Applications & Algorithms Storage capacity Online social networking Network bandwidth Algorithmic breakthroughs: machine learning and data mining Exponentially increasing capability at constant cost Cloud: Cost reductions and scalability improvements in computation Processing capacity Sensors everywhere

SLIDE 11

Price ce of 1 gigabyte abyte of st storage age over r time

11

Year Cost 1981 $300,000 1987 $50,000 1990 $10,000 1994 $1000 1997 $100 2000 $10 2004 $1 2012 $0.10

SLIDE 12

OSN Research Focus

1.Understand: characteristics of social graphs of real data; 2.Discover: properties of social graphs; 3.Engineer: social graph built.

SLIDE 13

OSN research approach

Computational sociology:

A natural sciences approach

– Gather and analyze OSN data to study problems in sociology

Social computing: An engineering approach

– Build systems that support / leverage human social interactions – Understand human behavior (as opposed of considering it annoying noise)

Inspired by sociological theories

SLIDE 14

SLIDE 15

The Atlantic

15

SLIDE 16

16

SLIDE 17

Understanding Factors that Affect Response Rates in Twitter(*)

Active users can receive ∼1000 tweets per

day;

Approximately 36% of all tweets worth

reading, 39% are neutral and 25% are “junk”;

Interesting Questions

– Do Twitter users receive more information than they are able to consume? – Is it possible to identify factors that affect interactions (replies and retweets)? (*) ACM Hypertext 2012, joint work with Giovanni Comarela, Mark Crovella, F. Benevenuto

SLIDE 18

Datasets: big data

Collected in August/September 2009, it

contains the following information:

Users: 54,981,152 Tweets: 1,755,925,520

(almost a complete history) Social Graph: 1,963,263,821 social links

It contains information related to Replies

and Retweets (interactions)

SLIDE 19

Characterization

Waiting Times (overload evidence)

– How long does a tweet wait in the timeline to be replied (retweeted)?

Factors that affect interactions

– Message Age – Previous Interactions – Sending Rate

SLIDE 20

Waiting Times

SLIDE 21

Message Age

SLIDE 22

Previous interaction

Are previously replied (retweeted) users more

likely to be replied (retweeted) again?

We computed for each user i the conditional

probability that a message m will be replied (retweeted) by i given that i has replied (retweeted) the sender of m before;

SLIDE 23

Sending rate

Are users with a higher sending rate more

likely to be replied (retweeted)?

For each user i, for each j ∈ Outi we

compared the sending rate of j with the fraction of her tweets replied (retweeted) by i.

SLIDE 24

Reorganizing the Twitter Timeline

Use the knowledge presented in order to create a

new way to show tweets for the users

More interesting tweets (more likely to be replied
r retweeted) in the top of the timeline.
Two schemes

– Naive Bayes (NB) – Support Vector Machine (SVM) – Three attributes

Age(m): Age of m
SR(m): Sending rate of the sender of m
I(m): Binary indicator for previous

interactions with the sender of m

SLIDE 25

Results

SLIDE 26

Google+

26

New Kid on the Block: Exploring the Google+ Social Graph, ACM Internet Measurement Conference, Sigcomm, 2012, Boston Joint work with: G. Magno, G. Comarela, D. Saez and Meeyong Cha.

SLIDE 27

Online Social Networks

OSNs now reach 82% of the world’s

Internet-using population (1.2 billion)

Social Networking accounts for 19%
f all time spent online

Social Networking is the most popular

nline activity worldwide

Source: comScore, December 21, 2011

27

SLIDE 28

Google+ Growth

28

Google+ is the fastest growing OSN

Days # users

SLIDE 29

Goal: characterization

Analyze how much and what kind of personal

information people share in Google+

Measure statistics of the Google+ social

graph and compare with other OSNs

Evaluate the impact of geography on user

behavior in Google+

29

SLIDE 30

Dataset: big data

Nov. 11th  Dec.

27th (2011)

27,556,390 profiles
35,114,957 nodes
575,141,097 edges

30

SLIDE 31

What kind of information do people share more?

SLIDE 32

Privacy Concerns

Users revealing more information on their

profiles have greater risk in privacy

In Facebook (young users, to friends)¹:

– 64.1% share e-mail – 10.7% share telephone – 10.7% share home address

32

SLIDE 33

What kind of information do people share more?

In Google+ (public):

– 0.22% share Work contact – 0.21% share Home contact – 0.26% share telephone numbers (72,736 users)

Users that shared telephone: tel-users

33

SLIDE 34

Number of fields shared in profile

34

Tel-users share more information

SLIDE 35

Information shared by users

Women are less likely to share phone number The majority of tel-users are single; a smaller fraction of them are in a relationship. Fraction of Indian users in the tel-users group is twice as big as in other countries

35

SLIDE 36

How are people connected on Google+?

SLIDE 37

Structural Characteristics

f Social Graphs

37

New network  Lower number

f friends

Higher reciprocity = More social “Hidden” edges  Higher avg. path length Diameter similar to Twitter, lower than Facebook

SLIDE 38

Structural Characteristics –

Clust. Coef.

Higher Clustering Coefficient than Twitter

38

SLIDE 39

What is the impact of geography

n the social relationships?

SLIDE 40

Geo-location Information

Question: is the

geographical location of users an important factor in the formation of social links?

Extract GPS coordinates

from map image

Retrieve country

information

6,621,644 users with

valid country inf.

40

SLIDE 41

Patterns Across Geo-locations – Average Path Miles

58% of friends were separated by less than a thousand miles Physical distance has influence on the intensity of the relationship

41

SLIDE 42

Social Links Across Geography

are users in the same country more likely to be friends than users in different countries

42

US is dominant on the influx of edges Populous countries have more self-loops

SLIDE 43

G+ Observations

Google+ is more social than Twitter

– Higher reciprocity – Higher clustering coefficient – Reflects offline relationship

Users exhibit different notions and

expectations in Google+, based on geography

– Privacy – Content – Connections

43

SLIDE 44

Concluding Remarks

Big data has created new opportunities for

scientific discoveries in the realm of social computing: – user preference understanding – data mining – summarization and aggregation – explorative analysis of large data sets – privacy – scalable services