Exploring Big Data in Social Networks virgilio@dcc.ufmg.br - - PowerPoint PPT Presentation

exploring big data in social networks
SMART_READER_LITE
LIVE PREVIEW

Exploring Big Data in Social Networks virgilio@dcc.ufmg.br - - PowerPoint PPT Presentation

Exploring Big Data in Social Networks virgilio@dcc.ufmg.br (meira@dcc.ufmg.br) INWEB National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG May 2013 Some thoughts about computing, future and


slide-1
SLIDE 1

Exploring Big Data in Social Networks

virgilio@dcc.ufmg.br (meira@dcc.ufmg.br)

INWEB – National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG

May 2013

slide-2
SLIDE 2

Some thoughts about computing, future and innovation…

slide-3
SLIDE 3

What happens in 60 seconds on the Internet?

slide-4
SLIDE 4

4

Explosion of Web Data

slide-5
SLIDE 5

5

  • BIG DATA:
  • data collection,
  • storage,
  • management,
  • automated large-scale

analysis

slide-6
SLIDE 6

Research interests

BIG DATA Algorithms and MACHINE LEARNING SOCIAL and ECONOMICS

  • characterization
  • models
  • incentives
  • privacy
  • network effects
  • crowdsourcing
  • anti-social behavior
  • spam and malware

s

  • algorithms around

social networks

  • VERY large graphs
  • data mining
  • analytics
  • Systems
  • Infrastructure
  • cloud
  • characterization
slide-7
SLIDE 7

The fundamental challenge of Big Data is not collecting data -- it's making sense of it.

1) What is the starting point? 2) What are the computation paths to discovery? 3) What are the appropriate algorithms? 3) How to visualize the findings?

slide-8
SLIDE 8

Experimental Methodology Measure  Analyze  Model  Synthesize

Models

Analysis Validation

Observations Artifacts Algorithms

Distributions of Random Variables Synthetic Workloads Logs and Traces

What if questions:

slide-9
SLIDE 9

Challenges in Online Social Networking Research

  • Explosive growth in size, complexity, and unstructured

data;

  • Enabled by various experimental methods: observational

studies, simulations,..., huge amount of data;

  • It is “big data,” the vast sets of information gathered

by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world. Companies often refuse to make such information public, sometimes for competitive reasons and sometimes to protect customers’ privacy. (New York Times, May 21)

slide-10
SLIDE 10

Enablers of Big Data

Hardware capability Applications & Algorithms Storage capacity Online social networking Network bandwidth Algorithmic breakthroughs: machine learning and data mining Exponentially increasing capability at constant cost Cloud: Cost reductions and scalability improvements in computation Processing capacity Sensors everywhere

slide-11
SLIDE 11

Price ce of 1 gigabyte abyte of st storage age over r time

11

Year Cost 1981 $300,000 1987 $50,000 1990 $10,000 1994 $1000 1997 $100 2000 $10 2004 $1 2012 $0.10

slide-12
SLIDE 12

OSN Research Focus

1.Understand: characteristics of social graphs of real data; 2.Discover: properties of social graphs; 3.Engineer: social graph built.

slide-13
SLIDE 13

OSN research approach

  • Computational sociology:

A natural sciences approach

– Gather and analyze OSN data to study problems in sociology

  • Social computing: An engineering approach

– Build systems that support / leverage human social interactions – Understand human behavior (as opposed of considering it annoying noise)

  • Inspired by sociological theories
slide-14
SLIDE 14
slide-15
SLIDE 15

The Atlantic

15

slide-16
SLIDE 16

16

slide-17
SLIDE 17

Understanding Factors that Affect Response Rates in Twitter(*)

  • Active users can receive ∼1000 tweets per

day;

  • Approximately 36% of all tweets worth

reading, 39% are neutral and 25% are “junk”;

  • Interesting Questions

– Do Twitter users receive more information than they are able to consume? – Is it possible to identify factors that affect interactions (replies and retweets)? (*) ACM Hypertext 2012, joint work with Giovanni Comarela, Mark Crovella, F. Benevenuto

slide-18
SLIDE 18

Datasets: big data

  • Collected in August/September 2009, it

contains the following information:

  • Users: 54,981,152 Tweets: 1,755,925,520

(almost a complete history) Social Graph: 1,963,263,821 social links

  • It contains information related to Replies

and Retweets (interactions)

slide-19
SLIDE 19

Characterization

  • Waiting Times (overload evidence)

– How long does a tweet wait in the timeline to be replied (retweeted)?

  • Factors that affect interactions

– Message Age – Previous Interactions – Sending Rate

slide-20
SLIDE 20

Waiting Times

slide-21
SLIDE 21

Message Age

slide-22
SLIDE 22

Previous interaction

  • Are previously replied (retweeted) users more

likely to be replied (retweeted) again?

  • We computed for each user i the conditional

probability that a message m will be replied (retweeted) by i given that i has replied (retweeted) the sender of m before;

slide-23
SLIDE 23

Sending rate

  • Are users with a higher sending rate more

likely to be replied (retweeted)?

  • For each user i, for each j ∈ Outi we

compared the sending rate of j with the fraction of her tweets replied (retweeted) by i.

slide-24
SLIDE 24

Reorganizing the Twitter Timeline

  • Use the knowledge presented in order to create a

new way to show tweets for the users

  • More interesting tweets (more likely to be replied
  • r retweeted) in the top of the timeline.
  • Two schemes

– Naive Bayes (NB) – Support Vector Machine (SVM) – Three attributes

  • Age(m): Age of m
  • SR(m): Sending rate of the sender of m
  • I(m): Binary indicator for previous

interactions with the sender of m

slide-25
SLIDE 25

Results

slide-26
SLIDE 26

Google+

26

New Kid on the Block: Exploring the Google+ Social Graph, ACM Internet Measurement Conference, Sigcomm, 2012, Boston Joint work with: G. Magno, G. Comarela, D. Saez and Meeyong Cha.

slide-27
SLIDE 27

Online Social Networks

  • OSNs now reach 82% of the world’s

Internet-using population (1.2 billion)

  • Social Networking accounts for 19%
  • f all time spent online

Social Networking is the most popular

  • nline activity worldwide

Source: comScore, December 21, 2011

27

slide-28
SLIDE 28

Google+ Growth

28

Google+ is the fastest growing OSN

Days # users

slide-29
SLIDE 29

Goal: characterization

  • Analyze how much and what kind of personal

information people share in Google+

  • Measure statistics of the Google+ social

graph and compare with other OSNs

  • Evaluate the impact of geography on user

behavior in Google+

29

slide-30
SLIDE 30

Dataset: big data

  • Nov. 11th  Dec.

27th (2011)

  • 27,556,390 profiles
  • 35,114,957 nodes
  • 575,141,097 edges

30

slide-31
SLIDE 31

What kind of information do people share more?

slide-32
SLIDE 32

Privacy Concerns

  • Users revealing more information on their

profiles have greater risk in privacy

  • In Facebook (young users, to friends)¹:

– 64.1% share e-mail – 10.7% share telephone – 10.7% share home address

32

slide-33
SLIDE 33

What kind of information do people share more?

  • In Google+ (public):

– 0.22% share Work contact – 0.21% share Home contact – 0.26% share telephone numbers (72,736 users)

  • Users that shared telephone: tel-users

33

slide-34
SLIDE 34

Number of fields shared in profile

34

Tel-users share more information

slide-35
SLIDE 35

Information shared by users

Women are less likely to share phone number The majority of tel-users are single; a smaller fraction of them are in a relationship. Fraction of Indian users in the tel-users group is twice as big as in other countries

35

slide-36
SLIDE 36

How are people connected on Google+?

slide-37
SLIDE 37

Structural Characteristics

  • f Social Graphs

37

New network  Lower number

  • f friends

Higher reciprocity = More social “Hidden” edges  Higher avg. path length Diameter similar to Twitter, lower than Facebook

slide-38
SLIDE 38

Structural Characteristics –

  • Clust. Coef.

Higher Clustering Coefficient than Twitter

38

slide-39
SLIDE 39

What is the impact of geography

  • n the social relationships?
slide-40
SLIDE 40

Geo-location Information

  • Question: is the

geographical location of users an important factor in the formation of social links?

  • Extract GPS coordinates

from map image

  • Retrieve country

information

  • 6,621,644 users with

valid country inf.

40

slide-41
SLIDE 41

Patterns Across Geo-locations – Average Path Miles

58% of friends were separated by less than a thousand miles Physical distance has influence on the intensity of the relationship

41

slide-42
SLIDE 42

Social Links Across Geography

are users in the same country more likely to be friends than users in different countries

42

US is dominant on the influx of edges Populous countries have more self-loops

slide-43
SLIDE 43

G+ Observations

  • Google+ is more social than Twitter

– Higher reciprocity – Higher clustering coefficient – Reflects offline relationship

  • Users exhibit different notions and

expectations in Google+, based on geography

– Privacy – Content – Connections

43

slide-44
SLIDE 44

Concluding Remarks

  • Big data has created new opportunities for

scientific discoveries in the realm of social computing: – user preference understanding – data mining – summarization and aggregation – explorative analysis of large data sets – privacy – scalable services