SLIDE 1
Exploring Big Data in Social Networks virgilio@dcc.ufmg.br - - PowerPoint PPT Presentation
Exploring Big Data in Social Networks virgilio@dcc.ufmg.br - - PowerPoint PPT Presentation
Exploring Big Data in Social Networks virgilio@dcc.ufmg.br (meira@dcc.ufmg.br) INWEB National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG May 2013 Some thoughts about computing, future and
SLIDE 2
SLIDE 3
What happens in 60 seconds on the Internet?
SLIDE 4
4
Explosion of Web Data
SLIDE 5
5
- BIG DATA:
- data collection,
- storage,
- management,
- automated large-scale
analysis
SLIDE 6
Research interests
BIG DATA Algorithms and MACHINE LEARNING SOCIAL and ECONOMICS
- characterization
- models
- incentives
- privacy
- network effects
- crowdsourcing
- anti-social behavior
- spam and malware
s
- algorithms around
social networks
- VERY large graphs
- data mining
- analytics
- Systems
- Infrastructure
- cloud
- characterization
SLIDE 7
The fundamental challenge of Big Data is not collecting data -- it's making sense of it.
1) What is the starting point? 2) What are the computation paths to discovery? 3) What are the appropriate algorithms? 3) How to visualize the findings?
SLIDE 8
Experimental Methodology Measure Analyze Model Synthesize
Models
Analysis Validation
Observations Artifacts Algorithms
Distributions of Random Variables Synthetic Workloads Logs and Traces
What if questions:
SLIDE 9
Challenges in Online Social Networking Research
- Explosive growth in size, complexity, and unstructured
data;
- Enabled by various experimental methods: observational
studies, simulations,..., huge amount of data;
- It is “big data,” the vast sets of information gathered
by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world. Companies often refuse to make such information public, sometimes for competitive reasons and sometimes to protect customers’ privacy. (New York Times, May 21)
SLIDE 10
Enablers of Big Data
Hardware capability Applications & Algorithms Storage capacity Online social networking Network bandwidth Algorithmic breakthroughs: machine learning and data mining Exponentially increasing capability at constant cost Cloud: Cost reductions and scalability improvements in computation Processing capacity Sensors everywhere
SLIDE 11
Price ce of 1 gigabyte abyte of st storage age over r time
11
Year Cost 1981 $300,000 1987 $50,000 1990 $10,000 1994 $1000 1997 $100 2000 $10 2004 $1 2012 $0.10
SLIDE 12
OSN Research Focus
1.Understand: characteristics of social graphs of real data; 2.Discover: properties of social graphs; 3.Engineer: social graph built.
SLIDE 13
OSN research approach
- Computational sociology:
A natural sciences approach
– Gather and analyze OSN data to study problems in sociology
- Social computing: An engineering approach
– Build systems that support / leverage human social interactions – Understand human behavior (as opposed of considering it annoying noise)
- Inspired by sociological theories
SLIDE 14
SLIDE 15
The Atlantic
15
SLIDE 16
16
SLIDE 17
Understanding Factors that Affect Response Rates in Twitter(*)
- Active users can receive ∼1000 tweets per
day;
- Approximately 36% of all tweets worth
reading, 39% are neutral and 25% are “junk”;
- Interesting Questions
– Do Twitter users receive more information than they are able to consume? – Is it possible to identify factors that affect interactions (replies and retweets)? (*) ACM Hypertext 2012, joint work with Giovanni Comarela, Mark Crovella, F. Benevenuto
SLIDE 18
Datasets: big data
- Collected in August/September 2009, it
contains the following information:
- Users: 54,981,152 Tweets: 1,755,925,520
(almost a complete history) Social Graph: 1,963,263,821 social links
- It contains information related to Replies
and Retweets (interactions)
SLIDE 19
Characterization
- Waiting Times (overload evidence)
– How long does a tweet wait in the timeline to be replied (retweeted)?
- Factors that affect interactions
– Message Age – Previous Interactions – Sending Rate
SLIDE 20
Waiting Times
SLIDE 21
Message Age
SLIDE 22
Previous interaction
- Are previously replied (retweeted) users more
likely to be replied (retweeted) again?
- We computed for each user i the conditional
probability that a message m will be replied (retweeted) by i given that i has replied (retweeted) the sender of m before;
SLIDE 23
Sending rate
- Are users with a higher sending rate more
likely to be replied (retweeted)?
- For each user i, for each j ∈ Outi we
compared the sending rate of j with the fraction of her tweets replied (retweeted) by i.
SLIDE 24
Reorganizing the Twitter Timeline
- Use the knowledge presented in order to create a
new way to show tweets for the users
- More interesting tweets (more likely to be replied
- r retweeted) in the top of the timeline.
- Two schemes
– Naive Bayes (NB) – Support Vector Machine (SVM) – Three attributes
- Age(m): Age of m
- SR(m): Sending rate of the sender of m
- I(m): Binary indicator for previous
interactions with the sender of m
SLIDE 25
Results
SLIDE 26
Google+
26
New Kid on the Block: Exploring the Google+ Social Graph, ACM Internet Measurement Conference, Sigcomm, 2012, Boston Joint work with: G. Magno, G. Comarela, D. Saez and Meeyong Cha.
SLIDE 27
Online Social Networks
- OSNs now reach 82% of the world’s
Internet-using population (1.2 billion)
- Social Networking accounts for 19%
- f all time spent online
Social Networking is the most popular
- nline activity worldwide
Source: comScore, December 21, 2011
27
SLIDE 28
Google+ Growth
28
Google+ is the fastest growing OSN
Days # users
SLIDE 29
Goal: characterization
- Analyze how much and what kind of personal
information people share in Google+
- Measure statistics of the Google+ social
graph and compare with other OSNs
- Evaluate the impact of geography on user
behavior in Google+
29
SLIDE 30
Dataset: big data
- Nov. 11th Dec.
27th (2011)
- 27,556,390 profiles
- 35,114,957 nodes
- 575,141,097 edges
30
SLIDE 31
What kind of information do people share more?
SLIDE 32
Privacy Concerns
- Users revealing more information on their
profiles have greater risk in privacy
- In Facebook (young users, to friends)¹:
– 64.1% share e-mail – 10.7% share telephone – 10.7% share home address
32
SLIDE 33
What kind of information do people share more?
- In Google+ (public):
– 0.22% share Work contact – 0.21% share Home contact – 0.26% share telephone numbers (72,736 users)
- Users that shared telephone: tel-users
33
SLIDE 34
Number of fields shared in profile
34
Tel-users share more information
SLIDE 35
Information shared by users
Women are less likely to share phone number The majority of tel-users are single; a smaller fraction of them are in a relationship. Fraction of Indian users in the tel-users group is twice as big as in other countries
35
SLIDE 36
How are people connected on Google+?
SLIDE 37
Structural Characteristics
- f Social Graphs
37
New network Lower number
- f friends
Higher reciprocity = More social “Hidden” edges Higher avg. path length Diameter similar to Twitter, lower than Facebook
SLIDE 38
Structural Characteristics –
- Clust. Coef.
Higher Clustering Coefficient than Twitter
38
SLIDE 39
What is the impact of geography
- n the social relationships?
SLIDE 40
Geo-location Information
- Question: is the
geographical location of users an important factor in the formation of social links?
- Extract GPS coordinates
from map image
- Retrieve country
information
- 6,621,644 users with
valid country inf.
40
SLIDE 41
Patterns Across Geo-locations – Average Path Miles
58% of friends were separated by less than a thousand miles Physical distance has influence on the intensity of the relationship
41
SLIDE 42
Social Links Across Geography
are users in the same country more likely to be friends than users in different countries
42
US is dominant on the influx of edges Populous countries have more self-loops
SLIDE 43
G+ Observations
- Google+ is more social than Twitter
– Higher reciprocity – Higher clustering coefficient – Reflects offline relationship
- Users exhibit different notions and
expectations in Google+, based on geography
– Privacy – Content – Connections
43
SLIDE 44
Concluding Remarks
- Big data has created new opportunities for