DS504/CS586: Big Data Analytics Graph Mining
- Prof. Yanhua Li
Welcome to
Time: 6:00pm –8:50pm R Location: AK232 Fall 2016
DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: - - PowerPoint PPT Presentation
Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK232 Fall 2016 Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
Time: 6:00pm –8:50pm R Location: AK232 Fall 2016
Mining of Massive Datasets, http:// www.mmds.org 2
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
3
Connections between political blogs
Polarization of the network [Adamic-Glance, 2005]
4
Citation networks and Maps of science
[Börner et al., 2012]
domain2 domain1 domain3 router
5
6
Seven Bridges of Königsberg
[Euler, 1735]
Return to the starting point by traveling each link of the graph once and only once.
+ +
Trust & distrust Repulsion & cohesion Friend & foe Following One-way road Resistance Wireless channel Friendship Co-authorship Undirected links Directed links Signed links Multi-relational links Hyperlinks … …
v Network Statistic Analysis (this lecture)
§ Network Size § Degree distribution.
v Node Ranking (Next lecture)
Mining of Massive Datasets, http:// www.mmds.org 9
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
10
Random sampling (uniform & independent)
} vertex sampling } BFS sampling
10
} random walk sampling } edge sampling
Random walk sampling Random Walk Routing Influence diffusion Molecule in liquid
Undirected !!
v Adjacency matrix v Transition Probability Matrix v |E|: number of links v Stationary Distribution
D = 3 2 3 2 ! " # # # # $ % & & & &
Undirected
A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &
Symmetric
P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 " # $ $ $ $ % & ' ' ' '
P
ij = 1
ki
v Adjacency matrix v Transition Probability Matrix v |E|: number of links v Stationary Distribution
D = 3 2 3 2 ! " # # # # $ % & & & &
Undirected
A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &
Symmetric
P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 " # $ $ $ $ % & ' ' ' '
, ,
1 min(1, ) if neighbor of 1 if =
MH w w MH y y
k w k k P P w
υ υ υ υ υ
υ υ
≠
⎧ ⎪ ⎪ = ⎨ − ⎪ ⎪ ⎩ ∑
υ
Minas Gjoka, UC Irvine Walking in Facebook 15
Minas Gjoka, Maciej Kurant ‡, Carter Butts, Athina Markopoulou UC Irvine, EPFL ‡
Minas Gjoka, UC Irvine Walking in Facebook 16
v Motivation and Problem Statement v Sampling Methodology v Data Analysis v Conclusion
Minas Gjoka, UC Irvine Walking in Facebook 17
v A network of declared friendships
between users
v Allows users to maintain relationships v Many popular OSNs with different focus
§ Facebook, LinkedIn, Flickr, …
C A E G F B D H
Social Graph
Minas Gjoka, UC Irvine Walking in Facebook 18
v Representative samples desirable
v Obtaining complete dataset difficult
Minas Gjoka, UC Irvine Walking in Facebook 19
v Obtain a representative sample of
Minas Gjoka, UC Irvine Walking in Facebook 20
v Graph traversal (BFS)
v Random walks (MHRW, RDS)
Minas Gjoka, UC Irvine Walking in Facebook 21
v Motivation and Problem Statement v Sampling Methodology
v Data Analysis v Conclusion
Minas Gjoka, UC Irvine Walking in Facebook 22
C A E G F B D H
Unexplored Explored Visited
v Starting from a seed, explores all
neighbor nodes. Process continues iteratively without replacement.
v BFS leads to bias towards high
degree nodes
Lee et al, “Statistical properties of Sampled Networks”, Phys Review E, 2006
v Early measurement studies of
OSNs use BFS as primary sampling technique
i.e [Mislove et al], [Ahn et al], [Wilson et al.]
Minas Gjoka, UC Irvine Walking in Facebook 23
C A E G F B D H
1/3 1 / 3 1/3
Next candidate Current node
a time with replacement
,
RW w
υ υ
υ υ
Degree of node υ Number of edges
Minas Gjoka, UC Irvine Walking in Facebook 24
v Corrects for degree bias at the end of collection v Without re-weighting, the probability distribution for
node property A is:
v Re-Weighted probability distribution :
i
u A u i u V u
∈ ∈
i
u A i i u V
∈ ∈
Subset of sampled nodes with value i All sampled nodes Degree of node u
Minas Gjoka, UC Irvine Walking in Facebook 25
v Explore graph one node at
a time with replacement
v In the stationary distribution
, ,
1 min(1, ) if neighbor of 1 if =
MH w w MH y y
k w k k P P w
υ υ υ υ υ
υ υ
≠
⎧ ⎪ ⎪ = ⎨ − ⎪ ⎪ ⎩ ∑
C A E G F B D H
1 / 3 1/5 1/3
Next candidate Current node
2/15
υ
MH AC
MH AA
Minas Gjoka, UC Irvine Walking in Facebook 26
v As a basis for comparison, we collect
v UNI not a general solution for
Minas Gjoka, UC Irvine Walking in Facebook 27
Sampling method MHRW RW BFS UNI #Valid Users 28x81K 28x81K 28x81K 984K # Unique Users 957K 2.19M 2.20M 984K
http://odysseas.calit2.uci.edu/research/osn.html
Minas Gjoka, UC Irvine Walking in Facebook 28
v What information do we collect for each sampled
node u?
UserID Name Networks Privacy settings
Friend List
UserID Name Networks Privacy Settings UserID Name Networks Privacy settings
1 1 1 1
Profile Photo Add as Friend Regional School/Workplace
UserID Name Networks Privacy settings
View Friends Send Message
Minas Gjoka, UC Irvine Walking in Facebook 29
Minas Gjoka, UC Irvine Walking in Facebook 30
Geweke
v Detects convergence for a single walk. Let X be a
sequence of samples for metric of interest.
posterior moments“ in Bayesian Statistics 4, 1992
( ) ( ) ( ) ( )
a b a b
E X E X z Var X Var X − = −
Minas Gjoka, UC Irvine Walking in Facebook 31
Gelman-Rubin
v Detects convergence for m>1 walks
Statistical Science Volume 7, 1992 Walk 1 Walk 2 Walk 3
1 1 n m B R n mn W − + ⎛ ⎞ = + ⎜ ⎟ ⎝ ⎠
Between walks variance Within walks variance
Minas Gjoka, UC Irvine Walking in Facebook 32
Node Degree
Minas Gjoka, UC Irvine Walking in Facebook 33
Node Degree
v Poor performance
v MHRW, RWRW
28 crawls
Minas Gjoka, UC Irvine Walking in Facebook 34
BFS
v Low degree nodes
v BFS is biased
Minas Gjoka, UC Irvine Walking in Facebook 35
v Degree distribution identical to UNI (MHRW,RWRW) v RW as biased as BFS but with smaller variance in each walk
Minas Gjoka, UC Irvine Walking in Facebook 36
v Use MHRW or RWRW. Do not use BFS, RW. v Use formal convergence diagnostics
§ assess convergence online § use multiple parallel walks
v MHRW vs RWRW
§ RWRW slightly better performance § MHRW provides a “ready-to-use” sample
Minas Gjoka, UC Irvine Walking in Facebook 37
v Motivation and Problem Statement v Sampling Methodology v Data Analysis v Conclusion
Minas Gjoka, UC Irvine Walking in Facebook 38
Degree Distribution, Power Law Distribution f(k)=b*k-a
v Degree distribution not a power law
a
2=3.38
a
1=1.32
40
v Do assigned readings before class v Submit reviews/critiques v Attend in-class discussions