DS504/CS586: Big Data Analytics Graph Mining
- Prof. Yanhua Li
Welcome to
Time: 6:00pm –8:50pm R Location: AK 233 Spring 2018
DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: - - PowerPoint PPT Presentation
Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK 233 Spring 2018 Service Providing Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Recommender Applications
Time: 6:00pm –8:50pm R Location: AK 233 Spring 2018
Urban Sensing & Data Acquisition
Participatory Sensing, Crowd Sensing, Mobile Sensing Traffic Road Networks POIs Air Quality Human mobility Meteorolo gy Social Media Energy
Urban Data Management
Spatio-temporal index, streaming, trajectory, and graph data management,...
Urban Data Analytics
Data Mining, Machine Learning, Visualization
Service Providing
Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Air Pollution, ...
Urban Computing: concepts, methodologies, and applications. Zheng, Y., et al. ACM transactions on Intelligent Systems and Technology.
Acquisition Cleaning Management
Big Graph Data Mining Big Data Clustering Recommender systems Applications
Mining of Massive Datasets, http:// www.mmds.org 3
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
4
Connections between political blogs
Polarization of the network [Adamic-Glance, 2005]
5
Citation networks and Maps of science
[Börner et al., 2012]
domain2 domain1 domain3 router
6
Partial map of the Internet based on the January 15, 2005 data found on
8
Seven Bridges of Königsberg
[Euler, 1735]
Return to the starting point by traveling each link of the graph once and only once.
+ +
Trust & distrust Repulsion & cohesion Friend & foe Following One-way road Resistance Wireless channel Friendship Co-authorship Undirected links Directed links Signed links Multi-relational links Hyperlinks … …
v Network Statistic Analysis (this lecture)
§ Network Size § Degree distribution.
v Node Ranking (Next lecture)
Mining of Massive Datasets, http:// www.mmds.org 11
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
12
Random sampling (uniform & independent)
} vertex sampling } BFS sampling
12
} random walk sampling } edge sampling
Random walk sampling Random Walk Routing Influence diffusion Molecule in liquid
Undirected !!
v Adjacency matrix v Transition Probability Matrix v |E|: number of links v Stationary Distribution
D = 3 2 3 2 ! " # # # # $ % & & & &
Undirected
A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &
Symmetric
P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 " # $ $ $ $ % & ' ' ' '
P
ij =
1 ki if i is not equal to j 0 if i=j ⎧ ⎨ ⎪ ⎩ ⎪
v Adjacency matrix v Transition Probability Matrix v |E|: number of links v Stationary Distribution
D = 3 2 3 2 ! " # # # # $ % & & & &
Undirected
A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &
Symmetric
P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 " # $ $ $ $ % & ' ' ' '
P
υ,w MH =
1 kυ min(1, kυ kw ) if w neighbor of υ 1− P
υ,y MH if w=υ y≠υ
⎧ ⎨ ⎪ ⎪ ⎩ ⎪ ⎪
υ
Undirected
Minas Gjoka, UC Irvine Walking in Facebook 17
Minas Gjoka, Maciej Kurant ‡, Carter Butts, Athina Markopoulou UC Irvine, EPFL ‡
Minas Gjoka, UC Irvine Walking in Facebook 18
v Motivation and Problem Statement v Sampling Methodology v Data Analysis v Conclusion
Minas Gjoka, UC Irvine Walking in Facebook 19
v A network of declared friendships
between users
v Allows users to maintain relationships v Many popular OSNs with different focus
§ Facebook, LinkedIn, Flickr, …
C A E G F B D H
Social Graph
Minas Gjoka, UC Irvine Walking in Facebook 20
v Representative samples desirable
v Obtaining complete dataset difficult
Minas Gjoka, UC Irvine Walking in Facebook 21
v Obtain a representative sample of
Minas Gjoka, UC Irvine Walking in Facebook 22
v Graph traversal (BFS)
v Random walks (MHRW, RDS)
Minas Gjoka, UC Irvine Walking in Facebook 23
v Motivation and Problem Statement v Sampling Methodology
v Data Analysis v Conclusion
Walking in Facebook 24
C A E G F B D H
Unexplored Explored Visited
v Starting from a seed, explores all
neighbor nodes. Process continues iteratively without replacement.
v BFS leads to bias towards high
degree nodes
Lee et al, “Statistical properties of Sampled Networks”, Phys Review E, 2006
v Early measurement studies of
OSNs use BFS as primary sampling technique
i.e [Mislove et al], [Ahn et al], [Wilson et al.]
i
u A i i u V
∈ ∈
Subset of sampled nodes with value i All sampled nodes
Minas Gjoka, UC Irvine Walking in Facebook 25
C A E G F B D H
1/3 1 / 3 1/3
Next candidate Current node
a time with replacement
,
RW w
υ υ
υ υ
Degree of node υ Number of edges
i
u A i i u V
∈ ∈
Subset of sampled nodes with value i All sampled nodes
Minas Gjoka, UC Irvine Walking in Facebook 26
v Corrects for degree bias at the end of collection v Without re-weighting, the probability distribution for
node property A is:
v Re-Weighted probability distribution :
i
u A u i u V u
∈ ∈
i
u A i i u V
∈ ∈
Subset of sampled nodes with value i All sampled nodes Degree of node u
Walking in Facebook 27
v Explore graph one node at
a time with replacement
v In the stationary distribution
, ,
1 min(1, ) if neighbor of 1 if =
MH w w MH y y
k w k k P P w
υ υ υ υ υ
υ υ
≠
⎧ ⎪ ⎪ = ⎨ − ⎪ ⎪ ⎩ ∑
C A E G F B D H
1 / 3 1/5 1/3
Next candidate Current node
2/15
υ
MH AC
MH AA
i
u A i i u V
∈ ∈
Subset of sampled nodes with value i All sampled nodes
Minas Gjoka, UC Irvine Walking in Facebook 28
v As a basis for comparison, we collect
v UNI not a general solution for
Minas Gjoka, UC Irvine Walking in Facebook 29
Sampling method MHRW RW BFS UNI #Valid Users 28x81K 28x81K 28x81K 984K # Unique Users 957K 2.19M 2.20M 984K
http://odysseas.calit2.uci.edu/research/osn.html
Minas Gjoka, UC Irvine Walking in Facebook 30
v What information do we collect for each sampled
node u?
UserID Name Networks Privacy settings
Friend List
UserID Name Networks Privacy Settings UserID Name Networks Privacy settings
1 1 1 1
Profile Photo Add as Friend Regional School/Workplace
UserID Name Networks Privacy settings
View Friends Send Message
Minas Gjoka, UC Irvine Walking in Facebook 31
Minas Gjoka, UC Irvine Walking in Facebook 32
Geweke
v Detects convergence for a single walk. Let X be a
sequence of samples for metric of interest.
posterior moments“ in Bayesian Statistics 4, 1992
z = E(X a)− E(X b) Var(X a)−Var(X b) ∈ [−1,1] a =10%,b = 50%
Minas Gjoka, UC Irvine Walking in Facebook 33
Gelman-Rubin
v Detects convergence for M>1 walks
Convergence is declared when R<1.02
Statistical Science Volume 7, 1992 Walk 1 Walk 2 Walk 3
R = N −1 N + M +1 MN B W
Between walks variance Within walks variance
Minas Gjoka, UC Irvine Walking in Facebook 34
Node Degree
Minas Gjoka, UC Irvine Walking in Facebook 35
Node Degree
v Poor performance
v MHRW, RWRW
28 crawls
Minas Gjoka, UC Irvine Walking in Facebook 36
BFS
v Low degree nodes
v BFS is biased
Minas Gjoka, UC Irvine Walking in Facebook 37
v Degree distribution identical to UNI (MHRW,RWRW) v RW as biased as BFS but with smaller variance in each walk
Minas Gjoka, UC Irvine Walking in Facebook 38
v Use MHRW or RWRW. Do not use BFS, RW. v Use formal convergence diagnostics
§ assess convergence online § use multiple parallel walks
v MHRW vs RWRW
§ RWRW slightly better performance § MHRW provides a “ready-to-use” sample
Minas Gjoka, UC Irvine Walking in Facebook 39
Degree Distribution, Power Law Distribution f(k)=b*k-a
v Degree distribution not a power law
a
2=3.38
a
1=1.32
Minas Gjoka, UC Irvine Walking in Facebook 40
v Motivation and Problem Statement v Sampling Methodology v Data Analysis v Conclusion
Minas Gjoka, UC Irvine Walking in Facebook 42
v 5 team presentations v 30 minutes each team including the
v We will have snacks and soft drinks
Minas Gjoka, UC Irvine Walking in Facebook 43
v KDD cup 2018 v Real data, worldwide competition v Will be announced March 1st, 2018 v Define your own project or KDD cup v http://www.kdd.org/News/view/kdd-
v http://www.kdd.org/kdd-cup v http://www.kdd.org/kdd2017/News/