DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: - - PowerPoint PPT Presentation

ds504 cs586 big data analytics graph mining
SMART_READER_LITE
LIVE PREVIEW

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: - - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK232 Fall 2016 Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]


slide-1
SLIDE 1

DS504/CS586: Big Data Analytics Graph Mining

  • Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm R Location: AK232 Fall 2016

slide-2
SLIDE 2

Graph Data: Social Networks

  • J. Leskovec, A. Rajaraman, J. Ullman:

Mining of Massive Datasets, http:// www.mmds.org 2

Facebook social graph

4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]

slide-3
SLIDE 3

Graph Data: Media Networks

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

3

Connections between political blogs

Polarization of the network [Adamic-Glance, 2005]

slide-4
SLIDE 4

Graph Data: Information Nets

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

4

Citation networks and Maps of science

[Börner et al., 2012]

slide-5
SLIDE 5

domain2 domain1 domain3 router

Internet

Graph Data: Communication Nets

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

5

slide-6
SLIDE 6

Graph Data: Topological Networks

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

6

Seven Bridges of Königsberg

[Euler, 1735]

Return to the starting point by traveling each link of the graph once and only once.

slide-7
SLIDE 7

Graph representation of networks

  • +

+ +

Trust & distrust Repulsion & cohesion Friend & foe Following One-way road Resistance Wireless channel Friendship Co-authorship Undirected links Directed links Signed links Multi-relational links Hyperlinks … …

slide-8
SLIDE 8

Mining in Big Graphs

v Network Statistic Analysis (this lecture)

§ Network Size § Degree distribution.

v Node Ranking (Next lecture)

§ Identifying most influential nodes § Viral Marketing, resource allocation

slide-9
SLIDE 9

Graph Data: Social Networks

  • J. Leskovec, A. Rajaraman, J. Ullman:

Mining of Massive Datasets, http:// www.mmds.org 9

Facebook social graph

4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]

slide-10
SLIDE 10

10

Sampling graphs

Random sampling (uniform & independent)

crawling

} vertex sampling } BFS sampling

10

} random walk sampling } edge sampling

slide-11
SLIDE 11

Random Walks on Graphs

Random walk sampling Random Walk Routing Influence diffusion Molecule in liquid

slide-12
SLIDE 12

Undirected Graphs

1 2 6 4 5 3

Undirected !!

slide-13
SLIDE 13

Random Walk

v Adjacency matrix v Transition Probability Matrix v |E|: number of links v Stationary Distribution

1 4 3 2

D = 3 2 3 2 ! " # # # # $ % & & & &

Undirected

A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &

Symmetric

P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 " # $ $ $ $ % & ' ' ' '

πi = di 2 E

P

ij = 1

ki

slide-14
SLIDE 14

Metropolis-Hastings Random Walk

v Adjacency matrix v Transition Probability Matrix v |E|: number of links v Stationary Distribution

1 4 3 2

D = 3 2 3 2 ! " # # # # $ % & & & &

Undirected

A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &

Symmetric

P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 " # $ $ $ $ % & ' ' ' '

, ,

1 min(1, ) if neighbor of 1 if =

MH w w MH y y

k w k k P P w

υ υ υ υ υ

υ υ

⎧ ⎪ ⎪ = ⎨ − ⎪ ⎪ ⎩ ∑

1 V

υ

π =

slide-15
SLIDE 15

Minas Gjoka, UC Irvine Walking in Facebook 15

Walking in Facebook: A Case Study of Unbiased Sampling of OSNs

Minas Gjoka, Maciej Kurant ‡, Carter Butts, Athina Markopoulou UC Irvine, EPFL ‡

slide-16
SLIDE 16

Minas Gjoka, UC Irvine Walking in Facebook 16

Outline

v Motivation and Problem Statement v Sampling Methodology v Data Analysis v Conclusion

slide-17
SLIDE 17

Minas Gjoka, UC Irvine Walking in Facebook 17

Online Social Networks (OSNs)

v A network of declared friendships

between users

v Allows users to maintain relationships v Many popular OSNs with different focus

§ Facebook, LinkedIn, Flickr, …

C A E G F B D H

Social Graph

slide-18
SLIDE 18

Minas Gjoka, UC Irvine Walking in Facebook 18

Why Sample OSNs?

v Representative samples desirable

§ study properties § test algorithms

v Obtaining complete dataset difficult

§ companies usually unwilling to share data § tremendous overhead to measure all (~100TB for Facebook)

slide-19
SLIDE 19

Minas Gjoka, UC Irvine Walking in Facebook 19

Problem statement

v Obtain a representative sample of

users in a given OSN by exploration

  • f the social graph.

§ in this work we sample Facebook (FB) § explore graph using various crawling techniques

slide-20
SLIDE 20

Minas Gjoka, UC Irvine Walking in Facebook 20

Related Work

v Graph traversal (BFS)

§ A. Mislove et al, IMC 2007 § Y. Ahn et al, WWW 2007 § C. Wilson, Eurosys 2009

v Random walks (MHRW, RDS)

§ M. Henzinger et al, WWW 2000 § D. Stutbach et al, IMC 2006 § A. Rasti et al, Mini Infocom 2009

slide-21
SLIDE 21

Minas Gjoka, UC Irvine Walking in Facebook 21

Outline

v Motivation and Problem Statement v Sampling Methodology

§ crawling methods § data collection § convergence evaluation § method comparisons

v Data Analysis v Conclusion

slide-22
SLIDE 22

Minas Gjoka, UC Irvine Walking in Facebook 22

(1) Breadth-First-Search (BFS)

C A E G F B D H

Unexplored Explored Visited

v Starting from a seed, explores all

neighbor nodes. Process continues iteratively without replacement.

v BFS leads to bias towards high

degree nodes

Lee et al, “Statistical properties of Sampled Networks”, Phys Review E, 2006

v Early measurement studies of

OSNs use BFS as primary sampling technique

i.e [Mislove et al], [Ahn et al], [Wilson et al.]

slide-23
SLIDE 23

Minas Gjoka, UC Irvine Walking in Facebook 23

(2) Random Walk (RW)

C A E G F B D H

1/3 1 / 3 1/3

Next candidate Current node

  • Explores graph one node at

a time with replacement

  • In the stationary distribution

,

1

RW w

P k

υ υ

=

2 k E

υ υ

π = ⋅

Degree of node υ Number of edges

slide-24
SLIDE 24

Minas Gjoka, UC Irvine Walking in Facebook 24

(3) Re-Weighted Random Walk (RWRW)

Hansen-Hurwitz estimator

v Corrects for degree bias at the end of collection v Without re-weighting, the probability distribution for

node property A is:

v Re-Weighted probability distribution :

1/ ( ) 1/

i

u A u i u V u

k p A k

∈ ∈

= ∑

1 | | ( ) 1 | |

i

u A i i u V

A p A V

∈ ∈

= =

∑ ∑

Subset of sampled nodes with value i All sampled nodes Degree of node u

slide-25
SLIDE 25

Minas Gjoka, UC Irvine Walking in Facebook 25

(4) Metropolis-Hastings Random Walk (MHRW)

v Explore graph one node at

a time with replacement

v In the stationary distribution

, ,

1 min(1, ) if neighbor of 1 if =

MH w w MH y y

k w k k P P w

υ υ υ υ υ

υ υ

⎧ ⎪ ⎪ = ⎨ − ⎪ ⎪ ⎩ ∑

C A E G F B D H

1 / 3 1/5 1/3

Next candidate Current node

2/15

1 V

υ

π =

5 1 5 3 3 1 = ⋅ =

MH AC

P 15 2 ) 5 1 3 1 3 1 ( 1 = + + − =

MH AA

P

slide-26
SLIDE 26

Minas Gjoka, UC Irvine Walking in Facebook 26

Uniform userID Sampling (UNI)

v As a basis for comparison, we collect

a uniform sample of Facebook userIDs (UNI)

§ rejection sampling on the 32-bit userID space

v UNI not a general solution for

sampling OSNs

§ userID space must not be sparse § names instead of numbers

slide-27
SLIDE 27

Minas Gjoka, UC Irvine Walking in Facebook 27

Summary of Datasets

Sampling method MHRW RW BFS UNI #Valid Users 28x81K 28x81K 28x81K 984K # Unique Users 957K 2.19M 2.20M 984K

  • Egonets for a subsample of MHRW
  • local properties of nodes
  • Datasets available at:

http://odysseas.calit2.uci.edu/research/osn.html

slide-28
SLIDE 28

Minas Gjoka, UC Irvine Walking in Facebook 28

Data Collection

Basic Node Information

v What information do we collect for each sampled

node u?

UserID Name Networks Privacy settings

Friend List

UserID Name Networks Privacy Settings UserID Name Networks Privacy settings

u

1 1 1 1

Profile Photo Add as Friend Regional School/Workplace

UserID Name Networks Privacy settings

View Friends Send Message

slide-29
SLIDE 29

Minas Gjoka, UC Irvine Walking in Facebook 29

Detecting Convergence

  • Number of samples (iterations) to loose

dependence from starting points?

slide-30
SLIDE 30

Minas Gjoka, UC Irvine Walking in Facebook 30

Online Convergence Diagnostics

Geweke

v Detects convergence for a single walk. Let X be a

sequence of samples for metric of interest.

  • J. Geweke, “Evaluating the accuracy of sampling based approaches to calculate

posterior moments“ in Bayesian Statistics 4, 1992

Xa Xb

( ) ( ) ( ) ( )

a b a b

E X E X z Var X Var X − = −

slide-31
SLIDE 31

Minas Gjoka, UC Irvine Walking in Facebook 31

Online Convergence Diagnostics

Gelman-Rubin

v Detects convergence for m>1 walks

  • A. Gelman, D. Rubin, “Inference from iterative simulation using multiple sequences“ in

Statistical Science Volume 7, 1992 Walk 1 Walk 2 Walk 3

1 1 n m B R n mn W − + ⎛ ⎞ = + ⎜ ⎟ ⎝ ⎠

Between walks variance Within walks variance

slide-32
SLIDE 32

Minas Gjoka, UC Irvine Walking in Facebook 32

When do we reach equilibrium?

Burn-in determined to be 3K

Node Degree

slide-33
SLIDE 33

Minas Gjoka, UC Irvine Walking in Facebook 33

Methods Comparison

Node Degree

v Poor performance

for BFS, RW

v MHRW, RWRW

produce good estimates

§ per chain § overall

28 crawls

slide-34
SLIDE 34

Minas Gjoka, UC Irvine Walking in Facebook 34

Sampling Bias

BFS

v Low degree nodes

under- represented by two orders of magnitude

v BFS is biased

slide-35
SLIDE 35

Minas Gjoka, UC Irvine Walking in Facebook 35

Sampling Bias

MHRW, RW, RWRW

v Degree distribution identical to UNI (MHRW,RWRW) v RW as biased as BFS but with smaller variance in each walk

slide-36
SLIDE 36

Minas Gjoka, UC Irvine Walking in Facebook 36

Practical Recommendations for Sampling Methods

v Use MHRW or RWRW. Do not use BFS, RW. v Use formal convergence diagnostics

§ assess convergence online § use multiple parallel walks

v MHRW vs RWRW

§ RWRW slightly better performance § MHRW provides a “ready-to-use” sample

slide-37
SLIDE 37

Minas Gjoka, UC Irvine Walking in Facebook 37

Outline

v Motivation and Problem Statement v Sampling Methodology v Data Analysis v Conclusion

slide-38
SLIDE 38

Minas Gjoka, UC Irvine Walking in Facebook 38

FB Social Graph

Degree Distribution, Power Law Distribution f(k)=b*k-a

v Degree distribution not a power law

a

2=3.38

a

1=1.32

slide-39
SLIDE 39

Any Comments & Critiques?

slide-40
SLIDE 40

40

Next Class: Graph Mining (II)

v Do assigned readings before class v Submit reviews/critiques v Attend in-class discussions