DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: - - PowerPoint PPT Presentation

ds504 cs586 big data analytics graph mining
SMART_READER_LITE
LIVE PREVIEW

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: - - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK 233 Spring 2018 Service Providing Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Recommender Applications


slide-1
SLIDE 1

DS504/CS586: Big Data Analytics Graph Mining

  • Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm R Location: AK 233 Spring 2018

slide-2
SLIDE 2

Urban Sensing & Data Acquisition

Participatory Sensing, Crowd Sensing, Mobile Sensing Traffic Road Networks POIs Air Quality Human mobility Meteorolo gy Social Media Energy

Urban Data Management

Spatio-temporal index, streaming, trajectory, and graph data management,...

Urban Data Analytics

Data Mining, Machine Learning, Visualization

Service Providing

Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Air Pollution, ...

Urban Computing: concepts, methodologies, and applications. Zheng, Y., et al. ACM transactions on Intelligent Systems and Technology.

Acquisition Cleaning Management

Big Graph Data Mining Big Data Clustering Recommender systems Applications

slide-3
SLIDE 3

Graph Data: Social Networks

  • J. Leskovec, A. Rajaraman, J. Ullman:

Mining of Massive Datasets, http:// www.mmds.org 3

Facebook social graph

4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]

slide-4
SLIDE 4

Graph Data: Media Networks

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

4

Connections between political blogs

Polarization of the network [Adamic-Glance, 2005]

slide-5
SLIDE 5

Graph Data: Information Nets

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

5

Citation networks and Maps of science

[Börner et al., 2012]

slide-6
SLIDE 6

domain2 domain1 domain3 router

Internet

Graph Data: Communication Nets

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

6

slide-7
SLIDE 7

Questions?

Partial map of the Internet based on the January 15, 2005 data found on

  • pte.org. (from http://atheistuniverse.net/group/internet)
slide-8
SLIDE 8

Graph Data: Topological Networks

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

8

Seven Bridges of Königsberg

[Euler, 1735]

Return to the starting point by traveling each link of the graph once and only once.

slide-9
SLIDE 9

Graph representation of networks

  • +

+ +

Trust & distrust Repulsion & cohesion Friend & foe Following One-way road Resistance Wireless channel Friendship Co-authorship Undirected links Directed links Signed links Multi-relational links Hyperlinks … …

slide-10
SLIDE 10

Mining in Big Graphs

v Network Statistic Analysis (this lecture)

§ Network Size § Degree distribution.

v Node Ranking (Next lecture)

§ Identifying most influential nodes § Viral Marketing, resource allocation

slide-11
SLIDE 11

Graph Data: Social Networks

  • J. Leskovec, A. Rajaraman, J. Ullman:

Mining of Massive Datasets, http:// www.mmds.org 11

Facebook social graph

4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]

slide-12
SLIDE 12

12

Sampling graphs

Random sampling (uniform & independent)

crawling

} vertex sampling } BFS sampling

12

} random walk sampling } edge sampling

slide-13
SLIDE 13

Random Walks on Graphs

Random walk sampling Random Walk Routing Influence diffusion Molecule in liquid

slide-14
SLIDE 14

Undirected Graphs

1 2 6 4 5 3

Undirected !!

slide-15
SLIDE 15

Random Walk

v Adjacency matrix v Transition Probability Matrix v |E|: number of links v Stationary Distribution

1 4 3 2

D = 3 2 3 2 ! " # # # # $ % & & & &

Undirected

A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &

Symmetric

P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 " # $ $ $ $ % & ' ' ' '

πi = di 2 E

P

ij =

1 ki if i is not equal to j 0 if i=j ⎧ ⎨ ⎪ ⎩ ⎪

slide-16
SLIDE 16

Metropolis-Hastings Random Walk

v Adjacency matrix v Transition Probability Matrix v |E|: number of links v Stationary Distribution

1 4 3 2

D = 3 2 3 2 ! " # # # # $ % & & & &

Undirected

A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &

Symmetric

P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 " # $ $ $ $ % & ' ' ' '

P

υ,w MH =

1 kυ min(1, kυ kw ) if w neighbor of υ 1− P

υ,y MH if w=υ y≠υ

⎧ ⎨ ⎪ ⎪ ⎩ ⎪ ⎪

1 V

υ

π =

1 4 3 2

Undirected

slide-17
SLIDE 17

Minas Gjoka, UC Irvine Walking in Facebook 17

Walking in Facebook: A Case Study of Unbiased Sampling of OSNs

Minas Gjoka, Maciej Kurant ‡, Carter Butts, Athina Markopoulou UC Irvine, EPFL ‡

slide-18
SLIDE 18

Minas Gjoka, UC Irvine Walking in Facebook 18

Outline

v Motivation and Problem Statement v Sampling Methodology v Data Analysis v Conclusion

slide-19
SLIDE 19

Minas Gjoka, UC Irvine Walking in Facebook 19

Online Social Networks (OSNs)

v A network of declared friendships

between users

v Allows users to maintain relationships v Many popular OSNs with different focus

§ Facebook, LinkedIn, Flickr, …

C A E G F B D H

Social Graph

slide-20
SLIDE 20

Minas Gjoka, UC Irvine Walking in Facebook 20

Why Sample OSNs?

v Representative samples desirable

§ study properties § test algorithms § We use node distribution in this study

v Obtaining complete dataset difficult

§ companies usually unwilling to share data § tremendous overhead to measure all (~100TB for Facebook)

slide-21
SLIDE 21

Minas Gjoka, UC Irvine Walking in Facebook 21

Problem statement

v Obtain a representative sample of

users in a given OSN by exploration

  • f the social graph.

§ in this work we sample Facebook (FB) § explore graph using various crawling techniques

slide-22
SLIDE 22

Minas Gjoka, UC Irvine Walking in Facebook 22

Related Work

v Graph traversal (BFS)

§ A. Mislove et al, IMC 2007 § Y. Ahn et al, WWW 2007 § C. Wilson, Eurosys 2009

v Random walks (MHRW, RDS)

§ M. Henzinger et al, WWW 2000 § D. Stutbach et al, IMC 2006 § A. Rasti et al, Mini Infocom 2009

slide-23
SLIDE 23

Minas Gjoka, UC Irvine Walking in Facebook 23

Outline

v Motivation and Problem Statement v Sampling Methodology

§ crawling methods § data collection § convergence evaluation § method comparisons

v Data Analysis v Conclusion

slide-24
SLIDE 24

Walking in Facebook 24

(1) Breadth-First-Search (BFS)

C A E G F B D H

Unexplored Explored Visited

v Starting from a seed, explores all

neighbor nodes. Process continues iteratively without replacement.

v BFS leads to bias towards high

degree nodes

Lee et al, “Statistical properties of Sampled Networks”, Phys Review E, 2006

v Early measurement studies of

OSNs use BFS as primary sampling technique

i.e [Mislove et al], [Ahn et al], [Wilson et al.]

1 | | ( ) 1 | |

i

u A i i u V

A p A V

∈ ∈

= =

∑ ∑

Subset of sampled nodes with value i All sampled nodes

slide-25
SLIDE 25

Minas Gjoka, UC Irvine Walking in Facebook 25

(2) Random Walk (RW)

C A E G F B D H

1/3 1 / 3 1/3

Next candidate Current node

  • Explores graph one node at

a time with replacement

  • In the stationary distribution

,

1

RW w

P k

υ υ

=

2 k E

υ υ

π = ⋅

Degree of node υ Number of edges

1 | | ( ) 1 | |

i

u A i i u V

A p A V

∈ ∈

= =

∑ ∑

Subset of sampled nodes with value i All sampled nodes

slide-26
SLIDE 26

Minas Gjoka, UC Irvine Walking in Facebook 26

(3) Re-Weighted Random Walk (RWRW)

Hansen-Hurwitz estimator

v Corrects for degree bias at the end of collection v Without re-weighting, the probability distribution for

node property A is:

v Re-Weighted probability distribution :

1/ ( ) 1/

i

u A u i u V u

k p A k

∈ ∈

= ∑

1 | | ( ) 1 | |

i

u A i i u V

A p A V

∈ ∈

= =

∑ ∑

Subset of sampled nodes with value i All sampled nodes Degree of node u

slide-27
SLIDE 27

Walking in Facebook 27

(4) Metropolis-Hastings Random Walk (MHRW)

v Explore graph one node at

a time with replacement

v In the stationary distribution

, ,

1 min(1, ) if neighbor of 1 if =

MH w w MH y y

k w k k P P w

υ υ υ υ υ

υ υ

⎧ ⎪ ⎪ = ⎨ − ⎪ ⎪ ⎩ ∑

C A E G F B D H

1 / 3 1/5 1/3

Next candidate Current node

2/15

1 V

υ

π =

5 1 5 3 3 1 = ⋅ =

MH AC

P 15 2 ) 5 1 3 1 3 1 ( 1 = + + − =

MH AA

P

1 | | ( ) 1 | |

i

u A i i u V

A p A V

∈ ∈

= =

∑ ∑

Subset of sampled nodes with value i All sampled nodes

slide-28
SLIDE 28

Minas Gjoka, UC Irvine Walking in Facebook 28

Uniform userID Sampling (UNI)

v As a basis for comparison, we collect

a uniform sample of Facebook userIDs (UNI)

§ rejection sampling on the 32-bit userID space

v UNI not a general solution for

sampling OSNs

§ userID space must not be sparse § names instead of numbers

slide-29
SLIDE 29

Minas Gjoka, UC Irvine Walking in Facebook 29

Summary of Datasets

Sampling method MHRW RW BFS UNI #Valid Users 28x81K 28x81K 28x81K 984K # Unique Users 957K 2.19M 2.20M 984K

  • Egonets for a subsample of MHRW
  • local properties of nodes
  • Datasets available at:

http://odysseas.calit2.uci.edu/research/osn.html

slide-30
SLIDE 30

Minas Gjoka, UC Irvine Walking in Facebook 30

Data Collection

Basic Node Information

v What information do we collect for each sampled

node u?

UserID Name Networks Privacy settings

Friend List

UserID Name Networks Privacy Settings UserID Name Networks Privacy settings

u

1 1 1 1

Profile Photo Add as Friend Regional School/Workplace

UserID Name Networks Privacy settings

View Friends Send Message

slide-31
SLIDE 31

Minas Gjoka, UC Irvine Walking in Facebook 31

Detecting Convergence

  • Number of samples (iterations) to loose

dependence from starting points?

slide-32
SLIDE 32

Minas Gjoka, UC Irvine Walking in Facebook 32

Online Convergence Diagnostics

Geweke

v Detects convergence for a single walk. Let X be a

sequence of samples for metric of interest.

  • J. Geweke, “Evaluating the accuracy of sampling based approaches to calculate

posterior moments“ in Bayesian Statistics 4, 1992

Xa Xb

z = E(X a)− E(X b) Var(X a)−Var(X b) ∈ [−1,1] a =10%,b = 50%

slide-33
SLIDE 33

Minas Gjoka, UC Irvine Walking in Facebook 33

Online Convergence Diagnostics

Gelman-Rubin

v Detects convergence for M>1 walks

Convergence is declared when R<1.02

  • A. Gelman, D. Rubin, “Inference from iterative simulation using multiple sequences“ in

Statistical Science Volume 7, 1992 Walk 1 Walk 2 Walk 3

R = N −1 N + M +1 MN B W

Between walks variance Within walks variance

slide-34
SLIDE 34

Minas Gjoka, UC Irvine Walking in Facebook 34

When do we reach equilibrium?

Burn-in determined to be 3K

Node Degree

slide-35
SLIDE 35

Minas Gjoka, UC Irvine Walking in Facebook 35

Methods Comparison

Node Degree

v Poor performance

for BFS, RW

v MHRW, RWRW

produce good estimates

§ per chain § overall

28 crawls

slide-36
SLIDE 36

Minas Gjoka, UC Irvine Walking in Facebook 36

Sampling Bias

BFS

v Low degree nodes

under- represented by two orders of magnitude

v BFS is biased

slide-37
SLIDE 37

Minas Gjoka, UC Irvine Walking in Facebook 37

Sampling Bias

MHRW, RW, RWRW

v Degree distribution identical to UNI (MHRW,RWRW) v RW as biased as BFS but with smaller variance in each walk

slide-38
SLIDE 38

Minas Gjoka, UC Irvine Walking in Facebook 38

Practical Recommendations for Sampling Methods

v Use MHRW or RWRW. Do not use BFS, RW. v Use formal convergence diagnostics

§ assess convergence online § use multiple parallel walks

v MHRW vs RWRW

§ RWRW slightly better performance § MHRW provides a “ready-to-use” sample

slide-39
SLIDE 39

Minas Gjoka, UC Irvine Walking in Facebook 39

FB Social Graph

Degree Distribution, Power Law Distribution f(k)=b*k-a

v Degree distribution not a power law

a

2=3.38

a

1=1.32

slide-40
SLIDE 40

Minas Gjoka, UC Irvine Walking in Facebook 40

Outline

v Motivation and Problem Statement v Sampling Methodology v Data Analysis v Conclusion

slide-41
SLIDE 41

Any Comments & Critiques?

slide-42
SLIDE 42

Minas Gjoka, UC Irvine Walking in Facebook 42

Next Week

v 5 team presentations v 30 minutes each team including the

presentation and Q&A

v We will have snacks and soft drinks

slide-43
SLIDE 43

Minas Gjoka, UC Irvine Walking in Facebook 43

Project 2 starts next week

v KDD cup 2018 v Real data, worldwide competition v Will be announced March 1st, 2018 v Define your own project or KDD cup v http://www.kdd.org/News/view/kdd-

cup-2018-call-for-proposals

v http://www.kdd.org/kdd-cup v http://www.kdd.org/kdd2017/News/

view/announcing-kdd-cup-2017- highway-tollgates-traffic-flow- prediction