Sampling Large Graphs: Algorithms and Applications Don Towsley - PowerPoint PPT Presentation

Sampling Large Graphs: Algorithms and Applications Don Towsley College of Information & Computer Science Umass - Amherst Collaborators: P.H. Wang, J.C.S. Lui, J.Z. Zhou, X. Guan

Measuring, analyzing large networks - large networks can be represented by graphs - Facebook 1+ Billion 3

Measuring, analyzing large networks - large networks can be represented by graphs - Facebook - WWW 50 Billion 3

Measuring, analyzing large networks - large networks can be represented by graphs 300 million - Facebook - WWW - Twitter 3

Measuring, analyzing large networks - large networks can be represented by graphs - Facebook - WWW - Twitter - Ebay 233 Million 3

Measuring, analyzing large networks - large networks can be represented by graphs - Facebook - WWW - Twitter - Ebay Curse of data dimensionality !!! 3

Challenges in measurement: Information distortion “World Map” in 1459  incomplete (Columbus et al. 1492) (Australia 17 th century)  wrong proportions (Africa & Asia) www.flickr.com/

Why do we want to understand these networks? Want to understand or find out  how did these networks evolve?

Why do we want to understand these networks? Want to understand or find out  how did these networks evolve? High school friendship  who are the influential users? network

Why do we want to understand these networks? Want to understand or find out  how did these networks evolve? High school friendship  who are the influential users? network  how does influence propagate?  communities in these networks?  ….etc .

Goals and challenges Goals  generate statistically valid characterization of network structure  node pairs in this work Challenges  large networks  correcting for biases

How to measure: sampling Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)

How to measure: sampling algorithms Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)

Breadth first search sampling  Orkut data set (Mislove 2007), 3M vertices, 200M edges CCDF True distribution BFS, depth = 3  BFS sampling highly biased  difficult to remove bias 117

Random walk sampling Bias removal?  Markov model π i  at steady state visits edges uniformly at random (edge θ i sampling ) CCDF Model:  i - P[node degree = i ] RW sampling π i - P[visited degree = i ]  i   i  i i 18

Random walk sampling Bias removal?  Markov model π i  at steady state visits edges uniformly at random (edge θ i sampling ) CCDF Model:  i - P[node degree = i ] RW sampling π i - P[visited degree = i ]  i =  i  i / avg degree i  i = Norm   i / i or 19

Node sampling vs. RW: Orkut random walk node sampling log(CCDF l og(CCDF ) log(degree) log(degree)  RW – estimates tail well  node sampling – estimates small degrees well 20

Focus of talk Measure node pair statistics: important for many applications! 22

Classification of node pairs Classify node pair [𝑣, 𝑤] using shortest path • 1-hop node pair class if distance( u , v ) = 1 • 2-hop node pair class if distance( u , v ) = 2 • … 23

Homophily Homophily: tendency of users to connect to others with common interests. P. Singla and M. Richardson. Yes, there is a correlation: from social networks to personal behavior on the web. In WWW 2008 (MSN) Can infer characteristics and make recommendations Compare homophily( u , v ) between different node pair classes 24

Pair similarity: Proximity Proximity( u , v ) : number of common neighbors of u and v; closeness of u and v u v  knowing proximity distribution of node pairs important for  friendship prediction  interest recommendation  … 25

Pair similarity: distance  Distance( u , v ) : length of shortest path between u and v in graph  measure distance distribution of all node pairs to calculate  average distance • Twitter: 4.1 • MSN: 6.6  effective diameter (the 90th percentile of all distances)  small world 26

Problem formulation  undirected graph 𝐻 = (𝑊, 𝐹)  measure node pair characteristics in following sets:  all pairs - 𝑇 = { 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝑊, 𝑣 ≠ 𝑤}  one-hop pairs - pairs of connected nodes 𝑇 (1) = 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝐹  two-hop pairs - pairs of nodes with at least one common neighbor 𝑇 (2) = { 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝑊, 𝑣 ≠ 𝑤; ∃𝑦 ∈ 𝑊 𝑡𝑢 𝑦, 𝑣 , 𝑦, 𝑤 ∈ 𝐹} 27

Problem formulation  𝑮(𝒗, 𝒘) – similarity of node pair under study, e.g., # of common neighbors of 𝑣, 𝑤  {𝑏 1 , … , 𝑏 𝐿 } - range of 𝐺 𝑣, 𝑤  distribution of 𝐺 𝑣, 𝑤  𝑇: (𝜕 1 , … , 𝜕 𝐿 ) (1) , … , 𝜕 𝐿 (1) )  𝑇 (1) : (𝜕 1 2 , … , 𝜕 𝐿 2 )  𝑇 (2) : (𝜕 1 (2) - fractions of node pairs in 𝑇, 𝑇 1 , 𝑇 (2) (1) , 𝜕 𝑙 𝜕 𝑙 , 𝜕 𝑙 with property 𝐺 𝑣, 𝑤 = 𝑏 𝑙 28

Challenges  OSNs large  Facebook, Google+, Twitter, Facebook, LinkedIn, …, 𝑊 > 500 million users  huge number of node pairs, 𝑊 2 > 10 16  topology not available ⇒ sampling required  UVS (Uniform Vertex Sampling): • unbiased for 𝑻 • sampling bias for 𝑻 (𝟐) , 𝑻 (𝟑) . • sometimes UVS not allowed  crawling - RW: sampling bias  need to construct unbiased estimates 29

Node pair sampling based on UVS Basic sampling techniques  UVS : sample nodes from 𝑊 uniformly  weighted vertex sampling (WVS) : sample nodes from V with desired probability distribution (𝜌 𝑦 : 𝑦 ∈ 𝑊)  independent WVS (IWVS) (if we have topology)  Metropolis-Hastings WVS (MHWVS) (if not): at each step, MHWVS selects a node v using UVS and then accepts the sample with probability min(𝜌 𝑤 /𝜌 𝑣 , 1) , where 𝑣 is previous sample; otherwise tries again 30

Node pair sampling based on UVS All pairs 𝑻 Sampling method: select two different nodes 𝑣 and 𝑤 uniformly at random Estimator : given sampled pairs 𝑣 𝑗 , 𝑤 𝑗 , 𝑗 = 1, … , 𝑜 𝑜 𝜕 𝑙 = 1 𝑜 𝟐(𝐺 𝑣 𝑗 , 𝑤 𝑗 = 𝑏 𝑙 ) , 𝑙 = 1, … , 𝐿 𝑗=1 Accuracy ( unbiased ) 𝐹 𝜕 𝑙 = 𝜕 𝑙 , 𝑙 = 1, … , 𝐿 31

Node pair sampling based on UVS One hop pairs 𝑻 (𝟐) Sampling node pair [𝑣, 𝑤] 1) sample node 𝑣 according to 𝑒 𝑣 probability distribution 1 : 𝑣 ∈ 𝑊) , where (𝜌 𝑣 … 1 = 𝑒 𝑣 𝜌 𝑣 2|𝐹| 𝑒 𝑣 - degree of node 𝑣 u 2) select neighbor 𝑤 at random Each [𝑣, 𝑤] sampled uniformly from 𝑻 (𝟐) 32

Node pair sampling based on UVS One hop pairs 𝑻 (𝟐) Sampling node pair [𝑣, 𝑤] 1) sample node 𝑣 according to 𝑒 𝑣 probability distribution 1 : 𝑣 ∈ 𝑊) , where (𝜌 𝑣 … 1 = 𝑒 𝑣 v 𝜌 𝑣 2|𝐹| 𝑒 𝑣 - degree of node 𝑣 u 2) select neighbor 𝑤 at random Each [𝑣, 𝑤] sampled uniformly from 𝑻 (𝟐) 33

Node pair sampling based on UVS One hop pairs 𝑻 (𝟐) Estimator : given sampled pairs 𝑣 𝑗 , 𝑤 𝑗 , 𝑗 = 1, … , 𝑜 𝑜 (1) = 1 𝜕 𝑙 𝑜 𝟐(𝐺 𝑣 𝑗 , 𝑤 𝑗 = 𝑏 𝑙 ) , 𝑙 = 1, … , 𝐿 𝑗=1 Accuracy (unbiased) (1) = 𝜕 𝑙 (1) , 𝑙 = 1, … , 𝐿 𝐹 𝜕 𝑙 34

Node pair sampling based on UVS Two hop pairs 𝑻 (𝟑)  sampling node pair 𝑣, 𝑤 sample node x 1) x select two neighbors u and v of 2) node 𝑦 at random 35

Node pair sampling based on UVS Two hop pairs 𝑻 (𝟑) u  sampling node pair 𝑣, 𝑤 sample node x 1) x select two neighbors u and v of v 2) node 𝑦 at random  produces asymptotically unbiased (2) , 𝑙 = 1, … , 𝐿 estimate of 𝜕 𝑙 tight convergence rate  36

Node pair sampling based on RW Why?  UVS not available, too costly  API not provided  user IDs sparsely distributed  only crawling techniques can be used  random walk : walker moves to random neighbor, samples its information  we saw for connected non-bipartite graph 𝜌 𝑤 = 𝑒 𝑤 2 𝐹 , 𝑤 ∈ 𝑊 37

Node pair sampling based on RW All pairs 𝑻  sample node pair 𝑣 𝑗 , 𝑤 𝑗 by two independent RWs, where 𝑣 𝑗 , 𝑤 𝑗 𝑣 𝑗 , are nodes sampled by two RWs 𝑤 𝑗 at step i  node pair [ 𝑣 , 𝑤 ] sampled according to stationary distribution 𝜌 [𝑣,𝑤] = 𝑒 𝑣 𝑒 𝑤 4 𝐹 2 , 𝑣, 𝑤 ∈ 𝑊 38

Sampling Large Graphs: Algorithms and Applications Don Towsley - PowerPoint PPT Presentation

Sampling Large Graphs: Algorithms and Applications Don Towsley College of Information & Computer Science Umass - Amherst Collaborators: P.H. Wang, J.C.S. Lui, J.Z. Zhou, X. Guan Measuring, analyzing large networks - large networks can be

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Pregel Large-Scale Graph Processing William Jones Analysing large graphs is hard. We are

Asymptotic Robustness of Estimators in Rare-Event Simulation P. LEcuyer, Universit e de

Yehuda uda Lindel dell, Benny Pinkas and Eli Oxman Bar-Ilan University, Israel Info forma

The geometry of the statistical model for The estimation problem range-based localization

QUANTUM ESTIMATION FOR QUANTUM TECHNOLOGY MATTEO G. A. PARIS Dipartimento di Fisica

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Statistical inference for R enyi entropy of integer order David K allberg August 23, 2010

I05 - Confidence intervals STAT 587 (Engineering) Iowa State University September 24, 2020

CSC 411: Lecture 01: Introduction Class based on Raquel Urtasun & Rich Zemels lectures

Sampling Large Graphs: Algorithms and Applications Don Towsley - PowerPoint PPT Presentation

Sampling Large Graphs: Algorithms and Applications Don Towsley College of Information & Computer Science Umass - Amherst Collaborators: P.H. Wang, J.C.S. Lui, J.Z. Zhou, X. Guan Measuring, analyzing large networks - large networks can be

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Pregel Large-Scale Graph Processing William Jones Analysing large graphs is hard. We are

Asymptotic Robustness of Estimators in Rare-Event Simulation P. LEcuyer, Universit e de

Yehuda uda Lindel dell, Benny Pinkas and Eli Oxman Bar-Ilan University, Israel Info forma

The geometry of the statistical model for The estimation problem range-based localization

QUANTUM ESTIMATION FOR QUANTUM TECHNOLOGY MATTEO G. A. PARIS Dipartimento di Fisica

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Statistical inference for R enyi entropy of integer order David K allberg August 23, 2010

I05 - Confidence intervals STAT 587 (Engineering) Iowa State University September 24, 2020

CSC 411: Lecture 01: Introduction Class based on Raquel Urtasun &amp; Rich Zemels lectures

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

CSC 411: Lecture 01: Introduction Class based on Raquel Urtasun & Rich Zemels lectures