and Size of Social Networks via Random Walk Stephen J. Hardiman* - PowerPoint PPT Presentation

Estimating Clustering Coefficients and Size of Social Networks via Random Walk Stephen J. Hardiman* Liran Katzir Capital Fund Management Advanced Technology Labs France Microsoft Research, Israel *Research was conducted while the author was unaffiliated

Motivation: Social Networks Qzone Habbo Netlog Sonico.com Bebo Google+ Renren Twitter Flixster Facebook MyLife Classmates.com Tagged Friendster hi5 Sina Weibo Orkut Plaxo LinkedIn Vkontakte

Motivation: External access The online social network Social Analytics v 3 v 5 v 7 v 1 v 2 v 9 Privacy v 4 v 6 v 8 Disk Space Communication

Task: Estimate parameters Global Network Number of Clustering Average Registered Coefficient CC Users Predicting Business Social Products’ development/ Potential. advertisement/ market size.

Global Clustering Coefficient Global CC = 3 x number of triangles number of connected triplet v 3 v 5 v 7 v 1 v 2 v 9 Triangle Connected v 4 v 6 v 8 Triplet

Global Clustering Coefficient Exact: [Alon et al, 1997] Estimation – input is read at least once: • Random Access: [Avron, 2010] • Streaming Model: [Buriol et al, 2006] Estimation – sampling: • Random Access: [Schank et al, 2005] • External Access: This work.

Local Clustering Coefficient C i = #connections between vi′s neighbors d i (d i −1)/2 d i – degree of node i C 2 = 1 / 3 v 3 v 5 v 7 d 1 = 1 d 2 = 3 d 9 = 2 v 1 v 2 v 9 Network Average CC v 4 v 6 v 8 = average local CC

Network Average CC Exact: Naïve. Estimation – input is read at least once: • Streaming Model: [Becchetti et al, 2010] Estimation – sampling: • Random Access: [Schank et al, 2005] • External Access: [Ribeiro et al 2010], [Gjoka et al, 2010], This work – Improved accuracy.

Number of Registered Users Exact: trivial Estimation – sampling: • External Access: [Hardiman et al 2009], [Katzir et al, 2011], This work – Improved accuracy.

Random Walk Sampled Nodes: v 1 v 2 v 3 v 4 v 5 Stationary 2 2 3 𝑒 𝑗 22 22 Distribution = 𝑒 𝑗 22 2 v 3 v 5 v 7 1 3 22 22 22 v 1 v 2 v 9 2 3 22 22 4 22 v 4 v 6 v 8

Random Walk - Summary Sampled Nodes Visible Nodes Invisible Nodes Visible Edges Invisible Edges v 3 v 5 v 7 v 1 v 2 v 9 v 4 v 6 v 8

Global CC Algorithm The estimated global clustering coefficient: Φ 𝑕 𝑑 𝑕 = Ψ 𝑕 1. Ψ 𝑕 – Sampled nodes average degree - 1. 𝜚 𝑙 = 1 if there is an edge 𝑤 𝑙−1 − 𝑤 𝑙+1 , 𝜚 𝑙 = 1 iff 𝑤 𝑙−1 , 𝑤 𝑙 , 𝑤 𝑙+1 is a triangle 0 Otherwise. 2. Φ 𝑕 – Sampled nodes average 𝜚 𝑙 𝑒 𝑙 .

Global CC Example Φ 𝑕 = 1 3 0 + 2 + 0 = 2 𝑕 = 1 5 0 + 2 + 1 + 3 + 1 = 7 Ψ 3 5 𝜚 3 = 1 = 2 5 𝑑 𝑕 7 ≈ 0.47 v 3 v 5 v 7 3 𝜚 2 = 0 𝑑 𝑕 = 9 23 ≈ 0.39 v 1 v 2 𝜚 4 = 0 v 4 v 6

Expectation of 𝝔 𝒍 𝑜 𝐹 𝜚 𝑙 𝑒 𝑙 = 𝑒 𝑗 𝐸 𝐹 𝜚 𝑙 𝑒 𝑙 |𝑦 𝑙 = 𝑤 𝑗 Total expectation 𝑗=1 𝑜 = 𝑒 𝑗 2𝑚 𝑗 𝑒 𝑗 𝑒 𝑗 combinations. 𝑒 𝑗 𝐸 𝑒 𝑗 𝑒 𝑗 2 𝑚 𝑗 yield 𝜚 𝑙 =1 𝑗=1 𝑜 = 2𝑚 𝑗 𝐸 𝑗=1 𝑜 𝑒 𝑗 – The degree of node v i . 𝐸 = 𝑒 𝑗 𝑚 𝑗 – The number of triangles contain v i . 𝑗=1 𝑜 – The number of nodes.

Global CC Proof 𝑜 𝑜 𝐹 Φ 𝑕 = 𝐹 𝜚 𝑙 𝑒 𝑙 = 2 𝑕 = 1 𝐸 𝑚 𝑗 𝐹 Ψ 𝐸 𝑒 𝑗 𝑒 𝑗 − 1 𝑗=1 𝑗=1 concentration bounds 𝐹 Φ 𝑕 𝑜 = Φ 𝑕 2 𝑚 𝑗 𝑗=1 𝑑 𝑕 ≅ = 𝑑 𝑕 concentration bounds 𝐹 Ψ 𝑜 𝑒 𝑗 𝑒 𝑗 − 1 Ψ 𝑗=1 𝑕 𝑕 𝑜 𝑒 𝑗 – The degree of node v i . 𝐸 = 𝑒 𝑗 𝑚 𝑗 – The number of triangles contain v i . 𝑗=1 𝑜 – The number of nodes.

Guarantees For any 𝜗 ≤ 1 8 and 𝜀 ≤ 1 , we have Prob 1 − 𝜁 𝑑 𝑕 ≤ 𝑑 𝑕 ≤ 1 + 𝜁 𝑑 𝑕 ≥ 1 − 𝜀 when the number of samples, r, satisfies 𝑠 ≥ 𝑠 𝑕 = 𝑃 mixing time(𝜁)

Network Average CC Algorithm The estimated network average CC: = Φ 𝑚 𝑑 𝑚 Ψ 𝑚 1. Ψ 𝑚 – Sampled nodes average 1/degree . 𝜚 𝑙 = 1 if there is an edge 𝑤 𝑙−1 − 𝑤 𝑙+1 , 0 Otherwise. 1 2. Φ 𝑚 – Sampled nodes average 𝜚 𝑙 𝑒 𝑙 −1 .

Evaluations Network n (size) D/n c l c g DBLP 977,987 8.457 0.7231 0.1868 Orkut 3,072,448 76.28 0.1704 0.0413 Flickr 2,173,370 20.92 0.3616 0.1076 Live Journal 4,843,953 17.69 0.3508 0.1179 DBLP facts: Paper with most co-authors: has 119 listed authors. Most prolific author: Vincent Poor with 798 entries.

Global CC Gjoka et al* 3.5 Relative estimation value DBLP Network Ribeiro et al* 3 This work 2.5 2 Relative improvement 1.5 ranges between 300% and 500% depending 1 on the network. 0.5 0 0 0.5 1 1.5 2 Percentage of mined nodes

Network Average CC Ribeiro et al 2.5 Relative estimation value Orkut Network Gjoka et al 2 Random walk 1.5 Relative improvement ranges between 50% 1 and 400% depending on the network. 0.5 0 0 0.5 1 1.5 2 Percentage of mined nodes

Conclusions 1. New external access estimator from Global Clustering Coefficient. 2. Improved estimator for Network Average Clustering Coefficient. 3. Improved estimator for number of registered users.

Estimating Sizes of Social Networks via Biased Sampling Oren Somekh Liran Katzir Edo Liberty Yahoo! Labs, Yahoo! Labs, Yahoo! Labs, Haifa, Israel Haifa, Israel Haifa, Israel

The Birthday “Paradox” The expected number of collisions in a list of r i.i.d. samples from a set of n elements is 𝑠 𝑠−1 . 2𝑜 A collision is a pair of identical samples. Example: Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x 2 , x 3 ), (x 2 , x 5 ), and (x 3 , x 5 )

Cardinality estimation uniform When C collisions are observed n ≅ 𝑠 𝑠 − 1 2𝐷 Needs 𝑠 = 𝑃 𝑜 samples to converge. Used by [Ye et al, 2010] to estimate the size.

Stationary distribution sampling Sampled Nodes: v 5 v 2 v 5 v 4 v 2 Stationary 2 2 3 𝑒 𝑗 22 22 Distribution = 𝑒 𝑗 22 2 v 3 v 5 v 7 1 3 22 22 22 v 1 v 2 v 9 2 3 22 22 4 22 v 4 v 6 v 8

Cardinality estimation stationary When C collisions are observed 𝑒 𝑦 1 𝑒 𝑦 n ≅ 2𝐷 4 Needs 𝑠 = 𝑃 𝑜 log 𝑜 samples to converge when 𝑒 𝑗 ~𝑨𝑗𝑞𝑔( 𝑜, 2) .

Example: 𝑒 𝑦 = 2 + 3 + 2 + 4 + 3 1 = 1 2 + 1 3 + 1 2 + 1 4 + 1 3 𝑒 𝑦 14 23 12 𝑜 = 2∙2 ≈ 6.7 v 5 v 2 v 5 v 4 v 2 v 3 v 5 v 7 v 1 v 2 v 9 v 4 v 6 v 8

Global CC Proof 𝑜 𝑜 𝑜 𝐹 𝑒 𝑦 = 𝑒 𝑗 1 = 𝑒 𝑗 1 = 𝑜 𝐹 𝐷 = 𝑒 𝑗 𝑒 𝑗 𝐸 𝑒 𝑗 𝐹 𝐸 𝑒 𝑦 𝐸 𝑒 𝑗 𝐸 𝐸 𝑗=1 𝑗=1 𝑗=1 concentration bounds 𝐹 𝑒 𝑦 𝐹 𝑒 𝑦 1 1 𝑒 𝑗 𝐸 𝑒 𝑗 𝑜 𝑒 𝑦 𝑒 𝑦 𝐸 𝑜 = ≅ = 𝑜 concentration bounds 2𝐹 𝐷 𝑒 𝑗 𝑒 𝑗 2𝐷 𝐸 𝐸 𝑜 𝑒 𝑗 – The degree of node v i . 𝐸 = 𝑒 𝑗 𝑜 – The number of nodes. 𝑗=1

Improvements 1. Using all samples (Hardiman et al 2009). 2. Using Conditional Monte Carlo (This work).

All Samples Restrict computation to indexes m steps apart, 𝐽 = 𝑙, 𝑚 | 𝑙 − 𝑚 ≥ 𝑛 A collision is only be considered within 𝐽 . Φ = 𝑦 𝑙 = 𝑦 𝑚 | 𝑙, 𝑚 ∈ 𝐽 Ratio of degrees is similarly defined 𝑒 𝑦 𝑙 Ψ = 𝑒 𝑦 𝑚 𝑙,𝑚 ∈𝐽

Conditional Monte Carlo A collision between 𝑦 𝑙 and 𝑦 𝑚 , is replaced by the conditional collision is steps k +1 and l +1 respectively. 𝐹 1 𝑦 𝑙+1 =𝑦 𝑚+1 |𝑦 𝑙 , 𝑦 𝑚 = Common Neighbors 𝑒 𝑦 𝑙 𝑒 𝑦 𝑚

Conditional Monte Carlo • The pair 𝑤 4 , 𝑤 7 is not a collision, but it contributes 1 12 to the collision counter. v 3 v 5 v 7 v 1 v 2 v 9 v 4 v 6 v 8

Size Estimation Priot art 2.5 Relative estimation value DBLP Network This work 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 Percentage of mined nodes

Thanks

and Size of Social Networks via Random Walk Stephen J. Hardiman* - PowerPoint PPT Presentation

Estimating Clustering Coefficients and Size of Social Networks via Random Walk Stephen J. Hardiman* Liran Katzir Capital Fund Management Advanced Technology Labs France Microsoft Research, Israel *Research was conducted while the author was

Lumber Size Lumber Size Control Control Studies Studies Lumber Size Control Lumber Size

Lab 2 discussion Last Time Debugging Its a science use experiments to refine

Handling array size limitations Handling array size limitations Issue: array size is fixed

Introduction Social and Economic Networks MohammadAmin Fazli Social and Economic Networks 1

Submodular Maximization applied to Marketing Over Social Networks Vahab Mirrokni Google

SOCIAL NETWORKS OF ELDERLY PEOPLE Hayden Manseau 1 1. THE PROBLEM 2 THE IMPACT OF SOCIAL

Types of networks (social networks, computer networks, entity- relationship networks, )

Querying Geo-social Data by Bridging Spatial Networks and Social Networks Yerach Ben Yaron

Social Networks What are they, really? What we will learn today What is a social network?

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Social Networking Trends and Social Networking Trends and Social Networking Trends and Social

P2P Networks as Content P2P Networks as Content Delivery Networks Delivery Networks FINAL

Current Network Structure for Pediatrics Hospital Networks Country, state, regional, Academic

Size based indicators rjan stman Dept. Aquatic Resources, regrund SLU What fish size

SAMPLE SIZE IN TRIAXIAL LOADS How sample size affects the frictional behavior Photo by H.

FUNCTION MERGING by SEQUENCE ALIGNMENT DOES SIZE MATTER? DOES SIZE MATTER? S E Y DOES SIZE

ProofTheory: Logicaland Philosophical Aspects Class 3: BeyondSequents Greg Restall and Shawn

Scalable methods for optimal control of systems governed by PDEs with random coefficient fields

Draft Lecture III notes for Les Houches 2014 Joel E. Moore, UC Berkeley and LBNL (Dated: August

Local clustering with graph diffusions and spectral solution paths Joint with Kyle Kloster

ERCIM 2013 Fitting techniques for estimating the trace of the inverse of a matrix Andreas

LEARNING DISORDERED TOPOLOGICAL PHASES BY STATISTICAL RECOVERY OF SYMMETRY

Non-Idempotent Plonka Functions and Hyperidentities Non-Idempotent Plonka Functions and

Computer Supported Modeling and Reasoning David Basin, Achim D. Brucker, Jan-Georg Smaus, and