Estimating Sizes of Social Networks via Biased Sampling
Liran Katzir, Edo Liberty, and Oren Somekh Yahoo! Labs, Haifa, Israel International World Wide Web Conference, 28th March - 1st April 2011, Hyderabad, India
Yahoo! Labs: WWW’2011 1 / 20
Estimating Sizes of Social Networks via Biased Sampling Liran - - PowerPoint PPT Presentation
Estimating Sizes of Social Networks via Biased Sampling Liran Katzir, Edo Liberty, and Oren Somekh Yahoo! Labs, Haifa, Israel International World Wide Web Conference, 28th March - 1st April 2011, Hyderabad, India Yahoo! Labs: WWW2011 1 /
Estimating Sizes of Social Networks via Biased Sampling
Liran Katzir, Edo Liberty, and Oren Somekh Yahoo! Labs, Haifa, Israel International World Wide Web Conference, 28th March - 1st April 2011, Hyderabad, India
Yahoo! Labs: WWW’2011 1 / 20
Social Network size estimation
Goal: Obtaining estimates for sizes of (sub)populations in social network. Why: Advertisement - estimate of market share. Business development - merger/acquisition or asset valuation.
Yahoo! Labs: WWW’2011 2 / 20
The Problem
Difficulties: Social network have become pretty big:
Facebook (650,000,000) Qzone (200,000,000) Twitter (175,000,000) ...
No public API for population size queries.
What is the total number of registered users? What is the number of registered (self-declared) 20–30 year olds living in New-York?
Even if a public API is provided an independent estimate is needed. Exhaustive crawl is time/space/communication intensive and violates “politeness”.
Yahoo! Labs: WWW’2011 3 / 20
Population size estimation
Population sizes can be estimated efficiently using the “birthday paradox”. The “birthday paradox”: Given r uniform samples from a set of n elements, the expected number
2n
. A collision is a pair of identical samples. Example:
Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x2, x3), (x2,x5), and (x3,x5).
Yahoo! Labs: WWW’2011 4 / 20
Population size estimation
Using the birthday paradox inversely: When observing C collisions the pouplation can be estimated by ⇒ n ≃ r2 2C If r = const · √n this gives a rather good estimator.
Similar to mark-and-recapture which counts collisions between two sample sets (but is essentially equivalent). Newer version of mark-and-recapture also handles non-uniform but a-priory known distributions [Chao, 1987]. Social network size estimation [Ye and Wu, 2010]
Alas, we cannot sample users uniformly from most social networks...
Yahoo! Labs: WWW’2011 5 / 20
Uniform distribution on graphs
Social networks can be viewed as an undirected graph which we can traverse using their public APIs. Special random walks can generate close to uniform sampling:
1 Bipartite Query-Web page graph [Bharat and Broder, 1998]
[Bar-Yossef and Gurevich, 2007].
2 Social network [Gjoka et al, 2010].
Uses only r = const√n samples, but obtaining each sample might be hard.
Yahoo! Labs: WWW’2011 6 / 20
Graph size estimation
It is possible to estimate the size of some graphs directly.
1 Estimate the size of a tree [Knuth, 1974]. 2 Estimate the size of a directed acyclic graph [Pitt, 1987].
We give an estimator for the size of undirected graphs (and sub graphs) which:
1 Counts collisions but uses the graph’s stationary distribution.
(does not require a uniform sample)
2 Requires asymptotically less than √n samples to converge. 3 Obtains samples efficiently.
(provable small number of random walk steps.)
Yahoo! Labs: WWW’2011 7 / 20
Assumptions
The graph can be traversed from nodes to neighboring nodes. We can perform a random walk the graph: start at any node In each step, proceed to one of the neighbors uniformly at random.
Yahoo! Labs: WWW’2011 8 / 20
Facts about random walks
This random walk yields the stationary distribution.
1 The probability to get the i’th node is di
D .
2 di – i’th node’s degree. 3 D = n
i=1 di.
taking a few steps/several walks ensures independence between two consecutive samples.
Yahoo! Labs: WWW’2011 9 / 20
Algorithm Outline
1 Sample r users using random walk. 2 C – the number of collisions. 3 Ψ1 – the sum of the sampled nodes’ degrees. 4 Ψ−1 – the sum of the inverse sampled nodes’ degrees.
The estimated number of nodes: ˆ n = Ψ1Ψ−1
2C
.
Yahoo! Labs: WWW’2011 10 / 20
Example
Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d Sampled Node Degree: 3 C: Ψ1: 3 Ψ−1: 1/3 ˆ n: – Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d Sampled Node Degree: 3 C: Ψ1: 3 Ψ−1: 1/3 ˆ n: – Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d Sampled Node Degree: 3 C: Ψ1: 3 Ψ−1: 1/3 ˆ n: – Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d Sampled Node Degree: 3 C: Ψ1: 3 Ψ−1: 1/3 ˆ n: – Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d Sampled Node Degree: 3 C: Ψ1: 3 Ψ−1: 1/3 ˆ n: – Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d Sampled Node Degree: 3 C: Ψ1: 3 Ψ−1: 1/3 ˆ n: – Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f Sampled Node Degree: 3 2 C: Ψ1: 3 5 Ψ−1: 1/3 5/6 ˆ n: – – Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f Sampled Node Degree: 3 2 C: Ψ1: 3 5 Ψ−1: 1/3 5/6 ˆ n: – – Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f Sampled Node Degree: 3 2 C: Ψ1: 3 5 Ψ−1: 1/3 5/6 ˆ n: – – Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f Sampled Node Degree: 3 2 C: Ψ1: 3 5 Ψ−1: 1/3 5/6 ˆ n: – – Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f Sampled Node Degree: 3 2 C: Ψ1: 3 5 Ψ−1: 1/3 5/6 ˆ n: – – Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f Sampled Node Degree: 3 2 C: Ψ1: 3 5 Ψ−1: 1/3 5/6 ˆ n: – – Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f Sampled Node Degree: 3 2 2 C: 1 Ψ1: 3 5 7 Ψ−1: 1/3 5/6 16/12 ˆ n: – – 4 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f Sampled Node Degree: 3 2 2 C: 1 Ψ1: 3 5 7 Ψ−1: 1/3 5/6 16/12 ˆ n: – – 4 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f Sampled Node Degree: 3 2 2 C: 1 Ψ1: 3 5 7 Ψ−1: 1/3 5/6 16/12 ˆ n: – – 4 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f Sampled Node Degree: 3 2 2 C: 1 Ψ1: 3 5 7 Ψ−1: 1/3 5/6 16/12 ˆ n: – – 4 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f Sampled Node Degree: 3 2 2 C: 1 Ψ1: 3 5 7 Ψ−1: 1/3 5/6 16/12 ˆ n: – – 4 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f Sampled Node Degree: 3 2 2 C: 1 Ψ1: 3 5 7 Ψ−1: 1/3 5/6 16/12 ˆ n: – – 4 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c Sampled Node Degree: 3 2 2 4 C: 1 1 Ψ1: 3 5 7 11 Ψ−1: 1/3 5/6 16/12 19/12 ˆ n: – – 4 8 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c Sampled Node Degree: 3 2 2 4 C: 1 1 Ψ1: 3 5 7 11 Ψ−1: 1/3 5/6 16/12 19/12 ˆ n: – – 4 8 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c Sampled Node Degree: 3 2 2 4 C: 1 1 Ψ1: 3 5 7 11 Ψ−1: 1/3 5/6 16/12 19/12 ˆ n: – – 4 8 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c Sampled Node Degree: 3 2 2 4 C: 1 1 Ψ1: 3 5 7 11 Ψ−1: 1/3 5/6 16/12 19/12 ˆ n: – – 4 8 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c Sampled Node Degree: 3 2 2 4 C: 1 1 Ψ1: 3 5 7 11 Ψ−1: 1/3 5/6 16/12 19/12 ˆ n: – – 4 8 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c Sampled Node Degree: 3 2 2 4 C: 1 1 Ψ1: 3 5 7 11 Ψ−1: 1/3 5/6 16/12 19/12 ˆ n: – – 4 8 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c c Sampled Node Degree: 3 2 2 4 4 C: 1 1 2 Ψ1: 3 5 7 11 15 Ψ−1: 1/3 5/6 16/12 19/12 22/12 ˆ n: – – 4 8 6 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c c Sampled Node Degree: 3 2 2 4 4 C: 1 1 2 Ψ1: 3 5 7 11 15 Ψ−1: 1/3 5/6 16/12 19/12 22/12 ˆ n: – – 4 8 6 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c c Sampled Node Degree: 3 2 2 4 4 C: 1 1 2 Ψ1: 3 5 7 11 15 Ψ−1: 1/3 5/6 16/12 19/12 22/12 ˆ n: – – 4 8 6 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c c Sampled Node Degree: 3 2 2 4 4 C: 1 1 2 Ψ1: 3 5 7 11 15 Ψ−1: 1/3 5/6 16/12 19/12 22/12 ˆ n: – – 4 8 6 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c c Sampled Node Degree: 3 2 2 4 4 C: 1 1 2 Ψ1: 3 5 7 11 15 Ψ−1: 1/3 5/6 16/12 19/12 22/12 ˆ n: – – 4 8 6 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c c Sampled Node Degree: 3 2 2 4 4 C: 1 1 2 Ψ1: 3 5 7 11 15 Ψ−1: 1/3 5/6 16/12 19/12 22/12 ˆ n: – – 4 8 6 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process: Sampled Nodes: d f f c c d Sampled Node Degree: 3 2 2 4 4 3 C: 1 1 2 3 Ψ1: 3 5 7 11 15 18 Ψ−1: 1/3 5/6 16/12 19/12 22/12 26/12 ˆ n: – – 4 8 6 6 Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Proof Intuition
Notations: n – the graph size, r – number of samples di – node i degree, D = n
i=1 di
Expectations: E [Ψ1] = rD n
i=1
D
2 , E [Ψ−1] = rn
D
E [C] = r
2
n
i=1
D
2 . ˆ n
E [Ψ1]E [Ψ−1] 2E [C]
= n
r r−1 ≃ n.
ˆ n = Ψ1Ψ−1 2C ≃ E [Ψ1]E [Ψ−1] 2E [C] ≃ n
Yahoo! Labs: WWW’2011 12 / 20
Analytic Results
Main statement: Using r(n, ǫ, δ) samples: Pr[n(1 − ǫ) ≤ ˆ n ≤ n(1 + ǫ)] ≥ 1 − δ Uniform vs Biased: Sampling method Number of samples Any graph, uniform O(√n) Synthetic graph, Zipfian degree distribution O( 4 √n log n) α = 2, dm = √n, random walk Example – n = 109 √n ≈ 30, 000.
4
√n log n ≈ 6, 000.
Yahoo! Labs: WWW’2011 13 / 20
Setup
Networks of known sizes: Network Size Edges Synthetic 1,000,000 Zipfian α = 2, dm = 1000 DBLP 845,211 co-authorship IMDB 1,955,508 co-casting
Yahoo! Labs: WWW’2011 14 / 20
A Synthetic Network, Degree Zipfian α = 2, dm = 1000
0.5 1 1.5 2 2.5 0.8 1 1.2 1.4 1.6 1.8 2 2.2 Synthetic network − Confidence interval Number of samples [Percentage of network size] Size estimation [Relative to network size]
Yahoo! Labs: WWW’2011 15 / 20
DBLP - The Digital Bibliography and Library Project
0.5 1 1.5 2 2.5 3 3.5 0.5 1 1.5 2 2.5 3 DBLP network − Confidence interval Number of samples [Percentage of network size] Size estimation [Relative to network size]
Yahoo! Labs: WWW’2011 16 / 20
IMDB - The Internet Movie Database
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3 IMDB − Confidence interval Number of samples [Percentage of network size] Size estimation [Relative to network size]
Yahoo! Labs: WWW’2011 17 / 20
Date April 2009 October 2010 Sampling method uniform random walk Number of samples 0.98 · 106 1 · 106 Collision estimator 237 · 106 475 · 106 Facebook report 200 − 250 · 106 500 · 106 Thanks to Minas Gjoka for the Facebook crawls.
Yahoo! Labs: WWW’2011 18 / 20
Conclusions
An efficient algorithm to estimate the size of a social network using public API was presented. Its effectiveness was demonstrated on synthetic and real world networks. This algorithm outperforms prior art methods by using biased sampling. This algorithm also applies for sub-populations.
Yahoo! Labs: WWW’2011 19 / 20
Yahoo! Labs: WWW’2011 20 / 20