[PPT] - Estimating Sizes of Social Networks via Biased Sampling Liran PowerPoint Presentation

SLIDE 1

Estimating Sizes of Social Networks via Biased Sampling

Liran Katzir, Edo Liberty, and Oren Somekh Yahoo! Labs, Haifa, Israel International World Wide Web Conference, 28th March - 1st April 2011, Hyderabad, India

Yahoo! Labs: WWW’2011 1 / 20

SLIDE 2

Social Network size estimation

Goal: Obtaining estimates for sizes of (sub)populations in social network. Why: Advertisement - estimate of market share. Business development - merger/acquisition or asset valuation.

Yahoo! Labs: WWW’2011 2 / 20

SLIDE 3

The Problem

Difficulties: Social network have become pretty big:

Facebook (650,000,000) Qzone (200,000,000) Twitter (175,000,000) ...

No public API for population size queries.

What is the total number of registered users? What is the number of registered (self-declared) 20–30 year olds living in New-York?

Even if a public API is provided an independent estimate is needed. Exhaustive crawl is time/space/communication intensive and violates “politeness”.

Yahoo! Labs: WWW’2011 3 / 20

SLIDE 4

Population size estimation

Population sizes can be estimated efficiently using the “birthday paradox”. The “birthday paradox”: Given r uniform samples from a set of n elements, the expected number

f collisions is r(r−1)

2n

. A collision is a pair of identical samples. Example:

Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x2, x3), (x2,x5), and (x3,x5).

Yahoo! Labs: WWW’2011 4 / 20

SLIDE 5

Population size estimation

Using the birthday paradox inversely: When observing C collisions the pouplation can be estimated by ⇒ n ≃ r2 2C If r = const · √n this gives a rather good estimator.

Similar to mark-and-recapture which counts collisions between two sample sets (but is essentially equivalent). Newer version of mark-and-recapture also handles non-uniform but a-priory known distributions [Chao, 1987]. Social network size estimation [Ye and Wu, 2010]

Alas, we cannot sample users uniformly from most social networks...

Yahoo! Labs: WWW’2011 5 / 20

SLIDE 6

Uniform distribution on graphs

Social networks can be viewed as an undirected graph which we can traverse using their public APIs. Special random walks can generate close to uniform sampling:

1 Bipartite Query-Web page graph [Bharat and Broder, 1998]

[Bar-Yossef and Gurevich, 2007].

2 Social network [Gjoka et al, 2010].

Uses only r = const√n samples, but obtaining each sample might be hard.

Yahoo! Labs: WWW’2011 6 / 20

SLIDE 7

Graph size estimation

It is possible to estimate the size of some graphs directly.

1 Estimate the size of a tree [Knuth, 1974]. 2 Estimate the size of a directed acyclic graph [Pitt, 1987].

We give an estimator for the size of undirected graphs (and sub graphs) which:

1 Counts collisions but uses the graph’s stationary distribution.

(does not require a uniform sample)

2 Requires asymptotically less than √n samples to converge. 3 Obtains samples efficiently.

(provable small number of random walk steps.)

Yahoo! Labs: WWW’2011 7 / 20

SLIDE 8

Assumptions

The graph can be traversed from nodes to neighboring nodes. We can perform a random walk the graph: start at any node In each step, proceed to one of the neighbors uniformly at random.

Yahoo! Labs: WWW’2011 8 / 20

SLIDE 9

Facts about random walks

This random walk yields the stationary distribution.

1 The probability to get the i’th node is di

D .

2 di – i’th node’s degree. 3 D = n

i=1 di.

taking a few steps/several walks ensures independence between two consecutive samples.

Yahoo! Labs: WWW’2011 9 / 20

SLIDE 10

Algorithm Outline

1 Sample r users using random walk. 2 C – the number of collisions. 3 Ψ1 – the sum of the sampled nodes’ degrees. 4 Ψ−1 – the sum of the inverse sampled nodes’ degrees.

The estimated number of nodes: ˆ n = Ψ1Ψ−1

2C

.

Yahoo! Labs: WWW’2011 10 / 20

SLIDE 11

Example

Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

SLIDE 12

Example

Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

SLIDE 13

Example

Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

SLIDE 14

Example

Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

SLIDE 15

Example

Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

SLIDE 16

Example

Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

SLIDE 17

Example