Estimating Sizes of Social Networks via Biased Sampling Liran - - PowerPoint PPT Presentation

estimating sizes of social networks via biased sampling
SMART_READER_LITE
LIVE PREVIEW

Estimating Sizes of Social Networks via Biased Sampling Liran - - PowerPoint PPT Presentation

Estimating Sizes of Social Networks via Biased Sampling Liran Katzir, Edo Liberty, and Oren Somekh Yahoo! Labs, Haifa, Israel International World Wide Web Conference, 28th March - 1st April 2011, Hyderabad, India Yahoo! Labs: WWW2011 1 /


slide-1
SLIDE 1

Estimating Sizes of Social Networks via Biased Sampling

Liran Katzir, Edo Liberty, and Oren Somekh Yahoo! Labs, Haifa, Israel International World Wide Web Conference, 28th March - 1st April 2011, Hyderabad, India

Yahoo! Labs: WWW’2011 1 / 20

slide-2
SLIDE 2

Social Network size estimation

Goal: Obtaining estimates for sizes of (sub)populations in social network. Why: Advertisement - estimate of market share. Business development - merger/acquisition or asset valuation.

Yahoo! Labs: WWW’2011 2 / 20

slide-3
SLIDE 3

The Problem

Difficulties: Social network have become pretty big:

Facebook (650,000,000) Qzone (200,000,000) Twitter (175,000,000) ...

No public API for population size queries.

What is the total number of registered users? What is the number of registered (self-declared) 20–30 year olds living in New-York?

Even if a public API is provided an independent estimate is needed. Exhaustive crawl is time/space/communication intensive and violates “politeness”.

Yahoo! Labs: WWW’2011 3 / 20

slide-4
SLIDE 4

Population size estimation

Population sizes can be estimated efficiently using the “birthday paradox”. The “birthday paradox”: Given r uniform samples from a set of n elements, the expected number

  • f collisions is r(r−1)

2n

. A collision is a pair of identical samples. Example:

Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x2, x3), (x2,x5), and (x3,x5).

Yahoo! Labs: WWW’2011 4 / 20

slide-5
SLIDE 5

Population size estimation

Using the birthday paradox inversely: When observing C collisions the pouplation can be estimated by ⇒ n ≃ r2 2C If r = const · √n this gives a rather good estimator.

Similar to mark-and-recapture which counts collisions between two sample sets (but is essentially equivalent). Newer version of mark-and-recapture also handles non-uniform but a-priory known distributions [Chao, 1987]. Social network size estimation [Ye and Wu, 2010]

Alas, we cannot sample users uniformly from most social networks...

Yahoo! Labs: WWW’2011 5 / 20

slide-6
SLIDE 6

Uniform distribution on graphs

Social networks can be viewed as an undirected graph which we can traverse using their public APIs. Special random walks can generate close to uniform sampling:

1 Bipartite Query-Web page graph [Bharat and Broder, 1998]

[Bar-Yossef and Gurevich, 2007].

2 Social network [Gjoka et al, 2010].

Uses only r = const√n samples, but obtaining each sample might be hard.

Yahoo! Labs: WWW’2011 6 / 20

slide-7
SLIDE 7

Graph size estimation

It is possible to estimate the size of some graphs directly.

1 Estimate the size of a tree [Knuth, 1974]. 2 Estimate the size of a directed acyclic graph [Pitt, 1987].

We give an estimator for the size of undirected graphs (and sub graphs) which:

1 Counts collisions but uses the graph’s stationary distribution.

(does not require a uniform sample)

2 Requires asymptotically less than √n samples to converge. 3 Obtains samples efficiently.

(provable small number of random walk steps.)

Yahoo! Labs: WWW’2011 7 / 20

slide-8
SLIDE 8

Assumptions

The graph can be traversed from nodes to neighboring nodes. We can perform a random walk the graph: start at any node In each step, proceed to one of the neighbors uniformly at random.

Yahoo! Labs: WWW’2011 8 / 20

slide-9
SLIDE 9

Facts about random walks

This random walk yields the stationary distribution.

1 The probability to get the i’th node is di

D .

2 di – i’th node’s degree. 3 D = n

i=1 di.

taking a few steps/several walks ensures independence between two consecutive samples.

Yahoo! Labs: WWW’2011 9 / 20

slide-10
SLIDE 10

Algorithm Outline

1 Sample r users using random walk. 2 C – the number of collisions. 3 Ψ1 – the sum of the sampled nodes’ degrees. 4 Ψ−1 – the sum of the inverse sampled nodes’ degrees.

The estimated number of nodes: ˆ n = Ψ1Ψ−1

2C

.

Yahoo! Labs: WWW’2011 10 / 20

slide-11
SLIDE 11

Example

Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-12
SLIDE 12

Example

Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-13
SLIDE 13

Example

Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-14
SLIDE 14

Example

Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-15
SLIDE 15

Example

Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-16
SLIDE 16

Example

Sampling process: Sampled Nodes: Sampled Node Degree: C: Ψ1: Ψ−1: ˆ n: Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-17
SLIDE 17

Example

Sampling process: Sampled Nodes: d Sampled Node Degree: 3 C: Ψ1: 3 Ψ−1: 1/3 ˆ n: – Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-18
SLIDE 18

Example

Sampling process: Sampled Nodes: d Sampled Node Degree: 3 C: Ψ1: 3 Ψ−1: 1/3 ˆ n: – Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-19
SLIDE 19

Example

Sampling process: Sampled Nodes: d Sampled Node Degree: 3 C: Ψ1: 3 Ψ−1: 1/3 ˆ n: – Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-20
SLIDE 20

Example

Sampling process: Sampled Nodes: d Sampled Node Degree: 3 C: Ψ1: 3 Ψ−1: 1/3 ˆ n: – Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-21
SLIDE 21

Example

Sampling process: Sampled Nodes: d Sampled Node Degree: 3 C: Ψ1: 3 Ψ−1: 1/3 ˆ n: – Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-22
SLIDE 22

Example

Sampling process: Sampled Nodes: d Sampled Node Degree: 3 C: Ψ1: 3 Ψ−1: 1/3 ˆ n: – Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-23
SLIDE 23

Example

Sampling process: Sampled Nodes: d f Sampled Node Degree: 3 2 C: Ψ1: 3 5 Ψ−1: 1/3 5/6 ˆ n: – – Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-24
SLIDE 24

Example

Sampling process: Sampled Nodes: d f Sampled Node Degree: 3 2 C: Ψ1: 3 5 Ψ−1: 1/3 5/6 ˆ n: – – Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-25
SLIDE 25

Example

Sampling process: Sampled Nodes: d f Sampled Node Degree: 3 2 C: Ψ1: 3 5 Ψ−1: 1/3 5/6 ˆ n: – – Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-26
SLIDE 26

Example

Sampling process: Sampled Nodes: d f Sampled Node Degree: 3 2 C: Ψ1: 3 5 Ψ−1: 1/3 5/6 ˆ n: – – Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-27
SLIDE 27

Example

Sampling process: Sampled Nodes: d f Sampled Node Degree: 3 2 C: Ψ1: 3 5 Ψ−1: 1/3 5/6 ˆ n: – – Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-28
SLIDE 28

Example

Sampling process: Sampled Nodes: d f Sampled Node Degree: 3 2 C: Ψ1: 3 5 Ψ−1: 1/3 5/6 ˆ n: – – Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-29
SLIDE 29

Example

Sampling process: Sampled Nodes: d f f Sampled Node Degree: 3 2 2 C: 1 Ψ1: 3 5 7 Ψ−1: 1/3 5/6 16/12 ˆ n: – – 4 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-30
SLIDE 30

Example

Sampling process: Sampled Nodes: d f f Sampled Node Degree: 3 2 2 C: 1 Ψ1: 3 5 7 Ψ−1: 1/3 5/6 16/12 ˆ n: – – 4 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-31
SLIDE 31

Example

Sampling process: Sampled Nodes: d f f Sampled Node Degree: 3 2 2 C: 1 Ψ1: 3 5 7 Ψ−1: 1/3 5/6 16/12 ˆ n: – – 4 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-32
SLIDE 32

Example

Sampling process: Sampled Nodes: d f f Sampled Node Degree: 3 2 2 C: 1 Ψ1: 3 5 7 Ψ−1: 1/3 5/6 16/12 ˆ n: – – 4 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-33
SLIDE 33

Example

Sampling process: Sampled Nodes: d f f Sampled Node Degree: 3 2 2 C: 1 Ψ1: 3 5 7 Ψ−1: 1/3 5/6 16/12 ˆ n: – – 4 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-34
SLIDE 34

Example

Sampling process: Sampled Nodes: d f f Sampled Node Degree: 3 2 2 C: 1 Ψ1: 3 5 7 Ψ−1: 1/3 5/6 16/12 ˆ n: – – 4 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-35
SLIDE 35

Example

Sampling process: Sampled Nodes: d f f c Sampled Node Degree: 3 2 2 4 C: 1 1 Ψ1: 3 5 7 11 Ψ−1: 1/3 5/6 16/12 19/12 ˆ n: – – 4 8 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-36
SLIDE 36

Example

Sampling process: Sampled Nodes: d f f c Sampled Node Degree: 3 2 2 4 C: 1 1 Ψ1: 3 5 7 11 Ψ−1: 1/3 5/6 16/12 19/12 ˆ n: – – 4 8 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-37
SLIDE 37

Example

Sampling process: Sampled Nodes: d f f c Sampled Node Degree: 3 2 2 4 C: 1 1 Ψ1: 3 5 7 11 Ψ−1: 1/3 5/6 16/12 19/12 ˆ n: – – 4 8 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-38
SLIDE 38

Example

Sampling process: Sampled Nodes: d f f c Sampled Node Degree: 3 2 2 4 C: 1 1 Ψ1: 3 5 7 11 Ψ−1: 1/3 5/6 16/12 19/12 ˆ n: – – 4 8 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-39
SLIDE 39

Example

Sampling process: Sampled Nodes: d f f c Sampled Node Degree: 3 2 2 4 C: 1 1 Ψ1: 3 5 7 11 Ψ−1: 1/3 5/6 16/12 19/12 ˆ n: – – 4 8 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-40
SLIDE 40

Example

Sampling process: Sampled Nodes: d f f c Sampled Node Degree: 3 2 2 4 C: 1 1 Ψ1: 3 5 7 11 Ψ−1: 1/3 5/6 16/12 19/12 ˆ n: – – 4 8 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-41
SLIDE 41

Example

Sampling process: Sampled Nodes: d f f c c Sampled Node Degree: 3 2 2 4 4 C: 1 1 2 Ψ1: 3 5 7 11 15 Ψ−1: 1/3 5/6 16/12 19/12 22/12 ˆ n: – – 4 8 6 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-42
SLIDE 42

Example

Sampling process: Sampled Nodes: d f f c c Sampled Node Degree: 3 2 2 4 4 C: 1 1 2 Ψ1: 3 5 7 11 15 Ψ−1: 1/3 5/6 16/12 19/12 22/12 ˆ n: – – 4 8 6 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-43
SLIDE 43

Example

Sampling process: Sampled Nodes: d f f c c Sampled Node Degree: 3 2 2 4 4 C: 1 1 2 Ψ1: 3 5 7 11 15 Ψ−1: 1/3 5/6 16/12 19/12 22/12 ˆ n: – – 4 8 6 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-44
SLIDE 44

Example

Sampling process: Sampled Nodes: d f f c c Sampled Node Degree: 3 2 2 4 4 C: 1 1 2 Ψ1: 3 5 7 11 15 Ψ−1: 1/3 5/6 16/12 19/12 22/12 ˆ n: – – 4 8 6 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-45
SLIDE 45

Example

Sampling process: Sampled Nodes: d f f c c Sampled Node Degree: 3 2 2 4 4 C: 1 1 2 Ψ1: 3 5 7 11 15 Ψ−1: 1/3 5/6 16/12 19/12 22/12 ˆ n: – – 4 8 6 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-46
SLIDE 46

Example

Sampling process: Sampled Nodes: d f f c c Sampled Node Degree: 3 2 2 4 4 C: 1 1 2 Ψ1: 3 5 7 11 15 Ψ−1: 1/3 5/6 16/12 19/12 22/12 ˆ n: – – 4 8 6 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-47
SLIDE 47

Example

Sampling process: Sampled Nodes: d f f c c d Sampled Node Degree: 3 2 2 4 4 3 C: 1 1 2 3 Ψ1: 3 5 7 11 15 18 Ψ−1: 1/3 5/6 16/12 19/12 22/12 26/12 ˆ n: – – 4 8 6 6 Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

slide-48
SLIDE 48

Proof Intuition

Notations: n – the graph size, r – number of samples di – node i degree, D = n

i=1 di

Expectations: E [Ψ1] = rD n

i=1

  • di

D

2 , E [Ψ−1] = rn

D

E [C] = r

2

n

i=1

  • di

D

2 . ˆ n

E [Ψ1]E [Ψ−1] 2E [C]

= n

r r−1 ≃ n.

ˆ n = Ψ1Ψ−1 2C ≃ E [Ψ1]E [Ψ−1] 2E [C] ≃ n

Yahoo! Labs: WWW’2011 12 / 20

slide-49
SLIDE 49

Analytic Results

Main statement: Using r(n, ǫ, δ) samples: Pr[n(1 − ǫ) ≤ ˆ n ≤ n(1 + ǫ)] ≥ 1 − δ Uniform vs Biased: Sampling method Number of samples Any graph, uniform O(√n) Synthetic graph, Zipfian degree distribution O( 4 √n log n) α = 2, dm = √n, random walk Example – n = 109 √n ≈ 30, 000.

4

√n log n ≈ 6, 000.

Yahoo! Labs: WWW’2011 13 / 20

slide-50
SLIDE 50

Setup

Networks of known sizes: Network Size Edges Synthetic 1,000,000 Zipfian α = 2, dm = 1000 DBLP 845,211 co-authorship IMDB 1,955,508 co-casting

Yahoo! Labs: WWW’2011 14 / 20

slide-51
SLIDE 51

A Synthetic Network, Degree Zipfian α = 2, dm = 1000

0.5 1 1.5 2 2.5 0.8 1 1.2 1.4 1.6 1.8 2 2.2 Synthetic network − Confidence interval Number of samples [Percentage of network size] Size estimation [Relative to network size]

  • Unif. dist. − non−unique 95%
  • Deg. dist. − non−unique 95%
  • Deg. dist. − non−unique 5%
  • Unif. dist. − non−unique 5%

Yahoo! Labs: WWW’2011 15 / 20

slide-52
SLIDE 52

DBLP - The Digital Bibliography and Library Project

0.5 1 1.5 2 2.5 3 3.5 0.5 1 1.5 2 2.5 3 DBLP network − Confidence interval Number of samples [Percentage of network size] Size estimation [Relative to network size]

  • Unif. dist. − non−unique 95%
  • Deg. dist. − non−unique 95%
  • Deg. dist. − non−unique 5%
  • Unif. dist. − non−unique 5%

Yahoo! Labs: WWW’2011 16 / 20

slide-53
SLIDE 53

IMDB - The Internet Movie Database

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3 IMDB − Confidence interval Number of samples [Percentage of network size] Size estimation [Relative to network size]

  • Unif. dist. − non−unique 95%
  • Deg. dist. − non−unique 95%
  • Deg. dist. − non−unique 5%
  • Unif. dist. − non−unique 5%

Yahoo! Labs: WWW’2011 17 / 20

slide-54
SLIDE 54

Facebook

Date April 2009 October 2010 Sampling method uniform random walk Number of samples 0.98 · 106 1 · 106 Collision estimator 237 · 106 475 · 106 Facebook report 200 − 250 · 106 500 · 106 Thanks to Minas Gjoka for the Facebook crawls.

Yahoo! Labs: WWW’2011 18 / 20

slide-55
SLIDE 55

Conclusions

An efficient algorithm to estimate the size of a social network using public API was presented. Its effectiveness was demonstrated on synthetic and real world networks. This algorithm outperforms prior art methods by using biased sampling. This algorithm also applies for sub-populations.

Yahoo! Labs: WWW’2011 19 / 20

slide-56
SLIDE 56

Thanks!

Yahoo! Labs: WWW’2011 20 / 20