Distinct Value Estimators For Zipfian Distributions Sergei - - PowerPoint PPT Presentation

distinct value estimators for zipfian distributions
SMART_READER_LITE
LIVE PREVIEW

Distinct Value Estimators For Zipfian Distributions Sergei - - PowerPoint PPT Presentation

Distinct Value Estimators For Zipfian Distributions Sergei Vassilvitskii Rajeev Motwani Stanford University Problem Statement Given a large multiset with elements, count X n the number of distinct elements in . X X = { a, b,


slide-1
SLIDE 1

Distinct Value Estimators For Zipfian Distributions

Sergei Vassilvitskii Rajeev Motwani Stanford University

slide-2
SLIDE 2

Problem Statement

Given a large multiset with elements, count the number of distinct elements in . Alternatively, given samples from a distribution , estimate the 0-th frequency moment.

2

X X P X = {a, b, c, a, a, c, b, a} = ⇒ Distinct(X) = 3 n

slide-3
SLIDE 3

Why Do We Care?

Good Planning for SQL Queries. Consider: where is expensive to compute.

3

select * from R, S where R.A = S.B and f(S.C) > k f S.C f If has few distinct elements, compute first, cache results, then join. S.C f If has many elements, compute the join first, then check the condition. Orders of Magnitude Improvements

slide-4
SLIDE 4

Classical problem

Different approaches: Streaming Input - Minimize space used. Sample from Input - Guarantee on approximations?

4

r X ˆ D Distinct(X) Given a sample of size from , find an approximation to .

slide-5
SLIDE 5

Previous Work

Good-Turing Estimator: “The Population Frequencies of Species, and the estimation of Population Parameters,” 1953. Other Heuristic Estimators: Smoothed Jackknife Estimator (Haas et. al) Adaptive Estimator (Charikar et. al) Many Others

5

slide-6
SLIDE 6

Previous Work - Theory

Given samples from a set of size Guaranteed Error Estimator(GEE) [CCMN] Approximation Ratio:

6

O(

  • n/r)

n r

  • n − r

2r Lower Bound: There exist inputs such that with constant probability any estimator will have appro- ximation ratio at least:

slide-7
SLIDE 7

Lower Bound Detail

Scenario 1: Scenario 2: With , after samples cannot distinguish between the two scenarios with probability at least .

7

S = {x, x, . . . , x, y1, . . . , yk}, |S| = n S = {x, x, . . . , x}, |S| = n k = n − r 2r ln 1 δ r δ

slide-8
SLIDE 8

So Why Are We Here?

Many large datasets are not worst-case. In fact, many follow Zipfian Distributions. Examples:

  • In/Out-Degrees of the Web Graph
  • Word frequencies in many languages
  • many, many more.

8

Zipfθ(i) ∝ 1 iθ

slide-9
SLIDE 9

Problem Definition

Suppose on elements. is known, is unknown Estimate by sampling from .

9

X ∼ Zipfθ D D X

  • Best-you-can Estimation: Given a sample

from , return best estimate of . X D D θ X

  • Adaptive Sampling: Will sample from until

a stopping condition is met. Two Kinds of Results:

slide-10
SLIDE 10

Results

Let be the probability of the least likely element.

10

p∗ Adaptive sampling will return after at most samples with constant probability. O(log D p∗ ) D r = (1 + 2)1+θ p∗ 1 + D 1 − exp(−Ω(D2)) Given samples, can return an estimate to with probability at least

slide-11
SLIDE 11

Outline

Introduction Techniques Experimental Results Conclusion

11

slide-12
SLIDE 12

Approximation Techniques

For a sample of size let be the number of distinct values in the sample. Suppose and are known, then we can compute the expected number of distinct values in the sample. If is the number of distinct values observed, the estimator returns such that .

12

D θ fr r ED,θ[fr] f ∗

r

ˆ D E ˆ

D,θ[fr] = f ∗ r

slide-13
SLIDE 13

Pr

  • |E[fr] − fr| ≥ E[fr]
  • ≤ exp(−2Ω(D))

Analysis

Lemma: Tight Distribution of . For large enough , Proof: Parallels the sharp threshold coupon collector arguments for uniform distributions.

13

fr r

slide-14
SLIDE 14

Analysis (2)

Lemma: MLE preserves approximation Given: , observed elements Let such that , and . Then:

14

fr ≤ (1 + )ED,θ[fr] ˆ D f ∗

r = E ˆ D,θ[fr]

(1 − 2) ˆ D ≤ D ≤ (1 + 2) ˆ D f ∗

r

r ≥ 1/p∗

slide-15
SLIDE 15

Outline

Introduction Techniques Experimental Results Conclusion

15

slide-16
SLIDE 16

The Competition

Zipfian Estimator (ZE): Performance guarantees

  • nly for Zipfian Distributions.

Guaranteed Error Estimator (GEE): error guarantee. (Works for all distributions) Analytic Estimator (AE): Best performing heuristic - no theoretical guarantees.

16

O(

  • n/r)
slide-17
SLIDE 17

Datasets

Synthetic Data:

  • Vary number of distinct elements
  • Vary the Database size
  • Vary the skew of the distribution

Real Datasets

  • “Router” dataset - Packet trace from the Internet

Traffic Archive. , ,

17

θ ∈ {0, 0.5, 1} D ∈ {10k, 50k, 100k} n ∈ {100k, 500k, 1000k} θ ≈ 1.6 n ≈ 4M D ≈ 250k

slide-18
SLIDE 18

Estimating

Recall: Let be the frequency of the i-th element. Estimate by doing linear regression on plot.

18

θ

Zipfθ(i) ∝ 1 iθ fi E[fi] = cri−θ = ⇒ log E[fi] = log cr − θ log i θ log fi vs log i

slide-19
SLIDE 19

Experimental Results

19

20 40 60 80 100 2 4 6 8 10 Number of Samples x 1000 Ratio Error

Theta = 0.5, D = 50000

ZE AE GEE

, n = 1M

slide-20
SLIDE 20

Experimental Results (2)

20

2 4 6 8 10 2 4 6 8 10 % DB Sampled Ratio Error

Router Dataset

ZE AE GEE

slide-21
SLIDE 21

Outline

Introduction Techniques Experimental Results Conclusion

21

slide-22
SLIDE 22

Conclusion

Can have error guarantees if the family of distributions is known ahead of time. How does the approximation of affect error guarantees?

22

θ Subtle problem: disk reads occur in blocks. Time to sample 10% is equivalent to reading the whole DB.

slide-23
SLIDE 23

Thank You