Distinct Value Estimators For Zipfian Distributions Sergei - - PowerPoint PPT Presentation

▶

Feb 04, 2024 37 likes •269 views

Distinct Value Estimators For Zipfian Distributions Sergei Vassilvitskii Rajeev Motwani Stanford University Problem Statement Given a large multiset with elements, count X n the number of distinct elements in . X X = { a, b,

SLIDE 1

Distinct Value Estimators For Zipfian Distributions

Sergei Vassilvitskii Rajeev Motwani Stanford University

SLIDE 2

Problem Statement

Given a large multiset with elements, count the number of distinct elements in . Alternatively, given samples from a distribution , estimate the 0-th frequency moment.

X X P X = {a, b, c, a, a, c, b, a} = ⇒ Distinct(X) = 3 n

SLIDE 3

Why Do We Care?

Good Planning for SQL Queries. Consider: where is expensive to compute.

select * from R, S where R.A = S.B and f(S.C) > k f S.C f If has few distinct elements, compute first, cache results, then join. S.C f If has many elements, compute the join first, then check the condition. Orders of Magnitude Improvements

SLIDE 4

Classical problem

Different approaches: Streaming Input - Minimize space used. Sample from Input - Guarantee on approximations?

r X ˆ D Distinct(X) Given a sample of size from , find an approximation to .

SLIDE 5

Previous Work

Good-Turing Estimator: “The Population Frequencies of Species, and the estimation of Population Parameters,” 1953. Other Heuristic Estimators: Smoothed Jackknife Estimator (Haas et. al) Adaptive Estimator (Charikar et. al) Many Others

SLIDE 6

Previous Work - Theory

Given samples from a set of size Guaranteed Error Estimator(GEE) [CCMN] Approximation Ratio:

O(

n/r)

n r

n − r

2r Lower Bound: There exist inputs such that with constant probability any estimator will have appro- ximation ratio at least:

SLIDE 7

Lower Bound Detail

Scenario 1: Scenario 2: With , after samples cannot distinguish between the two scenarios with probability at least .

S = {x, x, . . . , x, y1, . . . , yk}, |S| = n S = {x, x, . . . , x}, |S| = n k = n − r 2r ln 1 δ r δ

SLIDE 8

So Why Are We Here?

Many large datasets are not worst-case. In fact, many follow Zipfian Distributions. Examples:

In/Out-Degrees of the Web Graph
Word frequencies in many languages
many, many more.

Zipfθ(i) ∝ 1 iθ

SLIDE 9

Problem Definition

Suppose on elements. is known, is unknown Estimate by sampling from .

X ∼ Zipfθ D D X

Best-you-can Estimation: Given a sample

from , return best estimate of . X D D θ X

Adaptive Sampling: Will sample from until

a stopping condition is met. Two Kinds of Results:

SLIDE 10

Results

Let be the probability of the least likely element.

p∗ Adaptive sampling will return after at most samples with constant probability. O(log D p∗ ) D r = (1 + 2)1+θ p∗ 1 + D 1 − exp(−Ω(D2)) Given samples, can return an estimate to with probability at least

SLIDE 11

Outline

Introduction Techniques Experimental Results Conclusion

SLIDE 12

Approximation Techniques

For a sample of size let be the number of distinct values in the sample. Suppose and are known, then we can compute the expected number of distinct values in the sample. If is the number of distinct values observed, the estimator returns such that .

D θ fr r ED,θ[fr] f ∗

ˆ D E ˆ

D,θ[fr] = f ∗ r

SLIDE 13

Pr

|E[fr] − fr| ≥ E[fr]
≤ exp(−2Ω(D))

Analysis

Lemma: Tight Distribution of . For large enough , Proof: Parallels the sharp threshold coupon collector arguments for uniform distributions.

fr r

SLIDE 14

Analysis (2)

Lemma: MLE preserves approximation Given: , observed elements Let such that , and . Then:

fr ≤ (1 + )ED,θ[fr] ˆ D f ∗

r = E ˆ D,θ[fr]

(1 − 2) ˆ D ≤ D ≤ (1 + 2) ˆ D f ∗

r ≥ 1/p∗

SLIDE 15

Outline

Introduction Techniques Experimental Results Conclusion

SLIDE 16

The Competition

Zipfian Estimator (ZE): Performance guarantees

nly for Zipfian Distributions.

Guaranteed Error Estimator (GEE): error guarantee. (Works for all distributions) Analytic Estimator (AE): Best performing heuristic - no theoretical guarantees.

O(

n/r)

SLIDE 17

Datasets

Synthetic Data:

Vary number of distinct elements
Vary the Database size
Vary the skew of the distribution

Real Datasets

“Router” dataset - Packet trace from the Internet

Traffic Archive. , ,

θ ∈ {0, 0.5, 1} D ∈ {10k, 50k, 100k} n ∈ {100k, 500k, 1000k} θ ≈ 1.6 n ≈ 4M D ≈ 250k

SLIDE 18

Estimating

Recall: Let be the frequency of the i-th element. Estimate by doing linear regression on plot.

θ

Zipfθ(i) ∝ 1 iθ fi E[fi] = cri−θ = ⇒ log E[fi] = log cr − θ log i θ log fi vs log i

SLIDE 19

Experimental Results

20 40 60 80 100 2 4 6 8 10 Number of Samples x 1000 Ratio Error

Theta = 0.5, D = 50000

ZE AE GEE

, n = 1M

SLIDE 20

Experimental Results (2)

2 4 6 8 10 2 4 6 8 10 % DB Sampled Ratio Error

Router Dataset

ZE AE GEE

SLIDE 21

Outline

Introduction Techniques Experimental Results Conclusion

SLIDE 22

Conclusion

Can have error guarantees if the family of distributions is known ahead of time. How does the approximation of affect error guarantees?

θ Subtle problem: disk reads occur in blocks. Time to sample 10% is equivalent to reading the whole DB.

SLIDE 23

Distinct Value Estimators For Zipfian Distributions

Sergei Vassilvitskii Rajeev Motwani Stanford University

Problem Statement

Given a large multiset with elements, count the number of distinct elements in . Alternatively, given samples from a distribution , estimate the 0-th frequency moment.

X X P X = {a, b, c, a, a, c, b, a} = ⇒ Distinct(X) = 3 n

Why Do We Care?

Good Planning for SQL Queries. Consider: where is expensive to compute.

select * from R, S where R.A = S.B and f(S.C) > k f S.C f If has few distinct elements, compute first, cache results, then join. S.C f If has many elements, compute the join first, then check the condition. Orders of Magnitude Improvements

Classical problem

Different approaches: Streaming Input - Minimize space used. Sample from Input - Guarantee on approximations?

r X ˆ D Distinct(X) Given a sample of size from , find an approximation to .

Previous Work

Good-Turing Estimator: “The Population Frequencies of Species, and the estimation of Population Parameters,” 1953. Other Heuristic Estimators: Smoothed Jackknife Estimator (Haas et. al) Adaptive Estimator (Charikar et. al) Many Others

Previous Work - Theory

Given samples from a set of size Guaranteed Error Estimator(GEE) [CCMN] Approximation Ratio:

O(

n r

2r Lower Bound: There exist inputs such that with constant probability any estimator will have appro- ximation ratio at least:

Lower Bound Detail

Scenario 1: Scenario 2: With , after samples cannot distinguish between the two scenarios with probability at least .

S = {x, x, . . . , x, y1, . . . , yk}, |S| = n S = {x, x, . . . , x}, |S| = n k = n − r 2r ln 1 δ r δ

So Why Are We Here?

Many large datasets are not worst-case. In fact, many follow Zipfian Distributions. Examples:

Zipfθ(i) ∝ 1 iθ

Problem Definition

Suppose on elements. is known, is unknown Estimate by sampling from .

X ∼ Zipfθ D D X

from , return best estimate of . X D D θ X

a stopping condition is met. Two Kinds of Results:

Results

Let be the probability of the least likely element.

p∗ Adaptive sampling will return after at most samples with constant probability. O(log D p∗ ) D r = (1 + 2)1+θ p∗ 1 + D 1 − exp(−Ω(D2)) Given samples, can return an estimate to with probability at least

Outline

Introduction Techniques Experimental Results Conclusion

Approximation Techniques

For a sample of size let be the number of distinct values in the sample. Suppose and are known, then we can compute the expected number of distinct values in the sample. If is the number of distinct values observed, the estimator returns such that .

D θ fr r ED,θ[fr] f ∗

ˆ D E ˆ

Pr

Analysis

Lemma: Tight Distribution of . For large enough , Proof: Parallels the sharp threshold coupon collector arguments for uniform distributions.

fr r

Analysis (2)

Lemma: MLE preserves approximation Given: , observed elements Let such that , and . Then:

fr ≤ (1 + )ED,θ[fr] ˆ D f ∗

(1 − 2) ˆ D ≤ D ≤ (1 + 2) ˆ D f ∗

r ≥ 1/p∗

Outline

Introduction Techniques Experimental Results Conclusion

The Competition

Zipfian Estimator (ZE): Performance guarantees

Guaranteed Error Estimator (GEE): error guarantee. (Works for all distributions) Analytic Estimator (AE): Best performing heuristic - no theoretical guarantees.

O(

Datasets

Synthetic Data:

Real Datasets

Traffic Archive. , ,

θ ∈ {0, 0.5, 1} D ∈ {10k, 50k, 100k} n ∈ {100k, 500k, 1000k} θ ≈ 1.6 n ≈ 4M D ≈ 250k

Estimating

Recall: Let be the frequency of the i-th element. Estimate by doing linear regression on plot.

θ

Zipfθ(i) ∝ 1 iθ fi E[fi] = cri−θ = ⇒ log E[fi] = log cr − θ log i θ log fi vs log i

Experimental Results

Experimental Results (2)

Outline

Introduction Techniques Experimental Results Conclusion

Conclusion

Can have error guarantees if the family of distributions is known ahead of time. How does the approximation of affect error guarantees?

θ Subtle problem: disk reads occur in blocks. Time to sample 10% is equivalent to reading the whole DB.

Thank You