compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514 algorithms for data science
SMART_READER_LITE
LIVE PREVIEW

compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 8 0 logistics 1 Problem Set 1 was due this morning in Gradescope. Problem Set 2 will be released tomorrow and due 10/10.


slide-1
SLIDE 1

compsci 514: algorithms for data science

Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 8

slide-2
SLIDE 2

logistics

  • Problem Set 1 was due this morning in Gradescope.
  • Problem Set 2 will be released tomorrow and due 10/10.

1

slide-3
SLIDE 3

summary

Last Class: Finished up MinHash and LSH.

  • Application to fast similarity search.
  • False positive and negative tuning with length r hash

signatures and t hash table repetitions (s-curves).

  • Examples of other locality sensitive hash functions

(SimHash). This Class:

  • The Frequent Elements (heavy-hitters) problem in data

streams.

  • Misra-Gries summaries.
  • Count-min sketch.

2

slide-4
SLIDE 4

upcoming

Next Time: Random compression methods for high dimensional vectors. The Johnson-Lindenstrauss lemma.

  • Building on the idea of SimHash.

After That: Spectral Methods

  • PCA, low-rank approximation, and the singular value

decomposition.

  • Spectral clustering and spectral graph theory.

Will use a lot of linear algebra. May be helpful to refresh.

  • Vector dot product, addition, length. Matrix vector

multiplication.

  • Linear independence, column span, orthogonal bases, rank.
  • Eigendecomposition.

3

slide-5
SLIDE 5

hashing for duplicate detection

All different variants of detecting duplicates/finding matches in large datasets. An important problem in many contexts!

4

slide-6
SLIDE 6

the frequent items problem

k-Frequent Items (Heavy-Hitters) Problem: Consider a stream

  • f n items x1, . . . , xn (with possible duplicates). Return any item

that appears at least n

k times. E.g., for n = 9, k = 3:

  • What is the maximum number of items that must be

returned? At most k items with frequency ≥ n

k.

  • Think of k = 100. Want items appearing ≥ 1% of the time.
  • Easy with O(n) space – store the count for each item and

return the one that appears ≥ n/k times.

  • Can we do it with less space? I.e., without storing all n items?
  • Similar challenge as with the distinct elements problem.

5

slide-7
SLIDE 7

the frequent items problem

Applications of Frequent Items:

  • Finding top/viral items (i.e., products on Amazon, videos

watched on Youtube, Google searches, etc.)

  • Finding very frequent IP addresses sending requests (to

detect DoS attacks/network anomalies).

  • ‘Iceberg queries’ for all items in a database with frequency

above some threshold. Generally want very fast detection, without having to scan through database/logs. I.e., want to maintain a running list of frequent items that appear in a stream.

6

slide-8
SLIDE 8

frequent itemset mining

Association rule learning: A very common task in data mining is to identify common associations between different events.

  • Identified via frequent itemset counting. Find all sets of k items

that appear many times in the same basket.

  • Frequency of an itemset is known as its support.
  • A single basket includes many different itemsets, and with many

different baskets an efficient approach is critical. E.g., baskets are Twitter users and itemsets are subsets of who they follow.

7

slide-9
SLIDE 9

majority in data streams

Majority: Consider a stream of n items x1, . . . , xn, where a single item appears a majority of the time. Return this item.

  • Basically k-Frequent items for k = 2 (and assume a single

item has a strict majority.)

8

slide-10
SLIDE 10

boyer-moore algorithm

Boyer-Moore Voting Algorithm: (our first deterministic algorithm)

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Just requires O(log n) bits to store c and space to store m.

9

slide-11
SLIDE 11

boyer-moore algorithm

Boyer-Moore Voting Algorithm: (our first deterministic algorithm)

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Just requires O(log n) bits to store c and space to store m.

9

slide-12
SLIDE 12

boyer-moore algorithm

Boyer-Moore Voting Algorithm: (our first deterministic algorithm)

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Just requires O(log n) bits to store c and space to store m.

9

slide-13
SLIDE 13

boyer-moore algorithm

Boyer-Moore Voting Algorithm: (our first deterministic algorithm)

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Just requires O(log n) bits to store c and space to store m.

9

slide-14
SLIDE 14

boyer-moore algorithm

Boyer-Moore Voting Algorithm: (our first deterministic algorithm)

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Just requires O(log n) bits to store c and space to store m.

9

slide-15
SLIDE 15

boyer-moore algorithm

Boyer-Moore Voting Algorithm: (our first deterministic algorithm)

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Just requires O(log n) bits to store c and space to store m.

9

slide-16
SLIDE 16

boyer-moore algorithm

Boyer-Moore Voting Algorithm: (our first deterministic algorithm)

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Just requires O(log n) bits to store c and space to store m.

9

slide-17
SLIDE 17

boyer-moore algorithm

Boyer-Moore Voting Algorithm: (our first deterministic algorithm)

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Just requires O(log n) bits to store c and space to store m.

9

slide-18
SLIDE 18

boyer-moore algorithm

Boyer-Moore Voting Algorithm: (our first deterministic algorithm)

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Just requires O(log n) bits to store c and space to store m.

9

slide-19
SLIDE 19

boyer-moore algorithm

Boyer-Moore Voting Algorithm: (our first deterministic algorithm)

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Just requires O(log n) bits to store c and space to store m.

9

slide-20
SLIDE 20

boyer-moore algorithm

Boyer-Moore Voting Algorithm: (our first deterministic algorithm)

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Just requires O(log n) bits to store c and space to store m.

9

slide-21
SLIDE 21

correctness of boyer-moore

Boyer-Moore Voting Algorithm:

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Claim: The Boyer-Moore algorithm always outputs the majority element, regardless of what order the stream is presented in. Proof: Let M be the true majority element. Let s = c when m = M and s = −c otherwise (s is a ‘helper’ variable).

  • s is incremented each time M appears. So it is incremented more

than it is decremented (since M appears a majority of times) and ends at a positive value. algorithm ends with m M.

10

slide-22
SLIDE 22

correctness of boyer-moore

Boyer-Moore Voting Algorithm:

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Claim: The Boyer-Moore algorithm always outputs the majority element, regardless of what order the stream is presented in. Proof: Let M be the true majority element. Let s = c when m = M and s = −c otherwise (s is a ‘helper’ variable).

  • s is incremented each time M appears. So it is incremented more

than it is decremented (since M appears a majority of times) and ends at a positive value. algorithm ends with m M.

10

slide-23
SLIDE 23

correctness of boyer-moore

Boyer-Moore Voting Algorithm:

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Claim: The Boyer-Moore algorithm always outputs the majority element, regardless of what order the stream is presented in. Proof: Let M be the true majority element. Let s = c when m = M and s = −c otherwise (s is a ‘helper’ variable).

  • s is incremented each time M appears. So it is incremented more

than it is decremented (since M appears a majority of times) and ends at a positive value. algorithm ends with m M.

10

slide-24
SLIDE 24

correctness of boyer-moore

Boyer-Moore Voting Algorithm:

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi and c := 1.
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

Claim: The Boyer-Moore algorithm always outputs the majority element, regardless of what order the stream is presented in. Proof: Let M be the true majority element. Let s = c when m = M and s = −c otherwise (s is a ‘helper’ variable).

  • s is incremented each time M appears. So it is incremented more

than it is decremented (since M appears a majority of times) and ends at a positive value. = ⇒ algorithm ends with m = M.

10

slide-25
SLIDE 25

back to frequent items

k-Frequent Items (Heavy-Hitters) Problem: Consider a stream

  • f n items x1, . . . , xn (with possible duplicates). Return any item

at appears at least n

k times.

Boyer-Moore Voting Algorithm:

  • Initialize count c := 0, majority element m :=⊥
  • For i = 1, . . . , n
  • If c = 0, set m := xi
  • Else if m = xi, set c := c + 1.
  • Else if m ̸= xi, set c := c − 1.

11

slide-26
SLIDE 26

back to frequent items

k-Frequent Items (Heavy-Hitters) Problem: Consider a stream

  • f n items x1, . . . , xn (with possible duplicates). Return any item

at appears at least n

k times.

Misra-Gries Summary:

  • Initialize counts c1, . . . , ck := 0, elements m1, . . . , mk :=⊥
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.

11

slide-27
SLIDE 27

misra-gries algorithm

Misra-Gries Summary:

  • Initialize counts c1, . . . , ck := 0, elements m1, . . . , mk :=⊥.
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.

Claim: At the end of the stream, all items with frequency

n k

are stored.

12

slide-28
SLIDE 28

misra-gries algorithm

Misra-Gries Summary:

  • Initialize counts c1, . . . , ck := 0, elements m1, . . . , mk :=⊥.
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.

Claim: At the end of the stream, all items with frequency

n k

are stored.

12

slide-29
SLIDE 29

misra-gries algorithm

Misra-Gries Summary:

  • Initialize counts c1, . . . , ck := 0, elements m1, . . . , mk :=⊥.
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.

Claim: At the end of the stream, all items with frequency

n k

are stored.

12

slide-30
SLIDE 30

misra-gries algorithm

Misra-Gries Summary:

  • Initialize counts c1, . . . , ck := 0, elements m1, . . . , mk :=⊥.
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.

Claim: At the end of the stream, all items with frequency

n k

are stored.

12

slide-31
SLIDE 31

misra-gries algorithm

Misra-Gries Summary:

  • Initialize counts c1, . . . , ck := 0, elements m1, . . . , mk :=⊥.
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.

Claim: At the end of the stream, all items with frequency

n k

are stored.

12

slide-32
SLIDE 32

misra-gries algorithm

Misra-Gries Summary:

  • Initialize counts c1, . . . , ck := 0, elements m1, . . . , mk :=⊥.
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.

Claim: At the end of the stream, all items with frequency

n k

are stored.

12

slide-33
SLIDE 33

misra-gries algorithm

Misra-Gries Summary:

  • Initialize counts c1, . . . , ck := 0, elements m1, . . . , mk :=⊥.
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.

Claim: At the end of the stream, all items with frequency

n k

are stored.

12

slide-34
SLIDE 34

misra-gries algorithm

Misra-Gries Summary:

  • Initialize counts c1, . . . , ck := 0, elements m1, . . . , mk :=⊥.
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.

Claim: At the end of the stream, all items with frequency

n k

are stored.

12

slide-35
SLIDE 35

misra-gries algorithm

Misra-Gries Summary:

  • Initialize counts c1, . . . , ck := 0, elements m1, . . . , mk :=⊥.
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.

Claim: At the end of the stream, all items with frequency

n k

are stored.

12

slide-36
SLIDE 36

misra-gries algorithm

Misra-Gries Summary:

  • Initialize counts c1, . . . , ck := 0, elements m1, . . . , mk :=⊥.
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.

Claim: At the end of the stream, all items with frequency

n k

are stored.

12

slide-37
SLIDE 37

misra-gries algorithm

Misra-Gries Summary:

  • Initialize counts c1, . . . , ck := 0, elements m1, . . . , mk :=⊥.
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.

Claim: At the end of the stream, all items with frequency ≥ n

k

are stored.

12

slide-38
SLIDE 38

misra-gries analysis

Claim: At the end of the stream, the Misra-Gries algorithm stores k items, including all those with frequency ≥ n

k.

Intuition:

  • If there are exactly k items, each appearing exactly n/k

times, all are stored (since we have k storage slots).

  • If there are k/2 items each appearing ≥ n/k times, there are

≤ n/2 irrelevant items, being inserted into k/2 ‘free slots’.

  • May cause n/2

k/2 = n k decrement operations. Few enough that

the heavy items (appearing n/k times each) are still stored. Anything undesirable about the Misra-Gries output guarantee? May have false positives – infrequent items that are stored.

13

slide-39
SLIDE 39

approximate frequent elements

Issue: Misra-Gries algorithm stores k items, including all with frequency ≥ n/k. But may include infrequent items.

  • In fact, no algorithm using o(n) space can output just the

items with frequency ≥ n/k. Hard to tell between an item with frequency n/k (should be output) and n/k − 1 (should not be output). k -Frequent Items Problem: Consider a stream of n items x1

  • xn. Return a set F of items, including all items that

appear at least n

k times and only items that appear at least

1

n k times.

  • An example of relaxing to a ‘promise problem’: for items

with frequencies in 1

n k n k no output guarantee. 14

slide-40
SLIDE 40

approximate frequent elements

Issue: Misra-Gries algorithm stores k items, including all with frequency ≥ n/k. But may include infrequent items.

  • In fact, no algorithm using o(n) space can output just the

items with frequency ≥ n/k. Hard to tell between an item with frequency n/k (should be output) and n/k − 1 (should not be output). (ϵ, k)-Frequent Items Problem: Consider a stream of n items x1, . . . , xn. Return a set F of items, including all items that appear at least n

k times and only items that appear at least

(1 − ϵ) · n

k times.

  • An example of relaxing to a ‘promise problem’: for items

with frequencies in [(1 − ϵ) · n

k, n k] no output guarantee. 14

slide-41
SLIDE 41

approximate frequent elements with misra-gries

Misra-Gries Summary: (ϵ-error version)

  • Let r := ⌈k/ϵ⌉
  • Initialize counts c1, . . . , cr := 0, elements m1, . . . , mr :=⊥.
  • For i = 1, . . . , n
  • If mj = xi for some j, set cj := cj + 1.
  • Else let t = arg min cj. If ct = 0, set mt := xi and ct := 1.
  • Else cj := cj − 1 for all j.
  • Return any mj with cj ≥ (1 − ϵ) · n

k.

Claim: For all mj with true frequency f(mj): f(mj) − ϵn k ≤ cj ≤ f(mj). Intuition: # items stored r is large, so relatively few decrements. Implication: If f(mj) ≥ n

k, then cj ≥ (1 − ϵ) · n k so the item is returned.

If f(mj) < (1 − ϵ) · n

k, then cj < (1 − ϵ) · n k so the item is not returned.

15

slide-42
SLIDE 42

approximate frequent elements with misra-gries

Upshot: The (ϵ, k)-Frequent Items problem can be solved via the Misra-Gries approach.

  • Space usage is ⌈k/ϵ⌉ counts – O

(

k log n ϵ

) bits and ⌈k/ϵ⌉ items.

  • Deterministic approximation algorithm.

16

slide-43
SLIDE 43

frequent elements with count-min sketch

A common alternative to the Misra-Gries approach is the count-min sketch: a randomized method closely related to bloom filters.

  • A major advantage: easily distributed to processing on

multiple servers. Build arrays A1 As separately and then just set A A1 As. Will use A h x to estimate f x , the frequency of x in the

  • stream. I.e.,

xi xi x .

17

slide-44
SLIDE 44

frequent elements with count-min sketch

A common alternative to the Misra-Gries approach is the count-min sketch: a randomized method closely related to bloom filters.

  • A major advantage: easily distributed to processing on

multiple servers. Build arrays A1 As separately and then just set A A1 As. Will use A h x to estimate f x , the frequency of x in the

  • stream. I.e.,

xi xi x .

17

slide-45
SLIDE 45

frequent elements with count-min sketch

A common alternative to the Misra-Gries approach is the count-min sketch: a randomized method closely related to bloom filters.

  • A major advantage: easily distributed to processing on

multiple servers. Build arrays A1 As separately and then just set A A1 As. Will use A h x to estimate f x , the frequency of x in the

  • stream. I.e.,

xi xi x .

17

slide-46
SLIDE 46

frequent elements with count-min sketch

A common alternative to the Misra-Gries approach is the count-min sketch: a randomized method closely related to bloom filters.

  • A major advantage: easily distributed to processing on

multiple servers. Build arrays A1 As separately and then just set A A1 As. Will use A h x to estimate f x , the frequency of x in the

  • stream. I.e.,

xi xi x .

17

slide-47
SLIDE 47

frequent elements with count-min sketch

A common alternative to the Misra-Gries approach is the count-min sketch: a randomized method closely related to bloom filters.

  • A major advantage: easily distributed to processing on

multiple servers. Build arrays A1 As separately and then just set A A1 As. Will use A h x to estimate f x , the frequency of x in the

  • stream. I.e.,

xi xi x .

17

slide-48
SLIDE 48

frequent elements with count-min sketch

A common alternative to the Misra-Gries approach is the count-min sketch: a randomized method closely related to bloom filters.

  • A major advantage: easily distributed to processing on

multiple servers. Build arrays A1 As separately and then just set A A1 As. Will use A h x to estimate f x , the frequency of x in the

  • stream. I.e.,

xi xi x .

17

slide-49
SLIDE 49

frequent elements with count-min sketch

A common alternative to the Misra-Gries approach is the count-min sketch: a randomized method closely related to bloom filters.

  • A major advantage: easily distributed to processing on

multiple servers. Build arrays A1 As separately and then just set A A1 As. Will use A[h(x)] to estimate f(x), the frequency of x in the

  • stream. I.e., |{xi : xi = x}|.

17

slide-50
SLIDE 50

frequent elements with count-min sketch

A common alternative to the Misra-Gries approach is the count-min sketch: a randomized method closely related to bloom filters.

  • A major advantage: easily distributed to processing on

multiple servers. Build arrays A1, . . . , As separately and then just set A := A1 + . . . + As. Will use A[h(x)] to estimate f(x), the frequency of x in the

  • stream. I.e., |{xi : xi = x}|.

17

slide-51
SLIDE 51

count-min sketch accuracy

Use A[h(x)] to estimate f(x) Claim 1: We always have A[h(x)] ≥ f(x). Why?

  • A[h(x)] counts the number of occurrences of any y with

h(y) = h(x), including x itself.

  • A[h(x)] = f(x) + ∑

y̸=x:h(y)=h(x) f(y). f(x): frequency of x in the stream (i.e., number of items equal to x). h: random hash function. m: size of count-min sketch array. 18

slide-52
SLIDE 52

count-min sketch accuracy

A[h(x)] = f(x) + ∑

y̸=x:h(y)=h(x)

f(y)

  • error in frequency estimate

. Expected Error: E   ∑

y̸=x:h(y)=h(x)

f(y)   = ∑

y̸=x

Pr(h(y) = h(x)) · f(y) = ∑

y̸=x

1 m · f(y) = 1 m · (n − f(x)) ≤ n m What is a bound on probability that the error is ≥ 3n

m ?

Markov’s inequality: Pr [∑

y̸=x:h(y)=h(x) f(y) ≥ 3n m

] ≤ 1

3.

What property of h is required to show this bound? 2-universal.

f(x): frequency of x in the stream (i.e., number of items equal to x). h: random hash function. m: size of count-min sketch array. 19

slide-53
SLIDE 53

count-min sketch accuracy

Claim: For any x, with probability at least 2/3, f(x) ≤ A[h(x)] ≤ f(x) + ϵn k . To solve the (ϵ, k)-Frequent elements problem, set m = 6k

ϵ .

How can we improve the success probability? Repetition.

f(x): frequency of x in the stream (i.e., number of items equal to x). h: random hash function. m: size of count-min sketch array. 20

slide-54
SLIDE 54

count-min sketch accuracy

Estimate f x with f x mini

t Ai hi x . (count-min sketch)

Why min instead of median? The minimum estimate is always the most accurate since they are all overestimates of the true frequency!

21

slide-55
SLIDE 55

count-min sketch accuracy

Estimate f x with f x mini

t Ai hi x . (count-min sketch)

Why min instead of median? The minimum estimate is always the most accurate since they are all overestimates of the true frequency!

21

slide-56
SLIDE 56

count-min sketch accuracy

Estimate f x with f x mini

t Ai hi x . (count-min sketch)

Why min instead of median? The minimum estimate is always the most accurate since they are all overestimates of the true frequency!

21

slide-57
SLIDE 57

count-min sketch accuracy

Estimate f x with f x mini

t Ai hi x . (count-min sketch)

Why min instead of median? The minimum estimate is always the most accurate since they are all overestimates of the true frequency!

21

slide-58
SLIDE 58

count-min sketch accuracy

Estimate f(x) with ˜ f(x) = mini∈[t] Ai[hi(x)]. (count-min sketch) Why min instead of median? The minimum estimate is always the most accurate since they are all overestimates of the true frequency!

21

slide-59
SLIDE 59

count-min sketch accuracy

Estimate f(x) with ˜ f(x) = mini∈[t] Ai[hi(x)]. (count-min sketch) Why min instead of median? The minimum estimate is always the most accurate since they are all overestimates of the true frequency!

21

slide-60
SLIDE 60

count-min sketch analysis

Estimate f(x) by ˜ f(x) = mini∈[t] Ai[hi(x)]

  • For every x and i ∈ [t], we know that for m = O(k/ϵ), with

probability ≥ 2/3: f(x) ≤ Ai[hi(x)] ≤ f(x) + ϵn k .

  • What is Pr[f(x ≤ ˜

f(x) ≤ f(x) + ϵn

k ]? 1 − 1/3t.

  • To have a good estimate with probability ≥ 1 − δ, set t = log(1/δ).

22

slide-61
SLIDE 61

count-min sketch

Upshot: Count-min sketch lets us estimate the frequency of every item in a stream up to error ϵn

k with probability ≥ 1 − δ in

O (log(1/δ) · k/ϵ) space.

  • Accurate enough to solve the (ϵ, k)-Frequent elements

problem.

  • Actually identifying the frequent elements quickly requires a

little bit of further work. One approach: Store potential frequent elements as they come in. At step i remove any elements whose estimated frequency is below i/k. Store at most O(k) items at once and have all items with frequency ≥ n/k stored at the end of the stream.

23

slide-62
SLIDE 62

Questions on Frequent Elements?

24