Quantiles and Selection Lecture 16 October 20, 2020 Chandra (UIUC) - - PowerPoint PPT Presentation

quantiles and selection
SMART_READER_LITE
LIVE PREVIEW

Quantiles and Selection Lecture 16 October 20, 2020 Chandra (UIUC) - - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data Quantiles and Selection Lecture 16 October 20, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 / 31 Part I Introduction Chandra (UIUC) CS498ABD 2 Fall 2020 2 / 31 Selection Selection: Given a sequence of


slide-1
SLIDE 1

CS 498ABD: Algorithms for Big Data

Quantiles and Selection

Lecture 16

October 20, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 / 31
slide-2
SLIDE 2

Part I Introduction

Chandra (UIUC) CS498ABD 2 Fall 2020 2 / 31
slide-3
SLIDE 3

Selection

Selection: Given a sequence of numbers a1, a2, . . . , an and integer k 2 [n] want to find the rank k element (the k’th element after sorting) Median: rank n/2 element Chandra (UIUC) CS498ABD 3 Fall 2020 3 / 31 10, 100,1 ,

5,

95 , 36 ,
  • , 5

"Mr

=

slide-4
SLIDE 4

Selection

Selection: Given a sequence of numbers a1, a2, . . . , an and integer k 2 [n] want to find the rank k element (the k’th element after sorting) Median: rank n/2 element Offline solutions: Sort and pick the k’th element. O(n log n) time. Can find all ranks in constant time after sorting. O(n) time algorithm for Selection of given rank k. Randomized QuickSelect or deterministic Median-of-Medians algorithm (clever but slow). Chandra (UIUC) CS498ABD 3 Fall 2020 3 / 31
slide-5
SLIDE 5

Selection in Streaming

Question: Suppose a1, a2, . . . , an arrive in a stream. Can we do Selection in small space? Chandra (UIUC) CS498ABD 4 Fall 2020 4 / 31
slide-6
SLIDE 6

Selection in Streaming

Question: Suppose a1, a2, . . . , an arrive in a stream. Can we do Selection in small space? Exact Selection in one pass requires Ω(n) space. Need to store all elements so trivial solution is optimal. Chandra (UIUC) CS498ABD 4 Fall 2020 4 / 31
slide-7
SLIDE 7

Selection in Streaming

Question: Suppose a1, a2, . . . , an arrive in a stream. Can we do Selection in small space? Exact Selection in one pass requires Ω(n) space. Need to store all elements so trivial solution is optimal. Relaxations: Approximate selection. Recall sampling to find ✏-approximate median using O( 1 ✏2 log(1/)) samples. Can do this in streaming with reservoir sampling. Multiple passes. Assume random order arrival of elements. Chandra (UIUC) CS498ABD 4 Fall 2020 4 / 31
slide-8
SLIDE 8

Selection in Multiple Passes

Multipass model: See same stream p times for some p 1. With larger p one can do more with same memory bound. Initially motivated by database applications where random access main memory is small and large external memory (such as tapes) that allow for reasonably fast sequential scans. Chandra (UIUC) CS498ABD 5 Fall 2020 5 / 31
slide-9
SLIDE 9

Selection in Multiple Passes

Multipass model: See same stream p times for some p 1. With larger p one can do more with same memory bound. Initially motivated by database applications where random access main memory is small and large external memory (such as tapes) that allow for reasonably fast sequential scans. Selection in multiple passes: Θ(n) space allows 1 pass. O(1) space. How many passes? Chandra (UIUC) CS498ABD 5 Fall 2020 5 / 31
slide-10
SLIDE 10

Selection in Multiple Passes

Multipass model: See same stream p times for some p 1. With larger p one can do more with same memory bound. Initially motivated by database applications where random access main memory is small and large external memory (such as tapes) that allow for reasonably fast sequential scans. Selection in multiple passes: Θ(n) space allows 1 pass. O(1) space. How many passes? O(log n) suffices. Implement Quick Select in O(1) space. Chandra (UIUC) CS498ABD 5 Fall 2020 5 / 31

Whp

.
slide-11
SLIDE 11

Selection in Multiple Passes

Multipass model: See same stream p times for some p 1. With larger p one can do more with same memory bound. Initially motivated by database applications where random access main memory is small and large external memory (such as tapes) that allow for reasonably fast sequential scans. Selection in multiple passes: Θ(n) space allows 1 pass. O(1) space. How many passes? O(log n) suffices. Implement Quick Select in O(1) space. p passes? Chandra (UIUC) CS498ABD 5 Fall 2020 5 / 31
slide-12
SLIDE 12

Selection in Multiple Passes

Multipass model: See same stream p times for some p 1. With larger p one can do more with same memory bound. Initially motivated by database applications where random access main memory is small and large external memory (such as tapes) that allow for reasonably fast sequential scans. Selection in multiple passes: Θ(n) space allows 1 pass. O(1) space. How many passes? O(log n) suffices. Implement Quick Select in O(1) space. p passes? O(n1/ppolylog(n)) space suffices. Hence O(pn log n) for 2 passes. [Munro-Paterson 1980] Chandra (UIUC) CS498ABD 5 Fall 2020 5 / 31
slide-13
SLIDE 13

Quantiles

Large numerical/ordered data: say heights/weights/salaries of the population of the country. Exact selection is not as interesting as high-level summary. Pick some granularity and bucket data into groups of roughly equal size. Example: For ↵ = 1, 2, . . . , 100 want ↵ percentile salaries More precision: For ↵ = 0.1, 0.2, . . . , 100 want ↵ percentile salaries Chandra (UIUC) CS498ABD 6 Fall 2020 6 / 31 't¥¥÷÷ '
slide-14
SLIDE 14

Quantiles

Large numerical/ordered data: say heights/weights/salaries of the population of the country. Exact selection is not as interesting as high-level summary. Pick some granularity and bucket data into groups of roughly equal size. Example: For ↵ = 1, 2, . . . , 100 want ↵ percentile salaries More precision: For ↵ = 0.1, 0.2, . . . , 100 want ↵ percentile salaries In terms of Selection: want rank k element for k = ↵ 100n for each ↵ allows for ✏-approximate Selection (additive error ✏n where ✏ is granularity in quantile) Chandra (UIUC) CS498ABD 6 Fall 2020 6 / 31
slide-15
SLIDE 15

Quantile Summaries or Approximate Selection in Streaming

See stream of numbers a1, a2, . . . , an. Parameter ✏ 2 (0, 1) Maintain a small space summary such that given any k 2 [n] can
  • utput number a from stream such that
k ✏n  rank(a)  k + ✏n Chandra (UIUC) CS498ABD 7 Fall 2020 7 / 31

¥00

E

t.E.nl

.

slide-16
SLIDE 16

Quantile Summaries or Approximate Selection in Streaming

See stream of numbers a1, a2, . . . , an. Parameter ✏ 2 (0, 1) Maintain a small space summary such that given any k 2 [n] can
  • utput number a from stream such that
k ✏n  rank(a)  k + ✏n Offline: can do with O(1/✏) space. Store rank ✏i/n elements for i = 1, 2, . . . , 1/✏ Q: Can we do it in streaming and how much space do we need? Chandra (UIUC) CS498ABD 7 Fall 2020 7 / 31
slide-17
SLIDE 17

Quantile Summaries or Approximate Selection in Streaming

See stream of numbers a1, a2, . . . , an Parameter ✏ 2 (0, 1) Maintain a small space summary such that given any k 2 [n] can
  • utput number a from stream such that
k ✏n  rank(a)  k + ✏n Q: Can we do it in streaming and how much space do we need? O( 1 ✏ log2 n) space using merge and reduce approach Involved O( 1 ✏ log(n/✏)) space algorithm that is near optimal Both are deterministic algorithms. Can be used to derive Munro-Paterson multi-pass Selection algorithm Chandra (UIUC) CS498ABD 8 Fall 2020 8 / 31
slide-18
SLIDE 18

Part II Approximate Quantiles in Streaming

Chandra (UIUC) CS498ABD 9 Fall 2020 9 / 31
slide-19
SLIDE 19

Quantile Summary

See stream of numbers a1, a2, . . . , an. Parameter ✏ 2 (0, 1) Note: Items can be from any ordered set, use only comparison What should we store? Chandra (UIUC) CS498ABD 10 Fall 2020 10 / 31
slide-20
SLIDE 20

Quantile Summary

See stream of numbers a1, a2, . . . , an. Parameter ✏ 2 (0, 1) Note: Items can be from any ordered set, use only comparison What should we store? Take cue from offline solution. Equally spaced 1/✏ elements from sorted list. Chandra (UIUC) CS498ABD 10 Fall 2020 10 / 31
slide-21
SLIDE 21

Quantile Summary

See stream of numbers a1, a2, . . . , an. Parameter ✏ 2 (0, 1) Note: Items can be from any ordered set, use only comparison What should we store? Take cue from offline solution. Equally spaced 1/✏ elements from sorted list. Quantile Summary: Q = {q1, q2, . . . , q`} where each qi is an element of stream. Wlog q1 < q2 < . . . < q` and q1 is smallest and q` is largest in stream For each qi 2 Q an interval I(qi) = [rminQ(qi), rmaxQ(qi)] where rminQ(qi)  rank(qi)  rmaxQ(qi) Chandra (UIUC) CS498ABD 10 Fall 2020 10 / 31

I

9i Amin lurex
slide-22
SLIDE 22

Quantile Summary

Quantile Summary: Q = {q1, q2, . . . , q`}. Also q1 < q2 < . . . < q` and q1 is smallest and q` is largest For each qi 2 Q an interval I(qi) = [rminQ(qi), rmaxQ(qi)] where rminQ(qi)  rank(qi)  rmaxQ(qi) Given k 2 [n] want to use Q to answer ✏-approximate rank k query. How? Chandra (UIUC) CS498ABD 11 Fall 2020 11 / 31

O

gaming, ) I C Ge . ]

← E

  • E,
3 . ' K'
  • 1000

want any

Cd,¥¥,

* as "

ETI

k- En Ehankla)E keen
slide-23
SLIDE 23

Quantile Summary

Quantile Summary: Q = {q1, q2, . . . , q`}. Also q1 < q2 < . . . < q` and q1 is smallest and q` is largest For each qi 2 Q an interval I(qi) = [rminQ(qi), rmaxQ(qi)] where rminQ(qi)  rank(qi)  rmaxQ(qi) Given k 2 [n] want to use Q to answer ✏-approximate rank k query. How? Suppose I(qi) ✓ [k ✏n, k + ✏n] then it is clear that qi is good to
  • utput since
k ✏n  rmin(qi)  rank(qi)  rmax(qi)  k + ✏n. Chandra (UIUC) CS498ABD 11 Fall 2020 11 / 31
slide-24
SLIDE 24

✏-Approximate Quantile Summary

Quantile Summary: Q = {q1, q2, . . . , q`}. Also q1 < q2 < . . . < q` and q1 is smallest and q` is largest For each qi 2 Q an interval I(qi) = [rminQ(qi), rmaxQ(qi)] where rminQ(qi)  rank(qi)  rmaxQ(qi) Maintain key invariant: For each i, rmax(qi+1) rmin(qi)  2✏n also implies rank(qi+1) rank(qi)  2✏n Chandra (UIUC) CS498ABD 12 Fall 2020 12 / 31

→-

⇐ ¥14

IT

slide-25
SLIDE 25

✏-Approximate Quantile Summary

Quantile Summary: Q = {q1, q2, . . . , q`}. Also q1 < q2 < . . . < q` and q1 is smallest and q` is largest For each qi 2 Q an interval I(qi) = [rminQ(qi), rmaxQ(qi)] where rminQ(qi)  rank(qi)  rmaxQ(qi) Maintain key invariant: For each i, rmax(qi+1) rmin(qi)  2✏n also implies rank(qi+1) rank(qi)  2✏n Lemma With invariant quantile summary can be used to answer ✏-approximate rank queries. Chandra (UIUC) CS498ABD 12 Fall 2020 12 / 31
slide-26
SLIDE 26

Proof of Lemma

Maintain key invariant: For each i, rmax(qi+1) rmin(qi)  2✏n Claim: There exists qj such that I(qj) ✓ [k ✏n, k + ✏n] Chandra (UIUC) CS498ABD 13 Fall 2020 13 / 31
slide-27
SLIDE 27

Proof of Lemma

Maintain key invariant: For each i, rmax(qi+1) rmin(qi)  2✏n Claim: There exists qj such that I(qj) ✓ [k ✏n, k + ✏n] If k (1 ✏)n then q` satisfies condition. Chandra (UIUC) CS498ABD 13 Fall 2020 13 / 31
slide-28
SLIDE 28

Proof of Lemma

Maintain key invariant: For each i, rmax(qi+1) rmin(qi)  2✏n Claim: There exists qj such that I(qj) ✓ [k ✏n, k + ✏n] If k (1 ✏)n then q` satisfies condition. Let j be smallest index such that rmax(qj) k + ✏n (exists since rmax(q`) = n and k < (1 ✏)n). Chandra (UIUC) CS498ABD 13 Fall 2020 13 / 31
slide-29
SLIDE 29

Proof of Lemma

Maintain key invariant: For each i, rmax(qi+1) rmin(qi)  2✏n Claim: There exists qj such that I(qj) ✓ [k ✏n, k + ✏n] If k (1 ✏)n then q` satisfies condition. Let j be smallest index such that rmax(qj) k + ✏n (exists since rmax(q`) = n and k < (1 ✏)n). qj1 satisfies condition. Suppose not. By choice of j, rmax(qj1) < k + ✏n. Since condition is not satisfied by qj1, rmin(qj1) < k ✏n but then rmax(qj) rmin(qj1) > k + ✏n (k ✏n) > 2✏n contradiction to invariant. Chandra (UIUC) CS498ABD 13 Fall 2020 13 / 31
slide-30
SLIDE 30

Maintaining ✏-Approx Quantile Summary in Streaming

Question: How to maintain ✏-approximate quantile summary in small space in streaming setting? Merge and Reduce/Prune Framework (also useful in other settings) Merge: given ✏1-approx Q1 for multiset S1 and ✏2-approx Q2 for multiset S2 obtain approx Q for S = S1 [ S1 Prune: Given ✏-approx Q for S of size `, prune to size h without increasing error by too much Chandra (UIUC) CS498ABD 14 Fall 2020 14 / 31
slide-31
SLIDE 31

Merging Summaries

Q1 = {q1, q2, . . . , q`} and intervals I1(q1), . . . , I1(q`) for multiset S1 with n1 = |S1| Q2 = {s1, s2, . . . , sm} and intervals I2(s1), . . . , I2(sm) for multiset S2 with n2 = |S1| Q = {z1, z2, . . . , z`+m} which is sorted version of {q1, q2, . . . , q`, s1, . . . , sm} for multiset S = S1 ] S2 with n = n1 + n2 Chandra (UIUC) CS498ABD 15 Fall 2020 15 / 31 E ,
  • approx
Er - approx =
slide-32
SLIDE 32

Merging Summaries

Q1 = {q1, q2, . . . , q`} and intervals I1(q1), . . . , I1(q`) for multiset S1 with n1 = |S1| Q2 = {s1, s2, . . . , sm} and intervals I2(s1), . . . , I2(sm) for multiset S2 with n2 = |S1| Q = {z1, z2, . . . , z`+m} which is sorted version of {q1, q2, . . . , q`, s1, . . . , sm} for multiset S = S1 ] S2 with n = n1 + n2 How do we find intervals for Q while maintaining key invariant? Consider zi and assume wlog that zi = qj for some 1  j  ` Chandra (UIUC) CS498ABD 15 Fall 2020 15 / 31

t

  • O

#

slide-33
SLIDE 33

Merging

Consider zi and assume wlog that zi = qj for some 1  j  ` Find st, st+1 such that st  qj  st+1 (ignore corner cases) We know that rminQ1(qj) elements in S1 are smaller than qj and also rminQ2(st) elements in S2 are smaller than qj. Hence it safe to set rminQ(zi) = rminQ1(qj) + rminQ2(st) Chandra (UIUC) CS498ABD 16 Fall 2020 16 / 31

_⇐¥⇐

too

#

④i

slide-34
SLIDE 34

Merging

Consider zi and assume wlog that zi = qj for some 1  j  ` Find st, st+1 such that st  qj  st+1 (ignore corner cases) We know that rminQ1(qj) elements in S1 are smaller than qj and also rminQ2(st) elements in S2 are smaller than qj. Hence it safe to set rminQ(zi) = rminQ1(qj) + rminQ2(st) Similarly it is safe to set rmaxQ(zi) = rmaxQ1(qj) + rmaxQ2(st+1) 1 Chandra (UIUC) CS498ABD 16 Fall 2020 16 / 31
slide-35
SLIDE 35

Merging

Lemma If Q1 is an ✏1-approx quantile summary for S1 and Q2 is an ✏2-approx quantile summary for S2 then Q is an ✏ = max{✏1, ✏2}-approx quantile summary for S = S1 ] S2. Hence error does not increase but |Q| = |Q1| + |Q2|. For proof need to verify key invariant. Q = {z1, z2, . . . , z`+m}. Need to show that rmaxQ(zi+1) rminQ(zi)  2✏(n1 + n2). Chandra (UIUC) CS498ABD 17 Fall 2020 17 / 31
slide-36
SLIDE 36

Merging Analysis

Need to show that rmaxQ(zi+1) rminQ(zi)  2✏(n1 + n2). Case 1: zi, zi+1 in same summary, say Q1 wlog. Then zi = qj and zi+1 = qj+1 for some j. Chandra (UIUC) CS498ABD 18 Fall 2020 18 / 31
slide-37
SLIDE 37

Merging Analysis

Need to show that rmaxQ(zi+1) rminQ(zi)  2✏(n1 + n2). Case 1: zi, zi+1 in same summary, say Q1 wlog. Then zi = qj and zi+1 = qj+1 for some j. This implies that there are st, st+1 in Q2 such that st  qj < qj+1  st+1. Hence rmaxQ(zi+1) rminQ(zi) = rmaxQ1(qj+1) + rmaxQ2(st+1) 1 (rminQ1(qj) + rminQ2(st))  (rmaxQ1(qj+1) rminQ1(qj)) + (rmaxQ2(st+1) rminQ2(st))  2✏n1 + 2✏n2  2✏(n1 + n2) Chandra (UIUC) CS498ABD 18 Fall 2020 18 / 31
slide-38
SLIDE 38

Merging Analysis

Case 2: zi, zi+1 in different summaries, say Q1, Q2 wlog. Then zi = qj and zi+1 = st+1 for some j, t. Chandra (UIUC) CS498ABD 19 Fall 2020 19 / 31
slide-39
SLIDE 39

Merging Analysis

Case 2: zi, zi+1 in different summaries, say Q1, Q2 wlog. Then zi = qj and zi+1 = st+1 for some j, t. This implies that st  qj  st+1  qj+1 (ignoring corner cases) Hence rmaxQ(zi+1) rminQ(zi) = rmaxQ1(qj+1) + rmaxQ2(st+1) 1 (rminQ1(qj) + rminQ2(st))  (rmaxQ1(qj+1) rminQ1(qj)) + (rmaxQ2(st+1) rminQ2(st))  2✏n1 + 2✏n2  2✏(n1 + n2) Chandra (UIUC) CS498ABD 19 Fall 2020 19 / 31
slide-40
SLIDE 40

Pruning/Reducing Summary

Merging keeps accuracy but increases summary size. Reduce/Prune: reduce size at expense of accuracy. Lemma Given ✏-approx quantile Q and integer h 3 can find Q0 such that |Q0|  h + 1 and Q0 is ✏0-approximate for ✏0  ✏ + 1 2h. Chandra (UIUC) CS498ABD 20 Fall 2020 20 / 31 E- appu

8

0¥'

slide-41
SLIDE 41

a

slide-42
SLIDE 42

Pruning/Reducing Summary

Merging keeps accuracy but increases summary size. Reduce/Prune: reduce size at expense of accuracy. Lemma Given ✏-approx quantile Q and integer h 3 can find Q0 such that |Q0|  h + 1 and Q0 is ✏0-approximate for ✏0  ✏ + 1 2h. Q = {q1, q2, . . . , q`} and wlog assume ` > h + 1. Query Q for ranks 1, n/h, 2n/h, . . . , n. Create Q0 from output of queries. Use same intervals as those in Q. Chandra (UIUC) CS498ABD 20 Fall 2020 20 / 31

E O

. =

Fitz ,e

slide-43
SLIDE 43

Pruning/Reducing Analysis

Q = {q1, q2, . . . , q`} and wlog assume ` > h + 1. Query Q for ranks 1, n/h , 2n h , . . . , n. Q0 = {q0 1, q0 2, . . . , q0 h+1} Suppose q0 i = qa and q0 i+1 = qb for some a < b. I(qa) ✓ [in/h ✏n, in/h + ✏n] and I(qb) ✓ [(i + 1)n/h ✏n, (i + 1)n/h + ✏n] Chandra (UIUC) CS498ABD 21 Fall 2020 21 / 31
slide-44
SLIDE 44

Pruning/Reducing Analysis

Q = {q1, q2, . . . , q`} and wlog assume ` > h + 1. Query Q for ranks 1, n/h , 2n h , . . . , n. Q0 = {q0 1, q0 2, . . . , q0 h+1} Suppose q0 i = qa and q0 i+1 = qb for some a < b. I(qa) ✓ [in/h ✏n, in/h + ✏n] and I(qb) ✓ [(i + 1)n/h ✏n, (i + 1)n/h + ✏n] Therefore, rmaxQ0(q0 i+1) rminQ0(q0 i )  (i + 1)n/h + ✏n (in/h ✏n)  2✏n + n/h  2(✏ + 1/(2h))n. Chandra (UIUC) CS498ABD 21 Fall 2020 21 / 31
slide-45
SLIDE 45

Merge and Reduce Streaming Quantiles

Stream: a1, a2, . . . , an and given ✏ 2 (0, 1) Want to maintain ✏-approximate quantile summary. O( 1 ✏ log2 n) space algorithm based on reduce and merge. Come up with a solution as if the whole stream is available
  • ffline
Show how it can implemented in small space in streaming setting. Chandra (UIUC) CS498ABD 22 Fall 2020 22 / 31
slide-46
SLIDE 46

Qu

  • Paume ( Meyeldu.tw), htt)

late

want

Qe

to be

In

  • , #£)
e - appear

÷n→•¥O

Eh

Iff

  • X

at Kaz au arafat as

  • .

an

  • Given E
  • Or

O

"
  • Pamela
',h) . #

a-Has

= .
  • Ozu

÷

"

Tht Tf

era.

  • E
slide-47
SLIDE 47

Suppose

we

fix

h

.

(Qal E

htt

.

Every

summary

has

Size

E

(htt

That

space

= u (htt ) ) .

=

=

hen

.

then

what

about

accuracy

. =

Iln@EEEeeaegtdzthEE.hAeBTe.ztezd-lau.Ih

EE

⇒ hidden

X

n

X

N

linen

1*1

=

N N

OLE hied

II

=

slide-48
SLIDE 48

doubting trick

E :fixed

  • start

wvlh

Some

constant

no

= 100=0
u

tho

n

.

.

No

O
  • "
. .(

Q

.

{ ↳ Trot tf Wao

t et la 'kno +
  • a Eloi
. 2k¥ no

in

= N .

born (then?

thin

.

in

slide-49
SLIDE 49 @

O.O

(not

2

no

Luol

"

' "
  • f hint
slide-50
SLIDE 50

Merge and Reduce for Streaming Quantiles

Stream: a1, a2, . . . , an and given ✏ 2 (0, 1) Imagine a rooted binary tree with a1, a2, . . . , an as leaves in that order (not sorted) At each internal node v let Sv be leaves under v. Compute a summary Qv for Sv bottom up. Qr is output where r is root. Summary at leaf is optimal simply stores element. To compute Qv with children a, b Merge Qa and Qb and Prune to size h + 1 Guarantees that Qr has size h + 1 Chandra (UIUC) CS498ABD 23 Fall 2020 23 / 31
slide-51
SLIDE 51

Merge and Reduce for Streaming Quantiles

Stream: a1, a2, . . . , an and given ✏ 2 (0, 1) Imagine a rooted binary tree with a1, a2, . . . , an as leaves in that order (not sorted) At each internal node v let Sv be leaves under v. Compute a summary Qv for Sv bottom up. Qr is output where r is root. Summary at leaf is optimal simply stores element. To compute Qv with children a, b Merge Qa and Qb and Prune to size h + 1 Guarantees that Qr has size h + 1 How should we choose h to ensure ✏-approx Qr? Chandra (UIUC) CS498ABD 23 Fall 2020 23 / 31
slide-52
SLIDE 52

Merge and Reduce for Streaming Quantiles

If each leaf summary has error ✏0 then Merging does not increase error but Pruning adds 1/(2h) at each level. Hence ✏r at root with depth d satisfies ✏r  ✏0 + d/(2h)  ✏0 + log n/(2h) Chandra (UIUC) CS498ABD 24 Fall 2020 24 / 31 =

4

avi at key

? TT

E C to do that

chime h

=

E-

born

slide-53
SLIDE 53

Merge and Reduce for Streaming Quantiles

If each leaf summary has error ✏0 then Merging does not increase error but Pruning adds 1/(2h) at each level. Hence ✏r at root with depth d satisfies ✏r  ✏0 + d/(2h)  ✏0 + log n/(2h) To ensure ✏r  ✏ we set h = Ω( 1 ✏ log n). Hence each summary size is O( 1 ✏ log n) numbers Chandra (UIUC) CS498ABD 24 Fall 2020 24 / 31
slide-54
SLIDE 54

Merge and Reduce for Streaming Quantiles

If each leaf summary has error ✏0 then Merging does not increase error but Pruning adds 1/(2h) at each level. Hence ✏r at root with depth d satisfies ✏r  ✏0 + d/(2h)  ✏0 + log n/(2h) To ensure ✏r  ✏ we set h = Ω( 1 ✏ log n). Hence each summary size is O( 1 ✏ log n) numbers How can we implement offline algorithm in streaming setting and how much space does it require? Chandra (UIUC) CS498ABD 24 Fall 2020 24 / 31
slide-55
SLIDE 55

Merge and Reduce for Streaming Quantiles

To ensure ✏r  ✏ we set h = Ω( 1 ✏ log n). Hence each summary size is O( 1 ✏ log n) numbers How can we implement offline algorithm in streaming setting and how much space does it require? Only Qr needed so sufficient to keep only those summaries in the “imaginary” binary tree that suffice to create Qr. Suffices to keep O(d) summaries where d is depth. Hence total space is O( 1 ✏ log2 n). Chandra (UIUC) CS498ABD 25 Fall 2020 25 / 31
slide-56
SLIDE 56

Merge and Reduce for Streaming Quantiles

To ensure ✏r  ✏ we set h = Ω( 1 ✏ log n). Hence each summary size is O( 1 ✏ log n) numbers How can we implement offline algorithm in streaming setting and how much space does it require? Only Qr needed so sufficient to keep only those summaries in the “imaginary” binary tree that suffice to create Qr. Suffices to keep O(d) summaries where d is depth. Hence total space is O( 1 ✏ log2 n). Need to know n in advance to set h. Otherwise use doubling trick with extra log factor. Chandra (UIUC) CS498ABD 25 Fall 2020 25 / 31
slide-57
SLIDE 57

Improvements

Instead of binary tree all the way use at first level 1/✏ nodes. Depth goes to log(✏n) and hence space improves to O( 1 ✏ log2(✏n)). Chandra (UIUC) CS498ABD 26 Fall 2020 26 / 31
slide-58
SLIDE 58

Improvements

Instead of binary tree all the way use at first level 1/✏ nodes. Depth goes to log(✏n) and hence space improves to O( 1 ✏ log2(✏n)). [Greenwald-Khanna] gave a more involved scheme that achieves O( 1 ✏ log(✏n)) space. Near-optimal. Chandra (UIUC) CS498ABD 26 Fall 2020 26 / 31

=

slide-59
SLIDE 59

Part III Multipass Selection

Chandra (UIUC) CS498ABD 27 Fall 2020 27 / 31
slide-60
SLIDE 60

Multipass Selection

Selection in multiple passes: 1-pass requires and can be done in O(n) space O(1) space. O(log n) suffices. Implement Quick Select in O(1) space. p passes? O(n1/ppolylog(n)) space suffices. Hence O(pn log n) for 2 passes. [Munro-Paterson 1980] Chandra (UIUC) CS498ABD 28 Fall 2020 28 / 31
slide-61
SLIDE 61

Multipass Selection

Selection in multiple passes: 1-pass requires and can be done in O(n) space O(1) space. O(log n) suffices. Implement Quick Select in O(1) space. p passes? O(n1/ppolylog(n)) space suffices. Hence O(pn log n) for 2 passes. [Munro-Paterson 1980] Goal: Derive p-pass algorithm via approximate quantile summary Chandra (UIUC) CS498ABD 28 Fall 2020 28 / 31
slide-62
SLIDE 62

p = 2 case

Goal: Selection of rank k element in 2-passes using ˜ O(pn) space Pass 1: Store ✏ = 1/pn-approximate summary. Space is ˜ O(1/✏) = ˜ O(pn). Summary allows to find two numbers a < b such that rank(a) k O(✏)n and rank(b)  k + O(✏)n Chandra (UIUC) CS498ABD 29 Fall 2020 29 / 31
slide-63
SLIDE 63

p = 2 case

Goal: Selection of rank k element in 2-passes using ˜ O(pn) space Pass 1: Store ✏ = 1/pn-approximate summary. Space is ˜ O(1/✏) = ˜ O(pn). Summary allows to find two numbers a < b such that rank(a) k O(✏)n and rank(b)  k + O(✏)n Pass 2: Store all numbers between a and b; O(pn) numbers. Compute exact rank of a and b. How? Find rank k element from stored elements and knowing rank of a, b. How? Chandra (UIUC) CS498ABD 29 Fall 2020 29 / 31

[

=

JEFE

.

=

slide-64
SLIDE 64

General p

Goal: Selection of rank k element in p-passes using ˜ O(n1/p) space Pass 1: Store ✏ = 1/n1/p-approximate summary. Space is ˜ O(1/✏) = ˜ O(n1/p). Summary allows to find two numbers a < b such that rank(a) k O(n11/p) and rank(b)  k + O(n11/p) In subsequent passes one can restrict attention to numbers between a and b. Only n11/p of them. Hence in one pass reduce to n11/p numbers. Chandra (UIUC) CS498ABD 30 Fall 2020 30 / 31

if

'

'II K-2h41 K -12
  • n 43
slide-65
SLIDE 65

General p

Goal: Selection of rank k element in p-passes using ˜ O(n1/p) space Pass 1: Store ✏ = 1/n1/p-approximate summary. Space is ˜ O(1/✏) = ˜ O(n1/p). Summary allows to find two numbers a < b such that rank(a) k O(n11/p) and rank(b)  k + O(n11/p) In subsequent passes one can restrict attention to numbers between a and b. Only n11/p of them. Hence in one pass reduce to n11/p numbers. After (p 1) passes we have n1/p numbers left and we can store all
  • f them in p’th pass and solve exactly.
Chandra (UIUC) CS498ABD 30 Fall 2020 30 / 31
slide-66
SLIDE 66

Random Order Streams

Ω(n) lower bound for Selection in adversarial setting. Can we do better if we assume non-worst case input? Random Order Stream Model: Adversary picks some input. Algorithm sees a random permutation of the input. Adversary power is weakened. Several interesting results in this model. For Exact Selection in random order streams. O(pn) space in 1-pass suffices with high probability. [Munro-Paterson] O(log log n) passes suffice with O(poly(log n)) space whp. [Guha-MacGregor] Chandra (UIUC) CS498ABD 31 Fall 2020 31 / 31