Tight Lower Bound for Comparison-Based Quantile Summaries Pavel - - PowerPoint PPT Presentation

tight lower bound for comparison based quantile summaries
SMART_READER_LITE
LIVE PREVIEW

Tight Lower Bound for Comparison-Based Quantile Summaries Pavel - - PowerPoint PPT Presentation

Tight Lower Bound for Comparison-Based Quantile Summaries Pavel Vesel y University of Warwick 8 April 2020 Based on joint work with Graham Cormode (Warwick) Powered by Beamer i k Z Overview of the talk & Quantiles & Distributions


slide-1
SLIDE 1

Tight Lower Bound for Comparison-Based Quantile Summaries

Pavel Vesel´ y

University of Warwick 8 April 2020 Based on joint work with Graham Cormode (Warwick)

Powered by BeamerikZ

slide-2
SLIDE 2

Overview of the talk & Quantiles & Distributions Big Data Algorithms

1 0.5

median

Streaming Model

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 1 / 10

slide-3
SLIDE 3

Motivation: Monitoring Latencies of Web Requests

Source: C. Masson, J.E. Rim, and H.K. Lee. Ddsketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. PVLDB, 12(12):2195–2205, 2019.

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 2 / 10

slide-4
SLIDE 4

Motivation: Monitoring Latencies of Web Requests

Source: C. Masson, J.E. Rim, and H.K. Lee. Ddsketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. PVLDB, 12(12):2195–2205, 2019.

Millions of observations

  • no need to store all observed latencies

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 2 / 10

slide-5
SLIDE 5

Motivation: Monitoring Latencies of Web Requests

Source: C. Masson, J.E. Rim, and H.K. Lee. Ddsketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. PVLDB, 12(12):2195–2205, 2019.

Millions of observations

  • no need to store all observed latencies

How does the distribution look like? What is the median latency?

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 2 / 10

slide-6
SLIDE 6

Motivation: Monitoring Latencies of Web Requests

Source: C. Masson, J.E. Rim, and H.K. Lee. Ddsketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. PVLDB, 12(12):2195–2205, 2019.

Millions of observations

  • no need to store all observed latencies

How does the distribution look like? What is the median latency?

  • Average latency too high due to ∼ 2% of very high latencies

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 2 / 10

slide-7
SLIDE 7

Streaming Model Motivation: monitoring latencies of requests

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 3 / 10

slide-8
SLIDE 8

Streaming Model Motivation: monitoring latencies of requests Streaming model = one pass over data & limited memory

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 3 / 10

slide-9
SLIDE 9

Streaming Model Motivation: monitoring latencies of requests Streaming model = one pass over data & limited memory Streaming algorithm

  • receives data in a stream, item by item
  • uses memory sublinear in N = stream length
  • at the end, computes the answer

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 3 / 10

slide-10
SLIDE 10

Streaming Model Motivation: monitoring latencies of requests Streaming model = one pass over data & limited memory Streaming algorithm

  • receives data in a stream, item by item
  • uses memory sublinear in N = stream length
  • at the end, computes the answer

Challenges:

  • N very large & not known
  • Data independent
  • Stream ordered arbitrarily
  • No random access to data

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 3 / 10

slide-11
SLIDE 11

Streaming Model Motivation: monitoring latencies of requests Streaming model = one pass over data & limited memory Streaming algorithm

  • receives data in a stream, item by item
  • uses memory sublinear in N = stream length
  • at the end, computes the answer

Challenges:

  • N very large & not known
  • Data independent
  • Stream ordered arbitrarily
  • No random access to data

Main objective: space

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 3 / 10

slide-12
SLIDE 12

Streaming Model Motivation: monitoring latencies of requests Streaming model = one pass over data & limited memory Streaming algorithm

  • receives data in a stream, item by item
  • uses memory sublinear in N = stream length
  • at the end, computes the answer

Challenges:

  • N very large & not known
  • Data independent
  • Stream ordered arbitrarily
  • No random access to data

Main objective: space How to summarize the input?

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 3 / 10

slide-13
SLIDE 13

Selection Problem & Streaming

  • Input: stream of N numbers
  • Goal: find the k-th smallest
  • e.g.: the median, 99th percentile
  • O(N) time offline algorithm [Blum et al. ’73]

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 4 / 10

slide-14
SLIDE 14

Selection Problem & Streaming

  • Input: stream of N numbers
  • Goal: find the k-th smallest
  • e.g.: the median, 99th percentile
  • O(N) time offline algorithm [Blum et al. ’73]
  • Streaming restrictions:
  • just one pass over the data
  • limited memory: o(N)

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 4 / 10

slide-15
SLIDE 15

Selection Problem & Streaming

  • Input: stream of N numbers
  • Goal: find the k-th smallest
  • e.g.: the median, 99th percentile
  • O(N) time offline algorithm [Blum et al. ’73]
  • Streaming restrictions:
  • just one pass over the data
  • limited memory: o(N)

No streaming algorithm for exact selection

Ω(N) space needed to find the median

[Munro & Paterson ’80, Guha & McGregor ’07]

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 4 / 10

slide-16
SLIDE 16

Selection Problem & Streaming

  • Input: stream of N numbers
  • Goal: find the k-th smallest
  • e.g.: the median, 99th percentile
  • O(N) time offline algorithm [Blum et al. ’73]
  • Streaming restrictions:
  • just one pass over the data
  • limited memory: o(N)

No streaming algorithm for exact selection

Ω(N) space needed to find the median

[Munro & Paterson ’80, Guha & McGregor ’07]

What about finding an approximate median?

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 4 / 10

slide-17
SLIDE 17

Approximate Median & Quantiles How to define an approximate median?

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 5 / 10

slide-18
SLIDE 18

Approximate Median & Quantiles How to define an approximate median? φ-quantile = ⌈φ · N⌉-th smallest element (φ ∈ [0, 1])

  • Median = .5-quantile

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 5 / 10

slide-19
SLIDE 19

Sorted data median .25-quantile .75-quantile

Approximate Median & Quantiles How to define an approximate median? φ-quantile = ⌈φ · N⌉-th smallest element (φ ∈ [0, 1])

  • Median = .5-quantile
  • Quartiles = .25, .5, and .75-quantiles
  • Percentiles = .01, .02, . . . , .99-quantiles

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 5 / 10

slide-20
SLIDE 20

Sorted data median .25-quantile .75-quantile

Approximate Median & Quantiles How to define an approximate median? φ-quantile = ⌈φ · N⌉-th smallest element (φ ∈ [0, 1])

  • Median = .5-quantile
  • Quartiles = .25, .5, and .75-quantiles
  • Percentiles = .01, .02, . . . , .99-quantiles

ε-approximate φ-quantile = any φ′-quantile for φ′ = [φ − ε, φ + ε]

  • .01-approximate medians are .49- and .51-quantiles (and items in between)

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 5 / 10

slide-21
SLIDE 21

Sorted data median .25-quantile .75-quantile

Approximate Median & Quantiles How to define an approximate median? φ-quantile = ⌈φ · N⌉-th smallest element (φ ∈ [0, 1])

  • Median = .5-quantile
  • Quartiles = .25, .5, and .75-quantiles
  • Percentiles = .01, .02, . . . , .99-quantiles

ε-approximate φ-quantile = any φ′-quantile for φ′ = [φ − ε, φ + ε]

  • .01-approximate medians are .49- and .51-quantiles (and items in between)

ε-approximate selection:

  • query k-th smallest → return k′-th smallest for k′ = k ± εN

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 5 / 10

slide-22
SLIDE 22

Sorted data median .25-quantile .75-quantile

Approximate Median & Quantiles How to define an approximate median? φ-quantile = ⌈φ · N⌉-th smallest element (φ ∈ [0, 1])

  • Median = .5-quantile
  • Quartiles = .25, .5, and .75-quantiles
  • Percentiles = .01, .02, . . . , .99-quantiles

ε-approximate φ-quantile = any φ′-quantile for φ′ = [φ − ε, φ + ε]

  • .01-approximate medians are .49- and .51-quantiles (and items in between)

ε-approximate selection:

  • query k-th smallest → return k′-th smallest for k′ = k ± εN

Offline summary: sort data & select ∼ 1 2ε items

min. 2ε-quantile 4ε-quantile . . . (0-quantile)

R

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 5 / 10

slide-23
SLIDE 23

ε-Approximate Quantile Summaries Data structure with two operations:

  • Update(x):

x = new item from the stream

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 6 / 10

slide-24
SLIDE 24

ε-Approximate Quantile Summaries Data structure with two operations:

  • Update(x):

x = new item from the stream

  • Quantile Query(φ): For φ ∈ [0, 1], return ε-approximate φ-quantile

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 6 / 10

slide-25
SLIDE 25

ε-Approximate Quantile Summaries Data structure with two operations:

  • Update(x):

x = new item from the stream

  • Quantile Query(φ): For φ ∈ [0, 1], return ε-approximate φ-quantile

Additional operations:

  • Rank Query(x):
  • For item x, determine its rank = position in the ordering of the input

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 6 / 10

slide-26
SLIDE 26

ε-Approximate Quantile Summaries Data structure with two operations:

  • Update(x):

x = new item from the stream

  • Quantile Query(φ): For φ ∈ [0, 1], return ε-approximate φ-quantile

Additional operations:

  • Rank Query(x):
  • For item x, determine its rank = position in the ordering of the input
  • Merge of two quantile summaries
  • Preserve space bounds, while maintaining accuracy

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 6 / 10

slide-27
SLIDE 27

ε-Approximate Quantile Summaries Data structure with two operations:

  • Update(x):

x = new item from the stream

  • Quantile Query(φ): For φ ∈ [0, 1], return ε-approximate φ-quantile

Additional operations:

  • Rank Query(x):
  • For item x, determine its rank = position in the ordering of the input
  • Merge of two quantile summaries
  • Preserve space bounds, while maintaining accuracy

Quantile summaries → streaming algorithms for:

  • Approximating distributions
  • Equi-depth histograms
  • Streaming Bin Packing [Cormode & V. ’20]

. . .

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 6 / 10

slide-28
SLIDE 28

ε-Approximate Quantile Summaries Data structure with two operations:

  • Update(x):

x = new item from the stream

  • Quantile Query(φ): For φ ∈ [0, 1], return ε-approximate φ-quantile

Additional operations:

  • Rank Query(x):
  • For item x, determine its rank = position in the ordering of the input
  • Merge of two quantile summaries
  • Preserve space bounds, while maintaining accuracy

Quantile summaries → streaming algorithms for:

  • Approximating distributions
  • Equi-depth histograms
  • Streaming Bin Packing [Cormode & V. ’20]

. . .

Bottom line: Finding ε-approximate median in data streams

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 6 / 10

slide-29
SLIDE 29

Approximate Median & Quantiles: Streaming Algorithms State-of-the-art results space ∼ # of stored items

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 7 / 10

slide-30
SLIDE 30

Approximate Median & Quantiles: Streaming Algorithms State-of-the-art results space ∼ # of stored items

  • O

1 ε · log εN

  • – deterministic comparison-based [Greenwald & Khanna ’01]

maintains a subset of items + bounds on their ranks

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 7 / 10

slide-31
SLIDE 31

Approximate Median & Quantiles: Streaming Algorithms State-of-the-art results space ∼ # of stored items

  • O

1 ε · log εN

  • – deterministic comparison-based [Greenwald & Khanna ’01]

maintains a subset of items + bounds on their ranks

  • O

1 ε · log M

  • – deterministic for integers {1, . . . , M} [Shrivastava et al. ’04]

M 1 2 . . .

not for floats or strings

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 7 / 10

slide-32
SLIDE 32

Approximate Median & Quantiles: Streaming Algorithms State-of-the-art results space ∼ # of stored items

  • O

1 ε · log εN

  • – deterministic comparison-based [Greenwald & Khanna ’01]

maintains a subset of items + bounds on their ranks

  • O

1 ε · log M

  • – deterministic for integers {1, . . . , M} [Shrivastava et al. ’04]

M 1 2 . . .

not for floats or strings

  • O

1 ε

  • – randomized [Karnin et al. ’16]
  • const. probability of violating ±εN error guarantee

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 7 / 10

slide-33
SLIDE 33

Approximate Median & Quantiles: Streaming Algorithms State-of-the-art results space ∼ # of stored items

  • O

1 ε · log εN

  • – deterministic comparison-based [Greenwald & Khanna ’01]

maintains a subset of items + bounds on their ranks

  • O

1 ε · log M

  • – deterministic for integers {1, . . . , M} [Shrivastava et al. ’04]

M 1 2 . . .

not for floats or strings

  • O

1 ε

  • – randomized [Karnin et al. ’16]
  • const. probability of violating ±εN error guarantee

Many more papers: [Munro & Paterson ’80, Manku et al. ’98, Manku et al. ’99]

[Hung & Ting ’10, Agarwal et al. ’12, Wang et al. ’13, Felber & Ostrovsky ’15, . . . ]

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 7 / 10

slide-34
SLIDE 34
  • Approx. Median & Quantiles: Is There a “Perfect” Algorithm?

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 8 / 10

slide-35
SLIDE 35
  • Approx. Median & Quantiles: Is There a “Perfect” Algorithm?

What would be a “perfect” streaming algorithm?

  • finds ε-approximate median
  • deterministic

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 8 / 10

slide-36
SLIDE 36
  • Approx. Median & Quantiles: Is There a “Perfect” Algorithm?

What would be a “perfect” streaming algorithm?

  • finds ε-approximate median
  • deterministic
  • constant space for fixed ε
  • ideally O

1 ε

  • ; or e.g. O

1 ε2

  • Pavel Vesel´

y Tight Lower Bound for Quantile Summaries 8 / 10

slide-37
SLIDE 37
  • Approx. Median & Quantiles: Is There a “Perfect” Algorithm?

What would be a “perfect” streaming algorithm?

  • finds ε-approximate median
  • deterministic
  • constant space for fixed ε
  • ideally O

1 ε

  • ; or e.g. O

1 ε2

  • no additional knowledge about items
  • comparison-based

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 8 / 10

slide-38
SLIDE 38
  • Approx. Median & Quantiles: Is There a “Perfect” Algorithm?

What would be a “perfect” streaming algorithm?

  • finds ε-approximate median
  • deterministic
  • constant space for fixed ε
  • ideally O

1 ε

  • ; or e.g. O

1 ε2

  • no additional knowledge about items
  • comparison-based

Theorem (Cormode, V. ’20)

There is no perfect streaming algorithm for ε-approximate median

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 8 / 10

slide-39
SLIDE 39
  • Approx. Median & Quantiles: Is There a “Perfect” Algorithm?

What would be a “perfect” streaming algorithm?

  • finds ε-approximate median
  • deterministic
  • constant space for fixed ε
  • ideally O

1 ε

  • ; or e.g. O

1 ε2

  • no additional knowledge about items
  • comparison-based

Theorem (Cormode, V. ’20)

There is no perfect streaming algorithm for ε-approximate median

  • Optimal space lower bound Ω

1 ε · log εN

  • Matches the result in [Greenwald & Khanna ’01]

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 8 / 10

slide-40
SLIDE 40
  • Approx. Median & Quantiles: Lower Bound Idea

Comparison-based algorithm ⇒ cannot compare with items deleted from the memory

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 9 / 10

slide-41
SLIDE 41
  • Approx. Median & Quantiles: Lower Bound Idea

Comparison-based algorithm ⇒ cannot compare with items deleted from the memory

R

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 9 / 10

slide-42
SLIDE 42
  • Approx. Median & Quantiles: Lower Bound Idea

Comparison-based algorithm ⇒ cannot compare with items deleted from the memory

R

10 50

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 9 / 10

slide-43
SLIDE 43
  • Approx. Median & Quantiles: Lower Bound Idea

Comparison-based algorithm ⇒ cannot compare with items deleted from the memory

R

10 50 new item: 30

How does 30 compare to discarded items between 10 and 50?

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 9 / 10

slide-44
SLIDE 44
  • Approx. Median & Quantiles: Lower Bound Idea

Comparison-based algorithm ⇒ cannot compare with items deleted from the memory

R

10 50 new item: 30

How does 30 compare to discarded items between 10 and 50? Idea: Introduce uncertainty

  • too high uncertainty ⇒ not accurate-enough answers

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 9 / 10

slide-45
SLIDE 45
  • Approx. Median & Quantiles: Lower Bound Idea

Comparison-based algorithm ⇒ cannot compare with items deleted from the memory

R

10 50 new item: 30

How does 30 compare to discarded items between 10 and 50? Idea: Introduce uncertainty

  • too high uncertainty ⇒ not accurate-enough answers
  • need to show: low uncertainty ⇒ many items stored ⇒ large space needed

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 9 / 10

slide-46
SLIDE 46
  • Approx. Median & Quantiles: Lower Bound Idea

Comparison-based algorithm ⇒ cannot compare with items deleted from the memory

R

10 50 new item: 30

How does 30 compare to discarded items between 10 and 50? Idea: Introduce uncertainty

  • too high uncertainty ⇒ not accurate-enough answers
  • need to show: low uncertainty ⇒ many items stored ⇒ large space needed

→ recursive construction of worst-case stream → lower bound Ω 1 ε · log εN

  • Pavel Vesel´

y Tight Lower Bound for Quantile Summaries 9 / 10

slide-47
SLIDE 47

Approximating Median & Quantiles: Conclusions & Open Problems Problem solved:

  • Deterministic algorithms: space Θ

1 ε · log εN

  • ptimal

[Greenwald & Khanna ’01] [Cormode, V. ’20]

  • Randomized algorithms: space Θ

1 ε

  • ptimal (const. probability of too high error)

[Karnin et al. ’16]

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 10 / 10

slide-48
SLIDE 48

Approximating Median & Quantiles: Conclusions & Open Problems Problem solved:

  • Deterministic algorithms: space Θ

1 ε · log εN

  • ptimal

[Greenwald & Khanna ’01] [Cormode, V. ’20]

  • Randomized algorithms: space Θ

1 ε

  • ptimal (const. probability of too high error)

[Karnin et al. ’16]

Future work:

  • Figure out constant factors

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 10 / 10

slide-49
SLIDE 49

Approximating Median & Quantiles: Conclusions & Open Problems Problem solved:

  • Deterministic algorithms: space Θ

1 ε · log εN

  • ptimal

[Greenwald & Khanna ’01] [Cormode, V. ’20]

  • Randomized algorithms: space Θ

1 ε

  • ptimal (const. probability of too high error)

[Karnin et al. ’16]

Future work:

  • Figure out constant factors
  • Randomized algorithm with good expected space, but guaranteed ±εN error

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 10 / 10

slide-50
SLIDE 50

Approximating Median & Quantiles: Conclusions & Open Problems Problem solved:

  • Deterministic algorithms: space Θ

1 ε · log εN

  • ptimal

[Greenwald & Khanna ’01] [Cormode, V. ’20]

  • Randomized algorithms: space Θ

1 ε

  • ptimal (const. probability of too high error)

[Karnin et al. ’16]

Future work:

  • Figure out constant factors
  • Randomized algorithm with good expected space, but guaranteed ±εN error
  • A non-trivial lower bound for integers {1, . . . , M}?
  • Or can we do better than O

1 ε · log M

  • ?

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 10 / 10

slide-51
SLIDE 51

Approximating Median & Quantiles: Conclusions & Open Problems Problem solved:

  • Deterministic algorithms: space Θ

1 ε · log εN

  • ptimal

[Greenwald & Khanna ’01] [Cormode, V. ’20]

  • Randomized algorithms: space Θ

1 ε

  • ptimal (const. probability of too high error)

[Karnin et al. ’16]

Future work:

  • Figure out constant factors
  • Randomized algorithm with good expected space, but guaranteed ±εN error
  • A non-trivial lower bound for integers {1, . . . , M}?
  • Or can we do better than O

1 ε · log M

  • ?
  • Dynamic streams w/ insertions and deletions of items

Pavel Vesel´ y Tight Lower Bound for Quantile Summaries 10 / 10

slide-52
SLIDE 52

1

T h an k Y

  • u

!