"Probabilistic" Data Structures vs. PostgreSQL (and - - PowerPoint PPT Presentation

probabilistic data structures vs postgresql
SMART_READER_LITE
LIVE PREVIEW

"Probabilistic" Data Structures vs. PostgreSQL (and - - PowerPoint PPT Presentation

"Probabilistic" Data Structures vs. PostgreSQL (and similar stuff) FOSDEM PgDay - January 31, 2020 Tomas Vondra tomas.vondra@2ndquadrant.com tv@fuzzy.cz / @fuzzycz HyperLogLog and t-digest Probabilistic data structures ... use hash


slide-1
SLIDE 1

(and similar stuff)

FOSDEM PgDay - January 31, 2020

"Probabilistic" Data Structures vs. PostgreSQL

Tomas Vondra tomas.vondra@2ndquadrant.com tv@fuzzy.cz / @fuzzycz

slide-2
SLIDE 2

HyperLogLog and t-digest

slide-3
SLIDE 3

Probabilistic data structures ... use hash functions to randomize and compactly represent a set of items. These algorithms use much less memory and have constant query time … and can be easily parallelized. https://dzone.com/articles/introduction-probabilistic-0

slide-4
SLIDE 4

https://en.wikipedia.org/wiki/Category:Probabilistic_data_structures

  • Bloom Filter

(set membership)

  • HyperLogLog

(count distinct)

  • Count-Min Sketch

(frequency table)

  • MinHash

(set similarity)

  • … random trees, heaps, ...
slide-5
SLIDE 5

https://en.wikipedia.org/wiki/Category:Probabilistic_data_structures

  • Bloom Filter

(set membership)

  • HyperLogLog

(count distinct)

  • Count-Min Sketch

(frequency table)

  • MinHash

(set similarity)

  • … random trees, heaps, ...
slide-6
SLIDE 6

access_log

CREATE TABLE access_log ( ... req_date TIMESTAMPTZ, user_id INTEGER, response_time DOUBLE PRECISION, ... ); CREATE TABLE access_log (req_date timestamptz, user_id int, response_time double precision); INSERT INTO access_log SELECT i, 1000000 * random(), 1000 * random() from generate_series('2019-01-01'::timestamptz, '2020-02-01'::timestamptz, '1 second'::interval) s(i);

slide-7
SLIDE 7

SELECT COUNT(DISTINCT user_id) FROM access_log

slide-8
SLIDE 8

COUNT(DISTINCT user_id)

  • has to deduplicate data
  • needs a lot of memory / disk space
  • … so it's slow
  • difficult to precalculate
  • difficult to compute incrementally
  • difficult to parallelize
slide-9
SLIDE 9

HyperLogLog

slide-10
SLIDE 10
  • when it's enough to have (accurate) estimate

SELECT COUNT(DISTINCT user_id) FROM access_log;

  • we'll observe number of zeroes at the beginning of the hash value

○ 1xxxxxxxx => 1/2 ○ 01xxxxxxx => 1/4 ○ … ○ 0000001xx => 1/128

  • Maximum number of zeroes we've seen is 6. What's the cardinality?

HyperLogLog

slide-11
SLIDE 11

HyperLogLog

value HLL 7 h1(value) prefix zeroes in h2(value) 256

slide-12
SLIDE 12

HyperLogLog

HLL 7 256 4 5 6 3 5 8 5 5 4 4 6 ... ...

harmonic mean + correction

slide-13
SLIDE 13

https://github.com/citusdata/postgresql-hll

slide-14
SLIDE 14

Alternative to COUNT(DISTINCT user_id)

  • - install the extension

CREATE EXTENSION hll;

  • - generate HLL counter from user_id values

SELECT hll_add_agg(hll_hash_integer(user_id)) FROM access_log;

  • - estimate the cardinality of user_id values

SELECT #hll_add_agg(hll_hash_integer(user_id)) FROM access_log;

slide-15
SLIDE 15

Rollup (pre-calculation)

  • - create a rollup table

CREATE TABLE access_log_daily (req_day date, req_users hll);

  • - pre-calculate daily summaries

INSERT INTO access_log_daily SELECT date_trunc('day', req_date), hll_add_agg(hll_hash_integer(user_id)) FROM access_log GROUP BY 1;

slide-16
SLIDE 16

Rollup (pre-calculation)

  • - use the rollup to summarize range

SELECT #hll_union_agg(req_users) FROM access_log_daily WHERE req_day BETWEEN '2019-10-01' AND '2019-10-08';

slide-17
SLIDE 17

HyperLogLog

  • 2007 (evolution from ~1990)
  • just an estimate, not an exact cardinality

○ but you can compute the maximum error

  • trade-off between size and accuracy

○ size grows very slowly (with increasing accuracy / number of values) ○ 6kB more than enough for 1B values with 1% accuracy (1.5kB - 2% etc.)

  • supports

○ precalculation (rollup) ○ incremental updates ○ ...

slide-18
SLIDE 18

t-digest

slide-19
SLIDE 19

percentile_cont / percentile_disc

SELECT percentile_cont(0.95) WITHIN GROUP (ORDER BY response_time) FROM access_log

slide-20
SLIDE 20

percentile_cont / percentile_disc

SELECT percentile_cont(ARRAY[0.95, 0.99]) WITHIN GROUP (ORDER BY response_time) FROM access_log

slide-21
SLIDE 21

percentile_cont / percentile_disc

  • accurate results
  • has to store and sort all the data
  • difficult to parallelize
  • can't be precalculated

:-(

slide-22
SLIDE 22

t-digest

  • published in 2013 by Ted Dunning
  • approximation of CDF (cumulative distribution function)
  • essentially a histogram

○ represented by centroids, i.e. each bin is represented by [mean, count] ○ requires data types with ordering and mean

  • intended for stream processing

○ but hey, each aggregate is processing a stream of data

  • higher accuracy on the tails (close to 0.0 and 1.0)
slide-23
SLIDE 23

1000

slide-24
SLIDE 24

1000

slide-25
SLIDE 25

1000 80%

slide-26
SLIDE 26

1000 95%

slide-27
SLIDE 27

1000 95%

slide-28
SLIDE 28

1000 99%

slide-29
SLIDE 29

1000

slide-30
SLIDE 30

1000

slide-31
SLIDE 31

https://github.com/tvondra/tdigest

slide-32
SLIDE 32

Trivial example

SELECT percentile_cont(0.95) WITHIN GROUP (ORDER BY response_time) FROM access_log SELECT tdigest_percentile(response_time, 100, 0.95) FROM access_log

slide-33
SLIDE 33

Precalculation

CREATE TABLE precalc_digests ( req_day date, req_durations tdigest ); INSERT INTO precalc_digests SELECT date_trunc('day', req_date), tdigest(response_time, 100) FROM access_log GROUP BY 1;

slide-34
SLIDE 34

t-digest

  • modus operandi similar to HyperLogLog

○ approximation by simpler / smaller data structure ○ incremental updates ○ possibility to precalculate + rollup

  • result depends on order of input values

○ affects parallel queries

  • no formal accuracy limits

○ better accuracy on tails ○ worse accuracty close to 0.5 (median)

slide-35
SLIDE 35

?