Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, - - PowerPoint PPT Presentation

histograms in mariadb mysql and postgresql
SMART_READER_LITE
LIVE PREVIEW

Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, - - PowerPoint PPT Presentation

Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, MariaDB Sergei Petrunia, MariaDB Santa Clara, California | April 24th 27th, 2017 Santa Clara, California | April 24th 27th, 2017 What this talk is about Data statistics


slide-1
SLIDE 1

Histograms in MariaDB, MySQL and PostgreSQL

Sergei Petrunia, MariaDB Sergei Petrunia, MariaDB Santa Clara, California | April 24th – 27th, 2017 Santa Clara, California | April 24th – 27th, 2017

slide-2
SLIDE 2

2

What this talk is about

  • Data statistics histograms in

– MariaDB – MySQL (status so far) – PostgreSQL

  • This is not a competitive comparison

– Rather, a survey

slide-3
SLIDE 3

Histograms and query optimizers

Click to add text Click to add text

slide-4
SLIDE 4

4

Query optimizer needs data statistics

  • Which query plan enumerates fewer rows

– orders->customers or customers->orders?

  • It depends on row counts and condition selectivities
  • Condition selectivity has a big impact on query speed

select * from customers join orders on customer.cust_id=orders.customer_id where customers.balance<1000 and

  • rders.total>10K
slide-5
SLIDE 5

5

Data statistics has a big impact on optimizer

  • A paper "How good are query optimizers, really?"

– Leis et al, VLDB 2015

  • Conclusions section:

– "In contrast to cardinality estimation, the contribution of the cost

model to the overall query performance is limited."

  • This matches our experience
slide-6
SLIDE 6

6

Data statistics usage

  • Need a *cheap* way to answer questions about

– Numbers of rows in the table – Condition selectivity – Column widths – Number of distinct values – …

  • Condition selectivity is the most challenging
slide-7
SLIDE 7

7

Histogram as a compact data summary

  • Partition the value space into buckets
  • Keep an array of (bucket_bounds, n_values)

– Takes O(#buckets) space

slide-8
SLIDE 8

8

Histogram and condition selectivity

col BETWEEN ‘a’ AND ‘b’

  • Sum row counts in the covered

buckets

  • Partially covered bucket?

– Assume a fraction of rows match – This is a source of inaccuracy

  • More buckets – more accurate estimates
slide-9
SLIDE 9

9

Histogram types

  • Different strategies for choosing buckets

– Equi-width – Equi-height – Most Common Values – ...

slide-10
SLIDE 10

10

Equi-width histogram

  • Bucket bounds pre-defined

– Equal, log-scale, etc

  • Easy to understand, easy

to collect.

  • Not very efficient

– Densely and sparsely-populated regions have the same

#buckets

– What if densely-populated regions had more buckets?

slide-11
SLIDE 11

11

Equi-height histogram

  • Pick the bucket bounds such

that each bucket has the same #rows

– Densely populated areas get

more buckets

– Sparsely populated get fewer

buckets

  • Estimation error is limited by

bucket size

– Which is now limited.

slide-12
SLIDE 12

12

Most Common Values histogram

  • Suitable for enum-type

domains

  • All possible values fit in the

histogram

  • Just a list of values and

frequencies

value1 count1 value2 count2 value3 count3 ... ...

slide-13
SLIDE 13

13

Histogram collection algorithms

  • Equi-width

– Find (or guess) min and max value – For each value

  • Find which histogram bin it falls into
  • Increment bin’s counter
  • Equi-height

– Sort the values – First value starts bin #0 – Value at n_values * (1/n_bins) starts bin #2 – Value at n_values * (2/n_bins) starts bin #3 – ...

slide-14
SLIDE 14

14

Histogram collection strategies

  • Scan the whole dataset

– Used by MariaDB – Produces a “perfect” histogram – May be expensive

  • Do random sampling

– Used by PostgreSQL (MySQL going to do it, too?) – Produces imprecise histograms – Non-deterministic results

  • Incremental updates

– hard to do, not used

slide-15
SLIDE 15

15

Summary so far

  • Query optimizers need condition selectivities
  • These are provided by histograms
  • Histograms are compact data summaries
  • Histogram types

– Width-balanced – Height-balanced (better) – Most-Common-Values

  • Histogram collection methods

– Scan the whole dataset – Do random sampling.

slide-16
SLIDE 16

Histograms in MariaDB

Click to add text Click to add text

slide-17
SLIDE 17

17

Histograms in MariaDB

  • Available in MariaDB 10.0

– (Stable since March, 2014)

  • Used in the real world
  • Good for common use cases

– has some limitations

  • Sometimes are called “Engine-Independent Table Statistics”

– Although being engine-independent is not the primary point.

slide-18
SLIDE 18

18

Histogram storage in MariaDB

  • Are stored in mysql.column_stats table

CREATE TABLE mysql.column_stats ( db_name varchar(64) NOT NULL, table_name varchar(64) NOT NULL, column_name varchar(64) NOT NULL, min_value varbinary(255) DEFAULT NULL, max_value varbinary(255) DEFAULT NULL, nulls_ratio decimal(12,4) DEFAULT NULL, avg_length decimal(12,4) DEFAULT NULL, avg_frequency decimal(12,4) DEFAULT NULL, hist_size tinyint unsigned, hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'), histogram varbinary(255), PRIMARY KEY (db_name,table_name,column_name) );

  • Very compact: max 255 bytes (per column)
slide-19
SLIDE 19

19

Collecting a histogram

set histogram_size=255; set histogram_type='DOUBLE_PREC_HB'; analyze table tbl persistent for all; analyze table tbl persistent for columns (col1, col2) indexes (); +----------+---------+----------+-----------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------+---------+----------+-----------------------------------------+ | test.tbl | analyze | status | Engine-independent statistics collected | | test.tbl | analyze | status | OK | +----------+---------+----------+-----------------------------------------+

  • Manual collection only

set use_stat_tables='preferably'; set optimizer_use_condition_selectivity=4; <query>;

  • Make the optimizer use it
slide-20
SLIDE 20

20

Examining a histogram

select * from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname' *************************** 1. row *************************** db_name: babynames table_name: pop1980_cp column_name: firstname min_value: Aaliyah max_value: Zvi nulls_ratio: 0.0000 avg_length: 6.0551 avg_frequency: 194.4642 hist_size: 32 hist_type: DOUBLE_PREC_HB histogram: C)G[j\\fzzz]3 select decode_histogram(hist_type,histogram) from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname' *************************** 1. row *************************** decode_histogram(hist_type,histogram): 0.00201,0.04048,0.03833,0.03877,0.04158,0.11852,0.07912,0.00218,0.00093,0.03940, 0.07710,0.00124,0.08035,0.11992,0.03877,0.03989,0.24140

slide-21
SLIDE 21

21

Histograms in MariaDB - summary

  • Available since MariaDB 10.0
  • Special ANALYZE command to collect stats

– Does a full table scan – May require a lot of space for big VARCHARs:

MDEV-6529 “EITS ANALYZE uses disk space inefficiently for VARCHAR columns”

  • Not used by the optimizer by default

– Special settings to get optimizer to use them.

slide-22
SLIDE 22

Histograms in PostgreSQL

Click to add text Click to add text

slide-23
SLIDE 23

23

Histograms in PostgreSQL

  • Data statistics

– Fraction of NULL-values – Most common value (MCV) list – Height-balanced histogram (excludes MCV values) – A few other parameters

  • avg_length
  • n_distinct_values
  • ...
  • Collection algorithm

– One-pass random sampling

slide-24
SLIDE 24

24

Collecting histograms in PostgreSQL

  • - Global parameter specifying number of buckets
  • - the default is 100

set default_statistics_target=N;

  • - Can also override for specific columns

alter table tbl alter column_name set statistics N;

  • - Collect the statistics

analyze tablename; # number of inserted/updated/deleted tuples to trigger an ANALYZE autovacuum_analyze_threshold = N # fraction of the table size to add to autovacuum_analyze_threshold # when deciding whether to trigger ANALYZE autovacuum_analyze_scale_factor=N.N postgresql.conf, or per-table

slide-25
SLIDE 25

25

Examining the histogram

select * from pg_stats where tablename='pop1980'; tablename | pop1980 attname | firstname inherited | f null_frac | 0 avg_width | 7 n_distinct | 9320 most_common_vals | {Michael,Jennifer,Christopher,Jason,David,James, Matthew,John,Joshua,Amanda} most_common_freqs | {0.0201067,0.0172667,0.0149067,0.0139,0.0124533, 0.01164,0.0109667,0.0107133,0.0106067,0.01028} histogram_bounds | {Aaliyah,Belinda,Christine,Elsie,Jaron,Kamia, Lindsay,Natasha,Robin,Steven,Zuriel} correlation | 0.0066454 most_common_elems |

slide-26
SLIDE 26

26

Histograms are collected by doing sampling

  • src/backend/commands/analyze.c, std_typanalyze() refers to
  • "Random Sampling for Histogram Construction: How much is enough?”

– Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, ACM SIGMOD, 1998.

Histogram size Rows in table (=10^6) Max relative error in bin (=0.5) Error probability (=0.01) Random sample size

  • 100 buckets = 30,000 rows sample
slide-27
SLIDE 27

27

Histogram sampling in PostgreSQL

  • 30K rows are sampled from random locations in the table

– Does a skip scan forward – “Randomly chosen rows in randomly chosen blocks”

  • Choice of Most Common Values

– Sample values that are 25% more common than average – Values that would take more than one histogram bucket. – All seen values are MCVs? No histogram is built.

slide-28
SLIDE 28

28

Beyond single-column histograms

  • Conditions can be correlated

select ... from order_items where shipdate='2015-12-15' AND item_name='christmas light' 'swimsuit'

  • Correlation can have a big effect

– MIN(1/n, 1/m) – (1/n) * (1/m) – 0

  • Multi-column “histograms” are hard
  • “Possible PostgreSQL 10.0 feature: multivariate statistics”
slide-29
SLIDE 29

29

PostgreSQL: Conclusions

  • Collects and uses both

– Height-balanced histogram – Most Common Values list

  • Uses sampling for collection
  • Can run ANALYZE yourself

– Or VACUUM will do it automatically

  • Multivariate stats are in the plans
slide-30
SLIDE 30

30

Histogram test - MariaDB

  • Real world data, people born in 1980

MariaDB [babynames]> analyze select count(*) from pop1980 where firstname='Jennifer'; +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra | +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ | 1 | SIMPLE | pop1980 | ALL | NULL | NULL | NULL | NULL | 3444156 | 3444156.00 | 4.69 | 1.70 | Using where | +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ MariaDB [babynames]> analyze select count(*) from pop1980 where firstname='Allison'; +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra | +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ | 1 | SIMPLE | pop1980 | ALL | NULL | NULL | NULL | NULL | 3444156 | 3444156.00 | 2.89 | 0.14 | Using where | +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ MariaDB [babynames]> analyze select count(*) from pop1980 where firstname='Jennice'; +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra | +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ | 1 | SIMPLE | pop1980 | ALL | NULL | NULL | NULL | NULL | 3444156 | 3444156.00 | 4.69 | 0.00 | Using where | +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+

Jennifer 58,381 Allison, 4,868 Jennice, 7

2.75x 20x ?x

slide-31
SLIDE 31

31

Histogram test - PostgreSQL

  • Real world data, people born in 1980

Jennifer 58,381 Allison, 4,868 Jennice, 7

test=# explain analyze select count(*) from pop1980 where firstname='Jennifer'; QUERY PLAN

  • Aggregate (cost=68456.71..68456.71 rows=1 width=8) (actual time=372.593..372.593 rows=1 loops=1)
  • > Seq Scan on pop1980 (cost=0.00..68312.62 rows=57632 width=0) (actual time=0.288..366.058 rows=58591 loops=1)

Filter: ((firstname)::text = 'Jennifer'::text) Rows Removed by Filter: 3385539 Planning time: 0.098 ms Execution time: 372.625 ms test=# explain analyze select count(*) from pop1980 where firstname='Allison'; QUERY PLAN

  • Aggregate (cost=68313.66..68313.67 rows=1 width=8) (actual time=372.415..372.415 rows=1 loops=1)
  • > Seq Scan on pop1980 (cost=0.00..68312.62 rows=413 width=0) (actual time=119.238..372.023 rows=4896 loops=1)

Filter: ((firstname)::text = 'Allison'::text) Rows Removed by Filter: 3439234 Planning time: 0.086 ms Execution time: 372.447 ms test=# explain analyze select count(*) from pop1980 where firstname='Jennice'; QUERY PLAN

  • Aggregate (cost=68313.66..68313.67 rows=1 width=8) (actual time=345.966..345.966 rows=1 loops=1)
  • > Seq Scan on pop1980 (cost=0.00..68312.62 rows=413 width=0) (actual time=190.896..345.961 rows=7 loops=1)

Filter: ((firstname)::text = 'Jennice'::text) Rows Removed by Filter: 3444123 Planning time: 0.388 ms Execution time: 346.010 ms

0.9x 0.08x 103x

slide-32
SLIDE 32

Histograms in MySQL

Click to add text Click to add text

slide-33
SLIDE 33

33

Histograms in MySQL

  • Not available for use in MySQL 8.0.1
  • There are pieces of histogram code, still

– This gives some clues

  • Another feature that uses histograms: P_S statement latencies

– P_S.events_statements_histogram_global

P_S.events_statements_histogram_by_digest

– These are totally different kind of histogram

  • Buckets are log-scale equi-width.
slide-34
SLIDE 34

34

Sampling

  • Currently only has a default implementation only

– Which does a full table scan and “rolls the dice” for each row

  • Assume there will be an InnoDB implementation

enum class enum_sampling_method { SYSTEM }; class handler { ... int ha_sample_init(double sampling_percentage, int sampling_seed, enum_sampling_method sampling_method); int ha_sample_next(uchar *buf); int ha_sample_end();

  • New methods for storage engine API
slide-35
SLIDE 35

35

Histogram storage

  • Will be stored in mysql.column_stats table

CREATE TABLE mysql.column_stats ( database_name varchar(64) COLLATE utf8_bin NOT NULL, table_name varchar(64) COLLATE utf8_bin NOT NULL, column_name varchar(64) COLLATE utf8_bin NOT NULL, histogram json NOT NULL, PRIMARY KEY (database_name,table_name,column_name) );

  • Will be stored as JSON

– No limits on size?

slide-36
SLIDE 36

36

“Singleton” histograms

  • This is what PostgreSQL calls “Most Common Values”

{ "last-updated": "2015-11-04 15:19:51.000000", "histogram-type": "singleton", "null-values": 0.1, // Fraction of NULL values "buckets": [ [ 42, // Value, data type depends on the source column. 0.001978728666831561 // "Cumulative" frequency ], … ] }

slide-37
SLIDE 37

37

Height-balanced histograms

{ "last-updated": "2015-11-04 15:19:51.000000", "histogram-type": "equi-height", "null-values": 0.1, // Fraction of NULL values "buckets": [ [ "bar", // Lower inclusive value "foo", // Upper inclusive value 0.001978728666831561, // Cumulative frequency 10 // Number of distinct values in this bucket ], ... ] }

slide-38
SLIDE 38

38

Height-balanced histograms

... "buckets": [ [ "bar", // Lower inclusive value "foo", // Upper inclusive value 0.001978728666831561, // Cumulative frequency 10 // Number of distinct values in this bucket ], ... ] }

  • Why “upper inclusive value”? To support holes? At cost of 2x histogram size?
  • Why frequency in each bucket? it’s equi-height, so frequencies should be the

same?

  • Per-bucket #distinct is interesting but doesn’t seem high-demand.
slide-39
SLIDE 39

39

Histograms

  • “Singleton”
  • Height-balanced
  • Both kinds store nulls_fraction Fraction of NULLs is stored

– In both kind of histograms so you can’t have both at the same time?

  • Height-balanced allow for “gaps”
  • Each bucket has #distinct (non-optional?)
slide-40
SLIDE 40

40

MySQL histograms summary

  • Seem to be coming in MySQL 8.0
  • Support two types

– “Singleton” – “Height-balanced”

  • Both kinds store null-values so they are not used together?
  • “Height-balanced”

– May have “holes”? – Stores “frequency” for each bin (?)

  • Collection will probably use sampling

– Which has only full scan implementation ATM

slide-41
SLIDE 41

Conclusions

Click to add text Click to add text

slide-42
SLIDE 42

42

Conclusions

  • Histograms are compact data summaries for use by the optimizer
  • PostgreSQL

– Has a mature implementation – Uses sampling and auto-collection

  • MariaDB

– Supports histograms since MariaDB 10.0

  • Compact
  • Height-balanced only

– Need to run ANALYZE manually and set the optimizer to use them

  • MySQL

– Don’t have histograms, still. – Preparing to have them in 8.0 – Will support two kinds

  • Most common values
  • Height-balanced “with gaps” (?)
slide-43
SLIDE 43

43

Thanks!

slide-44
SLIDE 44

44

Rate My Session