Histograms in MariaDB, MySQL and PostgreSQL
Sergei Petrunia, MariaDB Sergei Petrunia, MariaDB Santa Clara, California | April 24th – 27th, 2017 Santa Clara, California | April 24th – 27th, 2017
Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, - - PowerPoint PPT Presentation
Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, MariaDB Sergei Petrunia, MariaDB Santa Clara, California | April 24th 27th, 2017 Santa Clara, California | April 24th 27th, 2017 What this talk is about Data statistics
Sergei Petrunia, MariaDB Sergei Petrunia, MariaDB Santa Clara, California | April 24th – 27th, 2017 Santa Clara, California | April 24th – 27th, 2017
2
– MariaDB – MySQL (status so far) – PostgreSQL
– Rather, a survey
Click to add text Click to add text
4
– orders->customers or customers->orders?
select * from customers join orders on customer.cust_id=orders.customer_id where customers.balance<1000 and
5
– Leis et al, VLDB 2015
– "In contrast to cardinality estimation, the contribution of the cost
6
– Numbers of rows in the table – Condition selectivity – Column widths – Number of distinct values – …
7
– Takes O(#buckets) space
8
– Assume a fraction of rows match – This is a source of inaccuracy
9
– Equi-width – Equi-height – Most Common Values – ...
10
– Equal, log-scale, etc
– Densely and sparsely-populated regions have the same
– What if densely-populated regions had more buckets?
11
– Densely populated areas get
– Sparsely populated get fewer
– Which is now limited.
12
13
– Find (or guess) min and max value – For each value
– Sort the values – First value starts bin #0 – Value at n_values * (1/n_bins) starts bin #2 – Value at n_values * (2/n_bins) starts bin #3 – ...
14
– Used by MariaDB – Produces a “perfect” histogram – May be expensive
– Used by PostgreSQL (MySQL going to do it, too?) – Produces imprecise histograms – Non-deterministic results
– hard to do, not used
15
– Width-balanced – Height-balanced (better) – Most-Common-Values
– Scan the whole dataset – Do random sampling.
Click to add text Click to add text
17
– (Stable since March, 2014)
– has some limitations
– Although being engine-independent is not the primary point.
18
CREATE TABLE mysql.column_stats ( db_name varchar(64) NOT NULL, table_name varchar(64) NOT NULL, column_name varchar(64) NOT NULL, min_value varbinary(255) DEFAULT NULL, max_value varbinary(255) DEFAULT NULL, nulls_ratio decimal(12,4) DEFAULT NULL, avg_length decimal(12,4) DEFAULT NULL, avg_frequency decimal(12,4) DEFAULT NULL, hist_size tinyint unsigned, hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'), histogram varbinary(255), PRIMARY KEY (db_name,table_name,column_name) );
19
set histogram_size=255; set histogram_type='DOUBLE_PREC_HB'; analyze table tbl persistent for all; analyze table tbl persistent for columns (col1, col2) indexes (); +----------+---------+----------+-----------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------+---------+----------+-----------------------------------------+ | test.tbl | analyze | status | Engine-independent statistics collected | | test.tbl | analyze | status | OK | +----------+---------+----------+-----------------------------------------+
set use_stat_tables='preferably'; set optimizer_use_condition_selectivity=4; <query>;
20
select * from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname' *************************** 1. row *************************** db_name: babynames table_name: pop1980_cp column_name: firstname min_value: Aaliyah max_value: Zvi nulls_ratio: 0.0000 avg_length: 6.0551 avg_frequency: 194.4642 hist_size: 32 hist_type: DOUBLE_PREC_HB histogram: C)G[j\\fzzz]3 select decode_histogram(hist_type,histogram) from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname' *************************** 1. row *************************** decode_histogram(hist_type,histogram): 0.00201,0.04048,0.03833,0.03877,0.04158,0.11852,0.07912,0.00218,0.00093,0.03940, 0.07710,0.00124,0.08035,0.11992,0.03877,0.03989,0.24140
21
– Does a full table scan – May require a lot of space for big VARCHARs:
– Special settings to get optimizer to use them.
Click to add text Click to add text
23
– Fraction of NULL-values – Most common value (MCV) list – Height-balanced histogram (excludes MCV values) – A few other parameters
– One-pass random sampling
24
set default_statistics_target=N;
alter table tbl alter column_name set statistics N;
analyze tablename; # number of inserted/updated/deleted tuples to trigger an ANALYZE autovacuum_analyze_threshold = N # fraction of the table size to add to autovacuum_analyze_threshold # when deciding whether to trigger ANALYZE autovacuum_analyze_scale_factor=N.N postgresql.conf, or per-table
25
select * from pg_stats where tablename='pop1980'; tablename | pop1980 attname | firstname inherited | f null_frac | 0 avg_width | 7 n_distinct | 9320 most_common_vals | {Michael,Jennifer,Christopher,Jason,David,James, Matthew,John,Joshua,Amanda} most_common_freqs | {0.0201067,0.0172667,0.0149067,0.0139,0.0124533, 0.01164,0.0109667,0.0107133,0.0106067,0.01028} histogram_bounds | {Aaliyah,Belinda,Christine,Elsie,Jaron,Kamia, Lindsay,Natasha,Robin,Steven,Zuriel} correlation | 0.0066454 most_common_elems |
26
– Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, ACM SIGMOD, 1998.
Histogram size Rows in table (=10^6) Max relative error in bin (=0.5) Error probability (=0.01) Random sample size
27
– Does a skip scan forward – “Randomly chosen rows in randomly chosen blocks”
– Sample values that are 25% more common than average – Values that would take more than one histogram bucket. – All seen values are MCVs? No histogram is built.
28
select ... from order_items where shipdate='2015-12-15' AND item_name='christmas light' 'swimsuit'
– MIN(1/n, 1/m) – (1/n) * (1/m) – 0
29
– Height-balanced histogram – Most Common Values list
– Or VACUUM will do it automatically
30
MariaDB [babynames]> analyze select count(*) from pop1980 where firstname='Jennifer'; +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra | +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ | 1 | SIMPLE | pop1980 | ALL | NULL | NULL | NULL | NULL | 3444156 | 3444156.00 | 4.69 | 1.70 | Using where | +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ MariaDB [babynames]> analyze select count(*) from pop1980 where firstname='Allison'; +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra | +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ | 1 | SIMPLE | pop1980 | ALL | NULL | NULL | NULL | NULL | 3444156 | 3444156.00 | 2.89 | 0.14 | Using where | +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ MariaDB [babynames]> analyze select count(*) from pop1980 where firstname='Jennice'; +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra | +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+ | 1 | SIMPLE | pop1980 | ALL | NULL | NULL | NULL | NULL | 3444156 | 3444156.00 | 4.69 | 0.00 | Using where | +------+-------------+---------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
Jennifer 58,381 Allison, 4,868 Jennice, 7
31
Jennifer 58,381 Allison, 4,868 Jennice, 7
test=# explain analyze select count(*) from pop1980 where firstname='Jennifer'; QUERY PLAN
Filter: ((firstname)::text = 'Jennifer'::text) Rows Removed by Filter: 3385539 Planning time: 0.098 ms Execution time: 372.625 ms test=# explain analyze select count(*) from pop1980 where firstname='Allison'; QUERY PLAN
Filter: ((firstname)::text = 'Allison'::text) Rows Removed by Filter: 3439234 Planning time: 0.086 ms Execution time: 372.447 ms test=# explain analyze select count(*) from pop1980 where firstname='Jennice'; QUERY PLAN
Filter: ((firstname)::text = 'Jennice'::text) Rows Removed by Filter: 3444123 Planning time: 0.388 ms Execution time: 346.010 ms
Click to add text Click to add text
33
– This gives some clues
– P_S.events_statements_histogram_global
– These are totally different kind of histogram
34
– Which does a full table scan and “rolls the dice” for each row
enum class enum_sampling_method { SYSTEM }; class handler { ... int ha_sample_init(double sampling_percentage, int sampling_seed, enum_sampling_method sampling_method); int ha_sample_next(uchar *buf); int ha_sample_end();
35
CREATE TABLE mysql.column_stats ( database_name varchar(64) COLLATE utf8_bin NOT NULL, table_name varchar(64) COLLATE utf8_bin NOT NULL, column_name varchar(64) COLLATE utf8_bin NOT NULL, histogram json NOT NULL, PRIMARY KEY (database_name,table_name,column_name) );
– No limits on size?
36
{ "last-updated": "2015-11-04 15:19:51.000000", "histogram-type": "singleton", "null-values": 0.1, // Fraction of NULL values "buckets": [ [ 42, // Value, data type depends on the source column. 0.001978728666831561 // "Cumulative" frequency ], … ] }
37
{ "last-updated": "2015-11-04 15:19:51.000000", "histogram-type": "equi-height", "null-values": 0.1, // Fraction of NULL values "buckets": [ [ "bar", // Lower inclusive value "foo", // Upper inclusive value 0.001978728666831561, // Cumulative frequency 10 // Number of distinct values in this bucket ], ... ] }
38
... "buckets": [ [ "bar", // Lower inclusive value "foo", // Upper inclusive value 0.001978728666831561, // Cumulative frequency 10 // Number of distinct values in this bucket ], ... ] }
39
– In both kind of histograms so you can’t have both at the same time?
40
– “Singleton” – “Height-balanced”
– May have “holes”? – Stores “frequency” for each bin (?)
– Which has only full scan implementation ATM
Click to add text Click to add text
42
– Has a mature implementation – Uses sampling and auto-collection
– Supports histograms since MariaDB 10.0
– Need to run ANALYZE manually and set the optimizer to use them
– Don’t have histograms, still. – Preparing to have them in 8.0 – Will support two kinds
43
44