histograms in mariadb mysql and postgresql
play

Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, - PowerPoint PPT Presentation

Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, MariaDB Sergei Petrunia, MariaDB Santa Clara, California | April 24th 27th, 2017 Santa Clara, California | April 24th 27th, 2017 What this talk is about Data statistics


  1. Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, MariaDB Sergei Petrunia, MariaDB Santa Clara, California | April 24th – 27th, 2017 Santa Clara, California | April 24th – 27th, 2017

  2. What this talk is about ● Data statistics histograms in – MariaDB – MySQL (status so far) – PostgreSQL ● This is not a competitive comparison – Rather, a survey 2

  3. Histograms and query optimizers Click to add text Click to add text

  4. Query optimizer needs data statistics select * from customers join orders on customer.cust_id=orders.customer_id where customers.balance<1000 and orders.total>10K ● Which query plan enumerates fewer rows – orders->customers or customers->orders? ● It depends on row counts and condition selectivities ● Condition selectivity has a big impact on query speed 4

  5. Data statistics has a big impact on optimizer ● A paper " H ow good are query optimizers, really?" – Leis et al, VLDB 2015 ● Conclusions section: – "In contrast to cardinality estimation, the contribution of the cost model to the overall query performance is limited." ● This matches our experience 5

  6. Data statistics usage ● Need a *cheap* way to answer questions about – Numbers of rows in the table – Condition selectivity – Column widths – Number of distinct values – … ● Condition selectivity is the most challenging 6

  7. Histogram as a compact data summary ● Partition the value space into buckets ● Keep an array of (bucket_bounds, n_values) – Takes O(#buckets) space 7

  8. Histogram and condition selectivity col BETWEEN ‘a’ AND ‘b’ ● Sum row counts in the covered buckets ● Partially covered bucket? – Assume a fraction of rows match – This is a source of inaccuracy ● More buckets – more accurate estimates 8

  9. Histogram types ● Different strategies for choosing buckets – Equi-width – Equi-height – Most Common Values – ... 9

  10. Equi-width histogram ● Bucket bounds pre-defined – Equal, log-scale, etc ● Easy to understand, easy to collect. ● Not very efficient – Densely and sparsely-populated regions have the same #buckets – What if densely-populated regions had more buckets? 10

  11. Equi-height histogram ● Pick the bucket bounds such that each bucket has the same #rows – Densely populated areas get more buckets – Sparsely populated get fewer buckets ● Estimation error is limited by bucket size – Which is now limited. 11

  12. Most Common Values histogram ● Suitable for enum-type value1 count1 domains ● All possible values fit in the value2 count2 histogram value3 count3 ● Just a list of values and ... ... frequencies 12

  13. Histogram collection algorithms ● Equi-width – Find (or guess) min and max value – For each value ● Find which histogram bin it falls into ● Increment bin’s counter ● Equi-height – Sort the values – First value starts bin #0 – Value at n_values * (1/n_bins) starts bin #2 – Value at n_values * (2/n_bins) starts bin #3 – ... 13

  14. Histogram collection strategies ● Scan the whole dataset – Used by MariaDB – Produces a “perfect” histogram – May be expensive ● Do random sampling – Used by PostgreSQL (MySQL going to do it, too?) – Produces imprecise histograms – Non-deterministic results ● Incremental updates – hard to do, not used 14

  15. Summary so far ● Query optimizers need condition selectivities ● These are provided by histograms ● Histograms are compact data summaries ● Histogram types – Width-balanced – Height-balanced (better) – Most-Common-Values ● Histogram collection methods – Scan the whole dataset – Do random sampling. 15

  16. Histograms in MariaDB Click to add text Click to add text

  17. Histograms in MariaDB ● Available in MariaDB 10.0 – (Stable since March, 2014) ● Used in the real world ● Good for common use cases – has some limitations ● Sometimes are called “Engine-Independent Table Statistics” – Although being engine-independent is not the primary point. 17

  18. Histogram storage in MariaDB ● Are stored in mysql.column_stats table CREATE TABLE mysql.column_stats ( db_name varchar(64) NOT NULL, table_name varchar(64) NOT NULL, column_name varchar(64) NOT NULL, min_value varbinary(255) DEFAULT NULL, max_value varbinary(255) DEFAULT NULL, nulls_ratio decimal(12,4) DEFAULT NULL, avg_length decimal(12,4) DEFAULT NULL, avg_frequency decimal(12,4) DEFAULT NULL, hist_size tinyint unsigned, hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'), histogram varbinary(255), PRIMARY KEY (db_name,table_name,column_name) ); ● Very compact: max 255 bytes (per column) 18

  19. Collecting a histogram ● Manual collection only set histogram_size=255; set histogram_type='DOUBLE_PREC_HB'; analyze table tbl persistent for all; analyze table tbl persistent for columns (col1, col2) indexes (); +----------+---------+----------+-----------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------+---------+----------+-----------------------------------------+ | test.tbl | analyze | status | Engine-independent statistics collected | | test.tbl | analyze | status | OK | +----------+---------+----------+-----------------------------------------+ ● Make the optimizer use it set use_stat_tables='preferably'; set optimizer_use_condition_selectivity=4; <query>; 19

  20. Examining a histogram select * from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname' *************************** 1. row *************************** db_name: babynames table_name: pop1980_cp column_name: firstname min_value: Aaliyah max_value: Zvi nulls_ratio: 0.0000 avg_length: 6.0551 avg_frequency: 194.4642 hist_size: 32 hist_type: DOUBLE_PREC_HB histogram: � ���C)�G�[j\�\�fzz�z]����3� select decode_histogram(hist_type,histogram) from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname' *************************** 1. row *************************** decode_histogram(hist_type,histogram): 0.00201,0.04048,0.03833,0.03877,0.04158,0.11852,0.07912,0.00218,0.00093,0.03940, 0.07710,0.00124,0.08035,0.11992,0.03877,0.03989,0.24140 20

  21. Histograms in MariaDB - summary ● Available since MariaDB 10.0 ● Special ANALYZE command to collect stats – Does a full table scan – May require a lot of space for big VARCHARs: MDEV-6529 “EITS ANALYZE uses disk space inefficiently for VARCHAR columns” ● Not used by the optimizer by default – Special settings to get optimizer to use them. 21

  22. Histograms in PostgreSQL Click to add text Click to add text

  23. Histograms in PostgreSQL ● Data statistics – Fraction of NULL-values – Most common value (MCV) list – Height-balanced histogram (excludes MCV values) – A few other parameters ● avg_length ● n_distinct_values ● ... ● Collection algorithm – One-pass random sampling 23

  24. Collecting histograms in PostgreSQL -- Global parameter specifying number of buckets -- the default is 100 set default_statistics_target=N; -- Can also override for specific columns alter table tbl alter column_name set statistics N; -- Collect the statistics analyze tablename; postgresql.conf, or per-table # number of inserted/updated/deleted tuples to trigger an ANALYZE autovacuum_analyze_threshold = N # fraction of the table size to add to autovacuum_analyze_threshold # when deciding whether to trigger ANALYZE autovacuum_analyze_scale_factor=N.N 24

  25. Examining the histogram select * from pg_stats where tablename='pop1980'; tablename | pop1980 attname | firstname inherited | f null_frac | 0 avg_width | 7 n_distinct | 9320 most_common_vals | {Michael,Jennifer,Christopher,Jason,David,James, Matthew,John,Joshua,Amanda} most_common_freqs | {0.0201067,0.0172667,0.0149067,0.0139,0.0124533, 0.01164,0.0109667,0.0107133,0.0106067,0.01028} histogram_bounds | {Aaliyah,Belinda,Christine,Elsie,Jaron,Kamia, Lindsay,Natasha,Robin,Steven,Zuriel} correlation | 0.0066454 most_common_elems | 25

  26. Histograms are collected by doing sampling src/backend/commands/analyze.c, std_typanalyze() refers to ● "Random Sampling for Histogram Construction: How much is enough?” ● – Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, ACM SIGMOD, 1998. Rows in table (=10^6) Histogram size Error probability (=0.01) Random Max relative error in bin (=0.5) sample size ● 100 buckets = 30,000 rows sample 26

  27. Histogram sampling in PostgreSQL ● 30K rows are sampled from random locations in the table – Does a skip scan forward – “Randomly chosen rows in randomly chosen blocks” ● Choice of Most Common Values – Sample values that are 25% more common than average – Values that would take more than one histogram bucket. – All seen values are MCVs? No histogram is built. 27

  28. Beyond single-column histograms ● Conditions can be correlated select ... from order_items where shipdate='2015-12-15' AND item_name='christmas light' 'swimsuit' ● Correlation can have a big effect – MIN(1/n, 1/m) – (1/n) * (1/m) – 0 ● Multi-column “histograms” are hard ● “Possible PostgreSQL 10.0 feature: multivariate statistics” 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend