Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, - PowerPoint PPT Presentation

Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, MariaDB Sergei Petrunia, MariaDB Santa Clara, California | April 24th – 27th, 2017 Santa Clara, California | April 24th – 27th, 2017

What this talk is about ● Data statistics histograms in – MariaDB – MySQL (status so far) – PostgreSQL ● This is not a competitive comparison – Rather, a survey 2

Histograms and query optimizers Click to add text Click to add text

Query optimizer needs data statistics select * from customers join orders on customer.cust_id=orders.customer_id where customers.balance<1000 and orders.total>10K ● Which query plan enumerates fewer rows – orders->customers or customers->orders? ● It depends on row counts and condition selectivities ● Condition selectivity has a big impact on query speed 4

Data statistics has a big impact on optimizer ● A paper " H ow good are query optimizers, really?" – Leis et al, VLDB 2015 ● Conclusions section: – "In contrast to cardinality estimation, the contribution of the cost model to the overall query performance is limited." ● This matches our experience 5

Data statistics usage ● Need a *cheap* way to answer questions about – Numbers of rows in the table – Condition selectivity – Column widths – Number of distinct values – … ● Condition selectivity is the most challenging 6

Histogram as a compact data summary ● Partition the value space into buckets ● Keep an array of (bucket_bounds, n_values) – Takes O(#buckets) space 7

Histogram and condition selectivity col BETWEEN ‘a’ AND ‘b’ ● Sum row counts in the covered buckets ● Partially covered bucket? – Assume a fraction of rows match – This is a source of inaccuracy ● More buckets – more accurate estimates 8

Histogram types ● Different strategies for choosing buckets – Equi-width – Equi-height – Most Common Values – ... 9

Equi-width histogram ● Bucket bounds pre-defined – Equal, log-scale, etc ● Easy to understand, easy to collect. ● Not very efficient – Densely and sparsely-populated regions have the same #buckets – What if densely-populated regions had more buckets? 10

Equi-height histogram ● Pick the bucket bounds such that each bucket has the same #rows – Densely populated areas get more buckets – Sparsely populated get fewer buckets ● Estimation error is limited by bucket size – Which is now limited. 11

Most Common Values histogram ● Suitable for enum-type value1 count1 domains ● All possible values fit in the value2 count2 histogram value3 count3 ● Just a list of values and ... ... frequencies 12

Histogram collection algorithms ● Equi-width – Find (or guess) min and max value – For each value ● Find which histogram bin it falls into ● Increment bin’s counter ● Equi-height – Sort the values – First value starts bin #0 – Value at n_values * (1/n_bins) starts bin #2 – Value at n_values * (2/n_bins) starts bin #3 – ... 13

Histogram collection strategies ● Scan the whole dataset – Used by MariaDB – Produces a “perfect” histogram – May be expensive ● Do random sampling – Used by PostgreSQL (MySQL going to do it, too?) – Produces imprecise histograms – Non-deterministic results ● Incremental updates – hard to do, not used 14

Summary so far ● Query optimizers need condition selectivities ● These are provided by histograms ● Histograms are compact data summaries ● Histogram types – Width-balanced – Height-balanced (better) – Most-Common-Values ● Histogram collection methods – Scan the whole dataset – Do random sampling. 15

Histograms in MariaDB Click to add text Click to add text

Histograms in MariaDB ● Available in MariaDB 10.0 – (Stable since March, 2014) ● Used in the real world ● Good for common use cases – has some limitations ● Sometimes are called “Engine-Independent Table Statistics” – Although being engine-independent is not the primary point. 17

Histogram storage in MariaDB ● Are stored in mysql.column_stats table CREATE TABLE mysql.column_stats ( db_name varchar(64) NOT NULL, table_name varchar(64) NOT NULL, column_name varchar(64) NOT NULL, min_value varbinary(255) DEFAULT NULL, max_value varbinary(255) DEFAULT NULL, nulls_ratio decimal(12,4) DEFAULT NULL, avg_length decimal(12,4) DEFAULT NULL, avg_frequency decimal(12,4) DEFAULT NULL, hist_size tinyint unsigned, hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'), histogram varbinary(255), PRIMARY KEY (db_name,table_name,column_name) ); ● Very compact: max 255 bytes (per column) 18

Collecting a histogram ● Manual collection only set histogram_size=255; set histogram_type='DOUBLE_PREC_HB'; analyze table tbl persistent for all; analyze table tbl persistent for columns (col1, col2) indexes (); +----------+---------+----------+-----------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------+---------+----------+-----------------------------------------+ | test.tbl | analyze | status | Engine-independent statistics collected | | test.tbl | analyze | status | OK | +----------+---------+----------+-----------------------------------------+ ● Make the optimizer use it set use_stat_tables='preferably'; set optimizer_use_condition_selectivity=4; <query>; 19

Examining a histogram select * from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname' *************************** 1. row *************************** db_name: babynames table_name: pop1980_cp column_name: firstname min_value: Aaliyah max_value: Zvi nulls_ratio: 0.0000 avg_length: 6.0551 avg_frequency: 194.4642 hist_size: 32 hist_type: DOUBLE_PREC_HB histogram: � ��C)�G�[j\�\�fzz�z]��3� select decode_histogram(hist_type,histogram) from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname' *************************** 1. row *************************** decode_histogram(hist_type,histogram): 0.00201,0.04048,0.03833,0.03877,0.04158,0.11852,0.07912,0.00218,0.00093,0.03940, 0.07710,0.00124,0.08035,0.11992,0.03877,0.03989,0.24140 20

Histograms in MariaDB - summary ● Available since MariaDB 10.0 ● Special ANALYZE command to collect stats – Does a full table scan – May require a lot of space for big VARCHARs: MDEV-6529 “EITS ANALYZE uses disk space inefficiently for VARCHAR columns” ● Not used by the optimizer by default – Special settings to get optimizer to use them. 21

Histograms in PostgreSQL Click to add text Click to add text

Histograms in PostgreSQL ● Data statistics – Fraction of NULL-values – Most common value (MCV) list – Height-balanced histogram (excludes MCV values) – A few other parameters ● avg_length ● n_distinct_values ● ... ● Collection algorithm – One-pass random sampling 23

Collecting histograms in PostgreSQL -- Global parameter specifying number of buckets -- the default is 100 set default_statistics_target=N; -- Can also override for specific columns alter table tbl alter column_name set statistics N; -- Collect the statistics analyze tablename; postgresql.conf, or per-table # number of inserted/updated/deleted tuples to trigger an ANALYZE autovacuum_analyze_threshold = N # fraction of the table size to add to autovacuum_analyze_threshold # when deciding whether to trigger ANALYZE autovacuum_analyze_scale_factor=N.N 24

Examining the histogram select * from pg_stats where tablename='pop1980'; tablename | pop1980 attname | firstname inherited | f null_frac | 0 avg_width | 7 n_distinct | 9320 most_common_vals | {Michael,Jennifer,Christopher,Jason,David,James, Matthew,John,Joshua,Amanda} most_common_freqs | {0.0201067,0.0172667,0.0149067,0.0139,0.0124533, 0.01164,0.0109667,0.0107133,0.0106067,0.01028} histogram_bounds | {Aaliyah,Belinda,Christine,Elsie,Jaron,Kamia, Lindsay,Natasha,Robin,Steven,Zuriel} correlation | 0.0066454 most_common_elems | 25

Histograms are collected by doing sampling src/backend/commands/analyze.c, std_typanalyze() refers to ● "Random Sampling for Histogram Construction: How much is enough?” ● – Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, ACM SIGMOD, 1998. Rows in table (=10^6) Histogram size Error probability (=0.01) Random Max relative error in bin (=0.5) sample size ● 100 buckets = 30,000 rows sample 26

Histogram sampling in PostgreSQL ● 30K rows are sampled from random locations in the table – Does a skip scan forward – “Randomly chosen rows in randomly chosen blocks” ● Choice of Most Common Values – Sample values that are 25% more common than average – Values that would take more than one histogram bucket. – All seen values are MCVs? No histogram is built. 27

Beyond single-column histograms ● Conditions can be correlated select ... from order_items where shipdate='2015-12-15' AND item_name='christmas light' 'swimsuit' ● Correlation can have a big effect – MIN(1/n, 1/m) – (1/n) * (1/m) – 0 ● Multi-column “histograms” are hard ● “Possible PostgreSQL 10.0 feature: multivariate statistics” 28

Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, - PowerPoint PPT Presentation

Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, MariaDB Sergei Petrunia, MariaDB Santa Clara, California | April 24th 27th, 2017 Santa Clara, California | April 24th 27th, 2017 What this talk is about Data statistics

MariaDB State of MariaDB Michael Monty Widenius MariaDB hacker monty@askmonty.org

Performance Guide for MySQL Cluster Mikael Ronstrm, Ph.D Senior MySQL Architect Sun

FOSDEM MariaDB 10 - The Spider Storage Engine (a sharding plugin for MySQL/MariaDB) Stphane

MySQL Proxy Making MySQL more flexible Jan Kneschke jan@mysql.com MySQL Proxy proxy-servers

MySQL Replication Update MySQL 5.5 (GA) & MySQL 5.6.2 (Dev. Milestone) Lars Thalmann

MySQL Proxy meets: binlogs Jan Kneschke MySQL Enterprise Tools mailto: jan@mysql.com What is

MySQL Group Replication & MySQL InnoDB Cluster Production Ready? Kenny Gryp MySQL Practice

ALTER TABLE IMPROVEMENTS IN MARIA DB SERVER Marko Mkel Lead Developer InnoDB MariaDB

Service 1 Service 2 Service 3 Postgresql/ MySQL Postgresql/ MySQL Data size: ~100GB to a few

MariaDB MySQL Corrected Michael Monty Widenius About Monty Original author of MySQL

MySQL Cluster und MySQL Proxy Alles Online Diese Slides gibt es auch unter:

How Microsoft Built MySQL, PostgreSQL and MariaDB for the Cloud Santa Clara, California | April

Matching Histograms to Data 1. Consider the following four histograms. Listed below are five

Chapter 18: The Normal Approximation for Probability Histograms The histograms we saw in Chapter

XL1G: Create Histograms using Excel 2013 Functions V0H 3/31/2017 XL1G: 0H Create Histograms

PHP and MySQL Dr. E. Benoist Winter Term 2006-2007 PHP and MySQL 1 PHP and MySQL Introduction

State and Territory Single Points of Contact (SPOC) Meeting Presentation Track | Palo Verde I

C3.1 AND/OR Graphs The heuristics we will consider can all be understood in terms of

Multiple and Coordinated Views Hauptseminar Information Visualization - Wintersemester

THERMODYNAMICS Course No: ME 209 Department: Mechanical Engineering Instructor: U. N.

Effective Field Theories for Quarkonium recent progress Antonio Vairo Technische Universit

Iterative Decoders Robust to Threshold Voltage Uncertainty D. Declercq, B. Vasic, B.

Comme r c ial L itigation E ve ning se minar : Philip Mo se r QC Chair Spe ake r s: Gre

Yong Tang University of Tokyo KEK, Dec 4-7, 2018 K.Nakayama & Y. Tang , 1810.04975

Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, - PowerPoint PPT Presentation

Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, MariaDB Sergei Petrunia, MariaDB Santa Clara, California | April 24th 27th, 2017 Santa Clara, California | April 24th 27th, 2017 What this talk is about Data statistics

MariaDB State of MariaDB Michael Monty Widenius MariaDB hacker monty@askmonty.org

Performance Guide for MySQL Cluster Mikael Ronstrm, Ph.D Senior MySQL Architect Sun

FOSDEM MariaDB 10 - The Spider Storage Engine (a sharding plugin for MySQL/MariaDB) Stphane

MySQL Proxy Making MySQL more flexible Jan Kneschke jan@mysql.com MySQL Proxy proxy-servers

MySQL Replication Update MySQL 5.5 (GA) &amp; MySQL 5.6.2 (Dev. Milestone) Lars Thalmann

MySQL Proxy meets: binlogs Jan Kneschke MySQL Enterprise Tools mailto: jan@mysql.com What is

MySQL Group Replication &amp; MySQL InnoDB Cluster Production Ready? Kenny Gryp MySQL Practice

ALTER TABLE IMPROVEMENTS IN MARIA DB SERVER Marko Mkel Lead Developer InnoDB MariaDB

Service 1 Service 2 Service 3 Postgresql/ MySQL Postgresql/ MySQL Data size: ~100GB to a few

MariaDB MySQL Corrected Michael Monty Widenius About Monty Original author of MySQL

MySQL Cluster und MySQL Proxy Alles Online Diese Slides gibt es auch unter:

How Microsoft Built MySQL, PostgreSQL and MariaDB for the Cloud Santa Clara, California | April

Matching Histograms to Data 1. Consider the following four histograms. Listed below are five

Chapter 18: The Normal Approximation for Probability Histograms The histograms we saw in Chapter

XL1G: Create Histograms using Excel 2013 Functions V0H 3/31/2017 XL1G: 0H Create Histograms

PHP and MySQL Dr. E. Benoist Winter Term 2006-2007 PHP and MySQL 1 PHP and MySQL Introduction

State and Territory Single Points of Contact (SPOC) Meeting Presentation Track | Palo Verde I

C3.1 AND/OR Graphs The heuristics we will consider can all be understood in terms of

Multiple and Coordinated Views Hauptseminar Information Visualization - Wintersemester

THERMODYNAMICS Course No: ME 209 Department: Mechanical Engineering Instructor: U. N.

Effective Field Theories for Quarkonium recent progress Antonio Vairo Technische Universit

Iterative Decoders Robust to Threshold Voltage Uncertainty D. Declercq, B. Vasic, B.

Comme r c ial L itigation E ve ning se minar : Philip Mo se r QC Chair Spe ake r s: Gre

Yong Tang University of Tokyo KEK, Dec 4-7, 2018 K.Nakayama &amp; Y. Tang , 1810.04975

MySQL Replication Update MySQL 5.5 (GA) & MySQL 5.6.2 (Dev. Milestone) Lars Thalmann

MySQL Group Replication & MySQL InnoDB Cluster Production Ready? Kenny Gryp MySQL Practice

Yong Tang University of Tokyo KEK, Dec 4-7, 2018 K.Nakayama & Y. Tang , 1810.04975