Billion Goods in Few Categories: how Histograms Save a Life? - - PowerPoint PPT Presentation

billion goods in few categories how histograms save a life
SMART_READER_LITE
LIVE PREVIEW

Billion Goods in Few Categories: how Histograms Save a Life? - - PowerPoint PPT Presentation

Billion Goods in Few Categories: how Histograms Save a Life? November, 7, 2018 Sveta Smirnova Table of Contents The Case The Cardinality: Two Levels ANALYZE TABLE Limitations Solutions in Percona Server 5.7 Histograms


slide-1
SLIDE 1

Billion Goods in Few Categories: how Histograms Save a Life?

November, 7, 2018

Sveta Smirnova

slide-2
SLIDE 2
  • The Case
  • The Cardinality: Two Levels
  • ANALYZE TABLE Limitations
  • Solutions in Percona Server 5.7
  • Histograms
  • Conclusion

Table of Contents

2

slide-3
SLIDE 3
  • MySQL Support engineer
  • Author of
  • MySQL Troubleshooting
  • JSON UDF functions
  • FILTER clause for MySQL
  • Speaker
  • Percona Live, OOW, Fosdem,

DevConf, HighLoad... Sveta Smirnova

3

slide-4
SLIDE 4
  • Hardware
  • Wise options
  • Optimized queries
  • Brain

Everything can be Resolved!

4

slide-5
SLIDE 5
  • This talk is about
  • How I spent last two years
  • Resolving the same issue
  • For different customers

Not Everything

5

slide-6
SLIDE 6
  • This talk is about
  • How I spent last two years
  • Resolving the same issue
  • For different customers
  • Task was to speed up the query

Not Everything

5

slide-7
SLIDE 7
  • Specific data distribution
  • Access on different fields
  • ON clause
  • WHERE clause
  • GROUP BY
  • ORDER BY
  • Index cannot be used effectively

Not All the Queries can be Optimized

6

slide-8
SLIDE 8
  • Topic based on real Support cases
  • Couple of them are still in progress

Disclaimer

7

slide-9
SLIDE 9
  • Topic based on real Support cases
  • All examples are 100% fake
  • They created such that
  • No customer can be identified
  • Everything generated

Table names Column names Data

  • Use case itself is fictional

Disclaimer

7

slide-10
SLIDE 10
  • Topic based on real Support cases
  • All examples are 100% fake
  • All examples are simplified
  • Only columns, required to show the issue
  • Everything extra removed
  • Real tables usually store much more data

Disclaimer

7

slide-11
SLIDE 11
  • Topic based on real Support cases
  • All examples are 100% fake
  • All examples are simplified
  • All disasters happened with version 5.7

Disclaimer

7

slide-12
SLIDE 12

The Case

slide-13
SLIDE 13
  • categories
  • Less than 20 rows

Two tables

9

slide-14
SLIDE 14
  • categories
  • Less than 20 rows
  • goods
  • More than 1M rows
  • 20 unique cat id values
  • Many other fields

Price Date: added, last updated, etc. Characteristics Store ...

Two tables

9

slide-15
SLIDE 15

select * from goods join categories

  • n

(categories.id=goods.cat_id) where date_added between ’2018-07-01’ and ’2018-08-01’ and cat_id in (16,11) and price >= 1000 and <=10000 [ and ... ] [ GROUP BY ... [ORDER BY ... [ LIMIT ...]]] ;

JOIN

10

slide-16
SLIDE 16
  • Select from the Small Table

Option 1: Select from the Small Table First

11

slide-17
SLIDE 17
  • Select from the Small Table
  • For each cat id select from the large table

Option 1: Select from the Small Table First

11

slide-18
SLIDE 18
  • Select from the Small Table
  • For each cat id select from the large table
  • Filter result on date added[ and price[...]]

Option 1: Select from the Small Table First

11

slide-19
SLIDE 19
  • Select from the Small Table
  • For each cat id select from the large table
  • Filter result on date added[ and price[...]]
  • Slow with many items in the category

Option 1: Select from the Small Table First

11

slide-20
SLIDE 20
  • Filter rows by date added[ and price[...]]

Option 2: Select from the Large Table First

12

slide-21
SLIDE 21
  • Filter rows by date added[ and price[...]]
  • Get cat id values

Option 2: Select from the Large Table First

12

slide-22
SLIDE 22
  • Filter rows by date added[ and price[...]]
  • Get cat id values
  • Retrieve rows from the small table

Option 2: Select from the Large Table First

12

slide-23
SLIDE 23
  • Filter rows by date added[ and price[...]]
  • Get cat id values
  • Retrieve rows from the small table
  • Slow if number of rows, filtered by

date added, is larger than number of goods

in the selected categories

Option 2: Select from the Large Table First

12

slide-24
SLIDE 24
  • CREATE INDEX index everything

(cat id, date added[, price[, ...]])

  • It resolves the issue

What if use Combined Indexes?

13

slide-25
SLIDE 25
  • CREATE INDEX index everything

(cat id, date added[, price[, ...]])

  • It resolves the issue
  • But not in all cases

What if use Combined Indexes?

13

slide-26
SLIDE 26
  • Maintenance cost
  • Slower INSERT/UPDATE/DELETE
  • Disk space

The Problem

14

slide-27
SLIDE 27
  • Maintenance cost
  • Slower INSERT/UPDATE/DELETE
  • Disk space
  • Tables may have wrong cardinality

The Problem

14

slide-28
SLIDE 28

The Cardinality: Two Levels

slide-29
SLIDE 29
  • Optimizer
  • Engine
  • TokuDB
  • InnoDB
  • Any

MySQL is Layered Architecture

16

slide-30
SLIDE 30
  • Stores statistics on disk
  • mysql.innodb table stats
  • mysql.innodb index stats

InnoDB: Overview

17

slide-31
SLIDE 31
  • Stores statistics on disk
  • Returns statistics to Optimizer

InnoDB: Overview

17

slide-32
SLIDE 32
  • Stores statistics on disk
  • Returns statistics to Optimizer
  • In ha innobase::info
  • handler/ha innodb.cc

InnoDB: Overview

17

slide-33
SLIDE 33
  • Stores statistics on disk
  • Returns statistics to Optimizer
  • In ha innobase::info
  • handler/ha innodb.cc
  • When opens table
  • flag = HA STATUS CONST
  • Reads data from disk
  • Stores it in memory

InnoDB: Overview

17

slide-34
SLIDE 34
  • Stores statistics on disk
  • Returns statistics to Optimizer
  • In ha innobase::info
  • handler/ha innodb.cc
  • When opens table
  • Subsequent table accesses
  • flag = HA STATUS VARIABLE
  • Statistics from memory
  • Up to date Primary Key data

InnoDB: Overview

17

slide-35
SLIDE 35
  • Table created with option STATS AUTO RECALC = 0
  • Before ANALYZE TABLE

mysql> show index from test\G ... *************************** 2. row *************************** Table: test Non_unique: 1 Key_name: f1 Seq_in_index: 1 Column_name: f1 Collation: A Cardinality: 64 ...

InnoDB: Flow

18

slide-36
SLIDE 36
  • Table created with option STATS AUTO RECALC = 0
  • After ANALYZE TABLE

mysql> show index from test\G ... *************************** 2. row *************************** Table: test Non_unique: 1 Key_name: f1 Seq_in_index: 1 Column_name: f1 Collation: A Cardinality: 2 ...

InnoDB: Flow

18

slide-37
SLIDE 37
  • Table created with option STATS AUTO RECALC = 0
  • After inserting rows

mysql> show index from test\G ... *************************** 2. row *************************** Table: test Non_unique: 1 Key_name: f1 Seq_in_index: 1 Column_name: f1 Collation: A Cardinality: 16 ...

InnoDB: Flow

18

slide-38
SLIDE 38
  • Table created with option STATS AUTO RECALC = 0
  • After restart

mysql> show index from test\G ... *************************** 2. row *************************** Table: test Non_unique: 1 Key_name: f1 Seq_in_index: 1 Column_name: f1 Collation: A Cardinality: 2 ...

InnoDB: Flow

18

slide-39
SLIDE 39
  • Takes data from the engine

Optimizer: Overview

19

slide-40
SLIDE 40
  • Takes data from the engine
  • Class ha statistics
  • sql/handler.h

Optimizer: Overview

19

slide-41
SLIDE 41
  • Takes data from the engine
  • Class ha statistics
  • sql/handler.h
  • Does not have Cardinality field at all

Optimizer: Overview

19

slide-42
SLIDE 42
  • Takes data from the engine
  • Class ha statistics
  • sql/handler.h
  • Does not have Cardinality field at all
  • Uses formula to calculate Cardinality

Optimizer: Overview

19

slide-43
SLIDE 43
  • n rows: number of rows in the table
  • Naturally up to date
  • Constantly changing!

Optimizer: Formula

20

slide-44
SLIDE 44
  • n rows: number of rows in the table
  • Naturally up to date
  • Constantly changing!
  • rec per key: number of duplicates per key
  • Calculated by InnoDB in time of ANALYZE
  • rec per key = n rows / unique values
  • Do not change!

Optimizer: Formula

20

slide-45
SLIDE 45
  • n rows: number of rows in the table
  • Naturally up to date
  • Constantly changing!
  • rec per key: number of duplicates per key
  • Calculated by InnoDB in time of ANALYZE
  • rec per key = n rows / unique values
  • Do not change!
  • Cardinality = n rows / rec per key

Optimizer: Formula

20

slide-46
SLIDE 46
  • Engine stores persistent statistics

TokuDB InnoDB Storage Files Tables Statistics As Calculated As Calculated Row Count Persistent Only in Memory

Persistent Statistics Are Not Persistent

21

slide-47
SLIDE 47
  • Engine stores persistent statistics

TokuDB InnoDB Storage Files Tables Statistics As Calculated As Calculated Row Count Persistent Only in Memory

  • Optimizer calculates Cardinality every time

when accesses engine statistics

Persistent Statistics Are Not Persistent

21

slide-48
SLIDE 48
  • Engine stores persistent statistics

TokuDB InnoDB Storage Files Tables Statistics As Calculated As Calculated Row Count Persistent Only in Memory

  • Optimizer calculates Cardinality every time

when accesses engine statistics

  • Weak user control

Persistent Statistics Are Not Persistent

21

slide-49
SLIDE 49

ANALYZE TABLE Limitations

slide-50
SLIDE 50
  • Counts number of pages in the table

How ANALYZE TABLE Works with InnoDB?

23

slide-51
SLIDE 51
  • Counts number of pages in the table
  • Takes STATS SAMPLE PAGES

How ANALYZE TABLE Works with InnoDB?

23

slide-52
SLIDE 52
  • Counts number of pages in the table
  • Takes STATS SAMPLE PAGES
  • Counts number of unique values in

secondary index in these pages

How ANALYZE TABLE Works with InnoDB?

23

slide-53
SLIDE 53
  • Counts number of pages in the table
  • Takes STATS SAMPLE PAGES
  • Counts number of unique values in

secondary index in these pages

  • Divides number of pages in the table on

number of sample pages and multiplies result on number of unique values

How ANALYZE TABLE Works with InnoDB?

23

slide-54
SLIDE 54
  • Number of pages in the table: 20,000
  • STATS SAMPLE PAGES: 20 (default)
  • Unique values in the secondary index:
  • In sample pages: 10
  • In the table: 11

Example

24

slide-55
SLIDE 55
  • Number of pages in the table: 20,000
  • STATS SAMPLE PAGES: 20 (default)
  • Unique values in the secondary index:
  • In sample pages: 10
  • In the table: 11
  • Cardinality: 20,000 * 10 / 20 = 10,000

Example

24

slide-56
SLIDE 56
  • Number of pages in the table: 20,000
  • STATS SAMPLE PAGES: 5,000
  • Unique values in the secondary index:
  • In sample pages: 10
  • In the table: 11
  • Cardinality: 20,000 * 10 / 5,000 = 40

Example 2

25

slide-57
SLIDE 57
  • Time consuming

mysql> select count(*) from goods; +----------+ | count(*) | +----------+ | 80303000 | +----------+ 1 row in set (35.95 sec)

Use Larger STATS SAMPLE PAGES?

26

slide-58
SLIDE 58
  • Time consuming
  • With default STATS SAMPLE PAGES

mysql> analyze table goods; +------------+---------+----------+----------+ | Table | Op | Msg_type | Msg_text | +------------+---------+----------+----------+ | test.goods | analyze | status | OK | +------------+---------+----------+----------+ 1 row in set (0.32 sec)

Use Larger STATS SAMPLE PAGES?

26

slide-59
SLIDE 59
  • Time consuming
  • With bigger number

mysql> alter table goods STATS_SAMPLE_PAGES=5000; Query OK, 0 rows affected (0.04 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> analyze table goods; +------------+---------+----------+----------+ | Table | Op | Msg_type | Msg_text | +------------+---------+----------+----------+ | test.goods | analyze | status | OK | +------------+---------+----------+----------+ 1 row in set (27.13 sec)

Use Larger STATS SAMPLE PAGES?

26

slide-60
SLIDE 60
  • Time consuming
  • With bigger number
  • 27.13/0.32 = 85 times slower!

Use Larger STATS SAMPLE PAGES?

26

slide-61
SLIDE 61

User Manual claims it does not During the analysis, the table is locked with a read lock for InnoDB and MyISAM.

Does ANALYZE TABLE Block Reads?

27

slide-62
SLIDE 62

User Manual claims it does not

  • But!

Does ANALYZE TABLE Block Reads?

27

slide-63
SLIDE 63

User Manual claims it does not Sometimes it blocks all subsequent queries

+------+-------------------------+---------------------------------+ | Time | State | Info | +------+-------------------------+---------------------------------+ | 32 | Writing to net | select * from t where c > ’%0%’ | | 12 | Waiting for table flush | select * from test.t where i=1 | | 12 | Waiting for table flush | select * from test.t where i=2 | | 12 | Waiting for table flush | select * from test.t where i=3 | | 11 | Waiting for table flush | select * from test.t where i=7 | | 10 | Waiting for table flush | select * from test.t where i=11 | ...

Does ANALYZE TABLE Block Reads?

27

slide-64
SLIDE 64

Is not a solution

Simply Increasing STATS SAMPLE PAGES

28

slide-65
SLIDE 65

Solutions in Percona Server 5.7

slide-66
SLIDE 66

Considered as a bug

  • jira.percona.com/browse/PS-2503
  • lp:1704195
  • bugs.mysql.com/87065

Blocking ANALYZE TABLE

30

slide-67
SLIDE 67

Considered as a bug

  • Fixed in Percona Server

5.6.38-83.0/5.7.20-18

Blocking ANALYZE TABLE

30

slide-68
SLIDE 68
  • Before the fix
  • Opens table statistics

Concurrent DML allowed

  • Updates table statistics

Concurrent DML allowed

  • Update finished
  • Invalidates entry in table definition cache

Concurrent DML forbidden

  • Invalidates query cache

Concurrent DML forbidden

Non-Blocking ANALYZE TABLE

31

slide-69
SLIDE 69
  • After the fix
  • Opens table statistics

Concurrent DML allowed

  • Updates table statistics

Concurrent DML allowed

  • Update finished
  • Invalidates entry in table definition cache

Concurrent DML forbidden

  • Invalidates query cache

Concurrent DML forbidden

Non-Blocking ANALYZE TABLE

31

slide-70
SLIDE 70
  • InnoDB stores its statistics

mysql.innodb index stats

Without the Fix: Manual Update

32

slide-71
SLIDE 71
  • InnoDB stores its statistics

mysql.innodb index stats

  • This table is writable

Without the Fix: Manual Update

32

slide-72
SLIDE 72
  • InnoDB stores its statistics

mysql.innodb index stats

  • This table is writable
  • Updating it with following FLUSH TABLE

allows to fake any statistics

Without the Fix: Manual Update

32

slide-73
SLIDE 73
  • InnoDB stores its statistics

mysql.innodb index stats

  • This table is writable
  • Updating it with following FLUSH TABLE

allows to fake any statistics

  • Hack
  • Not documented
  • Not recommended
  • Can stop working any time

Without the Fix: Manual Update

32

slide-74
SLIDE 74
  • With Percona fix for blocking ANALYZE

TABLE we can use large value for STATS SAMPLE PAGES

  • Does not help when
  • Index cannot be used
  • Data distribution in the index vary a lot

5.7: Resume

33

slide-75
SLIDE 75
  • With Percona fix for blocking ANALYZE

TABLE we can use large value for STATS SAMPLE PAGES

  • Does not help when
  • Index cannot be used
  • Data distribution in the index vary a lot
  • Manual update allows to fix statistics
  • Not recommended
  • Can stop working any time

5.7: Resume

33

slide-76
SLIDE 76

Histograms

slide-77
SLIDE 77
  • Optimizer Column Statistics
  • Engine-independent
  • No fancy calculations
  • Knows about data distribution

What are the Histograms?

35

slide-78
SLIDE 78

1 2 3 4 5 6 7 8 9 10 200 400 600 800

Number of Values in Each Bucket

36

slide-79
SLIDE 79

1 2 3 4 5 6 7 8 9 10 0.2 0.4 0.6 0.8 1

Data in the Histogram

37

slide-80
SLIDE 80
  • Accurate statistics
  • Truly persistent
  • No extra calculations on access
  • Optimizer knows about data distribution
  • Without touching the table!

How Histograms are Helpful?

38

slide-81
SLIDE 81
  • Example data

mysql> create table example(f1 int) engine=innodb; mysql> insert into example values(1),(1),(1),(2),(3); mysql> select f1, count(f1) from example group by f1; +------+-----------+ | f1 | count(f1) | +------+-----------+ | 1 | 3 | | 2 | 1 | | 3 | 1 | +------+-----------+ 3 rows in set (0.00 sec)

Filtered Rows

39

slide-82
SLIDE 82
  • Without a histogram

mysql> explain select * from example where f1 > 0\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: example partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 5 filtered: 33.33 Extra: Using where

Filtered Rows

39

slide-83
SLIDE 83
  • Without a histogram

mysql> explain select * from example where f1 > 1\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: example partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 5 filtered: 33.33 Extra: Using where

Filtered Rows

39

slide-84
SLIDE 84
  • Without a histogram

mysql> explain select * from example where f1 > 2\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: example partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 5 filtered: 33.33 Extra: Using where

Filtered Rows

39

slide-85
SLIDE 85
  • Without a histogram

mysql> explain select * from example where f1 > 3\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: example partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 5 filtered: 33.33 Extra: Using where

Filtered Rows

39

slide-86
SLIDE 86
  • With the histogram

mysql> analyze table example update histogram on f1 with 3 buckets; +-----------------+-----------+----------+------------------------------+ | Table | Op | Msg_type | Msg_text | +-----------------+-----------+----------+------------------------------+ | hist_ex.example | histogram | status | Histogram statistics created for column ’f1’. | +-----------------+-----------+----------+------------------------------+ 1 row in set (0.03 sec)

Filtered Rows

39

slide-87
SLIDE 87
  • With the histogram

mysql> select * from information_schema.column_statistics

  • > where table_name=’example’\G

*************************** 1. row *************************** SCHEMA_NAME: hist_ex TABLE_NAME: example COLUMN_NAME: f1 HISTOGRAM: "buckets": [[1, 0.6], [2, 0.8], [3, 1.0]], "data-type": "int", "null-values": 0.0, "collation-id": 8, "last-updated": "2018-11-07 09:07:19.791470", "sampling-rate": 1.0, "histogram-type": "singleton", "number-of-buckets-specified": 3 1 row in set (0.00 sec)

Filtered Rows

39

slide-88
SLIDE 88
  • With the histogram

mysql> explain select * from example where f1 > 0\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: example partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 5 filtered: 100.00 -- all rows Extra: Using where

Filtered Rows

39

slide-89
SLIDE 89
  • With the histogram

mysql> explain select * from example where f1 > 1\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: example partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 5 filtered: 40.00 -- 2 rows Extra: Using where

Filtered Rows

39

slide-90
SLIDE 90
  • With the histogram

mysql> explain select * from example where f1 > 2\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: example partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 5 filtered: 20.00 -- one row Extra: Using where

Filtered Rows

39

slide-91
SLIDE 91
  • With the histogram

mysql> explain select * from example where f1 > 3\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: example partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 5 filtered: 20.00 - one row Extra: Using where

Filtered Rows

39

slide-92
SLIDE 92
  • EXPLAIN without histograms

mysql> explain select goods.* from goods

  • > join categories on (categories.id=goods.cat_id)
  • > where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
  • > and
  • > date_added between ’2000-01-01’ and ’2001-01-01’ -- Large range
  • > order by goods.cat_id
  • > limit 10\G -- We ask for 10 rows only!

Example

40

slide-93
SLIDE 93
  • EXPLAIN without histograms

*************************** 1. row *************************** id: 1 select_type: SIMPLE table: categories -- Small table first partitions: NULL type: index possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: NULL rows: 20 filtered: 70.00 Extra: Using where; Using index; Using temporary; Using filesort

Example

40

slide-94
SLIDE 94
  • EXPLAIN without histograms

*************************** 2. row *************************** id: 1 select_type: SIMPLE table: goods -- Large table partitions: NULL type: ref possible_keys: cat_id_2 key: cat_id_2 key_len: 5 ref: orig.categories.id rows: 51827 filtered: 11.11 -- Default value Extra: Using where 2 rows in set, 1 warning (0.01 sec)

Example

40

slide-95
SLIDE 95
  • Execution time without histograms

mysql> flush status; Query OK, 0 rows affected (0.00 sec) mysql> select goods.* from goods

  • > join categories on (categories.id=goods.cat_id)
  • > where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
  • > and
  • > date_added between ’2000-01-01’ and ’2001-01-01’
  • > order by goods.cat_id
  • > limit 10;

ab9f9bb7bc4f357712ec34f067eda364

  • 10 rows in set (56.47 sec)

Example

40

slide-96
SLIDE 96
  • Engine statistics without histograms

mysql> show status like ’Handler%’; +----------------------------+--------+ | Variable_name | Value | +----------------------------+--------+ ... | Handler_read_next | 964718 | | Handler_read_prev | 0 | | Handler_read_rnd | 10 | | Handler_read_rnd_next | 951671 | ... | Handler_write | 951670 | +----------------------------+--------+ 18 rows in set (0.01 sec)

Example

40

slide-97
SLIDE 97
  • Now lets add the histogram

mysql> analyze table goods update histogram on date_added; +------------+-----------+----------+------------------------------+ | Table | Op | Msg_type | Msg_text | +------------+-----------+----------+------------------------------+ | orig.goods | histogram | status | Histogram statistics created for column ’date_added’. | +------------+-----------+----------+------------------------------+ 1 row in set (2.01 sec)

Example

40

slide-98
SLIDE 98
  • EXPLAIN with the histogram

mysql> explain select goods.* from goods

  • > join categories
  • > on (categories.id=goods.cat_id)
  • > where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
  • > and
  • > date_added between ’2000-01-01’ and ’2001-01-01’
  • > order by goods.cat_id
  • > limit 10\G

Example

40

slide-99
SLIDE 99
  • EXPLAIN with the histogram

*************************** 1. row *************************** id: 1 select_type: SIMPLE table: goods -- Large table first partitions: NULL type: index possible_keys: cat_id_2 key: cat_id_2 key_len: 5 ref: NULL rows: 10 -- Same as we asked filtered: 98.70 -- True numbers Extra: Using where

Example

40

slide-100
SLIDE 100
  • EXPLAIN with the histogram

*************************** 2. row *************************** id: 1 select_type: SIMPLE table: categories -- Small table partitions: NULL type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: orig.goods.cat_id rows: 1 filtered: 100.00 Extra: Using index 2 rows in set, 1 warning (0.01 sec)

Example

40

slide-101
SLIDE 101
  • Execution time with the histogram

mysql> flush status; Query OK, 0 rows affected (0.00 sec) mysql> select goods.* from goods

  • > join categories on (categories.id=goods.cat_id)
  • > where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
  • > and
  • > date_added between ’2000-01-01’ and ’2001-01-01’
  • > order by goods.cat_id
  • > limit 10;

eeb005fae0dd3441c5c380e1d87fee84

  • 10 rows in set (0.00 sec) -- 56 times faster!

Example

40

slide-102
SLIDE 102
  • Engine statistics with the histogram

mysql> show status like ’Handler%’; +----------------------------+-------++----------------------------+-------+ | Variable_name | Value || Variable_name | Value | +----------------------------+-------++----------------------------+-------+ | Handler_commit | 1 || Handler_read_prev | 0 | | Handler_delete | 0 || Handler_read_rnd | 0 | | Handler_discover | 0 || Handler_read_rnd_next | 0 | | Handler_external_lock | 4 || Handler_rollback | 0 | | Handler_mrr_init | 0 || Handler_savepoint | 0 | | Handler_prepare | 0 || Handler_savepoint_rollback | 0 | | Handler_read_first | 1 || Handler_update | 0 | | Handler_read_key | 3 || Handler_write | 0 | | Handler_read_last | 0 |+----------------------------+-------+ | Handler_read_next | 9 |18 rows in set (0.00 sec)

Example

40

slide-103
SLIDE 103
  • Data distribution is uniform
  • Range optimization can be used
  • Full table scan is fast

When Histogram are not Helpful?

41

slide-104
SLIDE 104

Backward index scan

  • Better Statistics Persistence in InnoDB
  • MySQL bug #80178
  • MySQL bug #84654
  • Better PRIMARY key access

Other Improvements in 8.0

42

slide-105
SLIDE 105

Conclusion

slide-106
SLIDE 106
  • Index statistics collected by the egine
  • Optimizer calculates Cardinality each time

when accesses statistics

  • Indexes not always improve performance
  • Histograms can help

Still new feature

Conclusion

44

slide-107
SLIDE 107

MySQL User Reference Manual Blog by Erik Froseth Blog by Frederic Descamps Talk by Oystein Grovlen @Fosdem Talk by Sergei Petrunia @PerconaLive Talk by Sergei Golubchik @HighLoad++

More information

45

slide-108
SLIDE 108

Rate My Session!

46

slide-109
SLIDE 109

http://www.slideshare.net/SvetaSmirnova https://twitter.com/svetsmirnova https://github.com/svetasmirnova

Thank you!

47

slide-110
SLIDE 110

How TokuDB Updates Statistics

slide-111
SLIDE 111
  • Stores key statistics on disk and in memory
  • tablename status id.tokudb

TokuDB: Overview

49

slide-112
SLIDE 112
  • Stores key statistics on disk and in memory
  • Stores row count on disk and in memory
  • tablename main id.tokudb
  • tablename key keyname id.tokudb

TokuDB: Overview

49

slide-113
SLIDE 113
  • Stores key statistics on disk and in memory
  • Stores row count on disk and in memory
  • Returns statistics to Optimizer

TokuDB: Overview

49

slide-114
SLIDE 114
  • Stores key statistics on disk and in memory
  • Stores row count on disk and in memory
  • Returns statistics to Optimizer
  • In ha tokudb::info (handler/ha tokudb.cc)

TokuDB: Overview

49

slide-115
SLIDE 115
  • Stored on disk

TokuDB: Key Statistics

50

slide-116
SLIDE 116
  • Stored on disk
  • Updated during ANALYZE
  • Background ANALYZE
  • Explicitly called

TokuDB: Key Statistics

50

slide-117
SLIDE 117
  • Stored on disk
  • Updated during ANALYZE
  • Background ANALYZE
  • Explicitly called
  • Not updated when tokudb auto analyze=0

TokuDB: Key Statistics

50

slide-118
SLIDE 118
  • Updated in TOKUDB SHARE::update cardinality counts

TokuDB Key Statistics: Code

51

slide-119
SLIDE 119
  • Updated in TOKUDB SHARE::update cardinality counts
  • Stored in tokudb::set card in status
  • In standard ANALYZE
  • standard t::on run

TokuDB Key Statistics: Code

51

slide-120
SLIDE 120
  • Updated in TOKUDB SHARE::update cardinality counts
  • Stored in tokudb::set card in status
  • Retrieved in tokudb::get card from status
  • When table is open
  • In ha tokudb::initialize share

TokuDB Key Statistics: Code

51

slide-121
SLIDE 121
  • Updated in TOKUDB SHARE::update cardinality counts
  • Stored in tokudb::set card in status
  • Retrieved in tokudb::get card from status
  • Used in TOKUDB SHARE::set cardinality counts in table

for (uint32_t j = 0; j < key->actual_key_parts; j++) { ... assert_always(next_key_part < _rec_per_keys); ulong val = _rec_per_key[next_key_part++]; val = (val * tokudb::sysvars::cardinality_scale_percent) / 100;

TokuDB Key Statistics: Code

51

slide-122
SLIDE 122
  • Stored on disk
  • Updated
  • Each time table is updated
  • When ha tokudb::info called

TokuDB Logical Rows Count

52

slide-123
SLIDE 123

mysql> create table test(

  • > id int not null auto_increment primary key,
  • > f1 int,
  • > ts timestamp,
  • > key(f1)
  • > ) engine=tokudb;

Query OK, 0 rows affected (0.10 sec) mysql> insert into test (f1, ts) values(1, NOW()), (2, NOW()); Query OK, 2 rows affected (0.03 sec) Records: 2 Duplicates: 0 Warnings: 0 ... mysql> insert into test (f1, ts) select f1, NOW() from test; Query OK, 32 rows affected (0.01 sec) Records: 32 Duplicates: 0 Warnings: 0

TokuDB Test Case

53

slide-124
SLIDE 124

mysql> select count(distinct id), count(distinct f1) from test; +--------------------+--------------------+ | count(distinct id) | count(distinct f1) | +--------------------+--------------------+ | 64 | 2 | +--------------------+--------------------+ 1 row in set (0.01 sec)

TokuDB Test Case

53

slide-125
SLIDE 125
  • SHOW INDEX

mysql> show index from test\G *************************** 1. row *************************** Table: test Non_unique: 0 Key_name: PRIMARY Column_name: id Cardinality: 64 ... *************************** 2. row *************************** Table: test Non_unique: 1 Key_name: f1 Column_name: f1 Cardinality: 64 ...

TokuDB: After First run

54

slide-126
SLIDE 126
  • Number of rows

$ ../bin/tokuftdump --header --nodata var/mysqld.1/data/test/test_key_f1_145_1_1d_B_1.tokudb ft: layout_version=29 layout_version_original=29 layout_version_read_from_disk=29 build_id=0 build_id_original=0 time_of_creation= 1537709029 Sun Sep 23 16:23:49 2018 time_of_last_modification=1537709100 Sun Sep 23 16:25:00 2018 ... estimated numrows=64 estimated numbytes=640 logical row count=64

TokuDB: After First run

54

slide-127
SLIDE 127
  • Index Statistics

Thread 44 "mysqld" hit Breakpoint 1, TOKUDB_SHARE::set_cardinality_counts_in_table (this=0x7fd86da54020, table=0x7fd86d90b020) at /home/sveta/src/percona-server/storage/tokudb/ha_tokudb.cc:400 400 if (val == 0 || _rows == 0 || (gdb) p key->name $21 = 0x7fd86d879999 "f1" (gdb) p val $22 = 0

TokuDB: After First run

54

slide-128
SLIDE 128
  • Cardinality = 64 / 0 = 64

TokuDB: After First run

54

slide-129
SLIDE 129

mysql> analyze table test; +-----------+---------+----------+----------+ | Table | Op | Msg_type | Msg_text | +-----------+---------+----------+----------+ | test.test | analyze | status | OK | +-----------+---------+----------+----------+ 1 row in set (0.01 sec)

TokuDB: ANALYZE TABLE

55

slide-130
SLIDE 130
  • SHOW INDEX

mysql> show index from test\G *************************** 1. row *************************** Table: test Non_unique: 0 Key_name: PRIMARY Column_name: id Cardinality: 64 ... *************************** 2. row *************************** Table: test Non_unique: 1 Key_name: f1 Column_name: f1 Cardinality: 2 ...

TokuDB: After ANALYZE TABLE

56

slide-131
SLIDE 131
  • Number of rows

$ ../bin/tokuftdump --header --nodata var/mysqld.1/data/test/test_key_f1_145_1_1d_B_1.tokudb ft: layout_version=29 layout_version_original=29 layout_version_read_from_disk=29 build_id=0 build_id_original=0 time_of_creation= 1537709029 Sun Sep 23 16:23:49 2018 time_of_last_modification=1537709100 Sun Sep 23 16:25:00 2018 ... estimated numrows=64 estimated numbytes=640 logical row count=64

TokuDB: After ANALYZE TABLE

56

slide-132
SLIDE 132
  • Index Statistics

Thread 44 "mysqld" hit Breakpoint 1, TOKUDB_SHARE::set_cardinality_counts_in_table (this=0x7fd86da54020, table=0x7fd86d90b020) at /home/sveta/src/percona-server/storage/tokudb/ha_tokudb.cc:400 400 if (val == 0 || _rows == 0 || (gdb) p key->name $26 = 0x7fd86d879999 "f1" (gdb) p val $27 = 32

TokuDB: After ANALYZE TABLE

56

slide-133
SLIDE 133
  • Cardinality = 64 / 32 = 2

TokuDB: After ANALYZE TABLE

56

slide-134
SLIDE 134

mysql> insert into test (f1, ts) select f1, NOW() from test; Query OK, 64 rows affected (0.01 sec) Records: 64 Duplicates: 0 Warnings: 0 mysql> insert into test (f1, ts) select f1, NOW() from test; Query OK, 128 rows affected (0.01 sec) Records: 128 Duplicates: 0 Warnings: 0 mysql> insert into test (f1, ts) select f1, NOW() from test; Query OK, 256 rows affected (0.02 sec) Records: 256 Duplicates: 0 Warnings: 0 mysql> select count(distinct id), count(distinct f1) from test; +--------------------+--------------------+ | count(distinct id) | count(distinct f1) | +--------------------+--------------------+ | 512 | 2 | +--------------------+--------------------+ 1 row in set (0.01 sec)

TokuDB: Let’s Insert More Data

57

slide-135
SLIDE 135
  • SHOW INDEX

mysql> show index from test\G *************************** 1. row *************************** Table: test Non_unique: 0 Key_name: PRIMARY Column_name: id Cardinality: 512 ... *************************** 2. row *************************** Table: test Non_unique: 1 Key_name: f1 Column_name: f1 Cardinality: 16 ...

TokuDB: After INSERT

58

slide-136
SLIDE 136
  • Number of rows

$ ../bin/tokuftdump --header --nodata var/mysqld.1/data/test/test_key_f1_145_1_1d_B_1.tokudb ft: layout_version=29 layout_version_original=29 layout_version_read_from_disk=29 build_id=0 build_id_original=0 time_of_creation= 1537709029 Sun Sep 23 16:23:49 2018 time_of_last_modification=1537709880 Sun Sep 23 16:38:00 2018 ... estimated numrows=512 estimated numbytes=5120 logical row count=512

TokuDB: After INSERT

58

slide-137
SLIDE 137
  • Index Statistics

Thread 44 "mysqld" hit Breakpoint 1, TOKUDB_SHARE::set_cardinality_counts_in_table (this=0x7fd86da54020, table=0x7fd86d90b020) at /home/sveta/src/percona-server/storage/tokudb/ha_tokudb.cc:400 400 if (val == 0 || _rows == 0 || (gdb) p key->name $30 = 0x7fd86d879999 "f1" (gdb) p val $31 = 32

TokuDB: After INSERT

58

slide-138
SLIDE 138
  • Cardinality = 512 / 32 = 16

TokuDB: After INSERT

58

slide-139
SLIDE 139
  • SHOW INDEX

mysql> show index from test\G *************************** 1. row *************************** Table: test Non_unique: 0 Key_name: PRIMARY Column_name: id Cardinality: 512 ... *************************** 2. row *************************** Table: test Non_unique: 1 Key_name: f1 Column_name: f1 Cardinality: 16 ...

TokuDB: After Restart

59

slide-140
SLIDE 140
  • Index Statistics

Thread 44 "mysqld" hit Breakpoint 1, TOKUDB_SHARE::set_cardinality_counts_in_table (this=0x7fd4e67ea020, table=0x7fd4e6765c20) at /home/sveta/src/percona-server/storage/tokudb/ha_tokudb.cc:400 400 if (val == 0 || _rows == 0 || (gdb) p key->name $3 = 0x7fd4e66d7599 "f1" (gdb) p val $4 = 32

TokuDB: After Restart

59

slide-141
SLIDE 141
  • Cardinality = 512 / 32 = 16

TokuDB: After Restart

59

slide-142
SLIDE 142
  • Cardinality = 512 / 32 = 16
  • Same!

TokuDB: After Restart

59

slide-143
SLIDE 143
  • Index statistics updated only when ANALYZE

TABLE is running

TokuDB: Conclusion

60

slide-144
SLIDE 144
  • Index statistics updated only when ANALYZE

TABLE is running

  • Logical row count updated each time when

number of rows change

TokuDB: Conclusion

60

slide-145
SLIDE 145
  • Index statistics updated only when ANALYZE

TABLE is running

  • Logical row count updated each time when

number of rows change

  • Cardinality based on both numbers

TokuDB: Conclusion

60

slide-146
SLIDE 146
  • Index statistics updated only when ANALYZE

TABLE is running

  • Logical row count updated each time when

number of rows change

  • Cardinality based on both numbers
  • It is expected the cardinality is not the same
  • After updates
  • Even when ANALYZE TABLE never run

TokuDB: Conclusion

60