with Dictionaries an alternative to InnoDB table compression Yura - - PowerPoint PPT Presentation

with dictionaries an alternative to
SMART_READER_LITE
LIVE PREVIEW

with Dictionaries an alternative to InnoDB table compression Yura - - PowerPoint PPT Presentation

Percona XtraDB: Compressed Columns with Dictionaries an alternative to InnoDB table compression Yura Sorokin, Senior Software Engineer at Percona Existing compression methods Overview Existing compression methods for MySQL InnoDB Table


slide-1
SLIDE 1

Percona XtraDB: Compressed Columns with Dictionaries – an alternative to InnoDB table compression

Yura Sorokin, Senior Software Engineer at Percona

slide-2
SLIDE 2

Existing compression methods

Overview

slide-3
SLIDE 3

3

Existing compression methods for MySQL

  • InnoDB Table Compression (starting from 5.1)

https://dev.mysql.com/doc/refman/5.7/en/innodb-table-compression.html

  • InnoDB Page Compression (starting from 5.7)

https://dev.mysql.com/doc/refman/5.7/en/innodb-page-compression.html

  • Filesystems with transparent compression (BTRFS, ZFS, NTFS, etc.)

will not be covered in this talk

  • COMPRESS() / UNCOMPRESS() MySQL functions

https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_compress https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_uncompress

  • Something intriguing from Percona Server :)

something you might have already noticed in the title of this talk

slide-4
SLIDE 4

InnoDB Table Compression

ROW_FORMAT=COMPRESSED

slide-5
SLIDE 5

5

InnoDB Table Compression

SET GLOBAL innodb_file_per_table=1; SET GLOBAL innodb_file_format=Barracuda; CREATE TABLE tbl ( id SERIAL PRIMARY KEY, value BLOB NOT NULL ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=N; Based on zlib compression library (LZ77 compression algorithm).

slide-6
SLIDE 6

6

InnoDB Table Compression

  • Uncompressed “modification log” records

changes made to the page (to avoid unnecessary uncompression and recompression)

  • When “modification log” is full,

recompression is needed.

  • Recompression may fail as compressed

data no longer fit and in this case B-tree nodes are split.

  • To avoid frequent compression failures

(splitting), there is reserved empty space (padding)

Compressed data Padding Modification log

slide-7
SLIDE 7

7

InnoDB Table Compression

  • BLOB, VARCHAR, and TEXT columns that

are not part of the primary key may be stored on separately allocated overflow pages.

  • All data written to overflow pages is

compressed “as is” (MySQL applies the zlib compression algorithm to the entire data item)

Compressed data Padding Modification log

slide-8
SLIDE 8

8

InnoDB Table Compression

Tuning:

  • KEY_BLOCK_SIZE=(1|2|4|8|16) combined with innodb_page_size

(KEY_BLOCK_SIZE must be <= innodb_page_size) (Table compression does not support 32k and 64k innodb_page_size)

  • innodb_compression_level = (0..9)

https://dev.mysql.com/doc/refman/5.7/en/innodb- parameters.html#sysvar_innodb_compression_level

  • innodb_compression_pad_pct_max

https://dev.mysql.com/doc/refman/5.7/en/innodb- parameters.html#sysvar_innodb_compression_pad_pct_max

  • innodb_compression_failure_threshold_pct

https://dev.mysql.com/doc/refman/5.7/en/innodb- parameters.html#sysvar_innodb_compression_failure_threshold_pct

slide-9
SLIDE 9

9

InnoDB Table Compression

“…In general, compression works best on tables that include a reasonable number of character string columns and where the data is read far more

  • ften than it is written…”

https://dev.mysql.com/doc/refman/5.7/en/innodb-compression-tuning.html

slide-10
SLIDE 10

Real world data

JSON documents

slide-11
SLIDE 11

11

Real world data

JSON data sets for JSON Studio tutorials

http://jsonstudio.com/resources/

  • JSON data set of projects funded by the World Bank (436K compressed).
  • JSON data set for US zip (postal) codes (656K compressed).
  • JSON data set of listed stocks (1.6M compressed).
  • JSON data set for Enron emails (3.9M compressed).
  • JSON data set of startup company information (14.8M compressed).

‘companies.json’ repeated 8 times

slide-12
SLIDE 12

12

Real world data

{ "_id": { "$oid": "52cdef7c4bab8bd675297d8b“ }, "name": "AdventNet", "permalink": "abc3", "crunchbase_url": "http://www.crunchbase.com/company/adventnet", "homepage_url": "http://adventnet.com", "blog_url": "", "blog_feed_url": "", "twitter_username": "manageengine", "category_code": "enterprise", "number_of_employees": 600, "founded_year": 1996, "deadpooled_year": 2, "tag_list": "", "alias_list": "Zoho ManageEngine ", "email_address": "pr@adventnet.com", "phone_number": "925-924-9500", "description": "Server Management Software", "created_at": { "$date": 1180121062000 }, "updated_at": "Wed Oct 31 18:26:09 UTC 2012", "overview": "<p>AdventNet is now <a href=\"/company/zoho-manageengine\" title=\"Zoho ManageEngine\" rel=\"nofollow\">Zoho ManageEngine</a>.</p>\n\n<p>Founded in 1996, AdventNet has served a diverse range of enterprise IT, networking and telecom customers.</p>\n\n<p>AdventNet supplies server and network management software.</p>", …

slide-13
SLIDE 13

13

Real world data

Loading data from the file SET max_heap_table_size = 1024 * 1048576; CREATE TABLE companies_src ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, value LONGTEXT NOT NULL, PRIMARY KEY(id) ) ENGINE=Memory; LOAD DATA INFILE ‘companies.json' INTO TABLE companies_src FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '' (value);

slide-14
SLIDE 14

14

Real world data

Parsing JSON documents CREATE TABLE companies_src_parsed( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, value JSON NOT NULL, PRIMARY KEY(id) ) ENGINE=Memory AS SELECT * FROM companies_src;

slide-15
SLIDE 15

15

Real world data

Statistics: SELECT COUNT(*) FROM companies_src_parsed; 150408 SELECT COUNT(*) FROM companies_src_parsed WHERE JSON_VALID(value) <> 1; SELECT MIN(LENGTH(value)), ROUND(AVG(LENGTH(value))), MAX(LENGTH(value)) FROM companies_src_parsed; 1027 4160 278016

slide-16
SLIDE 16

16

Real world data

More statistics: SELECT JSON_KEYS(value) AS keys, JSON_LENGTH(keys) FROM companies_src_parsed ORDER BY id LIMIT 1; ["_id", "name", "image", "offices", "blog_url", "overview", "partners", "products", "tag_list", "permalink", "alias_list", "created_at", "milestones", "updated_at", "acquisition", "description", "founded_day", "investments", "screenshots", "acquisitions", "competitions", "founded_year", "homepage_url", "phone_number", "video_embeds", "blog_feed_url", "category_code", "email_address", "founded_month", "providerships", "relationships", "crunchbase_url", "external_links", "funding_rounds", "deadpooled_year", "twitter_username", "total_money_raised", "number_of_employees"] 38 SELECT MIN(JSON_DEPTH(value)), MAX(JSON_DEPTH(value)) FROM companies_src_parsed; 3 7

slide-17
SLIDE 17

InnoDB Table Compression Experiments

Compression ratio and insertion time

slide-18
SLIDE 18

18

InnoDB Table Compression Experiments

Copying data to a new table: SET GLOBAL innodb_compression_level = LEVEL; CREATE TABLE companies_compressed( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, value LONGTEXT NOT NULL, PRIMARY KEY(id) ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=KBS; INSERT INTO companies_compressed SELECT * FROM companies_src; LEVEL = [0, 1, 6, 9] KBS = [1, 2, 4, 8, 16]

slide-19
SLIDE 19

19

InnoDB Table Compression Experiments

Calculating compressed data size: SELECT data_length FROM information_schema.tables WHERE table_schema = DATABASE() AND table_name = 'companies_compressed’;

slide-20
SLIDE 20

20

InnoDB Table Compression Experiments Identity: Data size, MB and query time, sec

898 880 100 200 300 400 500 600 700 800 900 1 000 LONGTEXT JSON

data_length, MB

data_length, MB 33 36 5 10 15 20 25 30 35 40 LONGTEXT JSON

query_time, sec

query_time, sec

slide-21
SLIDE 21

21

InnoDB Table Compression Experiments Identity: Data size, % and query time, %

100,00% 98,00% 0,00% 20,00% 40,00% 60,00% 80,00% 100,00% 120,00% LONGTEXT JSON

data_length_relative, %

data_length_relative, % 100,00% 109,09% 0,00% 20,00% 40,00% 60,00% 80,00% 100,00% 120,00% LONGTEXT JSON

query_time_relative, %

query_time_relative, %

slide-22
SLIDE 22

22

InnoDB Table Compression Experiments Data size, MB

898 880 749 873 938 917 854 898 880 359 379 338 401 792 898 880 341 364 317 400 791 898 880 341 363 316 399 790 100 200 300 400 500 600 700 800 900 1 000 LONGTEXT JSON KBS1 KBS2 KBS4 KBS8 KBS16 data_length(LVL0), MB data_length(LVL1), MB data_length(LVL6), MB data_length(LVL9), MB

slide-23
SLIDE 23

23

InnoDB Table Compression Experiments Query time, sec

33 36 64 43 46 27 20 33 36 38 37 36 28 21 33 36 43 43 48 34 23 33 36 45 45 52 37 24 10 20 30 40 50 60 70 LONGTEXT JSON KBS1 KBS2 KBS4 KBS8 KBS16 query_time(LVL0), sec query_time(LVL1), sec query_time(LVL6), sec query_time(LVL9), sec

slide-24
SLIDE 24

24

InnoDB Table Compression Experiments Data size, %

100,00% 98,00% 83,41% 97,20% 104,43% 102,06% 95,10% 100,00% 98,00% 39,97% 42,23% 37,61% 44,65% 88,20% 100,00% 98,00% 37,99% 40,48% 35,27% 44,49% 88,08% 100,00% 98,00% 37,95% 40,45% 35,19% 44,43% 87,97% 0,00% 20,00% 40,00% 60,00% 80,00% 100,00% 120,00% LONGTEXT JSON KBS1 KBS2 KBS4 KBS8 KBS16 data_length_relative(LVL0), % data_length_relative(LVL1), % data_length_relative(LVL6), % data_length_relative(LVL9), %

slide-25
SLIDE 25

25

InnoDB Table Compression Experiments Query time, %

100,00% 109,09% 193,94% 130,30% 139,39% 81,82% 60,61% 100,00% 109,09% 115,15% 112,12% 109,09% 84,85% 63,64% 100,00% 109,09% 130,30% 130,30% 145,45% 103,03% 69,70% 100,00% 109,09% 136,36% 136,36% 157,58% 112,12% 72,73% 0,00% 50,00% 100,00% 150,00% 200,00% 250,00% LONGTEXT JSON KBS1 KBS2 KBS4 KBS8 KBS16 query_time_relative(LVL0), % query_time_relative(LVL1), % query_time_relative(LVL6), % query_time_relative(LVL9), %

slide-26
SLIDE 26

InnoDB Page Compression

COMPRESSION=("zlib“ | “lz4”)

slide-27
SLIDE 27

27

InnoDB Page Compression

SET GLOBAL innodb_file_per_table=1; CREATE TABLE tbl ( id SERIAL PRIMARY KEY, value BLOB NOT NULL ) ENGINE=InnoDB COMPRESSION=("zlib“ | “lz4”); Can be configured to use either zlib or lz4 compression.

slide-28
SLIDE 28

28

InnoDB Page Compression

Also referred to as Transparent Page Compression

https://dev.mysql.com/doc/refman/5.7/en/innodb-page-compression.html

Requirements:

  • Sparse file and hole punching support on Linux
  • NTFS on Windows
slide-29
SLIDE 29

29

InnoDB Page Compression

‘ls -l tablespace_name.ibd’ – will show uncompressed file size. Use ‘du --block-size=1 tablespace_name.ibd’ instead. SELECT FS_BLOCK_SIZE, FILE_SIZE, ALLOCATED_SIZE FROM INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES;

slide-30
SLIDE 30

InnoDB Page Compression Experiments

Compression ratio and insertion time

slide-31
SLIDE 31

31

InnoDB Page Compression Experiments

Copying data to a new table: SET GLOBAL innodb_compression_level = LEVEL; CREATE TABLE companies_compressed( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, value LONGTEXT NOT NULL, PRIMARY KEY(id) ) ENGINE=InnoDB COMPRESSION=‘ALG'; INSERT INTO companies_compressed SELECT * FROM companies_src; LEVEL = [0, 1, 6, 9] ALG = [zlib, lz4]

slide-32
SLIDE 32

32

InnoDB Page Compression Experiments

Calculating compressed data size: SELECT allocated_size FROM information_schema. innodb_sys_tablespaces WHERE name = CONCAT(DATABASE(), '/companies_compressed’;

slide-33
SLIDE 33

33

InnoDB Page Compression Experiments Data size, MB

898 880 920 507 898 880 434 508 898 880 373 506 898 880 372 507 100 200 300 400 500 600 700 800 900 1 000 LONGTEXT JSON ZLIB LZ4 data_length(LVL0), MB data_length(LVL1), MB data_length(LVL6), MB data_length(LVL9), MB

slide-34
SLIDE 34

34

InnoDB Page Compression Experiments Query time, sec

33 36 18 20 33 36 22 19 33 36 28 19 33 36 30 19 5 10 15 20 25 30 35 40 LONGTEXT JSON ZLIB LZ4 query_time(LVL0), sec query_time(LVL1), sec query_time(LVL6), sec query_time(LVL9), sec

slide-35
SLIDE 35

35

InnoDB Page Compression Experiments Data size, %

100,00% 98,00% 102,45% 56,43% 100,00% 98,00% 48,36% 56,58% 100,00% 98,00% 41,59% 56,36% 100,00% 98,00% 41,37% 56,42% 0,00% 20,00% 40,00% 60,00% 80,00% 100,00% 120,00% LONGTEXT JSON ZLIB LZ4 data_length_relative(LVL0), % data_length_relative(LVL1), % data_length_relative(LVL6), % data_length_relative(LVL9), %

slide-36
SLIDE 36

36

InnoDB Page Compression Experiments Query time, %

100,00% 109,09% 54,55% 60,61% 100,00% 109,09% 66,67% 57,58% 100,00% 109,09% 84,85% 57,58% 100,00% 109,09% 90,91% 57,58% 0,00% 20,00% 40,00% 60,00% 80,00% 100,00% 120,00% LONGTEXT JSON ZLIB LZ4 query_time_relative(LVL0), % query_time_relative(LVL1), % query_time_relative(LVL6), % query_time_relative(LVL9), %

slide-37
SLIDE 37

COMPRESS() MySQL function

COMPRESS() / UNCOMPRESS()

slide-38
SLIDE 38

38

COMPRESS() MySQL function

CREATE TABLE tbl ( id SERIAL PRIMARY KEY, value BLOB NOT NULL ) ENGINE=<ANY_ENGINE>; INSERT INTO tbl VALUES (DEFAULT, COMPRESS(REPEAT(‘a’, 2048))); SELECT id, UNCOMPRESS(value) FROM tbl;

slide-39
SLIDE 39

39

COMPRESS() MySQL function

  • Based on zlib algorithm
  • Inconvenient – need to explicitly wrap field names into COMPRESS() /

UNCOMPRESS().

  • Losing actual field type – compressed fields should be VARBINARY or

BLOB.

  • No control over compression level.
slide-40
SLIDE 40

COMPRESS() MySQL function Experiments

Compression ratio and insertion time

slide-41
SLIDE 41

41

COMPRESS MySQL function Experiments

Copying data to a new table: CREATE TABLE companies_compressed( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, value LONGBLOB NOT NULL, PRIMARY KEY(id) ) ENGINE=InnoDB; INSERT INTO companies_compressed SELECT id, COMPRESS(value) FROM companies_src;

slide-42
SLIDE 42

42

COMPRESS MySQL function Experiments Data size, MB and query time, sec

898 880 283 100 200 300 400 500 600 700 800 900 1 000 LONGTEXT JSON COMPRESS()

data_length, MB

data_length, MB 33 36 27 5 10 15 20 25 30 35 40 LONGTEXT JSON COMPRESS()

query_time, sec

query_time, sec

slide-43
SLIDE 43

43

COMPRESS MySQL function Experiments Data size, % and query time, %

100,00% 98,00% 31,49% 0,00% 20,00% 40,00% 60,00% 80,00% 100,00% 120,00% LONGTEXT JSON COMPRESS()

data_length_relative, %

data_length_relative, % 100,00% 109,09% 81,82% 0,00% 20,00% 40,00% 60,00% 80,00% 100,00% 120,00% LONGTEXT JSON COMPRESS()

query_time_relative, %

query_time_relative, %

slide-44
SLIDE 44

XtraDB Compressed Columns

COLUMN_FORMAT COMPRESSED

slide-45
SLIDE 45

45

XtraDB Compressed Columns

Availability:

  • Percona Server 5.6.33-79.0

https://www.percona.com/doc/percona-server/5.6/flexibility/compressed_columns.html

  • Percona Server 5.7.17-11

https://www.percona.com/doc/percona-server/5.7/flexibility/compressed_columns.html

slide-46
SLIDE 46

46

XtraDB Compressed Columns

Key characteristics:

  • It is a data type modifier, independent from user-level SQL and InnoDB

data compression, that causes the data stored in the column to be compressed on writing to storage and decompressed on reading.

  • For all other purposes, the data type is identical to the one without the

modifier, i.e. no new data types are created.

  • Compression is done by using the zlib library.
slide-47
SLIDE 47

47

XtraDB Compressed Columns

CREATE TABLE tbl ( id SERIAL PRIMARY KEY, value TEXT COLUMN_FORMAT COMPRESSED NOT NULL ) ENGINE=InnoDB; INSERT INTO tbl VALUES (DEFAULT, REPEAT(‘a’, 2048)); SELECT id, value FROM tbl;

slide-48
SLIDE 48

48

XtraDB Compressed Columns

Supported data types:

  • BLOB (including TINYBLOB, MEDIUMBLOB, LONGBLOG)
  • TEXT (including TINYTEXT, MEDIUMTEXT, LONGTEXT)
  • VARCHAR (including NATIONAL VARCHAR)
  • VARBINARY
slide-49
SLIDE 49

49

XtraDB Compressed Columns

Compressing/uncompressing existing data: ALTER TABLE tbl CHANGE value value TEXT COLUMN_FORMAT DEFAULT NOT NULL; ALTER TABLE tbl MODIFY value TEXT COLUMN_FORMAT DEFAULT NOT NULL; ALTER TABLE tbl CHANGE value value TEXT COLUMN_FORMAT COMPRESSED NOT NULL; ALTER TABLE tbl MODIFY value TEXT COLUMN_FORMAT COMPRESSED NOT NULL;

slide-50
SLIDE 50

50

XtraDB Compressed Columns

5.7-specific features:

  • Support for compressed JSON data type
  • Support for compressed generated stored columns

CREATE TABLE tbl( id SERIAL PRIMARY KEY, value JSON COLUMN_FORMAT COMPRESSED NOT NULL gen TEXT GENERATED ALWAYS AS (value->”$.bio”) STORED COLUMN_FORMAT COMPRESSED ) ENGINE=InnoDB;

slide-51
SLIDE 51

51

XtraDB Compressed Columns

Tuning:

  • innodb_compressed_columns_zip_level

https://www.percona.com/doc/percona- server/5.7/flexibility/compressed_columns.html#innodb_compressed_columns_zip_level

  • innodb_compressed_columns_threshold = (0..9)

https://www.percona.com/doc/percona- server/5.7/flexibility/compressed_columns.html#innodb_compressed_columns_threshold

slide-52
SLIDE 52

52

XtraDB Compressed Columns

Limitations (both for 5.6 and 5.7):

  • Compressed columns cannot be used in indices (neither on their own nor

as parts of composite keys).

  • ALTER TABLE ... DISCARD/IMPORT TABLESPACE is not supported for

tables with compressed columns.

slide-53
SLIDE 53

53

XtraDB Compressed Columns

2-bytes compressed blob header:

R W ALGORITHM LEN-LEN C UNUSED 1 2 6 7 9 10 11 15

  • R – reserved for future versions (currently must always be 0)
  • ALGORITHM - identifies which algoritm was used to compress this BLOB.

Currently, the only value 0 (meaning zlib) is supported

  • W – 'wrap' identifies if compression algorithm calculated a checksum

(adler32 in case of zlib) and appended it to the compressed data

  • LEN-LEN – identifies the length of the column length data portion

followed by this header

slide-54
SLIDE 54

54

XtraDB Compressed Columns

2-bytes compressed blob header:

R W ALGORITHM LEN-LEN C UNUSED 1 2 6 7 9 10 11 15

  • C – 'compressed’. If set to 1, then this header is immediately followed by

1..8 bytes (depending on the value of LEN-LEN bitfield) which determine

  • riginal (uncompressed) block size. These LEN-LEN bytes are followed

by compressed representation of the original data.

  • If 'compressed' bit is set to 0, every other bitfield must be ignored. In this

case the header is immediately followed by uncompressed (original) data.

slide-55
SLIDE 55

XtraDB Compressed Columns Experiments

Compression ratio and insertion time

slide-56
SLIDE 56

56

XtraDB Compressed Columns Experiments

Copying data to a new table: CREATE TABLE companies_compressed( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, value [ LONGTEXT | JSON ] COLUMN_FORMAT COMPRESSED NOT NULL, PRIMARY KEY(id) ) ENGINE=InnoDB; INSERT INTO companies_compressed SELECT * FROM companies_src;

slide-57
SLIDE 57

57

XtraDB Compressed Columns Experiments Data size, MB

898 880 898 880 898 880 304 387 898 880 284 377 898 880 283 377 100 200 300 400 500 600 700 800 900 1 000 LONGTEXT JSON COMPRESSED TEXT COMPRESSED JSON data_length(LVL0), MB data_length(LVL1), MB data_length(LVL6), MB data_length(LVL9), MB

slide-58
SLIDE 58

58

XtraDB Compressed Columns Experiments Query time, sec

33 36 17 36 33 36 22 48 33 36 28 52 33 36 30 53 10 20 30 40 50 60 LONGTEXT JSON COMPRESSED TEXT COMPRESSED JSON query_time(LVL0), sec query_time(LVL1), sec query_time(LVL6), sec query_time(LVL9), sec

slide-59
SLIDE 59

59

XtraDB Compressed Columns Experiments Data size, %

100,00% 98,00% 100,00% 98,00% 100,00% 98,00% 33,83% 43,08% 100,00% 98,00% 31,60% 41,97% 100,00% 98,00% 31,49% 41,97% 0,00% 20,00% 40,00% 60,00% 80,00% 100,00% 120,00% LONGTEXT JSON COMPRESSED TEXT COMPRESSED JSON data_length_relative(LVL0), % data_length_relative(LVL1), % data_length_relative(LVL6), % data_length_relative(LVL9), %

slide-60
SLIDE 60

60

XtraDB Compressed Columns Experiments Query time, %

100,00% 109,09% 51,52% 109,09% 100,00% 109,09% 66,67% 145,45% 100,00% 109,09% 84,85% 157,58% 100,00% 109,09% 90,91% 160,61% 0,00% 20,00% 40,00% 60,00% 80,00% 100,00% 120,00% 140,00% 160,00% 180,00% LONGTEXT JSON COMPRESSED TEXT COMPRESSED JSON query_time_relative(LVL0), % query_time_relative(LVL1), % query_time_relative(LVL6), % query_time_relative(LVL9), %

slide-61
SLIDE 61

zlib Compression Algorithm

Overview

slide-62
SLIDE 62

62

zlib Compression Algorithm

zlib overview

https://en.wikipedia.org/wiki/Zlib

  • Only supports one algorithm DEFLATE, that is a variation of LZ77
  • LZ77 is a theoretically dictionary coder
  • When processing data also keeps track of the most commonly used byte

sequences in a dictionary so that those sequences can be referred when

  • ccur again in the stream.
slide-63
SLIDE 63

63

zlib Compression Algorithm

zlib compression examples (oversimplified) Example 1 – no repeating byte sequences Input : “Sunday Monday Tuesday” OutDict: {“Sunday”, “Monday”, “Tuesday”} Output : “Sunday Monday Tuesday” Very bad compression ratio (100%). Just put some words into the dictionary.

slide-64
SLIDE 64

64

zlib Compression Algorithm

zlib compression examples (oversimplified) Example 2 – a lot of byte sequences are repeated Input : “Sunday Monday Tuesday Tuesday Monday Sunday” OutDict: {“Sunday”, “Monday”, “Tuesday”} Output : “Sunday Monday Tuesday \ref[2] \ref[1] \ref[0]” Good compression ratio (50% + overhead). For the second half of the input stream dictionary words were used.

slide-65
SLIDE 65

65

zlib Compression Algorithm

zlib compression examples (oversimplified) Example 3 – an initial state of the dictionary (not ideal) is provided by the user InDict : {“Sunday”, “Monday”} Input : “Sunday Monday Tuesday Tuesday Monday Sunday” OutDict: {“Sunday”, “Monday”, “Tuesday”} Output : “\ref[0] \ref[1] Tuesday \ref[2] \ref[1] \ref[0]” Good compression ratio (16% + overhead). For 5 out of 6 words from the input sequence dictionary references were used. Still not perfect though.

slide-66
SLIDE 66

66

zlib Compression Algorithm

zlib compression examples (oversimplified) Example 3 – an initial state of the dictionary (ideal) is provided by the user InDict : {“Sunday”, “Monday”, “Tuesday”} Input : “Sunday Monday Tuesday Tuesday Monday Sunday” OutDict: {“Sunday”, “Monday”, “Tuesday”} Output : “\ref[0] \ref[1] \ref[2] \ref[2] \ref[1] \ref[0]” Good compression ratio (0% + overhead). For all the words from the input sequence dictionary references were used. The best results.

slide-67
SLIDE 67

67

zlib Compression Algorithm

Dictionary usage conclusions:

  • Compression ratio can significantly depend on whether the initial

compression dictionary is provided or not.

  • Compression ratio depends on the quality of that dictionary.
slide-68
SLIDE 68

Compressed Columns with Dictionaries

COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY

slide-69
SLIDE 69

69

XtraDB Compressed Columns with Dictionaries

CREATE COMPRESSION_DICTIONARY dict(’SundayMondayTuesday’); SET @dict_content = ’SundayMondayTuesday’; CREATE COMPRESSION_DICTIONARY dict(@dict_content); DROP COMPRESSION_DICTIONARY dict;

slide-70
SLIDE 70

70

XtraDB Compressed Columns with Dictionaries

CREATE TABLE tbl ( id SERIAL PRIMARY KEY, value TEXT COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY dict NOT NULL ) ENGINE=InnoDB; INSERT INTO tbl VALUES (DEFAULT, REPEAT(‘a’, 2048)); SELECT id, value FROM tbl;

slide-71
SLIDE 71

71

XtraDB Compressed Columns with Dictionaries

INFORMATION_SCHEMA.XTRADB_ZIP_DICT

  • id BIGINT UNSIGNED – compression dictionary ID
  • name VARCHAR(64) – compression dictionary name (’dict’ in the

example)

  • zip_dict BLOB – compression dictionary content

(’SundayMondayTuesday’ in the example)

slide-72
SLIDE 72

72

XtraDB Compressed Columns with Dictionaries

INFORMATION_SCHEMA.XTRADB_ZIP_DICT_COLS

  • table_id BIGINT UNSIGNED – table ID from

INFORMATION_SCHEMA.INNODB_SYS_TABLES

  • column_pos BIGINT UNSIGNED – column position (starts from 0 as in

INFORMATION_SCHEMA.INNODB_SYS_COLUMNS)

  • dict_id BIGINT UNSIGNED – compression dictionary ID
slide-73
SLIDE 73

Generating Compression Dictionary

Keyword histogram

slide-74
SLIDE 74

74

Generating Compression Dictionary

zlib 1.2.11 Manual

http://www.zlib.net/manual.html

“…The dictionary should consist of strings (byte sequences) that are likely to be encountered later in the data to be compressed, with the most commonly used strings preferably put towards the end of the dictionary…”

slide-75
SLIDE 75

75

Generating Compression Dictionary

A table for storing JSON keys CREATE TABLE extracted_words( level BIGINT UNSIGNED NOT NULL, word VARCHAR(255) NOT NULL, weight BIGINT UNSIGNED NOT NULL, PRIMARY KEY(level, word) ) ENGINE=InnoDB;

slide-76
SLIDE 76

76

Generating Compression Dictionary

Stored procedure for extracting keys from a JSON value recursively

CREATE PROCEDURE extract_json_keys_from_tree(IN tree JSON, IN path VARCHAR(255), IN level INT) BEGIN DECLARE i, n INT; DECLARE tree_keys JSON; DECLARE word VARCHAR(255); SET tree_keys = JSON_KEYS(tree, path); IF tree_keys IS NOT NULL THEN SET i = 0; SET n = JSON_LENGTH(tree_keys); WHILE i < n DO SET word = JSON_EXTRACT(tree_keys, CONCAT('$[', i, ']’)); INSERT INTO extracted_words VALUES(level, JSON_UNQUOTE(word), 1) ON DUPLICATE KEY UPDATE weight = weight + 1; CALL extract_json_keys_from_tree(tree, CONCAT(path, '.', word), level + 1); SET i = i + 1; END WHILE; ELSEIF JSON_CONTAINS_PATH(tree, 'one', CONCAT(path, '[*]')) THEN SET i = 0; SET n = JSON_LENGTH(tree, path); WHILE i < n DO CALL extract_json_keys_from_tree(tree, CONCAT(path, '[', i, ']'), level + 1); SET i = i + 1; END WHILE; END IF; END

slide-77
SLIDE 77

77

Generating Compression Dictionary

Stored procedure for extracting keys from JSON values stored in a table

CREATE PROCEDURE extract_json_keys_from_table() BEGIN DECLARE done BOOLEAN DEFAULT FALSE; DECLARE tree JSON; DECLARE cur CURSOR FOR SELECT value FROM companies_src_parsed ORDER BY id; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE; OPEN cur; read_loop: LOOP FETCH cur INTO tree; IF done THEN LEAVE read_loop; END IF; CALL extract_json_keys_from_tree(tree, '\$', 0); END LOOP; CLOSE cur; END

slide-78
SLIDE 78

78

Generating Compression Dictionary

Filling extracted_words with JSON keys from companies_src_parsed SET max_sp_recursion_depth = 255; CALL extract_json_keys_from_table();

slide-79
SLIDE 79

79

Generating Compression Dictionary

SELECT * FROM extracted_words; … 1 source_url 2677 1 stock_symbol 390 1 term_code 2677 1 valuation_amount 390 1 valuation_currency_code 390 2 acquired_day 4412 2 acquired_month 4412 2 acquired_year 4412 2 address1 16705 …

slide-80
SLIDE 80

80

Generating Compression Dictionary

Simplified extracted_words view SET group_concat_max_len = 32768; CREATE VIEW extracted_words_simple AS SELECT word, SUM(weight) AS weight, BIT_OR(POWER(2, level)) AS level_mask FROM extracted_words GROUP BY word;

slide-81
SLIDE 81

81

Generating Compression Dictionary

Creating compression dictionaries from keywords of the zero level

SELECT GROUP_CONCAT(word ORDER BY weight, word SEPARATOR ‘’) INTO @res FROM extracted_words_simple WHERE (level_mask & 1) != 0; CREATE COMPRESSION_DICTIONARY dict_zero_level(@res); SELECT GROUP_CONCAT(word ORDER BY weight DESC, word SEPARATOR '') INTO @res FROM extracted_words_simple WHERE (level_mask & 1) != 0; CREATE COMPRESSION_DICTIONARY dict_zero_level_reversed(@res);

slide-82
SLIDE 82

82

Generating Compression Dictionary

Creating compression dictionaries from keywords of all levels

SELECT GROUP_CONCAT(word ORDER BY weight, word SEPARATOR ‘’) INTO @res FROM extracted_words_simple; CREATE COMPRESSION_DICTIONARY dict_all_levels(@res); SELECT GROUP_CONCAT(word ORDER BY weight DESC, word SEPARATOR ‘’) INTO @res FROM extracted_words_simple; CREATE COMPRESSION_DICTIONARY dict_all_levels_reversed(@res);

slide-83
SLIDE 83

Compression with Dictionaries Experiments

Compression ratio and insertion time

slide-84
SLIDE 84

84

Compression with Dictionaries Experiments

Copying data to a new table: CREATE TABLE companies_compressed( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, value LONGTEXT COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY DICT NOT NULL, PRIMARY KEY(id) ) ENGINE=InnoDB; INSERT INTO companies_compressed SELECT * FROM companies_src; DICT = [ dict_zero_level, dict_zero_level_reversed, dict_all_levels, dict_all_levels_reversed ]

slide-85
SLIDE 85

85

Compression with Dictionaries Experiments Data size, MB

898 898 898 898 898 898 898 304 280 281 273 276 898 284 257 257 252 253 898 283 256 257 251 252 100 200 300 400 500 600 700 800 900 1 000 LONGTEXT COMPRESSED TEXT DICT ZERO DICT ZERO REV DICT ALL DICT ALL REV data_length(LVL0), MB data_length(LVL1), MB data_length(LVL6), MB data_length(LVL9), MB

slide-86
SLIDE 86

86

Compression with Dictionaries Experiments Query time, sec

33 17 16 16 17 17 33 22 21 21 21 22 33 28 28 28 29 29 33 30 30 30 31 30 5 10 15 20 25 30 35 LONGTEXT COMPRESSED TEXT DICT ZERO DICT ZERO REV DICT ALL DICT ALL REV query_time(LVL0), sec query_time(LVL1), sec query_time(LVL6), sec query_time(LVL9), sec

slide-87
SLIDE 87

87

Compression with Dictionaries Experiments Data size, %

100,00% 100,00% 100,00% 100,00% 100,00% 100,00% 100,00% 33,83% 31,16% 31,27% 30,38% 30,71% 100,00% 31,60% 28,59% 28,59% 28,04% 28,15% 100,00% 31,49% 28,48% 28,59% 27,92% 28,04% 0,00% 20,00% 40,00% 60,00% 80,00% 100,00% 120,00% LONGTEXT COMPRESSED TEXT DICT ZERO DICT ZERO REV DICT ALL DICT ALL REV data_length_relative(LVL0), % data_length_relative(LVL1), % data_length_relative(LVL6), % data_length_relative(LVL9), %

slide-88
SLIDE 88

88

Compression with Dictionaries Experiments Query time, %

100,00% 51,52% 48,48% 48,48% 51,52% 51,52% 100,00% 66,67% 63,64% 63,64% 63,64% 66,67% 100,00% 84,85% 84,85% 84,85% 87,88% 87,88% 100,00% 90,91% 90,91% 90,91% 93,94% 90,91% 0,00% 20,00% 40,00% 60,00% 80,00% 100,00% 120,00% LONGTEXT COMPRESSED TEXT DICT ZERO DICT ZERO REV DICT ALL DICT ALL REV query_time_relative(LVL0), % query_time_relative(LVL1), % query_time_relative(LVL6), % query_time_relative(LVL9), %

slide-89
SLIDE 89

The most important slide

The best from all worlds

slide-90
SLIDE 90

90

The most important slide Data size, MB

898 920 938 283 898 898 898 434 338 283 304 273 898 373 317 283 284 252 898 372 316 283 283 251 100 200 300 400 500 600 700 800 900 1 000 LONGTEXT PAGE ZLIB TABLE KBS4 COMPRESS() COLUMN COMPRESS DICT ALL data_length(LVL0), MB data_length(LVL1), MB data_length(LVL6), MB data_length(LVL9), MB

slide-91
SLIDE 91

91

The most important slide Data size, %

100,00% 102,45% 104,43% 31,49% 100,00% 100,00% 100,00% 48,36% 37,61% 31,49% 33,83% 30,38% 100,00% 41,59% 35,27% 31,49% 31,60% 28,04% 100,00% 41,37% 35,19% 31,49% 31,49% 27,92% 0,00% 20,00% 40,00% 60,00% 80,00% 100,00% 120,00% LONGTEXT PAGE ZLIB TABLE KBS4 COMPRESS() COLUMN COMPRESS DICT ALL data_length_relative(LVL0), % data_length_relative(LVL1), % data_length_relative(LVL6), % data_length_relative(LVL9), %

slide-92
SLIDE 92

92

The most important slide Summary

LVL1 LVL6 LVL9 COLUMN COMPRESS 33,83 % 31,60 % 31,49 % DICT ALL 30,38 % 28,04 % 27,92 % RATIO 89,79 % 88,71 % 88,67 % -11.33%

slide-93
SLIDE 93

93

Thank You Sponsors!

slide-94
SLIDE 94

94

SAVE T THE D DATE!

CALL FOR CALL FOR PAPE APERS RS OP OPEN ENING ING SO SOON ON!

www.perconalive.com

April pril 23 23-25, 2018 25, 2018

Santa Clara Convention Center

slide-95
SLIDE 95

Thank you

You survived 90 slides