ADVANCED DATABASE SYSTEMS Database Compression @ Andy_Pavlo // - PowerPoint PPT Presentation

Lect ure # 10 ADVANCED DATABASE SYSTEMS Database Compression @ Andy_Pavlo // 15- 721 // Spring 2019

CMU 15-721 (Spring 2019) 2 UPCO M IN G DATABASE EVEN TS Splice Machine Tech Talk → Thursday Feb 21 st @ 12:00pm → CIC 4 th Floor → CEO/Co-Found Monte Zweben (CMU'85)

CMU 15-721 (Spring 2019) 3 Compression Background Naïve Compression OLAP Columnar Compression OLTP Index Compression

CMU 15-721 (Spring 2019) 4 O BSERVATIO N I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs are more complicated. Key trade-off is speed vs. compression ratio → In-memory DBMSs (always?) choose speed. → Compressing the database reduces DRAM requirements and processing.

CMU 15-721 (Spring 2019) 5 REAL- WO RLD DATA CH ARACTERISTICS Data sets tend to have highly skewed distributions for attribute values. → Example: Zipfian distribution of the Brown Corpus Data sets tend to have high correlation between attributes of the same tuple. → Example: Zip Code to City, Order Date to Ship Date

CMU 15-721 (Spring 2019) 6 DATABASE CO M PRESSIO N Goal #1: Must produce fixed-length values. → Only exception is var-length data stored in separate pool. Goal #2: Postpone decompression for as long as possible during query execution. → Also known as late materialization . Goal #3: Must be a lossless scheme.

CMU 15-721 (Spring 2019) 7 LO SSLESS VS. LO SSY CO M PRESSIO N When a DBMS uses compression, it is always lossless because people don’t like losing data. Any kind of lossy compression is has to be performed at the application level. Reading less than the entire data set during query execution is sort of like of compression…

CMU 15-721 (Spring 2019) 8 DATA SKIPPIN G Approach #1: Approximate Queries (Lossy) → Execute queries on a sampled subset of the entire table to produce approximate results. → Examples: BlinkDB, SnappyData, XDB, Oracle (2017) Approach #2: Zone Maps (Loseless) → Pre-compute columnar aggregations per block that allow the DBMS to check whether queries need to access it. → Examples: Oracle, Vertica, MemSQL, Netezza

CMU 15-721 (Spring 2019) 9 ZO N E M APS Pre-computed aggregates for blocks of data. DBMS can check the zone map first to decide whether it wants to access the block. Original Data Zone Map val type val 100 MIN 100 200 MAX 400 300 AVG 280 400 SUM 1400 400 COUNT 5

CMU 15-721 (Spring 2019) 9 ZO N E M APS Pre-computed aggregates for blocks of data. DBMS can check the zone map first to decide whether it wants to access the block. Original Data Zone Map val type val SELECT * FROM table 100 MIN 100 WHERE val > 600 200 MAX 400 300 AVG 280 400 SUM 1400 400 COUNT 5

CMU 15-721 (Spring 2019) 10 O BSERVATIO N If we want to add compression to our DBMS, the first question we have to ask ourselves is what is what do want to compress. This determines what compression schemes are available to us…

CMU 15-721 (Spring 2019) 11 CO M PRESSIO N GRAN ULARITY Choice #1: Block-level → Compress a block of tuples for the same table. Choice #2: Tuple-level → Compress the contents of the entire tuple (NSM-only). Choice #3: Attribute-level → Compress a single attribute value within one tuple. → Can target multiple attributes for the same tuple. Choice #4: Column-level → Compress multiple values for one or more attributes stored for multiple tuples (DSM-only).

CMU 15-721 (Spring 2019) 12 N AÏVE CO M PRESSIO N Compress data using a general purpose algorithm. Scope of compression is only based on the data provided as input. → LZO (1996), LZ4 (2011), Snappy (2011), Brotli (2013), Oracle OZIP (2014), Zstd (2015) Considerations → Computational overhead → Compress vs. decompress speed.

CMU 15-721 (Spring 2019) 14 MYSQ L IN N O DB CO M PRESSIO N Buffer Pool Disk Pages mod log [1,2,4,8] KB Compressed page0 mod log Compressed page1 mod log Compressed page2 Source: MySQL 5.7 Documentation

CMU 15-721 (Spring 2019) 14 MYSQ L IN N O DB CO M PRESSIO N Buffer Pool Disk Pages Updates mod log mod log [1,2,4,8] KB Compressed page0 Compressed page0 mod log Uncompressed 16 KB Compressed page1 page0 mod log Compressed page2 Source: MySQL 5.7 Documentation

CMU 15-721 (Spring 2019) 15 N AÏVE CO M PRESSIO N The data has to be decompressed first before it can be read and (potentially) modified. → This limits the “scope” of the compression scheme. These schemes also do not consider the high-level meaning or semantics of the data.

CMU 15-721 (Spring 2019) 16 O BSERVATIO N We can perform exact-match comparisons and natural joins on compressed data if predicates and data are compressed the same way. → Range predicates are more tricky… SELECT * FROM users SELECT * FROM users WHERE name = 'Andy' WHERE name = XX NAME SALARY NAME SALARY XX AA Andy 99999 YY BB Lin 88888

CMU 15-721 (Spring 2019) 17 CO LUM N AR CO M PRESSIO N Null Supression Run-length Encoding Bitmap Encoding Delta Encoding Incremental Encoding Mostly Encoding Dictionary Encoding

CMU 15-721 (Spring 2019) 18 N ULL SUPPRESSIO N Consecutive zeros or blanks in the data are replaced with a description of how many there were and where they existed. → Example: Oracle’s Byte -Aligned Bitmap Codes (BBC) Useful in wide tables with sparse data. DATABASE COMPRESSION SIGMOD RECORD 1993

CMU 15-721 (Spring 2019) 19 RUN- LEN GTH EN CO DIN G Compress runs of the same value in a single column into triplets: → The value of the attribute. → The start position in the column segment. → The # of elements in the run. Requires the columns to be sorted intelligently to maximize compression opportunities. DATABASE COMPRESSION SIGMOD RECORD 1993

CMU 15-721 (Spring 2019) 20 RUN- LEN GTH EN CO DIN G Original Data Compressed Data id sex id sex 1 M 1 (M,0,3) 2 M 2 (F,3,1) 3 M 3 (M,4,1) 4 F 4 (F,5,1) 6 M 6 (M,6,2) 7 F RLE Triplet 7 - Value 8 M 8 - Offset 9 M 9 - Length

CMU 15-721 (Spring 2019) 20 RUN- LEN GTH EN CO DIN G Sorted Data Compressed Data id sex id sex 1 M 1 (M,0,6) 2 M 2 (F,7,2) 3 M 3 6 M 6 8 M 7 9 M 9 4 F 4 7 F 7

CMU 15-721 (Spring 2019) 21 BITM AP EN CO DIN G Store a separate Bitmap for each unique value for a particular attribute where an offset in the vector corresponds to a tuple. → The i th position in the Bitmap corresponds to the i th tuple in the table. → Typically segmented into chunks to avoid allocating large blocks of contiguous memory. Only practical if the value cardinality is low. MODEL 2 204 ARCHITECTURE AND PERFORMANCE HIGH PERFORMANCE TRANSACTION SYSTEMS 1987

CMU 15-721 (Spring 2019) 22 BITM AP EN CO DIN G Original Data Compressed Data sex id sex M F id 1 M 1 1 0 2 M 2 1 0 3 M 3 1 0 4 F 4 0 1 6 M 6 1 0 7 F 7 0 1 8 M 8 1 0 9 M 9 1 0

CMU 15-721 (Spring 2019) 22 BITM AP EN CO DIN G Original Data Compressed Data sex 2 × 8-bits = id sex M F id 16 bits 1 M 1 1 0 2 M 2 1 0 3 M 3 1 0 9 × 8-bits = 9 × 2-bits = 4 F 4 0 1 72 bits 18 bits 6 M 6 1 0 7 F 7 0 1 8 M 8 1 0 9 M 9 1 0

CMU 15-721 (Spring 2019) 23 BITM AP EN CO DIN G: EXAM PLE Assume we have 10 million tuples. 43,000 zip codes in the US. CREATE TABLE customer_dim ( → 10000000 × 32-bits = 40 MB id INT PRIMARY KEY , → 10000000 × 43000 = 53.75 GB name VARCHAR (32), email VARCHAR (64), address VARCHAR (64), Every time a txn inserts a new zip_code INT tuple, we have to extend 43,000 ); different bitmaps.

CMU 15-721 (Spring 2019) 24 BITM AP EN CO DIN G: CO M PRESSIO N Approach #1: General Purpose Compression → Use standard compression algorithms (e.g., LZ4, Snappy). → Have to decompress before you can use it to process a query. Not useful for in-memory DBMSs. Approach #2: Byte-aligned Bitmap Codes → Structured run-length encoding compression.

CMU 15-721 (Spring 2019) 25 O RACLE BYTE- ALIGN ED BITM AP CO DES Divide Bitmap into chunks that contain different categories of bytes: → Gap Byte : All the bits are 0 s. → Tail Byte: Some bits are 1 s. Encode each chunk that consists of some Gap Bytes followed by some Tail Bytes . → Gap Bytes are compressed with RLE. → Tail Bytes are stored uncompressed unless it consists of only 1-byte or has only one non-zero bit. BYTE- ALIGNED BITMAP COMPRESSION DATA COMPRESSION CONFERENCE 1995

CMU 15-721 (Spring 2019) 26 O RACLE BYTE- ALIGN ED BITM AP CO DES Bitmap Gap Bytes Tail Bytes #1 00000000 00000000 000 1 0000 00000000 00000000 00000000 00000000 00000000 00000000 #2 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0 1 000000 00 1 000 1 0 Compressed Bitmap Source: Brian Babcock

ADVANCED DATABASE SYSTEMS Database Compression @ Andy_Pavlo // - PowerPoint PPT Presentation

Lect ure # 10 ADVANCED DATABASE SYSTEMS Database Compression @ Andy_Pavlo // 15- 721 // Spring 2019 CMU 15-721 (Spring 2019) 2 UPCO M IN G DATABASE EVEN TS Splice Machine Tech Talk Thursday Feb 21 st @ 12:00pm CIC 4 th Floor

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Advanced Database CS 525: Organization? Advanced Database =Database Implementation

Advanced Database Management Systems Database Management Systems Alvaro A A Fernandes School of

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017

DATABASE SYSTEMS Database programming in a web environment Database System Course AGENDA FOR

Lect ure # 11 ADVANCED DATABASE SYSTEMS System Catalogs and Database Compression @

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

DATABASE SECURITY CS4750 Database Systems Prof. Nada Basit Email: basit@virginia.edu Fall

DATABASE SECURITY CS4750 Database Systems Prof. Nada Basit Email: basit@virginia.edu Fall

Database Systems Database Systems 1 Creating a Database System Design Construction

Overview of Database Systems CS3860 - Jay Urbain, PhD Introduction to Database Systems 1 2 UFR

National Address Database National Address Database What is a National Address Database?

CSc 337 LECTURE 24: CREATING A DATABASE AND MORE JOINS Creating a database In the command line

Lect ure # 01 ADVANCED DATABASE SYSTEMS Course Introduction & History of Database Systems

ADVANCED DATABASE SYSTEMS Self-Driving Database Management Systems @ Andy_Pavlo // 15- 721 //

Advanced Database Technologies Advanced Database Technologies in a Diabetic Healthcare System in

Mining in Logarithmic Space An exponential improvement on blockchain storage Aggelos Kiayias,

Message Passing Algorithms for Compressed Sensing DLD, Arian Maleki, Andrea Montanari Stanford

Compressing and Searching XML Data Via Two Zips Paolo Ferragina Dipartimento di Informatica,

Non-Linear Compression: Gzip Me Not! Michael F. Nowlan Bryan Ford Ramakrishna Gummadi

Data Compression Reduce the size of data. Reduces storage space and hence storage cost.

What Youll Learn Today Review: how ASCII works and the great unfairness of bits What

Delta-DNN : Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity The 49

Rate Distortion for Model Compression: From Theory To Practice Weihao Gao , Yu-Han Liu ,

ADVANCED DATABASE SYSTEMS Database Compression @ Andy_Pavlo // - PowerPoint PPT Presentation

Lect ure # 10 ADVANCED DATABASE SYSTEMS Database Compression @ Andy_Pavlo // 15- 721 // Spring 2019 CMU 15-721 (Spring 2019) 2 UPCO M IN G DATABASE EVEN TS Splice Machine Tech Talk Thursday Feb 21 st @ 12:00pm CIC 4 th Floor

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Advanced Database CS 525: Organization? Advanced Database =Database Implementation

Advanced Database Management Systems Database Management Systems Alvaro A A Fernandes School of

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017

DATABASE SYSTEMS Database programming in a web environment Database System Course AGENDA FOR

Lect ure # 11 ADVANCED DATABASE SYSTEMS System Catalogs and Database Compression @

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

DATABASE SECURITY CS4750 Database Systems Prof. Nada Basit Email: basit@virginia.edu Fall

DATABASE SECURITY CS4750 Database Systems Prof. Nada Basit Email: basit@virginia.edu Fall

Database Systems Database Systems 1 Creating a Database System Design Construction

Overview of Database Systems CS3860 - Jay Urbain, PhD Introduction to Database Systems 1 2 UFR

National Address Database National Address Database What is a National Address Database?

CSc 337 LECTURE 24: CREATING A DATABASE AND MORE JOINS Creating a database In the command line

Lect ure # 01 ADVANCED DATABASE SYSTEMS Course Introduction &amp; History of Database Systems

ADVANCED DATABASE SYSTEMS Self-Driving Database Management Systems @ Andy_Pavlo // 15- 721 //

Advanced Database Technologies Advanced Database Technologies in a Diabetic Healthcare System in

Mining in Logarithmic Space An exponential improvement on blockchain storage Aggelos Kiayias,

Message Passing Algorithms for Compressed Sensing DLD, Arian Maleki, Andrea Montanari Stanford

Compressing and Searching XML Data Via Two Zips Paolo Ferragina Dipartimento di Informatica,

Non-Linear Compression: Gzip Me Not! Michael F. Nowlan Bryan Ford Ramakrishna Gummadi

Data Compression Reduce the size of data. Reduces storage space and hence storage cost.

What Youll Learn Today Review: how ASCII works and the great unfairness of bits What

Delta-DNN : Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity The 49

Rate Distortion for Model Compression: From Theory To Practice Weihao Gao , Yu-Han Liu ,

Lect ure # 01 ADVANCED DATABASE SYSTEMS Course Introduction & History of Database Systems