Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both - - PowerPoint PPT Presentation

data blocks hybrid oltp and olap on compressed
SMART_READER_LITE
LIVE PREVIEW

Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both - - PowerPoint PPT Presentation

Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation Harald Lang 1 , Tobias Mhlbauer 1 , Florian Funke 2, , Peter Boncz 3, , Thomas Neumann 1 , Alfons Kemper 1 1 Technical University Munich,


slide-1
SLIDE 1

Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation †

Harald Lang1, Tobias Mühlbauer1, Florian Funke2,∗, Peter Boncz3,∗, Thomas Neumann1, Alfons Kemper1

1Technical University Munich, 2Snowflake Computing, 3Centrum Wiskunde & Informatica †To appear at SIGMOD 2016.

* Work done while at Technical University Munich.

slide-2
SLIDE 2

Goals

◮ Primary goal

◮ Reducing the memory-footprint in hybrid OLTP&OLAP database systems ◮ Retaining high query performance and transactional throughput

◮ Secondary goals / future work

◮ Eviting cold data to secondary storage ◮ Reducing costly disk I/O

◮ Out of scope

◮ Hot/cold clustering (see previous work of Funke et al.: “Compacting

Transactional Data in Hybrid OLTP&OLAP Databases”)

slide-3
SLIDE 3

Compression in Hybrid OLTP&OLAP Database Systems

◮ SAP HANA (existing approach)

◮ Compress entire relations ◮ Updates are performed in an uncompressed write-optimized partition ◮ Implicit hot/cold clustering ◮ Merge partitions

◮ HyPer (our approach)

◮ Split relations in fixed size chunks (e.g., 64 K tuples) ◮ Cold chunks are “frozen” into immutable Data Blocks

slide-4
SLIDE 4

Data Blocks

◮ Compressed columnar storage format

◮ Designed for cold data (mostly read) ◮ Immutable and self-contained ◮ Fast scans and fast point-accesses ◮ Novel index-structure to narrow scan ranges

Cold

compressed Data Blocks

Hot

uncompressed mostly point acceses through index; some on cold data

query pipeline OLAP OLTP

slide-5
SLIDE 5

Compression Schemes

◮ Lightweight compression only

◮ Single value, byte-aligned truncation, ordered dictionary

◮ Efficient predicate evaluation, decompression and point-accesses ◮ Optimal compression chosen based on the actual value distribution

◮ Improves compression ratio, amortizes light-weight compression schemes and

redundancies caused by block-wise compression

A0 chunk0 B0 C0 Uncompressed A1 B1 C1 chunk1 ... A B C Data Blocks ...

dictionary (B) truncated (A) keys (B) truncated (C) single value (A) truncated (B) dictionary (C) keys (C)

slide-6
SLIDE 6

Positional SMAs

◮ Lightweight indexing ◮ Extension of traditional SMAs (min/max-indexes) ◮ Narrow scan ranges in a Data Block

Positional SMA (PSMA) compressed Data Block range with potential matches σ SARGable

◮ Supported predicates:

◮ column ◦ constant, where ◦ ∈ {=, is, <, ≤, ≥, >} ◮ column between a and b

slide-7
SLIDE 7

Positional SMAs - Details

◮ Lookup table where each table entry contains a range with potential matches ◮ For n byte values, the table consists of n × 256 entries ◮ Only the most significant non-zero byte is considered

lookup table data

0x0200AA 0x0000AE 0x02FA42 ... 00 02 03 E4 ... 1 2 3 pos: range: [0,3) tail bytes

most significant non-zero byte

... leading zero-bytes 256 range entries 256 range entries 256 range entries ... ... [0,3) ...

max # of values sharing an entry

1 28 216 224

slide-8
SLIDE 8

Positional SMAs - Details

◮ Lookup table where each table entry contains a range with potential matches ◮ For n byte values, the table consists of n × 256 entries ◮ Only the most significant non-zero byte is considered

lookup table data

0x0200AA 0x0000AE 0x02FA42 ... 00 02 03 E4 ... 1 2 3 pos: range: [0,3) tail bytes

most significant non-zero byte

... leading zero-bytes 256 range entries 256 range entries 256 range entries ... ... [0,3) ...

max # of values sharing an entry

1 28 216 224

preferred range achieved by using the delta value – SMAmin

slide-9
SLIDE 9

Positional SMAs - Example

lookup table

1023 ... 259 ... 255 ... 5 ... 0 [0,6) [1,17) [16,19) [6,7) [0,0)

SMA min: 2 SMA max: 999

probe 7 delta = 5 (7-min)

bytes of delta

00 00 00 05 probe 998 delta = 996 (998-min)

bytes of delta

00 00 03 E4 5 (leading non-0 byte)

slide-10
SLIDE 10

Challenge for JIT-compiling Query Engines

◮ HyPer compiles queries just-in-time (JIT) using the LLVM compiler framework ◮ Generated code is data-centric and processes a tuple-at-a-time

for (const Chunk& c : relation.chunks) { for (unsigned row=0; row!=c.rows; ++row) { auto attr0 = c.column[0].data[row]; auto attr3 = c.column[3].data[row]; // check scan restrictions if (tuple qualifies) { // code of consuming operator ... } } }

◮ Data Blocks individually determine the best suitable compression scheme for

each column on a per-block basis

◮ The variety of physical representations either results in

◮ multiple code paths => exploding compile-time ◮ or interpretation overhead => performance drop at runtime

slide-11
SLIDE 11

Vectorization to the Rescue

◮ Vectorization greatly reduces the interpretation overhead ◮ Spezialized vectorized scan functions for each compression scheme ◮ Vectorized scan extracts matching tuples to temporary storage where tuples

are consumed by tuple-at-a-time JIT code

uncompressed chunk A B C D vector B compressed Data Block A B C D A D interpreted vectorized scan

  • n Data Block

vectorized evaluation of SARGable predicates on compressed data and unpacking of matches push matches tuple-at-a-time

vector B uncompressed chunk A B C D A D interpreted vectorized scan

  • n uncompressed chunk

vectorized evaluation of SARGable predicates and copying of matches

JIT-compiled tuple-at-a-time scan

  • n uncompressed chunk

tuple-at-a-time evaluation of scan predicates vectors of e.g., 8192 records

JIT-compiled tuple-at-a-time query pipeline

cold scan hot scan

tuple

single interface

Index

PSMAs

slide-12
SLIDE 12

Predicate Evaluation using SIMD Instructions

Find Initial Matches

data

read offset aligned data unaligned data 11 remaining data predicate evaluation movemask =15410

precomputed positions table

255 ... 154 ... 1 0 0, 3, 4, 6

  • 0, 1, 2, 3, 4, 5, 6, 7

lookup

match positions

1, 3, 5, 7, 9, 11, 14, 15, 17 +11 add global scan position and update match vector write offset

slide-13
SLIDE 13

Predicate Evaluation using SIMD Instructions

Additional Restrictions

match positions data

read offset movemask

precomputed positions table

255 ... 172 ... 1 0 0, 2, 4, 5

  • 0, 1, 2, 3, 4, 5, 6, 7

write offset gather predicate evaluation =17210 lookup 17, 20, 25, 26,-,-,-,- shuffle match vector store 1, 3, 14,-,-,-,-,-, 17, 18, 20, 21, 25, 26, 29, 31

slide-14
SLIDE 14

Evaluation

slide-15
SLIDE 15

Compression Ratio

Size of TPC-H, IMDB cast info, and a flight database in HyPer and Vectorwise:

TPC-H SF100 IMDB1 cast info Flights2 uncompressed CSV 107 GB 1.4 GB 12 GB HyPer 126 GB 1.8 GB 21 GB Vectorwise 105 GB 0.72 GB 11 GB compressed HyPer 66 GB (0.62×) 0.50 GB (0.36×) 4.2 GB (0.35×) Vectorwise 54 GB (0.50×) 0.24 GB (0.17×) 3.2 GB (0.27×)

1http://www.imdb.com 2http://stat-computing.org/dataexpo/2009/

slide-16
SLIDE 16

Query Performance

Runtimes of TPC-H queries (scale factor 100) using different scan types on uncompressed and compressed databases in HyPer and Vectorwise.

scan type geometric mean sum HyPer JIT (uncomressed) 0.586s 21.7s Vectorized (uncompressed) 0.583s (1.01×) 21.6s + SARG 0.577s (1.02×) 21.8s Data Blocks (compressed) 0.555s (1.06×) 21.5s + SARG/SMA 0.466s (1.26×) 20.3s + PSMA 0.463s (1.27×) 20.2s Vectorwise uncompressed storage 2.336s 74.4s compressed storage 2.527s (0.92×) 78.5s

slide-17
SLIDE 17

Query Performance (cont’d)

Speedup of TPC-H Q6 (scale factor 100) on block-wise sorted3 data (+SORT).

JIT VEC Data Blocks (+PSMA) +SORT (-PSMA) +PSMA 2 4 6 8 10 12 14 gain by PSMA speedup over JIT

3sorted by l_shipdate

slide-18
SLIDE 18

OLTP Performance - Point Access

Throughput (in lookups per second) of random point access queries select * from customer where c_custkey = randomCustKey()

  • n TPC-H scale factor 100 with a primary key index on c_custkey.

Throughput [lookups/sec] Uncompressed 545,554 Data Blocks 294,291 (0.54 ×)

slide-19
SLIDE 19

OLTP Performance - TPC-C

TPC-C transaction throughput (5 warehouses), old neworder records compressed into Data Blocks:

Throughput [Tx/sec] Uncompressed 89,229 Data Blocks 88,699 (0.99 ×)

Only read-only TPC-C transactions order status and stock level; all relations frozen into Data Blocks:

Throughput [Tx/sec] Uncompressed 119,889 Data Blocks 109,649 (0.91 ×)

slide-20
SLIDE 20

Performance of SIMD Predicate Evaluation

Speedup of SIMD predicate evaluation of type l ≤ A ≤ r with selectivity 20%:

8-bit 16-bit 32-bit 64-bit 1 2 3 4 5 data type bit-width speedup over sequential x86 code x86 SSE AVX2

slide-21
SLIDE 21

Performance of SIMD Predicate Evaluation (cont’d)

Costs of applying an additional restriction with varying selectivities of the first predicate and the selectivity of the second predicate set to 40%:

x86 AVX2 50 100 1 2 cycles per element 8-bit 50 100 1 2 16-bit 50 100 1 2 cycles per element 32-bit 50 100 1 2 64-bit selectivity of first predicate [%]

slide-22
SLIDE 22

Advantages of Byte-Addressability

Predicate Evaluation

Cost of evaluating a SARGable predicate of type l ≤ A ≤ r with varying selectivities:

25 50 75 100 4 8 12 selectivity [%] cycles per tuple Data Blocks Horizontal bit-packing Horizontal bit-packing + positions table

◮ dom(A) = [0, 216] ◮ Intentionally, the domain exceeds the 2-byte truncation by one bit ◮ 17-bit codes with bit-packing, 32-bit codes with Data Blocks

slide-23
SLIDE 23

Advantages of Byte-Addressability

Unpacking matching tuples

Cost of unpacking matching tuples:

1 25 50 75 100 10 100 selectivity [%] cycles per matching tuple (log scale) Data Blocks Bit-packed (positional access) Bit-packed (unpack all and filter)

◮ 3 attributes, dom(A) = dom(B) = [0, 216] and dom(C) = [0, 28]) ◮ Intentionally, the domains exceed 1-byte and 2-byte truncation by one bit ◮ The compression ratio of bit-packing is almost two times higher in this

scenario

slide-24
SLIDE 24

Thank you!