ADVANCED DATABASE SYSTEMS Storage Models & Data Layout @ - - PowerPoint PPT Presentation

advanced database systems
SMART_READER_LITE
LIVE PREVIEW

ADVANCED DATABASE SYSTEMS Storage Models & Data Layout @ - - PowerPoint PPT Presentation

Lect ure # 09 ADVANCED DATABASE SYSTEMS Storage Models & Data Layout @ Andy_Pavlo // 15- 721 // Spring 2019 CMU 15-721 (Spring 2019) 2 DATA O RGAN IZATIO N Variable-Length Fixed-Length Index Data Blocks Data Blocks Block Id +


slide-1
SLIDE 1

Storage Models & Data Layout

@ Andy_Pavlo // 15- 721 // Spring 2019

ADVANCED DATABASE SYSTEMS

Lect ure # 09

slide-2
SLIDE 2 CMU 15-721 (Spring 2019)

DATA O RGAN IZATIO N

2

Fixed-Length Data Blocks Index

Block Id + Offset

Variable-Length Data Blocks

slide-3
SLIDE 3 CMU 15-721 (Spring 2019)

DATA O RGAN IZATIO N

One can think of an in-memory database as just a large array of bytes.

→ The schema tells the DBMS how to convert the bytes into the appropriate type. → Each tuple is prefixed with a header that contains its meta-data.

Storing tuples with as fixed-length data makes it easy to compute the starting point of any tuple.

3

slide-4
SLIDE 4 CMU 15-721 (Spring 2019)

Type Representation Data Layout / Alignment Storage Models System Catalogs

4

slide-5
SLIDE 5 CMU 15-721 (Spring 2019)

DATA REPRESEN TATIO N

INTEGER/BIGINT/SMALLINT/TINYINT

→ C/C++ Representation

FLOAT/REAL vs. NUMERIC/DECIMAL

→ IEEE-754 Standard / Fixed-point Decimals

TIME/DATE/TIMESTAMP

→ 32/64-bit int of (micro/milli)seconds since Unix epoch

VARCHAR/VARBINARY/TEXT/BLOB

→ Pointer to other location if type is ≥64-bits → Header with length and address to next location (if segmented), followed by data bytes.

5

slide-6
SLIDE 6 CMU 15-721 (Spring 2019)

VARIABLE PRECISIO N N UM BERS

Inexact, variable-precision numeric type that uses the “native” C/C++ types. Store directly as specified by IEEE-754. Typically faster than arbitrary precision numbers.

→ Example: FLOAT, REAL/DOUBLE

6

slide-7
SLIDE 7 CMU 15-721 (Spring 2019)

VARIABLE PRECISIO N N UM BERS

7

#include <stdio.h> int main(int argc, char* argv[]) { float x = 0.1; float y = 0.2; printf("x+y = %.20f\n", x+y); printf("0.3 = %.20f\n", 0.3); }

Rounding Example

x+y = 0.30000001192092895508 0.3 = 0.29999999999999998890

Output

slide-8
SLIDE 8 CMU 15-721 (Spring 2019)

FIXED PRECISIO N N UM BERS

Numeric data types with arbitrary precision and

  • scale. Used when round errors are unacceptable.

→ Example: NUMERIC, DECIMAL

Typically stored in a exact, variable-length binary representation with additional meta-data.

→ Like a VARCHAR but not stored as a string

Postgres Demo

8

slide-9
SLIDE 9 CMU 15-721 (Spring 2019)

PO STGRES: N UM ERIC

9

typedef unsigned char NumericDigit; typedef struct { int ndigits; int weight; int scale; int sign; NumericDigit *digits; } numeric;

# of Digits Weight of 1st Digit Scale Factor Positive/Negative/NaN Digit Storage

slide-10
SLIDE 10 CMU 15-721 (Spring 2019)

PO STGRES: N UM ERIC

9

typedef unsigned char NumericDigit; typedef struct { int ndigits; int weight; int scale; int sign; NumericDigit *digits; } numeric;

# of Digits Weight of 1st Digit Scale Factor Positive/Negative/NaN Digit Storage

slide-11
SLIDE 11 CMU 15-721 (Spring 2019)

DATA LAYO UT

10

CREATE TABLE AndySux ( id INT PRIMARY KEY, value BIGINT ); header id value

char[]

reinterpret_cast<int32_t*>(address)

slide-12
SLIDE 12 CMU 15-721 (Spring 2019)

header

VARIABLE- LEN GTH FIELDS

11

CREATE TABLE AndySux ( value VARCHAR(1024) ); 64-BIT POINTER

char[]

Variable-Length Data Blocks

Andy has the worst hygiene that I have ever seen. I hate LENGTH NEXT him so much. NEXT LENGTH

INSERT INTO AndySux VALUES ("Andy has the worst hygiene that I have ever

  • seen. I hate him so much.");
slide-13
SLIDE 13 CMU 15-721 (Spring 2019)

header id

VARIABLE- LEN GTH FIELDS

11

CREATE TABLE AndySux ( value VARCHAR(1024) ); 64-BIT POINTER

char[]

Variable-Length Data Blocks

Andy|64-BIT POINTER Andy has the worst hygiene that I have ever seen. I hate LENGTH NEXT him so much. NEXT LENGTH

INSERT INTO AndySux VALUES ("Andy has the worst hygiene that I have ever

  • seen. I hate him so much.");
slide-14
SLIDE 14 CMU 15-721 (Spring 2019)

N ULL DATA TYPES

Choice #1: Special Values

→ Designate a value to represent NULL for a particular data type (e.g., INT32_MIN).

Choice #2: Null Column Bitmap Header

→ Store a bitmap in the tuple header that specifies what attributes are null.

Choice #3: Per Attribute Null Flag

→ Store a flag that marks that a value is null. → Have to use more space than just a single bit because this messes up with word alignment.

12

slide-15
SLIDE 15 CMU 15-721 (Spring 2019)

N ULL DATA TYPES

Choice #1: Special Values

→ Designate a value to represent NULL for a particular data type (e.g., INT32_MIN).

Choice #2: Null Column Bitmap Header

→ Store a bitmap in the tuple header that specifies what attributes are null.

Choice #3: Per Attribute Null Flag

→ Store a flag that marks that a value is null. → Have to use more space than just a single bit because this messes up with word alignment.

12

slide-16
SLIDE 16 CMU 15-721 (Spring 2019)

DISCLAIM ER

The truth is that you only need to worry about word-alignment for cache lines (e.g., 64 bytes). I’m going to show you the basic idea using 64-bit words since it’s easier to see…

13

slide-17
SLIDE 17 CMU 15-721 (Spring 2019)

WO RD- ALIGN ED TUPLES

All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.

14

CREATE TABLE AndySux ( id INT PRIMARY KEY, cdate TIMESTAMP, color CHAR(2), zipcode INT ); 64-bit Word 64-bit Word 64-bit Word 64-bit Word

char[]

slide-18
SLIDE 18 CMU 15-721 (Spring 2019)

WO RD- ALIGN ED TUPLES

All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.

14

CREATE TABLE AndySux ( id INT PRIMARY KEY, cdate TIMESTAMP, color CHAR(2), zipcode INT ); 32-bits 64-bit Word 64-bit Word 64-bit Word 64-bit Word id

char[]

slide-19
SLIDE 19 CMU 15-721 (Spring 2019)

WO RD- ALIGN ED TUPLES

All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.

14

CREATE TABLE AndySux ( id INT PRIMARY KEY, cdate TIMESTAMP, color CHAR(2), zipcode INT ); 32-bits 64-bits 64-bit Word 64-bit Word 64-bit Word 64-bit Word id cdate

char[]

slide-20
SLIDE 20 CMU 15-721 (Spring 2019)

WO RD- ALIGN ED TUPLES

All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.

14

CREATE TABLE AndySux ( id INT PRIMARY KEY, cdate TIMESTAMP, color CHAR(2), zipcode INT ); 32-bits 64-bits 16-bits 64-bit Word 64-bit Word 64-bit Word 64-bit Word id cdate c

char[]

slide-21
SLIDE 21 CMU 15-721 (Spring 2019)

WO RD- ALIGN ED TUPLES

All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.

14

CREATE TABLE AndySux ( id INT PRIMARY KEY, cdate TIMESTAMP, color CHAR(2), zipcode INT ); 32-bits 64-bits 16-bits 32-bits 64-bit Word 64-bit Word 64-bit Word 64-bit Word id cdate c zipc

char[]

slide-22
SLIDE 22 CMU 15-721 (Spring 2019)

WO RD- ALIGN ED TUPLES

All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.

14

CREATE TABLE AndySux ( id INT PRIMARY KEY, cdate TIMESTAMP, color CHAR(2), zipcode INT ); 32-bits 64-bits 16-bits 32-bits 64-bit Word 64-bit Word 64-bit Word 64-bit Word id cdate c zipc

char[]

slide-23
SLIDE 23 CMU 15-721 (Spring 2019)

WO RD- ALIGN ED TUPLES

Approach #1: Perform Extra Reads →Execute two reads to load the appropriate parts

  • f the data word and reassemble them.

Approach #2: Random Reads →Read some unexpected combination of bytes assembled into a 64-bit word. Approach #3: Reject →Throw an exception

15

Source: Levente Kurusa

slide-24
SLIDE 24 CMU 15-721 (Spring 2019)

CREATE TABLE AndySux ( id INT PRIMARY KEY, cdate TIMESTAMP, color CHAR(2), zipcode INT );

WO RD- ALIGN M EN T: PADDIN G

Add empty bits after attributes to ensure that tuple is word aligned.

16

64-bit Word 64-bit Word 64-bit Word 64-bit Word id cdate zipc

00000000 00000000 00000000 00000000 0000 0000 0000 0000

char[]

32-bits 64-bits 16-bits 32-bits c

slide-25
SLIDE 25 CMU 15-721 (Spring 2019)

CREATE TABLE AndySux ( id INT PRIMARY KEY, cdate TIMESTAMP, color CHAR(2), zipcode INT );

WO RD- ALIGN M EN T: REO RDERIN G

Switch the order of attributes in the tuples' physical layout to make sure they are aligned.

→ May still have to use padding.

17

64-bit Word 64-bit Word 64-bit Word 64-bit Word id cdate zipc

000000000000 000000000000 000000000000 000000000000

char[]

32-bits 64-bits 16-bits 32-bits c

slide-26
SLIDE 26 CMU 15-721 (Spring 2019)

CM U- DB ALIGN M EN T EXPERIM EN T

18

  • Avg. Throughput

No Alignment 0.523 MB/sec Optimization #1 11.7 MB/sec Optimization #2 814.8 MB/sec

Processor: 1 socket, 4 cores w/ 2×HT Workload: Insert Microbenchmark

Source: Tianyu Li

slide-27
SLIDE 27 CMU 15-721 (Spring 2019)

STO RAGE M O DELS

N-ary Storage Model (NSM) Decomposition Storage Model (DSM) Hybrid Storage Model

19

COLUMN- STORES VS. ROW- STORES: HOW DIFFERENT ARE THEY REALLY?

SIGMOD 2008

slide-28
SLIDE 28 CMU 15-721 (Spring 2019)

STO RAGE M O DELS

N-ary Storage Model (NSM) Decomposition Storage Model (DSM) Hybrid Storage Model

19

COLUMN- STORES VS. ROW- STORES: HOW DIFFERENT ARE THEY REALLY?

SIGMOD 2008

slide-29
SLIDE 29 CMU 15-721 (Spring 2019)

N- ARY STO RAGE M O DEL (N SM )

The DBMS stores all of the attributes for a single tuple contiguously. Ideal for OLTP workloads where txns tend to

  • perate only on an individual entity and insert-

heavy workloads. Use the tuple-at-a-time iterator model.

20

slide-30
SLIDE 30 CMU 15-721 (Spring 2019)

N SM : PH YSICAL STO RAGE

Choice #1: Heap-Organized Tables

→ Tuples are stored in blocks called a heap. → The heap does not necessarily define an order.

Choice #2: Index-Organized Tables

→ Tuples are stored in the primary key index itself. → Not quite the same as a clustered index.

21

slide-31
SLIDE 31 CMU 15-721 (Spring 2019)

N- ARY STO RAGE M O DEL (N SM )

Advantages

→ Fast inserts, updates, and deletes. → Good for queries that need the entire tuple. → Can use index-oriented physical storage.

Disadvantages

→ Not good for scanning large portions of the table and/or a subset of the attributes.

22

slide-32
SLIDE 32 CMU 15-721 (Spring 2019)

DECO M PO SITIO N STO RAGE M O DEL (DSM )

The DBMS stores a single attribute for all tuples contiguously in a block of data. Ideal for OLAP workloads where read-only queries perform large scans over a subset of the table’s attributes. Use the vector-at-a-time iterator model.

23

slide-33
SLIDE 33 CMU 15-721 (Spring 2019)

DECO M PO SITIO N STO RAGE M O DEL (DSM )

1970s: Cantor DBMS 1980s: DSM Proposal 1990s: SybaseIQ (in-memory only) 2000s: Vertica, Vectorwise, MonetDB 2010s: “The Big Three” Cloudera Impala, Amazon Redshift, SAP HANA, MemSQL, Clickhouse, LinkedIn Pinot, and most OLAP systems

24

slide-34
SLIDE 34 CMU 15-721 (Spring 2019)

DSM : TUPLE IDEN TIFICATIO N

Choice #1: Fixed-length Offsets

→ Each value is the same length for an attribute.

Choice #2: Embedded Tuple Ids

→ Each value is stored with its tuple id in a column.

25

Offsets

1 2 3

A B C D

Embedded Ids

A

1 2 3

B

1 2 3

C

1 2 3

D

1 2 3

slide-35
SLIDE 35 CMU 15-721 (Spring 2019)

DSM : Q UERY PRO CESSIN G

Late Materialization Columnar Compression Block/Vectorized Processing Model

26

slide-36
SLIDE 36 CMU 15-721 (Spring 2019)

DECO M PO SITIO N STO RAGE M O DEL (DSM )

Advantages

→ Reduces the amount wasted work because the DBMS

  • nly reads the data that it needs.

→ Better compression.

Disadvantages

→ Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching.

27

slide-37
SLIDE 37 CMU 15-721 (Spring 2019)

O BSERVATIO N

Data is “hot” when first entered into database

→ A newly inserted tuple is more likely to be updated again the near future.

As a tuple ages, it is updated less frequently.

→ At some point, a tuple is only accessed in read-only queries along with other tuples.

What if we want to use this data to make decisions that affect new txns?

28

slide-38
SLIDE 38 CMU 15-721 (Spring 2019)

H YBRID STO RAGE M O DEL

Single logical database instance that uses different storage models for hot and cold data. Store new data in NSM for fast OLTP Migrate data to DSM for more efficient OLAP

29

slide-39
SLIDE 39 CMU 15-721 (Spring 2019)

H YBRID STO RAGE M O DEL

Choice #1: Separate Execution Engines

→ Use separate execution engines that are optimized for either NSM or DSM databases.

Choice #2: Single, Flexible Architecture

→ Use single execution engine that is able to efficiently

  • perate on both NSM and DSM databases.

30

slide-40
SLIDE 40 CMU 15-721 (Spring 2019)

SEPARATE EXECUTIO N EN GIN ES

Run separate “internal” DBMSs that each only

  • perate on DSM or NSM data.

→ Need to combine query results from both engines to appear as a single logical database to the application. → Have to use a synchronization method (e.g., 2PC) if a txn spans execution engines.

Two approaches to do this:

→ Fractured Mirrors (Oracle, IBM) → Delta Store (SAP HANA)

31

slide-41
SLIDE 41 CMU 15-721 (Spring 2019)

FRACTURED M IRRO RS

Store a second copy of the database in a DSM layout that is automatically updated.

→ All updates are first entered in NSM then eventually copied into DSM mirror.

32

A CASE FOR FRACTURED M MIRRORS

VLDB 2002

NSM (Primary) DSM (Mirror)

Transactions Analytical Queries

slide-42
SLIDE 42 CMU 15-721 (Spring 2019)

DELTA STO RE

Stage updates to the database in an NSM table. A background thread migrates updates from delta store and applies them to DSM data.

33

NSM Delta Store DSM Historical Data

Transactions

slide-43
SLIDE 43 CMU 15-721 (Spring 2019)

CATEGO RIZIN G DATA

Choice #1: Manual Approach

→ DBA specifies what tables should be stored as DSM.

Choice #2: Off-line Approach

→ DBMS monitors access logs offline and then makes decision about what data to move to DSM.

Choice #3: On-line Approach

→ DBMS tracks access patterns at runtime and then makes decision about what data to move to DSM.

34

slide-44
SLIDE 44 CMU 15-721 (Spring 2019)

PELOTO N ADAPTIVE STO RAGE

Employ a single execution engine architecture that is able to operate on both NSM and DSM data.

→ Don’t need to store two copies of the database. → Don’t need to sync multiple database segments.

Note that a DBMS can still use the delta-store approach with this single-engine architecture.

35

BRIDGING THE ARCHIPELAGO BETWEEN ROW- STORES AND COLUMN- STORES FOR HYBRID WORKLOADS

SIGMOD 2016

slide-45
SLIDE 45 CMU 15-721 (Spring 2019)

PELOTO N ADAPTIVE STO RAGE

36

Original Data Adapted Data

SELECT AVG(B) FROM AndySux WHERE C = “yyy” UPDATE AndySux SET A = 123, B = 456, C = 789 WHERE D = “xxx”

A B C D A B C D A B C D

Cold Hot

slide-46
SLIDE 46 CMU 15-721 (Spring 2019)

TILE ARCH ITECTURE

Introduce an indirection layer that abstracts the true layout of tuples from query operators.

37

Tile Group A Tile Group B

A B C D

Tile #1 Tile #2 Tile #3 Tile #4

slide-47
SLIDE 47 CMU 15-721 (Spring 2019)

TILE ARCH ITECTURE

Introduce an indirection layer that abstracts the true layout of tuples from query operators.

37

A B C D

Tile #1 Tile #2 Tile #3 Tile #4

H

+ + + + +

Tile Group Header

slide-48
SLIDE 48 CMU 15-721 (Spring 2019)

AS AS

γ

s

TILE ARCH ITECTURE

Introduce an indirection layer that abstracts the true layout of tuples from query operators.

37

A B C D H

+ + + + +

SELECT AVG(B) FROM AndySux WHERE C = “yyy”

1 2

B

1 2 3

slide-49
SLIDE 49 CMU 15-721 (Spring 2019)

AS AS

γ

s

TILE ARCH ITECTURE

Introduce an indirection layer that abstracts the true layout of tuples from query operators.

37

A B C D H

+ + + + +

SELECT AVG(B) FROM AndySux WHERE C = “yyy”

1 2

B

1 2 3

slide-50
SLIDE 50 CMU 15-721 (Spring 2019)

PELOTO N ADAPTIVE STO RAGE

38

400 800 1200 1600

Row Layout Column Layout Adaptive Layout Sep-15

Scan Insert Scan Insert Scan Insert Scan Insert Scan Insert Scan Insert

Execution Time (ms)

Sep-16 Sep-17 Sep-18 Sep-19 Sep-20

slide-51
SLIDE 51 CMU 15-721 (Spring 2019)

SYSTEM CATALO GS

Almost every DBMS stores their a database's catalog in itself.

→ Wrap object abstraction around tuples. → Specialized code for "bootstrapping" catalog tables.

The entire DBMS should be aware of transactions in order to automatically provide ACID guarantees for DDL commands and concurrent txns.

39

slide-52
SLIDE 52 CMU 15-721 (Spring 2019)

SCH EM A CH AN GES

ADD COLUMN:

→ NSM: Copy tuples into new region in memory. → DSM: Just create the new column segment

DROP COLUMN:

→ NSM #1: Copy tuples into new region of memory. → NSM #2: Mark column as "deprecated", clean up later. → DSM: Just drop the column and free memory.

CHANGE COLUMN:

→ Check whether the conversion is allowed to happen. Depends on default values.

40

slide-53
SLIDE 53 CMU 15-721 (Spring 2019)

IN DEXES

CREATE INDEX:

→ Scan the entire table and populate the index. → Have to record changes made by txns that modified the table while another txn was building the index. → When the scan completes, lock the table and resolve changes that were missed after the scan started.

DROP INDEX:

→ Just drop the index logically from the catalog. → It only becomes "invisible" when the txn that dropped it

  • commits. All existing txns will still have to update it.

41

slide-54
SLIDE 54 CMU 15-721 (Spring 2019)

SEQ UEN CES

Typically stored in the catalog. Used for maintaining a global counter

→ Also called "auto-increment" or "serial" keys

Sequences are not maintained with the same isolation protection as regular catalog entries.

→ Rolling back a txn that incremented a sequence does not rollback the change to that sequence. → All INSERT queries would incur write-write conflicts.

42

slide-55
SLIDE 55 CMU 15-721 (Spring 2019)

PARTIN G TH O UGH TS

We abandoned the hybrid storage model

→ Significant engineering overhead. → Delta version storage + column store is almost equivalent.

Catalogs are hard.

43