ADVANCED DATABASE SYSTEMS Databases on New Hardware @ Andy_Pavlo - - PowerPoint PPT Presentation

advanced
SMART_READER_LITE
LIVE PREVIEW

ADVANCED DATABASE SYSTEMS Databases on New Hardware @ Andy_Pavlo - - PowerPoint PPT Presentation

Lect ure # 25 ADVANCED DATABASE SYSTEMS Databases on New Hardware @ Andy_Pavlo // 15- 721 // Spring 2020 2 ADM IN ISTRIVIA April 29: Guest Speaker (Live) May 4: Code Review #2 Submission May 5: Final Presentations (Live) May 13: Final Exam


slide-1
SLIDE 1

Lect ure # 25

Databases on New Hardware

@ Andy_Pavlo // 15- 721 // Spring 2020

ADVANCED DATABASE SYSTEMS

slide-2
SLIDE 2

15-721 (Spring 2020)

ADM IN ISTRIVIA

April 29: Guest Speaker (Live) May 4: Code Review #2 Submission May 5: Final Presentations (Live) May 13: Final Exam Due Date May 16: Hack-a-Thon (Extra Credit, Optional)

2

slide-3
SLIDE 3

15-721 (Spring 2020)

ADM IN ISTRIVIA

Course Evaluation

→ Please tell me what you really think of me. → I take your feedback in consideration. → Take revenge on next year's students.

https://cmu.smartevals.com/

3

slide-4
SLIDE 4

15-721 (Spring 2020)

DATABASE H ARDWARE

People have been thinking about using hardware to accelerate DBMSs for decades. 1980s: Database Machines 2000s: FPGAs + Appliances 2010s: FPGAs + GPUs 2020s: PM + FPGAs + GPUs + CSAs + More!

4

DATABASE MACHINES: AN IDEA WHOSE TIME HAS PASSED? A CRITIQUE OF THE FUTURE OF DATABASE MACHINES

UNIVERSITY OF WISCONSIN 1983

slide-5
SLIDE 5

15-721 (Spring 2020)

Persistent Memory GPU Acceleration Hardware Transactional Memory

5

slide-6
SLIDE 6

15-721 (Spring 2020)

PERSISTEN T M EM O RY

Emerging storage technology that provide low latency read/writes like DRAM, but with persistent writes and large capacities like SSDs.

→ aka Storage-class Memory, Non-Volatile Memory

First devices are block-addressable (NVMe) Later devices are byte-addressable.

6

slide-7
SLIDE 7

15-721 (Spring 2020)

FUN DAM EN TAL ELEM EN TS O F CIRCUITS

7

Capacitor (1745) Resistor (1827) Inductor (1831)

slide-8
SLIDE 8

15-721 (Spring 2020)

FUN DAM EN TAL ELEM EN TS O F CIRCUITS

In 1971, Leon Chua at Berkeley predicted the existence of a fourth fundamental element. A two-terminal device whose resistance depends

  • n the voltage applied to it, but when that voltage

is turned off it permanently remembers its last resistive state.

8

TWO CENTURIES OF MEMRISTORS

NATURE MATERIALS 2012

slide-9
SLIDE 9

15-721 (Spring 2020)

FUN DAM EN TAL ELEM EN TS O F CIRCUITS

9

Capacitor (1745) Resistor (1827) Inductor (1831) Memristor (1971)

slide-10
SLIDE 10

15-721 (Spring 2020)

M ERISTO RS

A team at HP Labs led by Stanley Williams stumbled upon a nano-device that had weird properties that they could not understand. It wasn’t until they found Chua’s 1971 paper that they realized what they had invented.

10

HOW WE FOUND THE MISSING MEMRISTOR

IEEE S SPECTRUM 2008

slide-11
SLIDE 11

Andy Pavlo / / Carnegie Mellon University / / Spring 2016

11

Source: Luke Kilpatrick

slide-12
SLIDE 12

15-721 (Spring 2020)

TECH N O LO GIES

Phase-Change Memory (PRAM) Resistive RAM (ReRAM) Magnetoresistive RAM (MRAM)

13

slide-13
SLIDE 13

15-721 (Spring 2020)

PH ASE- CH AN GE M EM O RY

Storage cell is comprised of two metal electrodes separated by a resistive heater and the phase change material (chalcogenide). The value of the cell is changed based on how the material is heated.

→ A short pulse changes the cell to a ‘0’. → A long, gradual pulse changes the cell to a ‘1’.

14

PHASE CHANGE MEMORY ARCHITECTURE AND THE QUEST FOR SCALABILITY

COMMUNICATIONS OF THE ACM 2 2010

Heater Bitline Access chalcogenide

slide-14
SLIDE 14

15-721 (Spring 2020)

RESISTIVE RAM

Two metal layers with two TiO2 layers in between. Running a current one direction moves electrons from the top TiO2 layer to the bottom, thereby changing the resistance. Potential programmable storage fabric…

→ Bertrand Russell’s Material Implication Logic

15

HOW WE FOUND THE MISSING MEMRISTOR

IEEE S SPECTRUM 2008

Platinum Platinum TiO2 Layer TiO2-x Layer

slide-15
SLIDE 15

15-721 (Spring 2020)

M AGN ETO RESISTIVE RAM

Stores data using magnetic storage elements instead of electric charge or current flows. Spin-Transfer Torque (STT-MRAM) is the leading technology for this type of PM.

→ Supposedly able to scale to very small sizes (10nm) and have SRAM latencies.

16

Fixed FM Layer→ Oxide Layer Free FM Layer ↔

SPIN MEMORY S SHOWS ITS MIGHT

IEEE S SPECTRUM 2014

slide-16
SLIDE 16

15-721 (Spring 2020)

WH Y TH IS IS FO R REAL

Industry has agreed to standard technologies and form factors (JDEC). Linux and Microsoft added support for PM in their kernels (DAX). Intel added new instructions for flushing cache lines to PM (CLFLUSH, CLWB).

17

slide-17
SLIDE 17

15-721 (Spring 2020)

WH Y TH IS IS FO R REAL

Industry has agreed to standard technologies and form factors (JDEC). Linux and Microsoft added support for PM in their kernels (DAX). Intel added new instructions for flushing cache lines to PM (CLFLUSH, CLWB).

17

slide-18
SLIDE 18

15-721 (Spring 2020)

PM DRAM

PM Next to DRAM

DBMS

Virtual Memory Subsystem

DBMS Address Space

PM DRAM

DRAM as Hardware- Managed Cache

DBMS

DBMS Address Space

Virtual Memory Subsystem

PM CO N FIGURATIO N S

18

Source: Ismail Oukid

slide-19
SLIDE 19

15-721 (Spring 2020)

PM FO R DATABASE SYSTEM S

Block-addressable PM is not that interesting. Byte-addressable PM will be a game changer but will require some work to use correctly.

→ In-memory DBMSs will be better positioned to use byte- addressable PM. → Disk-oriented DBMSs will initially treat PM as just a faster SSD.

19

slide-20
SLIDE 20

15-721 (Spring 2020)

STO RAGE & RECOVERY M ETH O DS

Understand how a DBMS will behave on a system that only has byte-addressable PM. Develop PM-optimized implementations of standard DBMS architectures. Based on the N-Store prototype DBMS.

20

LET'S TALK ABOUT STORAGE & RECOVERY M METHODS FOR NON- VOLATILE MEMORY D DATABASE SYSTEMS

SIGMOD 2015

slide-21
SLIDE 21

15-721 (Spring 2020)

SYN CH RO N IZATIO N

Existing programming models assume that any write to memory is non-volatile.

→ CPU decides when to move data from caches to DRAM.

The DBMS needs a way to ensure that data is flushed from caches to PM.

21

STORE CLWB

L1 Cache

L2 Cache

ADR

Memory Controller

slide-22
SLIDE 22

15-721 (Spring 2020)

N AM IN G

If the DBMS process restarts, we need to make sure that all the pointers for in-memory data point to the same data.

22

Table Heap

Tuple #00 Tuple #02 Tuple #01

Index

Tuple #00 (v2)

slide-23
SLIDE 23

15-721 (Spring 2020)

N AM IN G

If the DBMS process restarts, we need to make sure that all the pointers for in-memory data point to the same data.

22

Table Heap

Tuple #00 Tuple #02 Tuple #01

Index

Tuple #00 (v2)

slide-24
SLIDE 24

15-721 (Spring 2020)

N AM IN G

If the DBMS process restarts, we need to make sure that all the pointers for in-memory data point to the same data.

22

Table Heap

Tuple #00 Tuple #02 Tuple #01

Index

Tuple #00 (v2)

slide-25
SLIDE 25

15-721 (Spring 2020)

PM - AWARE M EM O RY ALLO CATO R

Feature #1: Synchronization

→ The allocator writes back CPU cache lines to PM using the CLFLUSH instruction. → It then issues a SFENCE instruction to wait for the data to become durable on PM.

Feature #2: Naming

→ The allocator ensures that virtual memory addresses assigned to a memory-mapped region never change even after the OS or DBMS restarts.

23

slide-26
SLIDE 26

15-721 (Spring 2020)

DBM S EN GIN E ARCH ITECTURES

Choice #1: In-place Updates

→ Table heap with a write-ahead log + snapshots. → Example: VoltDB

Choice #2: Copy-on-Write

→ Create a shadow copy of the table when updated. → No write-ahead log. → Example: LMDB

Choice #3: Log-structured

→ All writes are appended to log. No table heap. → Example: RocksDB

24

slide-27
SLIDE 27

15-721 (Spring 2020)

IN- PLACE UPDATES EN GIN E

25

In-Memory Table Heap

Tuple #00 Tuple #02

Durable Storage

Write-Ahead Log

Tuple Delta

In-Memory Index

Tuple #01

Snapshots

1

slide-28
SLIDE 28

15-721 (Spring 2020)

IN- PLACE UPDATES EN GIN E

25

In-Memory Table Heap

Tuple #00 Tuple #02

Durable Storage

Write-Ahead Log

Tuple Delta

In-Memory Index

Tuple #01

Snapshots

Tuple #01 (!)

1 2

slide-29
SLIDE 29

15-721 (Spring 2020)

IN- PLACE UPDATES EN GIN E

25

In-Memory Table Heap

Tuple #00 Tuple #02

Durable Storage

Write-Ahead Log

Tuple Delta

In-Memory Index

Tuple #01

Snapshots

Tuple #01 (!) Tuple #01 (!)

1 2 3

slide-30
SLIDE 30

15-721 (Spring 2020)

IN- PLACE UPDATES EN GIN E

25

In-Memory Table Heap

Tuple #00 Tuple #02

Durable Storage

Write-Ahead Log

Tuple Delta

In-Memory Index

Tuple #01

Snapshots

Tuple #01 (!) Tuple #01 (!)

1 2 3

Duplicate Data Recovery Latency

slide-31
SLIDE 31

15-721 (Spring 2020)

PM - O PTIM IZED ARCH ITECTURES

Leverage the allocator’s non-volatile pointers to

  • nly record what changed rather than how it

changed. The DBMS only must maintain a transient UNDO log for a txn until it commits.

→ Dirty cache lines from an uncommitted txn can be flushed by hardware to the memory controller. → No REDO log because we flush all the changes to PM at the time of commit.

26

slide-32
SLIDE 32

15-721 (Spring 2020)

PM IN- PLACE UPDATES EN GIN E

27

PM Table Heap

Tuple #00 Tuple #02

PM Storage

Write-Ahead Log

Tuple Pointers

PM Index

Tuple #01

1

slide-33
SLIDE 33

15-721 (Spring 2020)

PM IN- PLACE UPDATES EN GIN E

27

PM Table Heap

Tuple #00 Tuple #02

PM Storage

Write-Ahead Log

Tuple Pointers

PM Index

Tuple #01 Tuple #01 (!)

1 2

slide-34
SLIDE 34

15-721 (Spring 2020)

CO PY- O N- WRITE EN GIN E

28

Current Directory Master Record Leaf 1 Leaf 2

Page #00 Page #01

slide-35
SLIDE 35

15-721 (Spring 2020)

CO PY- O N- WRITE EN GIN E

28

Current Directory Master Record Leaf 1 Leaf 2

1

Page #00 Page #01

Updated Leaf 1

Page #00

slide-36
SLIDE 36

15-721 (Spring 2020)

CO PY- O N- WRITE EN GIN E

28

Current Directory Dirty Directory Master Record Leaf 1 Leaf 2

1 2

Page #00 Page #01

Updated Leaf 1

Page #00

slide-37
SLIDE 37

15-721 (Spring 2020)

CO PY- O N- WRITE EN GIN E

28

Current Directory Dirty Directory Master Record Leaf 1 Leaf 2

1 2 3

Page #00 Page #01

Updated Leaf 1

Page #00

slide-38
SLIDE 38

15-721 (Spring 2020)

CO PY- O N- WRITE EN GIN E

28

Current Directory Dirty Directory Master Record Leaf 1 Leaf 2

1 2 3

Expensive Copies

Page #00 Page #01

Updated Leaf 1

Page #00

slide-39
SLIDE 39

15-721 (Spring 2020)

PM CO PY- O N- WRITE EN GIN E

29

Current Directory

Tuple #00

Master Record Leaf 1 Leaf 2

Tuple #01

slide-40
SLIDE 40

15-721 (Spring 2020)

PM CO PY- O N- WRITE EN GIN E

29

Current Directory

Tuple #00

Master Record Leaf 1 Leaf 2 Updated Leaf 1

Tuple #00 (!)

1

Tuple #01 Only Copy Pointers

slide-41
SLIDE 41

15-721 (Spring 2020)

PM CO PY- O N- WRITE EN GIN E

29

Current Directory Dirty Directory

Tuple #00

Master Record Leaf 1 Leaf 2 Updated Leaf 1

Tuple #00 (!)

1 2 3

Tuple #01 Only Copy Pointers

slide-42
SLIDE 42

15-721 (Spring 2020)

LO G- STRUCTURED EN GIN E

30

SSTable MemTable

Write-Ahead Log

Tuple Delta Bloom Filter

1

slide-43
SLIDE 43

15-721 (Spring 2020)

LO G- STRUCTURED EN GIN E

30

SSTable MemTable

Write-Ahead Log

Tuple Delta Bloom Filter Tuple Delta Tuple Data

1 2 3

slide-44
SLIDE 44

15-721 (Spring 2020)

LO G- STRUCTURED EN GIN E

30

SSTable MemTable

Write-Ahead Log

Tuple Delta Bloom Filter Tuple Delta Tuple Data

1 2 3

Duplicate Data Compactions

slide-45
SLIDE 45

15-721 (Spring 2020)

PM LO G- STRUCTURED EN GIN E

31

SSTable MemTable

Write-Ahead Log

Tuple Delta Bloom Filter Tuple Delta Tuple Data

1 2 3

slide-46
SLIDE 46

15-721 (Spring 2020)

PM LO G- STRUCTURED EN GIN E

31

SSTable MemTable

Write-Ahead Log

Tuple Delta Bloom Filter Tuple Delta Tuple Data

1 2 3

slide-47
SLIDE 47

15-721 (Spring 2020)

PM LO G- STRUCTURED EN GIN E

31

MemTable

Write-Ahead Log

Tuple Delta

1

slide-48
SLIDE 48

15-721 (Spring 2020)

O BSERVATIO N

WAL serves two purposes

→ Transform random writes into sequential log writes. → Support transaction rollback. → Design makes sense for disks with slow random writes.

But PM supports fast random writes

→ Directly write data to the multi-versioned database. → Only record meta-data about committed txns in log.

32

slide-49
SLIDE 49

15-721 (Spring 2020)

WRITE- BEH IN D LO GGIN G

PM-centric logging protocol that provides instant recovery and minimal duplication overhead.

→ Directly propagate changes to the database. → Only record meta-data in log.

Recover the database almost instantaneously.

→ Need to record meta-data about in-flight transactions. → In case of failure, ignore their effects.

33

WRITE- BEHIND LOGGING

VLDB 2017

slide-50
SLIDE 50

15-721 (Spring 2020)

WRITE- BEH IN D LO GGIN G

34

Table Heap Table Heap Log

DRAM PM

UPDATE table SET val=ABC WHERE id=123

slide-51
SLIDE 51

15-721 (Spring 2020)

WRITE- BEH IN D LO GGIN G

34

Table Heap Table Heap Log

2 1 Updated Tuple Updated Tuple 3 Meta-data

UPDATE table SET val=ABC WHERE id=123

slide-52
SLIDE 52

15-721 (Spring 2020)

WRITE- BEH IN D LO GGIN G

DBMS assigns timestamps to transactions

→ Get timestamps within same group commit timestamp range to identify and ignore effects of in-flight txns.

Use failed group commit timestamp range:

→ DBMS uses range during tuple visibility checks. → Ignores tuples created or updated within this range. → UNDO is implicitly done via visibility checks.

35

slide-53
SLIDE 53

15-721 (Spring 2020)

WRITE- BEH IN D LO GGIN G

Recovery consists of only analysis phase

→ The DBMS can immediately start processing transactions after restart with explicit UNDO/REDO phases.

Garbage collection eventually kicks in to remove the physical versions of uncommitted transactions.

→ Using timestamp range information in write-behind log. → After this finishes, no need to do extra visibility checks.

36

slide-54
SLIDE 54

15-721 (Spring 2020)

M ETADATA FO R IN STAN T RECOVERY

Use group commit timestamp range to ignore effects of transactions in failed group commit.

→ Maintain list of failed timestamp ranges.

37

(T1, T2) (T2, T3) (T3, T4) (T4, T5) T1 T4 T3 T2 (T1, T2) (T1, T2)

Current Range Failed Ranges

slide-55
SLIDE 55

15-721 (Spring 2020)

WRITE- BEH IN D LO GGIN G RECOVERY

38

1 10 100 1,000 10,000 Hard Disk Drive Solid State Drive Persistent Memory

Recovery Time (sec) Write-Ahead Write-Behind

↓1000× ↓1000× ↓1000×

Replay Log with 1m TPC-C Transactions PM 2× Latency Relative to DRAM

slide-56
SLIDE 56

15-721 (Spring 2020)

WRITE- BEH IN D LO GGIN G RUN TIM E

39

1 10 100 1,000 10,000 100,000 Hard Disk Drive Solid State Drive Persistent Memory

Throughput (txn/sec) Write-Ahead Write-Behind

↓10× ↓10× ↑1.2×

TPC-C Transactions (Eight Warehouses) PM 2× Latency Relative to DRAM

slide-57
SLIDE 57

15-721 (Spring 2020)

PM SUM M ARY

Storage Optimizations

→ Leverage byte-addressability to avoid unnecessary data duplication.

Recovery Optimizations

→ PM-optimized recovery protocols avoid the overhead of processing a log. → Non-volatile data structures ensure consistency.

40

slide-58
SLIDE 58

15-721 (Spring 2020)

GPU ACCELERATIO N

GPUs excel at performing (relatively simple) repetitive operations on large amounts of data

  • ver multiple streams of data.

Target operations that do not require blocking for input or branches:

→ Good: Sequential scans with predicates → Bad: B+Tree index probes

AFAIK, GPU memory is not cache coherent with CPU memory.

41

slide-59
SLIDE 59

15-721 (Spring 2020)

GPU ACCELERATIO N

42

slide-60
SLIDE 60

15-721 (Spring 2020)

GPU ACCELERATIO N

42

DDR4 (~40 GB/s)

slide-61
SLIDE 61

15-721 (Spring 2020)

GPU ACCELERATIO N

42

PCIe Bus (~16 GB/s) DDR4 (~40 GB/s) NVLink (~25 GB/s) NVLink (~25 GB/s)

slide-62
SLIDE 62

15-721 (Spring 2020)

GPU ACCELERATIO N

Choice #1: Entire Database

→ Store the database in the GPU(s) VRAM. → All queries perform massively parallel seq scans.

Choice #2: Important Columns

→ Return the offsets of records that match the portion of the query that accesses GPU-resident columns. → Must materialize full results in CPU.

Choice #3: Streaming

→ Transfer data from CPU to GPU on the fly.

43

slide-63
SLIDE 63

15-721 (Spring 2020)

GPU ACCELERATIO N

Choice #1: Entire Database

→ Store the database in the GPU(s) VRAM. → All queries perform massively parallel seq scans.

Choice #2: Important Columns

→ Return the offsets of records that match the portion of the query that accesses GPU-resident columns. → Must materialize full results in CPU.

Choice #3: Streaming

→ Transfer data from CPU to GPU on the fly.

43

slide-64
SLIDE 64

15-721 (Spring 2020)

H ARDWARE TRAN SACTIO N AL M EM O RY

Create critical sections in software that are managed by hardware.

→ Leverages same cache coherency protocol to detect transaction conflicts. → Intel x86: Transactional Synchronization Extensions

Read/write set of transactions must fit in L1 cache.

→ This means that it is not useful for general purpose txns. → It can be used to create latch-free indexes.

TO LOCK, SWAP OR ELIDE: ON THE INTERPLAY OF HARDWARE TRANSACTIONAL MEMORY A AND LOCK- FREE INDEXING

VLDB 2015

slide-65
SLIDE 65

15-721 (Spring 2020)

H TM PRO GRAM M IN G M O DEL

Hardware Lock Elision (HLE)

→ Optimistically execute critical section by eliding the write to a lock so that it appears to be free to other threads. → If there is a conflict, re-execute the code but take locks the second time.

Restricted Transactional Memory (RTM)

→ Like HLE but with an optional fallback codepath that the CPU jumps to if the txn aborts.

45

slide-66
SLIDE 66

15-721 (Spring 2020)

H TM LATCH ELISIO N

46

A B D G

20 10 35 6 12 23 38 44

C E F

Insert Key 25

TSX-START { LATCH A Read A LATCH C UNLATCH A Read C LATCH F UNLATCH C } TSX-COMMIT Insert 25 UNLATCH F

slide-67
SLIDE 67

15-721 (Spring 2020)

H TM LATCH ELISIO N

46

A B D G

20 10 35 6 12 23 38 44

C E F X

Insert Key 25

TSX-START { LATCH A Read A LATCH C UNLATCH A Read C LATCH F UNLATCH C } TSX-COMMIT Insert 25 UNLATCH F

slide-68
SLIDE 68

15-721 (Spring 2020)

PARTIN G TH O UGH TS

Byte-addressable PM is going to be a game changer when it comes out. We are likely to see many new computational components that DBMSs can use in the next decade.

→ The core ideas / algorithms will still be the same.

47

slide-69
SLIDE 69

15-721 (Spring 2020)

FIN AL PARTIN G TH O UGH TS

You now are aware of the major topics involved in building a modern, single-node DBMS. You have a foundation for reasoning about systems in order to discern whether claims are legitimate or marketing hype.

48

slide-70
SLIDE 70

15-721 (Spring 2020)

FIN AL PARTIN G TH O UGH TS

You now are aware of the major topics involved in building a modern, single-node DBMS. You have a foundation for reasoning about systems in order to discern whether claims are legitimate or marketing hype.

48

slide-71
SLIDE 71

15-721 (Spring 2020)

FIN AL PARTIN G TH O UGH TS

You now are aware of the major topics involved in building a modern, single-node DBMS. You have a foundation for reasoning about systems in order to discern whether claims are legitimate or marketing hype.

48

slide-72
SLIDE 72

15-721 (Spring 2020)

N EXT CLASS

49