DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

database system implementation
SMART_READER_LITE
LIVE PREVIEW

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #8: DATABASES ON NEW HARDWARE 2 ANATOMY OF A DATABASE SYSTEM Process Manager Connection Manager + Admission Control Query Parser Query Processor Query


slide-1
SLIDE 1

DATABASE SYSTEM IMPLEMENTATION

GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #8: DATABASES ON NEW HARDWARE

slide-2
SLIDE 2

ANATOMY OF A DATABASE SYSTEM

Connection Manager + Admission Control Query Parser Query Optimizer Query Executor Lock Manager (Concurrency Control) Access Methods (or Indexes) Buffer Pool Manager Log Manager Memory Manager + Disk Manager Networking Manager 2

Query Transactional Storage Manager Query Processor Shared Utilities Process Manager

Source: Anatomy of a Database System
slide-3
SLIDE 3

DATABASE HARDWARE

People have been thinking about using hardware to accelerate DBMSs for decades.

3 DA DATABA BASE MA MACHINES: AN IDE DEA WHOSE TIME ME HAS PASSED? D? A A CRITIQUE OF F THE FU FUTURE OF F DAT ATAB ABAS ASE MAC ACHINES University of Wisconsin 1983
slide-4
SLIDE 4

DATABASE HARDWARE

1980s: Database Machines

→ Computational accelerators for efficient query processing → Embedded CPUs in active disks that push down query predicates to reduce the amount of data subsequently processed by the main CPU → Did not become mainstream because commodity hardware improved quickly thereby reducing the utility

  • f custom hardware
4 DA DATABA BASE MA MACHINES: AN IDE DEA WHOSE TIME ME HAS PASSED? D? A A CRITIQUE OF F THE FU FUTURE OF F DAT ATAB ABAS ASE MAC ACHINES University of Wisconsin 1983
slide-5
SLIDE 5

DATABASE HARDWARE

5 Source: A Domain- Specific Architecture for Deep Neural Networks
slide-6
SLIDE 6

DATABASE HARDWARE

6

1980s: RISC microprocessors enjoyed 20 years of rapid performance gains till 2004 as they rode atop both:

→ Moore’s Law (2x more transistors at each new semiconductor process node) and → Dennard Scaling (2x faster at half the power consumption per transistor for each process node).

Then Dennard Scaling died and individual processors stopped getting faster.

slide-7
SLIDE 7

DATABASE HARDWARE

7

2000s: Industry compensated by relying solely on Moore’s Law.

→ Rapidly increasing the number of processors on one chip to usher in multicore era

slide-8
SLIDE 8

DATABASE HARDWARE

8

2010s: The multicore era lasted for 10 years until Amdahl’s law kicked in.

→ There’s only so much exploitable parallelism in any given application → Few applications can keep dozens of processors busy. → Then Moore’s Law died.

slide-9
SLIDE 9

AMDAHL’S LAW

9

Execution time of task = Slatency = ! !"# $

% &%''()%

→ P= fraction of task that can be parallelized → P = 0.9: Slatency =

! *.! $ ,.- &%''()%

< 10x

slide-10
SLIDE 10

DATABASE HARDWARE

→ CPUs are not getting any faster. Constrained by the power wall (the chip’s overall temperature and power consumption) → High-density DRAM DIMMs have a higher potential for failure in manufacturing. Lower DIMM yields increase DRAM price. → Newer technologies: GPUs, NVM, InfiniBand

10 DA DATABA BASE MA MACHINES: AN IDE DEA WHOSE TIME ME HAS PASSED? D? A A CRITIQUE OF F THE FU FUTURE OF F DAT ATAB ABAS ASE MAC ACHINES University of Wisconsin 1983
slide-11
SLIDE 11

TODAY’S AGENDA

Non-Volatile Memory GPU Acceleration

11
slide-12
SLIDE 12

NON-VOLATILE MEMORY

Emerging storage technology that provide low latency read/writes like DRAM, but with persistent writes and large capacities like SSDs.

→ aka Storage-class Memory, Persistent Memory

First devices will be block-addressable (NVMe) Later devices will be byte-addressable.

12
slide-13
SLIDE 13

FUNDAMENTAL ELEMENTS OF CIRCUITS

13
slide-14
SLIDE 14

FUNDAMENTAL ELEMENTS OF CIRCUITS

14

Capacitor (ca. 1745)

slide-15
SLIDE 15

FUNDAMENTAL ELEMENTS OF CIRCUITS

15

Capacitor (ca. 1745) Resistor (ca. 1827)

slide-16
SLIDE 16

FUNDAMENTAL ELEMENTS OF CIRCUITS

16

Capacitor (ca. 1745) Resistor (ca. 1827) Inductor (ca. 1831)

slide-17
SLIDE 17

FUNDAMENTAL ELEMENTS OF CIRCUITS

In 1971, Leon Chua at Berkeley predicted the existence of a fourth fundamental element. A two-terminal device whose resistance depends

  • n the voltage applied to it, but when that voltage

is turned off it permanently remembers its last resistive state.

17 TW TWO CENTU TURIES OF ME MEMR MRISTORS Nature Materials 2012
slide-18
SLIDE 18

FUNDAMENTAL ELEMENTS OF CIRCUITS

18

Capacitor (ca. 1745) Resistor (ca. 1827) Inductor (ca. 1831)

slide-19
SLIDE 19

FUNDAMENTAL ELEMENTS OF CIRCUITS

19

Capacitor (ca. 1745) Resistor (ca. 1827) Inductor (ca. 1831) Memristor (ca. 1971)

slide-20
SLIDE 20

MERISTORS

In 2000s, a team at HP Labs led by Stanley Williams stumbled upon a nano-device that had weird properties that they could not understand. It wasn’t until they found Chua’s 1971 paper that they realized what they had invented.

20 HO HOW WE FOUND THE HE MIS ISSIN ING ME MEMR MRISTOR IEEE Spectrum 2008
slide-21
SLIDE 21

TECHNOLOGIES

Phase-Change Memory (PRAM) Resistive RAM (ReRAM) Magnetoresistive RAM (MRAM)

21
slide-22
SLIDE 22

PHASE-CHANGE MEMORY

Storage cell is comprised of two metal electrodes separated by a resistive heater and the phase change material (chalcogenide). The value of the cell is changed based on how the material is heated.

→ A short pulse changes the cell to a ‘0’. → A long, gradual pulse changes the cell to a ‘1’.

22 PH PHASE CHANGE MEMORY ARCHITECTURE AN AND THE QUEST FO FOR SCAL ALAB ABILITY Communications of the ACM 2010 Heater Bitline Access chalcogenide
slide-23
SLIDE 23

RESISTIVE RAM

Two metal layers with two TiO2 layers in between. Running a current one direction moves electrons from the top TiO2 layer to the bottom, thereby changing the resistance. May be programmable storage fabric…

→ Blurring the gap between storage and compute → Bertrand Russell’s Material Implication Logic

23 HO HOW WE FOUND THE HE MIS ISSIN ING ME MEMR MRISTOR IEEE Spectrum 2008 Platinum Platinum TiO2 Layer TiO2-x Layer
slide-24
SLIDE 24

MAGNETORESISTIVE RAM

Stores data using magnetic storage elements instead of electric charge or current flows. Spin-Transfer Torque (STT-MRAM) is the leading technology for this type of NVM.

→ Supposedly able to scale to very small sizes (10nm) and have SRAM latencies.

24 Fixed FM Layer→ Oxide Layer Free FM Layer SP SPIN MEMORY SH SHOWS S ITS S MIGHT IEEE Spectrum 2014
slide-25
SLIDE 25

WHY THIS IS FOR REAL THIS TIME

Industry has agreed to standard technologies and form factors. Linux and Microsoft have added support for NVM in their kernels (DAX). Intel has added new instructions for flushing cache lines to NVM (CLFLUSH, CLWB).

25
slide-26
SLIDE 26

NVM DIMM FORM FACTORS

NVDIMM-F (2015)

→ Flash only. Has to be paired with DRAM DIMM.

NVDIMM-N (2015)

→ Flash and DRAM together on the same DIMM. → Appears as volatile memory to the OS.

NVDIMM-P (2018)

→ True persistent memory. No DRAM or flash.

26
slide-27
SLIDE 27

NVM CONFIGURATIONS

Source: Ismail Oukid 27
slide-28
SLIDE 28 NVM DRAM

DRAM as Hardware- Managed Cache

DBMS DBMS Address Space Virtual Memory Subsystem

NVM CONFIGURATIONS

Source: Ismail Oukid 28
slide-29
SLIDE 29 NVM DRAM

DRAM as Hardware- Managed Cache

DBMS DBMS Address Space Virtual Memory Subsystem

NVM CONFIGURATIONS

Source: Ismail Oukid 29
slide-30
SLIDE 30 NVM DRAM

NVM Next to DRAM

DBMS Virtual Memory Subsystem DBMS Address Space NVM DRAM

DRAM as Hardware- Managed Cache

DBMS DBMS Address Space Virtual Memory Subsystem

NVM CONFIGURATIONS

Source: Ismail Oukid 30
slide-31
SLIDE 31 NVM DRAM

NVM Next to DRAM

DBMS Virtual Memory Subsystem DBMS Address Space NVM DRAM

DRAM as Hardware- Managed Cache

DBMS DBMS Address Space Virtual Memory Subsystem

NVM CONFIGURATIONS

Source: Ismail Oukid 31
slide-32
SLIDE 32 NVM Filesystem Disk Filesystem

NVM as Persistent Memory

DBMS DBMS Address Space Buffer Pool NVM DRAM

NVM Next to DRAM

DBMS Virtual Memory Subsystem DBMS Address Space NVM DRAM

DRAM as Hardware- Managed Cache

DBMS DBMS Address Space Virtual Memory Subsystem

NVM CONFIGURATIONS

Source: Ismail Oukid 32
slide-33
SLIDE 33 NVM Filesystem Disk Filesystem

NVM as Persistent Memory

DBMS DBMS Address Space Buffer Pool NVM DRAM

NVM Next to DRAM

DBMS Virtual Memory Subsystem DBMS Address Space NVM DRAM

DRAM as Hardware- Managed Cache

DBMS DBMS Address Space Virtual Memory Subsystem

NVM CONFIGURATIONS

Source: Ismail Oukid 33
slide-34
SLIDE 34

NVM FOR DATABASE SYSTEMS

Block-addressable NVM is not that interesting. Byte-addressable NVM will be a game changer but will require some work to use correctly.

→ In-memory DBMSs will be better positioned to use byte- addressable NVM. → Disk-oriented DBMSs will initially treat NVM as just a faster SSD.

34
slide-35
SLIDE 35

STORAGE & RECOVERY METHODS

Understand how a DBMS will behave on a system that only has byte-addressable NVM. Develop NVM-optimized implementations of standard DBMS architectures. Based on the N-Store prototype DBMS.

35 LE LET'S TALK LK ABOUT STORAGE & RECOVERY METHODS FO FOR R NO NON-VO VOLATILE MEMORY DATABASE SE SY SYST STEMS SIGMOD 2015
slide-36
SLIDE 36

SYNCHRONIZATION

Existing programming models assume that any write to memory is non-volatile.

→ CPU decides when to move data from caches to DRAM.

The DBMS needs a way to ensure that data is flushed from caches to NVM.

36
slide-37
SLIDE 37

SYNCHRONIZATION

Existing programming models assume that any write to memory is non-volatile.

→ CPU decides when to move data from caches to DRAM.

The DBMS needs a way to ensure that data is flushed from caches to NVM.

37 L1 Cache L2 Cache

Memory Controller

slide-38
SLIDE 38

SYNCHRONIZATION

Existing programming models assume that any write to memory is non-volatile.

→ CPU decides when to move data from caches to DRAM.

The DBMS needs a way to ensure that data is flushed from caches to NVM.

38

STORE

L1 Cache L2 Cache

Memory Controller

slide-39
SLIDE 39

SYNCHRONIZATION

Existing programming models assume that any write to memory is non-volatile.

→ CPU decides when to move data from caches to DRAM.

The DBMS needs a way to ensure that data is flushed from caches to NVM.

39

STORE

L1 Cache L2 Cache

Memory Controller

slide-40
SLIDE 40

SYNCHRONIZATION

Existing programming models assume that any write to memory is non-volatile.

→ CPU decides when to move data from caches to DRAM.

The DBMS needs a way to ensure that data is flushed from caches to NVM.

40

STORE CLWB

L1 Cache L2 Cache

Memory Controller

slide-41
SLIDE 41

SYNCHRONIZATION

Existing programming models assume that any write to memory is non-volatile.

→ CPU decides when to move data from caches to DRAM.

The DBMS needs a way to ensure that data is flushed from caches to NVM.

41

STORE CLWB

L1 Cache L2 Cache

ADR

Memory Controller

slide-42
SLIDE 42

NAMING

If the DBMS process restarts, we need to make sure that all of the pointers for in-memory data point to the same data.

42

Table Heap

Tuple #00 Tuple #02 Tuple #01

Index

slide-43
SLIDE 43

NAMING

If the DBMS process restarts, we need to make sure that all of the pointers for in-memory data point to the same data.

43

Table Heap

Tuple #00 Tuple #02 Tuple #01

Index

slide-44
SLIDE 44

NAMING

If the DBMS process restarts, we need to make sure that all of the pointers for in-memory data point to the same data.

44

Table Heap

Tuple #00 Tuple #02 Tuple #01

Index

Tuple #00 (v2)

slide-45
SLIDE 45

NAMING

If the DBMS process restarts, we need to make sure that all of the pointers for in-memory data point to the same data.

45

Table Heap

Tuple #00 Tuple #02 Tuple #01

Index

Tuple #00 (v2)

X X

slide-46
SLIDE 46

NAMING

If the DBMS process restarts, we need to make sure that all of the pointers for in-memory data point to the same data.

46

Table Heap

Tuple #00 Tuple #02 Tuple #01

Index

Tuple #00 (v2)

slide-47
SLIDE 47

NVM-AWARE MEMORY ALLOCATOR

Feature #1: Synchronization

→ The allocator writes back CPU cache lines to NVM using the CLFLUSH instruction. → It then issues a SFENCE instruction to wait for the data to become durable on NVM.

Feature #2: Naming

→ The allocator ensures that virtual memory addresses assigned to a memory-mapped region never change even after the OS or DBMS restarts.

47
slide-48
SLIDE 48

DBMS ENGINE ARCHITECTURES

Choice #1: In-place Updates

→ Table heap with a write-ahead log + snapshots. → Example: VoltDB

Choice #2: Copy-on-Write

→ Create a shadow copy of the table when updated. → No write-ahead log. → Example: LMDB

Choice #3: Log-structured

→ All writes are appended to log. No table heap. → Example: RocksDB

48
slide-49
SLIDE 49

IN-PLACE UPDATES ENGINE

49

In-Memory Table Heap

Tuple #00 Tuple #02

Durable Storage

Write-Ahead Log

In-Memory Index

Tuple #01

Snapshots
slide-50
SLIDE 50

IN-PLACE UPDATES ENGINE

50

In-Memory Table Heap

Tuple #00 Tuple #02

Durable Storage

Write-Ahead Log

In-Memory Index

Tuple #01

Snapshots
slide-51
SLIDE 51

IN-PLACE UPDATES ENGINE

51

In-Memory Table Heap

Tuple #00 Tuple #02

Durable Storage

Write-Ahead Log

Tuple Delta

In-Memory Index

Tuple #01

Snapshots

1

slide-52
SLIDE 52

IN-PLACE UPDATES ENGINE

52

In-Memory Table Heap

Tuple #00 Tuple #02

Durable Storage

Write-Ahead Log

Tuple Delta

In-Memory Index

Tuple #01

Snapshots

Tuple #01 (!)

1 2

slide-53
SLIDE 53

IN-PLACE UPDATES ENGINE

53

In-Memory Table Heap

Tuple #00 Tuple #02

Durable Storage

Write-Ahead Log

Tuple Delta

In-Memory Index

Tuple #01

Snapshots

Tuple #01 (!) Tuple #01 (!)

1 2 3

slide-54
SLIDE 54

IN-PLACE UPDATES ENGINE

54

In-Memory Table Heap

Tuple #00 Tuple #02

Durable Storage

Write-Ahead Log

Tuple Delta

In-Memory Index

Tuple #01

Snapshots

Tuple #01 (!) Tuple #01 (!)

1 2 3

Duplicate Data

slide-55
SLIDE 55

IN-PLACE UPDATES ENGINE

55

In-Memory Table Heap

Tuple #00 Tuple #02

Durable Storage

Write-Ahead Log

Tuple Delta

In-Memory Index

Tuple #01

Snapshots

Tuple #01 (!) Tuple #01 (!)

1 2 3

Duplicate Data Recovery Latency

slide-56
SLIDE 56

NVM-OPTIMIZED ARCHITECTURES

Leverage the allocator’s non-volatile pointers to

  • nly record what changed rather than how it

changed. The DBMS only has to maintain a transient UNDO log for a txn until it commits.

→ Dirty cache lines from an uncommitted txn can be flushed by hardware to the memory controller. → No REDO log because we flush all the changes to NVM at the time of commit.

56
slide-57
SLIDE 57

NVM IN-PLACE UPDATES ENGINE

57

NVM Table Heap

Tuple #00 Tuple #02

NVM Storage

Write-Ahead Log

NVM Index

Tuple #01

slide-58
SLIDE 58

NVM IN-PLACE UPDATES ENGINE

58

NVM Table Heap

Tuple #00 Tuple #02

NVM Storage

Write-Ahead Log

NVM Index

Tuple #01

slide-59
SLIDE 59

NVM IN-PLACE UPDATES ENGINE

59

NVM Table Heap

Tuple #00 Tuple #02

NVM Storage

Write-Ahead Log

Tuple Pointers

NVM Index

Tuple #01

1

slide-60
SLIDE 60

NVM IN-PLACE UPDATES ENGINE

60

NVM Table Heap

Tuple #00 Tuple #02

NVM Storage

Write-Ahead Log

Tuple Pointers

NVM Index

Tuple #01 Tuple #01 (!)

1 2

slide-61
SLIDE 61

COPY-ON-WRITE ENGINE

61

Current Directory Master Record Leaf 1 Leaf 2

Page #00 Page #01
slide-62
SLIDE 62

COPY-ON-WRITE ENGINE

62

Current Directory Master Record Leaf 1 Leaf 2

Page #00 Page #01
slide-63
SLIDE 63

COPY-ON-WRITE ENGINE

63

Current Directory Master Record Leaf 1 Leaf 2

1

Page #00 Page #01

Updated Leaf 1

Page #00
slide-64
SLIDE 64

COPY-ON-WRITE ENGINE

64

Current Directory Dirty Directory Master Record Leaf 1 Leaf 2

1 2

Page #00 Page #01

Updated Leaf 1

Page #00
slide-65
SLIDE 65

COPY-ON-WRITE ENGINE

65

Current Directory Dirty Directory Master Record Leaf 1 Leaf 2

1 2 3

Page #00 Page #01

Updated Leaf 1

Page #00
slide-66
SLIDE 66

COPY-ON-WRITE ENGINE

66

Current Directory Dirty Directory Master Record Leaf 1 Leaf 2

1 2 3

Expensive Copies

Page #00 Page #01

Updated Leaf 1

Page #00
slide-67
SLIDE 67

NVM COPY-ON-WRITE ENGINE

67

Current Directory

Tuple #00

Master Record Leaf 1 Leaf 2

Tuple #01

slide-68
SLIDE 68

NVM COPY-ON-WRITE ENGINE

68

Current Directory

Tuple #00

Master Record Leaf 1 Leaf 2 Updated Leaf 1

Tuple #00 (!)

1

Tuple #01 Only Copy Pointers

slide-69
SLIDE 69

NVM COPY-ON-WRITE ENGINE

69

Current Directory Dirty Directory

Tuple #00

Master Record Leaf 1 Leaf 2 Updated Leaf 1

Tuple #00 (!)

1 2 3

Tuple #01 Only Copy Pointers

slide-70
SLIDE 70

LOG-STRUCTURED ENGINE

70

SSTable MemTable

Write-Ahead Log

Bloom Filter

slide-71
SLIDE 71

LOG-STRUCTURED ENGINE

71

SSTable MemTable

Write-Ahead Log

Tuple Delta Bloom Filter

1

slide-72
SLIDE 72

LOG-STRUCTURED ENGINE

72

SSTable MemTable

Write-Ahead Log

Tuple Delta Bloom Filter Tuple Delta Tuple Data

1 2 3

slide-73
SLIDE 73

LOG-STRUCTURED ENGINE

73

SSTable MemTable

Write-Ahead Log

Tuple Delta Bloom Filter Tuple Delta Tuple Data

1 2 3

Duplicate Data

slide-74
SLIDE 74

LOG-STRUCTURED ENGINE

74

SSTable MemTable

Write-Ahead Log

Tuple Delta Bloom Filter Tuple Delta Tuple Data

1 2 3

Duplicate Data Compactions

slide-75
SLIDE 75

NVM LOG-STRUCTURED ENGINE

75

SSTable MemTable

Write-Ahead Log

Tuple Delta Bloom Filter Tuple Delta Tuple Data

1 2 3

slide-76
SLIDE 76

NVM LOG-STRUCTURED ENGINE

76

SSTable MemTable

Write-Ahead Log

Tuple Delta Bloom Filter Tuple Delta Tuple Data

1 2 3

X

slide-77
SLIDE 77

NVM LOG-STRUCTURED ENGINE

77

MemTable

Write-Ahead Log

Tuple Delta

1

slide-78
SLIDE 78

NVM SUMMARY

Storage Optimizations

→ Leverage byte-addressability to avoid unnecessary data duplication.

Recovery Optimizations

→ NVM-optimized recovery protocols avoid the overhead

  • f processing a log.

→ Non-volatile data structures ensure consistency.

78
slide-79
SLIDE 79

GPU ACCELERATION

GPUs excel at performing (relatively simple) repetitive operations on large amounts of data

  • ver multiple streams of data.

GPUs are better than CPUs at performing repetitive work. But, unlike CPUs, they cannot rapidly switch tasks.

79
slide-80
SLIDE 80

GPUs vs CPUs

GPUs have large numbers of ALUs, more so than

  • CPUs. So, they can do arithmetic operations on

large amounts of data faster than CPUs. Target operations that do not require blocking for input or branches:

→ Good: Sequential scans with predicates → Bad: B+Tree index probes

GPU memory is (usually) not cache coherent with CPU memory.

80
slide-81
SLIDE 81

GPU ACCELERATION

81
slide-82
SLIDE 82

GPU ACCELERATION

82
slide-83
SLIDE 83

GPU ACCELERATION

83

DDR4 (~40 GB/s)

slide-84
SLIDE 84

GPU ACCELERATION

84

PCIe Bus (~16 GB/s) DDR4 (~40 GB/s)

slide-85
SLIDE 85

GPU ACCELERATION

85

PCIe Bus (~16 GB/s) DDR4 (~40 GB/s) NVLink (~25 GB/s)

slide-86
SLIDE 86

GPU ACCELERATION

86

PCIe Bus (~16 GB/s) DDR4 (~40 GB/s) NVLink (~25 GB/s) NVLink (~25 GB/s)

slide-87
SLIDE 87

GPU-OPTIMIZED DBMSs

Choice #1: Entire Database

→ Store the database in the GPU(s) VRAM. → All queries perform massively parallel seq scans.

Choice #2: Important Columns

→ Return the offsets of records that match the portion

  • f the query that accesses GPU-resident columns.

→ Have to materialize full results in CPU.

Choice #3: Streaming Algorithms

→ Transfer data from CPU to GPU on the fly.

87
slide-88
SLIDE 88

PARTING THOUGHTS

Designing for NVM is important

→ Non-volatile data structures provide higher throughput and faster recovery

Byte-addressable NVM is going to be a game changer when it comes out.

88
slide-89
SLIDE 89

NEXT CLASS

Index Locking & Latching

89