15-721 ADVANCED DATABASE SYSTEMS Lecture #12 Logging Protocols - - PowerPoint PPT Presentation

15 721
SMART_READER_LITE
LIVE PREVIEW

15-721 ADVANCED DATABASE SYSTEMS Lecture #12 Logging Protocols - - PowerPoint PPT Presentation

15-721 ADVANCED DATABASE SYSTEMS Lecture #12 Logging Protocols Andy Pavlo / / Carnegie Mellon University / / Spring 2016 @Andy_Pavlo // Carnegie Mellon University // Spring 2017 2 TODAYS AGENDA Logging Schemes Crash Course on


slide-1
SLIDE 1

Andy Pavlo / / Carnegie Mellon University / / Spring 2016

ADVANCED

DATABASE SYSTEMS

Lecture #12 – Logging Protocols

15-721

@Andy_Pavlo // Carnegie Mellon University // Spring 2017

slide-2
SLIDE 2

CMU 15-721 (Spring 2017)

TODAY’S AGENDA

Logging Schemes Crash Course on ARIES Physical Logging Command Logging

2

slide-3
SLIDE 3

CMU 15-721 (Spring 2017)

LOGGING & RECOVERY

Recovery algorithms are techniques to ensure database consistency, txn atomicity and durability despite failures. Recovery algorithms have two parts:

→ Actions during normal txn processing to ensure that the DBMS can recover from a failure. → Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability.

3

slide-4
SLIDE 4

CMU 15-721 (Spring 2017)

LOGGING SCHEMES

Physical Logging

→ Record the changes made to a specific record in the database. → Example: Store the original value and after value for an attribute that is changed by a query.

Logical Logging

→ Record the high-level operations executed by txns. → Example: The UPDATE, DELETE, and INSERT queries invoked by a txn.

4

slide-5
SLIDE 5

CMU 15-721 (Spring 2017)

PHYSICAL VS. LOGICAL LOGGING

Logical logging writes less data in each log record than physical logging. Difficult to implement recovery with logical logging if you have concurrent txns.

→ Hard to determine which parts of the database may have been modified by a query before crash. → Also takes longer to recover because you must re-execute every txn all over again.

5

slide-6
SLIDE 6

CMU 15-721 (Spring 2017)

Logical Log

LOGICAL LOGGING EXAMPLE

6

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = ‘Andy’ NAME SALARY O.D.B. $100 El-P $666 Andy $888

UPDATE employees SET salary = salary * 1.10

slide-7
SLIDE 7

CMU 15-721 (Spring 2017)

Logical Log

LOGICAL LOGGING EXAMPLE

6

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = ‘Andy’ NAME SALARY O.D.B. $100 El-P $666 Andy $888

UPDATE employees SET salary = salary * 1.10

slide-8
SLIDE 8

CMU 15-721 (Spring 2017)

Logical Log

LOGICAL LOGGING EXAMPLE

6

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = ‘Andy’ NAME SALARY O.D.B. $100 El-P $666 Andy $888

UPDATE employees SET salary = salary * 1.10

$110 $732

slide-9
SLIDE 9

CMU 15-721 (Spring 2017)

Logical Log

LOGICAL LOGGING EXAMPLE

6

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = ‘Andy’ NAME SALARY O.D.B. $100 El-P $666 Andy $888

UPDATE employees SET salary = 900 WHERE name = ‘Andy’ UPDATE employees SET salary = salary * 1.10

$110 $732

slide-10
SLIDE 10

CMU 15-721 (Spring 2017)

Logical Log

LOGICAL LOGGING EXAMPLE

6

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = ‘Andy’ NAME SALARY O.D.B. $100 El-P $666 Andy $888

UPDATE employees SET salary = 900 WHERE name = ‘Andy’ UPDATE employees SET salary = salary * 1.10

$110 $732

slide-11
SLIDE 11

CMU 15-721 (Spring 2017)

Logical Log

LOGICAL LOGGING EXAMPLE

6

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = ‘Andy’ NAME SALARY O.D.B. $100 El-P $666 Andy $888 $900

UPDATE employees SET salary = 900 WHERE name = ‘Andy’ UPDATE employees SET salary = salary * 1.10

$110 $732

slide-12
SLIDE 12

CMU 15-721 (Spring 2017)

Logical Log

LOGICAL LOGGING EXAMPLE

6

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = ‘Andy’ NAME SALARY O.D.B. $100 El-P $666 Andy $888 $900

UPDATE employees SET salary = 900 WHERE name = ‘Andy’ UPDATE employees SET salary = salary * 1.10

$110 $732 $990

slide-13
SLIDE 13

CMU 15-721 (Spring 2017)

Logical Log

LOGICAL LOGGING EXAMPLE

6

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = ‘Andy’ NAME SALARY O.D.B. $100 El-P $666 Andy $888 $900

UPDATE employees SET salary = 900 WHERE name = ‘Andy’ UPDATE employees SET salary = salary * 1.10

$110 $732 $990

X

slide-14
SLIDE 14

CMU 15-721 (Spring 2017)

Logical Log

LOGICAL LOGGING EXAMPLE

6

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = ‘Andy’ NAME SALARY O.D.B. $100 El-P $666 Andy $888 $900

UPDATE employees SET salary = 900 WHERE name = ‘Andy’ UPDATE employees SET salary = salary * 1.10

$110 $732 SALARY $110 $732 $900 $990

X

slide-15
SLIDE 15

CMU 15-721 (Spring 2017)

DISK-ORIENTED LOGGING & RECOVERY

The “gold standard” for physical logging & recovery in a disk-oriented DBMS is ARIES.

→ Algorithms for Recovery and Isolation Exploiting Semantics → Invented by IBM Research in the early 1990s.

Relies on STEAL and NO-FORCE buffer pool management policies.

7

ARIES: A TRANSACTION RECOVERY METHOD SUPPORTING FINE-GRANULARITY LOCKING AND PARTIAL ROLLBACKS USING WRITE-AHEAD LOGGING ACM Transactions on Database Systems 1992

slide-16
SLIDE 16

CMU 15-721 (Spring 2017)

ARIES – MAIN IDEAS

Write-Ahead Logging:

→ Any change is recorded in log on stable storage before the database change is written to disk.

Repeating History During Redo:

→ On restart, retrace actions and restore database to exact state before crash.

Logging Changes During Undo:

→ Record undo actions to log to ensure action is not repeated in the event of repeated failures.

8

slide-17
SLIDE 17

CMU 15-721 (Spring 2017)

ARIES – RUNTIME LOGGING

For each modification to the database, the DBMS appends a record to the tail of the log. When a txn commits, its log records are flushed to durable storage.

9

slide-18
SLIDE 18

CMU 15-721 (Spring 2017)

ARIES – RUNTIME CHECKPOINTS

Use fuzzy checkpoints to allow txns to keep on running while writing checkpoint.

→ The checkpoint may contain updates from txns that have not committed and may abort later on.

The DBMS records internal system state as of the beginning of the checkpoint.

→ Active Transaction Table (ATT) → Dirty Page Table (DPT)

10

slide-19
SLIDE 19

CMU 15-721 (Spring 2017)

LOG SEQUENCE NUMBERS

Every log record has a globally unique log sequence number (LSN) that is used to determine the serial

  • rder of those records.

The DBMS keeps track of various LSNs in both volatile and non-volatile storage to determine the

  • rder of almost everything in the system…

11

slide-20
SLIDE 20

CMU 15-721 (Spring 2017)

LOG SEQUENCE NUMBERS

Each page contains a pageLSN that represents the LSN of the most recent update to that page. The DBMS keeps track of the max log record written to disk (flushedLSN). For a page i to be written, the DBMS must flush log at least to the point where pageLSNi ≤ flushedLSN

12

slide-21
SLIDE 21

CMU 15-721 (Spring 2017)

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

13

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-22
SLIDE 22

CMU 15-721 (Spring 2017)

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

13

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-23
SLIDE 23

CMU 15-721 (Spring 2017)

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

13

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-24
SLIDE 24

CMU 15-721 (Spring 2017)

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

13

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-25
SLIDE 25

CMU 15-721 (Spring 2017)

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

13

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-26
SLIDE 26

CMU 15-721 (Spring 2017)

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

13

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-27
SLIDE 27

CMU 15-721 (Spring 2017)

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

13

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

X

slide-28
SLIDE 28

CMU 15-721 (Spring 2017)

DISK-ORIENTED DBMS OVERHEAD

14

BUFFER POOL LOCKING RECOVERY REAL WORK

28% 30% 30%

12%

Measured CPU Cycles

OLTP THROUGH THE LOOKING GLASS, AND WHAT WE FOUND THERE SIGMOD, pp. 981-992, 2008.

slide-29
SLIDE 29

CMU 15-721 (Spring 2017)

OBSERVATION

Often the slowest part of the txn is waiting for the DBMS to flush the log records to disk. Have to wait until the records are safely written before the DBMS can return the acknowledgement to the client.

15

slide-30
SLIDE 30

CMU 15-721 (Spring 2017)

GROUP COMMIT

Batch together log records from multiple txns and flush them together with a single fsync.

→ Logs are flushed either after a timeout or when the buffer gets full. → Originally developed in IBM IMS FastPath in the 1980s

This amortizes the cost of I/O over several txns.

16

slide-31
SLIDE 31

CMU 15-721 (Spring 2017)

EARLY LOCK RELEASE

A txn’s locks can be released before its commit record is written to disk as long as it does not return results to the client before becoming durable. Other txns that read data updated by a pre- committed txn become dependent on it and also have to wait for their predecessor’s log records to reach disk.

17

slide-32
SLIDE 32

CMU 15-721 (Spring 2017)

IN-MEMORY DATABASE RECOVERY

Recovery is slightly easier because the DBMS does not have to worry about tracking dirty pages in case of a crash during recovery. An in-memory DBMS also does not need to store undo records. But the DBMS is still stymied by the slow sync time of non-volatile storage

18

slide-33
SLIDE 33

CMU 15-721 (Spring 2017)

OBSERVATION

The early papers (1980s) on recovery for in- memory DBMSs assume that there is non-volatile memory. This hardware is still not widely available so we want to use existing SSD/HDDs.

19

A RECOVERY ALGORITHM FOR A HIGH-PERFORMANCE MEMORY-RESIDENT DATABASE SYSTEM SIGMOD 1987

slide-34
SLIDE 34

CMU 15-721 (Spring 2017)

SILO – LOGGING AND RECOVERY

SiloR uses the epoch-based OCC that we discussed previously. It achieves high performance by parallelizing all aspects of logging, checkpointing, and recovery. Again, Eddie Kohler is unstoppable.

20

FAST DATABASES WITH FAST DURABILITY AND RECOVERY THROUGH MULTICORE PARALLELISM OSDI 2014

slide-35
SLIDE 35

CMU 15-721 (Spring 2017)

SILOR – LOGGING PROTOCOL

The DBMS assumes that there is one storage device per CPU socket.

→ Assigns one logger thread per device. → Worker threads are grouped per CPU socket.

As the worker executes a txn, it creates new log records that contain the values that were written to the database (i.e., REDO).

21

slide-36
SLIDE 36

CMU 15-721 (Spring 2017)

SILOR – LOGGING PROTOCOL

Each logger thread maintains a pool of log buffers that are given to its worker threads. When a worker’s buffer is full, it gives it back to the logger thread to flush to disk and attempts to acquire a new one.

→ If there are no available buffers, then it stalls.

22

slide-37
SLIDE 37

CMU 15-721 (Spring 2017)

SILOR – LOG FILES

The logger threads write buffers out to files

→ After 100 epochs, it creates a new file. → The old file is renamed with a marker indicating the max epoch of records that it contains.

Log record format:

→ Id of the txn that modified the record (TID). → A set of value log triplets (Table, Key, Value). → The value can be a list of attribute + value pairs.

23

slide-38
SLIDE 38

CMU 15-721 (Spring 2017)

SILOR – LOG FILES

The logger threads write buffers out to files

→ After 100 epochs, it creates a new file. → The old file is renamed with a marker indicating the max epoch of records that it contains.

Log record format:

→ Id of the txn that modified the record (TID). → A set of value log triplets (Table, Key, Value). → The value can be a list of attribute + value pairs.

23

slide-39
SLIDE 39

CMU 15-721 (Spring 2017)

SILOR – LOG FILES

The logger threads write buffers out to files

→ After 100 epochs, it creates a new file. → The old file is renamed with a marker indicating the max epoch of records that it contains.

Log record format:

→ Id of the txn that modified the record (TID). → A set of value log triplets (Table, Key, Value). → The value can be a list of attribute + value pairs.

23

UPDATE people SET isLame = true WHERE name IN (‘Dana’,‘Andy’) Txn#1001 [people, 888, (isLame→true)] [people, 999, (isLame→true)]

slide-40
SLIDE 40

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100

slide-41
SLIDE 41

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100 Log Records

slide-42
SLIDE 42

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100 Log Records

slide-43
SLIDE 43

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100

slide-44
SLIDE 44

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100

slide-45
SLIDE 45

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-46
SLIDE 46

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-47
SLIDE 47

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-48
SLIDE 48

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-49
SLIDE 49

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-50
SLIDE 50

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-51
SLIDE 51

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-52
SLIDE 52

CMU 15-721 (Spring 2017)

Storage

SILOR – ARCHITECTURE

24

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-53
SLIDE 53

CMU 15-721 (Spring 2017)

SILOR – PERSISTENT EPOCH

A special logger thread keeps track of the current persistent epoch (pepoch)

→ Special log file that maintains the highest epoch that is durable across all loggers.

Txns that executed in epoch e can only release their results when the pepoch is durable to non- volatile storage.

25

slide-54
SLIDE 54

CMU 15-721 (Spring 2017)

SILOR – ARCHITECTURE

26

Epoch Thread epoch=100

slide-55
SLIDE 55

CMU 15-721 (Spring 2017)

SILOR – ARCHITECTURE

26

Epoch Thread epoch=100

slide-56
SLIDE 56

CMU 15-721 (Spring 2017)

SILOR – ARCHITECTURE

26

Epoch Thread epoch=100

slide-57
SLIDE 57

CMU 15-721 (Spring 2017)

SILOR – ARCHITECTURE

26

Epoch Thread

P

epoch=200

slide-58
SLIDE 58

CMU 15-721 (Spring 2017)

SILOR – ARCHITECTURE

26

Epoch Thread

P

epoch=200 epoch=200 epoch=200 pepoch=200 epoch=200

slide-59
SLIDE 59

CMU 15-721 (Spring 2017)

SILOR – RECOVERY PROTOCOL

Phase #1: Load Last Checkpoint

→ Install the contents of the last checkpoint that was saved into the database. → All indexes have to be rebuilt.

Phase #2: Replay Log

→ Process logs in reverse order to reconcile the latest version of each tuple.

27

slide-60
SLIDE 60

CMU 15-721 (Spring 2017)

LOG RECOVERY

First check the pepoch file to determine the most recent persistent epoch.

→ Any log record from after the pepoch is ignored.

Log files are processed from newest to oldest.

→ Value logging is able to be replayed in any order. → For each log record, the thread checks to see whether the tuple already exists. → If it does not, then it is created with the value. → If it does, then the tuple’s value is overwritten only if the log TID is newer than tuple’s TID.

28

slide-61
SLIDE 61

CMU 15-721 (Spring 2017)

SILOR – RECOVERY PROTOCOL

29

P

slide-62
SLIDE 62

CMU 15-721 (Spring 2017)

SILOR – RECOVERY PROTOCOL

29

P

pepoch=200

slide-63
SLIDE 63

CMU 15-721 (Spring 2017)

SILOR – RECOVERY PROTOCOL

29

P

pepoch=200

slide-64
SLIDE 64

CMU 15-721 (Spring 2017)

SILOR – RECOVERY PROTOCOL

29

P

pepoch=200

slide-65
SLIDE 65

CMU 15-721 (Spring 2017)

SILOR – RECOVERY PROTOCOL

29

P

pepoch=200

slide-66
SLIDE 66

CMU 15-721 (Spring 2017)

OBSERVATION

The txn ids generated at runtime are enough to determine the serial order on recovery. This is why SiloR does not need to maintain separate log sequence numbers for each entry.

30

slide-67
SLIDE 67

CMU 15-721 (Spring 2017)

EVALUATION

Comparing Silo performance with and without logging and checkpoints YCSB + TPC-C Benchmarks Hardware:

→ Four Intel Xeon E7-4830 CPUs (8 cores per socket) → 256 GB of DRAM → Three Fusion ioDrive2 → RAID-5 Disk Array

31

slide-68
SLIDE 68

CMU 15-721 (Spring 2017)

YCSB-A

32

70% Reads / 30% Writes

Average Throughput SiloR: 8.76M txns/s LogSilo: 9.01M txns/s MemSilo: 10.83M txns/s

slide-69
SLIDE 69

CMU 15-721 (Spring 2017)

TPC-C

33

28 workers, 4 loggers, 4 checkpoint threads

Logging+Checkpoints Logging Only No Recovery

Average Throughput SiloR: 548K txns/s LogSilo: 575K txns/s MemSilo: 592 txns/s

slide-70
SLIDE 70

CMU 15-721 (Spring 2017)

RECOVERY TIMES

34

Recovered Database Checkpoint Log Total YCSB Size

43.2 GB 36 GB 64 GB 100 GB

Recovery

  • 33 sec

73 sec 106 sec

TPC-C Size

72.2 GB 16.7 GB 180 GB 195.7 GB

Recovery

  • 17 sec

194 sec 211 sec

slide-71
SLIDE 71

CMU 15-721 (Spring 2017)

OBSERVATION

Node failures in OLTP databases are rare.

→ OLTP databases are not that big. → They don’t need to run on hundreds of machines.

It’s better to optimize the system for runtime

  • perations rather than failure cases.

35

slide-72
SLIDE 72

CMU 15-721 (Spring 2017)

COMMAND LOGGING

Logical logging scheme where the DBMS only records the stored procedure invocation

→ Stored Procedure Name → Input Parameters → Additional safety checks

Command Logging = Transaction Logging

36

RETHINKING MAIN MEMORY OLTP RECOVERY ICDE 2014

slide-73
SLIDE 73

CMU 15-721 (Spring 2017)

DETERMINISTIC CONCURRENCY CONTROL

For a given state of the database, the execution of a serial schedule will always put the database in the same new state if:

→ The order of txns (or their queries) is defined before they start executing. → The txn logic is deterministic.

37

A=100 A = A + 1 Txn #1 A = A × 3 Txn #2 A = A - 5 Txn #3

slide-74
SLIDE 74

CMU 15-721 (Spring 2017)

DETERMINISTIC CONCURRENCY CONTROL

For a given state of the database, the execution of a serial schedule will always put the database in the same new state if:

→ The order of txns (or their queries) is defined before they start executing. → The txn logic is deterministic.

37

A=100 A = A + 1 Txn #1 A = A × 3 Txn #2 A = A - 5 Txn #3 A=298

slide-75
SLIDE 75

CMU 15-721 (Spring 2017)

DETERMINISTIC CONCURRENCY CONTROL

For a given state of the database, the execution of a serial schedule will always put the database in the same new state if:

→ The order of txns (or their queries) is defined before they start executing. → The txn logic is deterministic.

37

A=100 A = A + 1 Txn #1 A = A × 3 Txn #2 A = A - 5 Txn #3 A = A × NOW()

slide-76
SLIDE 76

CMU 15-721 (Spring 2017)

DETERMINISTIC CONCURRENCY CONTROL

For a given state of the database, the execution of a serial schedule will always put the database in the same new state if:

→ The order of txns (or their queries) is defined before they start executing. → The txn logic is deterministic.

37

A=100 A = A + 1 Txn #1 A = A × 3 Txn #2 A = A - 5 Txn #3 A = A × NOW()

X

slide-77
SLIDE 77

CMU 15-721 (Spring 2017)

VOLTDB – ARCHITECTURE

38

Partitions Single-threaded Execution Engines

slide-78
SLIDE 78

CMU 15-721 (Spring 2017)

VOLTDB – ARCHITECTURE

38

Procedure Name Input Params

slide-79
SLIDE 79

CMU 15-721 (Spring 2017)

VOLTDB – ARCHITECTURE

38

Procedure Name Input Params run(phoneNum, contestantId, currentTime) { result = execute(VoteCount, phoneNum); if (result > MAX_VOTES) { return (ERROR); } execute(InsertVote, phoneNum, contestantId, currentTime); return (SUCCESS); }

VoteCount:

SELECT COUNT(*) FROM votes WHERE phone_num = ?;

InsertVote:

INSERT INTO votes VALUES (?, ?, ?);

slide-80
SLIDE 80

CMU 15-721 (Spring 2017)

VOLTDB – ARCHITECTURE

38

Procedure Name Input Params run(phoneNum, contestantId, currentTime) { result = execute(VoteCount, phoneNum); if (result > MAX_VOTES) { return (ERROR); } execute(InsertVote, phoneNum, contestantId, currentTime); return (SUCCESS); }

VoteCount:

SELECT COUNT(*) FROM votes WHERE phone_num = ?;

InsertVote:

INSERT INTO votes VALUES (?, ?, ?);

slide-81
SLIDE 81

CMU 15-721 (Spring 2017)

VOLTDB – ARCHITECTURE

38

Procedure Name Input Params run(phoneNum, contestantId, currentTime) { result = execute(VoteCount, phoneNum); if (result > MAX_VOTES) { return (ERROR); } execute(InsertVote, phoneNum, contestantId, currentTime); return (SUCCESS); }

VoteCount:

SELECT COUNT(*) FROM votes WHERE phone_num = ?;

InsertVote:

INSERT INTO votes VALUES (?, ?, ?);

slide-82
SLIDE 82

CMU 15-721 (Spring 2017)

VOLTDB – ARCHITECTURE

38

Command Log

TxnId Procedure Name Input Params

slide-83
SLIDE 83

CMU 15-721 (Spring 2017)

VOLTDB – ARCHITECTURE

38

slide-84
SLIDE 84

CMU 15-721 (Spring 2017)

VOLTDB – ARCHITECTURE

38

Snapshots

slide-85
SLIDE 85

CMU 15-721 (Spring 2017)

VOLTDB – LOGGING PROTOCOL

The DBMS logs the txn command before it starts executing once a txn has been assigned its serial

  • rder.

The node with the txn’s “base partition” is responsible for writing the log record.

→ Remote partitions do not log anything. → Replica nodes have to log just like their master.

39

slide-86
SLIDE 86

CMU 15-721 (Spring 2017)

VOLTDB – RECOVERY PROTOCOL

The DBMS loads in the last complete checkpoint from disk. Nodes then re-execute all of the txns in the log that arrived after the checkpoint started.

→ The amount of time elapsed since the last checkpoint in the log determines how long recovery will take. → Txns that are aborted the first still have to be executed.

40

slide-87
SLIDE 87

CMU 15-721 (Spring 2017)

VOLTDB – REPLICATION

Executing a deterministic txn on the multiple copies of the same database in the same order provides strongly consistent replicas.

→ DBMS does not need to use Two-Phase Commit

41

Master Replica

Procedure Name Input Params TxnId Procedure Name Input Params

slide-88
SLIDE 88

CMU 15-721 (Spring 2017)

VOLTDB – REPLICATION

Executing a deterministic txn on the multiple copies of the same database in the same order provides strongly consistent replicas.

→ DBMS does not need to use Two-Phase Commit

41

Master Replica

Procedure Name Input Params OK

slide-89
SLIDE 89

CMU 15-721 (Spring 2017)

PROBLEMS WITH COMMAND LOGGING

If the log contains multi-node txns, then if one node goes down and there are no more replicas, then the entire DBMS has to restart.

42

Partition #1 Partition #2

X ← SELECT X FROM P2 if (X == true) { Y ← UPDATE P2 SET Y = Y+1 } else { Y ← UPDATE P3 SET Y = Y+1 } return (Y)

Partition #3

slide-90
SLIDE 90

CMU 15-721 (Spring 2017)

PROBLEMS WITH COMMAND LOGGING

If the log contains multi-node txns, then if one node goes down and there are no more replicas, then the entire DBMS has to restart.

42

Partition #1 Partition #2

X ← SELECT X FROM P2 if (X == true) { Y ← UPDATE P2 SET Y = Y+1 } else { Y ← UPDATE P3 SET Y = Y+1 } return (Y)

Partition #3

slide-91
SLIDE 91

CMU 15-721 (Spring 2017)

PROBLEMS WITH COMMAND LOGGING

If the log contains multi-node txns, then if one node goes down and there are no more replicas, then the entire DBMS has to restart.

42

Partition #1 Partition #2

X ← SELECT X FROM P2 if (X == true) { Y ← UPDATE P2 SET Y = Y+1 } else { Y ← UPDATE P3 SET Y = Y+1 } return (Y)

Partition #3

??? ???

slide-92
SLIDE 92

CMU 15-721 (Spring 2017)

PARTING THOUGHTS

Physical logging is a general purpose approach that supports all concurrency control schemes. Logical logging is faster but not universal.

43

slide-93
SLIDE 93

CMU 15-721 (Spring 2017)

NEXT CLASS

Checkpoint Schemes Facebook’s Fast Restarts

44