DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

database system implementation
SMART_READER_LITE
LIVE PREVIEW

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #5: LOGGING AND RECOVERY PROTOCOLS 2 TODAYS AGENDA Logging Schemes Crash Course on ARIES protocol Physical Logging Logical Logging 3 LOGGING &


slide-1
SLIDE 1

DATABASE SYSTEM IMPLEMENTATION

GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #5: LOGGING AND RECOVERY PROTOCOLS

slide-2
SLIDE 2

TODAY’S AGENDA

Logging Schemes Crash Course on ARIES protocol Physical Logging Logical Logging

2

slide-3
SLIDE 3

LOGGING & RECOVERY

Recovery algorithms are techniques to ensure txn atomicity and durability despite failures. Recovery algorithms have two parts:

→ Actions during normal txn processing to ensure that the DBMS can recover from a failure. → Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability.

3

slide-4
SLIDE 4

LOGGING SCHEMES

Physical Logging

→ Record the changes made to a specific record in the database. → Example: Store the old (before) and new (after) values for an attribute that is changed by a query.

Logical Logging

→ Record the high-level operations executed by txns. → Example: The UPDATE, DELETE, and INSERT queries invoked by a txn.

4

slide-5
SLIDE 5

PHYSICAL VS. LOGICAL LOGGING

Logical logging writes less data in each log record than physical logging. Difficult to implement recovery with logical logging if you have concurrent txns.

→ Hard to determine which parts of the database may have been modified by a query before crash. → Also takes longer to recover because you must re-execute every txn all over again.

5

slide-6
SLIDE 6

Logical Log

LOGICAL LOGGING EXAMPLE

6

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = 'Andy' NAME SALARY O.D.B. $100 El-P $666 Andy $888

slide-7
SLIDE 7

Logical Log

LOGICAL LOGGING EXAMPLE

7

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = 'Andy' NAME SALARY O.D.B. $100 El-P $666 Andy $888

slide-8
SLIDE 8

Logical Log

LOGICAL LOGGING EXAMPLE

8

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = 'Andy' NAME SALARY O.D.B. $100 El-P $666 Andy $888

UPDATE employees SET salary = salary * 1.10

slide-9
SLIDE 9

Logical Log

LOGICAL LOGGING EXAMPLE

9

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = 'Andy' NAME SALARY O.D.B. $100 El-P $666 Andy $888

UPDATE employees SET salary = salary * 1.10

slide-10
SLIDE 10

Logical Log

LOGICAL LOGGING EXAMPLE

10

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = 'Andy' NAME SALARY O.D.B. $100 El-P $666 Andy $888

UPDATE employees SET salary = salary * 1.10

$110 $732

slide-11
SLIDE 11

Logical Log

LOGICAL LOGGING EXAMPLE

11

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = 'Andy' NAME SALARY O.D.B. $100 El-P $666 Andy $888

UPDATE employees SET salary = 900 WHERE name = 'Andy' UPDATE employees SET salary = salary * 1.10

$110 $732

slide-12
SLIDE 12

Logical Log

LOGICAL LOGGING EXAMPLE

12

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = 'Andy' NAME SALARY O.D.B. $100 El-P $666 Andy $888

UPDATE employees SET salary = 900 WHERE name = 'Andy' UPDATE employees SET salary = salary * 1.10

$110 $732

slide-13
SLIDE 13

Logical Log

LOGICAL LOGGING EXAMPLE

13

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = 'Andy' NAME SALARY O.D.B. $100 El-P $666 Andy $888 $900

UPDATE employees SET salary = 900 WHERE name = 'Andy' UPDATE employees SET salary = salary * 1.10

$110 $732

slide-14
SLIDE 14

Logical Log

LOGICAL LOGGING EXAMPLE

14

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = 'Andy' NAME SALARY O.D.B. $100 El-P $666 Andy $888 $900

UPDATE employees SET salary = 900 WHERE name = 'Andy' UPDATE employees SET salary = salary * 1.10

$110 $732 $990

slide-15
SLIDE 15

Logical Log

LOGICAL LOGGING EXAMPLE

15

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = 'Andy' NAME SALARY O.D.B. $100 El-P $666 Andy $888 $900

UPDATE employees SET salary = 900 WHERE name = 'Andy' UPDATE employees SET salary = salary * 1.10

$110 $732 $990

slide-16
SLIDE 16

Logical Log

LOGICAL LOGGING EXAMPLE

16

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = 'Andy' NAME SALARY O.D.B. $100 El-P $666 Andy $888 $900

UPDATE employees SET salary = 900 WHERE name = 'Andy' UPDATE employees SET salary = salary * 1.10

$110 $732 $990

X

slide-17
SLIDE 17

Logical Log

LOGICAL LOGGING EXAMPLE

17

UPDATE employees SET salary = salary * 1.10 UPDATE employees SET salary = 900 WHERE name = 'Andy' NAME SALARY O.D.B. $100 El-P $666 Andy $888 $900

UPDATE employees SET salary = 900 WHERE name = 'Andy' UPDATE employees SET salary = salary * 1.10

$110 $732 SALARY $110 $732 $900 $990

X

slide-18
SLIDE 18

DISK-ORIENTED LOGGING & RECOVERY

The “gold standard” for physical logging & recovery in a disk-oriented DBMS is ARIES.

→ Algorithms for Recovery and Isolation Exploiting Semantics → Invented by IBM Research in the early 1990s.

Relies on STEAL and NO-FORCE buffer pool management policies.

18

ARIES: A TRANSACTION RECOVERY METHOD SUPPORTING FINE-GRANULARITY LOCKING AND PARTIAL ROLLBACKS USING WRITE-AHEAD LOGGING ACM Transactions on Database Systems 1992

slide-19
SLIDE 19

DISK-ORIENTED LOGGING & RECOVERY

The “gold standard” for physical logging & recovery in a disk-oriented DBMS is ARIES.

→ Algorithms for Recovery and Isolation Exploiting Semantics → Invented by IBM Research in the early 1990s.

Relies on STEAL and NO-FORCE buffer pool management policies.

19

ARIES: A TRANSACTION RECOVERY METHOD SUPPORTING FINE-GRANULARITY LOCKING AND PARTIAL ROLLBACKS USING WRITE-AHEAD LOGGING ACM Transactions on Database Systems 1992

slide-20
SLIDE 20

DISK-ORIENTED LOGGING & RECOVERY

The “gold standard” for physical logging & recovery in a disk-oriented DBMS is ARIES.

→ Algorithms for Recovery and Isolation Exploiting Semantics → Invented by IBM Research in the early 1990s.

Relies on STEAL and NO-FORCE buffer pool management policies.

20

ARIES: A TRANSACTION RECOVERY METHOD SUPPORTING FINE-GRANULARITY LOCKING AND PARTIAL ROLLBACKS USING WRITE-AHEAD LOGGING ACM Transactions on Database Systems 1992

slide-21
SLIDE 21

DISK-ORIENTED LOGGING & RECOVERY

The “gold standard” for physical logging & recovery in a disk-oriented DBMS is ARIES.

→ Algorithms for Recovery and Isolation Exploiting Semantics → Invented by IBM Research in the early 1990s.

Relies on STEAL and NO-FORCE buffer pool management policies.

21

ARIES: A TRANSACTION RECOVERY METHOD SUPPORTING FINE-GRANULARITY LOCKING AND PARTIAL ROLLBACKS USING WRITE-AHEAD LOGGING ACM Transactions on Database Systems 1992

slide-22
SLIDE 22

DISK-ORIENTED LOGGING & RECOVERY

The “gold standard” for physical logging & recovery in a disk-oriented DBMS is ARIES.

→ Algorithms for Recovery and Isolation Exploiting Semantics → Invented by IBM Research in the early 1990s.

Relies on STEAL and NO-FORCE buffer pool management policies.

22

ARIES: A TRANSACTION RECOVERY METHOD SUPPORTING FINE-GRANULARITY LOCKING AND PARTIAL ROLLBACKS USING WRITE-AHEAD LOGGING ACM Transactions on Database Systems 1992

slide-23
SLIDE 23

DISK-ORIENTED LOGGING & RECOVERY

The “gold standard” for physical logging & recovery in a disk-oriented DBMS is ARIES.

→ Algorithms for Recovery and Isolation Exploiting Semantics → Invented by IBM Research in the early 1990s.

Relies on STEAL and NO-FORCE buffer pool management policies.

23

ARIES: A TRANSACTION RECOVERY METHOD SUPPORTING FINE-GRANULARITY LOCKING AND PARTIAL ROLLBACKS USING WRITE-AHEAD LOGGING ACM Transactions on Database Systems 1992

slide-24
SLIDE 24

ARIES – MAIN IDEAS

Write-Ahead Logging:

→ Any change is recorded in log on stable storage before the database change is written to disk. → Each log record is assigned a unique identifier (LSN).

Repeating History During Redo:

→ On restart, retrace actions and restore database to exact state before crash.

Logging Changes During Undo:

→ Record undo actions to log to ensure action is not repeated in the event of repeated failures.

24

slide-25
SLIDE 25

ARIES – RECOVERY PHASES

Phase #1: Analysis

→ Read the WAL to identify dirty pages in the buffer pool and active txns at the time of the crash.

Phase #2: Redo

→ Repeat all actions starting from an appropriate point in the log. → Log redo steps in case of crash during recovery.

Phase #3: Undo

→ Reverse the actions of txns that did not commit before the crash.

25

slide-26
SLIDE 26

LOG SEQUENCE NUMBERS

Every log record has a globally unique log sequence number (LSN) that is used to determine the serial

  • rder of those records.

The DBMS keeps track of various LSNs in both volatile and non-volatile storage to determine the

  • rder of almost everything in the system…

26

slide-27
SLIDE 27

LOG SEQUENCE NUMBERS

27

Name Where Definition flushedLSN Memory Last LSN in log on disk pageLSN pagex Newest update to pagex recLSN pagex Oldest update to pagex that dirtied it lastLSN Ti Latest action of txn Ti MasterRecord Disk LSN of latest checkpoint

slide-28
SLIDE 28

WRITING LOG RECORDS

Each data page contains a pageLSN.

→ The LSN of the most recent update to that page.

System keeps track of flushedLSN.

→ The max LSN flushed so far.

Before page x can be written to disk, we must flush log at least to the point where:

→ pageLSNx ≤ flushedLSN

28

slide-29
SLIDE 29

#1: ACTIVE TRANSACTION TABLE

One entry per currently active txn.

→ txnId: Unique txn identifier. → status: The current "mode" of the txn. → lastLSN: Most recent LSN created by txn.

Entry removed when txn commits or aborts. Status Codes:

→ R → Running → C → Committing → U → Candidate for Undo

29

slide-30
SLIDE 30

#2: DIRTY PAGE TABLE

Keep track of which pages in the buffer pool contain changes from uncommitted transactions. One entry per dirty page:

→ recLSN: The LSN of the log record that first caused the page to be dirty.

30

slide-31
SLIDE 31

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

31

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-32
SLIDE 32

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

32

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-33
SLIDE 33

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

33

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-34
SLIDE 34

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

34

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-35
SLIDE 35

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

35

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-36
SLIDE 36

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

36

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-37
SLIDE 37

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

37

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-38
SLIDE 38

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

38

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-39
SLIDE 39

Non-Volatile Storage Buffer Pool WAL (Tail)

LOG SEQUENCE NUMBERS

39

015:<T5 begin> 016:<T5, A, 99, 88> 017:<T5, B, 5, 10> 018:<T5 commit>

pageLSN

A=99 B=5 C=12

pageLSN

A=99 B=5 C=12

001:<T1 begin> 002:<T1, A, 1, 2> 003:<T1 commit> 004:<T2 begin> 005:<T2, A, 2, 3> 006:<T3 begin> 007:<CHECKPOINT> 008:<T2 commit> 009:<T4 begin> 010:<T4, X, 5, 6> 011:<T3, B, 4, 2> 012:<T3 commit> 013:<T4, B, 2, 3> 014:<T4, C, 1, 2>

flushedLSN Master Record

slide-40
SLIDE 40

DISK-ORIENTED DBMS OVERHEAD

40

BUFFER POOL LATCHING LOCKING LOGGING B-TREE KEYS REAL WORK

16% 14%

34%

12%

Measured CPU Instructions

OLTP THROUGH THE LOOKING GLASS, AND WHAT WE FOUND THERE SIGMOD, pp. 981-992, 2008.

16%

7%

slide-41
SLIDE 41

OBSERVATION

Even in an in-memory DBMS, the slowest part of a txn is waiting is flushing the log records to disk. Have to wait until the records are safely written before the DBMS can return the acknowledgement to the client.

41

slide-42
SLIDE 42

OPTIMIZATION #1: GROUP COMMIT

Batch together log records from multiple txns and flush them together with a single fsync.

→ Logs are flushed either after a timeout or when the buffer gets full. → Originally developed in IBM IMS FastPath in the 1980s

This amortizes the cost of I/O over several txns.

42

slide-43
SLIDE 43

OPTIMIZATION #2: EARLY LOCK RELEASE

A txn’s locks can be released before its commit record is written to disk as long as it does not return results to the client before becoming durable. Other txns that read data updated by a pre- committed txn become dependent on it and also have to wait for their predecessor’s log records to reach disk.

43

slide-44
SLIDE 44

OPTIMIZATION #2: EARLY LOCK RELEASE

This increases time spent on useful work.

→ Partially overlap “log flush” (gray) and “useful computation” (black) parts of two transactions depending on the same lock

44 Source: Manos Athanassoulis

slide-45
SLIDE 45

IN-MEMORY DATABASE RECOVERY

Recovery is slightly easier because the DBMS does not have to worry about tracking dirty pages in case of a crash during recovery. An in-memory DBMS also does not need to store undo records. But the DBMS is still stymied by the slow sync time of non-volatile storage.

45

slide-46
SLIDE 46

OBSERVATION

The early papers (1980s) on recovery for in- memory DBMSs assume that there is non-volatile memory.

→ Battery-backed DRAM is large / finnicky → Real NVM is coming…

This hardware is still not widely available so we want to use existing SSD/HDDs.

46

A RECOVERY ALGORITHM FOR A HIGH-PERFORMANCE MEMORY-RESIDENT DATABASE SYSTEM SIGMOD 1987

slide-47
SLIDE 47

SILO – LOGGING AND RECOVERY

SiloR achieves high performance by parallelizing all aspects of logging, checkpointing, and recovery. Again, Eddie Kohler is unstoppable.

47

FAST DATABASES WITH FAST DURABILITY AND RECOVERY THROUGH MULTICORE PARALLELISM OSDI 2014

slide-48
SLIDE 48

SILOR – LOGGING PROTOCOL

The DBMS assumes that there is one storage device per CPU socket.

→ Assigns one dedicated logger thread per device. → Worker threads are grouped per CPU socket.

As the worker executes a txn, it creates new log records that contain the values that were written to the database (i.e., REDO).

48

slide-49
SLIDE 49

SILOR – LOGGING PROTOCOL

Each logger thread maintains a pool of log buffers that are given to its worker threads. When a worker’s buffer is full, it gives it back to the logger thread to flush to disk and attempts to acquire a new one.

→ If there are no available buffers, then it stalls.

49

slide-50
SLIDE 50

SILOR – LOG FILES

The logger threads write buffers out to files

→ After 100 epochs, it creates a new file. → The old file is renamed with a marker indicating the max epoch of records that it contains.

Log record format:

→ Id of the txn that modified the record (TID). → A set of value log triplets (Table, Key, Value). → The value can be a list of attribute + value pairs.

50

slide-51
SLIDE 51

SILOR – LOG FILES

The logger threads write buffers out to files

→ After 100 epochs, it creates a new file. → The old file is renamed with a marker indicating the max epoch of records that it contains.

Log record format:

→ Id of the txn that modified the record (TID). → A set of value log triplets (Table, Key, Value). → The value can be a list of attribute + value pairs.

51

slide-52
SLIDE 52

SILOR – LOG FILES

The logger threads write buffers out to files

→ After 100 epochs, it creates a new file. → The old file is renamed with a marker indicating the max epoch of records that it contains.

Log record format:

→ Id of the txn that modified the record (TID). → A set of value log triplets (Table, Key, Value). → The value can be a list of attribute + value pairs.

52

slide-53
SLIDE 53

SILOR – LOG FILES

The logger threads write buffers out to files

→ After 100 epochs, it creates a new file. → The old file is renamed with a marker indicating the max epoch of records that it contains.

Log record format:

→ Id of the txn that modified the record (TID). → A set of value log triplets (Table, Key, Value). → The value can be a list of attribute + value pairs.

53

slide-54
SLIDE 54

SILOR – LOG FILES

The logger threads write buffers out to files

→ After 100 epochs, it creates a new file. → The old file is renamed with a marker indicating the max epoch of records that it contains.

Log record format:

→ Id of the txn that modified the record (TID). → A set of value log triplets (Table, Key, Value). → The value can be a list of attribute + value pairs.

54

UPDATE people SET isLame = true WHERE name IN ('Prashanth','Andy')

slide-55
SLIDE 55

SILOR – LOG FILES

The logger threads write buffers out to files

→ After 100 epochs, it creates a new file. → The old file is renamed with a marker indicating the max epoch of records that it contains.

Log record format:

→ Id of the txn that modified the record (TID). → A set of value log triplets (Table, Key, Value). → The value can be a list of attribute + value pairs.

55

UPDATE people SET isLame = true WHERE name IN ('Prashanth','Andy')

Txn#1001 [people, 888, (isLame→true)] [people, 999, (isLame→true)]

slide-56
SLIDE 56

Storage

SILOR – ARCHITECTURE

56

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100

slide-57
SLIDE 57

Storage

SILOR – ARCHITECTURE

57

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100

slide-58
SLIDE 58

Storage

SILOR – ARCHITECTURE

58

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100

slide-59
SLIDE 59

Storage

SILOR – ARCHITECTURE

59

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100

slide-60
SLIDE 60

Storage

SILOR – ARCHITECTURE

60

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100 Log Records

slide-61
SLIDE 61

Storage

SILOR – ARCHITECTURE

61

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100 Log Records

slide-62
SLIDE 62

Storage

SILOR – ARCHITECTURE

62

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100

slide-63
SLIDE 63

Storage

SILOR – ARCHITECTURE

63

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100

slide-64
SLIDE 64

Storage

SILOR – ARCHITECTURE

64

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=100

slide-65
SLIDE 65

Storage

SILOR – ARCHITECTURE

65

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-66
SLIDE 66

Storage

SILOR – ARCHITECTURE

66

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-67
SLIDE 67

Storage

SILOR – ARCHITECTURE

67

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-68
SLIDE 68

Storage

SILOR – ARCHITECTURE

68

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-69
SLIDE 69

Storage

SILOR – ARCHITECTURE

69

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-70
SLIDE 70

Storage

SILOR – ARCHITECTURE

70

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-71
SLIDE 71

Storage

SILOR – ARCHITECTURE

71

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-72
SLIDE 72

Storage

SILOR – ARCHITECTURE

72

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-73
SLIDE 73

Storage

SILOR – ARCHITECTURE

73

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-74
SLIDE 74

Storage

SILOR – ARCHITECTURE

74

Epoch Thread

Worker Logger

Free Buffers Flushing Buffers Log Files

epoch=200

slide-75
SLIDE 75

SILOR – PERSISTENT EPOCH

A special logger thread keeps track of the current persistent epoch (pepoch)

→ Special log file that maintains the highest epoch that is durable across all loggers.

Txns that executed in epoch e can only release their results when the pepoch is durable to non- volatile storage.

75

slide-76
SLIDE 76

SILOR – ARCHITECTURE

76

Epoch Thread epoch=100

slide-77
SLIDE 77

SILOR – ARCHITECTURE

77

Epoch Thread epoch=100

slide-78
SLIDE 78

SILOR – ARCHITECTURE

78

Epoch Thread epoch=100

slide-79
SLIDE 79

SILOR – ARCHITECTURE

79

Epoch Thread epoch=100

slide-80
SLIDE 80

SILOR – ARCHITECTURE

80

Epoch Thread

P

epoch=100

slide-81
SLIDE 81

SILOR – ARCHITECTURE

81

Epoch Thread

P

epoch=200

slide-82
SLIDE 82

SILOR – ARCHITECTURE

82

Epoch Thread

P

epoch=200 epoch=200 epoch=200 pepoch=200 epoch=200

slide-83
SLIDE 83

SILOR – ARCHITECTURE

83

Epoch Thread

P

epoch=200 epoch=200 epoch=200 pepoch=200 epoch=200

slide-84
SLIDE 84

SILOR – RECOVERY PROTOCOL

Phase #1: Load Last Checkpoint

→ Install the contents of the last checkpoint that was saved into the database. → All indexes have to be rebuilt.

Phase #2: Replay Log

→ Process logs in reverse order to reconcile the latest version of each tuple.

84

slide-85
SLIDE 85

LOG RECOVERY

First check the pepoch file to determine the most recent persistent epoch.

→ Any log record from after the pepoch is ignored.

Log files are processed from newest (tail of log) to

  • ldest.

→ For each log record, the thread checks to see whether the tuple already exists. → If it does not, then it is created with the value. → If it does, then the tuple’s value is overwritten only if the log TID is newer than tuple’s TID.

85

slide-86
SLIDE 86

OBSERVATION

The txn ids generated at runtime are enough to determine the serial order on recovery. This is why SiloR does not need to maintain separate log sequence numbers for each entry.

86

slide-87
SLIDE 87

EVALUATION

Comparing Silo performance with and without logging and checkpoints YCSB + TPC-C Benchmarks Hardware:

→ Four Intel Xeon E7-4830 CPUs (8 cores per socket) → 256 GB of DRAM → Three Fusion ioDrive2 → RAID-5 Disk Array

87

slide-88
SLIDE 88

EVALUATION

Comparing Silo performance with and without logging and checkpoints YCSB + TPC-C Benchmarks Hardware:

→ Four Intel Xeon E7-4830 CPUs (8 cores per socket) → 256 GB of DRAM → Three Fusion ioDrive2 → RAID-5 Disk Array

88

slide-89
SLIDE 89

YCSB-A

89

70% Reads / 30% Writes

Average Throughput SiloR: 8.76M txns/s LogSilo: 9.01M txns/s MemSilo: 10.83M txns/s

slide-90
SLIDE 90

TPC-C

90

28 workers, 4 loggers, 4 checkpoint threads

Logging+Checkpoints Logging Only No Logging

Average Throughput SiloR: 548K txns/s LogSilo: 575K txns/s MemSilo: 592 txns/s

slide-91
SLIDE 91

RECOVERY TIMES

91

Recovered Database Checkpoint Log Total YCSB Size

43.2 GB 36 GB 64 GB 100 GB

Recovery

  • 33 sec

73 sec 106 sec

TPC-C Size

72.2 GB 16.7 GB 180 GB 195.7 GB

Recovery

  • 17 sec

194 sec 211 sec

slide-92
SLIDE 92

OBSERVATION

Node failures in OLTP databases are rare.

→ OLTP databases are not that big. → They don’t need to run on hundreds of machines.

It’s better to optimize the system for runtime

  • perations rather than failure cases.

92

slide-93
SLIDE 93

LOGICAL LOGGING

Logical logging scheme where the DBMS only records the stored procedure invocation

→ Stored Procedure Name → Input Parameters → Additional safety checks

Logical Logging = Command Logging = Transaction Logging

93

RETHINKING MAIN MEMORY OLTP RECOVERY ICDE 2014

slide-94
SLIDE 94

DETERMINISTIC CONCURRENCY CONTROL

For a given state of the database, the execution of a serial schedule will always put the database in the same new state if:

→ The order of txns (or their queries) is defined before they start executing. → The txn logic is deterministic.

94

slide-95
SLIDE 95

DETERMINISTIC CONCURRENCY CONTROL

For a given state of the database, the execution of a serial schedule will always put the database in the same new state if:

→ The order of txns (or their queries) is defined before they start executing. → The txn logic is deterministic.

95

A=100

slide-96
SLIDE 96

DETERMINISTIC CONCURRENCY CONTROL

For a given state of the database, the execution of a serial schedule will always put the database in the same new state if:

→ The order of txns (or their queries) is defined before they start executing. → The txn logic is deterministic.

96

A=100 A = A + 1 Txn #1 A = A × 3 Txn #2 A = A - 5 Txn #3

slide-97
SLIDE 97

DETERMINISTIC CONCURRENCY CONTROL

For a given state of the database, the execution of a serial schedule will always put the database in the same new state if:

→ The order of txns (or their queries) is defined before they start executing. → The txn logic is deterministic.

97

A=100 A = A + 1 Txn #1 A = A × 3 Txn #2 A = A - 5 Txn #3 A=298

slide-98
SLIDE 98

DETERMINISTIC CONCURRENCY CONTROL

For a given state of the database, the execution of a serial schedule will always put the database in the same new state if:

→ The order of txns (or their queries) is defined before they start executing. → The txn logic is deterministic.

98

A=100 A = A + 1 Txn #1 A = A × 3 Txn #2 A = A - 5 Txn #3 A = A × NOW()

slide-99
SLIDE 99

DETERMINISTIC CONCURRENCY CONTROL

For a given state of the database, the execution of a serial schedule will always put the database in the same new state if:

→ The order of txns (or their queries) is defined before they start executing. → The txn logic is deterministic.

99

A=100 A = A + 1 Txn #1 A = A × 3 Txn #2 A = A - 5 Txn #3 A = A × NOW()

X

slide-100
SLIDE 100

VOLTDB – ARCHITECTURE

100

slide-101
SLIDE 101

VOLTDB – ARCHITECTURE

101

Partitions

slide-102
SLIDE 102

VOLTDB – ARCHITECTURE

102

Partitions Single-threaded Execution Engines

slide-103
SLIDE 103

VOLTDB – ARCHITECTURE

103

Procedure Name Input Params

slide-104
SLIDE 104

VOLTDB – ARCHITECTURE

104

Procedure Name Input Params run(phoneNum, contestantId, currentTime) { result = execute(VoteCount, phoneNum); if (result > MAX_VOTES) { return (ERROR); } execute(InsertVote, phoneNum, contestantId, currentTime); return (SUCCESS); }

VoteCount:

SELECT COUNT(*) FROM votes WHERE phone_num = ?;

InsertVote:

INSERT INTO votes VALUES (?, ?, ?);

slide-105
SLIDE 105

VOLTDB – ARCHITECTURE

105

Procedure Name Input Params run(phoneNum, contestantId, currentTime) { result = execute(VoteCount, phoneNum); if (result > MAX_VOTES) { return (ERROR); } execute(InsertVote, phoneNum, contestantId, currentTime); return (SUCCESS); }

VoteCount:

SELECT COUNT(*) FROM votes WHERE phone_num = ?;

InsertVote:

INSERT INTO votes VALUES (?, ?, ?);

slide-106
SLIDE 106

VOLTDB – ARCHITECTURE

106

Procedure Name Input Params run(phoneNum, contestantId, currentTime) { result = execute(VoteCount, phoneNum); if (result > MAX_VOTES) { return (ERROR); } execute(InsertVote, phoneNum, contestantId, currentTime); return (SUCCESS); }

VoteCount:

SELECT COUNT(*) FROM votes WHERE phone_num = ?;

InsertVote:

INSERT INTO votes VALUES (?, ?, ?);

slide-107
SLIDE 107

VOLTDB – ARCHITECTURE

107

Procedure Name Input Params run(phoneNum, contestantId, currentTime) { result = execute(VoteCount, phoneNum); if (result > MAX_VOTES) { return (ERROR); } execute(InsertVote, phoneNum, contestantId, currentTime); return (SUCCESS); }

VoteCount:

SELECT COUNT(*) FROM votes WHERE phone_num = ?;

InsertVote:

INSERT INTO votes VALUES (?, ?, ?);

slide-108
SLIDE 108

VOLTDB – ARCHITECTURE

108

slide-109
SLIDE 109

VOLTDB – ARCHITECTURE

109

Command Log

TxnId Procedure Name Input Params

slide-110
SLIDE 110

VOLTDB – ARCHITECTURE

110

slide-111
SLIDE 111

VOLTDB – ARCHITECTURE

111

Snapshots

slide-112
SLIDE 112

VOLTDB – LOGGING PROTOCOL

The DBMS logs the txn command once a txn has been assigned its serial order even before it starts executing the txn. The DBMS also records txn

  • rdering messages in the log.

At recovery time, the transactions will be sorted again in the same order as at runtime and re- executed.

112

slide-113
SLIDE 113

VOLTDB – RECOVERY PROTOCOL

The DBMS loads in the last complete checkpoint from disk. Nodes then re-execute all of the txns in the log that arrived after the checkpoint started.

→ The amount of time elapsed since the last checkpoint in the log determines how long recovery will take. → Txns that are aborted the first still have to be executed.

113

slide-114
SLIDE 114

VOLTDB – REPLICATION

Executing a deterministic txn on the multiple copies of the same database in the same order provides strongly consistent replicas.

→ DBMS does not need to use Two-Phase Commit

114

Master Replica

slide-115
SLIDE 115

VOLTDB – REPLICATION

Executing a deterministic txn on the multiple copies of the same database in the same order provides strongly consistent replicas.

→ DBMS does not need to use Two-Phase Commit

115

Master Replica

Procedure Name Input Params

slide-116
SLIDE 116

VOLTDB – REPLICATION

Executing a deterministic txn on the multiple copies of the same database in the same order provides strongly consistent replicas.

→ DBMS does not need to use Two-Phase Commit

116

Master Replica

Procedure Name Input Params TxnId Procedure Name Input Params

slide-117
SLIDE 117

VOLTDB – REPLICATION

Executing a deterministic txn on the multiple copies of the same database in the same order provides strongly consistent replicas.

→ DBMS does not need to use Two-Phase Commit

117

Master Replica

Procedure Name Input Params

slide-118
SLIDE 118

VOLTDB – REPLICATION

Executing a deterministic txn on the multiple copies of the same database in the same order provides strongly consistent replicas.

→ DBMS does not need to use Two-Phase Commit

118

Master Replica

Procedure Name Input Params OK

slide-119
SLIDE 119

PROBLEMS WITH LOGICAL LOGGING

If the log contains multi-node txns, then if one node goes down and there are no more replicas, then the entire DBMS has to restart.

119

slide-120
SLIDE 120

PROBLEMS WITH LOGICAL LOGGING

If the log contains multi-node txns, then if one node goes down and there are no more replicas, then the entire DBMS has to restart.

120

Partition #1 Partition #2

X ← SELECT X FROM P2 if (X == true) { Y ← UPDATE P2 SET Y = Y+1 } else { Y ← UPDATE P3 SET Y = Y+1 } return (Y)

Partition #3

slide-121
SLIDE 121

PROBLEMS WITH LOGICAL LOGGING

If the log contains multi-node txns, then if one node goes down and there are no more replicas, then the entire DBMS has to restart.

121

Partition #1 Partition #2

X ← SELECT X FROM P2 if (X == true) { Y ← UPDATE P2 SET Y = Y+1 } else { Y ← UPDATE P3 SET Y = Y+1 } return (Y)

Partition #3

slide-122
SLIDE 122

PROBLEMS WITH LOGICAL LOGGING

If the log contains multi-node txns, then if one node goes down and there are no more replicas, then the entire DBMS has to restart.

122

Partition #1 Partition #2

X ← SELECT X FROM P2 if (X == true) { Y ← UPDATE P2 SET Y = Y+1 } else { Y ← UPDATE P3 SET Y = Y+1 } return (Y)

Partition #3

slide-123
SLIDE 123

PROBLEMS WITH LOGICAL LOGGING

If the log contains multi-node txns, then if one node goes down and there are no more replicas, then the entire DBMS has to restart.

123

Partition #1 Partition #2

X ← SELECT X FROM P2 if (X == true) { Y ← UPDATE P2 SET Y = Y+1 } else { Y ← UPDATE P3 SET Y = Y+1 } return (Y)

Partition #3

??? ???

slide-124
SLIDE 124

PARTING THOUGHTS

Physical logging is a general purpose approach that supports all concurrency control schemes. Logical logging is faster but not universal. Non-volatile memory is coming…

124

slide-125
SLIDE 125

NEXT CLASS

Checkpoint Schemes Reminder: Homework 1 is due on Thursday Jan 24th. Reminder: Review 1 is due on Thursday Jan 24th.

125