Distributed OLTP Databases (Part I) Lecture # 22 Andy Pavlo - - PowerPoint PPT Presentation

distributed oltp databases part i
SMART_READER_LITE
LIVE PREVIEW

Distributed OLTP Databases (Part I) Lecture # 22 Andy Pavlo - - PowerPoint PPT Presentation

Distributed OLTP Databases (Part I) Lecture # 22 Andy Pavlo Database Systems AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018 2 ADM IN ISTRIVIA Project #3 : TODAY @ 11:59am Homework #5 : Monday Dec 3 rd @ 11:59pm


slide-1
SLIDE 1

Database Systems 15-445/15-645 Fall 2018 Andy Pavlo Computer Science Carnegie Mellon Univ.

AP AP

Lecture # 22

Distributed OLTP Databases (Part I)

slide-2
SLIDE 2

CMU 15-445/645 (Fall 2018)

ADM IN ISTRIVIA

Project #3: TODAY @ 11:59am Homework #5: Monday Dec 3rd @ 11:59pm Project #4: Monday Dec 10th @ 11:59pm Extra Credit: Wednesday Dec 12th @11:59pm Final Exam: Sunday Dec 16th @ 8:30am

2

slide-3
SLIDE 3

CMU 15-445/645 (Fall 2018)

ADM IN ISTRIVIA

Monday Dec 3rd – VoltDB Lecture

→ Dr. Ethan Zhang (Lead Engineer)

Wednesday Dec 5th – Potpourri + Review

→ Vote for what system you want me to talk about. → https://cmudb.io/f18-systems

Wednesday Dec 5th – Extra Credit Check

→ Submit your extra credit assignment early to get feedback from me.

3

slide-4
SLIDE 4

CMU 15-445/645 (Fall 2018)

UPCO M IN G DATABASE EVEN TS

Swarm64 Tech Talk

→ Thursday November 29th @ 12pm → GHC 8102 ← Different Location!

VoltDB Research Talk

→ Monday December 3rd @ 4:30pm → GHC 8102

4

slide-5
SLIDE 5

CMU 15-445/645 (Fall 2018)

PARALLEL VS. DISTRIBUTED

Parallel DBMSs:

→ Nodes are physically close to each other. → Nodes connected with high-speed LAN. → Communication cost is assumed to be small.

Distributed DBMSs:

→ Nodes can be far from each other. → Nodes connected using public network. → Communication cost and problems cannot be ignored.

5

slide-6
SLIDE 6

CMU 15-445/645 (Fall 2018)

DISTRIBUTED DBM Ss

Use the building blocks that we covered in single- node DBMSs to now support transaction processing and query execution in distributed environments.

→ Optimization & Planning → Concurrency Control → Logging & Recovery

6

slide-7
SLIDE 7

CMU 15-445/645 (Fall 2018)

O LTP VS. O LAP

On-line Transaction Processing (OLTP):

→ Short-lived read/write txns. → Small footprint. → Repetitive operations.

On-line Analytical Processing (OLAP):

→ Long-running, read-only queries. → Complex joins. → Exploratory queries.

7

slide-8
SLIDE 8

CMU 15-445/645 (Fall 2018)

TO DAY'S AGEN DA

System Architectures Design Issues Partitioning Schemes Distributed Concurrency Control

8

slide-9
SLIDE 9

CMU 15-445/645 (Fall 2018)

SYSTEM ARCH ITECTURE

A DBMS's system architecture specifies what shared resources are directly accessible to CPUs. This affects how CPUs coordinate with each other and where they retrieve/store objects in the database.

9

slide-10
SLIDE 10

CMU 15-445/645 (Fall 2018)

SYSTEM ARCH ITECTURE

10

Shared Nothing

Network

Shared Memory

Network

Shared Disk

Network

Shared Everything

slide-11
SLIDE 11

CMU 15-445/645 (Fall 2018)

SH ARED M EM O RY

CPUs have access to common memory address space via a fast interconnect.

→ Each processor has a global view of all the in-memory data structures. → Each DBMS instance on a processor has to "know" about the other instances.

11

Network

slide-12
SLIDE 12

CMU 15-445/645 (Fall 2018)

SH ARED DISK

All CPUs can access a single logical disk directly via an interconnect but each have their own private memories.

→ Can scale execution layer independently from the storage layer. → Have to send messages between CPUs to learn about their current state.

12

Network

slide-13
SLIDE 13

CMU 15-445/645 (Fall 2018)

Storage

SH ARED DISK EXAM PLE

13

Node

Application Server

Node

Get Id=101 Page ABC

slide-14
SLIDE 14

CMU 15-445/645 (Fall 2018)

Storage

SH ARED DISK EXAM PLE

13

Node

Application Server

Node

Get Id=200 Page XYZ

slide-15
SLIDE 15

CMU 15-445/645 (Fall 2018)

Storage

SH ARED DISK EXAM PLE

13

Node

Application Server

Node Node

Get Id=101 Page ABC

slide-16
SLIDE 16

CMU 15-445/645 (Fall 2018)

Storage

SH ARED DISK EXAM PLE

13

Node

Application Server

Node Node

slide-17
SLIDE 17

CMU 15-445/645 (Fall 2018)

Storage

SH ARED DISK EXAM PLE

13

Node

Application Server

Node Node

Update 101 Page ABC

slide-18
SLIDE 18

CMU 15-445/645 (Fall 2018)

Storage

SH ARED DISK EXAM PLE

13

Node

Application Server

Node Node

Update 101 Page ABC

slide-19
SLIDE 19

CMU 15-445/645 (Fall 2018)

SH ARED N OTH IN G

Each DBMS instance has its own CPU, memory, and disk. Nodes only communicate with each

  • ther via network.

→ Easy to increase capacity. → Hard to ensure consistency.

14

Network

slide-20
SLIDE 20

CMU 15-445/645 (Fall 2018)

SH ARED N OTH IN G EXAM PLE

15

Node

Application Server

Node

P1→ID:1-150 P2→ID:151-300

Get Id=200

slide-21
SLIDE 21

CMU 15-445/645 (Fall 2018)

SH ARED N OTH IN G EXAM PLE

15

Node

Application Server

Node

P1→ID:1-150 P2→ID:151-300

Get Id=10 Get Id=200 Get Id=200

slide-22
SLIDE 22

CMU 15-445/645 (Fall 2018)

SH ARED N OTH IN G EXAM PLE

15

Node

Application Server

Node

P1→ID:1-150 P2→ID:151-300

Node

slide-23
SLIDE 23

CMU 15-445/645 (Fall 2018)

SH ARED N OTH IN G EXAM PLE

15

Node

Application Server

Node Node

P3→ID:101-200 P1→ID:1-100 P2→ID:201-300

slide-24
SLIDE 24

CMU 15-445/645 (Fall 2018)

EARLY DISTRIBUTED DATABASE SYSTEM S

MUFFIN – UC Berkeley (1979) SDD-1 – CCA (1979) System R* – IBM Research (1984) Gamma – Univ. of Wisconsin (1986) NonStop SQL – Tandem (1987)

16 Bernstein Mohan DeWitt Gray Stonebraker

slide-25
SLIDE 25

CMU 15-445/645 (Fall 2018)

DESIGN ISSUES

How does the application find data? How to execute queries on distributed data?

→ Push query to data. → Pull data to query.

How does the DBMS ensure correctness?

17

slide-26
SLIDE 26

CMU 15-445/645 (Fall 2018)

H O M O GEN O US VS. H ETERO GEN O US

Approach #1: Homogenous Nodes

→ Every node in the cluster can perform the same set of tasks (albeit on potentially different partitions of data). → Makes provisioning and failover "easier".

Approach #2: Heterogenous Nodes

→ Nodes are assigned specific tasks. → Can allow a single physical node to host multiple "virtual" node types for dedicated tasks.

18

slide-27
SLIDE 27

CMU 15-445/645 (Fall 2018)

M O N GO DB CLUSTER ARCH ITECTURE

19

Router (mongos)

Shards (mongod)

P3 P4 P1 P2

Config Server (mongod) Router (mongos)

⋮ ⋮

Application Server

Get Id=101

slide-28
SLIDE 28

CMU 15-445/645 (Fall 2018)

M O N GO DB CLUSTER ARCH ITECTURE

19

Router (mongos)

Shards (mongod)

P3 P4 P1 P2

P1→ID:1-100 P2→ID:101-200 P3→ID:201-300 P4→ID:301-400

Config Server (mongod) Router (mongos)

⋮ ⋮

Application Server

Get Id=101

slide-29
SLIDE 29

CMU 15-445/645 (Fall 2018)

M O N GO DB CLUSTER ARCH ITECTURE

19

Router (mongos)

Shards (mongod)

P3 P4 P1 P2

P1→ID:1-100 P2→ID:101-200 P3→ID:201-300 P4→ID:301-400

Config Server (mongod) Router (mongos)

⋮ ⋮

Application Server

Get Id=101

slide-30
SLIDE 30

CMU 15-445/645 (Fall 2018)

DATA TRAN SPAREN CY

Users should not be required to know where data is physically located, how tables are partitioned

  • r replicated.

A SQL query that works on a single-node DBMS should work the same on a distributed DBMS.

20

slide-31
SLIDE 31

CMU 15-445/645 (Fall 2018)

DATABASE PARTITIO N IN G

Split database across multiple resources:

→ Disks, nodes, processors. → Sometimes called "sharding"

The DBMS executes query fragments on each partition and then combines the results to produce a single answer.

21

slide-32
SLIDE 32

CMU 15-445/645 (Fall 2018)

N AÏVE TABLE PARTITIO N ING

Each node stores one and only table. Assumes that each node has enough storage space for a table.

22

slide-33
SLIDE 33

CMU 15-445/645 (Fall 2018)

N AÏVE TABLE PARTITIO N ING

23

Table1

SELECT * FROM table

Ideal Query:

Table2 Partitions

slide-34
SLIDE 34

CMU 15-445/645 (Fall 2018)

N AÏVE TABLE PARTITIO N ING

23

Table1

SELECT * FROM table

Ideal Query:

Table2 Partitions

Table1

slide-35
SLIDE 35

CMU 15-445/645 (Fall 2018)

N AÏVE TABLE PARTITIO N ING

23

Table1

SELECT * FROM table

Ideal Query:

Table2 Partitions

Table1 Table2

slide-36
SLIDE 36

CMU 15-445/645 (Fall 2018)

H O RIZO N TAL PARTITIO N IN G

Split a table's tuples into disjoint subsets.

→ Choose column(s) that divides the database equally in terms of size, load, or usage. → Each tuple contains all of its columns. → Hash Partitioning, Range Partitioning

The DBMS can partition a database physical (shared nothing) or logically (shared disk).

24

slide-37
SLIDE 37

CMU 15-445/645 (Fall 2018)

H O RIZO N TAL PARTITIO N IN G

25

SELECT * FROM table WHERE partitionKey = ?

Ideal Query:

Partitions Table1

101 a XXX 2017-11-29 102 b XXY 2017-11-28 103 c XYZ 2017-11-29 104 d XYX 2017-11-27 105 e XYY 2017-11-29 hash(a)%4 = P2 hash(b)%4 = P4 hash(c)%4 = P3 hash(d)%4 = P2 hash(e)%4 = P1

Partitioning Key

slide-38
SLIDE 38

CMU 15-445/645 (Fall 2018)

H O RIZO N TAL PARTITIO N IN G

25

SELECT * FROM table WHERE partitionKey = ?

Ideal Query:

Partitions Table1

101 a XXX 2017-11-29 102 b XXY 2017-11-28 103 c XYZ 2017-11-29 104 d XYX 2017-11-27 105 e XYY 2017-11-29 hash(a)%4 = P2 hash(b)%4 = P4 hash(c)%4 = P3 hash(d)%4 = P2 hash(e)%4 = P1

Partitioning Key

slide-39
SLIDE 39

CMU 15-445/645 (Fall 2018)

H O RIZO N TAL PARTITIO N IN G

25

SELECT * FROM table WHERE partitionKey = ?

Ideal Query:

Partitions Table1

101 a XXX 2017-11-29 102 b XXY 2017-11-28 103 c XYZ 2017-11-29 104 d XYX 2017-11-27 105 e XYY 2017-11-29

P3 P4 P1 P2

hash(a)%4 = P2 hash(b)%4 = P4 hash(c)%4 = P3 hash(d)%4 = P2 hash(e)%4 = P1

Partitioning Key

slide-40
SLIDE 40

CMU 15-445/645 (Fall 2018)

Storage

LO GICAL PARTITIO N IN G

26

Node

Application Server

Node

Get Id=1

Id=1 Id=2 Id=3 Id=4 Id=1 Id=2 Id=3 Id=4

slide-41
SLIDE 41

CMU 15-445/645 (Fall 2018)

Storage

LO GICAL PARTITIO N IN G

26

Node

Application Server

Node

Get Id=3

Id=1 Id=2 Id=3 Id=4 Id=1 Id=2 Id=3 Id=4

slide-42
SLIDE 42

CMU 15-445/645 (Fall 2018)

Node Node

PH YSICAL PARTITIO N IN G

27

Application Server

slide-43
SLIDE 43

CMU 15-445/645 (Fall 2018)

Node Node

PH YSICAL PARTITIO N IN G

27

Application Server

Id=1 Id=2 Id=3 Id=4

slide-44
SLIDE 44

CMU 15-445/645 (Fall 2018)

Node Node

PH YSICAL PARTITIO N IN G

27

Application Server

Get Id=1

Id=1 Id=2 Id=3 Id=4

slide-45
SLIDE 45

CMU 15-445/645 (Fall 2018)

Node Node

PH YSICAL PARTITIO N IN G

27

Application Server

Get Id=3

Id=1 Id=2 Id=3 Id=4

slide-46
SLIDE 46

CMU 15-445/645 (Fall 2018)

SIN GLE- N O DE VS. DISTRIBUTED

A single-node txn only accesses data that is contained on one partition.

→ The DBMS does not need coordinate the behavior concurrent txns running on other nodes.

A distributed txn accesses data at one or more partitions.

→ Requires expensive coordination.

29

slide-47
SLIDE 47

CMU 15-445/645 (Fall 2018)

TRAN SACTIO N CO O RDIN ATIO N

If our DBMS supports multi-operation and distributed txns, we need a way to coordinate their execution in the system. Two different approaches:

→ Centralized: Global "traffic cop". → Decentralized: Nodes organize themselves.

30

slide-48
SLIDE 48

CMU 15-445/645 (Fall 2018)

TP M O N ITO RS

Example of a centralized coordinator. Originally developed in the 1970-80s to provide txns between terminals and mainframe databases.

→ Examples: ATMs, Airline Reservations.

Many DBMSs now support the same functionality internally.

31

slide-49
SLIDE 49

CMU 15-445/645 (Fall 2018)

Coordinator

CEN TRALIZED CO O RDIN ATO R

32

Partitions Application Server

P3 P4 P1 P2

slide-50
SLIDE 50

CMU 15-445/645 (Fall 2018)

Coordinator

CEN TRALIZED CO O RDIN ATO R

32

Partitions

Lock Request

Application Server

P3 P4 P1 P2

P1 P2 P3 P4

slide-51
SLIDE 51

CMU 15-445/645 (Fall 2018)

Coordinator

CEN TRALIZED CO O RDIN ATO R

32

Partitions

Lock Request Acknowledgement

Application Server

P3 P4 P1 P2

P1 P2 P3 P4

slide-52
SLIDE 52

CMU 15-445/645 (Fall 2018)

Coordinator

CEN TRALIZED CO O RDIN ATO R

32

Partitions

Commit Request

Application Server

P3 P4 P1 P2

P1 P2 P3 P4

slide-53
SLIDE 53

CMU 15-445/645 (Fall 2018)

Coordinator

CEN TRALIZED CO O RDIN ATO R

32

Partitions

Commit Request Safe to commit?

Application Server

P3 P4 P1 P2

P1 P2 P3 P4

slide-54
SLIDE 54

CMU 15-445/645 (Fall 2018)

Coordinator

CEN TRALIZED CO O RDIN ATO R

32

Partitions

Acknowledgement Commit Request Safe to commit?

Application Server

P3 P4 P1 P2

P1 P2 P3 P4

slide-55
SLIDE 55

CMU 15-445/645 (Fall 2018)

CEN TRALIZED CO O RDIN ATO R

33

Middleware

Query Requests

Application Server

P3 P4 P1 P2

P1→ID:1-100 P2→ID:101-200 P3→ID:201-300 P4→ID:301-400

Partitions

slide-56
SLIDE 56

CMU 15-445/645 (Fall 2018)

CEN TRALIZED CO O RDIN ATO R

33

Middleware

Query Requests

Application Server

P3 P4 P1 P2

P1→ID:1-100 P2→ID:101-200 P3→ID:201-300 P4→ID:301-400

Partitions

slide-57
SLIDE 57

CMU 15-445/645 (Fall 2018)

CEN TRALIZED CO O RDIN ATO R

33

Middleware

Safe to commit?

Application Server

P3 P4 P1 P2

P1→ID:1-100 P2→ID:101-200 P3→ID:201-300 P4→ID:301-400

Commit Request

Partitions

slide-58
SLIDE 58

CMU 15-445/645 (Fall 2018)

P3 P4 P1 P2

DECEN TRALIZED CO O RDIN ATO R

34

Application Server

Begin Request

Partitions

slide-59
SLIDE 59

CMU 15-445/645 (Fall 2018)

P3 P4 P1 P2

DECEN TRALIZED CO O RDIN ATO R

34

Application Server

Query Request

Partitions

slide-60
SLIDE 60

CMU 15-445/645 (Fall 2018)

P3 P4 P1 P2

DECEN TRALIZED CO O RDIN ATO R

34

Application Server

Safe to commit? Commit Request

Partitions

slide-61
SLIDE 61

CMU 15-445/645 (Fall 2018)

DISTRIBUTED CO N CURREN CY CO N TRO L

Need to allow multiple txns to execute simultaneously across multiple nodes.

→ Many of the same protocols from single-node DBMSs can be adapted.

This is harder because of:

→ Replication. → Network Communication Overhead. → Node Failures. → Clock Skew.

35

slide-62
SLIDE 62

CMU 15-445/645 (Fall 2018)

DISTRIBUTED 2PL

36

Node 1 Node 2

NETWORK

Set A=2 A=1 Set B=7 B=8

Application Server Application Server

slide-63
SLIDE 63

CMU 15-445/645 (Fall 2018)

DISTRIBUTED 2PL

36

Node 1 Node 2

NETWORK

Set A=2 A=1 A=2 Set B=7 B=8 B=7

Application Server Application Server

slide-64
SLIDE 64

CMU 15-445/645 (Fall 2018)

DISTRIBUTED 2PL

36

Node 1 Node 2

NETWORK

Set A=2 A=1 A=2 Set B=7 B=8 B=7

Application Server Application Server

Set B=9 Set A=0

slide-65
SLIDE 65

CMU 15-445/645 (Fall 2018)

DISTRIBUTED 2PL

36

Node 1 Node 2

NETWORK

Set A=2 A=1 A=2 Set B=7 B=8 B=7

Application Server Application Server

Set B=9 Set A=0

slide-66
SLIDE 66

CMU 15-445/645 (Fall 2018)

DISTRIBUTED 2PL

36

Node 1 Node 2

NETWORK

Set A=2 A=1 A=2 Set B=7 B=8 B=7

Application Server Application Server

Set B=9 Set A=0

Waits-For Graph

T1 T2

slide-67
SLIDE 67

CMU 15-445/645 (Fall 2018)

O BSERVATIO N

We have not discussed how to ensure that all nodes agree to commit a txn and then to make sure it does commit if we decide that it should.

→ What happens if a node fails? → What happens if our messages show up late?

37

slide-68
SLIDE 68

CMU 15-445/645 (Fall 2018)

ATO M IC CO M M IT PROTO CO L

When a multi-node txn finishes, the DBMS needs to ask all of the nodes involved whether it is safe to commit.

→ All nodes must agree on the outcome

Examples:

→ Two-Phase Commit → Three-Phase Commit (not used) → Paxos → Raft → ZAB (Apache Zookeeper)

38

slide-69
SLIDE 69

CMU 15-445/645 (Fall 2018)

Node 1

TWO - PH ASE CO M M IT (SUCCESS)

39

Commit Request

Participant Participant Coordinator

Application Server Node 3 Node 2

slide-70
SLIDE 70

CMU 15-445/645 (Fall 2018)

Node 1

TWO - PH ASE CO M M IT (SUCCESS)

39

Commit Request Phase1: Prepare

Participant Participant Coordinator

Application Server Node 3 Node 2

slide-71
SLIDE 71

CMU 15-445/645 (Fall 2018)

Node 1

TWO - PH ASE CO M M IT (SUCCESS)

39

Commit Request OK OK Phase1: Prepare

Participant Participant Coordinator

Application Server Node 3 Node 2

slide-72
SLIDE 72

CMU 15-445/645 (Fall 2018)

Node 1

TWO - PH ASE CO M M IT (SUCCESS)

39

Commit Request OK OK Phase1: Prepare

Participant Participant Coordinator

Application Server Node 3 Node 2

Phase2: Commit

slide-73
SLIDE 73

CMU 15-445/645 (Fall 2018)

Node 1

TWO - PH ASE CO M M IT (SUCCESS)

39

Commit Request OK OK OK Phase1: Prepare

Participant Participant Coordinator

Application Server Node 3 Node 2

Phase2: Commit OK

slide-74
SLIDE 74

CMU 15-445/645 (Fall 2018)

Node 1

TWO - PH ASE CO M M IT (SUCCESS)

39

Participant Participant Coordinator

Application Server Node 3 Node 2

Success!

slide-75
SLIDE 75

CMU 15-445/645 (Fall 2018)

Node 1

TWO - PH ASE CO M M IT (ABO RT)

40

Commit Request

Participant Participant Coordinator

Application Server Node 3 Node 2

slide-76
SLIDE 76

CMU 15-445/645 (Fall 2018)

Node 1

TWO - PH ASE CO M M IT (ABO RT)

40

Commit Request Phase1: Prepare

Participant Participant Coordinator

Application Server Node 3 Node 2

slide-77
SLIDE 77

CMU 15-445/645 (Fall 2018)

Node 1

TWO - PH ASE CO M M IT (ABO RT)

40

Commit Request ABORT! Phase1: Prepare

Participant Participant Coordinator

Application Server Node 3 Node 2

slide-78
SLIDE 78

CMU 15-445/645 (Fall 2018)

Node 1

TWO - PH ASE CO M M IT (ABO RT)

40

ABORT!

Participant Participant Coordinator

Application Server Node 3 Node 2

Aborted

slide-79
SLIDE 79

CMU 15-445/645 (Fall 2018)

Node 1

TWO - PH ASE CO M M IT (ABO RT)

40

ABORT!

Participant Participant Coordinator

Application Server Node 3 Node 2

Phase2: Abort Aborted

slide-80
SLIDE 80

CMU 15-445/645 (Fall 2018)

Node 1

TWO - PH ASE CO M M IT (ABO RT)

40

ABORT! OK

Participant Participant Coordinator

Application Server Node 3 Node 2

Phase2: Abort OK Aborted

slide-81
SLIDE 81

CMU 15-445/645 (Fall 2018)

2PC O PTIM IZATIO N S

Early Prepare Voting

→ If you send a query to a remote node that you know will be the last one you execute there, then that node will also return their vote for the prepare phase with the query result.

Early Acknowledgement After Prepare

→ If all nodes vote to commit a txn, the coordinator can send the client an acknowledgement that their txn was successful before the commit phase finishes.

41

slide-82
SLIDE 82

CMU 15-445/645 (Fall 2018)

Node 1

EARLY ACKN OWLEDGEM EN T

42

Commit Request

Participant Participant Coordinator

Application Server Node 3 Node 2

slide-83
SLIDE 83

CMU 15-445/645 (Fall 2018)

Node 1

EARLY ACKN OWLEDGEM EN T

42

Commit Request

Participant Participant Coordinator

Application Server Node 3 Node 2

Phase1: Prepare

slide-84
SLIDE 84

CMU 15-445/645 (Fall 2018)

Node 1

EARLY ACKN OWLEDGEM EN T

42

Commit Request OK OK

Participant Participant Coordinator

Application Server Node 3 Node 2

Phase1: Prepare

slide-85
SLIDE 85

CMU 15-445/645 (Fall 2018)

Node 1

EARLY ACKN OWLEDGEM EN T

42

OK OK

Participant Participant Coordinator

Application Server Node 3 Node 2

Success! Phase1: Prepare

slide-86
SLIDE 86

CMU 15-445/645 (Fall 2018)

Node 1

EARLY ACKN OWLEDGEM EN T

42

OK OK

Participant Participant Coordinator

Application Server Node 3 Node 2

Success! Phase1: Prepare Phase2: Commit

slide-87
SLIDE 87

CMU 15-445/645 (Fall 2018)

Node 1

EARLY ACKN OWLEDGEM EN T

42

OK OK OK

Participant Participant Coordinator

Application Server Node 3 Node 2

OK Success! Phase1: Prepare Phase2: Commit

slide-88
SLIDE 88

CMU 15-445/645 (Fall 2018)

TWO - PH ASE CO M M IT

Each node has to record the outcome of each phase in a stable storage log. What happens if coordinator crashes?

→ Participants have to decide what to do.

What happens if participant crashes?

→ Coordinator assumes that it responded with an abort if it hasn't sent an acknowledgement yet.

The nodes have to block until they can figure out the correct action to take.

43

slide-89
SLIDE 89

CMU 15-445/645 (Fall 2018)

PAXO S

Consensus protocol where a coordinator proposes an outcome (e.g., commit or abort) and then the participants vote on whether that outcome should succeed. Does not block if a majority of participants are available and has provably minimal message delays in the best case.

→ First correct protocol that was provably resilient in the face asynchronous networks

44

slide-90
SLIDE 90

CMU 15-445/645 (Fall 2018)

2PC VS. PAXO S

Two-Phase Commit

→ Blocks if coordinator fails after the prepare message is sent, until coordinator recovers.

Paxos

→ Non-blocking as long as a majority participants are alive, provided there is a sufficiently long period without further failures.

45

slide-91
SLIDE 91

CMU 15-445/645 (Fall 2018)

CO N CLUSIO N

I have barely scratched the surface on distributed txn processing… It is really hard to get right. More info (and humiliation):

→ Kyle Kingsbury's Jepsen Project

46

slide-92
SLIDE 92

CMU 15-445/645 (Fall 2018)

N EXT CLASS

Replication CAP Theorem Real-World Examples

47