in Tashkent CSEP 545 Transaction Processing Sameh Elnikety - - PowerPoint PPT Presentation

in tashkent
SMART_READER_LITE
LIVE PREVIEW

in Tashkent CSEP 545 Transaction Processing Sameh Elnikety - - PowerPoint PPT Presentation

Database Replication in Tashkent CSEP 545 Transaction Processing Sameh Elnikety Replication for Performance Expensive Limited scalability 2 DB Replication is Challenging Single database system Large, persistent state Transactions


slide-1
SLIDE 1

Database Replication in Tashkent

CSEP 545 Transaction Processing Sameh Elnikety

slide-2
SLIDE 2

Replication for Performance

Expensive Limited scalability

2

slide-3
SLIDE 3

DB Replication is Challenging

  • Single database system

– Large, persistent state – Transactions – Complex software

  • Replication challenges

– Maintain consistency – Middleware replication

3

slide-4
SLIDE 4

Background

Replica 1 Standalone DBMS

4

slide-5
SLIDE 5

Background

Replica 2 Replica 1 Replica 3 Load Balancer

5

slide-6
SLIDE 6

Read Tx

Replica 2 Replica 1 Replica 3 Load Balancer

T Read tx does not change DB state

6

slide-7
SLIDE 7

Update tx changes DB state

Update Tx 1/2

Replica 2 Replica 1 Replica 3 Load Balancer

T

ws ws

7

slide-8
SLIDE 8

Update tx changes DB state

Update Tx 1/2

Replica 2 Replica 1 Replica 3 Load Balancer

T

ws

Apply (or commit) T everywhere

ws ws ws

Example: T1: { set x = 1 }

8

ws

slide-9
SLIDE 9

Ordering

Update Tx 2/2

Replica 2 Replica 1 Replica 3 Load Balancer

T

ws

Update tx changes DB state

ws

T

ws ws

9

slide-10
SLIDE 10

Ordering

Update Tx 2/2

Replica 2 Replica 1 Replica 3 Load Balancer

T Update tx changes DB state T

ws ws ws ws ws ws ws ws

Replica 3

Example: T1: { set x = 1 } T2: { set x = 7 } Commit updates in order

10

slide-11
SLIDE 11

Ordering

Sub-linear Scalability Wall

Replica 2 Replica 1 Replica 3 Load Balancer

T T

ws ws ws ws ws ws ws ws

Replica 3

11

Replica 4

slide-12
SLIDE 12
  • General scaling techniques

– Address fundamental bottlenecks – Synergistic, implemented in middleware – Evaluated experimentally

This Talk

12

slide-13
SLIDE 13

Super-linear Scalability

20 40 60 80 100 120 Single Base United MALB UF

TPS

12 X 25 X 37 X 1 X 7 X

slide-14
SLIDE 14

Big Picture: Let’s Oversimplify

Standalone DBMS R

reading update logging

U

14

slide-15
SLIDE 15

reading update logging

Big Picture: Let’s Oversimplify

Replica 1/N (traditional) Standalone DBMS R

reading update logging

U N.R N.U R U (N-1).ws

15

slide-16
SLIDE 16

reading update logging reading update logging

Big Picture: Let’s Oversimplify

Replica 1/N (traditional) Replica 1/N (optimized) Standalone DBMS

16

R

reading update logging

U N.R N.U R U (N-1).ws N.R N.U R* U* (N-1).ws*

slide-17
SLIDE 17

reading update logging reading update logging

Big Picture: Let’s Oversimplify

Replica 1/N (traditional) Replica 1/N (optimized) Standalone DBMS

17

R

reading update logging

U N.R N.U R U (N-1).ws N.R N.U R* U* (N-1).ws*

MALB Update Filtering Uniting O & D

slide-18
SLIDE 18

Key Points

  • 1. Commit updates in order

– Perform serial synchronous disk writes – Unite ordering and durability

  • 2. Load balancing

– Optimize for equal load: memory contention – MALB: optimize for in-memory execution

  • 3. Update propagation

– Propagate updates everywhere – Update filtering: propagate to where needed

18

slide-19
SLIDE 19

Tx A

Roadmap

Replica 2 Replica 1 Replica 3 Load Balancer Ordering

Load balancing Update propagation Commit updates in

  • rder

19

slide-20
SLIDE 20
  • Traditionally:

– Commit ordering and durability are separated

  • Key idea:

– Unite commit ordering and durability

Key Idea

20

slide-21
SLIDE 21

All Replicas Must Agree

  • All replicas agree on

– which update tx commit – their commit order

  • Total order

– Determined by middleware – Followed by each replica

durability

Replica 3 Tx A Tx B

durability

Replica 2

durability

Replica 1

21

slide-22
SLIDE 22

Tx B

durability

Replica 3

Ordering

Tx A

Order Outside DBMS

Tx A Tx B

durability

Replica 2

durability

Replica 1

22

slide-23
SLIDE 23

Tx B

durability

Replica 3

Ordering

Tx A A  B

A  B

Order Outside DBMS

Tx A Tx B

durability

Replica 2

A  B

durability

Replica 1

A  B A  B

A  B A  B 23

slide-24
SLIDE 24

Ordering

A  B

DBMS durability

Replica 3

Proxy Tx A Tx B

SQL interface Task A Task B

Enforce External Commit Order

24

slide-25
SLIDE 25

Ordering

A  B

DBMS durability

Replica 3

Proxy Tx A Tx B

SQL interface Task A Task B

B A 

Enforce External Commit Order

25

slide-26
SLIDE 26

Ordering

A  B

DBMS durability

Replica 3

Proxy Tx A Tx B

SQL interface Task A Task B

B A 

Cannot commit A & B concurrently!

Enforce External Commit Order

26

slide-27
SLIDE 27

Ordering

A  B

durability

Replica 3

Proxy Tx A Tx B

SQL interface Task A Task B

A

Enforce Order = Serial Commit

DBMS

27

slide-28
SLIDE 28

Ordering

A  B

durability

Replica 3

Proxy Tx A Tx B

SQL interface Task A Task B

A B 

Enforce Order = Serial Commit

DBMS

28

slide-29
SLIDE 29

Commit Serialization is Slow

Durability A

Proxy DBMS

durability CPU

Ordering

A  B  C

Commit order A  B  C Durability A  B

CPU

Durability A  B  C

CPU

Commit A Commit B Commit C Ack A Ack B Ack C

29

slide-30
SLIDE 30

Commit Serialization is Slow

Durability A

Proxy DBMS

durability CPU

Ordering

A  B  C

Commit order A  B  C Durability A  B

CPU

Durability A  B  C

CPU

Commit A Commit B Commit C Ack A Ack B Ack C

Problem: Durability & ordering separated → serial disk writes

30

slide-31
SLIDE 31

Commit A Commit B Commit C Ack A Ack B Ack C

Unite D. & O. in Middleware

Proxy DBMS

CPU

Ordering

A  B  C

Commit order A  B  C

CPU

Durability A  B  C

CPU

durability OFF

durability

31

slide-32
SLIDE 32

Commit A Commit B Commit C Ack A Ack B Ack C

Unite D. & O. in Middleware

Proxy DBMS

CPU

Ordering

A  B  C

Commit order A  B  C

CPU

Durability A  B  C

CPU

durability OFF

durability

Solution: Move durability to MW Durability & ordering in middleware → group commit

32

slide-33
SLIDE 33
  • Middleware logs tx effects

– Durability of update tx

  • Guaranteed in middleware
  • Turn durability off at database
  • Middleware performs durability & ordering

– United → group commit → fast

  • Database commits update tx serially

– Commit = quick main memory operation

Implementation: Uniting D & O in MW

33

slide-34
SLIDE 34

Uniting Improves Throughput

  • Metric

– Throughput

  • Workload

– TPC-W Ordering (50% updates)

  • System

– Linux cluster – PostgreSQL – 16 replicas – Serializable exec.

5 10 15 20 25 30 35 40 Single Base United

TPC-W

1 X 12 X 7 X

TPS

slide-35
SLIDE 35

Tx A

Roadmap

Replica 2 Replica 1 Replica 3 Load Balancer Ordering

Load balancing Update propagation Commit updates in

  • rder

35

slide-36
SLIDE 36

Key Idea

Replica 1 Mem Disk Replica 2 Mem Disk Load Balancer

Equal load

  • n replicas

36

slide-37
SLIDE 37

Key Idea

Replica 1 Mem Disk Replica 2 Mem Disk Load Balancer

Equal load

  • n replicas

MALB: (Memory-Aware Load Balancing) Optimize for in-memory execution

37

slide-38
SLIDE 38

How Does MALB Work?

Database 2 1 3 Workload A → B →

Mem

Memory 2 1 2 3

38

slide-39
SLIDE 39

Read Data From Disk

A, B, A, B

Replica 1 Mem Disk

2 1 3

Replica 2 Mem Disk

2 1 3

Least Loaded

3 1

A → B →

2 1 2 3

39

slide-40
SLIDE 40

Read Data From Disk

A, B, A, B

Replica 1 Mem Disk

2 1 3

Replica 2 Mem Disk

2 1 3

Least Loaded

3 1

Slow Slow

A → B →

2 1 2 3

40

2 1 3 3 1 2 1 3 3 1

slide-41
SLIDE 41

Data Fits in Memory

Replica 1 Mem Disk

2 1 3

Replica 2 Mem Disk

2 1 3

MALB A → B →

2 1 2 3

A, B, A, B

41

slide-42
SLIDE 42

Data Fits in Memory

Replica 1 Mem Disk

2 1 3 2 1

Replica 2 Mem Disk

2 1 3 3 2

MALB

Fast Fast

A → B →

2 1 2 3

Memory info? Many tx and replicas?

A, B, A, B

42

slide-43
SLIDE 43
  • Exploit tx execution plan

– Which tables & indices are accessed – Their access pattern

  • Linear scan, direct access
  • Metadata from database

– Sizes of tables and indices

Estimate Tx Memory Needs

43

slide-44
SLIDE 44
  • Objective

– Construct tx groups that fit together in memory

  • Bin packing

– Item: tx memory needs – Bin: memory of replica – Heuristic: Best Fit Decreasing

  • Allocate replicas to tx groups

– Adjust for group loads

Grouping Transactions

44

slide-45
SLIDE 45

MALB in Action

A B C D E F

MALB

45

slide-46
SLIDE 46

MALB in Action

A B C D E F

MALB Memory needs for A, B, C, D, E, F

46

slide-47
SLIDE 47

Group A

MALB in Action

A B C D E F Group B C Group D E F

MALB Memory needs for A, B, C, D, E, F

47

slide-48
SLIDE 48

Group A

MALB in Action

A B C D E F

Replica Replica Replica

Group B C A Group D E F B C D E F

MALB Disk Disk Disk Memory needs for A, B, C, D, E, F

48

slide-49
SLIDE 49
  • Objective

– Optimize for in-memory execution

  • Method

– Estimate tx memory needs – Construct tx groups – Allocate replicas to tx groups

MALB Summary

49

slide-50
SLIDE 50
  • Implementation

– No change in consistency – Still middleware

  • Compare

– United: efficient baseline system – MALB: exploits working set information

  • Same environment

– Linux cluster running PostgreSQL – Workload: TPC-W Ordering (50% update txs)

Experimental Evaluation

50

slide-51
SLIDE 51

MALB Doubles Throughput

TPC-W Ordering 16 replicas

51

20 40 60 80 100 120 Single Base United MALB UF

TPS

105%

12 X 25 X 1 X 7 X

slide-52
SLIDE 52

MALB Doubles Throughput

52

0.0 0.2 0.4 0.6 0.8 1.0 United MALB 20 40 60 80 100 120 Single Base United MALB UF

TPS Read I/O, normalized

105%

12 X 25 X 1 X 7 X

slide-53
SLIDE 53

Big Small Big Small

Mem Size DB Size

Big Gains with MALB

4% 0% 29% 48% 105% 45% 182% 75% 12%

slide-54
SLIDE 54

Big Small Big Small

Mem Size DB Size

Big Gains with MALB

4% 0% 29% 48% 105% 45% 182% 75% 12%

Run from memory Run from disk

slide-55
SLIDE 55

Tx A

Roadmap

Replica 2 Replica 1 Replica 3 Load Balancer Ordering

Load balancing Update propagation Commit updates in

  • rder

55

slide-56
SLIDE 56
  • Traditional:

– Propagate updates everywhere

  • Update Filtering:

– Propagate updates to where they are needed

Key Idea

56

slide-57
SLIDE 57

Update Filtering Example

Replica 1 Mem Disk

2 1 3

Replica 2 Mem Disk

2 1 3

MALB UF A → B →

2 1 2 3

A, B, A, B

57

slide-58
SLIDE 58

Group A

Update Filtering Example

Replica 1 Group B Mem Disk

2 1 3 2 1

Replica 2 Mem Disk

2 1 3 3 2

MALB UF A → B →

2 1 2 3

A, B, A, B

58

slide-59
SLIDE 59

Group A

Update Filtering Example

Disk Replica 1 Group B Mem

2 1 2 1

Replica 2 Mem Disk

2 1 3 2

MALB UF Update table 1

3 3

A → B →

2 1 2 3

A, B, A, B

59

slide-60
SLIDE 60

Group A

Update Filtering Example

Disk Replica 1 Group B Mem

2 1 2 1

Replica 2 Mem Disk

2 1 3 2

MALB UF Update table 1

3 3

A → B →

2 1 2 3

A, B, A, B

60

slide-61
SLIDE 61

Group A

Update Filtering Example

Disk Replica 1 Group B Mem

2 1 2 1

Replica 2 Mem Disk

2 3 2

MALB UF Update table 1

3 3

A → B →

2 1 2 3

A, B, A, B

61

1

slide-62
SLIDE 62

Group A

Update Filtering Example

Disk Replica 1 Group B Mem

2 1 2 1

Replica 2 Mem Disk

2 1 3 2

MALB UF Update table 1

3

Update table 3

3

A → B →

2 1 2 3

A, B, A, B

62

slide-63
SLIDE 63

Group A

Update Filtering Example

Disk Replica 1 Group B Mem

2 1 2 1

Replica 2 Mem Disk

2 1 3 2

MALB UF Update table 1

3

Update table 3

3

A → B →

2 1 2 3

A, B, A, B

63

slide-64
SLIDE 64

Group A

Update Filtering Example

Disk Replica 1 Group B Mem

2 1 2 1

Replica 2 Mem Disk

2 1 3 2

MALB UF Update table 1

3

Update table 3

3

A → B →

2 1 2 3

A, B, A, B

64

slide-65
SLIDE 65

Update Filtering in Action

UF

65

slide-66
SLIDE 66

Update Filtering in Action

UF

Update to red table

66

slide-67
SLIDE 67

Update Filtering in Action

UF

Update to red table Update to green table

67

slide-68
SLIDE 68

Update Filtering in Action

UF

Update to red table Update to green table

68

slide-69
SLIDE 69

Update Filtering in Action

UF

Update to red table Update to green table

69

slide-70
SLIDE 70

20 40 60 80 100 120 Single Base United MALB UF

MALB+UF Triples Throughput

37 X

TPS

12 X 25 X 1 X 7 X

49% TPC-W Ordering 16 replicas

slide-71
SLIDE 71

2 4 6 8 10 12 14 MALB UF 20 40 60 80 100 120 Single Base United MALB UF

MALB+UF Triples Throughput

37 X

TPS

12 X 25 X 1 X 7 X

  • Prop. Updates

15 7

49%

slide-72
SLIDE 72

1.49

0.5 1 1.5 2 MALB MALB+UF

Filtering Opportunities

50% Ordering Mix 5% Browsing Mix

1.02

0.5 1 1.5 2

MALB MALB+UF

Updates

Ratio MALB+UF / MALB

72

slide-73
SLIDE 73

Conclusions

  • 1. Commit updates in order

– Perform serial synchronous disk writes – Unite ordering and durability

  • 2. Load balancing

– Optimize for equal load: memory contention – MALB: optimize for in-memory execution

  • 3. Update propagation

– Propagate updates everywhere – Update filtering: propagate to where needed

73