Database Systems Do Not Scale to 1000 CPU Cores And Other Tales of - - PowerPoint PPT Presentation

database systems do not scale to 1000 cpu cores
SMART_READER_LITE
LIVE PREVIEW

Database Systems Do Not Scale to 1000 CPU Cores And Other Tales of - - PowerPoint PPT Presentation

Database Systems Do Not Scale to 1000 CPU Cores And Other Tales of the Macabre @ andy_pavlo 2 Three million children die per year due to poor nutrition. Source: http://www.wfp.org/hunger/stats 3 Three days after you die, stomach enzymes


slide-1
SLIDE 1

Database Systems Do Not Scale to 1000 CPU Cores

And Other Tales of the Macabre

@andy_pavlo

slide-2
SLIDE 2

2

Three million children die per year due to poor nutrition.

Source: http://www.wfp.org/hunger/stats
slide-3
SLIDE 3

3

Three days after you die, stomach enzymes start to digest you.

Source: http://discovermagazine.com/2006/sep/10-20thingsdeath
slide-4
SLIDE 4

4

Everyone in this room will be dead in 65 years.

Source: http://discovermagazine.com/2006/sep/10-20thingsdeath
slide-5
SLIDE 5

5

Database systems cannot scale to 1000 CPU cores.

Source: http://www.vldb.org/pvldb/vol8/p209-yu.pdf
slide-6
SLIDE 6

YCSB //

6

DBx1000 on Graphite Simulator Write-Intensive Workload High Contention

slide-7
SLIDE 7

Why This Matters

  • The era of single-core CPU speed-up is over.
  • Database applications are getting more

complex and larger.

  • Existing DBMSs are unable to take advantage of

future “many-core” CPU architectures.

7

slide-8
SLIDE 8

Today’s Talk

  • Transaction Processing
  • Experimental Platform
  • Evaluation & Discussion
  • The (Dire) Future

8

slide-9
SLIDE 9

9

Transaction Processing

slide-10
SLIDE 10

On-line Transaction Processing

  • Fast operations that ingest new data and then

update state using ACID transactions.

  • Transaction Example:

– Send $50 from user A to user B

10

slide-11
SLIDE 11

Concurrency Control

  • Allows transactions to access a database in a

multi-programmed fashion while preserving the illusion that each of them is executing alone on a dedicated system.

  • Provides Atomicity + Isolation in ACID

11

slide-12
SLIDE 12

Concurrency Control

  • Two-Phase Locking (Pessimistic)
  • Timestamp Ordering (Optimistic)

12

slide-13
SLIDE 13

Two-Phase Locking (2PL)

13

Transaction #1

BEGIN COMMIT

LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B)

Shrinking Phase

LOCK(A) LOCK(B)

Growing Phase

slide-14
SLIDE 14

Transaction #2

BEGIN COMMIT

LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)

Two-Phase Locking (2PL)

14

Transaction #1

BEGIN COMMIT

LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)

slide-15
SLIDE 15

Transaction #2

BEGIN COMMIT

LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)

Two-Phase Locking (2PL)

15

Transaction #1

BEGIN COMMIT

LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)

slide-16
SLIDE 16

Two-Phase Locking (2PL)

  • Deadlock Detection (DEADLOCK)
  • Non-waiting Deadlock Prevention (NO_WAIT)
  • Wait-and-Die Deadlock Prevention (WAIT_DIE)

16

slide-17
SLIDE 17

Record Read Timestamp Write Timestamp

A B 10000

Timestamp Ordering (T/O)

17

Transaction #1

BEGIN COMMIT

READ(A) WRITE(B) WRITE(A)

  • • • •

10000 10000

  • • •

10000

10001

slide-18
SLIDE 18

Record Read Timestamp Write Timestamp

A B 10000

Timestamp Ordering (T/O)

18

Transaction #1

BEGIN COMMIT

READ(A) WRITE(B) WRITE(A)

  • • • •

10001 10001

  • • •

10000

10001

slide-19
SLIDE 19

Record Read Timestamp Write Timestamp

A B 10000

Timestamp Ordering (T/O)

19

Transaction #1

BEGIN COMMIT

READ(A) WRITE(B) WRITE(A)

  • • • •

10001 10001

  • • •

10005

10001

slide-20
SLIDE 20

Timestamp Ordering (T/O)

  • Basic T/O (TIMESTAMP)
  • Multi-Version Concurrency Control (MVCC)
  • Optimistic Concurrency Control (OCC)

20

slide-21
SLIDE 21

Concurrency Control Schemes

21

DL_DETECT NO_WAIT WAIT_DIE 2PL w/ Deadlock Detection 2PL w/ Non-waiting Prevention 2PL w/ Wait-and-Die Prevention TIMESTAMP MVCC OCC Basic T/O Algorithm Multi-Version T/O Optimistic Concurrency Control

slide-22
SLIDE 22

22

Evaluation Testbed

slide-23
SLIDE 23

23

No DBMS supports multiple CC schemes. No CPU supports 1000 cores.

slide-24
SLIDE 24

Experimental Platform

24

DBx1000 Graphite Simulator Compute Cluster

Core L2 L1

Worker Threads

slide-25
SLIDE 25

Target Workload

  • Yahoo! Cloud Serving Benchmark (YCSB)

– 20 million tuples – Each tuple is 1KB (total database is ~20GB)

  • Each transactions reads/modifies 16 tuples.
  • Varying skew in transaction access patterns.
  • Serializable isolation level.

25

slide-26
SLIDE 26

26

Evaluation

slide-27
SLIDE 27

YCSB //

27

DBx1000 on Graphite Simulator Read-Only Workload No Contention

slide-28
SLIDE 28

YCSB //

28

DBx1000 on Graphite Simulator Write-Intensive Workload Medium Contention

slide-29
SLIDE 29

YCSB //

29

DBx1000 on Graphite Simulator Write-Intensive Workload High Contention

slide-30
SLIDE 30

YCSB //

Time % Breakdown (512 Cores)

30

DBx1000 on Graphite Simulator Write-Intensive Workload High Contention

slide-31
SLIDE 31

Bottlenecks

  • Lock Thrashing

– DL_DETECT, WAIT_DIE

  • Timestamp Allocation

– All T/O algorithms + WAIT_DIE

  • Memory Allocations

– OCC + MVCC

31

slide-32
SLIDE 32

Bottlenecks

  • Lock Thrashing

– DL_DETECT, WAIT_DIE

  • Timestamp Allocation

– All T/O algorithms + WAIT_DIE

  • Memory Allocations

– OCC + MVCC

32

slide-33
SLIDE 33

Locking Thrashing

  • Each transaction waits longer to acquire locks,

causing other transactions to wait a longer to acquire locks.

  • The perfect workload is where transactions

acquire locks in primary key order.

33

slide-34
SLIDE 34

YCSB //

34

DBx1000 with 2PL DL_DETECT Write-Intensive Workload No Deadlocks (Ordered Lock Acquisition)

slide-35
SLIDE 35

YCSB //

35

DBx1000 with 2PL DL_DETECT Write-Intensive Workload No Deadlocks (Ordered Lock Acquisition)

slide-36
SLIDE 36

36

Potential Solutions

slide-37
SLIDE 37

Hardware/Software Co-Design

  • Bottlenecks can only be overcome through

new hardware-level optimizations:

– Hardware-accelerated Lock Sharing – Asynchronous Memory Copying – Decentralized Memory Controller.

37

slide-38
SLIDE 38

Next Steps

  • Evaluating other main bottlenecks in DBMSs:

– Logging + Recovery – Indexes

  • Extend DBx1000 to support distributed

concurrency control algorithms.

38

slide-39
SLIDE 39

39

Andy Pavlo Mike Stonebraker Srini Devadas Xiangyao Yu

http://cmudb.io/1000cores

slide-40
SLIDE 40

END

@andy_pavlo