UNIST UNIST Hanyang Univ. UNIST/SKKU Fast but Asymmetric - - PowerPoint PPT Presentation

unist unist hanyang univ unist skku
SMART_READER_LITE
LIVE PREVIEW

UNIST UNIST Hanyang Univ. UNIST/SKKU Fast but Asymmetric - - PowerPoint PPT Presentation

Deukyeon Hwang Wook-Hee Kim Youjip Won Beomseok Nam UNIST UNIST Hanyang Univ. UNIST/SKKU Fast but Asymmetric Non-Volatility Byte-Addressability Large Capacity Access Latency CPU Caches Persistent Memory (Non-Volatile) (Volatile)


slide-1
SLIDE 1

Deukyeon Hwang UNIST Wook-Hee Kim UNIST Youjip Won Hanyang Univ. Beomseok Nam UNIST/SKKU

slide-2
SLIDE 2

Fast but Asymmetric Access Latency Non-Volatility Byte-Addressability Large Capacity

slide-3
SLIDE 3

40 30 40

CPU Caches (Volatile) Persistent Memory (Non-Volatile)

10 20 30 30 30 40

FLUSH

LOST 40!

cache line

slide-4
SLIDE 4

10 20 30 40

Inserting 25 into a node

10 20 30 40 40

(0 ) (1 )

Partially updated tree node is inconsistent Append-Only Update

10 20 30 30 40 10 20 25 30 40

(2 ) (3 )

slide-5
SLIDE 5

40 60 P4 P6 ʌ Node B Logging → Selective Persistence (Internal node in DRAM)

Node Split

10 20 30 40 60 P1 P2 P3 P4 P6 ʌ Node A 10 20 30 P1 P2 P3 ʌ Node A

slide-6
SLIDE 6

▪ Append-Only

  • Unsorted keys

▪ Selective Persistence

  • Internal node → DRAM
  • Internal nodes have to be reconstructed from leaf nodes after failures
  • Logging for leaf nodes

▪ Previous solutions

NV-Tree [FAST’15] Append-Only leaf update + Selective Persistence wB+-Tree [VLDB’15] Append-Only node update + bitmap/slot array metadata FP-Tree [SIGMOD’16] Append-Only leaf update + fingerprints + Selective Persistence

slide-7
SLIDE 7

Selective Persistence (DRAM + PM) Append-Only (Unsorted keys) Lock-Free Search Failure-Atomic ShifT (FAST) Failure-Atomic In-place Rebalancing (FAIR)

slide-8
SLIDE 8

▪ Modern processors reorder instructions to utilize the memory bandwidth ▪ Memory ordering in x86 and ARM ▪ x86 guarantees Total Store Ordering (TSO) ▪ Dependent instructions are not reordered

x86 ARM stores-after-stores Y N stores-after-loads N N loads-after-stores N N loads-after-loads N N

  • Inst. w/ dependency

Y Y

slide-9
SLIDE 9

▪ Pointers in B+-Tree store unique memory addresses ▪ 8-byte pointer can be atomically updated

Read transactions detect transient inconsistency between duplicate pointers

▪ transient inconsistency

  • In-flight state partially updated by a write transaction

10 20 30 40 40 P1 P2 P3 P4 P5 P5

slide-10
SLIDE 10

10 20 30 40 P1 P2 P3 P4 P5 P5 10 20 30 40 40 P1 P2 P3 P4 P5 P5

mfence(); mfence(); TSO

slide-11
SLIDE 11

10 20 30 40 g P1 P2 P3 P4 P5 ʌ g ʌ

Read transactions can succeed in finding a key even if a system crashes in any step Insert (25, P6) into a node using FAST

g: Garbage ʌ: Null

slide-12
SLIDE 12

10 20 30 40 g P1 P2 P3 P4 P5 P5 g ʌ

Insert (25, P6) into a node using FAST

slide-13
SLIDE 13

10 20 30 40 40 P1 P2 P3 P4 P5 P5 g ʌ

Insert (25, P6) into a node using FAST

slide-14
SLIDE 14

10 20 30 40 40 P1 P2 P3 P4 P5 P5 g ʌ

Insert (25, P6) into a node using FAST

slide-15
SLIDE 15

10 20 30 40 40 P1 P2 P3 P4 P5 P5 g ʌ

Key 40 between duplicate pointers is ignored! Insert (25, P6) into a node using FAST read transaction

slide-16
SLIDE 16

10 20 30 40 40 P1 P2 P3 P4 P4 P5 g ʌ

Shifting P4 invalidates the left 40 Insert (25, P6) into a node using FAST

slide-17
SLIDE 17

10 20 30 30 40 P1 P2 P3 P4 P4 P5 g ʌ

Insert (25, P6) into a node using FAST

slide-18
SLIDE 18

10 20 30 30 40 P1 P2 P3 P3 P4 P5 g ʌ

Insert (25, P6) into a node using FAST

slide-19
SLIDE 19

10 20 25 30 40 P1 P2 P3 P3 P4 P5 g ʌ

Insert (25, P6) into a node using FAST

slide-20
SLIDE 20

10 20 25 30 40 P1 P2 P3 P6 P4 P5 g ʌ

Storing P6 validates 25 Insert (25, P6) into a node using FAST

slide-21
SLIDE 21

▪ It is necessary to call clflush at the boundary of cache line

10 20 30 40 g P1 P2 P3 P4 P5 ʌ g ʌ

Cache Line 1 Cache Line 2

10 20 30 30 40 P1 P2 P3 P3 P4 P5 g ʌ

Cache Line 1 Cache Line 2 mfence() clflush( ) mfence()

Cache Line 2

slide-22
SLIDE 22

▪ Let’s avoid expensive logging

by making read transactions be aware of rebalancing operations

10 20 30 40 70 80 90

▪ Blink-Tree

slide-23
SLIDE 23

10 20 30 40 60 P1 P2 P3 P4 P6 ʌ Node A

A read transaction can detect transient inconsistency if keys are out of order FAIR split a node

40 60 P4 P6 ʌ Node B

slide-24
SLIDE 24

10 20 30 P1 P2 P3 ʌ 40 60 P4 P6 ʌ Node B Node A

Setting NULL pointer validates Node B. Node A and Node B are virtually a single node FAIR split a node

slide-25
SLIDE 25

10 20 30 P1 P2 P3 ʌ 40 60 P4 P6 ʌ Node B Node A

Migrated keys can be accessed via sibling pointer FAIR split a node

slide-26
SLIDE 26

10 20 30 P1 P2 P3 ʌ 40 50 P4 P6 ʌ Node B Node A 60 P5

FAIR split a node

slide-27
SLIDE 27

10 20 30 40 50 60 10 70 70

Node R

70 80 90

Node A Node B Node C root C2 C3 C3

Insert a key into the parent node using FAST after FAIR split

slide-28
SLIDE 28

10 20 30 40 50 60 10 70 70

Node R

70 80 90

Node A Node B Node C root C2 C2 C3

Node B can be accessed from Node A Insert a key into the parent node using FAST after FAIR split

slide-29
SLIDE 29

10 20 30 40 50 60 10 70 70

Node R

70 80 90

Node A Node B Node C root C2 C2 C3

Node B can be accessed from Node A ➢ Searching the key 50 from the root after a system crash

key accessed by read transaction

Insert a key into the parent node using FAST after FAIR split

slide-30
SLIDE 30

10 20 30 40 50 60 10 40 70

Node R

70 80 90

Node A Node B Node C root C2 C4 C3

FAST inserting makes Node B visible atomically Insert a key into the parent node using FAST after FAIR split

slide-31
SLIDE 31

Read transactions can tolerate any inconsistency caused by write transactions Read transactions can access the transient inconsistent tree node being modified by a write transaction Lock-Free Search

→ →

slide-32
SLIDE 32

Read transaction Write transaction

10 20 30 40 g P1 P2 P3 P4 P5 ʌ g ʌ

[Example 1] Searching 30 while inserting (15, P6) read → shift →

slide-33
SLIDE 33

10 20 30 40 g P1 P2 P3 P4 P5 P5 g ʌ

Read transaction Write transaction [Example 1] Searching 30 while inserting (15, P6) read → shift →

slide-34
SLIDE 34

10 20 30 40 40 P1 P2 P3 P4 P5 P5 g ʌ

Read transaction Write transaction [Example 1] Searching 30 while inserting (15, P6) read → shift →

slide-35
SLIDE 35

10 20 30 40 40 P1 P2 P3 P4 P4 P5 g ʌ

Read transaction Write transaction [Example 1] Searching 30 while inserting (15, P6) read → shift →

slide-36
SLIDE 36

10 20 30 30 40 P1 P2 P3 P4 P4 P5 g ʌ

Read transaction Write transaction [Example 1] Searching 30 while inserting (15, P6) read → shift →

slide-37
SLIDE 37

10 20 30 30 40 P1 P2 P3 P3 P4 P5 g ʌ

Read transaction Write transaction [Example 1] Searching 30 while inserting (15, P6) read → shift →

slide-38
SLIDE 38

10 20 20 30 40 P1 P2 P3 P3 P4 P5 g ʌ

Read transaction Write transaction [Example 1] Searching 30 while inserting (15, P6) read → shift →

slide-39
SLIDE 39

10 20 20 30 40 P1 P2 P2 P3 P4 P5 g ʌ

Read transaction Write transaction [Example 1] Searching 30 while inserting (15, P6) read → shift →

slide-40
SLIDE 40

10 20 20 30 40 P1 P2 P2 P3 P4 P5 g ʌ

Read transaction Write transaction FOUND! [Example 1] Searching 30 while inserting (15, P6) read → shift →

slide-41
SLIDE 41

Read transaction Write transaction [Example 2] Searching 30 while deleting (20, P2)

10 20 30 40 g P1 P2 P3 P4 P5 ʌ g ʌ

read →  shift

slide-42
SLIDE 42

Read transaction Write transaction

10 20 30 40 g P1 P3 P3 P4 P5 ʌ g ʌ

[Example 2] Searching 30 while deleting (20, P2) read →  shift

slide-43
SLIDE 43

Read transaction Write transaction

10 30 30 40 g P1 P3 P3 P4 P5 ʌ g ʌ

[Example 2] Searching 30 while deleting (20, P2) read →  shift

slide-44
SLIDE 44

Read transaction Write transaction

10 30 30 40 g P1 P3 P4 P4 P5 ʌ g ʌ

[Example 2] Searching 30 while deleting (20, P2) read →  shift

slide-45
SLIDE 45

Read transaction Write transaction

10 30 40 40 g P1 P3 P4 P4 P5 ʌ g ʌ

[Example 2] Searching 30 while deleting (20, P2) read →  shift

slide-46
SLIDE 46

Read transaction Write transaction

10 30 40 40 g P1 P3 P4 P5 P5 ʌ g ʌ

[Example 2] Searching 30 while deleting (20, P2) read →  shift

slide-47
SLIDE 47

Read transaction Write transaction

10 30 40 40 g P1 P3 P4 P5 P5 ʌ g ʌ

30 NOT FOUND The read transaction cannot find the key 30 due to shift operation [Example 2] Searching 30 while deleting (20, P2) read →  shift

slide-48
SLIDE 48

▪ Direction flag:

  • Even Number

– Insertion shifts to the right. – Search must scan from Left to Right

shift → read →

10 20 30 40 g P1 P2 P3 P4 P5 ʌ g ʌ

  • Odd Number

– Deletion shifts to the left. – Search must scan from Right to Left counter 2 Search 40 Insert 25

slide-49
SLIDE 49

▪ Direction flag:

  • Even Number

– Insertion shifts to the right. – Search must scan from Left to Right

 shift  read

10 20 30 40 g P1 P2 P3 P4 P5 ʌ g ʌ

  • Odd Number

– Deletion shifts to the left. – Search must scan from Right to Left counter 3 Search 40 Delete 25

slide-50
SLIDE 50

▪ Direction flag:

  • Even Number

– Insertion shifts to the right. – Search must scan from Left to Right

 shift

10 20 30 40 g P1 P2 P3 P4 P5 ʌ g ʌ

  • Odd Number

– Deletion shifts to the left. – Search must scan from Right to Left counter 2 Search 40

read →

Delete 25 3

The read transaction has to check the counter once again to make sure the counter has not changed. Otherwise, search the node again.

slide-51
SLIDE 51

Transaction A Transaction B

The ordering of Transaction A and Transaction B cannot be determined

BEGIN INSERT 10 SUSPENDED WAKE UP

ABORT

BEGIN SEARCH 10(FOUND) COMMIT

Dirty reads problem

slide-52
SLIDE 52

Our Lock-Free Search supports low isolation level Highest Lowest Isolation Level

Serializable Repeatable reads Read committed Read uncommitted

slide-53
SLIDE 53

10 13 40

...

99 150 160 1 50 70 90 10

... ... ...

For higher isolation level, read lock is necessary for leaf nodes Leaf Root Lock-Free Search High Low Lock Contention

slide-54
SLIDE 54

▪ Xeon Haswell-Ex E7-4809 v3 processors

  • 2.0 GHz, 16 vCPUs with hyper-threading enabled, and 20 MB L3 cache
  • Total Store Ordering (TSO) is guaranteed

▪ g++ 4.8.2 with -O3 ▪ PM latency

  • Read latency

– A DRAM-based PM latency emulator, Quartz

  • Write latency

– Injecting delay

slide-55
SLIDE 55
  • Sorted keys, cache locality, and memory level parallelism

→ up to 20X speed up

slide-56
SLIDE 56

FAST+FAIR→ FP-Tree → wB+-Tree → WORT → Skiplist

slide-57
SLIDE 57
  • FAST+Logging uses logging instead of FAIR when splitting a node

WORT, FAST+FAIR, FP-Tree → FAST+Logging → wB+-Tree → Skiplist

  • clflush: I/O time
  • Search: Tree traversal time
  • Node Update: Computation

time

slide-58
SLIDE 58

New Order Paymen t Order Status Delivery Stock Level W1 34% 43% 5% 4% 14% W2 27% 43% 15% 4% 11% W3 20% 43% 25% 4% 8% W4 13% 43% 35% 4% 5%

  • FAST+FAIR consistently outperforms other indexes because of its good

insertion performance and superior range query performance

Specification of TPCC workloads More Range Queries

slide-59
SLIDE 59

(a) 50M Search (b) 50M Insertion (c) 200M Search / 50M Insertion / 12.5M Deletion

  • Lock-free search with FAST+FAIR shows high scalability and performance
  • FAST+FAIR+LeafLock shows comparable scalability and provides high

concurrency level

slide-60
SLIDE 60

▪ We designed a byte addressable persistent B+-Tree that

  • stores keys in order
  • avoids expensive logging

▪ FAST and FAIR always transform B+-Trees into consistent/transient inconsistent B+-Trees ▪ Lock-Free search

  • By tolerating transient inconsistency
slide-61
SLIDE 61

Deukyeon Hwang UNIST Wook-Hee Kim UNIST Youjip Won Hanyang Univ. Beomseok Nam UNIST

slide-62
SLIDE 62
  • To guarantee the order of instructions, the dmb instruction is used for

FAST+FAIR

  • Although there is an overhead by dmb, FAST+FAIR is less affected by latency