Providing Atomic Sector Updates in Software for Persistent Memory - - PowerPoint PPT Presentation

providing atomic sector updates in software for
SMART_READER_LITE
LIVE PREVIEW

Providing Atomic Sector Updates in Software for Persistent Memory - - PowerPoint PPT Presentation

Providing Atomic Sector Updates in Software for Persistent Memory Vishal Verma vishal.l.verma@intel.com Vault 2015 1 Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX 2


slide-1
SLIDE 1

1

Providing Atomic Sector Updates in Software for Persistent Memory

Vishal Verma

vishal.l.verma@intel.com Vault 2015

slide-2
SLIDE 2

2

Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX

slide-3
SLIDE 3

3

NVDIMMs and Persistent Memory

  • NVDIMMs are byte-addressable
  • We won't talk of “Total System Persistence”
  • But using persistent memory DIMMs for storage
  • Drivers to present this as a block device - “pmem”

CPU caches DRAM Persistent Memory Traditional Storage

Speed Capacity

slide-4
SLIDE 4

4

Problem Statement

  • Byte addressability is great

– But not for writing a

sector atomically

Userspace

write()

'pmem' driver - /dev/pmem0

  • - - NVDIMM

1 2 3

memcpy()

slide-5
SLIDE 5

5

Problem Statement

  • On a power failure, there are three possibilities

1.No blocks are torn (common on modern drives) 2.A block was torn, but reads back with an ECC error 3.A block was torn, but reads back without an ECC error (very rare on modern drives)

  • With pmem, we use memcpy()

– ECC is correct between two stores – Torn sectors will almost never trigger ECC on the NVDIMM – Case 3 becomes most common! – Only file systems with data checksums will survive this case

slide-6
SLIDE 6

6

Naive solution

  • Full Data Journaling
  • Write every block to the journal first
  • 2x latency
  • 2x media wear
slide-7
SLIDE 7

7

Slightly better solution

  • Maintain an 'on-disk' indirection

table and an in-memory free block list

  • The map/indirection table has

LBA -> actual block offset mappings

  • New writes grab a block from

free list

  • On completing the write,

atomically swap the free list entry and map entry NVDIMM

LBA Actual 42 1 5050 2 314 3 3

Free List

2 12 42 - LBA 0 3 - LBA 3 314 - LBA 2 0 - Free

Map

slide-8
SLIDE 8

8

Slightly better solution

  • Maintain an 'on-disk' indirection

table and an in-memory free block list

  • The map/indirection table has

LBA -> actual block offset mappings

  • New writes grab a block from

free list

  • On completing the write,

atomically swap the free list entry and map entry NVDIMM

LBA Actual 42 1 5050 2 314 3 3

Free List

2 12 42 - LBA 0 314 - LBA 2

write( to LBA 3 )

Map

0 - Free 3 - LBA 3

slide-9
SLIDE 9

9

Slightly better solution

  • Maintain an 'on-disk' indirection

table and an in-memory free block list

  • The map/indirection table has

LBA -> actual block offset mappings

  • New writes grab a block from

free list

  • On completing the write,

atomically swap the free list entry and map entry NVDIMM

LBA Actual 42 1 5050 2 314 3

Free List

3 2 12 42 - LBA 0 3 - Free 314 - LBA 2 0 - LBA 3

Map

slide-10
SLIDE 10

10

Slightly better solution

  • Easy enough to implement
  • Should be performant
  • Caveat:

– The only way to recreate the free list is to read the entire map –

Consider a 512GB volume, bs=512 => reading 1073741824 map entries

Map entries have to be 64-bit, so we end up reading 8GB at startup

Could save the free list to media on clean shutdown

But...clunky at best

slide-11
SLIDE 11

11

Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX

slide-12
SLIDE 12

12

The Block T ranslation T able

  • nfree: The number of free blocks in reserve.
  • Flog: Portmanteau of free list + log

– Has nfree entries. – Each entry has two 'slots' that 'flip-flop' – Each slot has:

  • Info block: Info about arena - offsets, lbasizes etc.
  • External LBA: LBA as visible to upper layers
  • ABA: Arena Block Address - Block offset within an arena
  • Premap/Postmap ABA: The block offset into the data

area as seen prior to/post indirection from the map

Arena

Arena Info Block (4K) Data Blocks BTT Map Info Block Copy (4K) BTT Flog (8K)

Backing Store

Arena 0 512G Arena 1 512G

. . .

nfree reserved blocks

Block being written Old mapping New mapping Sequence num

slide-13
SLIDE 13

13

What's in a lane?

  • The idea of “lanes” is purely logical
  • num_lanes = min(num_cpus, nfree)
  • lane = cpu % num_lanes
  • If num_cpus > num_lanes, we need locking on lanes

– But if not, we can simply preempt_disable() and need not take a lock

CPU 0 get_lane() = 0 Lane 0

Free List

blk seq slot 2 0b10 6 0b10 1 14 0b01 LBA

  • ld

new seq LBA`

  • ld`

new` seq` 5 32 2 0b10 XX XX XX XX XX XX XX XX 8 38 6 0b10 42 42 14 0b01 XX XX XX XX

Flog

CPU 1 get_lane() = 1 Lane 1 CPU 2 Lane 2 get_lane() = 2 5 2 8 6 42 14

Map

slide-14
SLIDE 14

14

Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX

slide-15
SLIDE 15

15

BTT – Reading a block

  • Convert external LBA to Arena number + pre-map ABA
  • Get a lane (and take lane_lock if needed)
  • Read map to get the mapping
  • If ZERO flag is set, return zeroes
  • If ERROR flag is set, return an error
  • Read data from the block that the map points to
  • Release lane (and lane_lock)

CPU 0 Lane 0 read() LBA 5 Read data from 10 pre post 5 10

Map

Release Lane 0

slide-16
SLIDE 16

16

BTT – Writing a block

  • Convert external LBA to Arena number + pre-map ABA
  • Get a lane (and take lane_lock if needed)
  • Use lane to index into free list, write data to this free

block

  • Read map to get the existing mapping
  • Write flog entry: [premap_aba / old postmap_aba / new

postmap_aba / seq]

  • Write new post-map ABA into map.
  • Write old post-map entry into the free list
  • Calculate next sequence number and write into the free

list entry

  • Release lane (and lane_lock)

CPU 0 Lane 0 blk seq slot 2 0b10

Free List[0]

write() LBA 5 write data to 2 pre post 5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2

Map

Release Lane 0 free[0] = {10, 0b11, 1}

slide-17
SLIDE 17

17

BTT – Analysis of a write

CPU 0 Lane 0 blk seq slot 2 0b10

Free List[0]

write() LBA 5 write data to 2 pre post 5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2

Map

Release Lane 0 free[0] = {10, 0b11, 1}

Opportunities for interruption/power failure

slide-18
SLIDE 18

18

BTT – Analysis of a write

CPU 0 Lane 0 blk seq slot 2 0b10

Free List[0]

write() LBA 5 write data to 2 pre post 5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2

Map

Release Lane 0 free[0] = {10, 0b11, 1}

  • On reboot:

– No on-disk change had happened,

everything comes back up as normal

slide-19
SLIDE 19

19

BTT – Analysis of a write

CPU 0 Lane 0 blk seq slot 2 0b10

Free List[0]

write() LBA 5 write data to 2 pre post 5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2

Map

Release Lane 0 free[0] = {10, 0b11, 1}

  • On reboot:

– Map hasn't been updated – Reads will continue to get the 5 → 10 mapping – Flog will still show '2' as free and ready to be

written to

slide-20
SLIDE 20

20

BTT – Analysis of a write

CPU 0 Lane 0 blk seq slot 2 0b10

Free List[0]

write() LBA 5 write data to 2 pre post 5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2

Map

Release Lane 0 free[0] = {10, 0b11, 1}

  • On reboot:

Read flog[0][0] = {5, 10, 2, 0b10}

Flog claims map[5] should have been '2', but map[5] is still '10' (== flog.old)

Since flog and map disagree, recovery routine detects an incomplete transaction

Flog is assumed to be “true” since it is always written before the map

Recovery routine completes the transaction by updating map[5] = 2; free[0] = 10

slide-21
SLIDE 21

21

BTT – Analysis of a write

CPU 0 Lane 0 blk seq slot 2 0b10

Free List[0]

write() LBA 5 write data to 2 pre post 5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2

Map

Release Lane 0 free[0] = {10, 0b11, 1}

  • Special case, the flog write is torn:
  • On reboot:

Read flog[0][0] = {5, 10, X, 0b11}; flog[0][1] = {X, X, X, 0b01}

Since seq is written last, the half-written flog entry does not show up as “new”

Free list is reconstructed using the newest non-torn flog entry flog[0][1] in this case

map[5] remains '10', and '2' remains free.

Bit sequence for flog.seq: 01->10->11->01 Old

New ← →

slide-22
SLIDE 22

22

BTT – Analysis of a write

CPU 0 Lane 0 blk seq slot 2 0b10

Free List[0]

write() LBA 5 write data to 2 pre post 5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2

Map

Release Lane 0 free[0] = {10, 0b11, 1}

  • On reboot:

Since both flog and map were updated, free list reconstruction will happen as usual

slide-23
SLIDE 23

23

Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX

slide-24
SLIDE 24

24

Let's Race! Write vs. Write

CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10

slide-25
SLIDE 25

25

Let's Race! Write vs. Write

CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10

slide-26
SLIDE 26

26

Let's Race! Write vs. Write

CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10 Critical section

slide-27
SLIDE 27

27

Let's Race! Write vs. Write

  • Solution: An array of map_locks indexed by a hash of the premap ABA

CPU 1 CPU 2 write LBA 0; get-free[1] = 5; write_data to 5 write LBA 0; get-free[2] = 6; write_data to 6 lock map_lock[0 % nfree] read old_map[0] = 10 write log 0/10/5/xx; write map = 5; free[1] = 10 unlock map_lock[0 % nfree] lock map_lock[0 % nfree] read old_map[0] = 5 write log 0/5/6/xx; write map = 6; free[2] = 5 unlock map_lock[0 % nfree]

slide-28
SLIDE 28

28

Let's Race! Read vs. Write

CPU 1 (Reader) CPU 2 (Writer) read LBA 0 write LBA 0 ... get-free[2] = 6 read map[0] = 5 write data to postmap block 6 start reading postmap block 5 write meta: map[0] = 6, free[2] = 5 ... another write LBA 12 ... get-free[2] = 5 ... write data to postmap block 5 finish reading postmap block 5

BUG! – writing a block that is being read from

  • This doesn't corrupt on-disk layout, but the read appears torn
slide-29
SLIDE 29

29

Let's Race! Read vs. Write

CPU 1 (Reader) CPU 2 (Writer) read LBA 0 write LBA 0 read map[0] = 5 get-free[2] = 6; write data write rtt[1] = 5 write meta: map[0] = 6, free[2] = 5 start reading postmap block 5 another write LBA 12 ... get-free[2] = 5 ... scan RTT – '5' is present - wait! finish reading postmap block 5 ... clear rtt[1] ... write data to postmap block 5

  • Solution: A Read Tracking Table indexed by lane, tracking in-progress reads
slide-30
SLIDE 30

30

Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX

slide-31
SLIDE 31

31

That's Great...but is it Fast?

  • Overall, BTT to introduces a ~10% performance overhead
  • We think there is still room for improvement

512B Block size 4K Block size Write Amplification ~4.6% [536B] ~0.5% [4120B] Capacity Overhead ~0.8% ~0.1%

slide-32
SLIDE 32

32

Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX

slide-33
SLIDE 33

33

BTT vs. DAX

  • DAX stands for Direct Access
  • Patchset by Matthew Wilcox, merged into 4.0-rc1
  • Allows mapping a pmem range directly into userspace via mmap
  • DAX is fundamentally incompatible with the idea of BTT
  • If the application is aware of persistent, byte-addressable memory, and can use it to an

advantage, DAX is the best path for it

  • If the application relies on atomic sector update semantics, it must use the BTT

It may not know that it relies on this..

  • XFS relies on journal updates being sector atomic

For xfs-dax, we'd need to use logdev=/dev/[btt-partition]

slide-34
SLIDE 34

34

Resources

  • http://pmem.io - General persistent memory resources. Focuses on the

NVML, a library to make persistent memory programming easier

  • The 'pmem' driver on github: https://github.com/01org/prd
  • linux-nvdimm mailing list: https://lists.01.org/mailman/listinfo/linux-nvdimm
  • linux-nvdimm patchwork:

https://patchwork.kernel.org/project/linux-nvdimm/list/

  • #pmem on OFTC
slide-35
SLIDE 35

Q & A