1
Providing Atomic Sector Updates in Software for Persistent Memory
Vishal Verma
vishal.l.verma@intel.com Vault 2015
Providing Atomic Sector Updates in Software for Persistent Memory - - PowerPoint PPT Presentation
Providing Atomic Sector Updates in Software for Persistent Memory Vishal Verma vishal.l.verma@intel.com Vault 2015 1 Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX 2
1
Vishal Verma
vishal.l.verma@intel.com Vault 2015
2
3
CPU caches DRAM Persistent Memory Traditional Storage
Speed Capacity
4
– But not for writing a
sector atomically
Userspace
write()
'pmem' driver - /dev/pmem0
1 2 3
memcpy()
5
1.No blocks are torn (common on modern drives) 2.A block was torn, but reads back with an ECC error 3.A block was torn, but reads back without an ECC error (very rare on modern drives)
– ECC is correct between two stores – Torn sectors will almost never trigger ECC on the NVDIMM – Case 3 becomes most common! – Only file systems with data checksums will survive this case
6
7
table and an in-memory free block list
LBA -> actual block offset mappings
free list
atomically swap the free list entry and map entry NVDIMM
LBA Actual 42 1 5050 2 314 3 3
Free List
2 12 42 - LBA 0 3 - LBA 3 314 - LBA 2 0 - Free
Map
8
table and an in-memory free block list
LBA -> actual block offset mappings
free list
atomically swap the free list entry and map entry NVDIMM
LBA Actual 42 1 5050 2 314 3 3
Free List
2 12 42 - LBA 0 314 - LBA 2
write( to LBA 3 )
Map
0 - Free 3 - LBA 3
9
table and an in-memory free block list
LBA -> actual block offset mappings
free list
atomically swap the free list entry and map entry NVDIMM
LBA Actual 42 1 5050 2 314 3
Free List
3 2 12 42 - LBA 0 3 - Free 314 - LBA 2 0 - LBA 3
Map
10
– The only way to recreate the free list is to read the entire map –
Consider a 512GB volume, bs=512 => reading 1073741824 map entries
–
Map entries have to be 64-bit, so we end up reading 8GB at startup
–
Could save the free list to media on clean shutdown
–
But...clunky at best
11
12
– Has nfree entries. – Each entry has two 'slots' that 'flip-flop' – Each slot has:
area as seen prior to/post indirection from the map
Arena
Arena Info Block (4K) Data Blocks BTT Map Info Block Copy (4K) BTT Flog (8K)
Backing Store
Arena 0 512G Arena 1 512G
nfree reserved blocks
Block being written Old mapping New mapping Sequence num
13
– But if not, we can simply preempt_disable() and need not take a lock
CPU 0 get_lane() = 0 Lane 0
Free List
blk seq slot 2 0b10 6 0b10 1 14 0b01 LBA
new seq LBA`
new` seq` 5 32 2 0b10 XX XX XX XX XX XX XX XX 8 38 6 0b10 42 42 14 0b01 XX XX XX XX
Flog
CPU 1 get_lane() = 1 Lane 1 CPU 2 Lane 2 get_lane() = 2 5 2 8 6 42 14
Map
14
15
CPU 0 Lane 0 read() LBA 5 Read data from 10 pre post 5 10
Map
Release Lane 0
16
block
postmap_aba / seq]
list entry
CPU 0 Lane 0 blk seq slot 2 0b10
Free List[0]
write() LBA 5 write data to 2 pre post 5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2
Map
Release Lane 0 free[0] = {10, 0b11, 1}
17
CPU 0 Lane 0 blk seq slot 2 0b10
Free List[0]
write() LBA 5 write data to 2 pre post 5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2
Map
Release Lane 0 free[0] = {10, 0b11, 1}
Opportunities for interruption/power failure
18
CPU 0 Lane 0 blk seq slot 2 0b10
Free List[0]
write() LBA 5 write data to 2 pre post 5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2
Map
Release Lane 0 free[0] = {10, 0b11, 1}
– No on-disk change had happened,
everything comes back up as normal
19
CPU 0 Lane 0 blk seq slot 2 0b10
Free List[0]
write() LBA 5 write data to 2 pre post 5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2
Map
Release Lane 0 free[0] = {10, 0b11, 1}
– Map hasn't been updated – Reads will continue to get the 5 → 10 mapping – Flog will still show '2' as free and ready to be
written to
20
CPU 0 Lane 0 blk seq slot 2 0b10
Free List[0]
write() LBA 5 write data to 2 pre post 5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2
Map
Release Lane 0 free[0] = {10, 0b11, 1}
–
Read flog[0][0] = {5, 10, 2, 0b10}
–
Flog claims map[5] should have been '2', but map[5] is still '10' (== flog.old)
–
Since flog and map disagree, recovery routine detects an incomplete transaction
–
Flog is assumed to be “true” since it is always written before the map
–
Recovery routine completes the transaction by updating map[5] = 2; free[0] = 10
21
CPU 0 Lane 0 blk seq slot 2 0b10
Free List[0]
write() LBA 5 write data to 2 pre post 5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2
Map
Release Lane 0 free[0] = {10, 0b11, 1}
–
Read flog[0][0] = {5, 10, X, 0b11}; flog[0][1] = {X, X, X, 0b01}
–
Since seq is written last, the half-written flog entry does not show up as “new”
–
Free list is reconstructed using the newest non-torn flog entry flog[0][1] in this case
–
map[5] remains '10', and '2' remains free.
Bit sequence for flog.seq: 01->10->11->01 Old
New ← →
22
CPU 0 Lane 0 blk seq slot 2 0b10
Free List[0]
write() LBA 5 write data to 2 pre post 5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 pre post 5 2
Map
Release Lane 0 free[0] = {10, 0b11, 1}
–
Since both flog and map were updated, free list reconstruction will happen as usual
23
24
CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10
25
CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10
26
CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10 Critical section
27
CPU 1 CPU 2 write LBA 0; get-free[1] = 5; write_data to 5 write LBA 0; get-free[2] = 6; write_data to 6 lock map_lock[0 % nfree] read old_map[0] = 10 write log 0/10/5/xx; write map = 5; free[1] = 10 unlock map_lock[0 % nfree] lock map_lock[0 % nfree] read old_map[0] = 5 write log 0/5/6/xx; write map = 6; free[2] = 5 unlock map_lock[0 % nfree]
28
CPU 1 (Reader) CPU 2 (Writer) read LBA 0 write LBA 0 ... get-free[2] = 6 read map[0] = 5 write data to postmap block 6 start reading postmap block 5 write meta: map[0] = 6, free[2] = 5 ... another write LBA 12 ... get-free[2] = 5 ... write data to postmap block 5 finish reading postmap block 5
BUG! – writing a block that is being read from
29
CPU 1 (Reader) CPU 2 (Writer) read LBA 0 write LBA 0 read map[0] = 5 get-free[2] = 6; write data write rtt[1] = 5 write meta: map[0] = 6, free[2] = 5 start reading postmap block 5 another write LBA 12 ... get-free[2] = 5 ... scan RTT – '5' is present - wait! finish reading postmap block 5 ... clear rtt[1] ... write data to postmap block 5
30
31
512B Block size 4K Block size Write Amplification ~4.6% [536B] ~0.5% [4120B] Capacity Overhead ~0.8% ~0.1%
32
33
advantage, DAX is the best path for it
–
It may not know that it relies on this..
–
For xfs-dax, we'd need to use logdev=/dev/[btt-partition]
34
NVML, a library to make persistent memory programming easier
https://patchwork.kernel.org/project/linux-nvdimm/list/