Providing Atomic Sector Updates in Software for Persistent Memory - PowerPoint PPT Presentation

Providing Atomic Sector Updates in Software for Persistent Memory Vishal Verma vishal.l.verma@intel.com Vault 2015 1

Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX 2

NVDIMMs and Persistent Memory CPU caches Speed Capacity DRAM Persistent Memory Traditional Storage ● NVDIMMs are byte-addressable ● We won't talk of “Total System Persistence” ● But using persistent memory DIMMs for storage ● Drivers to present this as a block device - “pmem” 3

Problem Statement • Byte addressability is great – But not for writing a sector atomically Userspace write() 'pmem' driver - /dev/pmem0 memcpy() 0 1 2 3 - - - NVDIMM 4

Problem Statement • On a power failure, there are three possibilities 1.No blocks are torn (common on modern drives) 2.A block was torn, but reads back with an ECC error 3.A block was torn, but reads back without an ECC error (very rare on modern drives) • With pmem, we use memcpy() – ECC is correct between two stores – Torn sectors will almost never trigger ECC on the NVDIMM – Case 3 becomes most common! – Only file systems with data checksums will survive this case 5

Naive solution • Full Data Journaling • Write every block to the journal first • 2x latency • 2x media wear 6

Slightly better solution LBA Actual 0 42 0 - Free 1 5050 • Maintain an 'on-disk' indirection 2 314 3 - LBA 3 table and an in-memory free 3 3 block list • The map/indirection table has 42 - LBA 0 Map LBA -> actual block offset mappings • New writes grab a block from 0 free list 2 314 - LBA 2 • On completing the write, 12 atomically swap the free list entry and map entry NVDIMM Free List 7

Slightly better solution LBA Actual 0 42 0 - Free 1 5050 • Maintain an 'on-disk' indirection 2 314 3 - LBA 3 table and an in-memory free 3 3 block list • The map/indirection table has 42 - LBA 0 Map LBA -> actual block offset mappings • New writes grab a block from write( to LBA 3 ) 0 free list 2 314 - LBA 2 • On completing the write, 12 atomically swap the free list entry and map entry NVDIMM Free List 8

Slightly better solution LBA Actual 0 42 0 - LBA 3 1 5050 • Maintain an 'on-disk' indirection 2 314 3 - Free table and an in-memory free 3 0 block list • The map/indirection table has 42 - LBA 0 Map LBA -> actual block offset mappings • New writes grab a block from 3 free list 2 314 - LBA 2 • On completing the write, 12 atomically swap the free list entry and map entry NVDIMM Free List 9

Slightly better solution • Easy enough to implement • Should be performant • Caveat: – The only way to recreate the free list is to read the entire map Consider a 512GB volume, bs=512 => reading 1073741824 map entries – – Map entries have to be 64-bit, so we end up reading 8GB at startup Could save the free list to media on clean shutdown – – But...clunky at best 10

The Block T ranslation T able Arena Info Block (4K) • nfree: The number of free blocks in reserve. • Flog: Portmanteau of free list + log Arena 0 – Has nfree entries. 512G – Each entry has two 'slots' that 'flip-flop' Data Blocks – Each slot has: Block being Old New Sequence Arena 1 nfree reserved blocks written mapping mapping num 512G • Info block: Info about arena - offsets, lbasizes etc. BTT Map . • External LBA: LBA as visible to upper layers . • ABA: Arena Block Address - Block offset within an arena BTT Flog (8K) . • Premap/Postmap ABA: The block offset into the data Info Block Copy (4K) Backing Store area as seen prior to/post indirection from the map Arena 12

What's in a lane? Free List Flog blk seq slot LBA old new seq LBA` old` new` seq` get_lane() = 0 CPU 0 Lane 0 5 32 2 0b10 XX XX XX XX 2 0b10 0 get_lane() = 1 CPU 1 Lane 1 XX XX XX XX 8 38 6 0b10 6 0b10 1 get_lane() = 2 42 42 14 0b01 XX XX XX XX 14 0b01 0 CPU 2 Lane 2 Map • The idea of “lanes” is purely logical 5 2 • num_lanes = min(num_cpus, nfree) 8 6 • lane = cpu % num_lanes 42 14 • If num_cpus > num_lanes, we need locking on lanes – But if not, we can simply preempt_disable() and need not take a lock 13

BTT – Reading a block read() LBA 5 • Convert external LBA to Arena number + pre-map ABA CPU 0 • Get a lane (and take lane_lock if needed) • Read map to get the mapping Lane 0 Map • If ZERO flag is set, return zeroes pre post • If ERROR flag is set, return an error 5 10 • Read data from the block that the map points to • Release lane (and lane_lock) Read data from 10 Release Lane 0 15

write() LBA 5 BTT – Writing a block CPU 0 • Convert external LBA to Arena number + pre-map ABA Lane 0 Map (old) Free List[ 0 ] • Get a lane (and take lane_lock if needed) pre post blk seq slot • Use lane to index into free list, write data to this free 5 10 2 0b10 0 block write data to 2 • Read map to get the existing mapping • Write flog entry: [premap_aba / old postmap_aba / new flog[0][0] = {5, 10, 2, 0b10} postmap_aba / seq] • Write new post-map ABA into map. map[5] = 2 • Write old post-map entry into the free list Map • Calculate next sequence number and write into the free free[0] = {10, 0b11, 1} pre post list entry 5 2 • Release lane (and lane_lock) Release Lane 0 16

BTT – Analysis of a write Free List[ 0 ] Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Release write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Lane 0 Lane 0 Opportunities for interruption/power failure 17

BTT – Analysis of a write Free List[ 0 ] Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Release write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Lane 0 Lane 0 • On reboot: – No on-disk change had happened, everything comes back up as normal 18

BTT – Analysis of a write Free List[ 0 ] Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Release write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Lane 0 Lane 0 • On reboot: – Map hasn't been updated – Reads will continue to get the 5 → 10 mapping – Flog will still show '2' as free and ready to be written to 19

BTT – Analysis of a write Free List[ 0 ] Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Release write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Lane 0 Lane 0 • On reboot: Read flog[0][0] = {5, 10, 2, 0b10} – Flog claims map[5] should have been '2', but map[5] is still '10' (== flog.old) – Since flog and map disagree, recovery routine detects an incomplete transaction – Flog is assumed to be “true” since it is always written before the map – Recovery routine completes the transaction by updating map[5] = 2; free[0] = 10 – 20

BTT – Analysis of a write Free List[ 0 ] Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Release write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Lane 0 Lane 0 • Special case, the flog write is torn : Bit sequence for flog.seq: 01->10->11->01 Old New ← → • On reboot: Read flog [ 0 ][ 0 ] = {5, 10, X, 0b11}; flog[0][1] = {X, X, X, 0b01} – – Since seq is written last, the half-written flog entry does not show up as “new” – Free list is reconstructed using the newest non-torn flog entry flog[0][1] in this case – map[5] remains '10', and '2' remains free. 21

BTT – Analysis of a write Free List[ 0 ] Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Release write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Lane 0 Lane 0 • On reboot: – Since both flog and map were updated, free list reconstruction will happen as usual 22

Let's Race! Write vs. Write CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10 24

Let's Race! Write vs. Write CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10 25

Providing Atomic Sector Updates in Software for Persistent Memory - PowerPoint PPT Presentation

Providing Atomic Sector Updates in Software for Persistent Memory Vishal Verma vishal.l.verma@intel.com Vault 2015 1 Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX 2

DK - Batteridrevet vakuum lfter AL-Atomic 500 D - Batteriebetrieber Vakuumheber AL-Atomic 500

Atomic page flip and mode setting Hardware structure and abstraction Atomic page flip The

The Atomic Simulation Environment Ask Hjorth Larsen and the ASE development team Abinit

Cesium By Olivia H., P.10 Cesium Atomic Symbol: Cs State at room temperature: solid Atomic

Efficiency of equilibria Non-atomic routing games Non-atomic routing games Definition:

Atomic Workstation Kalev Lember, Red Hat desktop team DevConf.cz 2018 What is Fedora Atomic

Atomic Physics Accelerator Facility at Darmstadt, Warsaw, November 24, 2003 Atomic Physics at

Mission Updates Payload and Subsystems Updates Rocket and Subsystems Updates

Updates 1 24/08/2015 The Future of Sector Forums The Future of Sector Forums Feedback from

Russian energy sector Russian energy sector Russian energy sector Russian energy sector and

TEC Roadshow 2016 Welcome Our agenda this afternoon: Tertiary Policy updates SDR

Concurrency Problems Thierry Sans (recap) Lock A lock is an object in memory providing two atomic

Introduction to Software Testing Software Testing - Module 1 Part 1 The Software Engineering

CASHEW SECTOR IN BURKINA FASO CASHEW SECTOR IN BURKINA FASO CASHEW SECTOR IN BURKINA FASO CASHEW

Homelessness Sector Reform Sector Briefing 17 April 2020 Sector Reform - Overview Over the

Unit 1 Atomic Structure and Nuclear Chemistry Introduction to the atom Modern Atomic Theory

Underserved Communities: Moving Forward with Distributed Solar+Storage Projects October 20, 2020

Managing the New Block Layer Kevin Wolf <kwolf@redhat.com> Max Reitz

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content

CACHE OPTIMIZATION Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

Nomos : Resource-Aware Session Types for Programming Digital Contracts Stephanie Balzer,

Purchasing Local Food from a Broadline Distributor Abby Harper Farm to School Specialist Center

Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 Lecture 18 Trivia How does

CAS CS 460/660 Introduction to Database Systems Query Optimization II 1.1 Review

Providing Atomic Sector Updates in Software for Persistent Memory - PowerPoint PPT Presentation

Providing Atomic Sector Updates in Software for Persistent Memory Vishal Verma vishal.l.verma@intel.com Vault 2015 1 Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX 2

DK - Batteridrevet vakuum lfter AL-Atomic 500 D - Batteriebetrieber Vakuumheber AL-Atomic 500

Atomic page flip and mode setting Hardware structure and abstraction Atomic page flip The

The Atomic Simulation Environment Ask Hjorth Larsen and the ASE development team Abinit

Cesium By Olivia H., P.10 Cesium Atomic Symbol: Cs State at room temperature: solid Atomic

Efficiency of equilibria Non-atomic routing games Non-atomic routing games Definition:

Atomic Workstation Kalev Lember, Red Hat desktop team DevConf.cz 2018 What is Fedora Atomic

Atomic Physics Accelerator Facility at Darmstadt, Warsaw, November 24, 2003 Atomic Physics at

Mission Updates Payload and Subsystems Updates Rocket and Subsystems Updates

Updates 1 24/08/2015 The Future of Sector Forums The Future of Sector Forums Feedback from

Russian energy sector Russian energy sector Russian energy sector Russian energy sector and

TEC Roadshow 2016 Welcome Our agenda this afternoon: Tertiary Policy updates SDR

Concurrency Problems Thierry Sans (recap) Lock A lock is an object in memory providing two atomic

Introduction to Software Testing Software Testing - Module 1 Part 1 The Software Engineering

CASHEW SECTOR IN BURKINA FASO CASHEW SECTOR IN BURKINA FASO CASHEW SECTOR IN BURKINA FASO CASHEW

Homelessness Sector Reform Sector Briefing 17 April 2020 Sector Reform - Overview Over the

Unit 1 Atomic Structure and Nuclear Chemistry Introduction to the atom Modern Atomic Theory

Underserved Communities: Moving Forward with Distributed Solar+Storage Projects October 20, 2020

Managing the New Block Layer Kevin Wolf &lt;kwolf@redhat.com&gt; Max Reitz

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content

CACHE OPTIMIZATION Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

Nomos : Resource-Aware Session Types for Programming Digital Contracts Stephanie Balzer,

Purchasing Local Food from a Broadline Distributor Abby Harper Farm to School Specialist Center

Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 Lecture 18 Trivia How does

CAS CS 460/660 Introduction to Database Systems Query Optimization II 1.1 Review

Managing the New Block Layer Kevin Wolf <kwolf@redhat.com> Max Reitz