 
              Failure-atomic Synchronization-free Regions Vaibhav Gogte, Stephan Diestelhorst $ , William Wang $ , Satish Narayanasamy, Peter M. Chen, Thomas F. Wenisch NVMW 2018, San Diego, CA 03/13/2018 $
Promise of persistent memory (PM) Non-volatility Performance Density Byte-addressable, load-store interface to storage 2
Persistent memory system Core- Core- Core- Core- 1 2 3 4 L1 $ L1 $ L1 $ L1 $ LLC Recovery DRAM Persistent Memory (PM) Recovery can inspect the data-structures in PM to restore system to a consistent state
Memory persistency models • Provide guarantees required for recoverable software – Academia [Condit ‘09][Pelley ‘14][Joshi ‘15][Kolli ‘17] … – Industry [Intel ‘14][ARM ‘16] • Express primitives to define state of the program after failure • Govern ordering constraints on persists to PM • Ensure failure atomicity for a group of persists 4
Semantics for failure-atomicity • Assures that either all or none of the updates visible post failure • Guaranteed by hardware, library or language implementations • Design space for semantics Transactions Individual persists Outermost critical sections [Intel ‘14][ARM ‘16][Joshi ’ 15]… [Coburn ‘11][Volos ‘11]… [Chakrabarti ‘14][Boehm ‘16] Have high performance Do not compose well with Do not provide clean overhead general sync. primitives failure semantics Existing mechanisms suffer a trade-off between programmability and performance 5
Contributions • Failure-atomic Synchronization-free Regions (SFR) – Extend clean semantics to post-failure recovery – Employ existing synchronization primitives in C++ • Propose failure-atomicity as a part of language implementation – Build compiler pass that emits logging code for persistent accesses • Propose two designs: Coupled-SFR and Decoupled-SFR • Achieve 65% better performance over state-of-the-art tech. 6
Outline • Design space for granularities of failure-atomicity • Our proposal: Failure-atomic SFRs – Coupled-SFR design – Decoupled-SFR design • Evaluation 7
Language-level persistency models [Chakrabarti ’ 14][Kolli ’ 17] • Enables writing portable, recoverable software • Extend language memory-model with persistency semantics • Persistency model guarantees: – Ordering: How can programmers order persists? – Failure-atomicity: Which group of stores persist atomically? 8
Persist ordering Thread 1 Thread 2 Core- Core- St A < hb St B 1 2 L1.acq(); L1 $ L1 $ A = 100; L1.rel(); St A < p St B L1.acq(); sw LLC B = 200; L1.rel(); b PM a Ascribe persist ordering using synchronization primitives in language 9
Why failure-atomicity? Task: Fill node and add to linked list, safely In-memory data fillNewNode() Fence updateTailPtr() 10
Why failure-atomicity? Task: Fill node and add to linked list, safely In-memory data Atomic updateTailPtr() fillNewNode() Failure-atomicity à Persistent memory programming easier 11
Granularity of failure-atomicity - I L1.lock(); Individual persists [Condit ‘09][Pelley ‘14][Joshi ‘16][Kolli ‘17] x -= 100; y += 100; Mechanisms ensure atomicity of individual persists • L2.lock(); Non-sequentially consistent state visible to recovery • a -= 100; à Need additional custom logging b += 100; L2.unlock(); L1.unlock(); 12
Granularity of failure-atomicity - II L1.lock(); x -= 100; Outer critical sections [Chakrabarti ‘14][Boehm ‘16] y += 100; • Guarantees recovery to observe SC state L2.lock(); à Easier to build recovery code a -= 100; Require complex dependency tracking between critical sections • b += 100; à > 2x performance cost L2.unlock(); L1.unlock(); 13
Our proposal: Failure-atomic SFRs l1.acq(); Synchronization free regions (SFR) x -= 100; SFR1 Thread regions delimited by y += 100; synchronization operations or l2.acq(); system calls a -= 100; SFR2 b += 100; l2.rel(); Persistent state atomically moves l1.rel(); from one sync. operation to the next 14
Failure-atomicity of SFRs • Persist SFRs in sequentially consistent order • Extends clean SC semantics to post-failure recovery • Allow hardware/compiler optimizations within SFR • Persist ordering – Synchronizing acquire and release ops in C++ • Failure-atomicity – Undo-logging for SFRs Two logging designs à Coupled-SFR and Decoupled-SFR 15
Undo-logging for SFRs createUndoLog (L) L1.acq(); mutateData (M) Failure- SFR1 x = 100; atomic L1.rel(); persistData (P) commitLog (C) SFR1 Need to ensure the ordering of steps in undo-logging for SFRs to be failure-atomic 16
Design 1: Coupled-SFR Thread 1 Thread 2 Thread 1 Thread 2 ACQ2 L1.acq(); L1 SFR1 x = 100; L2 L1.rel(); M1 L1.acq(); SFR1 sw SFR2 x = 200; M2 P1 L1.rel(); SFR2 + Persistent state lags execution by at most one SFR P2 C1 à Simpler implementation, latest state at failure - Need to flush updates at the end of each SFR C2 REL1 à High performance cost 17
Design 2: Decoupled-SFR • Coupled-SFR has simple design, but lower perf. – Persists and log commits on critical execution path L • Key idea: Decouple persistent state from program exec. – Persist updates and commit logs in background – Create undo logs in order – Roll back updates in reverse order of creation on failure 18
Decoupled-SFR in action Thread 1 Thread 1 Thread 2 Thread 2 L1.acq(); ACQ2 L1 Create logs in order SFR1 x = 100; during execution SFR1 L2 L1.rel(); L1.acq(); M1 sw SFR2 SFR2 x = 200; M2 L1.rel(); REL1 P1 Need to commit logs in order à P2 Flush and commit in background record order in which logs are created C1 C2 19
Log ordering in Decoupled-SFR Init X = 0 Thread 1 Thread 2 Thread 1 Thread 2 Header Header Sequence Table X = 100; Store X Acq L1 à 1 L1 0 L1.rel(); X = 0 Seq = 1 L1.acq(); sw X = 200; Rel L1 Store X Seq = 0 X = 100 Sequence numbers in logs record inter-thread order of log creation • Background threads commit logs using recorded sequence no. • 20
Evaluation setup • Designed our logging approaches in LLVM v3.6.0 – Instruments stores and sync. ops. to emit undo logs – Creates log space for managing per-thread undo-logs – Launches background threads to flush/commit logs in Decoupled-SFR • Workloads: write-intensive micro-benchmarks – 12 threads, 10M operations • Performed experiments on Intel E5-2683 v3 – 2GHz, 12 physical cores, 2-way hyper-threading 21
Performance evaluation 1.2 Atlas Coupled-SFR Decoupled-SFR No-persistency [Chakrabarti ’ 14] Normalized exec. time 1 0.8 65% 0.6 Better 0.4 0.2 0 CQ SPS PC RB-tree TATP LL TPCC Mean Decoupled-SFR performs 65% better than state-of-the-art ATLAS design 22
Performance evaluation 1.2 Atlas Coupled-SFR Decoupled-SFR No-persistency [Chakrabarti ’ 14] Normalized exec. time 1 0.8 0.6 Better 0.4 0.2 0 CQ SPS PC RB-tree TATP LL TPCC Mean Coupled-SFR performs better that Decoupled-SFR when fewer stores/SFR 23
Conclusion • Failure-atomic synchronization-free regions – Persistent state moves from one sync. operation to the next • Coupled-SFR design – Easy to reason about PM state after failure; high performance cost • Decoupled-SFR design – Persistent state lags execution; performs 65% better than ATLAS More details in our full paper on “ Persistency for Synchronization-free Regions ” appearing in PLDI ’18 24
Thank you! Questions? 25
Recommend
More recommend