Failure-atomic Synchronization-free Regions
Vaibhav Gogte, Stephan Diestelhorst$, William Wang$, Satish Narayanasamy, Peter M. Chen, Thomas F. Wenisch
$
NVMW 2018, San Diego, CA 03/13/2018
Failure-atomic Synchronization-free Regions Vaibhav Gogte, Stephan - - PowerPoint PPT Presentation
Failure-atomic Synchronization-free Regions Vaibhav Gogte, Stephan Diestelhorst $ , William Wang $ , Satish Narayanasamy, Peter M. Chen, Thomas F. Wenisch NVMW 2018, San Diego, CA 03/13/2018 $ Promise of persistent memory (PM) Non-volatility
$
NVMW 2018, San Diego, CA 03/13/2018
2
Core- 1 Core- 2 Core- 3 Core- 4
L1 $ L1 $ L1 $ L1 $ LLC DRAM Recovery
Recovery can inspect the data-structures in PM to restore system to a consistent state
Persistent Memory (PM)
4
– Academia [Condit ‘09][Pelley ‘14][Joshi ‘15][Kolli ‘17] … – Industry [Intel ‘14][ARM ‘16]
5
Do not provide clean failure semantics Have high performance
Do not compose well with general sync. primitives Individual persists
[Intel ‘14][ARM ‘16][Joshi ’15]…
Outermost critical sections
[Chakrabarti ‘14][Boehm ‘16]
Transactions
[Coburn ‘11][Volos ‘11]…
Existing mechanisms suffer a trade-off between programmability and performance
6
– Extend clean semantics to post-failure recovery – Employ existing synchronization primitives in C++
– Build compiler pass that emits logging code for persistent accesses
7
– Coupled-SFR design – Decoupled-SFR design
– Ordering: How can programmers order persists? – Failure-atomicity: Which group of stores persist atomically?
8
9
L1.acq(); A = 100; L1.rel(); L1.acq(); B = 200; L1.rel();
Thread 1 Thread 2
Core- 1 Core- 2
L1 $ L1 $ LLC PM
a b
St A <hb St B St A <p St B sw
Ascribe persist ordering using synchronization primitives in language
10
fillNewNode() updateTailPtr()
In-memory data
Fence
11
In-memory data
fillNewNode() updateTailPtr()
Atomic
12
L1.lock(); x -= 100; y += 100; L2.lock(); a -= 100; b += 100; L2.unlock(); L1.unlock();
Individual persists [Condit ‘09][Pelley ‘14][Joshi ‘16][Kolli ‘17]
à Need additional custom logging
13
L1.lock(); x -= 100; y += 100; L2.lock(); a -= 100; b += 100; L2.unlock(); L1.unlock();
Outer critical sections
[Chakrabarti ‘14][Boehm ‘16]
à Easier to build recovery code
à > 2x performance cost
14
l1.acq(); x -= 100; y += 100; l2.acq(); a -= 100; b += 100; l2.rel(); l1.rel();
SFR1 SFR2
Thread regions delimited by synchronization operations or system calls Synchronization free regions (SFR) Persistent state atomically moves from one sync. operation to the next
15
Two logging designs à Coupled-SFR and Decoupled-SFR
16
L1.acq(); x = 100; L1.rel();
SFR1 Failure- atomic
createUndoLog (L) mutateData (M) commitLog (C) persistData (P)
SFR1 Need to ensure the ordering of steps in undo-logging for SFRs to be failure-atomic
17
L1.acq(); x = 100; L1.rel();
SFR1
L1.acq(); x = 200; L1.rel();
SFR2 Thread 1 Thread 2
+ Persistent state lags execution by at most one SFR à Simpler implementation, latest state at failure
à High performance cost L1 M1 P1 C1 REL1
SFR1 Thread 1
L2 M2 P2 C2 ACQ2
SFR2 Thread 2 sw
– Persists and log commits on critical execution path L
– Persist updates and commit logs in background – Create undo logs in order – Roll back updates in reverse order of creation on failure
18
19
L1.acq(); x = 100; L1.rel();
SFR1
L1.acq(); x = 200; L1.rel();
SFR2 Thread 1 Thread 2
P1 C1 P2 C2 Flush and commit in background L1 M1 REL1
SFR1 Thread 1
L2 M2 ACQ2
SFR2 Thread 2
Create logs in order during execution
Need to commit logs in order à record order in which logs are created
sw
20
X = 100; L1.rel(); L1.acq(); X = 200;
Thread 1 Thread 2 Thread 1 Header Thread 2 Header Store X X = 0 Rel L1 Seq = 0 Acq L1 Seq = 1 Store X X = 100
L1 0
Sequence Table
à 1
Init X = 0
sw
– Instruments stores and sync. ops. to emit undo logs – Creates log space for managing per-thread undo-logs – Launches background threads to flush/commit logs in Decoupled-SFR
– 12 threads, 10M operations
– 2GHz, 12 physical cores, 2-way hyper-threading
21
22
0.2 0.4 0.6 0.8 1 1.2
CQ SPS PC RB-tree TATP LL TPCC Mean Normalized exec. time Atlas Coupled-SFR Decoupled-SFR No-persistency
65% Better
Decoupled-SFR performs 65% better than state-of-the-art ATLAS design
[Chakrabarti ’14]
23
0.2 0.4 0.6 0.8 1 1.2
CQ SPS PC RB-tree TATP LL TPCC Mean Normalized exec. time Atlas Coupled-SFR Decoupled-SFR No-persistency
Better
Coupled-SFR performs better that Decoupled-SFR when fewer stores/SFR
[Chakrabarti ’14]
– Persistent state moves from one sync. operation to the next
– Easy to reason about PM state after failure; high performance cost
– Persistent state lags execution; performs 65% better than ATLAS
24
More details in our full paper on “Persistency for Synchronization-free Regions” appearing in PLDI ’18
25