Relaxed Persist Ordering Using Strand Persistency Vaibhav Gogte, - - PowerPoint PPT Presentation

relaxed persist ordering using strand persistency
SMART_READER_LITE
LIVE PREVIEW

Relaxed Persist Ordering Using Strand Persistency Vaibhav Gogte, - - PowerPoint PPT Presentation

Relaxed Persist Ordering Using Strand Persistency Vaibhav Gogte, William Wang $ , Stephan Diestelhorst $ , Peter M. Chen, Satish Narayanasamy, Thomas F. Wenisch $ ISCA 2020 Promise of persistent memory (PM) Performance Density Non-volatility


slide-1
SLIDE 1

Relaxed Persist Ordering Using Strand Persistency

Vaibhav Gogte, William Wang$, Stephan Diestelhorst$, Peter M. Chen, Satish Narayanasamy, Thomas F. Wenisch

ISCA 2020

$

slide-2
SLIDE 2

Promise of persistent memory (PM)

2

Non-volatility Performance Density

slide-3
SLIDE 3

Promise of persistent memory (PM)

3

Non-volatility Performance Density

“Optane DC Persistent Memory will be

  • ffered in packages of up to 512GB per stick.”

“… expanding memory per CPU socket to as much as 3TB.” *

* Source: www.extremetech.com

slide-4
SLIDE 4

Promise of persistent memory (PM)

4

Non-volatility Performance Density

“Optane DC Persistent Memory will be

  • ffered in packages of up to 512GB per stick.”

“… expanding memory per CPU socket to as much as 3TB.” *

* Source: www.extremetech.com

Byte-addressable, load-store interface to durable storage

slide-5
SLIDE 5

Persistent memory system

5

DRAM Persistent Memory (PM)

CPU Writeback caches

slide-6
SLIDE 6

Persistent memory system

6

DRAM Persistent Memory (PM)

CPU Writeback caches

Failure

slide-7
SLIDE 7

Persistent memory system

7

DRAM Recovery Persistent Memory (PM)

Recovery can inspect PM data-structures to restore system to a consistent state CPU Writeback caches

Failure

slide-8
SLIDE 8

Recovery requires PM access ordering

8

CPU Writeback caches

PM Intel x86 primitives St a = x St b = y

for recovery

slide-9
SLIDE 9

Recovery requires PM access ordering

9

CPU Writeback caches

PM St b = y St a = x Intel x86 primitives

Consistency model

St a = x St b = y

for recovery

slide-10
SLIDE 10

Recovery requires PM access ordering

10

CPU Writeback caches

PM St b = y St a = x Intel x86 primitives

Consistency model Persistency model

St a = x St b = y

for recovery

slide-11
SLIDE 11

Recovery requires PM access ordering

11

CPU Writeback caches

PM St b = y St a = x CLWB(b) Intel x86 primitives

Consistency model Persistency model

CLWB(a) St a = x St b = y

for recovery

slide-12
SLIDE 12

Recovery requires PM access ordering

12

CPU Writeback caches

PM St b = y St a = x CLWB(b) SFENCE Intel x86 primitives

Consistency model Persistency model

CLWB(a) St a = x St b = y

for recovery

slide-13
SLIDE 13

Recovery requires PM access ordering

13

Hardware systems provide primitives to express persist order to PM

CPU Writeback caches

PM St b = y St a = x CLWB(b) SFENCE Intel x86 primitives

Consistency model Persistency model

CLWB(a) St a = x St b = y

for recovery

slide-14
SLIDE 14

Hardware imposes overly strict constraints

14

St A = 1; CLWB (A) St B = 2; CLWB (B) St C = 3; CLWB (C) A B C Ideal DAG

slide-15
SLIDE 15

Hardware imposes overly strict constraints

15

St A = 1; CLWB (A) St B = 2; CLWB (B) St C = 3; CLWB (C) A B C Ideal DAG St A = 1; CLWB (A) SFENCE St B = 2; CLWB (B) St C = 3; CLWB (C) A B C DAG 1

slide-16
SLIDE 16

Hardware imposes overly strict constraints

16

St A = 1; CLWB (A) St B = 2; CLWB (B) St C = 3; CLWB (C) A B C Ideal DAG St A = 1; CLWB (A) SFENCE St B = 2; CLWB (B) St C = 3; CLWB (C) A B C DAG 1 St A = 1 ; CLWB (A) St C = 3; CLWB (C) SFENCE St B = 2; CLWB (B) A B C DAG 2

slide-17
SLIDE 17

Hardware imposes overly strict constraints

17

Primitives in existing hardware systems overconstrain PM accesses St A = 1; CLWB (A) St B = 2; CLWB (B) St C = 3; CLWB (C) A B C Ideal DAG St A = 1; CLWB (A) SFENCE St B = 2; CLWB (B) St C = 3; CLWB (C) A B C DAG 1 St A = 1 ; CLWB (A) St C = 3; CLWB (C) SFENCE St B = 2; CLWB (B) A B C DAG 2

slide-18
SLIDE 18

Contributions

  • Our proposal: StrandWeaver

– Builds strand persistency model in hardware – Specifies precise persist ordering constraints

  • Comprises primitives: PersistBarrier, NewStrand, and JoinStrand

– Can encode an arbitrary DAG

  • Map language-level persistency models to ISA level primitives

– Leverage hw primitives to build persistency models efficiently

18

slide-19
SLIDE 19

Contributions

  • Our proposal: StrandWeaver

– Builds strand persistency model in hardware – Specifies precise persist ordering constraints

  • Comprises primitives: PersistBarrier, NewStrand, and JoinStrand

– Can encode an arbitrary DAG

  • Map language-level persistency models to ISA level primitives

– Leverage hw primitives to build persistency models efficiently

19

StrandWeaver results in 1.45x (avg.) speedup over Intel x86

slide-20
SLIDE 20

Outline

  • Contributions
  • Example: Failure atomicity
  • Existing hardware vs. strand persistency model
  • Our proposal: StrandWeaver
  • Evaluation

20

slide-21
SLIDE 21

Example: Failure atomicity

21

Failure atomicity: Which group of stores persist atomically? atomic_begin() x = 100; y = 200; atomic_end() Failure-atomic region

slide-22
SLIDE 22

Example: Failure atomicity

22

Failure atomicity: Which group of stores persist atomically? Failure atomicity limits state that recovery can observe after failure atomic_begin() x = 100; y = 200; atomic_end() Failure-atomic region

slide-23
SLIDE 23

Undo logging for failure atomicity

23

Init: x = 0; y = 0 atomic_begin() x = 1; y = 2; atomic_end()

persistUndoLog (L) mutateData (M) commitLog (C) persistData (P)

slide-24
SLIDE 24

Undo logging for failure atomicity

24

Init: x = 0; y = 0 atomic_begin() x = 1; y = 2; atomic_end()

Failure- atomic

persistUndoLog (L) mutateData (M) commitLog (C) persistData (P)

Undo logging steps ordered to ensure failure atomicity

slide-25
SLIDE 25

Undo logging for failure atomicity

25

Init: x = 0; y = 0 atomic_begin() x = 1; y = 2; atomic_end()

Failure- atomic

persistUndoLog (L) mutateData (M) commitLog (C) persistData (P)

Undo logging steps ordered to ensure failure atomicity

slide-26
SLIDE 26

Hardware imposes stricter constraints

26

atomic_begin() x = 1; y = 2; atomic_end()

Log(Ly,y) CLWB(Ly) Log(Lx,x) CLWB(Lx) Store(x,1) Store(y,2)

SFENCE ordering

Log(Lx,x) CLWB(Lx) Store(x,1) Log(Ly,y) CLWB(Ly) Store(y,2)

Ideal ordering

SFENCE SFENCE

slide-27
SLIDE 27

Hardware imposes stricter constraints

27

atomic_begin() x = 1; y = 2; atomic_end()

Log(Ly,y) CLWB(Ly) Log(Lx,x) CLWB(Lx) Store(x,1) Store(y,2)

SFENCE ordering

Log(Lx,x) CLWB(Lx) Store(x,1) Log(Ly,y) CLWB(Ly) Store(y,2)

Ideal ordering

SFENCE SFENCE

slide-28
SLIDE 28

Hardware imposes stricter constraints

28

atomic_begin() x = 1; y = 2; atomic_end()

Log(Ly,y) CLWB(Ly) Log(Lx,x) CLWB(Lx) Store(x,1) Store(y,2)

SFENCE ordering

Log(Lx,x) CLWB(Lx) Store(x,1) Log(Ly,y) CLWB(Ly) Store(y,2)

Ideal ordering

SFENCE SFENCE

slide-29
SLIDE 29

StrandWeaver: Hardware Strand Persistency Model

29

Hardware ISA

ISA primitives: PersistBarrier, NewStrand, JoinStrand

Compiler

Logging impl. that map to hardware primitives

High-level languages

Failure atomicity for language-level persistency models

slide-30
SLIDE 30

StrandWeaver: Hardware Strand Persistency Model

30

Hardware ISA

ISA primitives: PersistBarrier, NewStrand, JoinStrand

Compiler

Logging impl. that map to hardware primitives

High-level languages

Failure atomicity for language-level persistency models

slide-31
SLIDE 31

StrandWeaver enables persist concurrency

  • Provides primitives to express precise persist order

31

A B

Strand 0 Strand 1

Persist A PersistBarrier Persist B

Orders persists within a thread ß

slide-32
SLIDE 32

StrandWeaver enables persist concurrency

  • Provides primitives to express precise persist order

32

A B C

Strand 0 Strand 1

Persist A PersistBarrier Persist C Persist B

Orders persists within a thread ß

slide-33
SLIDE 33

StrandWeaver enables persist concurrency

  • Provides primitives to express precise persist order

33

A B C

Strand 0 Strand 1

Persist A PersistBarrier NewStrand Persist C Persist B

Orders persists within a thread ß Initiates new stream of persists ß strand

slide-34
SLIDE 34

StrandWeaver enables persist concurrency

  • Provides primitives to express precise persist order

34

A B

Strand 0 Strand 1

Persist A PersistBarrier NewStrand JoinStrand Persist C Persist D Persist B

Orders persists within a thread ß Initiates new stream of persists ß strand

D

Merges prior initiated strands ß

C

slide-35
SLIDE 35

StrandWeaver architecture

35

CPU L1 Cache

Load-Store Queue

slide-36
SLIDE 36

StrandWeaver architecture

36

CPU L1 Cache

Load-Store Queue

Persist queue

  • Manages ongoing StrandWeaver primitives
  • Orders CLWBs separated by JoinStrand

Persist Queue

slide-37
SLIDE 37

StrandWeaver architecture

37

CPU L1 Cache

Load-Store Queue

SB0 … Strand Buffer Unit SB1 SBn

Persist queue

  • Manages ongoing StrandWeaver primitives
  • Orders CLWBs separated by JoinStrand

Persist Queue

Strand Buffer Unit

  • Issues CLWBs and flushes dirty cache lines
  • Ensures CLWBs on diff. strands are concurrent
  • Monitors coherence reqs. for inter-thread order
slide-38
SLIDE 38

Running example

38

Persist Queue

CLWB(A) SB0

Strand Buffer Unit

SB1 NewStrand CLWB(B) JoinStrand CLWB(C)

Buffer Idx

CLWB(A) NewStrand JoinStrand CLWB(C) CLWB(B) Example code

CPU L1 Cache

slide-39
SLIDE 39

Running example

39

Persist Queue

CLWB(A) SB0

Strand Buffer Unit

SB1 NewStrand CLWB(B) JoinStrand CLWB(C) A

Buffer Idx

CLWB(A) NewStrand JoinStrand CLWB(C) CLWB(B) Example code

CPU L1 Cache

slide-40
SLIDE 40

Running example

40

Persist Queue

CLWB(A) SB0

Strand Buffer Unit

SB1 NewStrand CLWB(B) JoinStrand CLWB(C) A

Buffer Idx

CLWB(A) NewStrand JoinStrand CLWB(C) CLWB(B) Example code

CPU L1 Cache

slide-41
SLIDE 41

Running example

41

Persist Queue

CLWB(A) SB0

Strand Buffer Unit

SB1 NewStrand CLWB(B) JoinStrand CLWB(C) A

Buffer Idx

B

CLWB(A) NewStrand JoinStrand CLWB(C) CLWB(B) Example code

CPU L1 Cache

slide-42
SLIDE 42

Running example

42

Persist Queue

CLWB(A) SB0

Strand Buffer Unit

SB1 NewStrand CLWB(B) JoinStrand CLWB(C) A

Buffer Idx

B

JoinStrand stalls until prior CLWBs complete

CLWB(A) NewStrand JoinStrand CLWB(C) CLWB(B) Example code

CPU L1 Cache

slide-43
SLIDE 43

Running example

43

Persist Queue

CLWB(A) SB0

Strand Buffer Unit

SB1 NewStrand CLWB(B) JoinStrand CLWB(C) A

Buffer Idx

B

CLWBs A and B flush data concurrently

CLWB(A) NewStrand JoinStrand CLWB(C) CLWB(B) Example code

CPU L1 Cache

JoinStrand stalls until prior CLWBs complete

slide-44
SLIDE 44

Running example

44

Persist Queue

CLWB(A) SB0

Strand Buffer Unit

SB1 NewStrand CLWB(B) JoinStrand CLWB(C) A

Buffer Idx

B

  • Ack. for CLWBs A and B

JoinStrand stalls until prior CLWBs complete

CLWB(A) NewStrand JoinStrand CLWB(C) CLWB(B) Example code

CPU L1 Cache

slide-45
SLIDE 45

Running example

45

Persist Queue

CLWB(A) SB0

Strand Buffer Unit

SB1 NewStrand CLWB(B) JoinStrand CLWB(C)

Buffer Idx JoinStrand stalls until prior CLWBs complete

CLWB(A) NewStrand JoinStrand CLWB(C) CLWB(B) Example code

CPU L1 Cache

slide-46
SLIDE 46

Running example

46

Persist Queue

SB0

Strand Buffer Unit

SB1 CLWB(C)

Buffer Idx

CLWB(A) NewStrand JoinStrand CLWB(C) CLWB(B) Example code

CPU L1 Cache

C

slide-47
SLIDE 47

StrandWeaver: From ISA to high-level language

47

Hardware ISA

ISA primitives: PersistBarrier, NewStrand, JoinStrand

Compiler

Logging impl. that map to hardware primitives

High-level languages

Failure atomicity for language-level persistency models

slide-48
SLIDE 48

Logging using StrandWeaver primitives

48

atomic_begin() x = 1; y = 2; atomic_end()

Log(Lx,x) CLWB(Lx) PersistBarrier Store(x,1) Log(Ly,y) CLWB(Ly) Store(y,2) PersistBarrier NewStrand JoinStrand

slide-49
SLIDE 49

Logging using StrandWeaver primitives

49

atomic_begin() x = 1; y = 2; atomic_end()

Log(Lx,x) CLWB(Lx) Store(x,1) Log(Ly,y) CLWB(Ly) Store(y,2) Log(Lx,x) CLWB(Lx) PersistBarrier Store(x,1) Log(Ly,y) CLWB(Ly) Store(y,2) PersistBarrier NewStrand

Strand 0 Strand 1

JoinStrand

slide-50
SLIDE 50

Logging using StrandWeaver primitives

50

atomic_begin() x = 1; y = 2; atomic_end()

Log(Lx,x) CLWB(Lx) Store(x,1) Log(Ly,y) CLWB(Ly) Store(y,2) Log(Lx,x) CLWB(Lx) PersistBarrier Store(x,1) Log(Ly,y) CLWB(Ly) Store(y,2) PersistBarrier NewStrand

Strand 0 Strand 1

JoinStrand

slide-51
SLIDE 51

StrandWeaver: From ISA to high-level language

51

Hardware ISA

ISA primitives: PersistBarrier, NewStrand, JoinStrand

Compiler

Logging impl. that map to hardware primitives

High-level languages

Failure atomicity for language-level persistency models

slide-52
SLIDE 52

High-level language implementations

ATLAS [Chakrabarti14]

  • Failure-atomic outermost critical sections

52

L1.lock(); x -= 100; y += 100; L2.lock(); a -= 100; b += 100; L2.unlock(); L1.unlock();

slide-53
SLIDE 53

High-level language implementations

ATLAS [Chakrabarti14]

  • Failure-atomic outermost critical sections

53

L1.lock(); x -= 100; y += 100; L2.lock(); a -= 100; b += 100; L2.unlock(); L1.unlock();

Coupled-SFR [Gogte18]

  • Failure-atomic synchronization-free regions

Decoupled-SFR [Gogte18]

  • Failure-atomic synchronization-free regions
slide-54
SLIDE 54

High-level language implementations

ATLAS [Chakrabarti14]

  • Failure-atomic outermost critical sections

54

L1.lock(); x -= 100; y += 100; L2.lock(); a -= 100; b += 100; L2.unlock(); L1.unlock();

Coupled-SFR [Gogte18]

  • Failure-atomic synchronization-free regions

Decoupled-SFR [Gogte18]

  • Failure-atomic synchronization-free regions
slide-55
SLIDE 55

Methodology

  • Gem5 simulator
  • Micro-benchmarks:

– Queue: insert/delete entries in a queue – Hashmap: update values in persistent hash table – Array swaps: random swaps of array elements – RBTree: insert/delete entries in red-black tree – TPCC: new order transaction from TPCC

  • Benchmarks:

– N-Store [Arulraj15]: persistent KV-Store benchmark

55

slide-56
SLIDE 56

0.5 1 1.5 2 2.5 Queue Hashmap Array Swap RB-Tree TPCC N-Store Mean Speedup Intel x86 HOPS StrandWeaver Non-atomic

Performance comparison with Intel x86

56

StrandWeaver achieves avg. speedup of 1.5x compared to the baseline

1.5x 1.9x

slide-57
SLIDE 57

0.5 1 1.5 2 2.5 Queue Hashmap Array Swap RB-Tree TPCC N-Store Mean Speedup Intel x86 HOPS StrandWeaver Non-atomic

Performance comparison with Intel x86

57

1.5x 1.2x

StrandWeaver achieves avg. speedup of 1.2x over HOPS

slide-58
SLIDE 58

0.5 1 1.5 2 2.5 Queue Hashmap Array Swap RB-Tree TPCC N-Store Mean Speedup Intel x86 HOPS StrandWeaver Non-atomic

Performance comparison with Intel x86

58

StrandWeaver performance is within 4% of non-atomic design

4%

slide-59
SLIDE 59

Conclusion

  • Strand persistency to precisely order persists
  • Three primitives: PersistBarrier, NewStrand and JoinStrand

– Work together to relax ordering constraints in undo logging

  • Evaluation using language-level persistency models
  • Performance improvement of 1.45x average over Intel x86

59

slide-60
SLIDE 60

Relaxed Persist Ordering Using Strand Persistency

Vaibhav Gogte, William Wang$, Stephan Diestelhorst$, Peter M. Chen, Satish Narayanasamy, Thomas F. Wenisch

ISCA 2020

$