Efficient Architectural Support for Persistent Memory Vijay - - PowerPoint PPT Presentation

efficient architectural support for persistent memory
SMART_READER_LITE
LIVE PREVIEW

Efficient Architectural Support for Persistent Memory Vijay - - PowerPoint PPT Presentation

Efficient Architectural Support for Persistent Memory Vijay Nagarajan People Marcelo Cintra (Intel) Stratis Viglas (Google) Arpit Joshi(Edinburgh) 2 Emerging System Core Core Cache Cache DRAM NVM Secondary Storage Secondary Storage


slide-1
SLIDE 1

Efficient Architectural Support for Persistent Memory

Vijay Nagarajan

slide-2
SLIDE 2

People

2

Marcelo Cintra (Intel) Arpit Joshi(Edinburgh) Stratis Viglas (Google)

slide-3
SLIDE 3

Emerging System

Core Cache NVM Secondary Storage Core Cache DRAM Secondary Storage

3

slide-4
SLIDE 4

Emerging System

Core Cache NVM Secondary Storage Core Cache DRAM Secondary Storage

Software Controlled

3

slide-5
SLIDE 5

Emerging System

Core Cache NVM Secondary Storage

Hardware Controlled

Core Cache DRAM Secondary Storage

Software Controlled

3

slide-6
SLIDE 6

Emerging System

Core Cache NVM Secondary Storage

Hardware Controlled

Core Cache DRAM Secondary Storage

Software Controlled

3

Need Efficient Persistency Primitives!

slide-7
SLIDE 7

Outline

  • Emergence of Persistent Memory
  • Ordering Persist Operations [MICRO ’15]
  • Atomic Durability [HPCA ’17]
  • Atomic (durability + visibility): Durable HTM
  • Conclusions

4

slide-8
SLIDE 8

Linked List - Naïve

Node Node 1 Node 2

HEAD Cache

Pseudo-code

  • 1. Create Node
  • 2. Update Node Pointer
  • 3. Update Head Pointer

Node Node 1 Node 2

HEAD NVM

5

slide-9
SLIDE 9

Linked List - Naïve

Node Node 1 Node 2

HEAD Cache

Pseudo-code

  • 1. Create Node
  • 2. Update Node Pointer
  • 3. Update Head Pointer

Node Node 1 Node 2

HEAD NVM

5

slide-10
SLIDE 10

Linked List - Naïve

Cache

Pseudo-code

  • 1. Create Node
  • 2. Update Node Pointer
  • 3. Update Head Pointer

Node Node 1 Node 2

HEAD NVM

5

System Crash!

Reordering of writes to NVM renders data inconsistent.

slide-11
SLIDE 11

Linked List - Failsafe

Node Node 1 Node 2

HEAD Cache

Pseudo-code

  • 1. Create Node
  • 2. Update Node Pointer
  • 3. Persist Barrier
  • 4. Update Head Pointer

Node Node 1 Node 2

HEAD NVM

6

slide-12
SLIDE 12

Linked List - Failsafe

Node Node 1 Node 2

HEAD Cache

Pseudo-code

  • 1. Create Node
  • 2. Update Node Pointer
  • 3. Persist Barrier
  • 4. Update Head Pointer

Node Node 1 Node 2

HEAD NVM

6

slide-13
SLIDE 13

St a St b St c St a Persist Barrier St d St e St d Persist Barrier St p St q St d …

Strict Barrier*

* Pelley et. al., “Memory Persistency”, in ISCA-2014. 7

Epoch 3

b c a a a c d e

Epoch 2

d e p

Visibility Persistence

b

Epoch 1

d q d

slide-14
SLIDE 14

St a St b St c St a Persist Barrier St d St e St d Persist Barrier St p St q St d …

Strict Barrier*

* Pelley et. al., “Memory Persistency”, in ISCA-2014. 7

Epoch 3

b c a a a c d e

Epoch 2

d e p

Visibility Persistence

b

Epoch 1

d q d

slide-15
SLIDE 15

St a St b St c St a Persist Barrier St d St e St d Persist Barrier St p St q St d …

Strict Barrier*

* Pelley et. al., “Memory Persistency”, in ISCA-2014. 7

Epoch 3

b c a a a c d e

Epoch 2

d e p

Visibility Persistence

b

Epoch 1

d q d

slide-16
SLIDE 16

St a St b St c St a Persist Barrier St d St e St d Persist Barrier St p St q St d …

Strict Barrier*

* Pelley et. al., “Memory Persistency”, in ISCA-2014. 7

Epoch 3

b c a a a c d e

Epoch 2

d e p

Visibility Persistence

b

Epoch 1

d q d

slide-17
SLIDE 17

St a St b St c St a Persist Barrier St d St e St d Persist Barrier St p St q St d …

Strict Barrier*

* Pelley et. al., “Memory Persistency”, in ISCA-2014. 7

Epoch 3

b c a a a c d e

Epoch 2

d e p

Visibility Persistence

b

Epoch 1

d q d

slide-18
SLIDE 18

St a St b St c St a Persist Barrier St d St e St d Persist Barrier St p St q St d …

Strict Barrier*

* Pelley et. al., “Memory Persistency”, in ISCA-2014. 7

Epoch 3

b c a a a c d e

Epoch 2

d e p

Visibility Persistence

b

Epoch 1

d q d

slide-19
SLIDE 19

St a St b St c St a Persist Barrier St d St e St d Persist Barrier St p St q St d …

Strict Barrier*

* Pelley et. al., “Memory Persistency”, in ISCA-2014. 7

Epoch 3

b c a a a c d e

Epoch 2

d e p

Visibility Persistence

b

Epoch 1

d q d

Persist operations happen in the critical path of execution.

slide-20
SLIDE 20

Lazy Barrier (LB)*

  • Durability lags visibility
  • Buffered barrier semantics
  • To allow performing persist operations out of critical path

8

* Pelley et. al., “Memory Persistency”, in ISCA-2014. * Condit et. al., “Better I/O through byte-addressable, persistent memory”, in SOSP-2009.

slide-21
SLIDE 21

Lazy Barrier (LB)*

  • Durability lags visibility
  • Buffered barrier semantics
  • To allow performing persist operations out of critical path

8

* Pelley et. al., “Memory Persistency”, in ISCA-2014. * Condit et. al., “Better I/O through byte-addressable, persistent memory”, in SOSP-2009.

Significant perf. improvement over strict barrier

slide-22
SLIDE 22

d − Conflicting request

a b c e d

Epoch 2

a b a c e

Persistence Visibility

Epoch 1

p q d

Epoch 3

d

Conflicts: Lazy Barrier (LB)

9

Cache Line Eviction

* Pelley et. al., “Memory Persistency”, in ISCA-2014. * Condit et. al., “Better I/O through byte-addressable, persistent memory”, in SOSP-2009.

slide-23
SLIDE 23

d − Conflicting request

a b c e d

Epoch 2

a b a c e

Persistence Visibility

Epoch 1

p q d

Epoch 3

d

Conflicts: Lazy Barrier (LB)

9

Cache Line Eviction

* Pelley et. al., “Memory Persistency”, in ISCA-2014. * Condit et. al., “Better I/O through byte-addressable, persistent memory”, in SOSP-2009.

slide-24
SLIDE 24

d − Conflicting request

a b c e d

Epoch 2

a b a c e

Persistence Visibility

Epoch 1

p q d

Epoch 3

d

Conflicts: Lazy Barrier (LB)

9

Cache Line Eviction

* Pelley et. al., “Memory Persistency”, in ISCA-2014. * Condit et. al., “Better I/O through byte-addressable, persistent memory”, in SOSP-2009.

slide-25
SLIDE 25

d − Conflicting request

a b c e d

Epoch 2

a b a c e

Persistence Visibility

Epoch 1

p q d

Epoch 3

d

Conflicts: Lazy Barrier (LB)

9

* Pelley et. al., “Memory Persistency”, in ISCA-2014. * Condit et. al., “Better I/O through byte-addressable, persistent memory”, in SOSP-2009.

slide-26
SLIDE 26

d − Conflicting request

a b c e d

Epoch 2

a b a c e

Persistence Visibility

Epoch 1

p q d

Epoch 3

d

Conflicts: Lazy Barrier (LB)

9

* Pelley et. al., “Memory Persistency”, in ISCA-2014. * Condit et. al., “Better I/O through byte-addressable, persistent memory”, in SOSP-2009.

Conflicts bring persist operations back in the critical path.

slide-27
SLIDE 27

d − Conflicting request

a b c e d

Epoch 2

a b a c e

Persistence Visibility

Epoch 1

p q d

Epoch 3

d

Conflicts: Lazy Barrier (LB)

9

* Pelley et. al., “Memory Persistency”, in ISCA-2014. * Condit et. al., “Better I/O through byte-addressable, persistent memory”, in SOSP-2009.

Conflicts bring persist operations back in the critical path. Intra-thread conflict

slide-28
SLIDE 28

Thread

Epoch Epoch Epoch

Persistence Visibility Visibility

RY RX W

Z

W

B

W

E

W

F

E B A F

W

A

RP W

E

RQ

E Z

T0 Thread T1

RB

00

E10 E11 E

Inter-thread Conflict

10

slide-29
SLIDE 29

Thread

Epoch Epoch Epoch

Persistence Visibility Visibility

RY RX W

Z

W

B

W

E

W

F

E B A F

W

A

RP W

E

RQ

E Z

T0 Thread T1

RB

00

E10 E11 E

Inter-thread Conflict

10

slide-30
SLIDE 30

Thread

Epoch Epoch Epoch

Persistence Visibility Visibility

RY RX W

Z

W

B

W

E

W

F

E B A F

W

A

RP W

E

RQ

E Z

T0 Thread T1

RB

00

E10 E11 E

Inter-thread Conflict

10

slide-31
SLIDE 31

Thread

Epoch Epoch Epoch

Persistence Visibility Visibility

RY RX W

Z

W

B

W

E

W

F

E B A F

W

A

RP W

E

RQ

E Z

T0 Thread T1

RB

00

E10 E11 E

Inter-thread Conflict

10

slide-32
SLIDE 32

Two Ideas

  • Proactive flushing (PF)
  • Predict when a cache block val is final and flush
  • Inter-thread dependence tracking (IDT)
  • Track inter-thread dependencies, enforce lazily

11

slide-33
SLIDE 33

Evaluation

LB Lazy barrier LB+IDT Lazy barrier with inter-thread dependence tracking (IDT) LB+PF Lazy barrier with proactive flush (PF) LB++ Lazy barrier with both IDT and PF

Persist Barrier Designs

12

  • System Configuration
  • We evaluate proposed design using GEM5 full-

system simulation mode

  • 32 Core CMP with 32x1MB LLC cache banks and 4

memory controllers

slide-34
SLIDE 34

Transaction Throughput

13

Higher is Better

slide-35
SLIDE 35

Transaction Throughput

13

Higher is Better 15%

slide-36
SLIDE 36

Transaction Throughput

13

Higher is Better 22%

slide-37
SLIDE 37

Outline

  • Emergence of Persistent Memory
  • Ordering Persist Operations
  • Atomic Durability
  • Atomic (durability + visibility): Durable HTM
  • Conclusions

14

slide-38
SLIDE 38

Atomic Durability

  • All or nothing persists: think transactions (ACID)

15

Atomic_Begin A = 1 B = 1 Atomic_End Initial State A B Final State A 1 B 1 Final State A B Final State A 1 B Final State A B 1

slide-39
SLIDE 39

Mechanisms

16

Write-Ahead-Logging REDO UNDO

➡read redirection ➡fine grained log->data ordering

slide-40
SLIDE 40

ATOM

  • Create Undo Log
  • on a store, cache

writes old value to log

  • Flush Undo Log
  • enforce log —> data
  • rdering

17

Core Cache NVM A

Data Log

A

slide-41
SLIDE 41

ATOM

  • Create Undo Log
  • on a store, cache

writes old value to log

  • Flush Undo Log
  • enforce log —> data
  • rdering

17

Core Cache NVM

A = 1

A

Data Log

A

slide-42
SLIDE 42

ATOM

  • Create Undo Log
  • on a store, cache

writes old value to log

  • Flush Undo Log
  • enforce log —> data
  • rdering

17

Core Cache NVM

A = 1 L(A) = 0

A A

Data Log

A

slide-43
SLIDE 43

ATOM

  • Create Undo Log
  • on a store, cache

writes old value to log

  • Flush Undo Log
  • enforce log —> data
  • rdering

17

Core Cache NVM

A = 1 L(A) = 0

A A

Data Log

Log Done

A 1

slide-44
SLIDE 44

ATOM

  • Create Undo Log
  • on a store, cache

writes old value to log

  • Flush Undo Log
  • enforce log —> data
  • rdering

17

Core Cache NVM

A = 1 L(A) = 0

A A

Data Log

Log Done

A 1

Posted Log: Offload log—> data at memory controller!

slide-45
SLIDE 45

Baseline Implementation

18

SQ Cache Mem Ctrl Memory

ST(A) L(A) L(A) WRITE L(A) L(A) L(A) ST(A)

Store Completion Time

slide-46
SLIDE 46

ATOM Posted Log

19

SQ Cache Mem Ctrl Memory

ST(A) L(A) L(A) WRITE L(A) L(A) L(A) ST(A)

Store Completion Time

slide-47
SLIDE 47

ATOM Source Log

20

SQ Cache Mem Ctrl Memory

WRITE L(A) ST(A)

Store Completion Time

RDx(A) RDx(A) RDx(A)

Log at the source: reduce data movement

slide-48
SLIDE 48

Evaluation

  • System Configuration
  • We evaluate proposed design using GEM5 full-

system simulation mode

  • 32 Core CMP with 32x1MB LLC cache banks and 4

memory controllers

BASE Baseline hardware undo log implementation ATOM Posted log writes to memory controller ATOM-OPT Posted log writes with source logging NON-ATOMIC No logging (Upper bound on performance)

Atomic Durability Designs

21

slide-49
SLIDE 49

Transaction Throughput

22

slide-50
SLIDE 50

Transaction Throughput

22

27%

slide-51
SLIDE 51

Transaction Throughput

22

27% Within 11% of the upper bound

slide-52
SLIDE 52

Ordering + Atomicity

  • Checkpointing applications (at epoch granularity)
  • 1. Divide application into Epochs
  • 2. Epochs persist atomically (logging)
  • 3. Epochs persist in order (persist barrier)
  • In effect: Bulk Strict Persistency (BSP)
  • provide strict persistency in bulk mode*

23 * Ceze et. al., “BulkSC: Bulk enforcement of sequential consistency”, in ISCA-2007.

slide-53
SLIDE 53

Execution Time

24

S +posted log

slide-54
SLIDE 54

Execution Time

24

59%

S +posted log

slide-55
SLIDE 55

Execution Time

24

17%

S +posted log

slide-56
SLIDE 56

Execution Time

24

15%

S +posted log

slide-57
SLIDE 57

Execution Time

24

20%

S +posted log

slide-58
SLIDE 58

Execution Time

24

20% Checkpointing at 32% overhead

S +posted log

slide-59
SLIDE 59

Outline

  • Emergence of Persistent Memory
  • Ordering Persist Operations [MICRO ’15]
  • Atomic Durability [HPCA ’17]
  • Atomic (durability + visibility): Durable HTM
  • Conclusions

25

slide-60
SLIDE 60

DHTM

  • Built over HTM (RTM)
  • In-place cache updates to avoid redirection
  • HW Redo logging for durability
  • Data can be flushed to memory out of critical path.
  • Simple predictor for last write to minimise log writes
  • Leverage logging infrastructure to permit L1 overflows

26

slide-61
SLIDE 61

Transaction Throughput

27

slide-62
SLIDE 62

1 1.25 1.5 1.75 2 btree hash queue rbtree sdg sps gmean

ATOM sdTM HTM-undo DHTM

Transaction Throughput

27

35%

slide-63
SLIDE 63

1 1.25 1.5 1.75 2 btree hash queue rbtree sdg sps gmean

ATOM sdTM HTM-undo DHTM

Transaction Throughput

27

15%

slide-64
SLIDE 64

1 1.25 1.5 1.75 2 btree hash queue rbtree sdg sps gmean

ATOM sdTM HTM-undo DHTM

Transaction Throughput

27

45%

slide-65
SLIDE 65

1 1.25 1.5 1.75 2 btree hash queue rbtree sdg sps gmean

ATOM sdTM HTM-undo DHTM

Transaction Throughput

27

17%

slide-66
SLIDE 66

1 1.25 1.5 1.75 2 btree hash queue rbtree sdg sps gmean

ATOM sdTM HTM-undo DHTM

Transaction Throughput

27

~60% overhead over volatile 17%

slide-67
SLIDE 67

Conclusion

  • Persistent memory needs primitives
  • Fast (lazy) persist barrier
  • Proactive flushing (PF); Multithreaded-awareness (IDT)
  • 20% better than the state-of-the-art lazy barrier
  • Fast Undo Logging.
  • Offload logging to memory controller (fine grained ordering and data movement)
  • Checkpointing at only 32% overhead with respect to volatile
  • Fast ACID transactions via DHTM
  • Hybrid versioning: in-place in cache; HW supported redo-logging at memory
  • Leverage logging to support L1 overflows (~60% overhead with respect to volatile)

28

slide-68
SLIDE 68

Open Questions

  • Semantics of persist barriers?
  • Should every visibility barrier behave as a persist barrier

too?

  • Should every persist barrier behave like a visibility

barrier?

  • To buffer or not?
  • Performance impact ~30%; extra hardware
  • Programmability implications of buffering?

29