Encrypted Non-volatile Main Memory Systems Yu Hua Huazhong - - PowerPoint PPT Presentation

encrypted non volatile main memory systems
SMART_READER_LITE
LIVE PREVIEW

Encrypted Non-volatile Main Memory Systems Yu Hua Huazhong - - PowerPoint PPT Presentation

Encrypted Non-volatile Main Memory Systems Yu Hua Huazhong University of Science and Technology https://csyhua.github.io/ Non-volatile Memory (NVM) Non-volatile memory is expected to replace or complement DRAM in memory hierarchy


slide-1
SLIDE 1

Encrypted Non-volatile Main Memory Systems

Yu Hua Huazhong University of Science and Technology https://csyhua.github.io/

slide-2
SLIDE 2

Non-volatile Memory (NVM)

  • Non-volatile memory is expected to replace or complement

DRAM in memory hierarchy

Non-volatility, low power, high density, large capacity X Limited write cell endurance

PCM ReRAM DRAM Read (ns) 20-70 10-50 10 Write (ns) 150-220 30-100 10 Non-volatility √ √ × Standby Power ~0 ~0 High Endurance 107~109 108~1012 1015 Density (Gb/cm2) 13.5 24.5 9.1

  • K. Suzuki and S. Swanson. “A Survey of Trends in Non-Volatile Memory Technologies: 2000-2014”, IMW 2015

2

slide-3
SLIDE 3

NVM Security

3

Traditional DRAM: volatile

– If a DRAM DIMM is removed from a computer

  • Data are quickly lost

NVM: non-volatile

– If an NVM DIMM is removed

  • Data are still retained in NVM

– An attacker can directly read the data

  • Unsecure
slide-4
SLIDE 4

Two General attacks to NVM

  • Attacks:

Stolen NVMM Bus snooping

  • Memory encryption is important to NVM
  • Encrypt data in CPU side, not in-memory.

Direct encryption: AES Counter encryption: OTP

slide-5
SLIDE 5

5

Encryption Increases Bit Writes to NVM

Diffusion property of encryption

– The change of one bit in original data has to modify half of bits in the encrypted data.

5

00000000…0000000000 10000000…0000000000 01011010…0010110100 10101100…0100101001

Encryption Encryption

1 of 512 bits modified 256 of 512 bits modified

Old data: New data:

Memory encryption causes 50% bit flips on each write

slide-6
SLIDE 6

Observation

  • A large number of entire-line duplicates exist, varying from 18% to 98%
  • On average 58% duplicates lines v.s. 16% zero lines

6

SPEC CPU2006 PARSEC 2.1

slide-7
SLIDE 7

Motivation

Eliminating duplicate lines via performing deduplication in line level

– Improve secure NVM endurance

  • Remove duplicate writes

– Improve system performance

  • Remove the high write latency of duplicate writs
  • Reduce the wait time of read and non-duplicate write

requests

7

slide-8
SLIDE 8

Challenges

  • How to perform in-line deduplication in NVMM without the

decrease of system performance

– Existing memory deduplication is performed out of line

  • Duplicates are first written into memory and then eliminated
  • Fail to reduce writes

– Existing in-line deduplication incurs high latency

  • Use cryptographic hash functions, e.g., SHA-1 and MD5
  • > 300ns computation latency that is close to NVM write latency
  • How to integrate deduplication with NVM encryption while

delivering good performance

– Be executed serially in the critical path of memory writes – Both produce metadata overheads

8

slide-9
SLIDE 9

DeWrite

  • Light-weight deduplication leveraging

asymmetric NVM reads and writes

– Eliminate a write at the cost of a read latency – Write latency is much higher than read latency (3~8×)

  • Efficient synergization of deduplication

and encryption via parallelism and metadata colocation

– Opportunistically perform deduplication and encryption in parallel – Co-locate their metadata storage for saving space

9

Last Level Cache

Metadata Cache AES-ctr Memory Controller Dedup Logic

Metadata: Direct encryption

Metadata Storage Encrypted NVMM

Data: CME Data OTP Non-duplicate

Hardware Architecture

slide-10
SLIDE 10

Memory Encryption for Security

Counter mode encryption

– Hide the decryption latency – Generate One Time Pad (OTP) using a per-line counter

  • Counters are buffered in an on-chip counter cache

Decryption Time (a) Traditional encryption Memory Access Memory Access One Time Pad (b) Counter mode encryption Reduced latency

AES-ctr

LineAddr Counter Key

+

Plaintext Plaintext

+

Ciphertext Ciphertext Encryption Decryption OTP

10/27

slide-11
SLIDE 11

Prediction-based Parallelism

11

Detect Duplication Is duplicate ? Encrypt Data Write to NVM Cancel the Write No Yes A Write Request

The direct way

Detect Duplication Is duplicate ? Write to NVM Discard the Ciphertext No Encrypt Data Yes A Write Request

The parallel way

Be inefficient for applications where most lines are non-duplicate

  • Serial execution latency

Be inefficient for applications where most lines are duplicate

  • Unnecessary encryption
slide-12
SLIDE 12

Prediction-based Parallelism

  • How to know whether a cache line is duplicate beforehand?
  • Observation: the duplication states of most memory writes are

the same as those of their previous ones

– Rationale: The size of duplicate (non-duplicate) data is usually larger than a cache line

12

B B Memory CPU A

1 1 1 1

1: duplicate 0: non-duplicate … B is duplicate

slide-13
SLIDE 13

Prediction-based Parallelism

  • How to know whether a cache line is duplicate beforehand?
  • Observation: the duplication states of most memory writes are

the same as those of their previous ones

– Rationale: The size of duplicate (non-duplicate) data is usually larger than a cache line

13

B B Memory CPU A 1: duplicate 0: non-duplicate … B is non-duplicate

slide-14
SLIDE 14

Prediction-based Parallelism

  • How to know whether a cache line is duplicate beforehand?
  • Observation: the duplication states of most memory writes are

the same as those of their previous ones

– Rationale: The size of duplicate (non-duplicate) data is usually larger than a cache line

14

1: duplicate 0: non-duplicate …

1 1 1

  • Solution: a simple yet effective prediction scheme

– Exploiting the duplication states of the most recent memory writes

History window A new write Predict

slide-15
SLIDE 15

Light-weight Deduplication for NVMM

  • Compute the light-weight hash (CRC-32) of a cache line,

instead of the cryptographic hash

15

  • If the hash matches the value in an existing line, read the line

and compare data byte by byte (tQ: hash query time)

slide-16
SLIDE 16

Evaluation

Benchmarks

– 12 Benchmarks from SPEC CPU2006: single-threaded – 8 benchmarks from m PARSEC 2.1: multiple-threaded

16

slide-17
SLIDE 17

NVM Endurance

DeWrite reduces 54% writes to secure NVM on average

17

slide-18
SLIDE 18

Write Speedup

18

DeWrite speeds up NVM writes by 4.2X on average

slide-19
SLIDE 19

Read Speedup

19

DeWrite speeds up NVM reads by 3.1X on average

slide-20
SLIDE 20

Persistence Issue

The non-volatility of NVM enables data to be persistently stored into NVM Data may be incorrectly persisted due to crash inconsistency

– Modern processors and caches usually reorder memory writes – Volatile caches cause partial update

Caches (volatile) NVM (non-volatile)

Bus (64bits)

20/27

slide-21
SLIDE 21

Consistency Guarantee for Persistence

Durable transaction: a commonly used solution

– NV-Heaps (ASPLOS’11), Mnemosyne (ASPLOS’11), DCT (ASPLOS’16), DudeTM (ASPLOS’17), NVML (Intel) – Enable a group of memory updates to be performed in an atomic manner

Enforce write ordering

– Cache line flush and memory barrier instructions

Avoid partial update

– Logging

TX_BEGIN do some computation; // Prepare stage: backing up the data in log write undo log; flush log; memory_barrier(); // Mutate stage: updating the data in place write data; flush data; memory_barrier(); // Commit stage: invalidating the log log->valid = false; flush log->valid; memory_barrier(); TX_END

21/27

slide-22
SLIDE 22

The Gap between Persistence and Security

Ensuring both security and persistence

– Simply combining existing persistence schemes with memory encryption is inefficient – Each write in the secure NVM has to persist two data

  • Including the data itself and the counter

Crash inconsistency

– Cache line flush instruction cannot operate the counter cache – Memory barrier instruction fails to ensure the ordering of counter writes

Performance degradation

– Double write requests

22/27

slide-23
SLIDE 23

SecPM: a Secure and Persistent Memory System

Perform only slight modifications on the memory controller, being transparent for programmers

– Programs running on an un-encrypted NVM can be directly executed on a secure NVM with SecPM

23/27

Consistency guarantee

– A counter cache write-through (CWT) scheme

Performance improvement

– A locality-aware counter write reduction (CWR) scheme

Asynchronous DRAM refresh (ADR): cache lines reaching the write queue can be considered durable.

Last Level Cache Memory Controller AES-ctr Counter Cache The Write Queue

OTP Plaintext Ciphertext Counter Counter

Counters

Encrypted NVM

slide-24
SLIDE 24

Counter Cache Write-through (CWT) Scheme

CWT ensures the crash consistency of both data and counter

– Append the counter of the data in the write queue during encrypting the data – Ensure the counter is durable before the data flush completes

Memory Ctrl (Write Queue) CPU

Flu(A) Read(Ac) Ac++ Enc(A) App(Ac) App(A) Ack(A) Ret(A)

24/27

slide-25
SLIDE 25

Durable Transaction in SecPM

Stage Log content Log counter Data content Data counter Recoverabl e? Prepare Wrong Wrong Correct Correct Yes Mutate Correct Correct Wrong Wrong Yes Commit Correct Correct Correct Correct Yes

TX_BEGIN do some computation; // Prepare stage: backing up the data in log write undo log; flush log; memory_barrier(); // Mutate stage: updating the data in place write data; flush data; memory_barrier(); // Commit stage: invalidating the log log->valid = false; flush log->valid; memory_barrier(); TX_END

At least one of log and data is correct in whichever stage a system failure

  • ccurs

The system can be recoverable in a consistent state in SecPM

25/27

slide-26
SLIDE 26

Counter Write Reduction (CWR) Scheme

leveraging the spatial locality of counter storage, log and data writes

– The spatial locality of counter storage

  • The counters of all memory lines in a page are stored in one

memory line

  • Each memory line is encrypted by the major counter

concatenated with a minor counter

m64

……

m3 M

64B

m1 m2

64 minor counters (each 7 bit) Major counter (64 bit)

26/27

slide-27
SLIDE 27

Counter Write Reduction (CWR) Scheme

leveraging the spatial locality of counter storage, log and data writes

– The spatial locality of counter storage

  • The counters of all memory lines in a page are stored in one

memory line

  • Each memory line is encrypted by the major counter

concatenated with a minor counter

– The spatial locality of log and data writes

  • A log is stored in a contiguous region
  • Programs usually allocate a contiguous memory region for a

transaction

27/27

slide-28
SLIDE 28

Counter Write Reduction (CWR) Scheme

An illustration of the write queue when writing a log

– The counters Ac, Bc, Cc, and Dc are written into the same memory line – The latter cache lines contain the updated contents of the former

  • nes (Ac ∈ Bc ∈ Cc ∈ Dc)
  • They are evicted from the write-through counter cache

Ac A

The write queue (Each cell is a cache line to be written into NVM)

……

Bc B Cc C Dc D

The log contents The counters of log contents

m64

m3 M m1' m2 m4 m64

m3 M m1' m2' m4 m64

m3' M m1' m2' m4 m64

m3' M m1' m2' m4'

Ac: Bc: Cc: Dc:

28/27

slide-29
SLIDE 29

Counter Write Reduction (CWR) Scheme

When a new cache line arrives, remove the existing cache line with the same physical address in the write queue

– Without causing any loss of data

Ac A

……

The Write Queue

29/27

slide-30
SLIDE 30

Counter Write Reduction (CWR) Scheme

Ac A

……

The Write Queue

Bc

When a new cache line arrives, remove the existing cache line with the same physical address in the write queue

– Without causing any loss of data

30/27

slide-31
SLIDE 31

Counter Write Reduction (CWR) Scheme

A

……

The Write Queue

Bc B

When a new cache line arrives, remove the existing cache line with the same physical address in the write queue

– Without causing any loss of data

31/27

slide-32
SLIDE 32

Counter Write Reduction (CWR) Scheme

A

……

The Write Queue

Bc B Cc

When a new cache line arrives, remove the existing cache line with the same physical address in the write queue

– Without causing any loss of data

32/27

slide-33
SLIDE 33

Counter Write Reduction (CWR) Scheme

A

……

The Write Queue

B Cc C

When a new cache line arrives, remove the existing cache line with the same physical address in the write queue

– Without causing any loss of data

33/27

slide-34
SLIDE 34

Counter Write Reduction (CWR) Scheme

A

……

The Write Queue

B Cc C Dc

When a new cache line arrives, remove the existing cache line with the same physical address in the write queue

– Without causing any loss of data

34/27

slide-35
SLIDE 35

Counter Write Reduction (CWR) Scheme

When a new cache line arrives, remove the existing cache line with the same physical address in the write queue

– Without causing any loss of data – Using a flag to distinguish whether a cache line is from CPU caches

  • r the counter cache

A

……

The Write Queue

B C Dc D

1 1 1 1

(1: from CPU caches; 0: from the counter cache)

35/27

slide-36
SLIDE 36

Counter Write Reduction (CWR) Scheme

When a new cache line arrives, remove the existing cache line with the same physical address in the write queue

– Without causing any loss of data – Using a flag to distinguish whether a cache line is from CPU caches

  • r the counter cache

A

……

B C Dc D A

……

B C Dc D Ac Bc Cc

With CWR Without CWR

36/27

slide-37
SLIDE 37

Performance Evaluation

Model NVM using gem5 and NVMain

CPU and Caches X86-64 CPU, at 2 GHz 32KB L1 data & instruction caches 2MB L2 cache 8MB shared L3 cache Memory Using PCM Capacity: 16GB Read/write latency: 150/450ns Encryption/decryption latency: 40ns Counter cache: 1MB, 10ns latency

Storage benchmarks

– A hash table based key-value store – A B-tree based key-value store

37/27

slide-38
SLIDE 38

The Number of NVM Write Requests

Hash table based KV store B-tree based KV store

  • Compared with the SecPM w/o CWR, SecPM significantly reduces NVM writes
  • Compared with Insec-PM, SecPM only causes 13%, 5%, and 2% more writes

when the request size is 256B, 1KB, and 4KB, respectively

38/27

slide-39
SLIDE 39

Transaction Throughput

  • Compared with the SecPM w/o CWR, SecPM significantly increases the throughput by 1.4

∼ 2.1 times

  • Compared with InsecPM, SecPM incurs a little throughput reduction, due to the more NVM

writes and the latency overhead of data encryption

Hash table based KV store B-tree based KV store

39/27

slide-40
SLIDE 40

Conclusion

Both security and persistence of NVM are important

DeWrite is a line-level write reduction technique to enhance the endurance & performance

– Lightweight deduplication leveraging read/write asymmetry – Efficient synergization of deduplication and encryption via parallelism and metadata colocation

SecPM bridges the gap between security and persistence

– Guarantee consistency via a counter cache write-through (CWT) scheme – Improve performance via a locality-aware counter write reduction (CWR) scheme

40/27

slide-41
SLIDE 41

Thanks! Q&A

41/27