Encrypted Non-volatile Main Memory Systems
Yu Hua Huazhong University of Science and Technology https://csyhua.github.io/
Encrypted Non-volatile Main Memory Systems Yu Hua Huazhong - - PowerPoint PPT Presentation
Encrypted Non-volatile Main Memory Systems Yu Hua Huazhong University of Science and Technology https://csyhua.github.io/ Non-volatile Memory (NVM) Non-volatile memory is expected to replace or complement DRAM in memory hierarchy
Encrypted Non-volatile Main Memory Systems
Yu Hua Huazhong University of Science and Technology https://csyhua.github.io/
Non-volatile Memory (NVM)
DRAM in memory hierarchy
Non-volatility, low power, high density, large capacity X Limited write cell endurance
PCM ReRAM DRAM Read (ns) 20-70 10-50 10 Write (ns) 150-220 30-100 10 Non-volatility √ √ × Standby Power ~0 ~0 High Endurance 107~109 108~1012 1015 Density (Gb/cm2) 13.5 24.5 9.1
2
NVM Security
3
Traditional DRAM: volatile
– If a DRAM DIMM is removed from a computer
NVM: non-volatile
– If an NVM DIMM is removed
– An attacker can directly read the data
Two General attacks to NVM
Stolen NVMM Bus snooping
Direct encryption: AES Counter encryption: OTP
5
Encryption Increases Bit Writes to NVM
– The change of one bit in original data has to modify half of bits in the encrypted data.
5
00000000…0000000000 10000000…0000000000 01011010…0010110100 10101100…0100101001
Encryption Encryption
1 of 512 bits modified 256 of 512 bits modified
Old data: New data:
Memory encryption causes 50% bit flips on each write
Observation
6
SPEC CPU2006 PARSEC 2.1
Motivation
– Improve secure NVM endurance
– Improve system performance
requests
7
Challenges
decrease of system performance
– Existing memory deduplication is performed out of line
– Existing in-line deduplication incurs high latency
delivering good performance
– Be executed serially in the critical path of memory writes – Both produce metadata overheads
8
DeWrite
asymmetric NVM reads and writes
– Eliminate a write at the cost of a read latency – Write latency is much higher than read latency (3~8×)
and encryption via parallelism and metadata colocation
– Opportunistically perform deduplication and encryption in parallel – Co-locate their metadata storage for saving space
9
Last Level Cache
Metadata Cache AES-ctr Memory Controller Dedup Logic
Metadata: Direct encryption
Metadata Storage Encrypted NVMM
Data: CME Data OTP Non-duplicate
Hardware Architecture
Memory Encryption for Security
– Hide the decryption latency – Generate One Time Pad (OTP) using a per-line counter
Decryption Time (a) Traditional encryption Memory Access Memory Access One Time Pad (b) Counter mode encryption Reduced latency
AES-ctr
LineAddr Counter Key
Plaintext Plaintext
Ciphertext Ciphertext Encryption Decryption OTP
10/27
Prediction-based Parallelism
11
Detect Duplication Is duplicate ? Encrypt Data Write to NVM Cancel the Write No Yes A Write Request
The direct way
Detect Duplication Is duplicate ? Write to NVM Discard the Ciphertext No Encrypt Data Yes A Write Request
The parallel way
Be inefficient for applications where most lines are non-duplicate
Be inefficient for applications where most lines are duplicate
Prediction-based Parallelism
the same as those of their previous ones
– Rationale: The size of duplicate (non-duplicate) data is usually larger than a cache line
12
B B Memory CPU A
1 1 1 1
1: duplicate 0: non-duplicate … B is duplicate
Prediction-based Parallelism
the same as those of their previous ones
– Rationale: The size of duplicate (non-duplicate) data is usually larger than a cache line
13
B B Memory CPU A 1: duplicate 0: non-duplicate … B is non-duplicate
Prediction-based Parallelism
the same as those of their previous ones
– Rationale: The size of duplicate (non-duplicate) data is usually larger than a cache line
14
1: duplicate 0: non-duplicate …
1 1 1
– Exploiting the duplication states of the most recent memory writes
History window A new write Predict
Light-weight Deduplication for NVMM
instead of the cryptographic hash
15
and compare data byte by byte (tQ: hash query time)
Evaluation
Benchmarks
– 12 Benchmarks from SPEC CPU2006: single-threaded – 8 benchmarks from m PARSEC 2.1: multiple-threaded
16
NVM Endurance
DeWrite reduces 54% writes to secure NVM on average
17
Write Speedup
18
DeWrite speeds up NVM writes by 4.2X on average
Read Speedup
19
DeWrite speeds up NVM reads by 3.1X on average
Persistence Issue
– Modern processors and caches usually reorder memory writes – Volatile caches cause partial update
Caches (volatile) NVM (non-volatile)
Bus (64bits)
20/27
Consistency Guarantee for Persistence
Durable transaction: a commonly used solution
– NV-Heaps (ASPLOS’11), Mnemosyne (ASPLOS’11), DCT (ASPLOS’16), DudeTM (ASPLOS’17), NVML (Intel) – Enable a group of memory updates to be performed in an atomic manner
Enforce write ordering
– Cache line flush and memory barrier instructions
Avoid partial update
– Logging
TX_BEGIN do some computation; // Prepare stage: backing up the data in log write undo log; flush log; memory_barrier(); // Mutate stage: updating the data in place write data; flush data; memory_barrier(); // Commit stage: invalidating the log log->valid = false; flush log->valid; memory_barrier(); TX_END
21/27
The Gap between Persistence and Security
Ensuring both security and persistence
– Simply combining existing persistence schemes with memory encryption is inefficient – Each write in the secure NVM has to persist two data
Crash inconsistency
– Cache line flush instruction cannot operate the counter cache – Memory barrier instruction fails to ensure the ordering of counter writes
Performance degradation
– Double write requests
22/27
SecPM: a Secure and Persistent Memory System
Perform only slight modifications on the memory controller, being transparent for programmers
– Programs running on an un-encrypted NVM can be directly executed on a secure NVM with SecPM
23/27
Consistency guarantee
– A counter cache write-through (CWT) scheme
Performance improvement
– A locality-aware counter write reduction (CWR) scheme
Asynchronous DRAM refresh (ADR): cache lines reaching the write queue can be considered durable.
Last Level Cache Memory Controller AES-ctr Counter Cache The Write Queue
OTP Plaintext Ciphertext Counter Counter
Counters
Encrypted NVM
Counter Cache Write-through (CWT) Scheme
CWT ensures the crash consistency of both data and counter
– Append the counter of the data in the write queue during encrypting the data – Ensure the counter is durable before the data flush completes
Memory Ctrl (Write Queue) CPU
Flu(A) Read(Ac) Ac++ Enc(A) App(Ac) App(A) Ack(A) Ret(A)
24/27
Durable Transaction in SecPM
Stage Log content Log counter Data content Data counter Recoverabl e? Prepare Wrong Wrong Correct Correct Yes Mutate Correct Correct Wrong Wrong Yes Commit Correct Correct Correct Correct Yes
TX_BEGIN do some computation; // Prepare stage: backing up the data in log write undo log; flush log; memory_barrier(); // Mutate stage: updating the data in place write data; flush data; memory_barrier(); // Commit stage: invalidating the log log->valid = false; flush log->valid; memory_barrier(); TX_END
At least one of log and data is correct in whichever stage a system failure
The system can be recoverable in a consistent state in SecPM
25/27
Counter Write Reduction (CWR) Scheme
leveraging the spatial locality of counter storage, log and data writes
– The spatial locality of counter storage
memory line
concatenated with a minor counter
m64
……
m3 M
64B
m1 m2
64 minor counters (each 7 bit) Major counter (64 bit)
26/27
Counter Write Reduction (CWR) Scheme
leveraging the spatial locality of counter storage, log and data writes
– The spatial locality of counter storage
memory line
concatenated with a minor counter
– The spatial locality of log and data writes
transaction
27/27
Counter Write Reduction (CWR) Scheme
An illustration of the write queue when writing a log
– The counters Ac, Bc, Cc, and Dc are written into the same memory line – The latter cache lines contain the updated contents of the former
Ac A
The write queue (Each cell is a cache line to be written into NVM)
……
Bc B Cc C Dc D
The log contents The counters of log contents
m64
…
m3 M m1' m2 m4 m64
…
m3 M m1' m2' m4 m64
…
m3' M m1' m2' m4 m64
…
m3' M m1' m2' m4'
Ac: Bc: Cc: Dc:
28/27
Counter Write Reduction (CWR) Scheme
When a new cache line arrives, remove the existing cache line with the same physical address in the write queue
– Without causing any loss of data
Ac A
The Write Queue
29/27
Counter Write Reduction (CWR) Scheme
Ac A
The Write Queue
Bc
When a new cache line arrives, remove the existing cache line with the same physical address in the write queue
– Without causing any loss of data
30/27
Counter Write Reduction (CWR) Scheme
A
The Write Queue
Bc B
When a new cache line arrives, remove the existing cache line with the same physical address in the write queue
– Without causing any loss of data
31/27
Counter Write Reduction (CWR) Scheme
A
The Write Queue
Bc B Cc
When a new cache line arrives, remove the existing cache line with the same physical address in the write queue
– Without causing any loss of data
32/27
Counter Write Reduction (CWR) Scheme
A
The Write Queue
B Cc C
When a new cache line arrives, remove the existing cache line with the same physical address in the write queue
– Without causing any loss of data
33/27
Counter Write Reduction (CWR) Scheme
A
The Write Queue
B Cc C Dc
When a new cache line arrives, remove the existing cache line with the same physical address in the write queue
– Without causing any loss of data
34/27
Counter Write Reduction (CWR) Scheme
When a new cache line arrives, remove the existing cache line with the same physical address in the write queue
– Without causing any loss of data – Using a flag to distinguish whether a cache line is from CPU caches
A
The Write Queue
B C Dc D
1 1 1 1
(1: from CPU caches; 0: from the counter cache)
35/27
Counter Write Reduction (CWR) Scheme
When a new cache line arrives, remove the existing cache line with the same physical address in the write queue
– Without causing any loss of data – Using a flag to distinguish whether a cache line is from CPU caches
A
B C Dc D A
B C Dc D Ac Bc Cc
With CWR Without CWR
36/27
Performance Evaluation
CPU and Caches X86-64 CPU, at 2 GHz 32KB L1 data & instruction caches 2MB L2 cache 8MB shared L3 cache Memory Using PCM Capacity: 16GB Read/write latency: 150/450ns Encryption/decryption latency: 40ns Counter cache: 1MB, 10ns latency
Storage benchmarks
– A hash table based key-value store – A B-tree based key-value store
37/27
The Number of NVM Write Requests
Hash table based KV store B-tree based KV store
when the request size is 256B, 1KB, and 4KB, respectively
38/27
Transaction Throughput
∼ 2.1 times
writes and the latency overhead of data encryption
Hash table based KV store B-tree based KV store
39/27
Conclusion
Both security and persistence of NVM are important
DeWrite is a line-level write reduction technique to enhance the endurance & performance
– Lightweight deduplication leveraging read/write asymmetry – Efficient synergization of deduplication and encryption via parallelism and metadata colocation
SecPM bridges the gap between security and persistence
– Guarantee consistency via a counter cache write-through (CWT) scheme – Improve performance via a locality-aware counter write reduction (CWR) scheme
40/27
41/27