Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, - - PowerPoint PPT Presentation
Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, - - PowerPoint PPT Presentation
Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, Matheus Ogleari , Jishen Zhao Persistent Memory STT-RAM, PCM, Memory CPU CPU ReRAM, NVDIMM, Battery-backed Load/store DRAM NVRAM DRAM, etc. Not persistent Persistent
Persistent Memory
2
STT-RAM, PCM, ReRAM, NVDIMM, Battery-backed DRAM, etc.
These nonvolatile devices are able to retain the data in a consistent state in case of power loss. CPU DRAM Disk/Flash
Memory
Load/store Not persistent
Storage
Fopen, fread, fwrite, … Persistent CPU NVRAM Load/store Persistent
Persistent memory
Logging in Persistent Memory
3
Root A B C D Log
NVRAM
Memory Barrier L1 LLC Core L1 Core
…
Tx_begin do some reads do some computation Rlog ( addr(C), new_val(C) ) memory_barrier write C Tx_commit C1’ Micro-ops: store C’1 store C’2 ... Time Log_C’ Root A B C’ D Log
NVRAM
L1 LLC Core L1 Core
…
Log_C’ Update persistent memory with transactions
To cache, or not cache? That is the question.
4
[Mengjie Li+, Memsys 2017]
Experimental Setup
- Desktop – Dell OptiPlex 7040 Tower
- CPU – 4-core 3.4GHz Intel Core-i7
- Cache – 8 MB last-level cache
- Measurement Tools – Perf & rdtsc
- Micro-benchmarks – run 20 times and report the average
performance without initialization time
- Various working set sizes
- Various transaction sizes and write intensity
- Various data structures: hashtable, rbtree, array, …
5
Microbenchmarks Example
6
//Uncacheable log for (i = 0; i < array_size; ++i) { value = random_string; key = i; // Log updates // Intrinsic functions to invoke movnti _mm_stream_si32(&log[2 * i], key); _mm_stream_si32(&log[2 * i + 1], value); asm volatile (“sfence”); array[i] = value; } //Cacheable log for (i = 0; i < array_size; ++i) { value = random_string; key = i; // Log updates log[2 * i] = key; log[2 * i + 1] = value; asm volatile (“sfence”); array[i] = value; } //initialization Create an array of strings
Issue with Cacheable log
7
Last-Level Cache Memory Bus NVM Core L1i Cache
... ...
DRAM Log Log Log Log
Cache pollution
Core L1d Cache L1i Cache L1d Cache
LLC Miss Rate and Execution Time
8
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 50% 55% 60% 65% 70% 75% 80% 85% 90% Uncacheable Cacheable LLC Miss Rate Execution Time LLC Miss Rate Execution Time (Million Cycles)
9
How about uncacheable log performance?
How do we make log uncacheable?
Example: x86 processors provide uncacheable write instructions (movnti, movntg, etc) Instructions can be invoked by
- Inline functions (__asm__())
- Intrinsic functions(_mm_stream_si32)
10
WCB
Write Combining Buffer (WCB)
11 4-6 cache lines
Last-Level Cache Memory Bus NVM Core L1 Cache
... ...
DRAM Log Log Core L1 Cache WCB
Issues with Uncacheable Log
- Existing uncacheable writing schemes are sub-optimal
- Partial writes in WCB
- Overhead of uncacheable write instructions
- Limited WCB size
12
Partial Writes in WCB
13
64B < 64B
Memory WCB Partial write Full write Partial writes are inefficient, because they underutilize the memory bus bandwidth
1 bus clock 1 bus clock
Execution Time vs. Transaction Size
14
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Uncacheable Cacheable Partial Writes Full Writes Execution Time 1.28E09 Cycles 1.15E08 Cycles
— Partial Writes
Partial writes: String Size – 4B Iterations – 2097152 Total Data – 8MB Full wirtes: String Size – 64B Iterations – 131072 Total Data – 8MB
Overhead of Uncacheable Write Instructions
15
/ /U n c a c h e a b l e lo g fo r ( i = ; i < a r r a y _ s i z e ; + + i) { v a l u e = r a n d
- m
_ s tr i n g ; k e y = i ; / / L
- g
u p d a te s / / In tr i n s i c fu n c ti
- n
s to i n v
- k
e m
- v
n ti _ m m _ s tr e a m _ s i 3 2 ( & lo g [ 2 * i] , k e y ) ; _ m m _ s tr e a m _ s i 3 2 ( & lo g [ 2 * i + 1 ] , v a l u e ) ; a s m v
- l
a ti l e ( “ s fe n c e ” ) ; a r r a y [ i] = v a l u e ; } e ( “ ” ) ;
6
e ( “ ” ) ; / /C a c h e a b l e lo g fo r ( i = ; i < a r r a y _ s i z e ; + + i) { v a l u e = r a n d
- m
_ s tr i n g ; k e y = i ; / / L
- g
u p d a te s lo g [ 2 * i] = k e y ; lo g [ 2 * i + 1 ] = v a l u e ; a s m v
- l
a ti l e ( “ s fe n c e ” ) ; a r r a y [ i] = v a l u e ; }
Overhead of Uncacheable Write Instructions
16 void _mm_stream_si32 (int *p, int a) asm(” movnti %1, %0” : “=m” (*p) : “r”(v)); // int * p, int v; More overhead to do type casting, if the type of data written is not integer
Issues with Limited WCB Size
17 WCB Log updates among transactions issued by program NVRAM bus
1.0 1.2 1.4 1.6 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4 8 16 32 64 128 256 uncacheable cacheable speedup
Inefficiencies of Uncacheable Log
18
Speedup Execution Time (Billion cycles)
– –
Partial writes and sfence WCB size limit String size (Bytes) iterations 4 2097152 8 1048576 16 524288 32 262144 64 131072 128 65536 256 32768
String size (Bytes)
Summary
- Tradeoff between cacheable and uncacheable log
- Issues with cacheable log – cache contamination
- Issues with uncacheable log – sub-optimal design in
- Uncacheable write instructions and programming interface
- Hardware components, e.g., write-combining buffer design and the way
it is used
- More results
- Sensitivity study on read/write ratio in transactions
- Sensitivity study on transaction size
- Other data structures: hash table, rbtree, b+tree, etc.
19