Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, - - PowerPoint PPT Presentation

▶

Apr 05, 2024 92 likes •312 views

Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, Matheus Ogleari , Jishen Zhao Persistent Memory STT-RAM, PCM, Memory CPU CPU ReRAM, NVDIMM, Battery-backed Load/store DRAM NVRAM DRAM, etc. Not persistent Persistent

SLIDE 1

Logging in Persistent Memory: to Cache, or Not to Cache?

Mengjie Li, Matheus Ogleari, Jishen Zhao

SLIDE 2

Persistent Memory

STT-RAM, PCM, ReRAM, NVDIMM, Battery-backed DRAM, etc.

These nonvolatile devices are able to retain the data in a consistent state in case of power loss. CPU DRAM Disk/Flash

Memory

Load/store Not persistent

Storage

Fopen, fread, fwrite, … Persistent CPU NVRAM Load/store Persistent

Persistent memory

SLIDE 3

Logging in Persistent Memory

Root A B C D Log

NVRAM

Memory Barrier L1 LLC Core L1 Core

…

Tx_begin do some reads do some computation Rlog ( addr(C), new_val(C) ) memory_barrier write C Tx_commit C1’ Micro-ops: store C’1 store C’2 ... Time Log_C’ Root A B C’ D Log

NVRAM

L1 LLC Core L1 Core

…

Log_C’ Update persistent memory with transactions

SLIDE 4

To cache, or not cache? That is the question.

[Mengjie Li+, Memsys 2017]

SLIDE 5

Experimental Setup

Desktop – Dell OptiPlex 7040 Tower
CPU – 4-core 3.4GHz Intel Core-i7
Cache – 8 MB last-level cache
Measurement Tools – Perf & rdtsc
Micro-benchmarks – run 20 times and report the average

performance without initialization time

Various working set sizes
Various transaction sizes and write intensity
Various data structures: hashtable, rbtree, array, …

SLIDE 6

Microbenchmarks Example

//Uncacheable log for (i = 0; i < array_size; ++i) { value = random_string; key = i; // Log updates // Intrinsic functions to invoke movnti _mm_stream_si32(&log[2 * i], key); _mm_stream_si32(&log[2 * i + 1], value); asm volatile (“sfence”); array[i] = value; } //Cacheable log for (i = 0; i < array_size; ++i) { value = random_string; key = i; // Log updates log[2 * i] = key; log[2 * i + 1] = value; asm volatile (“sfence”); array[i] = value; } //initialization Create an array of strings

SLIDE 7

Issue with Cacheable log

Last-Level Cache Memory Bus NVM Core L1i Cache

... ...

DRAM Log Log Log Log

Cache pollution

Core L1d Cache L1i Cache L1d Cache

SLIDE 8

LLC Miss Rate and Execution Time

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 50% 55% 60% 65% 70% 75% 80% 85% 90% Uncacheable Cacheable LLC Miss Rate Execution Time LLC Miss Rate Execution Time (Million Cycles)

SLIDE 9

How about uncacheable log performance?

SLIDE 10

How do we make log uncacheable?

Example: x86 processors provide uncacheable write instructions (movnti, movntg, etc) Instructions can be invoked by

Inline functions (__asm__())
Intrinsic functions(_mm_stream_si32)

SLIDE 11

WCB

Write Combining Buffer (WCB)

11 4-6 cache lines

Last-Level Cache Memory Bus NVM Core L1 Cache

... ...

DRAM Log Log Core L1 Cache WCB

SLIDE 12

Issues with Uncacheable Log

Existing uncacheable writing schemes are sub-optimal
Partial writes in WCB
Overhead of uncacheable write instructions
Limited WCB size

SLIDE 13

Partial Writes in WCB

64B < 64B

Memory WCB Partial write Full write Partial writes are inefficient, because they underutilize the memory bus bandwidth

1 bus clock 1 bus clock

SLIDE 14

Execution Time vs. Transaction Size

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Uncacheable Cacheable Partial Writes Full Writes Execution Time 1.28E09 Cycles 1.15E08 Cycles

— Partial Writes

Partial writes: String Size – 4B Iterations – 2097152 Total Data – 8MB Full wirtes: String Size – 64B Iterations – 131072 Total Data – 8MB

SLIDE 15

Overhead of Uncacheable Write Instructions

/ /U n c a c h e a b l e lo g fo r ( i = ; i < a r r a y _ s i z e ; + + i) { v a l u e = r a n d

_ s tr i n g ; k e y = i ; / / L

u p d a te s / / In tr i n s i c fu n c ti

s to i n v

e m

n ti _ m m _ s tr e a m _ s i 3 2 ( & lo g [ 2 * i] , k e y ) ; _ m m _ s tr e a m _ s i 3 2 ( & lo g [ 2 * i + 1 ] , v a l u e ) ; a s m v

a ti l e ( “ s fe n c e ” ) ; a r r a y [ i] = v a l u e ; } e ( “ ” ) ;

e ( “ ” ) ; / /C a c h e a b l e lo g fo r ( i = ; i < a r r a y _ s i z e ; + + i) { v a l u e = r a n d

_ s tr i n g ; k e y = i ; / / L

u p d a te s lo g [ 2 * i] = k e y ; lo g [ 2 * i + 1 ] = v a l u e ; a s m v

a ti l e ( “ s fe n c e ” ) ; a r r a y [ i] = v a l u e ; }

SLIDE 16

Overhead of Uncacheable Write Instructions

16 void _mm_stream_si32 (int *p, int a) asm(” movnti %1, %0” : “=m” (*p) : “r”(v)); // int * p, int v; More overhead to do type casting, if the type of data written is not integer

SLIDE 17

Issues with Limited WCB Size

17 WCB Log updates among transactions issued by program NVRAM bus

SLIDE 18

1.0 1.2 1.4 1.6 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4 8 16 32 64 128 256 uncacheable cacheable speedup

Inefficiencies of Uncacheable Log

Speedup Execution Time (Billion cycles)

– –

Partial writes and sfence WCB size limit String size (Bytes) iterations 4 2097152 8 1048576 16 524288 32 262144 64 131072 128 65536 256 32768

String size (Bytes)

SLIDE 19

Summary

Tradeoff between cacheable and uncacheable log
Issues with cacheable log – cache contamination
Issues with uncacheable log – sub-optimal design in
Uncacheable write instructions and programming interface
Hardware components, e.g., write-combining buffer design and the way

it is used

More results
Sensitivity study on read/write ratio in transactions
Sensitivity study on transaction size
Other data structures: hash table, rbtree, b+tree, etc.

SLIDE 20