Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, - - PowerPoint PPT Presentation

logging in persistent memory
SMART_READER_LITE
LIVE PREVIEW

Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, - - PowerPoint PPT Presentation

Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, Matheus Ogleari , Jishen Zhao Persistent Memory STT-RAM, PCM, Memory CPU CPU ReRAM, NVDIMM, Battery-backed Load/store DRAM NVRAM DRAM, etc. Not persistent Persistent


slide-1
SLIDE 1

Logging in Persistent Memory: to Cache, or Not to Cache?

Mengjie Li, Matheus Ogleari, Jishen Zhao

slide-2
SLIDE 2

Persistent Memory

2

STT-RAM, PCM, ReRAM, NVDIMM, Battery-backed DRAM, etc.

These nonvolatile devices are able to retain the data in a consistent state in case of power loss. CPU DRAM Disk/Flash

Memory

Load/store Not persistent

Storage

Fopen, fread, fwrite, … Persistent CPU NVRAM Load/store Persistent

Persistent memory

slide-3
SLIDE 3

Logging in Persistent Memory

3

Root A B C D Log

NVRAM

Memory Barrier L1 LLC Core L1 Core

Tx_begin do some reads do some computation Rlog ( addr(C), new_val(C) ) memory_barrier write C Tx_commit C1’ Micro-ops: store C’1 store C’2 ... Time Log_C’ Root A B C’ D Log

NVRAM

L1 LLC Core L1 Core

Log_C’ Update persistent memory with transactions

slide-4
SLIDE 4

To cache, or not cache? That is the question.

4

[Mengjie Li+, Memsys 2017]

slide-5
SLIDE 5

Experimental Setup

  • Desktop – Dell OptiPlex 7040 Tower
  • CPU – 4-core 3.4GHz Intel Core-i7
  • Cache – 8 MB last-level cache
  • Measurement Tools – Perf & rdtsc
  • Micro-benchmarks – run 20 times and report the average

performance without initialization time

  • Various working set sizes
  • Various transaction sizes and write intensity
  • Various data structures: hashtable, rbtree, array, …

5

slide-6
SLIDE 6

Microbenchmarks Example

6

//Uncacheable log for (i = 0; i < array_size; ++i) { value = random_string; key = i; // Log updates // Intrinsic functions to invoke movnti _mm_stream_si32(&log[2 * i], key); _mm_stream_si32(&log[2 * i + 1], value); asm volatile (“sfence”); array[i] = value; } //Cacheable log for (i = 0; i < array_size; ++i) { value = random_string; key = i; // Log updates log[2 * i] = key; log[2 * i + 1] = value; asm volatile (“sfence”); array[i] = value; } //initialization Create an array of strings

slide-7
SLIDE 7

Issue with Cacheable log

7

Last-Level Cache Memory Bus NVM Core L1i Cache

... ...

DRAM Log Log Log Log

Cache pollution

Core L1d Cache L1i Cache L1d Cache

slide-8
SLIDE 8

LLC Miss Rate and Execution Time

8

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 50% 55% 60% 65% 70% 75% 80% 85% 90% Uncacheable Cacheable LLC Miss Rate Execution Time LLC Miss Rate Execution Time (Million Cycles)

slide-9
SLIDE 9

9

How about uncacheable log performance?

slide-10
SLIDE 10

How do we make log uncacheable?

Example: x86 processors provide uncacheable write instructions (movnti, movntg, etc) Instructions can be invoked by

  • Inline functions (__asm__())
  • Intrinsic functions(_mm_stream_si32)

10

slide-11
SLIDE 11

WCB

Write Combining Buffer (WCB)

11 4-6 cache lines

Last-Level Cache Memory Bus NVM Core L1 Cache

... ...

DRAM Log Log Core L1 Cache WCB

slide-12
SLIDE 12

Issues with Uncacheable Log

  • Existing uncacheable writing schemes are sub-optimal
  • Partial writes in WCB
  • Overhead of uncacheable write instructions
  • Limited WCB size

12

slide-13
SLIDE 13

Partial Writes in WCB

13

64B < 64B

Memory WCB Partial write Full write Partial writes are inefficient, because they underutilize the memory bus bandwidth

1 bus clock 1 bus clock

slide-14
SLIDE 14

Execution Time vs. Transaction Size

14

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Uncacheable Cacheable Partial Writes Full Writes Execution Time 1.28E09 Cycles 1.15E08 Cycles

— Partial Writes

Partial writes: String Size – 4B Iterations – 2097152 Total Data – 8MB Full wirtes: String Size – 64B Iterations – 131072 Total Data – 8MB

slide-15
SLIDE 15

Overhead of Uncacheable Write Instructions

15

/ /U n c a c h e a b l e lo g fo r ( i = ; i < a r r a y _ s i z e ; + + i) { v a l u e = r a n d

  • m

_ s tr i n g ; k e y = i ; / / L

  • g

u p d a te s / / In tr i n s i c fu n c ti

  • n

s to i n v

  • k

e m

  • v

n ti _ m m _ s tr e a m _ s i 3 2 ( & lo g [ 2 * i] , k e y ) ; _ m m _ s tr e a m _ s i 3 2 ( & lo g [ 2 * i + 1 ] , v a l u e ) ; a s m v

  • l

a ti l e ( “ s fe n c e ” ) ; a r r a y [ i] = v a l u e ; } e ( “ ” ) ;

6

e ( “ ” ) ; / /C a c h e a b l e lo g fo r ( i = ; i < a r r a y _ s i z e ; + + i) { v a l u e = r a n d

  • m

_ s tr i n g ; k e y = i ; / / L

  • g

u p d a te s lo g [ 2 * i] = k e y ; lo g [ 2 * i + 1 ] = v a l u e ; a s m v

  • l

a ti l e ( “ s fe n c e ” ) ; a r r a y [ i] = v a l u e ; }

slide-16
SLIDE 16

Overhead of Uncacheable Write Instructions

16 void _mm_stream_si32 (int *p, int a) asm(” movnti %1, %0” : “=m” (*p) : “r”(v)); // int * p, int v; More overhead to do type casting, if the type of data written is not integer

slide-17
SLIDE 17

Issues with Limited WCB Size

17 WCB Log updates among transactions issued by program NVRAM bus

slide-18
SLIDE 18

1.0 1.2 1.4 1.6 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4 8 16 32 64 128 256 uncacheable cacheable speedup

Inefficiencies of Uncacheable Log

18

Speedup Execution Time (Billion cycles)

– –

Partial writes and sfence WCB size limit String size (Bytes) iterations 4 2097152 8 1048576 16 524288 32 262144 64 131072 128 65536 256 32768

String size (Bytes)

slide-19
SLIDE 19

Summary

  • Tradeoff between cacheable and uncacheable log
  • Issues with cacheable log – cache contamination
  • Issues with uncacheable log – sub-optimal design in
  • Uncacheable write instructions and programming interface
  • Hardware components, e.g., write-combining buffer design and the way

it is used

  • More results
  • Sensitivity study on read/write ratio in transactions
  • Sensitivity study on transaction size
  • Other data structures: hash table, rbtree, b+tree, etc.

19

slide-20
SLIDE 20

Logging in Persistent Memory: to Cache, or Not to Cache?

Mengjie Li, Matheus Ogleari, Jishen Zhao