Persistent Memory Architecture Research at UCSC Workload - - PowerPoint PPT Presentation

persistent memory architecture research at ucsc workload
SMART_READER_LITE
LIVE PREVIEW

Persistent Memory Architecture Research at UCSC Workload - - PowerPoint PPT Presentation

Persistent Memory Architecture Research at UCSC Workload Characterization and Hardware Support for Persistence Jishen Zhao jishen.zhao@ucsc.edu Computer Engineering UC Santa Cruz July 12, 2016 What is persistent memory? NVRAM


slide-1
SLIDE 1

Persistent Memory Architecture Research at UCSC – Workload Characterization and Hardware Support for Persistence

Jishen Zhao jishen.zhao@ucsc.edu Computer Engineering UC Santa Cruz July 12, 2016

slide-2
SLIDE 2

What is persistent memory?

2

  • Persistent memory memory storage

NVRAM

slide-3
SLIDE 3

NVRAM is here…

2016 NVRAM

STT-RAM, PCM, ReRAM, NVDIMM, 3D Xpoint, etc.

3

slide-4
SLIDE 4

Design Opportunities with NVRAM

  • Allow in-memory data structures to become permanent immediately
  • Demonstrated 32x speedup compared with using storage devices

[Condit+ SOSP’09, Volos+ ASPLOS’11, Coburn+ ASPLOS’11, Venkataraman+ FAST’11]

CPU DRAM Disk/Flash

Memory

Load/store Not persistent

Storage

Fopen(), fread(), fwrite(),… Persistent CPU NVRAM Load/store Persistent

Persistent memory

4

slide-5
SLIDE 5

Executing Applications in Persistent Memory

5 Jeff Moyer, “Persistent memory in Linux,” SNIA NVM Summit, 2016.

  • pen()

mmap()

slide-6
SLIDE 6
  • Workload characterization
  • Exploring persistent memory use

cases

  • Identifying system bottlenecks
  • Implications to software/hardware

design

  • System software
  • Efficient fault tolerance and data

persistence mechanisms

  • Hardware
  • Developing storage accelerators
  • Redefining the boundary between

software and hardware

Our research – At the software/hardware boundary

DRAM CPU NVRAM SSD/HDD System Software (VM, File System, Database System) Applications

ISA 6

slide-7
SLIDE 7

Workload Characterization from a hardware perspective

  • Motivation
  • Persistent memory is managed by both hardware and

software

  • Most prior works only profile software statistics, e.g.,

system throughput

  • Objectives
  • Help system designers better understand performance

bottlenecks

  • Help application designers better utilize persistent

memory hardware

  • Approach
  • Profile hardware and software counter statistics
  • Instrument application and system software to obtain

insights at micro-architecture level

7

slide-8
SLIDE 8

Hardware and software configurations

  • CPU: Intel Xeon CPU E5-2620 v3
  • Memory: 12GB of pmem + 4GB of main memory

partitioned on DRAM (memmap)

  • Operating system: Linux 4.4.0 kernel
  • Profiling Tools
  • Linux Perf: collecting software and hardware counter statistics
  • Intel Pin 3.0 instrumentation tool with in-house Pintools
  • File systems evaluated
  • Ext4 : Journaling of metadata, running on RAMDisk
  • Ext4-DAX :
  • Journaling of metadata and bypass page cache with DAX
  • NOVA
  • Nonvolatile accelerated log-structured file system [Li+ FAST’16]

8

slide-9
SLIDE 9

About DAX

  • What is DAX?
  • “Direct Access”
  • Enabling efficient Linux support for persistent memory
  • Allowing file system requests to bypass the page cache

allocated in DRAM and directly access NVRAM via loads and stores

  • How does Ext4-DAX work?
  • DAX maps storage components directly into userspace
  • * True DAX is not supported in Linux yet – accesses still go

through DRAM, i.e., directly swaps the pages between DRAM main memory and NVRAM storage.

  • Example of file systems with DAX capability
  • Ext4-DAX, XFS-DAX, Btrfs-DAX à Fedora
  • Intel PMFS
  • NOVA

9

slide-10
SLIDE 10

Current workloads

  • Filebench (a widely-used benchmark suite designed for

evaluating file system performance)

  • Fileserver, Webproxy, WebServer, Varmail
  • FFSB (Flexible Filesystem benchmark)
  • Can configure read/write ratio and number of threads
  • Bonnie
  • measuring file system performance by invoking putc() and

getc()

  • File compression/decompression: tar/untar, zip/unzip
  • TPC-C running with MySQL
  • A database online transaction processing workload
  • Write intensive, with 63.7% of writes
  • In-house micro-benchmarks
  • * Applications are compiled with static linking and stored in NVRAM

(pmem) region 10

slide-11
SLIDE 11

Workload throughput

14000 15000 16000 17000 18000 19000 20000 21000 Fileserver Webproxy Webserver Varmail ext4 ext4-DAX NOVA Throughput (opera-ons per second)

1E+09 2E+09 3E+09 4E+09 5E+09 Execu&on &me in nanoseocnds NOVA EXT4-DAX EXT4

TAR UNTAR

20 40 60 80 100 120 NOVA EXT4-DAX EXT4 Transac6ons per ten seconds

TPC-C

11

slide-12
SLIDE 12

Correlation between system performance and hardware behavior

  • 1.5
  • 1
  • 0.5

0.5 1 1.5 dTLB miss rate iTLB miss rate LLC load miss rate LLC store miss rate Page fault rate Fileserver Webproxy Webserver Varmail Zip Unzip FFSB CorrelaFon Coefficient Highly correlated (standard error within 8%) 12

slide-13
SLIDE 13

400000 800000 1200000 1600000 2000000 2400000 ext4 ext4-DAX NOVA Throughput (Transac>ons/s)

R=100%, W=0% R=90%, W=10% R=80%, W=20% R=70%, W=30% R=60%, W=40% R=0%, W=100%

0.0 0.5 1.0 1.5 2.0 2.5 3.0

putc() throughput Block writes Block create change rewrite getc() Efficient block reads Effec@ve random seek rate

ext4-dax ext4 nova

Throughput vs. Write Intensity

Bonnie (read:write = 1:1) FFSB

13

Normalized Throughput

slide-14
SLIDE 14

13000 15000 17000 19000 21000 DRAM classic NVM model 50% 60% 70% 80% 90% ext4 ext4-DAX NOVA Buffer hit rate in revised NVRAM model TransacIons per second

The impact of workload locality

  • NVRAM devices may or may not have an on-chip buffer

13000 15000 17000 19000 21000 4KB DRAM 4KB 2KB 1KB 512B 256B ext4 ext4-DAX NOVA Buffer size in revised NVRAM model TransacHons per second

14

slide-15
SLIDE 15
  • Workload characterization
  • Exploring persistent memory use

cases

  • Identifying system bottlenecks
  • Implications to software/hardware

design

  • System software
  • Efficient fault tolerance and data

persistence mechanisms

  • Hardware
  • Developing storage accelerators
  • Redefining the boundary between

software and hardware

Our research – At the software/hardware boundary

DRAM CPU NVRAM SSD/HDD System Software (VM, File System, Database System) Applications

ISA 15

slide-16
SLIDE 16

Logging Acceleration (executive summary)

  • Problem
  • Traditional software-based logging imposes substantial
  • verhead in persistent memory
  • Even with either undo or redo logging
  • Not to say undo+redo logging as used in many modern

database systems

  • Changes in software interface add burden on programmers
  • Solution
  • Hardware-based logging accelerators
  • Leverage existing hardware information (otherwise largely

wasted)

  • Results
  • 3.3X performance improvement
  • Simplified software interface
  • Low hardware overhead

16

slide-17
SLIDE 17

17

Logging (Journaling) in Persistent Memory (Maintaining Atomicity)

Root A B C D Log Root A B C D C’ D’

NVRAM

Memory Barrier

Root A B C D C’ D’

Size of one store

slide-18
SLIDE 18

Performance overhead of software logging

Zhao+, “Kiln: Closing the performance gap between systems with and without persistence support,” MICRO 2013. 18

slide-19
SLIDE 19

Software interface of software logging

  • Memory barriers, strict ordering constraints, and cache

flushing all needed for ensuring data persistence

19

slide-20
SLIDE 20

Our software interface

  • Memory barriers, strict ordering constraints, and cache

flushing all needed for ensuring data persistence Hardware support for

20

slide-21
SLIDE 21

How does it work?

  • Writes to persistent memory automatically trigger a write to

the log – a software-allocated circular buffer

  • Log information includes TxID, address, undo cache line

value, and redo cache line value

  • Leveraging cache hit/miss handling process to update the log
  • Log updates get buffered in the processor

L1$ Core Processor Core

Last-level Cache L1$

… …

Memory Controllers Cache Controllers DRAM NVRAM Log (circular buffer) Log Buffer Cache line size

hit

Core A’1 L1$ 1 Log Buffer (FIFO) A’1 A1 A’2 A2 2 2 NVRAM (Nonvolatile) Log (circular buffer) Processor (Volatile) Bypass Caches 3 Tx_commit 5 A1

(b)

4 TxID, addr(A) ze Cache line size

L1 cache hit – we get all that needed for undo+redo log

22

slide-22
SLIDE 22

How does it work?

  • Writes to persistent memory automatically trigger a write to

the log – a software-allocated circular buffer

  • Log information includes TxID, address, undo cache line

value, and redo cache line value

  • Leveraging cache hit/miss handling process to update the log
  • Log updates get buffered in the processor

L1$ Core Processor Core

Last-level Cache L1$

… …

Memory Controllers Cache Controllers DRAM NVRAM Log (circular buffer) Log Buffer Cache line size Write-allocate Lower-level$ Core A1 A’1 miss 1 L1$ 2 2 Hit in a lower-level cache 2 Bypass Caches Log Buffer (FIFO) NVRAM (Nonvolatile) Log (circular buffer) 3 Tx_commit 5

(c)

4 TxID, addr(A) Cache line size A’1 A1 A’2 A2

L1 cache miss – we get all that needed during “write-allocate”

23

slide-23
SLIDE 23

Force cache writeback when necessary

  • Need to flush CPU caches, when
  • A log entry is almost overwritten by new log updates
  • But the associated data still remains in CPU caches

head tail

Circular Log Buffer

24

slide-24
SLIDE 24

Results

  • McSimA+ simulator running
  • Persistent memory micro-benchmarks
  • A real workload – a persistent version of memcached
  • System throughput improved by 1.45x~1.60x on average
  • Memcached throughput improved by 3.3x
  • Memory traffic reduced by 2.36x~3.12x
  • Dynamic memory energy improvement by 1.53x~1.72x
  • Hardware overhead
  • 17 bytes of flip-flops
  • 1-bit cache tag information per cache line
  • Multiplexers

25

slide-25
SLIDE 25
  • Workload characterization
  • Exploring persistent memory use

cases

  • Identifying system bottlenecks
  • Implications to software/hardware

design

  • System software
  • Efficient fault tolerance and data

persistence mechanisms

  • Hardware
  • Developing storage accelerators
  • Redefining the boundary between

software and hardware

Summary

DRAM CPU NVRAM SSD/HDD System Software (VM, File System, Database System) Applications

ISA

þ þ

26

slide-26
SLIDE 26

UCSC STABLE (SysTem and Architecture lab on scalaBility, reLiability, and Energy-efficiency)

27

Email: Jishen.zhao@ucsc.edu https://users.soe.ucsc.edu/~jzhao/

slide-27
SLIDE 27

Persistent Memory Architecture Research at UCSC – Workload Characterization and Hardware Support for Persistence

Jishen Zhao jishen.zhao@ucsc.edu Computer Engineering UC Santa Cruz July 12, 2016