Persistent Memory Architecture Research at UCSC Workload - PowerPoint PPT Presentation

Persistent Memory Architecture Research at UCSC – Workload Characterization and Hardware Support for Persistence Jishen Zhao jishen.zhao@ucsc.edu Computer Engineering UC Santa Cruz July 12, 2016

What is persistent memory? NVRAM • Persistent memory memory storage 2

NVRAM is here … STT-RAM, PCM, ReRAM, NVDIMM, 3D Xpoint, etc. 2016 NVRAM 3

Design Opportunities with NVRAM Memory CPU CPU Load/store DRAM NVRAM Not persistent Persistent memory Storage Disk/Flash Load/store Fopen(), fread(), fwrite(), … Persistent Persistent • Allow in-memory data structures to become permanent immediately • Demonstrated 32x speedup compared with using storage devices [Condit+ SOSP’09, Volos+ ASPLOS’11, Coburn+ ASPLOS’11, Venkataraman+ FAST’11] 4

Executing Applications in Persistent Memory open() mmap() 5 Jeff Moyer, “Persistent memory in Linux,” SNIA NVM Summit, 2016.

Our research – At the software/hardware boundary • Workload characterization Applications • Exploring persistent memory use cases • Identifying system bottlenecks System Software (VM, File System, • Implications to software/hardware Database System) design • System software ISA • Efficient fault tolerance and data CPU persistence mechanisms DRAM NVRAM • Hardware • Developing storage accelerators SSD/HDD • Redefining the boundary between 6 software and hardware

Workload Characterization from a hardware perspective • Motivation • Persistent memory is managed by both hardware and software • Most prior works only profile software statistics, e.g., system throughput • Objectives • Help system designers better understand performance bottlenecks • Help application designers better utilize persistent memory hardware • Approach • Profile hardware and software counter statistics • Instrument application and system software to obtain 7 insights at micro-architecture level

Hardware and software configurations • CPU: Intel Xeon CPU E5-2620 v3 • Memory: 12GB of pmem + 4GB of main memory partitioned on DRAM (memmap) • Operating system: Linux 4.4.0 kernel • Profiling Tools • Linux Perf: collecting software and hardware counter statistics • Intel Pin 3.0 instrumentation tool with in-house Pintools • File systems evaluated • Ext4 : Journaling of metadata, running on RAMDisk • Ext4-DAX : • Journaling of metadata and bypass page cache with DAX • NOVA 8 • Nonvolatile accelerated log-structured file system [Li+ FAST’16]

About DAX • What is DAX? • “Direct Access” • Enabling efficient Linux support for persistent memory • Allowing file system requests to bypass the page cache allocated in DRAM and directly access NVRAM via loads and stores • How does Ext4-DAX work? • DAX maps storage components directly into userspace • * True DAX is not supported in Linux yet – accesses still go through DRAM, i.e., directly swaps the pages between DRAM main memory and NVRAM storage. • Example of file systems with DAX capability • Ext4-DAX, XFS-DAX, Btrfs-DAX à Fedora • Intel PMFS 9 • NOVA

Current workloads • Filebench (a widely-used benchmark suite designed for evaluating file system performance) • Fileserver, Webproxy, WebServer, Varmail • FFSB (Flexible Filesystem benchmark) • Can configure read/write ratio and number of threads • Bonnie • measuring file system performance by invoking putc() and getc() • File compression/decompression: tar/untar, zip/unzip • TPC-C running with MySQL • A database online transaction processing workload • Write intensive, with 63.7% of writes • In-house micro-benchmarks • * Applications are compiled with static linking and stored in NVRAM 10 (pmem) region

Workload throughput (opera-ons per second) 21000 ext4 ext4-DAX NOVA 20000 Throughput 19000 18000 17000 16000 15000 14000 Fileserver Webproxy Webserver Varmail Execu&on &me in nanoseocnds 5E+09 Transac6ons per ten seconds NOVA EXT4-DAX EXT4 120 NOVA EXT4-DAX EXT4 4E+09 100 80 3E+09 60 2E+09 40 1E+09 20 0 0 TPC-C UNTAR TAR 11

Correlation between system performance and hardware behavior dTLB miss rate iTLB miss rate LLC load miss rate CorrelaFon Coefficient LLC store miss rate Page fault rate 1.5 1 Highly correlated 0.5 (standard error within 8%) 0 -0.5 -1 -1.5 Fileserver Webproxy Webserver Varmail Zip Unzip FFSB 12

Throughput vs. Write Intensity 2400000 (Transac>ons/s) ext4 ext4-DAX NOVA FFSB Throughput 2000000 1600000 1200000 800000 400000 0 R=100%, R=90%, R=80%, R=70%, R=60%, R=0%, W=0% W=10% W=20% W=30% W=40% W=100% Normalized Throughput 3.0 ext4-dax ext4 nova 2.5 Bonnie (read:write = 1:1) 2.0 1.5 1.0 0.5 0.0 putc() Block Block getc() Efficient Effec@ve 13 throughput writes create block reads random change seek rate rewrite

The impact of workload locality • NVRAM devices may or may not have an on-chip buffer TransacIons per second � 21000 ext4 ext4-DAX NOVA 19000 17000 15000 13000 DRAM classic 50% 60% 70% 80% 90% NVM Buffer hit rate in revised NVRAM model � model TransacHons per second � 21000 ext4 ext4-DAX NOVA 19000 17000 15000 13000 14 4KB DRAM 4KB 2KB 1KB 512B 256B Buffer size in revised NVRAM model �

Our research – At the software/hardware boundary • Workload characterization Applications • Exploring persistent memory use cases • Identifying system bottlenecks System Software (VM, File System, • Implications to software/hardware Database System) design • System software ISA • Efficient fault tolerance and data CPU persistence mechanisms DRAM NVRAM • Hardware • Developing storage accelerators SSD/HDD • Redefining the boundary between 15 software and hardware

Logging Acceleration (executive summary) • Problem • Traditional software-based logging imposes substantial overhead in persistent memory • Even with either undo or redo logging • Not to say undo+redo logging as used in many modern database systems • Changes in software interface add burden on programmers • Solution • Hardware-based logging accelerators • Leverage existing hardware information (otherwise largely wasted) • Results • 3.3X performance improvement • Simplified software interface 16 • Low hardware overhead

Logging (Journaling) in Persistent Memory (Maintaining Atomicity) NVRAM Memory Root Root Root Barrier A A A B C D B C D B C D C’ D’ C’ D’ Log Size of one store 17

Performance overhead of software logging Zhao+, “Kiln: Closing the performance gap between systems with and without persistence support,” MICRO 2013. 18

Software interface of software logging • Memory barriers, strict ordering constraints, and cache flushing all needed for ensuring data persistence 19

Our software interface • Memory barriers, strict ordering constraints, and cache flushing all needed for ensuring data persistence Hardware support for 20

How does it work? L1 cache hit – we get all that needed for undo+redo log • Writes to persistent memory automatically trigger a write to the log – a software-allocated circular buffer • Log information includes TxID, address, undo cache line value, and redo cache line value • Leveraging cache hit/miss handling process to update the log • Log updates get buffered in the processor Processor Core Processor 5 … (Volatile) Core Core A ’ 1 hit 1 Tx_commit L1$ L1$ Controllers L1$ A 1 Log … … Cache 4 Buffer Bypass Last-level Cache 2 2 Caches Log Buffer (FIFO) A ’ 1 A 1 A ’ 2 A 2 Memory Controllers ze 3 TxID, addr(A) Cache line size Cache line size 22 Log Log NVRAM (circular (circular DRAM NVRAM (Nonvolatile) buffer) buffer) (b)

How does it work? L1 cache miss – we get all that needed during “write-allocate” • Writes to persistent memory automatically trigger a write to the log – a software-allocated circular buffer • Log information includes TxID, address, undo cache line value, and redo cache line value • Leveraging cache hit/miss handling process to update the log • Log updates get buffered in the processor Core Processor 5 … Core Core A ’ 1 miss 1 Tx_commit L1$ L1$ Controllers L1$ Log … … Cache 4 Buffer 2 Write-allocate Hit in a Last-level Cache lower-level A 1 Lower-level$ Bypass cache 2 Caches 2 Log Buffer (FIFO) Memory Controllers A ’ 1 A 1 A ’ 2 A 2 Cache line size 3 TxID, addr(A) Cache line size 23 Log Log NVRAM (circular (circular DRAM NVRAM (Nonvolatile) buffer) buffer) (c)

Force cache writeback when necessary • Need to flush CPU caches, when • A log entry is almost overwritten by new log updates • But the associated data still remains in CPU caches head Circular Log Buffer tail 24

Results • McSimA+ simulator running • Persistent memory micro-benchmarks • A real workload – a persistent version of memcached • System throughput improved by 1.45x~1.60x on average • Memcached throughput improved by 3.3x • Memory traffic reduced by 2.36x~3.12x • Dynamic memory energy improvement by 1.53x~1.72x • Hardware overhead • 17 bytes of flip-flops • 1-bit cache tag information per cache line • Multiplexers 25

Persistent Memory Architecture Research at UCSC Workload - PowerPoint PPT Presentation

Persistent Memory Architecture Research at UCSC Workload Characterization and Hardware Support for Persistence Jishen Zhao jishen.zhao@ucsc.edu Computer Engineering UC Santa Cruz July 12, 2016 What is persistent memory? NVRAM

Workload, Fatigue, and Sleep Disruption 1 Workload 1.What is workload? 2.What is the

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

Hardware Support for ACID Transactions in Persistent Memory Arpit Joshi , Vijay Nagarajan, Marcelo

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Distributed Shared Persistent Memory (SoCC 17) Yizhou Shan, Yiying Zhang Persistent Memory

Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, Matheus Ogleari , Jishen Zhao

WORKLOAD WORKLOAD WORKLOAD During exercise, nasal breathing causes a reduction in FEO 2

ASHA Workload Calculator What is Direct and Other indirect workload? activities Services

DHTM: Durable Hardware Transactional Memory Arpit Joshi , Vijay Nagarajan, Marcelo Cintra, Stratis

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Persistent Handles: approaches Ralph Bhme, Samba Team, SerNet 2018-06-08 Outline Persistent

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems Se Kwon Lee K. Hyun Lim 1 ,

Persistent Memory Use Cases in Modern Software Architectures Olasoji Denloye SW Engineer Intel

DAY 2 Agenda for Today Introduce the workload characterization problem. Discuss a

Day 3 Agenda for Today Formulate simple problem statement Revisit the workload

A Probabilistic Model of Cross- situational Word Learning from Noisy and Ambiguous Data Afra

A First Course on Kinetics and Reaction Engineering Class 35 on Unit 33 Where Were Going

An Implementation of Fast memset() Using Hardware Accelerators Runtime and Operating Systems for

Using NVDIMM under KVM Applications of persistent memory in virtualization Stefan Hajnoczi

eight beam sections - geometric properties Sections 1 Elements of Architectural Structures

Notes on the BLCOP Package Francisco Gochez, Mango Solutions August 20, 2008 1 Introduction

Number Theory for Cryptography

An Interdisciplinary Survey of An Interdisciplinary Survey of Word Learning Research Word