Using NVDIMM under KVM Applications of persistent memory in - - PowerPoint PPT Presentation

using nvdimm under kvm
SMART_READER_LITE
LIVE PREVIEW

Using NVDIMM under KVM Applications of persistent memory in - - PowerPoint PPT Presentation

Using NVDIMM under KVM Applications of persistent memory in virtualization Stefan Hajnoczi <stefanha@redhat.com> FOSDEM 2017 About me QEMU contributor since 2010 Focus on storage, tracing, performance Work in Red Hats virtualization


slide-1
SLIDE 1

Using NVDIMM under KVM

Applications of persistent memory in virtualization

Stefan Hajnoczi <stefanha@redhat.com> FOSDEM 2017

slide-2
SLIDE 2

FOSDEM 2017 2

About me

QEMU contributor since 2010 Focus on storage, tracing, performance Work in Red Hat’s virtualization team Reviewer of NVDIMM emulation patches in QEMU

slide-3
SLIDE 3

FOSDEM 2017 3

NVDIMM-N hardware

DRAM Memory Controller NAND Flash

It’s DDR4 RAM with one key feature: Saves data to fmash in event of power failure Details in JEDEC JESD245 & JESD248 standards

slide-4
SLIDE 4

FOSDEM 2017 4

Not to be confused with NVMe

NVDIMM NVMe Form factor DIMM PCIe Device type Memory Block Capacity 10’s of GB 1’s of TB Latency 10’s of ns 10’s of us

Both are non-volatile but

  • therwise totally

different device types

CC BY-SA 4.0, Dsimic via Wikimedia Commons

slide-5
SLIDE 5

FOSDEM 2017 5

Use cases for NVDIMM

Really fast writes particularly interesting for: In-memory databases – get persistence for free*! Databases – transaction logs File & storage systems – frequently updated metadata * need to follow programming model (explained later)

slide-6
SLIDE 6

FOSDEM 2017 6

Managing data on NVDIMMs

Region Namespace GPT Partition Table File system

Multiple NVDIMMs can be interleaved in a region Regions are carved up into namespaces Standard GPT/fjle system/etc stack inside namespaces Data is identifjed by fjlename or device path

slide-7
SLIDE 7

FOSDEM 2017 7

Bypassing the I/O stack

Application

  • pen(2), mmap(2),

read(2), write(2) File system Block layer Load/store instructions

I/O bypasses kernel when accessing mmap of pmem via DAX device Linux kernel has DAX support DAX means page cache is bypassed

slide-8
SLIDE 8

FOSDEM 2017 8

Programming model

Modes of operation: 1) Persistent memory – byte-addressable 2)Block window – block I/O Described in pmem.io specifjcations

512 bytes

Cache line

slide-9
SLIDE 9

FOSDEM 2017 9

Persistent memory mode

Load – use regular load instructions Store – fmush cache line after store or use non-temporal store Error handling – Machine Check Exception on read but hard to handle in applications Robustness – Map only data you need to protect against stray writes or use Memory Protection Keys

slide-10
SLIDE 10

FOSDEM 2017 10

Block window mode

Block device semantics:

  • Sector-based I/O
  • Immediate error notifjcation
  • Data not exposed to stray memory writes

But:

  • No DAX, traditional read(2)/write(2) only
  • Hard to virtualize effjciently, not yet implemented in

QEMU

slide-11
SLIDE 11

FOSDEM 2017 11

ndctl utility and NVM Library

ndctl utility manages NVDIMMs, regions, and namespaces https://github.com/pmem/ndctl NVM Library APIs offer:

  • Low-level access to pmem
  • Higher-level data structures and memory allocators

http://pmem.io/nvml/

slide-12
SLIDE 12

FOSDEM 2017 12

NVDIMM pass-through in QEMU

Pass-through of entire namespace (fjles too in the future) Label area is emulated, guest cannot alter host label area Guest directly accesses host pmem – no vmexits!

namespace0.0 /dev/dax Host Guest Physical NVDIMM Virtual NVDIMM namespace0.0 ext4 /db/tx-log.dat fjle

slide-13
SLIDE 13

FOSDEM 2017 13

Fake NVDIMM in QEMU

Guest #1 QEMU #1 Guest #2 QEMU #2 /big-data fjle

Non-DAX host fjles as guest NVDIMMs (Careful: stores are not persistent!) Example: Two guests sharing read-only access to a host fjle Bypasses guest page cache if DAX is enabled inside guest Avoids copy-in and reduces overall memory footprint

slide-14
SLIDE 14

FOSDEM 2017 14

Future QEMU use cases

QEMU maintains frequently updated metadata:

  • Allocation maps and refcounts in disk image fjles
  • Dirty bitmap for incremental disk backup

NVDIMM could be used to speed up these features Requires extensions to disk image formats to split frequently used metadata into separate DAX fjle

slide-15
SLIDE 15

FOSDEM 2017 15

Thank you

Application developers → NVM Library: http://pmem.io/nvml/ High-level overview → SNIA NVM Programming Model (NPM) 1.1 https://goo.gl/d4YHPl Low-level details → NVDIMM specifjcations: http://pmem.io/documents/ QEMU command-line syntax → docs/nvdimm.txt My blog → http://blog.vmsplice.net/ IRC → stefanha on Freenode & OFTC QEMU 2.6+ Linux 4.1+ libvirt Status February 2017:

slide-16
SLIDE 16

FOSDEM 2017 16

Special thanks to...

Jeff Moyer Dan Williams Haozhong Zhang Guangrong Xiao Ross Zwisler ...for feedback and discussion

slide-17
SLIDE 17

FOSDEM 2017 17

Backup slides

slide-18
SLIDE 18

FOSDEM 2017 18

Persistence domains

A regular store instruction is not enough to make data persistent! Data must reach hardware- dependent “persistence domain” On Intel that means CLFLUSHOPT + SFENCE on platforms with ADR feature

Score if ball lands in goal Score if ball lands anywhere on

  • pposing side!
slide-19
SLIDE 19

FOSDEM 2017 19

Block Translation Table

Provides atomic sector I/O Prevents torn write problem if power failure occurs during a sector write operation Optional layer on top of pmem or blk mode

slide-20
SLIDE 20

FOSDEM 2017 20

Hardware availability

No widely available hardware on market (Feb 2017) Intel, Micron, and HPE have announced products