NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System - - PowerPoint PPT Presentation

nova fortis a fault tolerant non volatile main memory
SMART_READER_LITE
LIVE PREVIEW

NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System - - PowerPoint PPT Presentation

NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System Jian Andiry Xu, Lu Zhang , Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-Volatile Systems Laboratory


slide-1
SLIDE 1 1

NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System

Jian Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson

Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego
slide-2
SLIDE 2 2

Non-volatile Memory and DAX

  • Non-volatile main memory (NVMM)

– PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface

Application NVMM DRAM HDD / SSD File system

load/store load/store

slide-3
SLIDE 3 3

Non-volatile Memory and DAX

  • Non-volatile main memory (NVMM)

– PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface

  • Direct Access (DAX)

– DAX file I/O bypasses the page cache – DAX-mmap() maps NVMM pages to application address space directly and bypasses file system – “Killer app”

Application NVMM DRAM HDD / SSD

mmap() copy DAX-mmap()

slide-4
SLIDE 4 4

Application expectations on NVMM File System

POSIX I/O Atomicity Fault Tolerance Speed Direct Access

DAX

slide-5
SLIDE 5 5

POSIX I/O Atomicity Fault Tolerance Speed Direct Access

DAX

ext4 xfs BtrFS F2FS

❌ ❌

slide-6
SLIDE 6 6

Fault Direct

DAX

❌ ❌

✔ ✔✔

PMFS ext4-DAX xfs-DAX

slide-7
SLIDE 7 7

Fault Direct

DAX

Strata

SOSP ’17

✔ ✔ ✔

❌ ❌

slide-8
SLIDE 8 8

Fault Direct

DAX

NOVA

FAST ’16

✔ ✔✔ ✔ ❌

slide-9
SLIDE 9 9

Fault Direct

DAX

NOVA-Fortis

✔ ✔ ✔ ✔✔

slide-10
SLIDE 10 10

Challenges

DAX

slide-11
SLIDE 11 11

NOVA: Log-structured FS for NVMM

  • Per-inode logging

– High concurrency – Parallel recovery

  • High scalability

– Per-core allocator, journal and inode table

  • Atomicity

– Logging for single inode update – Journaling for update across logs – Copy-on-Write for file data

Head Tail Inode Inode log

Per-inode logging Data Data Data

Jian Xu and Steven Swanson, NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories, FAST ’16.
slide-12
SLIDE 12 13

Snapshot

slide-13
SLIDE 13 14

Snapshot support

  • Snapshot is essential for file system backup
  • Widely used in enterprise file systems

– ZFS, Btrfs, WAFL

  • Snapshot is not available with DAX file systems
slide-14
SLIDE 14 15

Snapshot for normal file I/O

Current snapshot File log Data Page 0 Data in snapshot File write entry Reclaimed data Current data

1

1 Data Page 0 Data 1 Data Page 0 Data

2

Data 2 Data Page 0 Data recover_snapshot(1); take_snapshot(); take_snapshot(); write(0, 4K); write(0, 4K); write(0, 4K); write(0, 4K);
slide-15
SLIDE 15 16

Memory Ordering With DAX-mmap()

D = 42; Fence(); V = True;

  • Recovery invariant: if V == True, then D is valid

D V Valid ? False ✓ 42 False ✓ 42 True ✓ ? True ✗

slide-16
SLIDE 16 17

Memory Ordering With DAX-mmap()

D = 42; Fence(); V = True;

  • Recovery invariant: if V == True, then D is valid
  • D and V live in two pages of a mmap()’d region.

Page 1 Page 3 D V DAX-mmap() Application NVMM

slide-17
SLIDE 17 18
  • Set pages read-only, then copy-on-write

DAX Snapshot: Idea

File data: File system: Applications:

DAX-mmap()

no file system intervention

RO
slide-18
SLIDE 18 19
  • Application invariant: if V is True, then D is valid
page fault D = ?; V = False; D = 42; V = True; ? T ?

DAX Snapshot: Incorrect implementation

snapshot_begin(); set_read_only(page_d); copy_on_write(page_d); set_read_only(page_v); snapshot_end(); D V D V Application thread NOVA snapshot Application values Snapshot values ? F 42 F ? T 42 T
slide-19
SLIDE 19 20
  • Delay CoW page faults completion until all pages are read-only
? F ?

DAX Snapshot: Correct implementation

snapshot_begin(); set_read_only(page_d); set_read_only(page_v); snapshot_end(); copy_on_write(page_d); copy_on_write(page_v); D V D V Application thread NOVA snapshot Application values Snapshot values ? F 42 F 42 T page fault D = ?; V = False; D = 42; V = True; ? F
slide-20
SLIDE 20 21

Performance impact of snapshots

  • Normal execution vs. taking snapshots every 10s

– Negligible performance loss through read()/write() – Average performance loss 3.7% through mmap()

0.2 0.4 0.6 0.8 1 1.2

W/O snapshot W snapshot

Filebench (read/write) WHISPER (DAX-mmap())
slide-21
SLIDE 21 22

Protecting Metadata and Data

slide-22
SLIDE 22 23

NVMM Failure Modes

  • Detectable errors

– Media errors detected by NVMM controller – Raises Machine Check Exception (MCE)

  • Undetectable errors

– Media errors not detected by NVMM controller – Software scribbles NVMM data: Software: NVMM Ctrl.:

Receives MCE Media error

Detects uncorrectable errors Raises exception

Read

slide-23
SLIDE 23 24

NVMM Failure Modes

  • Detectable errors

– Media errors detected by NVMM controller – Raises Machine Check Exception (MCE)

  • Undetectable errors

– Media errors not detected by NVMM controller – Software scribbles NVMM data: Software: NVMM Ctrl.:

Consumes corrupted data Media error

Read

Sees no error

slide-24
SLIDE 24 25

NVMM Failure Modes

  • Detectable errors

– Media errors detected by NVMM controller – Raises Machine Check Exception (MCE)

  • Undetectable errors

– Media errors not detected by NVMM controller – Software scribbles NVMM data: Software: NVMM Ctrl.:

Updates ECC Bug code scribbles NVMM Scribble error

Write

slide-25
SLIDE 25 26

NOVA-Fortis Metadata Protection

  • Detection

– CRC32 checksums in all structures – Use memcpy_mcsafe() to catch MCEs

  • Correction

– Replicate all metadata: inodes, logs, superblock, etc. – Tick-tock: persist primary before updating replica

ent1 entN … Head’ Tail’ csum’ Head Tail Head’ Tail’ csum’ H1’ T1’ inode c1 cN Data 1 Data 2 ent1’ c1’ entN’ cN’ … inode’ Head Tail csum Head Tail csum H1 T1 log log’
slide-26
SLIDE 26 27

NOVA-Fortis Data Protection

  • Metadata

– CRC32 + replication for all structures

  • Data

– RAID-4 style parity – Replicated checksums

ent1 entN … Head’ Tail’ csum’ Head Tail Head’ Tail’ csum’ H1’ T1’ inode c1 cN Data 1 Data 2 ent1’ c1 ’ entN’ cN’ … inode’ Head Tail csum Head Tail csum H1 T1 S0 S1 S2 S3 S4 S5 S6 S7 P 1 Block (8 stripes) P = ⊕ S0..7 Ci = CRC32C(Si) Replicated log log’
slide-27
SLIDE 27 28

File data protection with DAX-mmap

  • Stores are invisible to the file systems
  • The file systems cannot protect mmap’ed data
  • NOVA-Fortis’ data protection contract:

NOVA-Fortis protects pages from media errors and scribbles iff they are not mmap()’d for writing.

DAX

slide-28
SLIDE 28 29

File data protection with DAX-mmap

  • NOVA-Fortis logs mmap() operations
File data: File log: NOVA-Fortis: read/write Applications: Kernel-space NVDIMMs User-space mmap() load/store load/store protected unprotected mmap log entry
slide-29
SLIDE 29 30

File data protection with DAX-mmap

  • On munmap and during recovery, NOVA-Fortis restores

protection

File data: File log: NOVA-Fortis: read/write Applications: Kernel-space NVDIMMs User-space mmap() munmap() Protection restored load/store
slide-30
SLIDE 30 31

File data protection with DAX-mmap

  • On munmap and during recovery, NOVA-Fortis restores

protection

File data: File log: NOVA-Fortis: read/write Applications: Kernel-space NVDIMMs User-space mmap() System Failure + recovery
slide-31
SLIDE 31 32

Performance

slide-32
SLIDE 32 33

Latency breakdown

1 2 3 4 5 6 Read 16KB Read 4KB Overwrite 512B Overwrite 4KB Append 4KB Create Latency (microsecond) VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity
slide-33
SLIDE 33 34

Latency breakdown

1 2 3 4 5 6 Read 16KB Read 4KB Overwrite 512B Overwrite 4KB Append 4KB Create Latency (microsecond) VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Metadata Protection
slide-34
SLIDE 34 35

Latency breakdown

1 2 3 4 5 6 Read 16KB Read 4KB Overwrite 512B Overwrite 4KB Append 4KB Create Latency (microsecond) VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Metadata Protection Data Protection
slide-35
SLIDE 35 36

Application performance

0.2 0.4 0.6 0.8 1 1.2 Fileserver Varmail MongoDB SQLite TPCC Average Normalized throughput Normalized throughput

ext4-DAX Btrfs NOVA w/ MP w/ MP+DP

slide-36
SLIDE 36 37

Conclusion

  • Fault tolerance is critical for file system, but existing DAX file

systems don’t provide it

  • We identify new challenges that NVMM file system fault

tolerance poses

  • NOVA-Fortis provides fault tolerance with high performance

– 1.5x on average to DAX-aware file systems without reliability features – 3x on average to other reliable file systems

slide-37
SLIDE 37 38

Give a try https://github.com/NVSL/linux-nova

slide-38
SLIDE 38 39

Thanks!

slide-39
SLIDE 39 40

Backup slides

slide-40
SLIDE 40 41

Hybrid DRAM/NVMM system

  • Non-volatile main memory (NVMM)

– PCM, STT-RAM, ReRAM, 3D XPoint technology

  • File system for NVMM
Host CPU DRAM NVMM NVMM FS
slide-41
SLIDE 41 42

Disk-based file systems are inadequate for NVMM

  • Ext4, xfs, Btrfs, F2FS, NILFS2
  • Built for hard disks and SSDs

– Software overhead is high – CPU may reorder writes to NVMM – NVMM has different atomicity guarantees

  • Cannot exploit NVMM performance
  • Performance optimization compromises

consistency on system failure [1]

[1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14. Atomicity Ext4 wb Ext4
  • rder
Ext4 dataj Btrfs xfs 1-Sector
  • verwrite
✓ ✓ ✓ ✓ ✓ 1-Sector append ✗ ✓ ✓ ✓ ✓ 1-Block
  • verwrite
✗ ✗ ✓ ✓ ✗ 1-Block append ✗ ✓ ✓ ✓ ✓ N-Block write/append ✗ ✗ ✗ ✗ ✗ N-Block prefix/append ✗ ✓ ✓ ✓ ✓
slide-42
SLIDE 42 43

NVMM file systems are not strongly consistent

  • BPFS, PMFS, Ext4-DAX, SCMFS, Aerie
  • None of them provide strong metadata and data consistency
File system Metadata atomicity Data atomicity Mmap Atomicity [1] BPFS Yes Yes [2] No PMFS Yes No No Ext4-DAX Yes No No SCMFS No No No Aerie Yes No No [1] Each msync() commits updates atomically. [2] In BPFS, write times are not updated atomically with respect to the write itself. File system Metadata atomicity Data atomicity Mmap Atomicity [1] BPFS Yes Yes [2] No PMFS Yes No No Ext4-DAX Yes No No SCMFS No No No Aerie Yes No No NOVA Yes Yes Yes
slide-43
SLIDE 43 44

Why LFS?

  • Log-structuring provides cheaper atomicity than journaling and

shadow paging

  • NVMM supports fast, highly concurrent random accesses

– Using multiple logs does not negatively impact performance – Log does not need to be contiguous

  • Rethink and redesign log-structuring entirely
slide-44
SLIDE 44 45

Atomicity

  • Log-structuring for single log update

– Write, msync, chmod, etc – Strictly commit log entry to NVMM before updating log tail

  • Lightweight journaling for update

across logs

– Unlink, rename, etc – Journal log tails instead of metadata

  • r data
  • Copy-on-write for file data

– Log only contains metadata – Log is short

File log Directory log Tail Tail Tail Tail Tail Dir tail File tail Journal

slide-45
SLIDE 45 46

Atomicity

  • Log-structuring for single log update

– Write, msync, chmod, etc – Strictly commit log entry to NVMM before updating log tail

  • Lightweight journaling for update

across logs

– Unlink, rename, etc – Journal log tails instead of metadata

  • r data
  • Copy-on-write for file data

– Log only contains metadata – Log is short

File log Directory log Tail Tail

Data 1 Data 2

Tail

Data 0 Data 1

slide-46
SLIDE 46 47

Performance

  • Per-inode logging allows for

high concurrency

  • Split data structure between

DRAM and NVMM

– Persistent log is simple and efficient – Volatile tree structure has no consistency overhead

File log Directory log Tail

Data 1 Data 2

Tail

Data 0

slide-47
SLIDE 47 48

Performance

  • Per-inode logging allows for

high concurrency

  • Split data structure between

DRAM and NVMM

– Persistent log is simple and efficient – Volatile tree structure has no consistency overhead

File log

Data 1 Data 2

Tail

Data 0

DRAM NVMM

Radix tree

0 1 2 3

slide-48
SLIDE 48 49

NOVA layout

  • Put allocator in DRAM
  • High scalability

– Per-CPU NVMM free list, journal and inode table – Concurrent transactions and allocation/deallocation

DRAM NVMM

Journal Inode table Free list

CPU 0

Journal Inode table Free list

CPU 1

Head Tail Inode Inode log Super block Recovery inode
slide-49
SLIDE 49 50

Fast garbage collection

  • Log is a linked list
  • Log only contains

metadata

  • Fast GC deletes dead log

pages from the linked list

  • No copying

Head Tail

Vaild log entry Invalid log entry
slide-50
SLIDE 50 51

Thorough garbage collection

  • Starts if valid log entries < 50% log length
  • Format a new log and atomically replace the old one
  • Only copy metadata

Head Tail

Vaild log entry Invalid log entry
slide-51
SLIDE 51 52

Recovery

  • Rebuild DRAM structure

– Allocator – Lazy rebuild: postpones inode radix tree rebuild

  • Accelerates recovery
  • Reduces DRAM consumption
  • Normal shutdown recovery:

– Store allocator in recovery inode – No log scanning

  • Failure recovery:

– Log is short – Parallel scan – Failure recovery bandwidth: > 400 GB/s

DRAM NVMM

Journal Inode table Free list

CPU 0

Journal Inode table Free list

CPU 1

Super block Recovery inode Recovery inode Recovery thread Recovery thread
slide-52
SLIDE 52 53

Snapshot for normal file I/O

Current snapshot File log Data Page 1 Snapshot entry Data in snapshot File write entry Reclaimed data Epoch ID Current data Snapshot 0 1 1 Data Page 1 Data 1 Data Page 1 Data Snapshot 1 2 Data 2 Data Page 1 Data [0, 1) [1, 2) Delete snapshot 0; Data
slide-53
SLIDE 53 54

Corrupt Snapshots with DAX-mmap()

  • Recovery invariant: if V == True, then D is valid

– Incorrect: Naïvely mark pages read-only one-at-a-time

False ? V = True; D = 5; R/W RO Page Fault Copy on Write Value Change Application: Page hosting D: Page hosting V: ? T Snapshot Snapshot True 5 Timeline: ? False ? False False ? False ? Corrupt
slide-54
SLIDE 54 55

Consistent Snapshots with DAX-mmap()

  • Recovery invariant: if V == True, then D is valid

– Correct: Delay CoW page faults completion until all pages are read-

  • nly
False ? D = 5; R/W RO Page Fault Value Change Application: Page hosting D: Page hosting V: ? Snapshot V = True; 5 RO Waiting F Copy on Write Snapshot Timeline: False True ? Consistent
slide-55
SLIDE 55 56

Snapshot-related latency

1 2 3 4 5 6 7 8 9 10 CoW page fault (4KB) Snapshot deletion Snapshot creation Latency (microsecond) snapshot manifest init combine manifests radix tree locking sync superblock mark pages read-only change mapping memcpy_nocache
slide-56
SLIDE 56 57

Defense Against Scribbles

  • Tolerating Larger Scribbles

– Allocate replicas far from one another – NOVA metadata can tolerate scribbles of 100s of MB

  • Preventing scribbles

– Mark all NVMM as read-only – Disable CPU write protection while accessing NVMM – Exposes all kernel data to bugs in a very small section of NOVA code.

slide-57
SLIDE 57 58

NVMM Failure Modes: Media Failures

  • Media errors

– Detectable & correctable – Detectable & uncorrectable – Undetectable

  • Software scribbles

– Kernel bugs or own bugs – Transparent to hardware

Software: NVMM Ctrl.: Read NVMM data: Detects & corrects errors Consumes good data Media error
slide-58
SLIDE 58 59

NVMM Failure Modes: Media Failures

  • Media errors

– Detectable & correctable – Detectable & uncorrectable – Undetectable

  • Software scribbles

– Kernel bugs or own bugs – Transparent to hardware

NVMM data: Software: NVMM Ctrl.: Detects uncorrectable errors Raises exception Receives MCE Media error & Poison Radius (PR) e.g. 512 bytes Read
slide-59
SLIDE 59 60

NVMM Failure Modes: Media Failures

  • Media errors

– Detectable & correctable – Detectable & uncorrectable – Undetectable

  • Software scribbles

– Kernel bugs or own bugs – Transparent to hardware

NVMM data: Media error Software: NVMM Ctrl.: Sees no error Consumes corrupted data Read
slide-60
SLIDE 60 61

NVMM Failure Modes: Scribbles

  • Media errors

– Detectable & correctable – Detectable & uncorrectable – Undetectable

  • Software “scribbles”

– Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable

NVMM data: Software: NVMM Ctrl.: Updates ECC Bug code scribbles NVMM Scribble error Write
slide-61
SLIDE 61 62

NVMM Failure Modes: Scribbles

  • Media errors

– Detectable & correctable – Detectable & uncorrectable – Undetectable

  • Software “scribbles”

– Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable

NVMM data: Software: NVMM Ctrl.: Sees no error Consumes corrupted data Scribble error Read
slide-62
SLIDE 62 63

File operation latency

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Create Append (4KB) Overwrite (4KB) Overwrite (512B) Read (4KB) Latency (microsecond) xfs-DAX PMFS ext4-DAX ext4-dataj Btrfs NOVA w/ MP w/ MP+WP w/ MP+DP w/ MP+DP+WP Relaxed mode
slide-63
SLIDE 63 64

Random R/W bandwidth on NVDIMM-N

5 10 15 20 25 30 1 2 4 8 16 Bandwidnth (GB/s) Threads NVDIMM-N 4K Read 2 4 6 8 10 12 14 1 2 4 8 16 Bandwidnth (GB/s) Threads NVDIMM-N 4K Write xfs-DAX PMFS ext4-DAX ext4-dataj Btrfs NOVA w/ MP w/ MP+DP Relaxed mode
slide-64
SLIDE 64 65

Scribble size and metadata bytes at risk

Metadata Pages at Risk Scribble Size in Bytes no replication, worst no replication, average simple replication, worst simple replication, average two-way replication, worst two-way replication, average dead-zone replication, worst dead-zone replication, average 1 16 256 4K 64K 1M 16M 256M 1.5E-5 1.2E-4 9.8E-4 7.8E-3 0.06 0.5 4 32 256 2K 16K
slide-65
SLIDE 65 66

Storage overhead

File data 82.4% Primary inode 0.1% Primary log 2.0% Replica inode 0.1% Replica log 2.0% File checksum 1.6% File parity 11.1% Unused 0.8%
slide-66
SLIDE 66 67

Latency breakdown

1 2 3 4 5 6 7 8 9 Read 16KB Read 4KB Overwrite 512B Overwrite 4KB Append 4KB Create Latency (microsecond) VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity write protection
slide-67
SLIDE 67 68

Latency breakdown

1 2 3 4 5 6 7 8 9 Read 16KB Read 4KB Overwrite 512B Overwrite 4KB Append 4KB Create Latency (microsecond) VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity write protection Metadata Protection
slide-68
SLIDE 68 69

Latency breakdown

1 2 3 4 5 6 7 8 9 Read 16KB Read 4KB Overwrite 512B Overwrite 4KB Append 4KB Create Latency (microsecond) VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity write protection Metadata Protection Data Protection
slide-69
SLIDE 69 70

Latency breakdown

1 2 3 4 5 6 7 8 9 Read 16KB Read 4KB Overwrite 512B Overwrite 4KB Append 4KB Create Latency (microsecond) VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity write protection Metadata Protection Data Protection Scribble Prevention
slide-70
SLIDE 70 71

Application performance on NOVA-Fortis

0.0 0.2 0.4 0.6 0.8 1.0 1.2 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim SQLite TPCC Average Normalized throughput Ops/second NVDIMM-N xfs-DAX PMFS ext4-DAX ext4-dataj Btrfs NOVA w/ MP w/ MP+DP Relaxed mode 495k 610k 553k 692k 27k 73k 30k 126k 45k
slide-71
SLIDE 71 72

Application performance on NOVA-Fortis

0.0 0.2 0.4 0.6 0.8 1.0 1.2 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim SQLite TPCC Average Normalized throughput to NVDIMM-N PCM xfs-DAX PMFS ext4-DAX ext4-dataj Btrfs NOVA w/ MP w/ MP+DP Relaxed mode