nova fortis a fault tolerant non volatile main memory
play

NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System - PowerPoint PPT Presentation

NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System Jian Andiry Xu, Lu Zhang , Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-Volatile Systems Laboratory


  1. NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System Jian Andiry Xu, Lu Zhang , Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego 1

  2. Non-volatile Memory and DAX • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface Application load/store load/store DRAM NVMM File system HDD / SSD 2

  3. Non-volatile Memory and DAX • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface Application • Direct Access (DAX) mmap() DAX-mmap() – DAX file I/O bypasses the page cache – DAX-mmap() maps NVMM pages to application DRAM NVMM address space directly and bypasses file system copy – “Killer app” HDD / SSD 3

  4. Application expectations on NVMM File System DAX Fault Direct Speed POSIX I/O Atomicity Tolerance Access 4

  5. ext4 xfs BtrFS F2FS DAX ✔ ✔ Fault Direct Speed POSIX I/O Atomicity ❌ ❌ ❌ Tolerance Access 5

  6. PMFS ext4-DAX xfs-DAX ✔✔ ✔ ❌ ❌ DAX 6 Fault Direct

  7. Strata SOSP ’17 ✔ ✔ ✔ ❌ ❌ DAX 7 Fault Direct

  8. NOVA FAST ’16 ✔ ✔ ❌ ✔✔ DAX 8 Fault Direct

  9. NOVA-Fortis ✔ ✔✔ ✔ ✔ DAX 9 Fault Direct

  10. Challenges DAX 10

  11. NOVA: Log-structured FS for NVMM • Per-inode logging – High concurrency Per-inode logging – Parallel recovery • High scalability Inode Head Tail – Per-core allocator, journal and inode Inode log table • Atomicity Data Data Data – Logging for single inode update – Journaling for update across logs – Copy-on-Write for file data Jian Xu and Steven Swanson, NOVA: A Log-structured File System for Hybrid 11 Volatile/Non- volatile Main Memories, FAST ’16.

  12. Snapshot 13

  13. Snapshot support • Snapshot is essential for file system backup • Widely used in enterprise file systems – ZFS, Btrfs, WAFL • Snapshot is not available with DAX file systems 14

  14. Snapshot for normal file I/O 0 1 2 Current snapshot write(0, 4K); take_snapshot(); write(0, 4K); 0 Page 0 1 Page 0 1 Page 0 2 Page 0 File log write(0, 4K); take_snapshot(); Data Data Data Data Data Data Data Data write(0, 4K); recover_snapshot(1); File write entry Data in snapshot Reclaimed data Current data 15

  15. Memory Ordering With DAX-mmap() D V Valid D = 42; ✓ ? False Fence(); ✓ 42 False ✓ 42 True V = True; ✗ ? True • Recovery invariant: if V == True, then D is valid 16

  16. Memory Ordering With DAX-mmap() Application D D = 42; V Fence(); DAX-mmap() V = True; NVMM Page 1 Page 3 • Recovery invariant: if V == True, then D is valid • D and V live in two pages of a mmap ()’d region . 17

  17. DAX Snapshot: Idea • Set pages read-only, then copy-on-write Applications: no file system intervention File system: DAX-mmap() File data: RO 18

  18. DAX Snapshot: Incorrect implementation • Application invariant: if V is True, then D is valid Snapshot Application Application NOVA values thread values snapshot D = ?; D V D V V = False; snapshot_begin(); ? F ? set_read_only(page_d); page fault D = 42; copy_on_write(page_d); 42 F V = True; 42 T set_read_only(page_v); ? T snapshot_end(); ? T 19

  19. DAX Snapshot: Correct implementation • Delay CoW page faults completion until all pages are read-only Snapshot Application Application NOVA values thread values snapshot D = ?; D V D V V = False; snapshot_begin(); ? F ? set_read_only(page_d); page fault D = 42; set_read_only(page_v); ? F snapshot_end(); ? F 42 F copy_on_write(page_d); V = True; copy_on_write(page_v); 42 T 20

  20. Performance impact of snapshots • Normal execution vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 3.7% through mmap() W/O snapshot W snapshot 1.2 Filebench (read/write) WHISPER (DAX-mmap()) 1 0.8 0.6 0.4 0.2 0 21

  21. Protecting Metadata and Data 22

  22. NVMM Failure Modes • Detectable errors – Media errors detected by NVMM controller Software: Receives MCE – Raises Machine Check Exception Read Detects uncorrectable errors NVMM Ctrl.: (MCE) Raises exception • Undetectable errors NVMM data: Media error – Media errors not detected by NVMM controller – Software scribbles 23

  23. NVMM Failure Modes • Detectable errors – Media errors detected by NVMM controller Software: Consumes corrupted data – Raises Machine Check Exception Read NVMM Ctrl.: Sees no error (MCE) • Undetectable errors NVMM data: Media error – Media errors not detected by NVMM controller – Software scribbles 24

  24. NVMM Failure Modes • Detectable errors – Media errors detected by NVMM controller Software: Bug code scribbles NVMM – Raises Machine Check Exception Write NVMM Ctrl.: Updates ECC (MCE) • Undetectable errors NVMM data: Scribble error – Media errors not detected by NVMM controller – Software scribbles 25

  25. NOVA-Fortis Metadata Protection • Detection inode ’ Head’ Head’ Tail’ Tail’ csum’ csum ’ H1’ T1’ – CRC32 checksums in all structures inode – Use memcpy_mcsafe() to catch Head Head Head Tail Tail Tail csum csum H1 T1 MCEs • Correction log ent1 c1 … entN cN – Replicate all metadata: inodes, logs, superblock, etc. l og’ ent1’ c1’ … entN ’ cN ’ – Tick-tock: persist primary before updating replica Data 1 Data 2 26

  26. NOVA-Fortis Data Protection inode ’ Head’ Head’ Tail’ Tail’ csum ’ csum’ H1’ T1’ inode Head Head Head Tail Tail Tail csum csum H1 T1 • Metadata – CRC32 + replication for all structures log ent1 c1 … entN cN • Data c1 l og’ ent1’ … entN ’ cN ’ ’ – RAID-4 style parity – Replicated checksums Data 1 Data 2 1 Block (8 stripes) P = ⊕ S 0..7 S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 P C i = CRC32C(S i ) Replicated 27

  27. File data protection with DAX-mmap • Stores are invisible to the file systems • The file systems cannot protect mmap’ed data • NOVA- Fortis’ data protection contract: DAX NOVA-Fortis protects pages from media errors and scribbles iff they are not mmap ()’d for writing. 28

  28. File data protection with DAX-mmap • NOVA-Fortis logs mmap() operations User-space load/store load/store Applications: Kernel-space read/write mmap() NOVA-Fortis: NVDIMMs protected File data: unprotected File log: mmap log entry 29

  29. File data protection with DAX-mmap • On munmap and during recovery, NOVA-Fortis restores protection User-space load/store munmap() Applications: Kernel-space read/write mmap() NOVA-Fortis: NVDIMMs Protection restored File data: File log: 30

  30. File data protection with DAX-mmap • On munmap and during recovery, NOVA-Fortis restores protection User-space System Failure + Applications: recovery Kernel-space NOVA-Fortis: read/write mmap() NVDIMMs File data: File log: 31

  31. Performance 32

  32. Latency breakdown VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Create Append 4KB Overwrite 4KB Overwrite 512B Read 4KB Read 16KB 0 1 2 3 4 5 6 Latency (microsecond) 33

  33. Latency breakdown VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Create Append 4KB Overwrite 4KB Overwrite 512B Read 4KB Read 16KB 0 1 2 3 4 5 6 Latency (microsecond) Metadata Protection 34

  34. Latency breakdown VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Create Append 4KB Overwrite 4KB Overwrite 512B Read 4KB Read 16KB 0 1 2 3 4 5 6 Latency (microsecond) Metadata Protection Data Protection 35

  35. Application performance Normalized throughput 1.2 1 Normalized throughput 0.8 0.6 0.4 0.2 0 Fileserver Varmail MongoDB SQLite TPCC Average ext4-DAX Btrfs NOVA w/ MP w/ MP+DP 36

  36. Conclusion • Fault tolerance is critical for file system, but existing DAX file systems don’t provide it • We identify new challenges that NVMM file system fault tolerance poses • NOVA-Fortis provides fault tolerance with high performance – 1.5x on average to DAX-aware file systems without reliability features – 3x on average to other reliable file systems 37

  37. Give a try https://github.com/NVSL/linux-nova 38

  38. Thanks! 39

  39. Backup slides 40

  40. Hybrid DRAM/NVMM system • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology • File system for NVMM Host CPU NVMM FS DRAM NVMM 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend