[PPT] - Versioning File System Shengan Zheng, Linpeng Huang, Hao Liu, Linzhu PowerPoint Presentation

SLIDE 1

HMVFS: A Hybrid Memory Versioning File System

Shengan Zheng, Linpeng Huang, Hao Liu, Linzhu Wu, Jin Zha

Department of Computer Science and Engineering Shanghai Jiao Tong University

SLIDE 2

Outline

Introduction
Design
Implementation
Evaluation
Conclusion

SLIDE 3

Introduction

Emerging Non-Volatile Memory (NVM)
Persistency as disk
Byte addressability as DRAM
Current file systems for NVM
PMFS, SCMFS, BPFS
Non-versioning, unable to recover old data
Hardware and software errors
Large dataset and long execution time
Fault tolerance mechanism is needed
Current versioning file systems
BTRFS, NILFS2
Not optimized for NVM

SLIDE 4

Design Goals

Strong consistency
A Stratified File System Tree (SFST) represents the snapshot of whole file system
Atomic snapshotting is ensured
Fast recovery
Almost no redo or undo overhead in recovery
High performance
Utilize the byte-addressability of NVM to update the tree metadata at the granularity of bytes
Log-structured updates to files balance the endurance of NVM
Avoid write amplification
User friendly
Snapshots are created automatically and transparently

SLIDE 5

Overview

HMVFS is an NVM-friendly log-structured versioning file system
Space-efficient file system snapshotting
HMVFS decouples tree metadata from tree data
High performance and consistency guarantee
POSIX compliant

SLIDE 6

Outline

Introduction
Design
Implementation
Evaluation
Conclusion

SLIDE 7

On-Memory Layout

DRAM: cache and journal
Sequential write zone
File metadata and data
Tree data
Random write zone
File system metadata
Tree metadata
NVM:

Block Information Table (BIT) Node Address Tree Cache (NAT Cache ) Segment Information Table (SIT)

Random Writes Sequential Writes

NVM

Segment Information Table Journal

DRAM

Checkpoint Information Tree (CIT) Node Address Tree (NAT)

Main Area (SFST) Auxiliary Information

Node Blocks Checkpoint Blocks (CP) Data Blocks Superblock Superblock

SLIDE 8

in traditional Log-structured File Systems

Update propagation problem

Direct pointer Or Inline data Metadata Single-indirect Double-indirect Triple-indirect Inode block Direct node Direct node Indirect node Indirect node Indirect node Direct node Direct node Data block Direct node Indirect node

Data Node

… … … … … … … … … … … Data block Data block Data block Data block Data block Data block Data block Data block Indirect node Inode

Updated blocks

Direct node Data block

Index Structure

SLIDE 9

Index Structure without write amplification

Node Address Table

Direct pointer Or Inline data Metadata Single-indirect Double-indirect Triple-indirect Inode block Direct node Direct node Indirect node Indirect node Indirect node Direct node Direct node Data block Direct node Indirect node

Data Node

… … … … … … … … … … … Data block Data block Data block Data block Data block Data block Data block Data block

Updated blocks

Direct node Data block

Node Address Table

Node-ID Address … … n-1 0x38 n 0x42 n+1 0x24 … … 0x73

SLIDE 10

Index Structure for versioning

Node Address Table with the dimension of version.

Direct pointer Or Inline data Metadata Single-indirect Double-indirect Triple-indirect Inode block Direct node Direct node Indirect node Indirect node Indirect node Direct node Direct node Data block Direct node Indirect node

Data Node

… … … … … … … … … … … Data block Data block Data block Data block Data block Data block Data block Data block

Updated blocks

Direct node Data block

Node Address Table with Version

Node-ID Address … … n-1 0x14 n n+1 0x24 … … 0x42 Address … 0x38 0x24 … x42 Address … 0x38 0x24 … 0x73

Version1 Version2 Version3

SLIDE 11

How to store different trees space-efficiently

Node Address Tree (NAT)
A four-level B-tree to store multi-version Node Address

Table space-efficiently

Adopt the idea of CoW friendly B-tree
NAT leaves contain NodeID-address pairs
Other tree blocks in NAT contain pointers to lower level

blocks.

Node NAT root NAT internal NAT internal NAT leaf NAT leaf NAT leaf Indirect node NAT internal NAT internal NAT internal NAT root NAT internal NAT internal NAT leaf Direct node Inode Direct node Node Address Tree

P,1 A,1 B,1 C,1 D,1 E,1 F,1 P,1 A,1 B,1 C,2 D,1 E,2 F,1 Q,1 D',1 F',1 P,0 A,1 B,1 C,1 D,0 E,1 F,0 Q,1 D',1 F',1 Original New

SLIDE 12

Stratified File System Tree (SFST)

Four different categories of blocks:
Checkpoint layer
Node Address Tree (NAT) layer
Node layer
Data layer
All blocks from SFST are stored in the main area with

log-structured writes

Balance the endurance of NVM media
Each SFST represents a valid snapshot of file system
Share overlapped blocks to achieve space-efficiency

Data block Data block Data block Data block Data block Data block Node Data NAT root NAT internal NAT internal NAT leaf NAT leaf NAT leaf Indirect node NAT internal NAT internal NAT internal NAT root NAT internal NAT internal NAT leaf Direct node Inode Direct node Node Address Tree Original snapshot New snapshot CP block CP block Checkpoint

SLIDE 13

Stratified File System Tree (SFST)

The metadata of SFST
In auxiliary information zone
Random write updates
Segment Information Table (SIT)
Contains the status information of every segment
Block Information Table (BIT)
Keeps the information of every block
Update precisely at variable bytes granularity
Contains:
Start and end version number
Block type
Node ID
Reference count

Data block Data block Data block Data block Data block Data block Node Data NAT root NAT internal NAT internal NAT leaf NAT leaf NAT leaf Indirect node NAT internal NAT internal NAT internal NAT root NAT internal NAT internal NAT leaf Direct node Inode Direct node Node Address Tree Original snapshot New snapshot CP block CP block Checkpoint

SLIDE 14

Garbage Collection in HMVFS

Move all the valid blocks in the victim segment to the current segment
When finished, update SIT and create a snapshot
Handle block sharing problem

NAT block Node Block 1 Node Block 2

Version 1

NAT block Node Block 2

Version 2

NAT block Node Block 2

Version 3

NAT block Node Block 2

Version 4 Segment A Segment B

SLIDE 15

Outline

Introduction
Design
Implementation
Evaluation
Conclusion

SLIDE 16

Block Information Table (BIT)

Block sharing problem
The corresponding pointer in the parent block must be updated if a new child block is

written in the main area

Node ID and block type
Used to locate parent node

Type of the block Type of the parent Node ID Checkpoint N/A N/A NAT internal NAT internal Index code in NAT NAT leaf Inode NAT leaf Node ID Indirect Direct Data Inode or direct Node ID of parent node

SLIDE 17

Block Information Table (BIT)

Start and end version number
The first and last versions in which the block is valid
Operations like write and delete set these two variables to the current version

number

Reference count
The number of parent nodes which are linked to the block
Update with lazy reference counting
File level operations and snapshot level operations update the reference

count

If the count reaches zero, the block will become garbage

SLIDE 18

Snapshot Creation

Strong consistency is guaranteed
Flush dirty NAT entries from DRAM to form a new

Node Address Tree

Follow the bottom-up procedure
Status information are stored in checkpoint block
Space-efficient snapshot
The atomicity of snapshot creation is ensured
Atomic update to the pointer in superblock to announce

the validity of the new snapshot

Crash during snapshot creation can be recovered by

undo or redo depend on the validity

Node Data NAT root NAT internal NAT internal NAT leaf NAT leaf NAT leaf Indirect node NAT internal NAT internal NAT internal NAT root NAT internal NAT internal NAT leaf Direct node Inode Direct node Node Address Tree Original snapshot New snapshot CP block CP block Checkpoint

Data block Data block Data block Data block Data block Data block … … … … … … …

Super Block

SLIDE 19

Snapshot Deletion

Deletion starts from the checkpoint block
Checkpoint cache is stored in DRAM
Follows the top-down procedure to decrease reference counts
Consistency is ensured by journaling
Call garbage collection afterwards
Many reference counts have decreased to zero

P,0 A,1 B,1 C,1 D,0 E,1 F,0 Q,1 D',1 F',1 P,1 A,1 B,1 C,2 D,1 E,2 F,1 Q,1 D',1 F',1

SLIDE 20

Crash Recovery

Mount the writable last completed snapshot
No additional recovery overhead
Mount the read-only old snapshots
Locate the checkpoint block of the snapshot
Retrieve files via SFST

Checkpoint Checkpoint Checkpoint Checkpoint Superblock NAT root … NAT root … NAT root … NAT root …

SLIDE 21

Outline

Introduction
Design
Implementation
Evaluation
Conclusion

SLIDE 22

Evaluation

Experimental Setup
A commodity server with 64 Intel Xeon 2GHz processors and 512GB DRAM
Performance comparison with PMFS, EXT4, BTRFS, NILFS2
Postmark results
Different read bias numbers

20 40 60 80 100 120 140 160 180 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

sec Percentage of Reads

HMVFS BTRFS NILFS2 EXT4 PMFS

20 40 60 80 100 120 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Efficiency (sec-1) Percentage of Reads

HMVFS BTRFS NILFS2

Transaction performance Snapshotting efficiency

2.7x and 2.3x

SLIDE 23

Evaluation

Filebench results
Fileserver
Different numbers of files

5 10 15 20 25 2k 4k 8k 16k

ps/sec (x1000)

Number of Files HMVFS BTRFS NILFS2 EXT4 PMFS 5 10 15 20 25 30 35 40 2k 4k 8k 16k Efficiency (sec-1) Number of Files HMVFS BTRFS NILFS2

Throughput performance Snapshotting efficiency

9.7x and 6.6x

SLIDE 24

Evaluation

Filebench results
Varmail
Different depths of directories

2 4 6 8 10 12 0.7 1.2 1.4 2.1 Efficiency (sec-1) Directory Depth HMVFS BTRFS NILFS2 5 10 15 20 25 0.7 1.2 1.4 2.1

ps/sec (x1000)

Directory Depth HMVFS BTRFS NILFS2 EXT4 PMFS

Throughput performance Snapshotting efficiency

8.7x and 2.5x

SLIDE 25

Outline

Introduction
Design
Implementation
Evaluation
Conclusion

SLIDE 26

Conclusion

HMVFS is the first file system to solve the consistency problem for NVM-based

in-memory file systems using snapshotting.

Metadata of the Stratified File System Tree (SFST) is decoupled from data and is

updated at byte granularity

HMVFS stores the snapshots space-efficiently with shared blocks in SFST and

handles write amplification problem and block sharing problem well

HMVFS exploits the structural benefit of CoW friendly B-tree and the byte-

addressability of NVM to automatically take frequent snapshots

HMVFS outperforms tradition versioning file systems in snapshotting and

performance while providing strong consistency guarantee and having little impact on foreground operations

SLIDE 27

Q & A
Thank you