Write Optimization of Log-structured Flash File System for Parallel - - PowerPoint PPT Presentation

write optimization of log structured flash file system
SMART_READER_LITE
LIVE PREVIEW

Write Optimization of Log-structured Flash File System for Parallel - - PowerPoint PPT Presentation

Write Optimization of Log-structured Flash File System for Parallel I/O on Manycore Servers Chang-Gyu Lee , Hyunki Byun, Sunghyun Noh, Hyeongu Kang, Youngjae Kim Department of Computer Science and Engineering Sogang University, Seoul, Republic of


slide-1
SLIDE 1

Write Optimization of Log-structured Flash File System for Parallel I/O on Manycore Servers

Chang-Gyu Lee, Hyunki Byun, Sunghyun Noh, Hyeongu Kang, Youngjae Kim Department of Computer Science and Engineering Sogang University, Seoul, Republic of Korea

SYSTOR ‘19

1

slide-2
SLIDE 2

Data Intensive Applications

– Massive data explosion in recent years and expected to grow – Database Applications

2007, 281 EB 2010, 1.2 ZB 2013, 4.4 ZB 2020, ~44 ZB

Growing Capacity Demands

Storage Memory

2

Parallel Writes

slide-3
SLIDE 3

Manycore CPU and NVMe SSD

Manycore Servers High-Performance SSD Parallel Writes OS File System (F2FS)

3

slide-4
SLIDE 4

What are Parallel Writes?

– Shared File Writes (DWOM from FxMark[ATC’16])

– Multiple processes write private regions on a single file.

Process 1 Process 2 Process 3 Process N

– Private File Write with FSYNC (DWSL from FxMark[ATC’16])

– Multiple processes write private files, then call fsync system calls.

Direct I/O Write

Process 1 Process 2 Process 3 Process N

Write and fsync Shared File Private Files * FxMark[ATC’16]: Min. et. al., "Understanding Manycore Scalability of File Systems", USENIX ATC 2016 4

slide-5
SLIDE 5

Preliminary Results

5

1 1 5 2 8 4 2 5 6 7 8 4 9 8 1 1 2 1 2 # of Cores 50 100 150 200 K IOPS 1 1 5 2 8 4 2 5 6 7 8 4 9 8 1 1 2 1 2 # of Cores 50 100 150 200 K IOPS

<DWOM Workload> <DWSL Workload>

– In DWOM workload, the performance does not scale. – In DWSL workload, the performance does not scale after 42 cores.

slide-6
SLIDE 6

Contents

– Introduction and Motivation – Background: F2FS – Research Problems

– Parallel Writes do never scale with respect to the increased number of cores on Manycore servers.

– Approaches

– Applying Range-Locking – NVM Node Logging for file and file system metadata – Pin-Point Update to completely eliminate checkpointing

– Evaluation Results – Conclusion

6

slide-7
SLIDE 7

F2FS: Flash Friendly File System

– F2FS is a log-structured file system designed for NAND Flash SSD. – F2FS employs two types of logs to benefit with Flash’s parallelism and garbage collection.

– Data log for directory entry and user data – Node log for inode and indirect node

– Node Address Table (NAT) translates Node id (NID) to block address. – In memory, block address of an NAT entry is updated when corresponding Node Log is flushed. – Entire NAT is flushed to the storage device during checkpointing.

CP NAT SIT SSA Node Log Data Log

Filesystem Metadata (Random write) Main Log Area (Sequential Write)

7

slide-8
SLIDE 8

Problem(1): Serialized Shared File Writes

– Single file write

A B C Inode File Blocked Grant Lock

8

slide-9
SLIDE 9

Problem(2): fsync Processing in F2FS

inode Data Node Log Data Log

Node id Block

NAT

DRAM ❷ ❶

Data inode

Node id Block

NAT

SSD

Old Data Reference Flushing New Data

9

slide-10
SLIDE 10

Problem(3): I/O Blocking during Checkpointing

inode Data Node Log Data Log

Node id Block

NAT

DRAM ❷ ❶

Data inode

Node id Block

NAT

SSD

Old Data Reference Flushing New Data

60 Sec.

Checkpointing

10

slide-11
SLIDE 11

Problem(3): I/O Blocking during Checkpointing

inode Data Node Log Data Log

Node id Block

NAT

DRAM ❷ ❸ ❶

Data inode

Node id Block

NAT

SSD

Old Data Reference Flushing New Data

60 Sec.

Checkpointing

11

slide-12
SLIDE 12

Problem(3): I/O Blocking during Checkpointing

inode Data Node Log Data Log

Node id Block

NAT

DRAM ❷ ❸ ❶

Data inode

Node id Block

NAT

SSD

User Level Filesystem Level

Old Data Reference Flushing New Data

12

slide-13
SLIDE 13

Summary

– We identified the causes of bottlenecks in F2FS for parallel writes as follows.

  • 1. Serialization of parallel writes on a single file
  • 2. High latency of fsync system call
  • 3. I/O blocking by checkpointing of F2FS

13

slide-14
SLIDE 14

Approach(1): Range Locking

– In F2FS, parallel writes to a single file are serialized by inode mutex lock.

B, ref=0 C, ref=1 A B C Inode File A, ref=0 Grant Lock Grant Lock Block

We employed a range-based lock to allow parallel writes on a single file.

14

slide-15
SLIDE 15

Approach(2): High Latency of fsync Processing

– When fsync is called, F2FS has to flush data and metadata.

– Even if only small portion of metadata is changed, a block has to be flushed. – The latency of fsync is dominated by block I/O latency.

inode

Slow Block I/O Write Amplification DRAM SSD

To mitigate high latency of fsync, we propose NVM Node Logging and fine- graind inode.

inode

Better Latency Byte-addressability DRAM NVM

15

slide-16
SLIDE 16

Approach(2): Node Logging on NVM

inode Data Node Log Data Log

Node id Block

NAT

DRAM ❷ ❶

Data inode

Node id Block

NAT

SSD NVM

Old Data Reference Flushing New Data

16

slide-17
SLIDE 17

Approach(3): Fine-grained inode Structure

17

inode Address Address Space nid

4KB

Data

Double Indirect

Indirect Node Direct Node

inode Address nid

0.4KB

Data

inode in baseline F2FS Fine-grained inode

slide-18
SLIDE 18

Approach(4): Pin-Point NAT Update

– Frequent fsync calls trigger checkpointing in F2FS – However, F2FS blocks all incoming I/O requests during checkpointing. To eliminate checkpointing, we propose Pin-Point NAT Update.

18

slide-19
SLIDE 19

Approach(4): Pin-Point NAT Update

inode Data Node Log Data Log

Node id Block

NAT

DRAM ❷ ❸ ❶

Data inode

Node id Block

NAT

SSD NVM

Old Data Reference Flushing New Data

In Pin-Point NAT Update, we update only the modified NAT entry directly in NVM when fsync is called. Therefore, checkpointing is not necessary to persist the entire NAT.

19

slide-20
SLIDE 20

Approach(4): Pin-Point NAT Update

inode Data Node Log Data Log

Node id Block

NAT

DRAM ❷ ❸ ❶

Data inode

Node id Block

NAT

SSD NVM

Old Data Reference Flushing New Data

20

slide-21
SLIDE 21

Evaluation Setup

– Microbenchmark (FxMark)

– DWOM

– Shared File Write

– DWSL

– Private File Write with fsync CPU Intel Xeon E7-8870 v2 2.3GHz 8 CPU Nodes (15 Cores per Node) Total 120 cores RAM 740GB SSD Intel SSD 750 Series 400GB (NVMe) Read: 2200 MB/s, Write: 900 MB/s NVM 32GB Emulated as PMEM device on R AM OS Linux kernel 4.14.11

– Test-bed

– IBM x3950 X6 Manycore Server

* FxMark[ATC’16]: Min. et. al., "Understanding Manycore Scalability of File Systems", USENIX ATC 2016

21

slide-22
SLIDE 22

Shared File Write (DWOM Workload)

20 40 60 80 100 120 140 1 15 28 42 56 70 84 98 112 120 K IOPS # of Cores

baseline range lock node logging integrated

X15 X6.8

  • Baseline and node logging lines overlap.
  • Node Logging does not help at all because DWOM

workload does not carry fsync calls.

22

slide-23
SLIDE 23

Frequent fsync (DWSL Workload)

50 100 150 200 250 1 15 28 42 56 70 84 98 112 120 K IOPS # of Cores

baseline range lock node logging integrated

X1.6

23

slide-24
SLIDE 24

Conclusion

– We identified performance bottlenecks of F2FS for parallel writes.

  • 1. Serialization of share file writes on a file
  • 2. High latency of fsync operations in F2FS
  • 3. High I/O blocking times during checkpointing.

– To solve these problem, we proposed

  • 1. File-level Range Lock to allow parallel writes on a shared file
  • 2. NVM Node Logging to provides lower latency for updating file/file system

metadata

  • 3. Pin-Point NAT Update to eliminate I/O blocking times of checkpointing

24

slide-25
SLIDE 25

Q&A

25

Thank you!

– Contact: Changgyu Lee (changgyu@sogang.ac.kr) Department of Computer Science and Engineering Sogang University, Seoul, Republic of Korea