Understanding Manycore Scalability of File Systems Changwoo Min , - - PowerPoint PPT Presentation

understanding manycore scalability of file systems
SMART_READER_LITE
LIVE PREVIEW

Understanding Manycore Scalability of File Systems Changwoo Min , - - PowerPoint PPT Presentation

Understanding Manycore Scalability of File Systems Changwoo Min , Sanidhya Kashyap, Stefgen Maass Woonhak Kang, and Taesoo Kim Application must parallelize I/O operations Death of single core CPU scaling CPU clock frequency: 3 ~ 3.8 GHz


slide-1
SLIDE 1

Understanding Manycore Scalability

  • f File Systems

Changwoo Min, Sanidhya Kashyap, Stefgen Maass Woonhak Kang, and Taesoo Kim

slide-2
SLIDE 2

Application must parallelize I/O operations

  • Death of single core CPU scaling

– CPU clock frequency: 3 ~ 3.8 GHz – # of physical cores: up to 24 (Xeon E7 v4)

  • From mechanical HDD to fmash SSD

– IOPS of a commodity SSD: 900K – Non-volatile memory (e.g., 3D XPoint): 1,000x ↑

But fjle systems become a scalability bottleneck

slide-3
SLIDE 3

3

Problem: Lack of understanding in internal scalability behavior

0k 2k 4k 6k 8k 10k 12k 14k 10 20 30 40 50 60 70 80 messages/sec #core Exim mail server on RAMDISK btrfs ext4 F2FS XFS

  • 3. Never scale

Embarrassingly parallel application!

  • 1. Saturated
  • 2. Collapsed
  • Intel 80-core machine: 8-socket, 10-core Xeon E7-8870
  • RAM: 512GB, 1TB SSD, 7200 RPM HDD
slide-4
SLIDE 4

4

Even in slower storage medium fjle system becomes a bottleneck

0k 2k 4k 6k 8k 10k 12k btrfs ext4 F2FS XFS messages/sec Exim email server at 80 cores RAMDISK SSD HDD

slide-5
SLIDE 5

5

Outline

  • Background
  • FxMark design

– A fjle system benchmark suite for manycore scalability

  • Analysis of fjve Linux fjle systems
  • Pilot solution
  • Related work
  • Summary
slide-6
SLIDE 6

6

Research questions

  • What fjle system operations are not scalable?
  • Why they are not scalable?
  • Is it the problem of implementation or design?
slide-7
SLIDE 7

7

Technical challenges

  • Applications are usually stuck with a few bottlenecks

→ cannot see the next level of bottlenecks before resolving them → diffjcult to understand overall scalability behavior

  • How to systematically stress fjle systems to

understand scalability behavior

slide-8
SLIDE 8

8

FxMark: evaluate & analyze manycore scalability of fjle systems

FxMark: File systems: Storage medium:

tmpfs

Memory FS

ext4

J/NJ

XFS

Journaling FS

btrfs

CoW FS

F2FS

Log FS

19 micro-benchmarks 3 applications

# core: 1, 2, 4, 10, 20, 30, 40, 50, 60, 70, 80

SSD

slide-9
SLIDE 9

9

FxMark: evaluate & analyze manycore scalability of fjle systems

FxMark: File systems: Storage medium:

tmpfs

Memory FS

ext4

J/NJ

XFS

Journaling FS

btrfs

CoW FS

F2FS

Log FS

19 micro-benchmarks 3 applications

# core: 1, 2, 4, 10, 20, 30, 40, 50, 60, 70, 80

SSD

>4,700

slide-10
SLIDE 10

10

Low Medium High

Microbenchmark: unveil hidden scalability bottlenecks

  • Data block read

Sharing Level File Block Process

R

Operation Legend:

R R R R R R

slide-11
SLIDE 11

11

Stress difgerent components with various sharing levels

slide-12
SLIDE 12

12

Legend btrfs ext4 ext4NJ F2FS tmpfs XFS

50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core

Evaluation

  • Data block read

Low:

R R

File systems: Storage medium:

Linear scalability

slide-13
SLIDE 13

13

Outline

  • Background
  • FxMark design
  • Analysis of fjve Linux fjle systems

– What are scalability bottlenecks?

  • Pilot solution
  • Related work
  • Summary
slide-14
SLIDE 14

14

50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBL

Summary of results: fjle systems are not scalable

50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH 20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DWAL 0.5 1 1.5 2 2.5 3 3.5 4 10 20 30 40 50 60 70 80 M ops/sec #core DWTL 20 40 60 80 100 120 140 10 20 30 40 50 60 70 80 M ops/sec #core DWSL 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 M ops/sec #core MRPL 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 M ops/sec #core MRPM 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 20 30 40 50 60 70 80 M ops/sec #core MRPH 50 100 150 200 250 300 350 400 450 500 10 20 30 40 50 60 70 80 M ops/sec #core MRDL 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 80 M ops/sec #core MRDM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWCL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core MWCM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core MWUL 0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 20 30 40 50 60 70 80 M ops/sec #core MWUM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWRL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 10 20 30 40 50 60 70 80 M ops/sec #core MWRM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBM:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOM:O_DIRECT 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k 10 20 30 40 50 60 70 80 messages/sec #core Exim 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80

  • ps/sec

#core RocksDB 2 4 6 8 10 12 14 16 18 10 20 30 40 50 60 70 80 GB/sec #core DBENCH

Legend btrfs ext4 ext4NJ F2FS tmpfs XFS

slide-15
SLIDE 15

15

50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBL

Summary of results: fjle systems are not scalable

50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH 20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DWAL 0.5 1 1.5 2 2.5 3 3.5 4 10 20 30 40 50 60 70 80 M ops/sec #core DWTL 20 40 60 80 100 120 140 10 20 30 40 50 60 70 80 M ops/sec #core DWSL 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 M ops/sec #core MRPL 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 M ops/sec #core MRPM 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 20 30 40 50 60 70 80 M ops/sec #core MRPH 50 100 150 200 250 300 350 400 450 500 10 20 30 40 50 60 70 80 M ops/sec #core MRDL 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 80 M ops/sec #core MRDM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWCL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core MWCM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core MWUL 0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 20 30 40 50 60 70 80 M ops/sec #core MWUM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWRL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 10 20 30 40 50 60 70 80 M ops/sec #core MWRM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBM:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOM:O_DIRECT 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k 10 20 30 40 50 60 70 80 messages/sec #core Exim 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80

  • ps/sec

#core RocksDB 2 4 6 8 10 12 14 16 18 10 20 30 40 50 60 70 80 GB/sec #core DBENCH

Legend btrfs ext4 ext4NJ F2FS tmpfs XFS

slide-16
SLIDE 16

16

50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBL

Summary of results: fjle systems are not scalable

50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH 20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DWAL 0.5 1 1.5 2 2.5 3 3.5 4 10 20 30 40 50 60 70 80 M ops/sec #core DWTL 20 40 60 80 100 120 140 10 20 30 40 50 60 70 80 M ops/sec #core DWSL 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 M ops/sec #core MRPL 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 M ops/sec #core MRPM 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 20 30 40 50 60 70 80 M ops/sec #core MRPH 50 100 150 200 250 300 350 400 450 500 10 20 30 40 50 60 70 80 M ops/sec #core MRDL 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 80 M ops/sec #core MRDM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWCL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core MWCM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core MWUL 0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 20 30 40 50 60 70 80 M ops/sec #core MWUM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWRL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 10 20 30 40 50 60 70 80 M ops/sec #core MWRM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBM:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOM:O_DIRECT 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k 10 20 30 40 50 60 70 80 messages/sec #core Exim 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80

  • ps/sec

#core RocksDB 2 4 6 8 10 12 14 16 18 10 20 30 40 50 60 70 80 GB/sec #core DBENCH

Legend btrfs ext4 ext4NJ F2FS tmpfs XFS

slide-17
SLIDE 17

17

50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBL

Summary of results: fjle systems are not scalable

50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH 20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DWAL 0.5 1 1.5 2 2.5 3 3.5 4 10 20 30 40 50 60 70 80 M ops/sec #core DWTL 20 40 60 80 100 120 140 10 20 30 40 50 60 70 80 M ops/sec #core DWSL 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 M ops/sec #core MRPL 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 M ops/sec #core MRPM 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 20 30 40 50 60 70 80 M ops/sec #core MRPH 50 100 150 200 250 300 350 400 450 500 10 20 30 40 50 60 70 80 M ops/sec #core MRDL 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 80 M ops/sec #core MRDM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWCL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core MWCM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core MWUL 0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 20 30 40 50 60 70 80 M ops/sec #core MWUM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWRL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 10 20 30 40 50 60 70 80 M ops/sec #core MWRM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBM:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOM:O_DIRECT 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k 10 20 30 40 50 60 70 80 messages/sec #core Exim 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80

  • ps/sec

#core RocksDB 2 4 6 8 10 12 14 16 18 10 20 30 40 50 60 70 80 GB/sec #core DBENCH

Legend btrfs ext4 ext4NJ F2FS tmpfs XFS

slide-18
SLIDE 18

18

Data block read

50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBL 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH

Low: Medium:

50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBM

All fjle systems linearly scale All fjle systems show performance collapse XFS shows performance collapse

XFS

R R R R

High:

R R

slide-19
SLIDE 19

19

Disk

Page cache is maintained for effjcient access of fjle data

R

  • 1. read a fjle block

Page cache

  • 2. look up a page cache
  • 3. cache miss
  • 4. read a page

from disk

  • 5. copy page

OS Kernel

slide-20
SLIDE 20

20

Disk

Page cache hit

R

  • 1. read a fjle block

Page cache

  • 2. look up a page cache
  • 3. cache hit
  • 4. copy page

OS Kernel

slide-21
SLIDE 21

21

Disk

Page cache can be evicted to secure free memory

Page cache

OS Kernel

slide-22
SLIDE 22

22

Disk

… only when not being accessed

R

  • 1. read a fjle block

Page cache

OS Kernel

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }

Reference counting is used to track # of accessing tasks

  • 4. copy page
slide-23
SLIDE 23

23

Reference counting becomes a scalability bottleneck

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }

1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }

R R R R R R

...

100 CPI (cycles-per-instruction) 20 CPI (cycles-per-instruction)

slide-24
SLIDE 24

24

Reference counting becomes a scalability bottleneck

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }

1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }

R R R R R R

...

High contention on a page reference counter → Huge memory stall

100 CPI (cycles-per-instruction) 20 CPI (cycles-per-instruction)

Many more: directory entry cache, XFS inode, etc

slide-25
SLIDE 25

25

Lessons learned

High locality can cause performance collapse Cache hit should be scalable

→ When the cache hit is dominant, the scalability of cache hit does matter.

slide-26
SLIDE 26

26

Data block overwrite

Low: Medium:

20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM

All fjle systems degrade gradually Ext4, F2FS, and btrfs show performance collapse

ext4 F2FS btrfs

W W W W

slide-27
SLIDE 27

Btrfs is a copy-on-write (CoW) fjle system

  • Directs a write to a block to a new copy of the block

→ Never overwrites the block in place → Maintain multiple versions of a fjle system image

W

Time T Time T Time T+1

slide-28
SLIDE 28

28

CoW triggers disk block allocation for every write

W

Time T Time T+1 Block Allocation Block Allocation

Disk block allocation becomes a bottleneck Ext4 journaling, F2FS checkpointing → →

slide-29
SLIDE 29

29

Lessons learned

Overwriting could be as expensive as appending

→ Critical at log-structured FS (F2FS) and CoW FS (btrfs)

Consistency guarantee mechanisms should be scalable → Scalable journaling → Scalable CoW index structure → Parallel log-structured writing

slide-30
SLIDE 30

Data block overwrite

Low: Medium:

20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM

All fjle systems degrade gradually Ext4, F2FS, and btrfs show performance collapse

ext4 F2FS btrfs

W W W W

slide-31
SLIDE 31

Entire fjle is locked regardless of update range

  • All tested fjle systems hold an inode mutex for

write operations

– Range-based locking is not implemented

***_file_write_iter(...) { mutex_lock(&inode->i_mutex); ... mutex_unlock(&inode->i_mutex); }

slide-32
SLIDE 32

Lessons learned

A fjle cannot be concurrently updated

– Critical for VM and DBMS, which manage large fjles

Need to consider techniques used in parallel fjle systems → E.g., range-based locking

slide-33
SLIDE 33

33

Summary of fjndings

  • High locality can cause performance collapse
  • Overwriting could be as expensive as appending
  • A fjle cannot be concurrently updated
  • All directory operations are sequential
  • Renaming is system-wide sequential
  • Metadata changes are not scalable
  • Non-scalability often means wasting CPU cycles
  • Scalability is not portable

See our paper

slide-34
SLIDE 34

34

Summary of fjndings

  • High locality can cause performance collapse
  • Overwriting could be as expensive as appending
  • A fjle cannot be concurrently updated
  • All directory operations are sequential
  • Renaming is system-wide sequential
  • Metadata changes are not scalable
  • Non-scalability often means wasting CPU cycles
  • Scalability is not portable

See our paper

Many of them are unexpected and counter-intuitive → Contention at fjle system level to maintain data dependencies

slide-35
SLIDE 35

35

Outline

  • Background
  • FxMark design
  • Analysis of fjve Linux fjle systems
  • Pilot solution

– If we remove contentions in a fjle system,

is such fjle system scalable?

  • Related work
  • Summary
slide-36
SLIDE 36

36

RocksDB on a 60-partitioned RAMDISK scales better

100 200 300 400 500 600 700 10 20 30 40 50 60 70 80

  • ps/sec

#core 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80

  • ps/sec

#core btrfs ext4 F2FS tmpfs XFS

tmpfs

A single-partitioned RAMDISK A 60-partitioned RAMDISK

2.1x

** Tested workload: DB_BENCH overwrite **

slide-37
SLIDE 37

37

RocksDB on a 60-partitioned RAMDISK scales better

100 200 300 400 500 600 700 10 20 30 40 50 60 70 80

  • ps/sec

#core 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80

  • ps/sec

#core btrfs ext4 F2FS tmpfs XFS

tmpfs

A single-partitioned RAMDISK A 60-partitioned RAMDISK

2.1x

** Tested workload: DB_BENCH overwrite **

Reduced contention on fjle systems helps improving performance and scalability

slide-38
SLIDE 38

38

100 200 300 400 500 5 10 15 20

  • ps/sec

#core

But partitioning makes performance worse on HDD

A single-partitioned HDD A 60-partitioned HDD

2.7x

100 200 300 400 500 5 10 15 20 btrfs ext4 F2FS XFS

F2FS ** Tested workload: DB_BENCH overwrite **

slide-39
SLIDE 39

39

100 200 300 400 500 5 10 15 20

  • ps/sec

#core

But partitioning makes performance worse on HDD

A single-partitioned HDD A 60-partitioned HDD

2.7x

100 200 300 400 500 5 10 15 20 btrfs ext4 F2FS XFS

F2FS ** Tested workload: DB_BENCH overwrite **

But reduced spatial locality degrades performance → Medium-specifjc characteristics (e.g., spatial locality) should be considered

slide-40
SLIDE 40

40

Related work

  • Scaling operating systems

– Mostly use memory fjle system to opt out the efgect of I/O

  • perations
  • Scaling fjle systems

– Scalable fjle system journaling

  • ScaleFS [MIT:MSThesis'14]
  • SpanFS [ATC'15]

– Parallel log-structured writing on NVRAM

  • NOVA [FAST'16]
slide-41
SLIDE 41

41

Summary

  • Comprehensive analysis of manycore scalability of fjve

widely-used fjle systems using FxMark

  • Manycore scalability should be of utmost importance

in fjle system design

  • New challenges in scalable fjle system design

– Minimizing contention, scalable consistency guarantee, spatial

locality, etc.

  • FxMark is open source

– https://github.com/sslab-gatech/fxmark