Understanding Manycore Scalability
- f File Systems
Understanding Manycore Scalability of File Systems Changwoo Min , - - PowerPoint PPT Presentation
Understanding Manycore Scalability of File Systems Changwoo Min , Sanidhya Kashyap, Stefgen Maass Woonhak Kang, and Taesoo Kim Application must parallelize I/O operations Death of single core CPU scaling CPU clock frequency: 3 ~ 3.8 GHz
– CPU clock frequency: 3 ~ 3.8 GHz – # of physical cores: up to 24 (Xeon E7 v4)
– IOPS of a commodity SSD: 900K – Non-volatile memory (e.g., 3D XPoint): 1,000x ↑
3
0k 2k 4k 6k 8k 10k 12k 14k 10 20 30 40 50 60 70 80 messages/sec #core Exim mail server on RAMDISK btrfs ext4 F2FS XFS
Embarrassingly parallel application!
4
0k 2k 4k 6k 8k 10k 12k btrfs ext4 F2FS XFS messages/sec Exim email server at 80 cores RAMDISK SSD HDD
5
– A fjle system benchmark suite for manycore scalability
6
7
8
Memory FS
J/NJ
Journaling FS
CoW FS
Log FS
SSD
9
Memory FS
J/NJ
Journaling FS
CoW FS
Log FS
SSD
10
Low Medium High
Sharing Level File Block Process
Operation Legend:
11
12
Legend btrfs ext4 ext4NJ F2FS tmpfs XFS
50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core
Low:
Linear scalability
13
– What are scalability bottlenecks?
14
50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBL
50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH 20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DWAL 0.5 1 1.5 2 2.5 3 3.5 4 10 20 30 40 50 60 70 80 M ops/sec #core DWTL 20 40 60 80 100 120 140 10 20 30 40 50 60 70 80 M ops/sec #core DWSL 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 M ops/sec #core MRPL 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 M ops/sec #core MRPM 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 20 30 40 50 60 70 80 M ops/sec #core MRPH 50 100 150 200 250 300 350 400 450 500 10 20 30 40 50 60 70 80 M ops/sec #core MRDL 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 80 M ops/sec #core MRDM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWCL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core MWCM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core MWUL 0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 20 30 40 50 60 70 80 M ops/sec #core MWUM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWRL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 10 20 30 40 50 60 70 80 M ops/sec #core MWRM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBM:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOM:O_DIRECT 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k 10 20 30 40 50 60 70 80 messages/sec #core Exim 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80
#core RocksDB 2 4 6 8 10 12 14 16 18 10 20 30 40 50 60 70 80 GB/sec #core DBENCH
Legend btrfs ext4 ext4NJ F2FS tmpfs XFS
15
50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBL
50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH 20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DWAL 0.5 1 1.5 2 2.5 3 3.5 4 10 20 30 40 50 60 70 80 M ops/sec #core DWTL 20 40 60 80 100 120 140 10 20 30 40 50 60 70 80 M ops/sec #core DWSL 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 M ops/sec #core MRPL 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 M ops/sec #core MRPM 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 20 30 40 50 60 70 80 M ops/sec #core MRPH 50 100 150 200 250 300 350 400 450 500 10 20 30 40 50 60 70 80 M ops/sec #core MRDL 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 80 M ops/sec #core MRDM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWCL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core MWCM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core MWUL 0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 20 30 40 50 60 70 80 M ops/sec #core MWUM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWRL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 10 20 30 40 50 60 70 80 M ops/sec #core MWRM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBM:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOM:O_DIRECT 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k 10 20 30 40 50 60 70 80 messages/sec #core Exim 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80
#core RocksDB 2 4 6 8 10 12 14 16 18 10 20 30 40 50 60 70 80 GB/sec #core DBENCH
Legend btrfs ext4 ext4NJ F2FS tmpfs XFS
16
50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBL
50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH 20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DWAL 0.5 1 1.5 2 2.5 3 3.5 4 10 20 30 40 50 60 70 80 M ops/sec #core DWTL 20 40 60 80 100 120 140 10 20 30 40 50 60 70 80 M ops/sec #core DWSL 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 M ops/sec #core MRPL 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 M ops/sec #core MRPM 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 20 30 40 50 60 70 80 M ops/sec #core MRPH 50 100 150 200 250 300 350 400 450 500 10 20 30 40 50 60 70 80 M ops/sec #core MRDL 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 80 M ops/sec #core MRDM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWCL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core MWCM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core MWUL 0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 20 30 40 50 60 70 80 M ops/sec #core MWUM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWRL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 10 20 30 40 50 60 70 80 M ops/sec #core MWRM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBM:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOM:O_DIRECT 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k 10 20 30 40 50 60 70 80 messages/sec #core Exim 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80
#core RocksDB 2 4 6 8 10 12 14 16 18 10 20 30 40 50 60 70 80 GB/sec #core DBENCH
Legend btrfs ext4 ext4NJ F2FS tmpfs XFS
17
50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBL
50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH 20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DWAL 0.5 1 1.5 2 2.5 3 3.5 4 10 20 30 40 50 60 70 80 M ops/sec #core DWTL 20 40 60 80 100 120 140 10 20 30 40 50 60 70 80 M ops/sec #core DWSL 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 M ops/sec #core MRPL 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 M ops/sec #core MRPM 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 20 30 40 50 60 70 80 M ops/sec #core MRPH 50 100 150 200 250 300 350 400 450 500 10 20 30 40 50 60 70 80 M ops/sec #core MRDL 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 80 M ops/sec #core MRDM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWCL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core MWCM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core MWUL 0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 20 30 40 50 60 70 80 M ops/sec #core MWUM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWRL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 10 20 30 40 50 60 70 80 M ops/sec #core MWRM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBM:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOM:O_DIRECT 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k 10 20 30 40 50 60 70 80 messages/sec #core Exim 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80
#core RocksDB 2 4 6 8 10 12 14 16 18 10 20 30 40 50 60 70 80 GB/sec #core DBENCH
Legend btrfs ext4 ext4NJ F2FS tmpfs XFS
18
50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBL 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH
50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBM
XFS
R R R R
R R
19
Disk
Page cache
from disk
OS Kernel
20
Disk
Page cache
OS Kernel
21
Disk
Page cache
OS Kernel
22
Disk
Page cache
OS Kernel
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }
Reference counting is used to track # of accessing tasks
23
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }
1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }
100 CPI (cycles-per-instruction) 20 CPI (cycles-per-instruction)
24
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }
1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }
100 CPI (cycles-per-instruction) 20 CPI (cycles-per-instruction)
25
26
20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM
ext4 F2FS btrfs
W W W W
W
Time T Time T Time T+1
28
W
Time T Time T+1 Block Allocation Block Allocation
29
20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM
ext4 F2FS btrfs
W W W W
– Range-based locking is not implemented
***_file_write_iter(...) { mutex_lock(&inode->i_mutex); ... mutex_unlock(&inode->i_mutex); }
33
34
35
– If we remove contentions in a fjle system,
36
100 200 300 400 500 600 700 10 20 30 40 50 60 70 80
#core 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80
#core btrfs ext4 F2FS tmpfs XFS
tmpfs
** Tested workload: DB_BENCH overwrite **
37
100 200 300 400 500 600 700 10 20 30 40 50 60 70 80
#core 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80
#core btrfs ext4 F2FS tmpfs XFS
tmpfs
** Tested workload: DB_BENCH overwrite **
38
100 200 300 400 500 5 10 15 20
#core
100 200 300 400 500 5 10 15 20 btrfs ext4 F2FS XFS
F2FS ** Tested workload: DB_BENCH overwrite **
39
100 200 300 400 500 5 10 15 20
#core
100 200 300 400 500 5 10 15 20 btrfs ext4 F2FS XFS
F2FS ** Tested workload: DB_BENCH overwrite **
40
– Mostly use memory fjle system to opt out the efgect of I/O
– Scalable fjle system journaling
– Parallel log-structured writing on NVRAM
41
– Minimizing contention, scalable consistency guarantee, spatial
– https://github.com/sslab-gatech/fxmark