Yang Yang, Qiang Cao, Hong Jiang, Li Yang, Jie Yao, Yuanyuan Dong, Puyuan Yang
Huazhong University of Science and Technology, University of Texas at Arlington, Alibaba group
1
BFO: Batch-File Operations on Massive Files for Consistent - - PowerPoint PPT Presentation
BFO: Batch-File Operations on Massive Files for Consistent Performance Improvement Yang Yang, Qiang Cao, Hong Jiang, Li Yang, Jie Yao, Yuanyuan Dong, Puyuan Yang Huazhong University of Science and Technology, University of Texas at Arlington,
Yang Yang, Qiang Cao, Hong Jiang, Li Yang, Jie Yao, Yuanyuan Dong, Puyuan Yang
Huazhong University of Science and Technology, University of Texas at Arlington, Alibaba group
1
Background BFO Design Evaluation Conclusion
2
Batch-file Operations
Accessing a batch of files
Many applications need batch-file operations
Backup applications File-level data replication and archiving Big data analytics systems Social media and online shopping websites
Traditional access approaches access files one by one
Called single-file access pattern Inefficient for small files
3
Small files in file systems
Desktop file system: more than 80% of accesses are to files smaller than 32B. Cloud and HPC cluster: 25%~40% files < 4KB.
Single-file access pattern for small files
Accessing metadata Fetching file data, and so on
IO operations dominate batch-file access
Metadata access contributes 40% time for accessing a small file on disk. Random data IOs
4
Read performance
5
Setup:
9704 2226.5 551.3 177.6 65.1 37.1 167.9 53.7 31.1 29.4 28.9 28.2
4 16 64 256 1024 4096 16384 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
HDD_R HDD_S
227.8 106.3 30.5 20.2 14.9 10.4 87.3 37.1 21.1 14.2 9.7 8.5
4 8 16 32 64 128 256 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
SSD_R SSD_S
Read performance
6
Setup:
9704 2226.5 551.3 177.6 65.1 37.1 167.9 53.7 31.1 29.4 28.9 28.2
4 16 64 256 1024 4096 16384 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
HDD_R HDD_S
227.8 106.3 30.5 20.2 14.9 10.4 87.3 37.1 21.1 14.2 9.7 8.5
4 8 16 32 64 128 256 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
SSD_R SSD_S
Large performance gap between the random and sequential, especially for small files
57.8X 2.6X
Read performance
7
Setup:
9704 2226.5 551.3 177.6 65.1 37.1 167.9 53.7 31.1 29.4 28.9 28.2
4 16 64 256 1024 4096 16384 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
HDD_R HDD_S
227.8 106.3 30.5 20.2 14.9 10.4 87.3 37.1 21.1 14.2 9.7 8.5
4 8 16 32 64 128 256 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
SSD_R SSD_S
Large performance gap among different file sizes
Write performance
8
5138 930 225.7 146.5 68.6 56.1 88.7 43.5 37 36.1 35.3 35.9
2 8 32 128 512 2048 8192 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
HDD_R HDD_S
92.4 37.8 20.8 16.5 12.9 12.4 58.8 22.2 12.5 11.6 11.3 11
4 8 16 32 64 128 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
SSD_R SSD_S
Setup:
Observation: the single-file access approach is very inefficient
Application-level optimization (Fastcopy)
Multi-threading, large buffer
Prefetching mechanism (Diskseen, ATC’07)
Depending on the future access behaviors
Block-level I/O scheduler (split-level I/O scheduling, SOSP’15)
Serializing the file accesses
Packing metadata and data together (CFFS, FAST’16)
Redesigning new file systems
9
File Access behaviors
Reading a file set with three representative file systems 10
File Access behaviors
Reading a file set with three representative file systems 11
File Access behaviors
Reading a file set with three representative file systems Writing a file set with three representative file systems 12
File Access behaviors Data Access behaviors (excluding the metadata)
13
Expected access order Disk Blocks
A B C D E
App
A C E D B
File Access behaviors Data Access behaviors (excluding the metadata)
14
Actual access order (alphabetic) Disk Blocks File A File B File C File D File E
A B C D E
Disk Blocks
A B C D E
App
A C E D B
135 135.5 136 136.5 137 234 234.1 234.2 234.3 234.4 234.5 234.6 234.7 234.8 234.9 235 Logical Block Address (X106) Time (Secs)
File Access behaviors Data Access behaviors (excluding the metadata)
15
Background BFO Design
BFOr BFOw
Evaluation Conclusion
16
Two-phase read
Objective: Separately read the metadata and file data
Phase 1: scanning the inodes Phase 2: fetching all files’ data
Layout-aware scheduler
17
2MB 128MB data group
Two-phase read Layout-aware scheduler
Extracting the addresses from the inodes Sorting the addresses of all files Issuing read I/O in the order of the list
18
Order_node
Inode (2bytes) Start-point (8bytes) Length (4bytes) Num (4bytes)
Order list
Disk blocks
A C E D B
Two-phase read Layout-aware scheduler
Extracting the addresses from the inodes Sorting the addresses of all files Issuing read I/O in the order of the list
19
Order_node
Inode (2bytes) Start-point (8bytes) Length (4bytes) Num (4bytes)
Order list
Disk blocks
A
Order_node
Inode->File A Start-point->3000# Length->8192bytes Num->0
B C D E
A C E D B
Two-phase read Layout-aware scheduler
Extracting the addresses from the inodes Sorting the addresses of all files Issuing read I/O in the order of the list
20
Order_node
Inode (2bytes) Start-point (8bytes) Length (4bytes) Num (4bytes)
Order list
Disk blocks
A
Order_node
Inode->File A Start-point->3000# Length->8192bytes Num->0
B C D E
A C E D B
Two-phase read Layout-aware scheduler
Extracting the addresses from the inodes Sorting the addresses of all files Issuing read I/O in the order of the list
21
Order_node
Inode (2bytes) Start-point (8bytes) Length (4bytes) Num (4bytes)
Order list
Disk blocks
A
Order_node
Inode->File A Start-point->3000# Length->8192bytes Num->0
B C D E
A C E D B
Two-phase write
Phase 1: creating a global file to store all data once
Creating G inode for the file Creating Order_list to record the order of the written files
Phase 2: creating all inodes for all files
Extracting the address from the G inode Creating all inodes with the address information and the Order_list
Current_FileAddr = Previous_FileAddr + FileLength Light-weight consistency strategy
22
Disk Blocks
ABCDE Global file
G G
Two-phase write
Phase 1: creating a global file to store all data once
Creating G inode for the file Creating Order_list to record the order of the written files
Phase 2: creating all inodes for all files
Extracting the address from the G inode Creating all inodes with the address information and the Order_list
Current_FileAddr = Previous_FileAddr + FileLength Light-weight consistency strategy
23
Disk Blocks
ABCDE
G
A B C D E
G
Two-phase write
Phase 1: creating a global file to store all data once
Creating G inode for the file Creating Order_list to record the order of the written files
Phase 2: creating all inodes for all files
Extracting the address from the G inode Creating all inodes with the address information and the Order_list
Current_FileAddr = Previous_FileAddr + FileLength Light-weight consistency strategy
24
Disk Blocks
ABCDE
G
A B C D E
A B C D E G
Two-phase write Light-weight consistency strategy
writing the Order_list into journal files as an atomic operation recreating all inodes with the Order_list and G inode
25
Disk Blocks
ABCDE
G
A B C D E
A B C D E G G
Background BFO Design Evaluation Conclusion
28
Prototyped BFO on ext4 Intel Xeon E5 2620 @ 2.40GHz and 16GB RAM Storage devices
RAID0 with 5 Western Digital 7200RPM 4TB SAS HDD A Western Digital 4TB SAS HDD 480GB SAMSUNG 750 EVO SSD
File sets
File sets created by Filebench
4GB data with different file sizes (i.e., from 4KB to 4MB)
Linux-kernel source code 29
30
2 16 128 1024 8192 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
Read_R BFOr_R Read_S BFOr_S 2 16 128 1024 8192 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
Read_R BFOr_R Read_S BFOr_S
9704
1 4 16 64 256 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
Read_R BFOr_R Read_S BFOr_S
42.1X 22.4X 81.4%
31
2 16 128 1024 8192 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
Read_R BFOr_R Read_S BFOr_S 2 16 128 1024 8192 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
Read_R BFOr_R Read_S BFOr_S
9704
1 4 16 64 256 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different file sets
Read_R BFOr_R Read_S BFOr_S
1.6X 2X 1.8X
32 4 16 64 256 1024 4096 4KB 16KB 64KB 256KB 1MB 4MB
Execution time (s) File size in different sets
RAID_RW RAID_SW RAID_BFOw HDD_RW HDD_SW HDD_BFOw SSD_RW SSD_SW SSD_BFOw
71.8X 111.4X 2.9X
33
34
46.6%
Background BFO Design Evaluation Conclusion
35
We experimentally investigate the root cause of the inefficiency of the
Seeking forth and back between metadata area and data area. Accessing all files in random order.
We present BFO, for batch-file access, with optimized batch-file read
Two-phase access. Layout-aware scheduler. Light-weight consistency strategy
BFO improves the access performance consistently, and removes a
36
37