Performance Improvement of Btrfs
Miao Xie <miaox@cn.fujitsu.com> Li Zefan <lizf@cn.fujitsu.com>
Performance Improvement of Btrfs Miao Xie - - PowerPoint PPT Presentation
Performance Improvement of Btrfs Miao Xie <miaox@cn.fujitsu.com> Li Zefan <lizf@cn.fujitsu.com> Agenda Comparison between Btrfs and Ext3/4 Issue analysis (We have investigated) Small file sequential read Large file
Miao Xie <miaox@cn.fujitsu.com> Li Zefan <lizf@cn.fujitsu.com>
2
Small file sequential read Large file random write (Direct I/O and fsync) File creation/deletion
3
Hardware
Software
4
72 file I/O cases, mix the following conditions:
File creation/deletion
* Block size (bs): read or write BYTES bytes at a time.
5
1000 2000 3000 4000 5000 6000 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K 1 Thread 8 Threads 1 Thread 8 Threads DirectI/O General Read
IO Speed (Unit: Kb/s)
EXT3 EXT4 BTRFS
6
Write (fsync): write data into the file, and do fsync every 100 requests
500 1000 1500 2000 2500 3000 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K 1 Thread 8 Threads 1 Thread 8 Threads DirectI/O Write (fsync)
IO Speed (Unit: Kb/s)
EXT3 EXT4 BTRFS
7
0.00 10.00 20.00 30.00 40.00 50.00 60.00 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K 1 Thread 8 Threads 1 Thread 8 Threads DirectI/O General Read
IO Speed (Unit: Mb/s)
EXT3 EXT4 BTRFS
8
Write (fsync): write data into the file, and do fsync every 100 requests
1000 2000 3000 4000 5000 6000 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K 1 Thread 8 Threads 1 Thread 8 Threads DirectI/O Write (fsync)
IO Speed (Unit: Kb/s)
EXT3 EXT4 BTRFS
9
0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K 1 Thread 8 Threads 1 Thread 8 Threads DirectIO General Read
IO Speed (Unit: Mb/s)
EXT3 EXT4 BTRFS
10
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K 1 Thread 8 Threads 1 Thread 8 Threads DirectIO General Read
IO Speed (Unit: Mb/s)
EXT3 EXT4 BTRFS
11
(1/2)
0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K 1 Thread 8 Threads 1 Thread 8 Threads DirectIO Write (fsync)
IO Speed (Unit: Mb/s)
EXT3 EXT4 BTRFS
12
(2/2)
0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 bs = 4K bs = 32K bs = 4K bs = 32K 1 Thread 8 Threads General Write
IO Speed (Unit: Mb/s)
EXT3 EXT4 BTRFS
13
(1/2)
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 bs = 4K bs = 32K bs = 4K bs = 32K 1 Thread 8 Threads DirectIO
IO Speed (Unit: Mb/s)
EXT3 EXT4 BTRFS
14
(2/2)
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K 1 Thread 8 Threads 1 Thread 8 Threads Write (fsync) General Write
IO Speed (Unit: Mb/s)
EXT3 EXT4 BTRFS
15
Create/delete lots of empty files to measure the speed of file creation and deletion.
20000 40000 60000 80000 100000 120000 140000 Creation Deletion
(Unit: files/sec) Ext3 Ext4 Btrfs
16
Small file random read (Not inline file) Small file sequential read Small file random/sequential write Large file random write (Direct I/O and fsync) Large file random write (general write, bs = 4Kb) File creation and deletion
17
Small file sequential read Large file random write (Direct I/O and fsync) File creation/deletion
18
Reasons
Metadata fragment -> The file extent reading latency -> The delay of file data reading
fragment, and the readahead function can’t work well. So …
Fs/file tree Disk
19
Do small file sequential read after defragment
0.00 5.00 10.00 15.00 20.00 25.00 30.00 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K 1 Thread 8 Threads 1 Thread 8 Threads DirectI/O General Read
IO Speed (Unit: Mb/s)
No Defrag After Defrag
20
Pre-allocation for b+ tree: Introduce free space clusters for each node in the tree, then we can allocate contiguous free space from the parent node’s cluster to store the sibling leaves closely
(The patch of this solution is still under test, hasn’t be posted)
Fs/file tree Disk
Cluster Cluster Cluster
21
Improvement result
0.00 10.00 20.00 30.00 40.00 50.00 60.00 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K 1 Thread 8 Threads 1 Thread 8 Threads DirectI/O General Read
IO Speed (Unit: Mb/s
EXT3 EXT4 BTRFS BTRFS + Patch
Further Improvement
Introduce the auto defragment for metadata Apply the new metadata readahead API written by Arne
22
Small file sequential read Large file random write (Direct I/O and fsync) File creation/deletion
23
metadata.
Purpose: Reduce the write requests of the metadata when fsyncs and O_SYNCs happen. Implementation: Copy the changed items into a special tree (log tree, one per fs/file tree), and then write that tree to disk. After a crash, Btrfs recover the fs/file tree by that tree.
24
Log lots of unchanged metadata (Ex. Csum, File extent)
File
Application Extent 1 Extent 2 Extent 3
Csum tree
… Extent N Change the relative Checksums Checksum of the file’s extent Log all the csum data of this file The extent that be changed Checksum of the file’s extent that be changed Write to disk
Disk Log tree
Extent1 Csum Extent2 Csum Extent3 Csum ExtentN Csum
25
Reason verification
Do large file random write test after closing tree log function (mount with -o notreelog)
0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K 1 Thread 8 Threads 1 Thread 8 Threads DirectIO Write (fsync)
IO Speed (Unit: Mb/s)
BTRFS BTRFS(no treelog)
26
File
Application Extent 1 Extent 2 Extent 3
Csum tree
… Extent N Change the relative Checksums Checksum of the file’s extent Log all the csum data of this file The extent that be changed Checksum of the file’s extent that be changed Write to disk
Disk Log tree
Extent1 Csum Extent2 Csum Extent3 Csum ExtentN Csum
Don’t log unchanged metadata: Introduce sub- transaction id to filter the unchanged metadata (v2.6.41)
27
File
Application Extent 1 Extent 2 Extent 3
Csum tree
… Extent N Change the relative Checksums Checksum of the file’s extent Log all the csum data of this file The extent that be changed Checksum of the file’s extent that be changed Write to disk
Disk Log tree
Extent1 Csum Extent2 Csum Extent3 Csum ExtentN Csum
Don’t log unchanged metadata: Introduce sub- transaction id to filter the unchanged metadata (v2.6.41)
28
File
Application Extent 1 Extent 2 Extent 3
Csum tree
… Extent N Change the relative Checksums Checksum of the file’s extent Log the changed csum data of this file The extent that be changed Checksum of the file’s extent that be changed Write to disk
Disk Log tree
Extent1 Csum Extent2 Csum Extent3 Csum ExtentN Csum
Don’t log unchanged metadata: Introduce sub- transaction id to filter the unchanged metadata (v2.6.41)
29
Improvement result
0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K 1 Thread 8 Threads 1 Thread 8 Threads DirectIO Write (fsync)
IO Speed (Unit: Mb/s)
EXT3 EXT4 BTRFS BTRFS + Patch
30
Small file sequential read Large file random write (Direct I/O and fsync) File creation/deletion
31
Reasons
Btrfs does more metadata insertion and deletion. Btrfs must search the b+ tree to look up the place, where the inode will be stored, when updating inode. (Time complexity: O(log(n)), But Ext3/4 is O(1)) Searching nodes/leaves in the rb-tree spends lots of time
Btrfs Ext4(Not Sure) File Creation
inode name back reference ACL directory item directory name index inode ACL directory entry
File Deletion
inode inode back reference ACL directory item directory name index logged directory item logged directory name index inode ACL directory entry
32
Solution
Batch operation -- Insert/delete a batch of the directory name indexes (v2.6.40) Delay operation -- Delay to update the inode information in the b+ tree (v2.6.40) Using radix tree instead of rb-tree (v2.6.37)
33
Improvement result
Create/delete lots of empty files to measure the speed of file creation and deletion.
20000 40000 60000 80000 100000 120000 140000 Creation Deletion (Unit: files/sec)
Ext3 Ext4 Btrfs Btrfs + Patch
34
Small file sequential read Large file random write (Direct I/O and fsync) File creation/deletion
35
36