A study of practical deduplication
Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research
A study of practical deduplication Dutch T. Meyer University of - - PowerPoint PPT Presentation
A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft
Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research
Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research
(intermission)
MD5(name) Metadata MD5(data) MD5(name) Metadata MD5(data) MD5(name) Metadata MD5(data) Once per week for 4 weeks. ~875 file systems ~40TB ~200M Files
0% 2% 4% 6% 8% 10% 12% 14% 8 128 2K 32K 512K 8M 128M
File Size (bytes), power-of-two bins
2009 2004 2000
(intermission)
01101010….. ….110010101
01101010….. ….110010101
Whole file Deduplication
01101010….. ….110010101
01101010….. ….110010101
Fixed Chunk Deduplication
1 01101010….. 01101010….. ….110010101 ….1100101011
01101010….. ….110010101
01101010….. ….110010101
Rabin Figerprinting
1 110101 101010 010100 101101010…..
Algorithm Parameters Cost Deduplication effectiveness Whole-file Low Lowest Fixed Chunk Chunk Size Seeks CPU Complexity Middle Rabin fingerprints Average Chunk Size Seeks More CPU More Complexity Highest
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 64K 32K 16K 8K
Space Deduplicated Chunk Size Whole File Fixed-Chunk Rabin
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Whole File Whole File + Sparse 8K rabin
Deduplicated Space
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 4 8 16 32 64 128 256 512 Whole Set
Space Deduplicated Deduplication Domain Size (file systems) Whole File 64 KB Fixed 8KB Fixed 64KB Rabin 8KB Rabin
0% 2% 4% 6% 8% 10% 12% 1K 16K 256K 4M 64M 1G 16G 256G
Percentage of Total Bytes Containing File Size (Bytes), Power-of-2 bins
2000 2004 2009
dll dll ø
pdb vhd dll
exe pdb lib
pst exe vhd
pch wma pdb
mp3 lib exe
lib cab pch
chm pst cab cab mp3 wma
ø ø iso
0% 10% 20% 30% 40% 50% 60% 2000 2004 2009
dll dll ø
pdb vhd dll
exe pdb lib
pst exe vhd
pch wma pdb
mp3 lib exe
lib cab pch
chm pst cab cab mp3 wma
ø ø iso
0% 10% 20% 30% 40% 50% 60% 2000 2004 2009
Extension % of Duplicate Space Mean File Size (bytes) % of Total Space dll 20% 521K 10% lib 11% 1080K 7% pdb 11% 2M 7% <none> 7% 277K 13% exe 6% 572K 4% cab 4% 4M 2% msp 3% 15M 2% msi 3% 5M 1% iso 2% 436M 2% <a guid> 1% 604K <1%
vhd vhd
pch lib dll
pdb pdb
lib pch
wma iso
pst dll
ø avhd
avhd wma
mo3 wim
0% 10% 20% 30% 40% 50% 60% 70%
8K Fixed 8K Rabin Percentage of difference vs. whole file + sparse