A study of practical deduplication Dutch T. Meyer University of - PowerPoint PPT Presentation

A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research

Why study deduplication? $0.046 9ms 9ms per GB per seek per seek

When do we exploit duplicates? It Depends. • How much can you get back from deduping? • How does fragmenting files affect performance? • How often will you access the data?

Outline • Intro • Methodology • “There’s more here than dedup ” teaser (intermission) • Deduplication Background • Deplication Analysis • Conclusion

Methodology MD5(name) Metadata MD5(data) Once per week for 4 MD5(name) weeks. Metadata ~875 file systems MD5(data) ~40TB ~200M Files MD5(name) Metadata MD5(data)

There’s more here than dedup! • We update and extend filesystem metadata findings from 2000 and 2004 • File system complexity is growing • Read the paper to answer questions like: Are my files bigger now than they used to be?

Teaser: Histogram of file size 4K Since 14% 12% 1981! 10% 8% 6% 4% 2% 0% 0 8 128 2K 32K 512K 8M 128M File Size (bytes), power-of-two bins 2009 2004 2000

There’s more here than dedup! How fragmented are my files?

Teaser: Layout and Organization • High linearity: only 4% of files fragmented in practice – Most windows machines defrag weekly • One quarter of fragmented files have at least 170 fragments

Intermission • Intro • Methodology • “There’s more here than dedup ” teaser (intermission) • Deduplication Background • Deplication Analysis • Conclusion

Dedup Background Whole file Deduplication 01101010….. ….110010101 foo 01101010….. ….110010101 bar

Dedup Background Fixed Chunk Deduplication 01101010….. 01101010….. ….1100101011 ….110010101 1 foo 01101010….. ….110010101 01101010….. ….110010101 bar

Dedup Background Rabin Figerprinting 1 01101010….. 101101010….. ….110010101 foo 01101010….. ….110010101 110101 101010 010100 bar

The Deduplication Space Algorithm Parameters Cost Deduplication effectiveness Whole-file Low Lowest Fixed Chunk Size Seeks Middle Chunk CPU Complexity Rabin Average Seeks Highest fingerprints Chunk Size More CPU More Complexity

What is the relative deduplication rate of the algorithms?

Dedup by method and chunk size 100% 90% 80% Space Deduplicated 70% 60% 50% 40% 30% 20% 10% 0% 64K 32K 16K 8K Chunk Size Whole File Fixed-Chunk Rabin

What if I was doing full weekly backups?

Backup dedup over 4 weeks 8K rabin Whole File + Sparse Whole File 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Deduplicated Space

How does the number of filesystems influence deduplication?

Dedup by filesystem count 100% 90% 80% Space Deduplicated 70% 60% 50% 40% 30% 20% 10% 0% 1 2 4 8 16 32 64 128 256 512 Whole Set Deduplication Domain Size (file systems) Whole File 64 KB Fixed 8KB Fixed 64KB Rabin 8KB Rabin

So what is filling up all this space?

Bytes by containing file size 12% 10% Percentage of Total Bytes 8% 6% 4% 2% 0% 1K 16K 256K 4M 64M 1G 16G 256G Containing File Size (Bytes), Power-of-2 bins 2000 2004 2009

What types of files take up disk space?

Disk consumption by file type 60% iso wma cab 50% pch ø exe cab ø chm pdb mp3 40% lib pst mp3 vhd cab pch lib 30% pst lib wma exe exe 20% dll pdb pdb vhd 10% ø dll dll 0% 2000 2004 2009

Which of these types deduplicate well?

Whole-file duplicates % of Duplicate Mean File % of Space Size (bytes) Total Space Extension dll 20% 521K 10% lib 11% 1080K 7% pdb 11% 2M 7% <none> 7% 277K 13% exe 6% 572K 4% cab 4% 4M 2% msp 3% 15M 2% msi 3% 5M 1% iso 2% 436M 2% <a guid> 1% 604K <1%

What files make up the 20% difference between whole file dedup and sparse file, as compared to more aggressive deduplication?

Where does fine granularity help? 70% 60% wim wma avhd dll Percentage of difference vs. 50% iso pch whole file + sparse pdb 40% obj mo3 avhd ø lib pst wma 30% lib pdb dll pch 20% vhd vhd 10% 0% 8K Fixed 8K Rabin

Last plea to read the whole paper • ~4x more results in paper! • Real world filesystem analysis is hard – Eight machines months in query processing – Requires careful simplifying assumptions – Requires heavy optimization

Conclusion • The benefit of fine grained dedup is < 20% – Potentially just a fraction of that. • Fragmentation is a manageable problem • Read the paper for more metadata results We’re releasing this dataset

A study of practical deduplication Dutch T. Meyer University of - PowerPoint PPT Presentation

A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft

OrderMergeDedup: Efficient, Failure-Consistent Deduplication on Flash Zhuan Chen and Kai Shen

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Deduplication: Overview & Case Studies CSCI 333 Spring 2020 Williams College Lecture

SADedupe: Skew Area Inline Deduplication for Distributed Storage Binqi Zhang , Bing Bing Zhou,

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS & RESPONSE RATES 28 October 2014 Matching

Practical Experience with Practical Experience with Practical Experience with Practical

Change from a Practical Perspective Change from a Practical Perspective Change from a Practical

Message-locked Encryption with Deduplication Consistency Sbastien Canard 1 , Fabien Laguillaumie

Tradeoffs in Scalable Data Routing for Deduplication Clusters Wei Dong Fred Douglis Kai Li

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad Murali

A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco, Renata

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data

iDedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson

Reversing a firmware uploader & Others NFC stories 1

Unix API 1 1 Changelog swtch() of animation 9 September 2019: exec and PCBs: remove init.

Project Report Guidelines Written report due Dec 6, 4:00pm Kalev Kask ICS 271 Fall 2016

Introduction to Computing Principles

Leveraging Executable Language Engineering for Domain-Specific Transformation Languages (Position

Go to http://workshop.DidierStevens.com Unzip shellcode-workshop.zip to C:\ Password is workshop

Semantic Networks and Topic Modeling A Comparison Using Small and Medium-Sized Corpora Loet

Outline Modern architectures Spring 2006 Delay slots Introduction to instruction

A study of practical deduplication Dutch T. Meyer University of - PowerPoint PPT Presentation

A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft

OrderMergeDedup: Efficient, Failure-Consistent Deduplication on Flash Zhuan Chen and Kai Shen

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Deduplication: Overview &amp; Case Studies CSCI 333 Spring 2020 Williams College Lecture

SADedupe: Skew Area Inline Deduplication for Distributed Storage Binqi Zhang , Bing Bing Zhou,

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS &amp; RESPONSE RATES 28 October 2014 Matching

Practical Experience with Practical Experience with Practical Experience with Practical

Change from a Practical Perspective Change from a Practical Perspective Change from a Practical

Message-locked Encryption with Deduplication Consistency Sbastien Canard 1 , Fabien Laguillaumie

Tradeoffs in Scalable Data Routing for Deduplication Clusters Wei Dong Fred Douglis Kai Li

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad Murali

A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco, Renata

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data

iDedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson

Reversing a firmware uploader &amp; Others NFC stories 1

Unix API 1 1 Changelog swtch() of animation 9 September 2019: exec and PCBs: remove init.

Project Report Guidelines Written report due Dec 6, 4:00pm Kalev Kask ICS 271 Fall 2016

Introduction to Computing Principles

Leveraging Executable Language Engineering for Domain-Specific Transformation Languages (Position

Go to http://workshop.DidierStevens.com Unzip shellcode-workshop.zip to C:\ Password is workshop

Semantic Networks and Topic Modeling A Comparison Using Small and Medium-Sized Corpora Loet

Outline Modern architectures Spring 2006 Delay slots Introduction to instruction

Deduplication: Overview & Case Studies CSCI 333 Spring 2020 Williams College Lecture

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS & RESPONSE RATES 28 October 2014 Matching

Reversing a firmware uploader & Others NFC stories 1