A study of practical deduplication Dutch T. Meyer University of - - PowerPoint PPT Presentation

a study of practical deduplication
SMART_READER_LITE
LIVE PREVIEW

A study of practical deduplication Dutch T. Meyer University of - - PowerPoint PPT Presentation

A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft


slide-1
SLIDE 1

A study of practical deduplication

Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research

slide-2
SLIDE 2

A study of practical deduplication

Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research

slide-3
SLIDE 3

Why study deduplication?

$0.046 per GB 9ms 9ms per seek per seek

slide-4
SLIDE 4

When do we exploit duplicates? It Depends.

  • How much can you get back from deduping?
  • How does fragmenting files affect

performance?

  • How often will you access the data?
slide-5
SLIDE 5

Outline

  • Intro
  • Methodology
  • “There’s more here than dedup” teaser

(intermission)

  • Deduplication Background
  • Deplication Analysis
  • Conclusion
slide-6
SLIDE 6

Methodology

MD5(name) Metadata MD5(data) MD5(name) Metadata MD5(data) MD5(name) Metadata MD5(data) Once per week for 4 weeks. ~875 file systems ~40TB ~200M Files

slide-7
SLIDE 7

There’s more here than dedup!

  • We update and extend filesystem metadata

findings from 2000 and 2004

  • File system complexity is growing
  • Read the paper to answer questions like:

Are my files bigger now than they used to be?

slide-8
SLIDE 8

Teaser: Histogram of file size

0% 2% 4% 6% 8% 10% 12% 14% 8 128 2K 32K 512K 8M 128M

File Size (bytes), power-of-two bins

2009 2004 2000

4K Since 1981!

slide-9
SLIDE 9

There’s more here than dedup!

How fragmented are my files?

slide-10
SLIDE 10

Teaser: Layout and Organization

  • High linearity: only 4% of files fragmented in

practice

– Most windows machines defrag weekly

  • One quarter of fragmented files have at least

170 fragments

slide-11
SLIDE 11

Intermission

  • Intro
  • Methodology
  • “There’s more here than dedup” teaser

(intermission)

  • Deduplication Background
  • Deplication Analysis
  • Conclusion
slide-12
SLIDE 12

Dedup Background

foo

01101010….. ….110010101

bar

01101010….. ….110010101

Whole file Deduplication

slide-13
SLIDE 13

Dedup Background

foo

01101010….. ….110010101

bar

01101010….. ….110010101

Fixed Chunk Deduplication

1 01101010….. 01101010….. ….110010101 ….1100101011

slide-14
SLIDE 14

Dedup Background

foo

01101010….. ….110010101

bar

01101010….. ….110010101

Rabin Figerprinting

1 110101 101010 010100 101101010…..

slide-15
SLIDE 15

The Deduplication Space

Algorithm Parameters Cost Deduplication effectiveness Whole-file Low Lowest Fixed Chunk Chunk Size Seeks CPU Complexity Middle Rabin fingerprints Average Chunk Size Seeks More CPU More Complexity Highest

slide-16
SLIDE 16

What is the relative deduplication rate of the algorithms?

slide-17
SLIDE 17

Dedup by method and chunk size

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 64K 32K 16K 8K

Space Deduplicated Chunk Size Whole File Fixed-Chunk Rabin

slide-18
SLIDE 18

What if I was doing full weekly backups?

slide-19
SLIDE 19

Backup dedup over 4 weeks

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Whole File Whole File + Sparse 8K rabin

Deduplicated Space

slide-20
SLIDE 20

How does the number of filesystems influence deduplication?

slide-21
SLIDE 21

Dedup by filesystem count

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 4 8 16 32 64 128 256 512 Whole Set

Space Deduplicated Deduplication Domain Size (file systems) Whole File 64 KB Fixed 8KB Fixed 64KB Rabin 8KB Rabin

slide-22
SLIDE 22

So what is filling up all this space?

slide-23
SLIDE 23

Bytes by containing file size

0% 2% 4% 6% 8% 10% 12% 1K 16K 256K 4M 64M 1G 16G 256G

Percentage of Total Bytes Containing File Size (Bytes), Power-of-2 bins

2000 2004 2009

slide-24
SLIDE 24

What types of files take up disk space?

slide-25
SLIDE 25

Disk consumption by file type

dll dll ø

pdb vhd dll

exe pdb lib

pst exe vhd

pch wma pdb

mp3 lib exe

lib cab pch

chm pst cab cab mp3 wma

ø ø iso

0% 10% 20% 30% 40% 50% 60% 2000 2004 2009

slide-26
SLIDE 26

Disk consumption by file type

dll dll ø

pdb vhd dll

exe pdb lib

pst exe vhd

pch wma pdb

mp3 lib exe

lib cab pch

chm pst cab cab mp3 wma

ø ø iso

0% 10% 20% 30% 40% 50% 60% 2000 2004 2009

slide-27
SLIDE 27

Which of these types deduplicate well?

slide-28
SLIDE 28

Whole-file duplicates

Extension % of Duplicate Space Mean File Size (bytes) % of Total Space dll 20% 521K 10% lib 11% 1080K 7% pdb 11% 2M 7% <none> 7% 277K 13% exe 6% 572K 4% cab 4% 4M 2% msp 3% 15M 2% msi 3% 5M 1% iso 2% 436M 2% <a guid> 1% 604K <1%

slide-29
SLIDE 29

What files make up the 20% difference between whole file dedup and sparse file, as compared to more aggressive deduplication?

slide-30
SLIDE 30

Where does fine granularity help?

vhd vhd

pch lib dll

  • bj

pdb pdb

lib pch

wma iso

pst dll

ø avhd

avhd wma

mo3 wim

0% 10% 20% 30% 40% 50% 60% 70%

8K Fixed 8K Rabin Percentage of difference vs. whole file + sparse

slide-31
SLIDE 31

Last plea to read the whole paper

  • ~4x more results in paper!
  • Real world filesystem analysis is hard

– Eight machines months in query processing – Requires careful simplifying assumptions – Requires heavy optimization

slide-32
SLIDE 32

Conclusion

  • The benefit of fine grained dedup is < 20%

– Potentially just a fraction of that.

  • Fragmentation is a manageable problem
  • Read the paper for more metadata results

We’re releasing this dataset