Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, - - PowerPoint PPT Presentation

lazy exact deduplication
SMART_READER_LITE
LIVE PREVIEW

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, - - PowerPoint PPT Presentation

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang Wang, Xiaoguang Liu College of Computer and Control Engineering, Nankai University, China. 5 May 2016 Lazy exact deduplication Lazy exact


slide-1
SLIDE 1

Lazy Exact Deduplication

Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie Ren, Gang Wang, Xiaoguang Liu

College of Computer and Control Engineering, Nankai University, China.

5 May 2016

slide-2
SLIDE 2

Lazy exact deduplication

slide-3
SLIDE 3

Lazy exact deduplication

Lead author: Jingwei Ma, PhD student at Nankai University (supervisor: Prof. Gang Wang).

slide-4
SLIDE 4

Lazy exact deduplication

Lead author: Jingwei Ma, PhD student at Nankai University (supervisor: Prof. Gang Wang). Couldn’t get USA visa in time = ⇒ I will present this work.

slide-5
SLIDE 5

Lazy exact deduplication

Lead author: Jingwei Ma, PhD student at Nankai University (supervisor: Prof. Gang Wang). Couldn’t get USA visa in time = ⇒ I will present this work. Credit where credit is due: Jingwei Ma did the lion’s share of this work (development, implementation, experimentation, etc.).

slide-6
SLIDE 6

Lazy exact deduplication

Lead author: Jingwei Ma, PhD student at Nankai University (supervisor: Prof. Gang Wang). Couldn’t get USA visa in time = ⇒ I will present this work. Credit where credit is due: Jingwei Ma did the lion’s share of this work (development, implementation, experimentation, etc.). Lazy deduplication: ‘Lazy’ in the sense that we postpone disk lookups, until we can do them as a batch.

slide-7
SLIDE 7

Lazy exact deduplication

Lead author: Jingwei Ma, PhD student at Nankai University (supervisor: Prof. Gang Wang). Couldn’t get USA visa in time = ⇒ I will present this work. Credit where credit is due: Jingwei Ma did the lion’s share of this work (development, implementation, experimentation, etc.). Lazy deduplication: ‘Lazy’ in the sense that we postpone disk lookups, until we can do them as a batch. (Lazy is exact.)

slide-8
SLIDE 8

Deduplication: What usually happens...

We have a large amount of data, with lots of duplicate data (e.g. weekly backups).

slide-9
SLIDE 9

Deduplication: What usually happens...

We have a large amount of data, with lots of duplicate data (e.g. weekly backups). We read through the data, and if we see something we’ve seen before, we replace it with an index entry (saving disk space).

slide-10
SLIDE 10

Deduplication: What usually happens...

We have a large amount of data, with lots of duplicate data (e.g. weekly backups). We read through the data, and if we see something we’ve seen before, we replace it with an index entry (saving disk space).

slide-11
SLIDE 11

Deduplication: What usually happens...

We have a large amount of data, with lots of duplicate data (e.g. weekly backups). We read through the data, and if we see something we’ve seen before, we replace it with an index entry (saving disk space). The data is broken up into chunks (Rabin Hash).

slide-12
SLIDE 12

Deduplication: What usually happens...

We have a large amount of data, with lots of duplicate data (e.g. weekly backups). We read through the data, and if we see something we’ve seen before, we replace it with an index entry (saving disk space). The data is broken up into chunks (Rabin Hash). The chunks are fingerprinted (SHA1): same fingerprint = ⇒ duplicate chunk.

slide-13
SLIDE 13

Deduplication: What usually happens...

Disk bottleneck: Most fingerprints are stored on disk = ⇒ lots of disk reads (“have I seen this before?”) = ⇒ slow.

slide-14
SLIDE 14

Deduplication: What usually happens...

Disk bottleneck: Most fingerprints are stored on disk = ⇒ lots of disk reads (“have I seen this before?”) = ⇒ slow. Caching and prefetching reduce the disk bottleneck problem:

slide-15
SLIDE 15

Deduplication: What usually happens...

Disk bottleneck: Most fingerprints are stored on disk = ⇒ lots of disk reads (“have I seen this before?”) = ⇒ slow. Caching and prefetching reduce the disk bottleneck problem:

fingerprints cache disk cache miss · · · fA fB fC fD · · · fA fB fC fD

The first time we see fingerprints fA, fB, ...

slide-16
SLIDE 16

Deduplication: What usually happens...

Disk bottleneck: Most fingerprints are stored on disk = ⇒ lots of disk reads (“have I seen this before?”) = ⇒ slow. Caching and prefetching reduce the disk bottleneck problem:

fingerprints cache disk cache miss · · · fA fB fC fD · · · fA fB fC fD

The first time we see fingerprints fA, fB, ...

fingerprints cache cache miss prefetching disk cache hit · · · fA fB fC fD · · · fA fB fC fD

The second time we see fingerprints fA, fB, ...

slide-17
SLIDE 17

Lazy deduplication...

fingerprints cache disk · · · fA fB fC fD · · · fC fA fB, fD fA fB fC fD

slide-18
SLIDE 18

Lazy deduplication...

Bloom filter: identifies many uniques (not all). [Commonly used.]

fingerprints Bloom filter cache disk · · · fA fB fC fD · · · fC fA fB, fD fA fB fC fD

slide-19
SLIDE 19

Lazy deduplication...

Bloom filter: identifies many uniques (not all). [Commonly used.] buffer: stores fingerprints in hash buckets; searched later

  • n disk (“lazy”)—when full,

whole buckets are searched in

  • ne go (stored on-disk in hash

buckets)

fingerprints Bloom filter cache disk buffer · · · fA fB fC fD · · · fC fA fB, fD fA fB fC fD

slide-20
SLIDE 20

Lazy deduplication...

Bloom filter: identifies many uniques (not all). [Commonly used.] buffer: stores fingerprints in hash buckets; searched later

  • n disk (“lazy”)—when full,

whole buckets are searched in

  • ne go (stored on-disk in hash

buckets) post-lookup: searching the cache after buffering (maybe multiple times)

fingerprints Bloom filter cache post-lookup disk buffer · · · fA fB fC fD · · · fC fA fB, fD fA fB fC fD

slide-21
SLIDE 21

Lazy deduplication...

Bloom filter: identifies many uniques (not all). [Commonly used.] buffer: stores fingerprints in hash buckets; searched later

  • n disk (“lazy”)—when full,

whole buckets are searched in

  • ne go (stored on-disk in hash

buckets) post-lookup: searching the cache after buffering (maybe multiple times) pre-lookup: searching the cache before buffering [not shown]

fingerprints Bloom filter cache post-lookup disk buffer · · · fA fB fC fD · · · fC fA fB, fD fA fB fC fD

slide-22
SLIDE 22

Lazy deduplication...

Bloom filter: identifies many uniques (not all). [Commonly used.] buffer: stores fingerprints in hash buckets; searched later

  • n disk (“lazy”)—when full,

whole buckets are searched in

  • ne go (stored on-disk in hash

buckets) post-lookup: searching the cache after buffering (maybe multiple times) pre-lookup: searching the cache before buffering [not shown] prefetching: bidirectional; triggers post-lookup

fingerprints Bloom filter cache post-lookup prefetching disk buffer · · · fA fB fC fD · · · fC fA fB, fD fA fB fC fD

slide-23
SLIDE 23

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk

slide-24
SLIDE 24

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk—these will probably be the next incoming fingerprints.

slide-25
SLIDE 25

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk—these will probably be the next incoming

  • fingerprints. But this doesn’t work with the lazy method (where

fingerprints are buffered).

slide-26
SLIDE 26

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk—these will probably be the next incoming

  • fingerprints. But this doesn’t work with the lazy method (where

fingerprints are buffered). To overcome this obstacle, each buffered fingerprint is given a...

slide-27
SLIDE 27

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk—these will probably be the next incoming

  • fingerprints. But this doesn’t work with the lazy method (where

fingerprints are buffered). To overcome this obstacle, each buffered fingerprint is given a... rank, used to determine the on-disk search range;

slide-28
SLIDE 28

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk—these will probably be the next incoming

  • fingerprints. But this doesn’t work with the lazy method (where

fingerprints are buffered). To overcome this obstacle, each buffered fingerprint is given a... rank, used to determine the on-disk search range; and a buffer cycle, indicating where duplicates might be on-disk.

slide-29
SLIDE 29

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk—these will probably be the next incoming

  • fingerprints. But this doesn’t work with the lazy method (where

fingerprints are buffered). To overcome this obstacle, each buffered fingerprint is given a... rank, used to determine the on-disk search range; and a buffer cycle, indicating where duplicates might be on-disk. It looks like this:

rank r: fingerprints 1 2 3 4 5 6 7 8 fingerprints stored on disk

  • n-disk

lookup · · · 2048 fingerprints r incoming unique

  • n-disk unique

buffered / on-disk match

slide-30
SLIDE 30

Experimental results...

(See our paper for the details and further experiments.)

slide-31
SLIDE 31

Experimental results...

(See our paper for the details and further experiments.) The time it takes to deduplicate a dataset (on SSD): Vm (220GB) Src (343GB) FSLHomes (3.58TB) eager way 282 sec. 476 sec. 5824 sec. lazy way 151 sec. 226 sec. 3939 sec. (eager = non-lazy [exact] way—i.e., no buffering before accessing the disk) Conclusion: Lazy is faster.

slide-32
SLIDE 32

On-disk lookups...

Disk access time (sec.) on SSD: Vm Src FSLHomes eager lazy eager lazy eager lazy

  • n-disk lookup

176 20 325 45 4598 1639 prefetching 46 60 52 68 298 655

  • ther

59 71 99 113 928 1645 total disk access 222 80 377 113 4896 2294 total dedup. 282 151 476 226 5824 3939 Conclusion: Lazy reduces the disk bottleneck.

slide-33
SLIDE 33

Throughput...

656 397 100 200 300 400 500 600 700 800 20 70 120 170 220 270 320 370 420 throughput (MB/sec.) data size (GB) Src on SSD eager lazy 69 151 50 100 150 200 250 300 350 20 70 120 170 220 270 320 370 420 throughput (MB/sec.) data size (GB) Src on HDD eager lazy

Conclusion: Lazy has better throughput on both SSD and HDD, but moreso on slower HDD.

slide-34
SLIDE 34