Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, - PowerPoint PPT Presentation

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang Wang, Xiaoguang Liu College of Computer and Control Engineering, Nankai University, China. 5 May 2016

Lazy exact deduplication

Lazy exact deduplication Lead author: Jingwei Ma, PhD student at Nankai University (supervisor: Prof. Gang Wang).

Lazy exact deduplication Lead author: Jingwei Ma, PhD student at Nankai University (supervisor: Prof. Gang Wang). Couldn’t get USA visa in time = ⇒ I will present this work.

Lazy exact deduplication Lead author: Jingwei Ma, PhD student at Nankai University (supervisor: Prof. Gang Wang). Couldn’t get USA visa in time = ⇒ I will present this work. Credit where credit is due: Jingwei Ma did the lion’s share of this work (development, implementation, experimentation, etc.).

Lazy exact deduplication Lead author: Jingwei Ma, PhD student at Nankai University (supervisor: Prof. Gang Wang). Couldn’t get USA visa in time = ⇒ I will present this work. Credit where credit is due: Jingwei Ma did the lion’s share of this work (development, implementation, experimentation, etc.). Lazy deduplication : ‘Lazy’ in the sense that we postpone disk lookups, until we can do them as a batch.

Lazy exact deduplication Lead author: Jingwei Ma, PhD student at Nankai University (supervisor: Prof. Gang Wang). Couldn’t get USA visa in time = ⇒ I will present this work. Credit where credit is due: Jingwei Ma did the lion’s share of this work (development, implementation, experimentation, etc.). Lazy deduplication : ‘Lazy’ in the sense that we postpone disk lookups, until we can do them as a batch. (Lazy is exact.)

Deduplication: What usually happens... We have a large amount of data, with lots of duplicate data (e.g. weekly backups).

Deduplication: What usually happens... We have a large amount of data, with lots of duplicate data (e.g. weekly backups). We read through the data, and if we see something we’ve seen before, we replace it with an index entry (saving disk space).

Deduplication: What usually happens... We have a large amount of data, with lots of duplicate data (e.g. weekly backups). We read through the data, and if we see something we’ve seen before, we replace it with an index entry (saving disk space). The data is broken up into chunks (Rabin Hash).

Deduplication: What usually happens... We have a large amount of data, with lots of duplicate data (e.g. weekly backups). We read through the data, and if we see something we’ve seen before, we replace it with an index entry (saving disk space). The data is broken up into chunks (Rabin Hash). The chunks are fingerprinted (SHA1): same fingerprint = ⇒ duplicate chunk.

Deduplication: What usually happens... Disk bottleneck : Most fingerprints are stored on disk = ⇒ lots of disk reads (“have I seen this before?”) = ⇒ slow.

Deduplication: What usually happens... Disk bottleneck : Most fingerprints are stored on disk = ⇒ lots of disk reads (“have I seen this before?”) = ⇒ slow. Caching and prefetching reduce the disk bottleneck problem:

Deduplication: What usually happens... Disk bottleneck : Most fingerprints are stored on disk = ⇒ lots of disk reads (“have I seen this before?”) = ⇒ slow. Caching and prefetching reduce the disk bottleneck problem: fingerprints f A f B f C f D · · · · · · cache cache miss disk f A f B f C f D The first time we see fingerprints f A , f B , ...

Deduplication: What usually happens... Disk bottleneck : Most fingerprints are stored on disk = ⇒ lots of disk reads (“have I seen this before?”) = ⇒ slow. Caching and prefetching reduce the disk bottleneck problem: fingerprints f A f B f C f D · · · · · · cache cache miss disk f A f B f C f D The first time we see fingerprints f A , f B , ... fingerprints f A f B f C f D · · · · · · cache hit cache prefetching cache miss disk f A f B f C f D The second time we see fingerprints f A , f B , ...

Lazy deduplication... fingerprints · · · f A f B f C f D · · · f C f A f B , f D cache disk f A f B f C f D

Lazy deduplication... Bloom filter: identifies many uniques (not all). [Commonly used.] fingerprints · · · f A f B f C f D · · · Bloom filter f C f A f B , f D cache disk f A f B f C f D

Lazy deduplication... Bloom filter: identifies many uniques (not all). [Commonly used.] buffer: stores fingerprints in fingerprints · · · f A f B f C f D · · · hash buckets; searched later Bloom on disk (“lazy”)—when full, filter whole buckets are searched in one go (stored on-disk in hash buffer buckets) f C f A f B , f D cache disk f A f B f C f D

Lazy deduplication... Bloom filter: identifies many uniques (not all). [Commonly used.] buffer: stores fingerprints in fingerprints · · · f A f B f C f D · · · hash buckets; searched later Bloom on disk (“lazy”)—when full, filter whole buckets are searched in one go (stored on-disk in hash buffer buckets) f C f A f B , f D post-lookup post-lookup: searching the cache after buffering (maybe cache multiple times) disk f A f B f C f D

Lazy deduplication... Bloom filter: identifies many uniques (not all). [Commonly used.] buffer: stores fingerprints in fingerprints · · · f A f B f C f D · · · hash buckets; searched later Bloom on disk (“lazy”)—when full, filter whole buckets are searched in one go (stored on-disk in hash buffer buckets) f C f A f B , f D post-lookup post-lookup: searching the cache after buffering (maybe cache multiple times) pre-lookup: searching the disk f A f B f C f D cache before buffering [not shown]

Lazy deduplication... Bloom filter: identifies many uniques (not all). [Commonly used.] buffer: stores fingerprints in fingerprints · · · f A f B f C f D · · · hash buckets; searched later Bloom on disk (“lazy”)—when full, filter whole buckets are searched in one go (stored on-disk in hash buffer buckets) f C f A f B , f D post-lookup post-lookup: searching the cache after buffering (maybe cache multiple times) prefetching pre-lookup: searching the disk f A f B f C f D cache before buffering [not shown] prefetching: bidirectional; triggers post-lookup

Prefetching... Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk

Prefetching... Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk—these will probably be the next incoming fingerprints.

Prefetching... Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk—these will probably be the next incoming fingerprints. But this doesn’t work with the lazy method (where fingerprints are buffered).

Prefetching... Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk—these will probably be the next incoming fingerprints. But this doesn’t work with the lazy method (where fingerprints are buffered). To overcome this obstacle, each buffered fingerprint is given a...

Prefetching... Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk—these will probably be the next incoming fingerprints. But this doesn’t work with the lazy method (where fingerprints are buffered). To overcome this obstacle, each buffered fingerprint is given a... rank , used to determine the on-disk search range;

Prefetching... Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk—these will probably be the next incoming fingerprints. But this doesn’t work with the lazy method (where fingerprints are buffered). To overcome this obstacle, each buffered fingerprint is given a... rank , used to determine the on-disk search range; and a buffer cycle , indicating where duplicates might be on-disk.

Prefetching... Ordinarily, we prefetch the subsequent on-disk fingerprints after a duplicate is found on disk—these will probably be the next incoming fingerprints. But this doesn’t work with the lazy method (where fingerprints are buffered). To overcome this obstacle, each buffered fingerprint is given a... rank , used to determine the on-disk search range; and a buffer cycle , indicating where duplicates might be on-disk. It looks like this: rank r : 0 1 2 3 4 5 6 7 8 fingerprints on-disk r lookup fingerprints · · · stored on disk 2048 fingerprints incoming unique on-disk unique buffered / on-disk match

Experimental results... (See our paper for the details and further experiments.)

Experimental results... (See our paper for the details and further experiments.) The time it takes to deduplicate a dataset (on SSD): Vm (220GB) Src (343GB) FSLHomes (3.58TB) eager way 282 sec. 476 sec. 5824 sec. lazy way 151 sec. 226 sec. 3939 sec. (eager = non-lazy [exact] way—i.e., no buffering before accessing the disk) Conclusion : Lazy is faster.

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, - PowerPoint PPT Presentation

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang Wang, Xiaoguang Liu College of Computer and Control Engineering, Nankai University, China. 5 May 2016 Lazy exact deduplication Lazy exact

Can We Represent Infinite Lists? Lazy Evaluation Amtoft Motivation Lazy Lists Conversions

Imagine for a moment @trentmwillis Lazy Loading Engines: Anything But Lazy Engines allow

Algebraic Tools for Exact Geometric Computing I - Exact Arithmetic and Filtering Michael Hemmer

Lazy v. Yield Incremental, Linear Pretty-printing Oleg Kiselyov Simon Peyton-Jones Amr Sabry

Lazy Modules Keiko Nakata Institute of Cybernetics at Tallinn University of Technology

OrderMergeDedup: Efficient, Failure-Consistent Deduplication on Flash Zhuan Chen and Kai Shen

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Deduplication: Overview & Case Studies CSCI 333 Spring 2020 Williams College Lecture

SADedupe: Skew Area Inline Deduplication for Distributed Storage Binqi Zhang , Bing Bing Zhou,

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS & RESPONSE RATES 28 October 2014 Matching

BEST OF EXACT GLOBE Jos Suijkens Michiel Beek Best of Exact Globe 2 Agenda 1. A fresh look for

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

ROOT package management: lazy install approach Brian Bockelman, Oksana Shadura, Vassil

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Impact of Magnet Performance on the Physics Program of MICE Chris Rogers, AST eC, Rutherford

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint

Boolean functions in quantum computation Ashley Montanaro School of Mathematics, University of

Direct Link Networks: Multiaccess Protocols (2.7) CS/ECpE 5516: Computer Networks Originally by

Tunneling and Gateways Tunneling and Gateways Srinidhi Varadarajan Topics Topics Tunneling

IEEE 802.11, Token Rings 10/11/06 CS/ECE 438 - UIUC, Fall 2006 1 Medium Access Control

EPL606 Internetworking Part 2c 1 IP Internet Concatenation of Networks Network 1

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, - PowerPoint PPT Presentation

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang Wang, Xiaoguang Liu College of Computer and Control Engineering, Nankai University, China. 5 May 2016 Lazy exact deduplication Lazy exact

Can We Represent Infinite Lists? Lazy Evaluation Amtoft Motivation Lazy Lists Conversions

Imagine for a moment @trentmwillis Lazy Loading Engines: Anything But Lazy Engines allow

Algebraic Tools for Exact Geometric Computing I - Exact Arithmetic and Filtering Michael Hemmer

Lazy v. Yield Incremental, Linear Pretty-printing Oleg Kiselyov Simon Peyton-Jones Amr Sabry

Lazy Modules Keiko Nakata Institute of Cybernetics at Tallinn University of Technology

OrderMergeDedup: Efficient, Failure-Consistent Deduplication on Flash Zhuan Chen and Kai Shen

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Deduplication: Overview &amp; Case Studies CSCI 333 Spring 2020 Williams College Lecture

SADedupe: Skew Area Inline Deduplication for Distributed Storage Binqi Zhang , Bing Bing Zhou,

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS &amp; RESPONSE RATES 28 October 2014 Matching

BEST OF EXACT GLOBE Jos Suijkens Michiel Beek Best of Exact Globe 2 Agenda 1. A fresh look for

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

ROOT package management: lazy install approach Brian Bockelman, Oksana Shadura, Vassil

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Impact of Magnet Performance on the Physics Program of MICE Chris Rogers, AST eC, Rutherford

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint

Boolean functions in quantum computation Ashley Montanaro School of Mathematics, University of

Direct Link Networks: Multiaccess Protocols (2.7) CS/ECpE 5516: Computer Networks Originally by

Tunneling and Gateways Tunneling and Gateways Srinidhi Varadarajan Topics Topics Tunneling

IEEE 802.11, Token Rings 10/11/06 CS/ECE 438 - UIUC, Fall 2006 1 Medium Access Control

EPL606 Internetworking Part 2c 1 IP Internet Concatenation of Networks Network 1

Deduplication: Overview & Case Studies CSCI 333 Spring 2020 Williams College Lecture

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS & RESPONSE RATES 28 October 2014 Matching