Deduplication CSCI 333 Spring 2019 Logistics Lab 2a/b Final - PowerPoint PPT Presentation

Deduplication CSCI 333 Spring 2019

Logistics • Lab 2a/b • Final Project • Final Exam • Grades 2

Last Class • BetrFS [FAST ‘15] – Linux file system using B e -trees • Metadata B e -tree: path -> struct stat • Data in B e -tree: path|{block#} -> 4KiB block – Schema maps VFS operations to efficient B e -tree operations • Upserts, Range queries – Next iteration [FAST ‘16] : fixed slowest operations • Rangecast delete messages • “Zones” • Late-binding journal 3

This Class • Introduction to Deduplication – Big picture idea – Design choices and tradeoffs – Open questions • Slides from Gala Yadgar & Geoff Kuenning, presented at Dagstuhl • I’ve added new slides (slides without borders) for extra context 4

Deduplication Geoff Kuenning Gala Yadgar

Sources of Duplicates • Different people store the same files – Shared documents, code development – Popular photos, videos, etc. • May also share blocks – Attachments – Configuration files – Company logo and other headers à Deduplication! 6

Deduplication • Dedup(e) is one form of compression • High-level goal: identify duplicate objects and eliminate redundant copies – How should we define a duplicate object? – What makes a copy “redundant”? • Answers are application-dependent and some of the more interesting research questions! 7

857 Desktops at Microsoft D. Meyer, W. Bolosky. A Study of Practical Deduplication. FAST 2011 8

“Naïve” Deduplication For each new file Compare each block to all existing blocks If new, write block and add pointer If duplicate, add pointer to existing copy File1 File3 File2 Are we done? 9

Identifying Duplicates • It’s unreasonable to “Compare each block to all existing blocks” RAM à Fingerprints Cryptographic hash of block content Low collision probability 10

Dedup Fingerprints • Goal: uniquely identify an object’s contents • How big should a fingerprint be? – Ideally, large enough that the probability of a collision is lower than the probability of a hardware error • MD5: 16-byte hash • SHA-1: 20-byte hash • Technique: system stores a map (index) between each object’s fingerprint and each object’s location – Compare a new object’s fingerprint against all existing fingerprints, looking for a match – Scales with number of unique objects, not size of objects 11

Identifying Duplicates • It’s unreasonable to “Compare each block to all existing blocks” RAM à Fingerprints Cryptographic hash of block content Low collision probability • It’s also unreasonable to compare to all fingerprints… à Fingerprint cache RAM 12

Fingerprint Lookup • How should we store the fingerprints? RAM • Every unique block is a miss à miss rate ≥ 40% • One solution: Bloom filter lookup Insert Insert Lookup Lookup (negative) (false positive) • Challenge: 2% false positive rate à 1TB for 4PB of data 13

How To Implement a Cache? • (Bloom) Filters help us determine if a fingerprint exists – We still need to do an I/O to find the mapping • Locality in fingerprints? – If we sort our index by fingerprint: cryptographic hash destroys all notions of locality – What if we grouped fingerprints by temporal locality of writes? 14

Reading and Restoring File1 File3 File2 • How long does it take to read File1? • How long does it take to read File3? • Challenge: when is it better to store the duplicates? 15

Write Path File3 Surprise lookup Many writes become faster! Fingerprint index File recipe Chunk store 16

Read Path File3 lookup Fingerprint index File recipe Chunk store 17

Delete Path • Challenge: storing reference counts File3 – Physically separate from the chunks Fingerprint index lookup File recipe Chunk store Reference 1 1 2 1 2 1 2 counters: 18

Chunking • Chunking: splitting files into blocks • Fixed-size chunks: usually aligned to device blocks • What is the best chunk size? File1 File2 File1 File2 19

Updates and Versions • Best case: File1 File1a aabbccdd à a A bbccdd • Worst case: aabbccdd à a A abbccdd File1b File1b Ideally… 20

Variable-Size Chunks • Basic idea: chunk boundary is triggered by a random string • For example: 010 • aa010bb010cc010dd à a A a010bb010cc010dd • Triggers should be: – Not too short/long – Not too popular (000000…) – Easy to identify 21

Identifying Chunk Boundaries • 48-byte triggers (empirically, this works) • Define a set of possible triggers à K highest bits of the hash are == 0 à Rabin fingerprints do this efficiently à “systems” solutions for corner cases …010110010011001110100100100110011001001001100110000… Fingerprint 0010001001 00000 00101 K=5 Boundary! • Challenge: parallelize this process 22

Rabin Fingerprints • “The polynomial representation of the data modulo a predetermined irreducible polynomial” [LBFS sosp01] • What/why Rabin fingerprints? – Calculates a rolling hash – “Slide the window” in a constant number of operations (intuition: we “add” a new byte and “subtract” an old byte to slide the window by one) – Define a “chunk” once our window’s hash matches our target value (i.e., we hit a trigger) 23

Defining chunk boundaries • Tradeoff between small and large chunks? – Finer granularity of sharing vs. metadata overhead • With process just described, how might we: – Produce a very small chunk? – Produce a very large chunk? • How might we modify our chunking algorithm to give us “reasonable” chunk sizes? – To avoid small chunks: don’t consider boundaries until minimum size threshold – To avoid large chunks: as soon as we reach a maximum threshold, insert a chunk boundary 24

Distributed Storage Increase storage capacity and performance with multiple storage servers • Each server is a separate machine (CPU,RAM,HDD/SSD) • Data access is distributed between servers G Scalability Increase capacity with data growth G Load balancing Independent of workload G Failure handling Network, nodes and devices always fail 25

Distributed Deduplication File1 File3 File2 • Where/when should we look for duplicates? • Where should we store each file? 26

Challenges (aka Summary) Size of fingerprint dictionary Approximate membership query structures (AMQ) …010110010011001110100100100110011001001001100110000… Parallelizing chunking Bidirectional indexing of chunks à Wonderful theory problems! 1 2 1 1 2 1 2 27

Next Class? • Specific dedup system(s) (4) • Mapreduce (+ write-optimized) (2) • Google file system (1) • RAID (3) 28

Final Project Discussion • Get with your group • Find another group • Pitch your project / show them your proposal – React/revise 29

Deduplication CSCI 333 Spring 2019 Logistics Lab 2a/b Final - PowerPoint PPT Presentation

Deduplication CSCI 333 Spring 2019 Logistics Lab 2a/b Final Project Final Exam Grades 2 Last Class BetrFS [FAST 15] Linux file system using B e -trees Metadata B e -tree: path -> struct stat Data in B e

OrderMergeDedup: Efficient, Failure-Consistent Deduplication on Flash Zhuan Chen and Kai Shen

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Deduplication: Overview & Case Studies CSCI 333 Spring 2020 Williams College Lecture

SADedupe: Skew Area Inline Deduplication for Distributed Storage Binqi Zhang , Bing Bing Zhou,

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS & RESPONSE RATES 28 October 2014 Matching

Message-locked Encryption with Deduplication Consistency Sbastien Canard 1 , Fabien Laguillaumie

Tradeoffs in Scalable Data Routing for Deduplication Clusters Wei Dong Fred Douglis Kai Li

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad Murali

A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco, Renata

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data

iDedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson

DupHunter : Flexible High-Performance Deduplication for Docker Registries Nannan Zhao , Hadeel

Website Fingerprinting at Internet Scale Andriy Panchenko 1 , Fabian Lanze 1 , Andreas Zinnen 2 ,

Fingerprinting ECUs for Vehicle Intrusion Detection Kyong-Tak Cho, Kang G. Shin, University of

Br Browser fi fingerprinting Nataliia Bielova @nataliabielova February 12

1 Introduction There are three fundamental principles of There are three fundamental

Enabling Privacy-Aware Zone Exchanges Among Authoritative and Recursive DNS Servers Nikos

Feature Selection in Website Fingerprinting Junhua Yan Advisor: Prof. Jasleen Kaur July 24,

Visualization for Biometric Evaluation Romain Giot <romain.giot@u-bordeaux.fr> Romain

Fingerprinting Requirements for Increased Controls Licensees Chris Einberg, Senior Project

Deduplication CSCI 333 Spring 2019 Logistics Lab 2a/b Final - PowerPoint PPT Presentation

Deduplication CSCI 333 Spring 2019 Logistics Lab 2a/b Final Project Final Exam Grades 2 Last Class BetrFS [FAST 15] Linux file system using B e -trees Metadata B e -tree: path -> struct stat Data in B e

OrderMergeDedup: Efficient, Failure-Consistent Deduplication on Flash Zhuan Chen and Kai Shen

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Deduplication: Overview &amp; Case Studies CSCI 333 Spring 2020 Williams College Lecture

SADedupe: Skew Area Inline Deduplication for Distributed Storage Binqi Zhang , Bing Bing Zhou,

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS &amp; RESPONSE RATES 28 October 2014 Matching

Message-locked Encryption with Deduplication Consistency Sbastien Canard 1 , Fabien Laguillaumie

Tradeoffs in Scalable Data Routing for Deduplication Clusters Wei Dong Fred Douglis Kai Li

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad Murali

A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco, Renata

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data

iDedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson

DupHunter : Flexible High-Performance Deduplication for Docker Registries Nannan Zhao , Hadeel

Website Fingerprinting at Internet Scale Andriy Panchenko 1 , Fabian Lanze 1 , Andreas Zinnen 2 ,

Fingerprinting ECUs for Vehicle Intrusion Detection Kyong-Tak Cho, Kang G. Shin, University of

Br Browser fi fingerprinting Nataliia Bielova @nataliabielova February 12

1 Introduction There are three fundamental principles of There are three fundamental

Enabling Privacy-Aware Zone Exchanges Among Authoritative and Recursive DNS Servers Nikos

Feature Selection in Website Fingerprinting Junhua Yan Advisor: Prof. Jasleen Kaur July 24,

Visualization for Biometric Evaluation Romain Giot &lt;romain.giot@u-bordeaux.fr&gt; Romain

Fingerprinting Requirements for Increased Controls Licensees Chris Einberg, Senior Project

Deduplication: Overview & Case Studies CSCI 333 Spring 2020 Williams College Lecture

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS & RESPONSE RATES 28 October 2014 Matching

Visualization for Biometric Evaluation Romain Giot <romain.giot@u-bordeaux.fr> Romain