Deduplication CSCI 333 Spring 2019 Logistics Lab 2a/b Final - - PowerPoint PPT Presentation

deduplication
SMART_READER_LITE
LIVE PREVIEW

Deduplication CSCI 333 Spring 2019 Logistics Lab 2a/b Final - - PowerPoint PPT Presentation

Deduplication CSCI 333 Spring 2019 Logistics Lab 2a/b Final Project Final Exam Grades 2 Last Class BetrFS [FAST 15] Linux file system using B e -trees Metadata B e -tree: path -> struct stat Data in B e


slide-1
SLIDE 1

Deduplication

CSCI 333 Spring 2019

slide-2
SLIDE 2

Logistics

  • Lab 2a/b
  • Final Project
  • Final Exam
  • Grades

2

slide-3
SLIDE 3

Last Class

  • BetrFS [FAST ‘15]

– Linux file system using Be-trees

  • Metadata Be-tree: path -> struct stat
  • Data in Be-tree: path|{block#} -> 4KiB block

– Schema maps VFS operations to efficient Be-tree

  • perations
  • Upserts, Range queries

– Next iteration [FAST ‘16] : fixed slowest operations

  • Rangecast delete messages
  • “Zones”
  • Late-binding journal

3

slide-4
SLIDE 4

This Class

  • Introduction to Deduplication

– Big picture idea – Design choices and tradeoffs – Open questions

  • Slides from Gala Yadgar & Geoff Kuenning,

presented at Dagstuhl

  • I’ve added new slides (slides without borders)

for extra context

4

slide-5
SLIDE 5

Deduplication

Geoff Kuenning Gala Yadgar

slide-6
SLIDE 6

Sources of Duplicates

  • Different people store the same files

– Shared documents, code development – Popular photos, videos, etc.

  • May also share blocks

– Attachments – Configuration files – Company logo and other headers

à Deduplication!

6

slide-7
SLIDE 7

Deduplication

  • Dedup(e) is one form of compression
  • High-level goal: identify duplicate objects and

eliminate redundant copies

– How should we define a duplicate object? – What makes a copy “redundant”?

  • Answers are application-dependent and some
  • f the more interesting research questions!

7

slide-8
SLIDE 8

857 Desktops at Microsoft

  • D. Meyer, W. Bolosky. A Study of Practical Deduplication. FAST 2011

8

slide-9
SLIDE 9

“Naïve” Deduplication

For each new file

Compare each block to all existing blocks

If new, write block and add pointer If duplicate, add pointer to existing copy

9

File1 File3 File2

Are we done?

slide-10
SLIDE 10

Identifying Duplicates

  • It’s unreasonable to “Compare each block to all

existing blocks” àFingerprints

Cryptographic hash of block content Low collision probability

10

RAM

slide-11
SLIDE 11

Dedup Fingerprints

  • Goal: uniquely identify an object’s contents
  • How big should a fingerprint be?

– Ideally, large enough that the probability of a collision is lower than the probability of a hardware error

  • MD5: 16-byte hash
  • SHA-1: 20-byte hash
  • Technique: system stores a map (index) between each
  • bject’s fingerprint and each object’s location

– Compare a new object’s fingerprint against all existing fingerprints, looking for a match – Scales with number of unique objects, not size of objects

11

slide-12
SLIDE 12

Identifying Duplicates

  • It’s unreasonable to “Compare each block to all

existing blocks” àFingerprints

Cryptographic hash of block content Low collision probability

  • It’s also unreasonable to compare to all fingerprints…

àFingerprint cache

12

RAM RAM

slide-13
SLIDE 13

Fingerprint Lookup

  • How should we store the fingerprints?
  • Every unique block is a miss à miss rate ≥ 40%
  • One solution: Bloom filter
  • Challenge: 2% false positive rate à 1TB for 4PB of data

13

RAM Insert Insert Lookup (negative) Lookup (false positive) lookup

slide-14
SLIDE 14

How To Implement a Cache?

  • (Bloom) Filters help us determine if a

fingerprint exists

– We still need to do an I/O to find the mapping

  • Locality in fingerprints?

– If we sort our index by fingerprint: cryptographic hash destroys all notions of locality – What if we grouped fingerprints by temporal locality of writes?

14

slide-15
SLIDE 15

Reading and Restoring

  • How long does it take to read File1?
  • How long does it take to read File3?
  • Challenge: when is it better to store the duplicates?

15

File1 File3 File2

slide-16
SLIDE 16

Write Path

16

File3 File recipe Fingerprint index Chunk store lookup Surprise Many writes become faster!

slide-17
SLIDE 17

Read Path

17

File3 File recipe Fingerprint index Chunk store lookup

slide-18
SLIDE 18

Delete Path

18

File3 File recipe Fingerprint index Chunk store lookup Reference counters: 1 2 1 2 1 1 2

  • Challenge: storing reference counts

– Physically separate from the chunks

slide-19
SLIDE 19

Chunking

  • Chunking: splitting files into blocks
  • Fixed-size chunks: usually aligned to device blocks
  • What is the best chunk size?

19

File1 File2 File1 File2

slide-20
SLIDE 20

Updates and Versions

  • Best case:

aabbccdd àaAbbccdd

  • Worst case:

aabbccdd àaAabbccdd

20

File1 File1a File1b

Ideally…

File1b

slide-21
SLIDE 21

à aAa010bb010cc010dd

Variable-Size Chunks

  • Basic idea: chunk boundary is triggered by a

random string

  • For example: 010
  • aa010bb010cc010dd
  • Triggers should be:

– Not too short/long – Not too popular (000000…) – Easy to identify

21

slide-22
SLIDE 22

Identifying Chunk Boundaries

  • 48-byte triggers (empirically, this works)
  • Define a set of possible triggers

à K highest bits of the hash are == 0 à Rabin fingerprints do this efficiently à “systems” solutions for corner cases

  • Challenge: parallelize this process

22

…010110010011001110100100100110011001001001100110000… 0010001001 0000000101 Fingerprint Boundary! K=5

slide-23
SLIDE 23

Rabin Fingerprints

  • “The polynomial representation of the data

modulo a predetermined irreducible polynomial” [LBFS sosp01]

  • What/why Rabin fingerprints?

– Calculates a rolling hash – “Slide the window” in a constant number of

  • perations (intuition: we “add” a new byte and

“subtract” an old byte to slide the window by one) – Define a “chunk” once our window’s hash matches our target value (i.e., we hit a trigger)

23

slide-24
SLIDE 24

Defining chunk boundaries

  • Tradeoff between small and large chunks?

– Finer granularity of sharing vs. metadata overhead

  • With process just described, how might we:

– Produce a very small chunk? – Produce a very large chunk?

  • How might we modify our chunking algorithm to

give us “reasonable” chunk sizes?

– To avoid small chunks: don’t consider boundaries until minimum size threshold – To avoid large chunks: as soon as we reach a maximum threshold, insert a chunk boundary

24

slide-25
SLIDE 25

Distributed Storage

Increase storage capacity and performance with multiple storage servers

  • Each server is a separate machine

(CPU,RAM,HDD/SSD)

  • Data access is distributed between servers

G Scalability

Increase capacity with data growth

G Load balancing

Independent of workload

G Failure handling

Network, nodes and devices always fail

25

slide-26
SLIDE 26

Distributed Deduplication

  • Where/when should we look for duplicates?
  • Where should we store each file?

26

File1 File3 File2

slide-27
SLIDE 27

Challenges (aka Summary)

à Wonderful theory problems!

27

Approximate membership query structures (AMQ) …010110010011001110100100100110011001001001100110000… Parallelizing chunking Size of fingerprint dictionary 1 2 1 2 1 1 2 Bidirectional indexing of chunks

slide-28
SLIDE 28

Next Class?

  • Specific dedup system(s) (4)
  • Mapreduce (+ write-optimized) (2)
  • Google file system (1)
  • RAID (3)

28

slide-29
SLIDE 29

Final Project Discussion

  • Get with your group
  • Find another group
  • Pitch your project / show them your proposal

– React/revise

29