Efficient Locally Trackable from seed Deduplication in Replicated - - PowerPoint PPT Presentation

efficient locally trackable
SMART_READER_LITE
LIVE PREVIEW

Efficient Locally Trackable from seed Deduplication in Replicated - - PowerPoint PPT Presentation

technology Efficient Locally Trackable from seed Deduplication in Replicated Systems Joo Barreto and Paulo Ferreira Distributed Systems Group INESC-ID/Technical University Lisbon, Portugal www.gsd.inesc-id.pt Instituto de Engenharia de


slide-1
SLIDE 1

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Efficient Locally Trackable Deduplication in Replicated Systems

João Barreto and Paulo Ferreira Distributed Systems Group INESC-ID/Technical University Lisbon, Portugal

www.gsd.inesc-id.pt

slide-2
SLIDE 2

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Bandwidth remains scarce

02/12/2009 Middleware 2009

slide-3
SLIDE 3

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Collaboration through data replication

02/12/2009 Middleware 2009

Site S Site R

File A, v4 File A, v3 File A, v2 File B, v4 File B, v3 File C, v8 File C, v7 File A, v3 File A, v2 File B, v3 File C, v8 File C, v9

Distributed users share objects A, B and C At each m om ent: S stores versions( S) and R stores versions( R)

versions( S) versions( R)

slide-4
SLIDE 4

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

The bottleneck: synchronization

02/12/2009 Middleware 2009

Site S Site R

File A, v4 File A, v3 File A, v2 File B, v4 File B, v3 File C, v8 File C, v7 File A, v3 File A, v2 File B, v3 File C, v8 File C, v9

1 . Determ ine w hich versions to transfer

versions( S) versions( R)

Synchronize to

slide-5
SLIDE 5

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

The bottleneck: synchronization

02/12/2009 Middleware 2009

Site S Site R

2 . Transfer versions

versions( S) versions( R)

slide-6
SLIDE 6

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Deduplication: exploiting redundancy

02/12/2009 Middleware 2009

Site S Site R

versions( S)

+ References to chunks in versions( R)

versions( R)

How to determine which chunks are redundant?

slide-7
SLIDE 7

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Locally trackable vs untrackable redundancy

02/12/2009 Middleware 2009

Site S Site R

versions( S) versions( R)

Locally Trackable Redundancy

chunk to transfer exists in some version that is both in versions(S) and versions(R)

Locally Untrackable Redundancy

  • therwise
slide-8
SLIDE 8

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Existing approaches: compare-by-hash

02/12/2009 Middleware 2009

Site S Site R

versions( S) versions( R)

slide-9
SLIDE 9

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Existing approaches: advantages and shortcomings Compare-by-hash  Detects both locally trackable and untrackable redundancy  Detects redundancy across any versions and/or objects Additional round-trip Limited precision:

– smaller chunks may not compensate hash-exchange and hash-lookup overheads

slide-10
SLIDE 10

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Existing approaches: delta encoding

02/12/2009 Middleware 2009

Site S Site R

versions( S) versions( R)

Calculate deltas from most recent versions versions(R) to each version to transfer. Using local, high-precision algorithm s.

slide-11
SLIDE 11

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Existing approaches: advantages and shortcomings Compare-by-hash  Detects both locally trackable and untrackable redundancy  Detects redundancy across any versions and/or objects Additional round-trip Limited precision:

– smaller chunks may not compensate hash-exchange and hash-lookup overheads

Delta Encoding Only detects locally trackable redundancy Limited to pairs of versions  High-precision local redundancy detection  Redundancy detection can

  • ccur ahead of transfer

time  Simple protocol Can we devise a solution that borrows the advantages from both approaches?

slide-12
SLIDE 12

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

  • 3. For each chunk to transfer, if the chunk is also in some

version in C, simply send a reference to that version

  • 2. Determine C = versions(S) ∩ versions(R)

Our contribution: redFS Site S Site R

versions( S) versions( R)

  • 0. Use local high-precision compare-by-hash

algorithm to pre-compute local redundancy relations

[ S:5 ] [ S:4 ] [ R:4 ] [ R:5 ] [ S:3 ] [ S:5 ] [ S:4 ] [ S:6 ] [ R:4 ] [ S:3 ] [ S:6 ] [ R:3 ]

slide-13
SLIDE 13

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

What does redFS achieve? Compare-by-hash  Detects both locally trackable and untrackable redundancy  Detects redundancy across any versions and/or objects Additional round-trip Limited precision:

– smaller chunks may not compensate hash-exchange and hash-lookup overheads

Delta Encoding Only detects locally trackable redundancy Limited to pairs of versions  High-precision local redundancy detection  Redundancy detection can

  • ccur ahead of transfer

time  Simple protocol

slide-14
SLIDE 14

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Compare-by-hash  Detects both locally trackable and untrackable redundancy  Detects redundancy across any versions and/or objects Additional round-trip Limited precision:

– smaller chunks may not compensate hash-exchange and hash-lookup overheads

Delta Encoding Only detects locally trackable redundancy Limited to pairs of versions  High-precision local redundancy detection  Redundancy detection can

  • ccur ahead of transfer

time  Simple protocol What does redFS achieve?

slide-15
SLIDE 15

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

What does redFS achieve?

02/12/2009 Middleware 2009

Simpler protocol, Able to detect finer-grained redundancy Locally trackable redundancy across pairs of consecutive versions of sam e

  • bject

Delta Encoding Detectable forms of redudancy Com pare-by- hash

redFS

Any redundancy Any locally trackable redundancy ≈ More complicated protocol, Limited precision

slide-16
SLIDE 16

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Evaluation: methodology

  • We evaluated different solutions from every approach:

– RedFS full implementation – LBFS, rsync, TAPER (compare-by-hash) – xdelta, svn (delta encoding)

  • Two distributed sites, network with different bandwidths

(3Mbps to 100Mbps)

  • Real workloads

– Single-writer Scenarios – Multi-writer Scenarios

02/12/2009 Middleware 2009

slide-17
SLIDE 17

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Evaluation: single-writer multi-reader scenarios

02/12/2009 Middleware 2009

Site S ( W riter) Site R ( Reader) v0 of a set of files

(e.g. gcc 3.3.1 source code)

v0 of a set of files

(e.g. gcc 3.3.1 source code)

v1 of a set of files

(e.g. gcc 3.4.1 source code)

deduplication

v1 of a set of files

(e.g. gcc 3.4.1 source code)

Sam e m ethodology and w orkloads as in recent com pare-by-hash papers ( e.g. LBFS [ SOSP’0 1 ] , TAPER [ FAST’0 5 ] ) All redundancy is locally trackable!

slide-18
SLIDE 18

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed 02/12/2009 Middleware 2009

Evaluation: transferred volumes in single-writer multi-reader scenarios

(Except for particularly suited workloads such as this one) RedFS transfers less than delta-encoding RedFS transfers less (or, in few exceptions, comparable) bytes than all compare-by-hash solutions

slide-19
SLIDE 19

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed 02/12/2009 Middleware 2009

Best performance with low bandwidth (due to high deduplication efficiency) Best performance with high bandwidth (due to protocol simplicity)

Evaluation: performace in single-writer multi-reader scenarios

redFS LBFS Plain svn (delta encoding) rsync

slide-20
SLIDE 20

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Evaluation: multi-writer scenario

02/12/2009 Middleware 2009

Site S ( W riter) Site R ( W riter) v0 of working set v0 of working set vS of working set

deduplication

vR of working set

W orkloads from collaborative activity betw een groups of teachers and students during 1 -sem ester courses. Locally untrackable redundancy can now occur!

slide-21
SLIDE 21

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Evaluation: multi-writer scenario

Q: How much locally untrackable redundancy? A: Only 1% to 4% of all redundancy generated over +3 months was locally untrackable

02/12/2009 Middleware 2009

Advantages of redFS persist even in real scenarios where locally untrackable redundancy can occur (both in transferred volume and performance)

slide-22
SLIDE 22

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

Conclusions

  • Compare-by-hash trades simplicity and efficiency for the

ability to detect any redundancy

  • In relevant scenarios, most redundancy is locally trackable

– Including the same scenarios that most compare-by-hash papers use to motivate their works

  • By exclusively detecting locally trackable redundancy we

can outperform compare-by-hash

– One forgotten example: Delta Encoding

  • Our contribution: RedFS’s algorithm

– Fully implemented as open source distributed file system – Evaluation shows it outperforms both compare-by-hash and delta encoding

12/2/2009

slide-23
SLIDE 23

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed 02-12-2009 Título da apresentação 24

technology from seed

Thank you.

Questions?

www.gsd.inesc-id.pt/~jpbarreto

slide-24
SLIDE 24

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed 02/12/2009 Middleware 2009

Ok, but at which storage cost?

Plain storage Deduplicated storage (128-byte chunks) Deduplicated storage (2KB chunks) Deduplicated storage (8KB chunks)