Cumulus: Filesystem Backup to the Cloud 7th USENIX Conference on - - PowerPoint PPT Presentation

cumulus filesystem backup to the cloud
SMART_READER_LITE
LIVE PREVIEW

Cumulus: Filesystem Backup to the Cloud 7th USENIX Conference on - - PowerPoint PPT Presentation

Cumulus: Filesystem Backup to the Cloud 7th USENIX Conference on File and Storage Technologies (FAST 09) Michael Vrable Stefan Savage Geoffrey M. Voelker University of California, San Diego February 26, 2009 Vrable, Savage, Voelker (UCSD)


slide-1
SLIDE 1

Cumulus: Filesystem Backup to the Cloud

7th USENIX Conference on File and Storage Technologies (FAST ’09) Michael Vrable Stefan Savage Geoffrey M. Voelker

University of California, San Diego

February 26, 2009

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 1 / 19

slide-2
SLIDE 2

Introduction

◮ Cloud computing important emerging area, with a spectrum of

implementations

◮ “Thick” cloud: Purchase a complete integrated service from a

provider

◮ Potentially greater efficiencies ◮ Easier to set up

◮ “Thin” cloud: Customer builds application on more generic services

◮ More choices among service providers ◮ Easier to migrate between providers ◮ Potentially lower costs

◮ Thin cloud offers some advantages, particularly for applications such

as backup

◮ How well can we do with such a simple interface? Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 2 / 19

slide-3
SLIDE 3

Cumulus: Background and Requirements

◮ Network Backup: Functionality

◮ Implement backup over a network to provide easy off-site storage ◮ Store snapshots of file data at multiple points in time ◮ Allow recovery of selected files or entire snapshot Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 3 / 19

slide-4
SLIDE 4

Cumulus: Background and Requirements

◮ Network Backup: Functionality

◮ Implement backup over a network to provide easy off-site storage ◮ Store snapshots of file data at multiple points in time ◮ Allow recovery of selected files or entire snapshot

◮ System Requirements

◮ Build on a thin cloud model: simple storage interface only ◮ Storage layer need only support put/get of blobs of data, list,

delete

◮ Implies that application logic must be built into client ◮ Focus on cloud storage, but could be FTP server, friend’s computer,

P2P network, . . .

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 3 / 19

slide-5
SLIDE 5

Cumulus: Background and Requirements

◮ Network Backup: Functionality

◮ Implement backup over a network to provide easy off-site storage ◮ Store snapshots of file data at multiple points in time ◮ Allow recovery of selected files or entire snapshot

◮ System Requirements

◮ Build on a thin cloud model: simple storage interface only ◮ Storage layer need only support put/get of blobs of data, list,

delete

◮ Implies that application logic must be built into client ◮ Focus on cloud storage, but could be FTP server, friend’s computer,

P2P network, . . .

◮ Goals

◮ Minimize resource requirements (storage, network) ◮ Minimize ongoing monetary costs Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 3 / 19

slide-6
SLIDE 6

Cumulus Backup Format

Monday

Snapshot Roots

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 4 / 19

slide-7
SLIDE 7

Cumulus Backup Format

Monday

Snapshot Roots

mbox paper photos/A photos/B

Metadata

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 4 / 19

slide-8
SLIDE 8

Cumulus Backup Format

Monday

Snapshot Roots

mbox paper photos/A photos/B

Metadata

paper1 mbox1 photoA photoB

Data

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 4 / 19

slide-9
SLIDE 9

Cumulus Backup Format

Monday

Snapshot Roots

mbox paper photos/A photos/B

Metadata

paper1 mbox1 photoA photoB

Data

Tuesday paper2 mbox2 mbox' paper'

Monday Tuesday Shared

◮ Stores filesystem snapshots at multiple points in time ◮ Data blocks shared within, between snapshots ◮ Minimizes storage, upload bandwidth needed

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 4 / 19

slide-10
SLIDE 10

Aggregation: Minimizing Per-Block Costs

Monday

Snapshot Roots

mbox paper photos/A photos/B

Metadata

paper1 mbox1 photoA photoB

Data

Tuesday paper2 mbox2 mbox' paper' Segments

◮ May have per-file in addition to per-byte costs

◮ Protocol overhead: Slower backups from more transactions ◮ Per-file overhead at storage server ◮ May be exposed as monetary cost by provider

◮ Cumulus reduces these costs by aggregating blocks into segments

before storage

◮ Aggregation follows from our constraints, but may not be needed in

  • ther systems

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 5 / 19

slide-11
SLIDE 11

Aggregation Challenges: Internal Fragmentation

Day 1

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 6 / 19

slide-12
SLIDE 12

Aggregation Challenges: Internal Fragmentation

Day 1 Day 2

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 6 / 19

slide-13
SLIDE 13

Aggregation Challenges: Internal Fragmentation

Day 1 Day 2 Day 3

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 6 / 19

slide-14
SLIDE 14

Aggregation Challenges: Internal Fragmentation

Day 1 Day 2 Day 3 Day 4 (new data) Day 4 (repacked data)

◮ Wasted space within segments reclaimed by segment cleaning ◮ Tradeoff: space vs. upload bandwidth ◮ Contribution: Show how to tune segment size, threshold for cleaning

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 6 / 19

slide-15
SLIDE 15

Cumulus Implementation

◮ Implemented as ≈ 4000 lines C++, Python ◮ Execution packages new data into segments, uploads to storage server ◮ Client tracks some data locally (not essential for restores):

◮ Block hash database ◮ Previous snapshot metadata (detect changed files)

◮ Other features:

◮ Compression/encryption ◮ Sub-file incremental updates

◮ More details in the paper ◮ In real use: I have been using it for over 18 months

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 7 / 19

slide-16
SLIDE 16

Evaluation

Key Questions:

◮ What is the resource (network, storage) overhead imposed by the

restricted storage interface?

◮ How do these overheads translate into monetary terms? ◮ How can aggregation and cleaning be tuned to minimize the cost? ◮ How does the prototype perform?

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 8 / 19

slide-17
SLIDE 17

Evaluation Traces

Fileserver User Duration (days) 157 223 Entries 26673083 122007 Files 24344167 116426 File Sizes Median 0.996 KB 4.4 KB Average 153 KB 21.4 KB Maximum 54.1 GB 169 MB Total 3.47 TB 2.37 GB Update Rates New data/day 9.50 GB 10.3 MB Changed data/day 805 MB 29.9 MB Total data/day 10.3 GB 40.2 MB

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 9 / 19

slide-18
SLIDE 18

Evaluation Traces

Fileserver User Duration (days) 157 223 Entries 26673083 122007 Files 24344167 116426 File Sizes Median 0.996 KB 4.4 KB Average 153 KB 21.4 KB Maximum 54.1 GB 169 MB Total 3.47 TB 2.37 GB Update Rates New data/day 9.50 GB 10.3 MB Changed data/day 805 MB 29.9 MB Total data/day 10.3 GB 40.2 MB

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 9 / 19

slide-19
SLIDE 19

Backup Simulation

◮ Compare against optimal backup performance:

◮ All unique data must be stored at server ◮ All new data must be transferred over network

◮ In simulation, compare Cumulus against these baseline values ◮ Consider effect of aggregation, cleaning parameters ◮ For simplicity, ignore compression and metadata

◮ Effects discussed in paper Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 10 / 19

slide-20
SLIDE 20

Is Cleaning Necessary?

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 50 100 150 200 Storage Utilization Time (days) With Cleaning No Cleaning

◮ Without segment

cleaning, storage utilization steadily decreases

◮ Weekly cleaning

keeps overhead within a narrow range

◮ Exact overhead

depends on cleaning parameters

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 11 / 19

slide-21
SLIDE 21

How Much Data is Transferred?

5 10 15 20 25 30 35 40 0.2 0.4 0.6 0.8 1 38 40 42 44 46 48 50 52 Overhead vs. Optimal (%) Raw Size (MB/day) Cleaning Threshold 16 MB Segments 4 MB Segments 1 MB Segments 512 kB Segments 128 kB Segments

◮ Aggressive cleaning,

large segments increase overhead

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 12 / 19

slide-22
SLIDE 22

How Much Data is Transferred?

5 10 15 20 25 30 35 40 0.2 0.4 0.6 0.8 1 38 40 42 44 46 48 50 52 Overhead vs. Optimal (%) Raw Size (MB/day) Cleaning Threshold 16 MB Segments 4 MB Segments 1 MB Segments 512 kB Segments 128 kB Segments

◮ Aggressive cleaning,

large segments increase overhead

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 12 / 19

slide-23
SLIDE 23

What is the Storage Overhead?

5 10 15 20 25 0.2 0.4 0.6 0.8 1 2.7 2.8 2.9 3 3.1 3.2 3.3 Overhead vs. Optimal (%) Raw Size (GB) Cleaning Threshold 16 MB Segments 4 MB Segments 1 MB Segments 512 kB Segments 128 kB Segments

◮ Large segments

increase overhead

◮ Too little cleaning

leads to large

  • verheads

◮ Aggressive cleaning

leads to churn, storage overhead when keeping multiple snapshots

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 13 / 19

slide-24
SLIDE 24

What is the Storage Overhead?

5 10 15 20 25 0.2 0.4 0.6 0.8 1 2.7 2.8 2.9 3 3.1 3.2 3.3 Overhead vs. Optimal (%) Raw Size (GB) Cleaning Threshold 16 MB Segments 4 MB Segments 1 MB Segments 512 kB Segments 128 kB Segments

◮ Large segments

increase overhead

◮ Too little cleaning

leads to large

  • verheads

◮ Aggressive cleaning

leads to churn, storage overhead when keeping multiple snapshots

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 13 / 19

slide-25
SLIDE 25

Estimating Ongoing Backup Costs

◮ How do storage, upload translate into total cost for implementing

backup?

◮ Amazon S3 prices:

Storage: $0.15 per GB · month Upload: $0.10 per GB Operation: $0.01 per 1000 uploads

◮ Effects of varying costs discussed in the paper

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 14 / 19

slide-26
SLIDE 26

What Settings Minimize Total Cost?

10 20 30 40 50 0.2 0.4 0.6 0.8 1 0.55 0.6 0.65 0.7 0.75 Cost Increase vs. Optimal (%) Cost ($/month) Cleaning Threshold 16 MB Segments 4 MB Segments 1 MB Segments 512 kB Segments 128 kB Segments

◮ Aggressive cleaning,

large segments increase overhead

◮ Total cost includes

per-segment charge: intermediate segment size is best

◮ Cleaning threshold

0.4–0.6, segment size 0.5–1 MB work well

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 15 / 19

slide-27
SLIDE 27

What Settings Minimize Total Cost?

10 20 30 40 50 0.2 0.4 0.6 0.8 1 0.55 0.6 0.65 0.7 0.75 Cost Increase vs. Optimal (%) Cost ($/month) Cleaning Threshold 16 MB Segments 4 MB Segments 1 MB Segments 512 kB Segments 128 kB Segments

◮ Aggressive cleaning,

large segments increase overhead

◮ Total cost includes

per-segment charge: intermediate segment size is best

◮ Cleaning threshold

0.4–0.6, segment size 0.5–1 MB work well

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 15 / 19

slide-28
SLIDE 28

Simulation Summary

◮ Storage cost dominates (> 75% in this trace) ◮ Cost not overly sensitive to aggregation, cleaning settings ◮ Cost within 5–10% of best we could expect

◮ Implications for integrated backup? Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 16 / 19

slide-29
SLIDE 29

Prototype Evaluations

◮ Tested full prototype using backups from two months of user trace

◮ Snapshots stored properly, could be restored ◮ Ongoing costs come out to $0.24/month for around 2 GB of data

◮ Compared with two existing tools for Amazon S3

◮ Brackup and JungleDisk: two other tools capable of filesystem backup

to S3

◮ Monthly costs are 19–200% more ◮ But, systems designed for more than just backup or not explicitly tuned

for cost

◮ What about thick cloud?

◮ Mozy: integrated online backup solution ◮ $5/month for “unlimited” backups ◮ $0.50/GB/month for businesses Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 17 / 19

slide-30
SLIDE 30

Summary

◮ Cumulus is a cost-effective tool for backup to network storage ◮ We show how system parameters can be tuned to minimize total cost ◮ Shows specialized server not necessary for implementing low-overhead

backup

◮ Can choose from variety of storage providers based on cost or other

factors

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 18 / 19

slide-31
SLIDE 31

Questions?

Cumulus is available at http://sysnet.ucsd.edu/projects/cumulus/

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 19 / 19

slide-32
SLIDE 32

Deduplication

◮ Cumulus implementation does perform coarse-grained data

deduplication

◮ Recognizes duplicate data at file or 1 MB block level ◮ Block boundaries for deduplication are fixed

◮ Deduplication only for a single client, not across clients ◮ Server-side support could enable deduplication across clients

◮ Doesn’t work well with aggregation into segments ◮ Does slightly reduce privacy of backup ◮ Complicates accounting Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 20 / 19