Fast and Secure H T Y T O H Laptop Backups F G R E U D - - PowerPoint PPT Presentation

fast and secure
SMART_READER_LITE
LIVE PREVIEW

Fast and Secure H T Y T O H Laptop Backups F G R E U D - - PowerPoint PPT Presentation

I V N E U R S E I Fast and Secure H T Y T O H Laptop Backups F G R E U D B I N with Encrypted De-duplication Le Zhang <zhang.le@ed.ac.uk> Paul Anderson <dcspaul@ed.ac.uk> LISA 2010 Laptop Backup Options


slide-1
SLIDE 1

T H E U N I V E R S I T Y O F E D I N B U R G H

Le Zhang

<zhang.le@ed.ac.uk>

Paul Anderson

<dcspaul@ed.ac.uk> LISA 2010

Fast and Secure Laptop Backups

with Encrypted De-duplication

slide-2
SLIDE 2

Laptop Backup Options

External Hard Drive

No offsite storage ? What if I have a break-in? Or there is a fire? I need a very large capacity to handle archival storage as well ...

slide-3
SLIDE 3

Laptop Backup Options

Recordable CD/DVD

I have to make multiple copies if I want offsite storage ... DVDs are only small - I can

  • nly backups subsets of

files ...

slide-4
SLIDE 4

Laptop Backup Options

Cloud Storage

Broadband upload speeds are slow - 30 DAYS to upload 300Gb to cloud storage is typical ... Often, there is a transfer cost as well as a storage cost ...

slide-5
SLIDE 5

Laptop Backup Options

External Hard Drive Recordable CD/DVD Cloud Storage

slide-6
SLIDE 6

What do people do?

11% 5% 16% 33% 25% 11%

Store no vital data Regular full backups Partial backups Keep copy on University machine Don’t do backups Don’t use laptop

When people bother keeping backups, they are mostly ad-hoc - and usually only involve hand- selected subsets

slide-7
SLIDE 7

What kind of data?

29% 8% 63%

User files Applications System files

Perhaps a lot of the system files and application files (at least) are common?

From our sample of academic Mac laptop users

slide-8
SLIDE 8

Shared Data

It seems like there is a good deal of duplication among the system and application files. And this increases with the number of machines But it is interesting that a good many files are not common! So is it a good idea not to back up these categories?

5 10 15 20 25 0.1 0.2 0.3 0.4 0.5 Number of machines added SYS Storage (TB) SYS Storage Saving Actual Storage (TB) Saved Storage (TB) 5 10 15 20 25 0.05 0.1 0.15 Number of machines added APP Storage (TB) APP Storage Saving Actual Storage (TB) Saved Storage (TB)

slide-9
SLIDE 9

Shared Data

Obviously, there is less sharing among the user data

  • but the overall saving is still

significant And we might expect a higher degree of sharing among the user data for different communities - for example, common music files would make a big difference ...

5 10 15 20 25 0.2 0.4 0.6 0.8 1 1.2 1.4 Number of machines added USR Storage (TB) USR Storage Saving Actual Storage (TB) Saved Storage (TB) 5 10 15 20 25 0.5 1 1.5 2 Number of machines added Storage (TB) Overall Storage Saving Actual Storage (TB) Saved Storage (TB)

slide-10
SLIDE 10

Deduplication

“Deduplication” is becoming very popular for saving space when storing multiple copies of the same file A “hash” (digital signature) is generated from the contents of the file Two files with the same content will have the same hash Two files with different contents have a very high chance of having different hashes Use the hash as the name of the stored file

slide-11
SLIDE 11

Block sizes

10Bytes 1K 100K 1MB 1GB 10GB 0.5 1 1.5 2 2.5 3 3.5 x 10

6

File size distribution (in log10 domain) Frequency

128K 256K 512K 1024K File 28.5 29 29.5 30 30.5 31 31.5 32 32.5 Duplication Rate %

  • a. Data duplication rate vs block size

128K 256K 512K 1024K File 1.25 1.3 1.35 1.4 Actual Storage (TB)

  • b. Actual storage needed vs block size

128K 256K 512K 1024K File 10 20 30 40 Million Objects

  • c. Number of backup objects vs block size

All Objs Stored Objs

Deduplicating at the block level is more efficient than the file level. What is an appropriate block size?

slide-12
SLIDE 12

Deduplication problems?

Most de-duplication systems work at the storage level This has two problems in our application .. If the data is encrypted “at source” (with different keys) then the deduplication is defeated (the cipher text will be different) The full data still has to be transmitted to the “server” - and this time is a more significant problem than the storage!

slide-13
SLIDE 13

Convergent Encryption

“Convergent Encryption” neatly solves the first problem ... Files are encrypted using the hash of the data as the key Files containing the same data will encrypt to the same cypher text and hence deduplication continues to work File owners will have the key (because they

  • riginally had the data) and will be able to

decrypt the data - others won’t

slide-14
SLIDE 14

Managing keys

Each (unique) file now has a separate key which we need to manage Our solution creates a “data object” for each directory which contains the keys for the children, as well as their metadata The directory object is then encoded and stored in the same was as a normal file The user only has to record the key for the root

  • bject

Entire duplicate subtrees can be detected

slide-15
SLIDE 15

Avoiding Transmission

To avoid transmitting data which already exists

  • n the server, we need to do the deduplication
  • n the source system

Many services (eg. Amazon) don’t provide the necessary interfaces for the client to communicate directly There are several approaches to this, depending on specific application ...

  • A private server
  • A local “caching” server for a remote cloud service
slide-16
SLIDE 16

A Protoype

Local Disk FS Events Backup Manager Data Compression (Optional) Symmetric Encryption with key generated from block content Upload Queue

Backup Server Upload threads Files Changed les

Local Meta DB

Meta Update Backup status update List of les to backup Encrypted blocks

A Mac OsX client A local (departmental, home) server which performs hash checking, authentication and high-speed caching before forwarding unique blocks to the cloud

slide-17
SLIDE 17

Where next?

Performance depends heavily on the characteristics of the data itself, and the underlying network/storage (eg. latency)

  • We would like to study this more

We would like to develop a production quality client, and investigate a possible service in a datacentre

  • we are looking for possible funding/partners
slide-18
SLIDE 18

T H E U N I V E R S I T Y O F E D I N B U R G H

Le Zhang

<zhang.le@ed.ac.uk>

Paul Anderson

<dcspaul@ed.ac.uk> LISA 2010

Fast and Secure Laptop Backups

with Encrypted De-duplication