T H E U N I V E R S I T Y O F E D I N B U R G H
Fast and Secure H T Y T O H Laptop Backups F G R E U D - - PowerPoint PPT Presentation
Fast and Secure H T Y T O H Laptop Backups F G R E U D - - PowerPoint PPT Presentation
I V N E U R S E I Fast and Secure H T Y T O H Laptop Backups F G R E U D B I N with Encrypted De-duplication Le Zhang <zhang.le@ed.ac.uk> Paul Anderson <dcspaul@ed.ac.uk> LISA 2010 Laptop Backup Options
Laptop Backup Options
External Hard Drive
No offsite storage ? What if I have a break-in? Or there is a fire? I need a very large capacity to handle archival storage as well ...
Laptop Backup Options
Recordable CD/DVD
I have to make multiple copies if I want offsite storage ... DVDs are only small - I can
- nly backups subsets of
files ...
Laptop Backup Options
Cloud Storage
Broadband upload speeds are slow - 30 DAYS to upload 300Gb to cloud storage is typical ... Often, there is a transfer cost as well as a storage cost ...
Laptop Backup Options
External Hard Drive Recordable CD/DVD Cloud Storage
What do people do?
11% 5% 16% 33% 25% 11%
Store no vital data Regular full backups Partial backups Keep copy on University machine Don’t do backups Don’t use laptop
When people bother keeping backups, they are mostly ad-hoc - and usually only involve hand- selected subsets
What kind of data?
29% 8% 63%
User files Applications System files
Perhaps a lot of the system files and application files (at least) are common?
From our sample of academic Mac laptop users
Shared Data
It seems like there is a good deal of duplication among the system and application files. And this increases with the number of machines But it is interesting that a good many files are not common! So is it a good idea not to back up these categories?
5 10 15 20 25 0.1 0.2 0.3 0.4 0.5 Number of machines added SYS Storage (TB) SYS Storage Saving Actual Storage (TB) Saved Storage (TB) 5 10 15 20 25 0.05 0.1 0.15 Number of machines added APP Storage (TB) APP Storage Saving Actual Storage (TB) Saved Storage (TB)
Shared Data
Obviously, there is less sharing among the user data
- but the overall saving is still
significant And we might expect a higher degree of sharing among the user data for different communities - for example, common music files would make a big difference ...
5 10 15 20 25 0.2 0.4 0.6 0.8 1 1.2 1.4 Number of machines added USR Storage (TB) USR Storage Saving Actual Storage (TB) Saved Storage (TB) 5 10 15 20 25 0.5 1 1.5 2 Number of machines added Storage (TB) Overall Storage Saving Actual Storage (TB) Saved Storage (TB)
Deduplication
“Deduplication” is becoming very popular for saving space when storing multiple copies of the same file A “hash” (digital signature) is generated from the contents of the file Two files with the same content will have the same hash Two files with different contents have a very high chance of having different hashes Use the hash as the name of the stored file
Block sizes
10Bytes 1K 100K 1MB 1GB 10GB 0.5 1 1.5 2 2.5 3 3.5 x 10
6File size distribution (in log10 domain) Frequency
128K 256K 512K 1024K File 28.5 29 29.5 30 30.5 31 31.5 32 32.5 Duplication Rate %
- a. Data duplication rate vs block size
128K 256K 512K 1024K File 1.25 1.3 1.35 1.4 Actual Storage (TB)
- b. Actual storage needed vs block size
128K 256K 512K 1024K File 10 20 30 40 Million Objects
- c. Number of backup objects vs block size
All Objs Stored Objs
Deduplicating at the block level is more efficient than the file level. What is an appropriate block size?
Deduplication problems?
Most de-duplication systems work at the storage level This has two problems in our application .. If the data is encrypted “at source” (with different keys) then the deduplication is defeated (the cipher text will be different) The full data still has to be transmitted to the “server” - and this time is a more significant problem than the storage!
Convergent Encryption
“Convergent Encryption” neatly solves the first problem ... Files are encrypted using the hash of the data as the key Files containing the same data will encrypt to the same cypher text and hence deduplication continues to work File owners will have the key (because they
- riginally had the data) and will be able to
decrypt the data - others won’t
Managing keys
Each (unique) file now has a separate key which we need to manage Our solution creates a “data object” for each directory which contains the keys for the children, as well as their metadata The directory object is then encoded and stored in the same was as a normal file The user only has to record the key for the root
- bject
Entire duplicate subtrees can be detected
Avoiding Transmission
To avoid transmitting data which already exists
- n the server, we need to do the deduplication
- n the source system
Many services (eg. Amazon) don’t provide the necessary interfaces for the client to communicate directly There are several approaches to this, depending on specific application ...
- A private server
- A local “caching” server for a remote cloud service
A Protoype
Local Disk FS Events Backup Manager Data Compression (Optional) Symmetric Encryption with key generated from block content Upload Queue
Backup Server Upload threads Files Changed les
Local Meta DB
Meta Update Backup status update List of les to backup Encrypted blocks
A Mac OsX client A local (departmental, home) server which performs hash checking, authentication and high-speed caching before forwarding unique blocks to the cloud
Where next?
Performance depends heavily on the characteristics of the data itself, and the underlying network/storage (eg. latency)
- We would like to study this more
We would like to develop a production quality client, and investigate a possible service in a datacentre
- we are looking for possible funding/partners
T H E U N I V E R S I T Y O F E D I N B U R G H