Design & Implementation of a Portable File Synchronisation - - PowerPoint PPT Presentation

design implementation of a portable file synchronisation
SMART_READER_LITE
LIVE PREVIEW

Design & Implementation of a Portable File Synchronisation - - PowerPoint PPT Presentation

National Technical University of Athens Design & Implementation of a Portable File Synchronisation Mechanism for a Cloud Storage Environment Supervisor Prof. Nektarios Koziris Assistant Supervisor Dr. Vangelis Koukis Candidate Vasilis


slide-1
SLIDE 1

National Technical University of Athens

Design & Implementation of a Portable File Synchronisation Mechanism for a Cloud Storage Environment

Supervisor

  • Prof. Nektarios Koziris

Assistant Supervisor

  • Dr. Vangelis Koukis

Candidate Vasilis Gerakaris

2/9/2015

slide-2
SLIDE 2

Table of Contents

Introduction Design & Implementation Syncing algorithm Core Classes / API Optimisations Request Qveuing Directory Monitoring Local Block Storage Local deduplication - FUSE Comparison with existing sofuware Future Work

2 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-3
SLIDE 3

Table of Contents

Introduction Design & Implementation Syncing algorithm Core Classes / API Optimisations Request Qveuing Directory Monitoring Local Block Storage Local deduplication - FUSE Comparison with existing sofuware Future Work

3 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-4
SLIDE 4

Introduction

(i) - The problem

File Synchronisation: The process of updating files in two or more difgerent locations, following certain rules. Why is it needed?

  • Copying files between difgerent computers
  • Backups

Important Qvalities

✓ Needs to detect & handle update conflicts/renames/deletions ✓ Needs to be reliable (no errors) ✓ Needs to be efgicient

4 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-5
SLIDE 5

Introduction

(i) - The problem (cont)

File Synchronisation: The process of updating files in two or more difgerent locations, following certain rules. Sofuware designed for that purpose already exists, namely:

  • rsync
  • ownCloud
  • Dropbox
  • Google Drive

We focus on a more specific aspect of the problem.

4 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-6
SLIDE 6

Large Similar Files

(i) - Definition

What are they? Files that satisfy the following two requirements:

  • Are large in size (several GBs)
  • Have a lot of their data in common

Examples: VM images, VM snapshots Why are they important? Many VMs are being used on cloud service providers (Amazon AWS, ~okeanos, etc) and there should be a way to efgiciently synchronise their images and snapshots.

5 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-7
SLIDE 7

Large Similar Files

(ii) - Definition (cont)

Object Storage Service Compute Service

User B Custom image File Files Upload

Image File

User A

Clone

Snapshot Snapshot File

VMs Connect Store Register

5 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-8
SLIDE 8

Large Similar Files

(iii) - Definition (cont)

Snapshot t0 Snapshot t1 Snapshot t2

We can use these similarities to optimise the synchronisation!

5 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-9
SLIDE 9

Table of Contents

Introduction Design & Implementation Syncing algorithm Core Classes / API Optimisations Request Qveuing Directory Monitoring Local Block Storage Local deduplication - FUSE Comparison with existing sofuware Future Work

6 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-10
SLIDE 10

Syncing algorithm

(i) - Modification detection

Modification detection: Comparison of hash digests ✓ Reliable ✗ Very slow, especially on large files Faster alternative: Use last modification time as an indicator. Why we need history data: Need to know what to do in the following cases:

  • File exists on both locations and is difgerent
  • File exists on A but not on B (or vice-versa)

7 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-11
SLIDE 11

Syncing algorithm

(ii) - Initial algorithm

Time T1 Time T2 Change Does not Exist Exists Created Exists Does not Exist Deleted Exists (ETag = J) Exists (ETag = J) No Change Exists (ETag = J) Exists (ETag = K) Modified

(a) File change detection between two points in time

File replica A File replica B Action No Change No Change No Action Created (ETag = J) Created (ETag = J) No Action Created (ETag = J) Created (ETag = K) Merge∗ Deleted Deleted No Action Deleted No Change Delete B Modified No Change Update B Modified (ETag = J) Modified (ETag = K) Merge∗

(b) Syncing actions based on file states

7 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-12
SLIDE 12

Syncing algorithm

(iii) - What we propose

Limitations ✗ Can't detect renames (or worse, renames & modifications) Our solution for syncing with a central metadata server

  • Store the metadata of all files, as they were during the last

successful sync on a local state database (StateDB).

  • Reconcile local directory replicas (Local) and remote server replicas

(Remote) in three steps:

  • 1. Detect updates from Local Directory
  • 2. Detect updates from StateDB
  • 3. Detect updates from Remote Directory
  • FCFS updates on conflicts, with conflicting copies being renamed.

7 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-13
SLIDE 13

3-step synchronisation

(i) - Updates from Local Directory

phash exists in StateDB? Local modtime == StateDB modtime? inode exists in StateDB? No local change File exists

  • n Remote?

StateDB ETag == Remote Etag? Local modified Local modified Conflict Renamed File exists

  • n remote?

Conflict New local file yes no yes no yes no yes no yes no yes no 8 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-14
SLIDE 14

3-step synchronisation

(ii) - Updates from StateDB

File exists on local/remote? Local exists, Remote exists Local doesn't exist, Remote exists Local doesn't exist, Remote doesn't exist Local exists, Remote doesn't exist No change Deleted inode exists in StateDB? Renamed / Deleted Remote ETag == StateDB Etag? Deleted Remote modified Local modtime == StateDB modtime? Deleted Local modified yes no yes no yes no 8 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-15
SLIDE 15

3-step synchronisation

(iii) - Updates from Remote Directory

phash exists in StateDB? Remote ETag == StateDB ETag? New remote file No remote changes Local modtime == StateDB modtime? Remote modified Conflict yes no yes no yes no

8 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-16
SLIDE 16

Core Classes / API

What we have done:

  • Built a cross-platform framework in Python that can be used to

synchronise files with any cloud storage service, as long as some API functions are implemented.

  • Created abstract classes for representations of files, filesystem

directories and cloud storage services.

  • Implemented a class that uses the Synnefo (Pithos) API as an

example.

  • Created a proof-of-concept application that syncs a local directory

with the Pithos+ service ofgered by ~okeanos.

9 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-17
SLIDE 17

Core Classes / API

(i) - FileStat

phash: int path: str inode: int modtime: int type: int etag: str FileStat

The core class used in this framework to represent file objects

  • phash: The (integer) hash digest of the

relative path string. It is used for fast indexing in the StateDB. Assumed unique for each file path.

  • etag: The ETag (sha-256 digest) of the file.

Assumed unique for each file version.

10 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-18
SLIDE 18

Core Classes / API

(ii) - LocalDirectory + get_all_objects_fstat() + get_modified_objects_fstat() + get_file_fstat(str path) sync_dir: str LocalDirectory

  • get_all_objects_fstat: Returns all local files' metadata as FileStat
  • bjects.
  • get_modified_objects_fstat: Return file metadata only for the files that

were modified since the last sync.

  • get_file_fstat: Returns the FileStat object for the file path if it exists, else

returns None.

10 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-19
SLIDE 19

Core Classes / API

(iii) - CloudClient

+ get_object_fstat(str path) + get_all_objects_fstat() + download_object(str path, file fd) + upload_object(str rel_path, str sync_dir) + update_object(str rel_path, str sync_dir, str etag) + delete_object(str path) + rename_object(str old_path, str new_path) CloudClient + init(str auth_URL, str auth_token, str ca_certs_path)

  • _modtime_from_remote(dict remote_obj)
  • _is_directory_from_remote(dict remote_obj)
  • _etag_from_remote(dict remote_obj)
  • _fstat_from_metadata(dict obj_metadata, str path)

pithos: PithosClient PithosClient

Closely resembles the OpenStack API (used by synnefo as well). To properly handle race conditions: upload_object() is used for new files update_object() is used for existing files.

10 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-20
SLIDE 20

Table of Contents

Introduction Design & Implementation Syncing algorithm Core Classes / API Optimisations Request Qveuing Directory Monitoring Local Block Storage Local deduplication - FUSE Comparison with existing sofuware Future Work

11 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-21
SLIDE 21

Optimisation: Request Qveuing

(i) - Description

Multiple requests are slow!

✓ Batch them wherever possible (get_all_objects_fstat()) ✓ Use threads and queues to send requests without waiting for

  • thers to complete.

✗ Need to wait for completion of all threads at a step of the sync algorithm before proceeding to the next.

  • Can be further optimised using a locking mechanism

12 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-22
SLIDE 22

Optimisation: Request Qveuing

(ii) - Benchmark results

# of threads 1 2 4 8 12 16 20 24 28 32 time (s) 92.55 91.51 48.33 33.42 29.79 29.80 30.85 30.79 30.95 30.68 31.23 speedup (%) N/A 1.51 47.78 63.89 67.81 67.80 66.67 66.73 66.56 66.85 66.25

(a) Upload speedup by queuing, relative to # of threads

File Size 150 B 150 KB 1.5 MB Sequential upload time (s) 92.55 153.32 636.48 4 threads upload time (s) 33.82 68.12 569.43 speedup (%) 63.46 55.57 10.54

(b) Upload speedup, relative to file size (4 threads)

Considerable speedup for smaller files, but less efgective when network gets close to maximum throughput.

12 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-23
SLIDE 23

Optimisation: Directory Monitoring

(i) - Description

Checking all files for changes is slow!

  • ~1000 files/s on an SSD
  • 1M files ⇒ 16.7 minutes!

✓ Operating Systems can have modification information available - directory monitoring mechanisms (inotify, FSEvents, kqueue, etc) We use the watchdog Python module to access those utilities, extending the LocalDirectory class to support the feature.

  • Constantly runs in the background (daemon). Ofgline

changes/crashes/reboots are handled by performing a full local directory scan on start-up.

13 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-24
SLIDE 24

Optimisation: Directory Monitoring

(ii) - Benchmark results

Setup: Directory with 1M files, modify some of them and measure time of update detection.

# files modified 10 100 1000 10000 100000 1000000 default time (s) 1.06E-5 0.004 0.038 0.339 1.618 12.907 90.003 108.110 speedup (%) 100.000 99.996 99.965 99.687 98.503 92.825 16.749 N/A

✓ Significant speedup when a small number of files has changed (most common scenario) ✓ Small speedup even in the cases where many files have changed Graphical representation of the results on the next slide (Note: Lin-Log scale)

13 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-25
SLIDE 25

Optimisation: Directory Monitoring

(iii) - Graphical representation of results

13 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-26
SLIDE 26

Optimisation: Local Block Storage

(i) - Description

Downloading whole large files for small changes is slow!

Implement delta-sync: ✓ Keep a local copy of all files' blocks ✓ Detect what parts of files have been changed ✓ Download only the missing blocks and create the file ✗ Needs extra storage space to store all the difgerent blocks

  • Extend the CloudClient class to handle downloads using blocks.
  • Use hierarchical structure to improve block lookup speed.
  • Save local modified blocks afuer uploads/updates.

14 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-27
SLIDE 27

Optimisation: Local Block Storage

(ii) - Sync Process

(local) block_directory local_client remote_server get_hash_list(path) hash_list find_missing_blocks(hash_list) missing_blocks download_blocks(missing_blocks) blocks reconstruct_file_from_blocks()

14 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-28
SLIDE 28

Optimisation: Local Block Storage

(iii) - Benchmark results

Setup: Create and upload a 40 MiB (41,943,040 B) file (exactly 10 blocks of 4 MiB), modify some blocks, manually re-upload to server, measure download times.

# of modified blocks 1 2 3 4 5 6 7 8 9 10 time (s) 0.37 2.59 4.49 6.44 8.98 10.12 12.23 13.60 15.65 17.59 19.61 speedup (%) 98.1 86.8 77.1 67.2 54.2 48.4 37.7 30.7 20.2 10.3 N/A

✓ Linear correlation ✓ Significant improvement for large similar files, since very few blocks need to be downloaded each time Performance gain % = ( 1 − # of new blocks Total # of blocks ) × 100

14 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-29
SLIDE 29

Optimisation: Local Block Storage

(iv) - Graphical representation

0 1 2 3 4 5 6 7 8 9 10 10 20 Blocks modified Download time (s)

14 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-30
SLIDE 30

Local Deduplication - FUSE

(i) - Description

Storing so many large files is expensive!

✓ Those files have the majority of their blocks in common ✓ We only need to store each block once, in the block directory ✓ Need to control the FS, so we can "virtually" create the files Solution: Filesystem in Userspace (FUSE) mechanism.

15 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-31
SLIDE 31

Local Deduplication - FUSE

(ii) - Design

  • Modify fstat(), open(), read(), write() system calls to use the blocks a

file consists of.

  • "Write once, Read many, Update never" practice
  • Copy-on-Write (CoW) strategy, to preserve possibly shared blocks

when changes are made Efgectively implements deduplication on the local file system. Storage space reduction of approximately:

block_size ×

n

i=1

[(# of fjles sharing block i − 1]

Also ofgers other benefits (cheap file copies, immediate modification detection)

15 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-32
SLIDE 32

Table of Contents

Introduction Design & Implementation Syncing algorithm Core Classes / API Optimisations Request Qveuing Directory Monitoring Local Block Storage Local deduplication - FUSE Comparison with existing sofuware Future Work

16 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-33
SLIDE 33

Comparison with existing sofuware

rsync

✓ Rolling hash algorithm performs exceptionally on detecting modified parts. ✓ Does not need files to be aligned to blocks ✓ One round-trip, works well on high latency connections ✓ Free & Open source sofuware ✗ Not automated ✗ Needs third-party applications to handle synchronisation ✗ No directory monitoring ✗ Uses MD5 for checksum comparison (potentially unsafe - collisions can be computed) ✗ No local file deduplication

17 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-34
SLIDE 34

Comparison with existing sofuware

  • wnCloud

✓ Most famous open source synchronisation sofuware suite ✓ Cross-platform ✓ Directory monitoring ✗ No delta-sync - Transfer whole files ✗ Full local directory scan every few minutes ✗ No local file deduplication ✗ Silently ignores files containing special characters which are not allowed in Windows ('|', ':', '>', '<' and '?')

17 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-35
SLIDE 35

Comparison with existing sofuware

Dropbox

✓ Uses librsync - rolling checksum algorithm benefits ✓ Remote deduplication with blocks of 4 MiB - Fast uploads of similar files ✓ Benchmarks indicated the existence of a local block cache - fast downloads if blocks are cached ✓ Directory Monitoring ✓ Streaming Sync for multiple clients (Prefetching blocks) ✗ Commercial, closed source sofuware ✗ Cannot be deployed on personal cloud storage infrastructures or

  • ther cloud storage services

✗ No local file deduplication

17 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-36
SLIDE 36

Comparison with existing sofuware

Google Drive

✓ Directory monitoring ✗ No delta-sync - Transfer whole files ✗ Commercial, closed source sofuware ✗ Cannot be deployed on personal cloud storage infrastructures or

  • ther cloud storage services

✗ No local file deduplication

17 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-37
SLIDE 37

Table of Contents

Introduction Design & Implementation Syncing algorithm Core Classes / API Optimisations Request Qveuing Directory Monitoring Local Block Storage Local deduplication - FUSE Comparison with existing sofuware Future Work

18 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-38
SLIDE 38

Future Work

Peer-to-Peer L2 block exchange

Idea: LAN transfers are faster than over the WAN. Request missing resources from the LAN, before asking the server.

  • Have clients monitor a Link Layer (L2) broadcast address for

requests.

  • Send missing block requests to the network and wait for responses.
  • When asked, clients check their respective block directories and

respond with block availability.

  • Only request blocks not found in the block directory or the local

network from the remote server.

  • ALWAYS verify blocks downloaded from LAN - Avoid

corruption or compromise from blocks sent by malicious users.

19 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-39
SLIDE 39

Q & A

Any Qvestions?

20 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .

slide-40
SLIDE 40

The End.

Thank you for your time!

21 of 21 A Portable File Synchronisation Mechanism for a Cloud Storage Environment . . . . . .