Xing Lin Song Jiang Cluster and Internet Computing Laboratory - - PowerPoint PPT Presentation

xing lin
SMART_READER_LITE
LIVE PREVIEW

Xing Lin Song Jiang Cluster and Internet Computing Laboratory - - PowerPoint PPT Presentation

SS-CDC : A Two-stage Parallel Content-Defined Chunking Method for Data Deduplicating Fan Ni Xing Lin Song Jiang Cluster and Internet Computing Laboratory Wayne State University Data is Growing Rapidly From storagenewsletter.com Most


slide-1
SLIDE 1

Wayne State University Cluster and Internet Computing Laboratory

SS-CDC: A Two-stage Parallel Content-Defined Chunking Method for Data Deduplicating

Fan Ni Xing Lin Song Jiang

slide-2
SLIDE 2

Data is Growing Rapidly

▪ Most of the data needs to be safely stored. ▪ Efficient data storage and management have become a big challenge.

2

From storagenewsletter.com

slide-3
SLIDE 3

The Opportunity: Data Duplication is Common

▪ Sources of duplicate data:

– The same files are stored by multiple users into the cloud. – Continuously updating of files to generate multiple versions. – Use of checkpointing and repeated data archiving.

▪ Significant data duplication has been observed.

– For backup storage workloads

  • Over 90% are duplicate data.

– For primary storage workloads

  • About 50% are duplicate data.

3

slide-4
SLIDE 4

The Deduplication Technique can Help

4

Logical Physical File1 File2

File1 File2 SHA HA1( 1( ) ) = S = SHA HA2( 2( ) ) When duplication is detected (using fingerprinting): Then only one copy is stored:

▪ Benefits

– Storage space – I/O bandwidth – Network traffic

▪ A important feature in commercial storage systems

– NetApp ONTAP system – Dell-EMC Data Domain system

▪ The data deduplication technique is critical.

– How to deduplicate more data? – How to deduplicate faster?

slide-5
SLIDE 5

Chunking and fingerprinting Remove duplicate chunks

Deduplicate at Smaller Chunks … … for higher deduplication ratio

▪ Two potentially major sources of cost in the deduplication:

– Chunking – Fingerprinting

▪ Can chunking be very fast?

slide-6
SLIDE 6

Fixed-Size Chunking (FSC)

6

HOWAREYOU?OK?REALLY?YES?NO File A

      

HOWAREYOU?OK?REALLY?YES?NO File B

▪ FSC: partition files (or data streams) into equal- and fixed-size chunks.

– Very fast!

▪ But the dedup ratio can be significantly compromised.

– The boundary-shift problem.

slide-7
SLIDE 7

Fixed-Size Chunking (FSC)

7

▪ FSC: partition files (or data streams) into equal- and fixed-size chunks.

– Very fast!

▪ But the dedup ratio can be significantly compromised.

– The boundary-shift problem. HOWAREYOU?OK?REALLY?YES?NO File A HOWAREYOU?OK?REALLY?YES?NO File B H

     

slide-8
SLIDE 8

Content-Defined Chunking (CDC)

8

HOWAREYOU?OK?REALLY?YES?NO File A HOWAREYOU?OK?REALLY?YES?NO File B H

▪ CDC: determines chunk boundaries according to contents (a predefined special marker).

– Variable chunk size. – Addresses boundary-shift problem – However, it can be very expensive

Assume the special marker is ‘?’

   

Actually the marker is determined by applying a hash function on a window of bytes, such as hash(“YOU?”) == pre-defined-value ➔ Even more expensive (likely more than half of the dedup cost!)

slide-9
SLIDE 9

Parallelizing CDC Chunking Operations

9

A File

slide-10
SLIDE 10

Parallelizing CDC Chunking Operations

10

A File

p1 p0 p2 p3

Parallelize its chunking:

slide-11
SLIDE 11

Parallelizing CDC Chunking Operations

11

A File

p1 p0 p2 p3

Parallelize its chunking: However, the parallelized chunking can compromise deduplication ratio.

slide-12
SLIDE 12

Compromised Deduplication Ratio

12

Higher is better

Deduplication ratio = data size before dedup / data size after dedup

slide-13
SLIDE 13

Chunks can be Different!

  • 13

The rule of forming chunks:

– Usually between two adjacent markers. – But neither too small (≥ Minimum-chunk-size) nor (≤ maximum-chunk-size) – Inherently a sequential process

The parallel chunking:

– Artificially introduce a set of markers (segment boundaries). – These maker positions change with data insertion/deletion. – Partially brings back the boundary shift problem.

min max

slide-14
SLIDE 14

The Goal of this Research

14

To design a parallel chunking technique that … – Does not compromise any deduplication ratio. – Achieves superlinear speedup of chunking operations.

slide-15
SLIDE 15

Approach of the Proposed SS-CDC Chunking

15

Two-phase chunking:

– Stage 1: produce all markers in parallel on a segmented file

  • A thread works on 16 consecutive segments at a time.
  • Use AVX-512 SIMD instructions to process the 16 segments in

parallel at a core.

  • The markers are recorded in a bit vector

File One thread

slide-16
SLIDE 16

The Approach of the Proposed SS-CDC Chunking

16

Two-phase chunking:

– Stage 2: sequentially determines the chunks based on the marker bit vector

  • Take account of minimum and maximum chunk sizes
slide-17
SLIDE 17

Advantages of SS-CDC

17

▪ It doesn’t have any loss of deduplication ratio

– The second stage is sequential. – It generates the set of chunks exactly the same the sequential chunking.

▪ It potentially achieves superlinear speedup.

– Stage 1 accounts for about 98% of the chunking time. – Stage 1 is parallelized across and within cores. – With optimization, Stage 2 accounts for less than 2% of the chunking time.

slide-18
SLIDE 18

Experiment Setup

18

▪ The hardware

– Dell-EMC PowerEdge T440 server with 2 Intel Xeon 3.6GHz CPUs – Each CPU has 4 cores and 16MB LLC. – 256GB DDR4 memory.

▪ The Software

– Ubuntu 18.04 OS. – The rolling window function is Rabin. – Minimum/average/maximum chunk sizes are 2KB/16KB/64KB, respectively.

slide-19
SLIDE 19

The Datasets

19

Name Description Cassandra Docker images of Apache Cassandra, an open-source storage system Redis Docker images of the Redis key-value store database Debian Docker images of Debian Linux distribution (since Ver. 7.11) Linux-src Uncompressed Linux source code (v3.0 ~ v4.9) downloaded from the website of Linux Kernel Archives Neo4j Docker images of neo4j graph database Wordpress Docker images of WordPress rich content management system Nodejs Docker images of JavaScript-based runtime environment packages

slide-20
SLIDE 20

Consistently about 3.3X speedup Single-thread/core Chunking Throughput

20

slide-21
SLIDE 21

Multi-thread/core Chunking Throughput

The chunking speedups are superlinear and scale well.

  • 21
slide-22
SLIDE 22

Existing Parallel CDC Deduplication Ratio Reduction

▪ Compared to SS-CDC, the reduction can be up to 43%. ▪ Using smaller segments leads to higher reduction

Cassandra Redis Debian Linux-src Neo4j Wordpress Node

Datasets 10 20 30 40 50 Dedup Ratio Reduction (%)

512KB segments 1MB segments 2MB segments

slide-23
SLIDE 23

Conclusions

▪ SS-CDC is a parallel CDC technique that has – high chunking speed. – zero deduplication ratio loss. ▪ SS-CDC is optimized for the SIMD platforms. – Similar two-stage chunking techniques can be applied in

  • ther platforms such as GPU.

23