RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in - - PowerPoint PPT Presentation

rapidcdc leveraging duplicate locality to accelerate
SMART_READER_LITE
LIVE PREVIEW

RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in - - PowerPoint PPT Presentation

RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Fan Ni* Song Jiang fan@netapp.com song.jiang@uta.edu ATG, NetApp UT Arlington Cluster and Internet Computing Laboratory * The work was done when he


slide-1
SLIDE 1

Wayne State University Cluster and Internet Computing Laboratory

RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication

Fan Ni* fan@netapp.com ATG, NetApp Song Jiang song.jiang@uta.edu UT Arlington

* The work was done when he was a Ph.D. student at UT Arlington

slide-2
SLIDE 2

Data is Growing Rapidly

§ Many of the data needs to be stored for preservation and processing. § Efficient data storage and management has become a big challenge.

From storagenewsletter.com

2

slide-3
SLIDE 3

The Opportunity: Data Duplication is Common

§ Sources of duplicate data:

– The same files are stored by multiple users into the cloud. – Continuously updating of files to generate multiple versions. – Use of checkpointing and repeated data archiving.

§ Significant data duplication has been observed for both backup and primary storage workloads.

3

slide-4
SLIDE 4

The Deduplication Technique can Help

Logical Physical File1 File2

File1 File2 SHA1( ) = SHA1( ) When duplication is detected (using fingerprinting): Only one copy is stored:

§ Benefits

– Storage space – I/O bandwidth – Network traffic

§ An important feature in commercial storage systems.

– NetApp ONTAP system – Dell-EMC Data Domain system

§ Two critical issues:

– How to deduplicate more data? – How to deduplicate faster?

4

slide-5
SLIDE 5

Chunking and fingerprinting Remove duplicate chunks

Deduplicate at Smaller Chunks … … for higher deduplication ratio

§ Two potentially major sources of cost in the deduplication:

– Chunking – Fingerprinting

§ Can chunking be very fast?

5

slide-6
SLIDE 6

Fixed-Size Chunking (FSC)

HOWAREYOU?OK?REALLY?YES?NO File A HOWAREYOU?OK?REALLY?YES?NO File B

§ FSC: partition files (or data streams) into equal- and fixed- sized chunks.

– Very fast!

§ But the deduplication ratio can be significantly compromised.

– The boundary-shift problem.

6

slide-7
SLIDE 7

Fixed-Size Chunking (FSC)

§ FSC: partition files (or data streams) into equal- and fixed- size chunks.

– Very fast!

§ But the deduplication ratio can be significantly compromised.

– The boundary-shift problem. HOWAREYOU?OK?REALLY?YES?NO File A HOWAREYOU?OK?REALLY?YES?NO File B H

7

slide-8
SLIDE 8

Content-Defined Chunking (CDC)

HOWAREYOU?OK?REALLY?YES?NO File A HOWAREYOU?OK?REALLY?YES?NO File B H

§ CDC: determines chunk boundaries according to contents (a predefined special marker).

– Variable chunk size. – Addresses boundary-shift problem

§ Assume the special marker is ‘?’

8

slide-9
SLIDE 9

The Advantage of CDC

§ Real-world datasets include two-week’s google news, Linux kernels, and various Docker images. § CDC’s deduplication ratio is much higher than FSC.

§ However, CDC can be very expensive.

G

  • g

l e

  • n

e w s L i n u x

  • t

a r C a s s a n d r a R e d i s D e b i a n

  • d
  • c

k e r N e

  • 4

j W

  • r

d p r e s s N

  • d

e j s

4 8 12 16 20 24 28 32 36 40

Deduplication ratio

CDC FSC

9

slide-10
SLIDE 10

CDC can be Too Expensive!

HOWAREYOU?OK?REALLY?YES?NO File A HOWAREYOU?OK?REALLY?YES?NO File B H

Assume the special marker is ‘?’

§ The marker for identifying chunk boundaries must

– be evenly spaced out with a controllable distance in between.

§ Actually the marker is determined by applying a hash function on a window of bytes.

– E.g., hash(“YOU?”) == pre-defined-value

§ The window rolls forward byte-by-byte and the hashing is applied continuously.

10

slide-11
SLIDE 11

CDC Chunking Becomes a Bottleneck

§ Chunking time > 60% of the CPU time. § I/O bandwidth is not fully utilized. § The bottleneck shifts from the disk to CPU.

11 11

Linux-tar dr=4.06 Redis dr=7.17 Neo4j dr=19.04

20 40 60 80 100

Time (%)

Fingerprinting Chunking

Breakdown of CPU time

slide-12
SLIDE 12

Linux-tar dr=4.06 Redis dr=7.17 Neo4j dr=19.04

20 40 60 80 100

Time (%)

CDC Chunking Becomes a Bottleneck

§ Chunking time > 60% of the CPU time. § I/O bandwidth is not fully utilized. § The bottleneck shifts from the disk to CPU.

12 12

Fingerprinting Chunking I/O Idle

Breakdown of CPU time Breakdown of IO time

I/O Busy

slide-13
SLIDE 13

Linux-tar dr=4.06 Redis dr=7.17 Neo4j dr=19.04

20 40 60 80 100

Time (%)

CDC Chunking Becomes a Bottleneck

§ Chunking time > 60% of the CPU time. § I/O bandwidth is not fully utilized. § The bottleneck shifts from the disk to CPU.

13 13

I/O Busy Fingerprinting Chunking I/O Idle

Breakdown of CPU time Breakdown of IO time

slide-14
SLIDE 14

Efforts on Acceleration of CDC Chunking

§ Make hashing faster

– Example functions: SimpleByte, gear, and AE – More likely to generate small chunks

  • increasing size of metadata cached in memory for performance

§ Use GPU/multi-core to parallelize the chunking process

– Extra hardware cost – Substantial efforts to deploy – The speedup is bounded by hardware parallelism.

§ Significant software/hardware efforts, but limited performance return

14 14

slide-15
SLIDE 15

We proposed RapidCDC that …

§ is still sequential and doesn’t require additional cores/threads. § makes the hashing speed almost irrelevant. § accelerates the CDC chunking often by 10-30 times. § has a deduplication ratio the same as regular CDC methods. § can be adopted in an existing CDC deduplication system by adding 100~200 LOC in a few functions.

15 15

slide-16
SLIDE 16

The Path to the Breakthrough

Unique Chunks in the Disk

16

slide-17
SLIDE 17

The Path to the Breakthrough

Fingerprint Matched!

17

slide-18
SLIDE 18

The Path to the Breakthrough

Fingerprint Matched !

15KB

15KB

Confirm it !

18

slide-19
SLIDE 19

The Path to the Breakthrough

Fingerprint Matched!

15KB

15KB

10KB 9KB 20KB 12KB 12KB 7KB

19

slide-20
SLIDE 20

The Path to the Breakthrough

Fingerprint Matched !

16KB

20

slide-21
SLIDE 21

The Path to the Breakthrough

Fingerprint Matched ! Fingerprint Matched !

7KB 16KB

P

21

slide-22
SLIDE 22

The Path to the Breakthrough

Fingerprint Matched ! Fingerprint Matched ! Fingerprint Matched !

20KB 16KB 7KB

P P

22

slide-23
SLIDE 23

The Path to the Breakthrough

Fingerprint Matched ! Fingerprint Matched ! Fingerprint Matched ! Fingerprint Matched ! Fingerprint Matched !

P

almost always happens !

16KB 7KB 20KB

P P P

23

slide-24
SLIDE 24

Duplicate Locality

§ Duplicate locality: if two of chunks are duplicates, their next chunks (in their respective files or data stream) are likely duplicates of each other. § Duplicate chunks tend to stay together.

10 20 40 80 90

# of files

20 40 60 80 100

Percentage of chunks (%)

All duplicate chunks Duplicate chunk immediately following another duplicate chunk

24

(Debian)

slide-25
SLIDE 25

Duplicate Locality

§ Duplicate locality: if two of chunks are duplicates, their next chunks (in their respective files or data stream) are likely duplicates of each other. § Duplicate chunks tend to stay together.

10 20 40 80 90

# of files

20 40 60 80 100

Percentage of chunks (%)

All duplicate chunks Duplicate chunk immediately following another duplicate chunk

25

slide-26
SLIDE 26

RapidCDC: Using Next Chunk in History as a Hint

+s2=

When FP(B1) == FP(A1):

<FP1, s2> <FP2, s3> <FP3, s4> <FP4, …> B1 B2 B3 … …

File A File B

P1 P0 P2 P3 P4 … …

+s3=

B4

+s4=

A2 A1 A3 A4

Offset in file:

§ History recording: whenever a chunk is detected, its size is attached to its previous chunk (fingerprint); § Hint-assisted chunking: whenever a duplication is detected, use the history chunk size as a hint for the next chunk boundary.

§ Regular CDC is used for chunking until a duplicate chunk (e.g., B1) is found

26

slide-27
SLIDE 27

More Design Considerations …

27

§ A chunk may have been followed with chunks of different sizes

– Maintain a size list

§ Validation of Hinted Next Chunk Boundaries

– Four alternative criterions with different efficiency and confidences

Ø FF (fast-forwarding only) Ø FF+RWT (Rolling window Test) Ø FF+MT (Marker Test) Ø FF+RWT+FPT (Fingerprint Test)

§ Please refer to the paper for detail.

slide-28
SLIDE 28

Evaluation of RapidCDC

§ Prototype: based on a rolling-window-based CDC system.

– Using Rabin/Gear as rolling function for rolling window computation. – Using SHA1 to calculate fingerprints.

§ Three disks with different speed are tested.

– SATA Hard disk: 138 MB/s and 150MB/s for sequential read/write. – SATA SSD: 520 MB/s and 550MB/s for sequential read/write. – NVMe SSD: 1.2 GB/s and 2.4G/s for sequential read/write.

28

slide-29
SLIDE 29

§ Chunking speedup correlates to the deduplication ratio. § Deduplication ratio is little affected (except for one very aggressive validation criterion).

Synthetic Datasets: Insert/Delete

29

1000 2000 5000 10000 20000

# of modifications

1 2 3 4 5 6 7

Deduplication ratio

Regular FF+RWT+FPT FF+RWT FF+MT FF

1000 2000 5000 10000 20000

# of modifications

1 2 3 4 5 6

Speedup

FF+RWT+FPT FF+RWT FF+MT FF

Speedup Deduplication ratio

# of modifications # of modifications

slide-30
SLIDE 30

Debian Neo4j Wordpress Nodejs

5 10 15 20 25 30

Speedup

FF+RWT+FPT FF+RWT FF+MT FF

Real-world Datasets: Chunking Speed

§ Chunking speedup approaches deduplication ratio. § Negligible deduplication ratio reductions (if any). 33X Faster!

Debian Neo4j Wordpress Nodejs

10 20 30 40

Deduplication ratio

Regular FF+RWT+FPT FF+RWT FF+MT FF

30

Speedup

Deduplication ratio

slide-31
SLIDE 31

Conclusions

§ RapidCDC represents a disruptively new approach to improve CDC chunking speed. § It increases chunking speed by up to 33X without loss of deduplication ratio. § Its adoption in an existing CDC deduplication system does not require any major change of its current operation flow. § Its implementation in any existing CDC deduplication systems requires minimal code changes (100-200 lines of C code in our prototype)

§ A prototype implementation is available at

https://github.com/moking/rapidcdc

31