Adaptive Look-Ahead Window Assisted Chunk Caching Zhichao Cao , Hao - - PowerPoint PPT Presentation

adaptive look ahead window assisted chunk caching
SMART_READER_LITE
LIVE PREVIEW

Adaptive Look-Ahead Window Assisted Chunk Caching Zhichao Cao , Hao - - PowerPoint PPT Presentation

ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching Zhichao Cao , Hao Wen, Fenggang Wu and David H.C. Du University of Minnesota, Twin Cities 02/15/2018 Agenda


slide-1
SLIDE 1

ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching

Zhichao Cao, Hao Wen, Fenggang Wu and David H.C. Du University of Minnesota, Twin Cities 02/15/2018

slide-2
SLIDE 2

Center for Research in

Intelligent Storage

  • Deduplication Process
  • Restore Process with Different Caching Schemes

– Container/chunk based caching – Forward Assembly

  • Objective and Challenges
  • Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-ahead Chunk-based Caching (ALACC)

  • Evaluations
  • Conclusions and Future Work

Agenda

slide-3
SLIDE 3

Center for Research in

Intelligent Storage

  • Deduplication Process
  • Restore Process with Different Caching Schemes

– Container/chunk based caching – Forward Assembly

  • Objective and Challenges
  • Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-ahead Chunk-based Caching (ALACC)

  • Evaluations
  • Conclusions and Future Work

Agenda

slide-4
SLIDE 4

Center for Research in

Intelligent Storage

Deduplication Process [1]

2 5 10 14 22 18

Recipe

23 8 22 22 23 8 22 21 19 20 18 23 22 Container Buffer 13 3 10 2 Container Storage 10 2 5 18 25 9 17 8 5 14 14 Indexing Table

……

Byte Stream Not Found 14

  • 1. Chunk ID
  • 2. Chunk Size
  • 3. Container Address
  • 4. Offset in the container
  • 5. Other meta information

Sliding Window

Recipe Entry

[1] Zhu B, Li K, Patterson R H. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System[C]//Fast. 2008, 8: 1-14.

slide-5
SLIDE 5

Center for Research in

Intelligent Storage

Deduplication Process [1]

2 5 10 14 22 18

Recipe

23 8 22 22 23 8 22 21 19 20 18 23 22 Container Buffer 13 3 10 2 Container Storage 10 2 5 18 25 9 17 8 5 5 14 14 Indexing Table 5

……

Byte Stream Exits

Sliding Window

[1] Zhu B, Li K, Patterson R H. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System[C]//Fast. 2008, 8: 1-14.

slide-6
SLIDE 6

Center for Research in

Intelligent Storage

  • Deduplication Process
  • Restore Process with Different Caching Schemes

– Container/chunk based caching – Forward Assembly

  • Objective and Challenges
  • Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-ahead Chunk-based Caching (ALACC)

  • Evaluations
  • Conclusions and Future Work

Agenda

slide-7
SLIDE 7

Center for Research in

Intelligent Storage

  • Due to the serious data fragmentation and size mismatching of requested

data and I/O unite, the restore performance is much lower than that of directly reading out the data which is not deduplicated.

  • CPU and memory resources are limited.

Why Improving Restore Performance is Important?

2 22 18 23 8 22

……

Container Storage 22 23 8 18 2

?

slide-8
SLIDE 8

Center for Research in

Intelligent Storage

Restore Process with Container-based Caching

2 5 10 14 18 13 22 3

……

22 18

Recipe

23 8 22 28 23 12 13 32 23 28 6 Container Cache 21 19 20 18 13 3 10 2 25 9 17 8 14 23 5 22 Assembling Buffer 10 2 5 18 13 3 10 2 14 23 5 22 14 22 23 8 22 Restored Data Storage 13 3 5 2

……

13 14 5 12

……

21 19 20 18 18 Container Storage Restore Direction 13 5

……

slide-9
SLIDE 9

Center for Research in

Intelligent Storage

Assembling Buffer

Restore Process with Chunk-based Caching

2 5 10 14 18 13 22 3

……

22 18

Recipe

23 8 22 28 23 12 13 32 23 28 6 Chunk Cache 21 19 20 18 13 3 10 2 25 9 17 8 14 23 5 22 10 2 5 18 14 22 23 8 22 Restored Data Storage 13 3 5 2 13 14 5 12

……

Container Read Buffer

22 3 18 2 23 13 5 14 13 3 10 2 10 18 Container Storage

……

slide-10
SLIDE 10

Center for Research in

Intelligent Storage

Container-based Caching vs. Chunk-based Caching

Less operating and management

  • verhead

Relatively higher cache miss ratio, especially when the caching space is limited.

Container-based Caching Chunk-based Caching

  • 1. Higher cache hit ratio
  • 2. Even much higher if look-ahead

window is applied Higher operating and management

  • verhead
slide-11
SLIDE 11

Center for Research in

Intelligent Storage

Container-based Caching vs. Chunk-based Caching

50 100 150 200 250 300

Container-reads per 100MB Restored Total Cache Size Container_LRU Chunk_LRU

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Computing Time (seconds/GB) Total Cache Size Container_LRU Chunk_LRU

slide-12
SLIDE 12

Center for Research in

Intelligent Storage

21 19 20 18 13 3 10 2 14 23 5 22 25 9 17 8

Forward Assembly Scheme [1]

8 5 10 14 18 9 22 3

……

22 18

Recipe

23 8 22 28 23 12 13 22 23 8 22 Container Read buffer Container Storage 21 19 20 18 25 9 17 8 Forward Assembling Area (FAA) 18 14 Look-Ahead Window

……

5 8 22 18 18 Restored Data Storage 8 22 23 2

……

14 23 5 22 23 8 22

[1] Lillibridge M, Eshghi K, Bhagwat D. Improving restore speed for backup systems that use inline chunk- based deduplication[C]//FAST. 2013: 183-198.

9

slide-13
SLIDE 13

Center for Research in

Intelligent Storage

Chunk-based Caching vs. Forward Assembly

  • 1. Highly efficient, when chunks from

the same container are used most in the FAA range

  • 2. Low operating and management
  • verhead

Workload sensitive, requires good workload locality

Chunk-based Caching

Higher operating and management

  • verhead

Forward Assembly

  • 1. When chunks are re-used in a

relatively long distance (larger than FAA), caching is more effective

slide-14
SLIDE 14

Center for Research in

Intelligent Storage

  • Deduplication Process
  • Restore Process with Different Caching Schemes

– Container/chunk based caching – Forward Assembly

  • Objective and Challenges
  • Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-ahead Chunk-based Caching (ALACC)

  • Evaluations
  • Conclusions and Future Work

Agenda

slide-15
SLIDE 15

Center for Research in

Intelligent Storage

  • Objective:

– Forward assembly + chunk-based caching + LAW (limited memory space)

  • Challenges

– When the total size of available memory for restore is limited and fixed, how to use these schemes in an efficient way, is unclear. – How to make better trade-offs to achieve fewer container-reads, but limit the computing

  • verhead including the LAW, caching and forward assembly overhead.

– How to make the design adapt to the changing workload is very challenging.

Objective and Challenges

slide-16
SLIDE 16

Center for Research in

Intelligent Storage

  • Deduplication Process
  • Restore Process with Different Caching Schemes

– Container/chunk based caching – Forward Assembly

  • Objective and Challenges
  • Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-Ahead Chunk-based Caching (ALACC)

  • Evaluations
  • Conclusions and Future Work

Agenda

slide-17
SLIDE 17

Center for Research in

Intelligent Storage

Look-ahead Window Assisted Chunk Cache

2 5 10 14 10 15 7 22

……

Look-Ahead Window (LAW) Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored Information for Chunk Cache 2 5 18

FAA FAB1 FAB2 Chunk Cache

Container Read Buffer 21 19 20 18 25 9 2 8 14 23 5 22 Container Storage

……

Restored Data Storage 22 2 8 22

……

25 2 18 8 14 23 5 22 10 17 12 13 10 10

What’s the caching policy?

17 12 13 10

slide-18
SLIDE 18

Center for Research in

Intelligent Storage

Chunks in the Read-in Container

10 P-chunk: Probably used chunk 17 U-chunk: Unused chunk 12 13 F-chunk: Future used chunk

Chunk Cache

F-cache P-cache 2 5 10 14 10 15 7 22

……

Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored Information for Chunk Cache 2 5 18

FAA FAB1 FAB2

Container Read Buffer 21 19 20 18 17 12 13 10 25 9 2 8 14 23 5 22 Container Storage

……

Restored Data Storage 22 2 8 22

……

10 17 12 13 10 10 22 18 32 28 21 25 2 5

slide-19
SLIDE 19

Center for Research in

Intelligent Storage

Caching Priority of F-cache

Chunk Cache

F-cache 2 5 10 14 10 15 7 22

……

Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored Information for Chunk Cache 2 5 18

FAA FAB1 FAB2

Container Read Buffer 21 19 20 18 17 12 13 10 25 9 2 8 14 23 5 22 Container Storage

……

Restored Data Storage 22 2 8 22

……

10 17 12 13 10 10 22 18 32 28 12 13

F-chunks being used in the near future have higher priority than F- chunks being used in the far future

High priority end Low priority end

slide-20
SLIDE 20

Center for Research in

Intelligent Storage

Caching Priority of P-cache

Chunk Cache

P-cache 2 5 10 14 10 15 7 22

……

Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored Information for Chunk Cache 2 5 18

FAA FAB1 FAB2

Container Read Buffer 21 19 20 18 17 12 13 10 25 9 2 8 14 23 5 22 Container Storage

……

Restored Data Storage 22 2 8 22

……

10 17 12 13 10 10 5 2 8 21 F-cache

……

P-cache is LRU based caching policy

10 21

High priority end Low priority end

slide-21
SLIDE 21

Center for Research in

Intelligent Storage

  • What’s the memory space ratio between forward assembly and chunk cache?
  • What’s the size of LAW?

– Too large: the computing overhead large but the extra information in the LAW is wasting – Too small: it becomes forward assembly+LRU cache

  • What if the workload locality changes?

Good Enough?

slide-22
SLIDE 22

Center for Research in

Intelligent Storage

Chunk Cache

2 5 10 14 5 2 18 14

……

Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 5 10 13 32 23 28 6 Restored 2 5 18

FAA FAB2 FAB3

Container Read Buffer 21 19 20 18 17 12 13 10 25 9 2 8 14 23 5 22 Container Storage

……

Restored Data Storage 22 2 8 22

……

10 5

……

FAB1 If most of the chunks (e.g., duplicated chunks) from the same container can be covered by FAA, a large FAA size is preferred

2 18 18 5 10

Large FAA Size (Small Cache)

Cache Range

slide-23
SLIDE 23

Center for Research in

Intelligent Storage

Chunk Cache

19 20 21 10 21 3 7 21 Look-Ahead Window Covered Range 12 18 Unknown 2 8 22 18 28 16 22 2 8 2 19

FAA FAB3

Container Read Buffer 21 19 20 18 17 12 13 10 25 9 2 8 14 23 5 22 Container Storage

……

Restored Data Storage 22 2 8 22

……

If most the chunks reuse distance is relatively larger, caching chunks is preferred. 22 2 8 22 Information for Chunk Cache 22 2 8 2

…… ……

Large Chunk Cache Size (Small FAA)

FAA Range

slide-24
SLIDE 24

Center for Research in

Intelligent Storage

LAW Size

……

2 5 10 14 10 15 7 22

……

Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored 2 5 10 14 10 15 7 22 Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored 14 10 15 7

With larger LAW, more F-chunks can be potentially identified and cached. However,

  • nce the cache is full of F-chunks, further increasing the LAW size cannot improve

the cache efficiency. In contrast, it will bring in more unnecessary overhead. With smaller LAW, lower computing overhead can be achieved, but cache will store more P-chunks, which potentially reduce the caching efficiency.

slide-25
SLIDE 25

Center for Research in

Intelligent Storage

The LAW Size Influence

slide-26
SLIDE 26

Center for Research in

Intelligent Storage

ALACC

Increase FAA

The re-use distances of most duplicated data chunks are within the FAA range The data chunks in the first FAB are identified mostly as unique data chunks and these chunks are stored in the same

  • r close containers

Shrink LAW

P-chunk number is very small F-chunk added during the restore cycle is very large

Adjust LAW Shrink Cache

Not satisfied

Enlarge Cache Shrink FAA

Not satisfied OR OR

Adjust LAW

slide-27
SLIDE 27

Center for Research in

Intelligent Storage

  • Deduplication Process
  • Restore Process with Different Caching Schemes

– Container/chunk based caching – Forward Assembly

  • Objective and Challenges
  • Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-ahead Chunk-based Caching (ALACC)

  • Evaluations
  • Conclusions and Future Work

Agenda

slide-28
SLIDE 28

Center for Research in

Intelligent Storage

  • Five Caching Designs:

– LRU-based container caching (Container_LRU) – LRU-based chunk caching (Chunk_LRU) – Forward assembly (FAA) – Optimal configuration with fixed forward assembly and chunk-based caching (Fix_Opt) – ALACC

  • Four Traces:

– 2 FSL traces from FSL /home directory snapshots of the year 2014 [1] – 2 EMC weekly full-backup traces [2]

Experiment Setup

[1] http://tracer.filesystems.org/. [2] Nohhyun Park and David J Lilja. Characteriz- ing datasets for data deduplication in backup ap- plications. In Workload Characterization (IISWC), 2010 IEEE International Symposium on, pages 1– 10. IEEE, 2010.

slide-29
SLIDE 29

Center for Research in

Intelligent Storage

How We Get the Fix_Opt

The Fix_Opt configuration of each trace (FAA/chunk cache/LAW size in container unite)

2 4 6 8 10 12 14

12 13 14 15 16 17 18 16 24 32 40 48 56 64 72 80 88 96 FAA Size Restore Throughput (MB/S) LAW Size 12-13 13-14 14-15 15-16 16-17 17-18

We run all possible configurations for each trace to discover the optimal

  • throughput. Notice that we need tens
  • f experiments to find out the
  • ptimal configurations of Fix_Opt

which is almost impossible to carry

  • ut in a real-world production

scenario.

Chunk cache size = total memory size – FAA size

slide-30
SLIDE 30

Center for Research in

Intelligent Storage

Restore Throughput

FSL_1 FSL_2

5 10 15 20 25 1 2 3 4 5 Restore Throughput(MB/S) Version Number

Container_LRU Chunk_LRU FAA Fix_Opt ALACC

10 20 30 40 50 60 70 1 2 3 4 5 Restore Throughput(MB/S) Version Number

Container_LRU Chunk_LRU FAA Fix_Opt ALACC

  • 1. ACS stands for Average Chunk Size

2.DR stands for the Deduplication Ratio.

  • 3. CFL stands for the Chunk Fragmentation

Level

Restore Throughput (MB/S):

  • riginal data stream size divided by the

total restore time.

(4/12/56) (6/10/72)

>100% >10% <15%

slide-31
SLIDE 31

Center for Research in

Intelligent Storage

Speed Factor

0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 Speed Factor Version Number

Container_LRU Chunk_LRU FAA Fix_Opt ALACC

1 2 3 4 5 1 2 3 4 5 Speed Factor Version Number

Container_LRU Chunk_LRU FAA Fix_Opt ALACC

FSL_1 FSL_2

  • 1. ACS stands for Average Chunk Size

2.DR stands for the Deduplication Ratio.

  • 3. CFL stands for the Chunk Fragmentation

Level

Speed Factor (MB/container-read): the mean data size restored per container read

(4/12/56) (6/10/72)

slide-32
SLIDE 32

Center for Research in

Intelligent Storage

Computing Cost Factor

FSL_1 FSL_2

4 8 12 16 20 1 2 3 4 5 Computing Cost Factor Version Number

Container_LRU Chunk_LRU FAA Fix_Opt ALACC

2 4 6 8 10 12 1 2 3 4 5 Computing Cost Factor Version Number

Container_LRU Chunk_LRU FAA Fix_Opt ALACC

  • 1. ACS stands for Average Chunk Size

2.DR stands for the Deduplication Ratio.

  • 3. CFL stands for the Chunk Fragmentation

Level

Computing Cost Factor (second/GB): the time spent on computing operations (subtracting the storage I/O time from the restore time) per GB data restored

(4/12/56) (6/10/72)

slide-33
SLIDE 33

Center for Research in

Intelligent Storage

  • Deduplication Process
  • Restore Process with Different Caching Schemes

– Container/chunk based caching – Forward Assembly

  • Objective and Challenges
  • Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-ahead Chunk-based Caching (ALACC)

  • Evaluations
  • Conclusions and Future Work

Agenda

slide-34
SLIDE 34

Center for Research in

Intelligent Storage

  • Studied the effectiveness and the efficiency of different caching mechanisms.
  • Designed an adaptive algorithm called ALACC which is able to adaptively

adjust the sizes of the FAA, chunk cache and LAW according to the workload changing.

  • In our future work, duplicated data chunk rewriting, multi-threading

implementation will be investigated and integrated with ALACC to further improve the restore performance.

Conclusions and Future Work

slide-35
SLIDE 35

Center for Research in

Intelligent Storage

slide-36
SLIDE 36

Center for Research in

Intelligent Storage

Look-ahead Window Assisted Chunk Cache

2 5 10 14 10 15 7 22

……

Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored Information for Chunk Cache 2 5 18

FAA FAB1 FAB2 Chunk Cache

Container Read Buffer 21 19 20 18 25 9 2 8 14 23 5 22 Container Storage

……

Restored Data Storage 22 2 8 22

……

25 2 18 8 14 23 5 22 10 17 12 13 10 10 17 12 13 10

slide-37
SLIDE 37

Center for Research in

Intelligent Storage

0.5 1 1.5 2 2.5 3 1 2 3 4 5 Speed Factor Version Number

Container_LRU Chunk_LRU FAA Fix_Opt ALACC

1 2 3 4 5 1 2 3 4 5 Speed Factor Version Number

Container_LRU Chunk_LRU FAA Fix_Opt ALACC

EMC_1 EMC_2

slide-38
SLIDE 38

Center for Research in

Intelligent Storage

EMC_1 EMC_2

0.5 1 1.5 2 2.5 3 1 2 3 4 5 Computing Cost Factor Version Number

Container_LRU Chunk_LRU FAA Fix_Opt ALACC

0.4 0.8 1.2 1.6 2 1 2 3 4 5 Computing Cost Factor Version Number

Container_LRU Chunk_LRU FAA Fix_Opt ALACC

slide-39
SLIDE 39

Center for Research in

Intelligent Storage

EMC_1 EMC_2

10 20 30 40 50 60 1 2 3 4 5 Restore Throughput (MB/S) Version Number

Container_LRU Chunk_LRU FAA Fix_Opt ALACC

15 30 45 60 75 90 105 1 2 3 4 5 Restore Throughput(MB/S) Version Number

Container_LRU Chunk_LRU FAA Fix_Opt ALACC

slide-40
SLIDE 40

Center for Research in

Intelligent Storage