Adaptive Look-Ahead Window Assisted Chunk Caching Zhichao Cao , Hao - - PowerPoint PPT Presentation
Adaptive Look-Ahead Window Assisted Chunk Caching Zhichao Cao , Hao - - PowerPoint PPT Presentation
ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching Zhichao Cao , Hao Wen, Fenggang Wu and David H.C. Du University of Minnesota, Twin Cities 02/15/2018 Agenda
Center for Research in
Intelligent Storage
- Deduplication Process
- Restore Process with Different Caching Schemes
– Container/chunk based caching – Forward Assembly
- Objective and Challenges
- Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-ahead Chunk-based Caching (ALACC)
- Evaluations
- Conclusions and Future Work
Agenda
Center for Research in
Intelligent Storage
- Deduplication Process
- Restore Process with Different Caching Schemes
– Container/chunk based caching – Forward Assembly
- Objective and Challenges
- Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-ahead Chunk-based Caching (ALACC)
- Evaluations
- Conclusions and Future Work
Agenda
Center for Research in
Intelligent Storage
Deduplication Process [1]
2 5 10 14 22 18
Recipe
23 8 22 22 23 8 22 21 19 20 18 23 22 Container Buffer 13 3 10 2 Container Storage 10 2 5 18 25 9 17 8 5 14 14 Indexing Table
……
Byte Stream Not Found 14
- 1. Chunk ID
- 2. Chunk Size
- 3. Container Address
- 4. Offset in the container
- 5. Other meta information
Sliding Window
Recipe Entry
[1] Zhu B, Li K, Patterson R H. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System[C]//Fast. 2008, 8: 1-14.
Center for Research in
Intelligent Storage
Deduplication Process [1]
2 5 10 14 22 18
Recipe
23 8 22 22 23 8 22 21 19 20 18 23 22 Container Buffer 13 3 10 2 Container Storage 10 2 5 18 25 9 17 8 5 5 14 14 Indexing Table 5
……
Byte Stream Exits
Sliding Window
[1] Zhu B, Li K, Patterson R H. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System[C]//Fast. 2008, 8: 1-14.
Center for Research in
Intelligent Storage
- Deduplication Process
- Restore Process with Different Caching Schemes
– Container/chunk based caching – Forward Assembly
- Objective and Challenges
- Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-ahead Chunk-based Caching (ALACC)
- Evaluations
- Conclusions and Future Work
Agenda
Center for Research in
Intelligent Storage
- Due to the serious data fragmentation and size mismatching of requested
data and I/O unite, the restore performance is much lower than that of directly reading out the data which is not deduplicated.
- CPU and memory resources are limited.
Why Improving Restore Performance is Important?
2 22 18 23 8 22
……
Container Storage 22 23 8 18 2
?
Center for Research in
Intelligent Storage
Restore Process with Container-based Caching
2 5 10 14 18 13 22 3
……
22 18
Recipe
23 8 22 28 23 12 13 32 23 28 6 Container Cache 21 19 20 18 13 3 10 2 25 9 17 8 14 23 5 22 Assembling Buffer 10 2 5 18 13 3 10 2 14 23 5 22 14 22 23 8 22 Restored Data Storage 13 3 5 2
……
13 14 5 12
……
21 19 20 18 18 Container Storage Restore Direction 13 5
……
Center for Research in
Intelligent Storage
Assembling Buffer
Restore Process with Chunk-based Caching
2 5 10 14 18 13 22 3
……
22 18
Recipe
23 8 22 28 23 12 13 32 23 28 6 Chunk Cache 21 19 20 18 13 3 10 2 25 9 17 8 14 23 5 22 10 2 5 18 14 22 23 8 22 Restored Data Storage 13 3 5 2 13 14 5 12
……
Container Read Buffer
22 3 18 2 23 13 5 14 13 3 10 2 10 18 Container Storage
……
Center for Research in
Intelligent Storage
Container-based Caching vs. Chunk-based Caching
Less operating and management
- verhead
Relatively higher cache miss ratio, especially when the caching space is limited.
Container-based Caching Chunk-based Caching
- 1. Higher cache hit ratio
- 2. Even much higher if look-ahead
window is applied Higher operating and management
- verhead
Center for Research in
Intelligent Storage
Container-based Caching vs. Chunk-based Caching
50 100 150 200 250 300
Container-reads per 100MB Restored Total Cache Size Container_LRU Chunk_LRU
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Computing Time (seconds/GB) Total Cache Size Container_LRU Chunk_LRU
Center for Research in
Intelligent Storage
21 19 20 18 13 3 10 2 14 23 5 22 25 9 17 8
Forward Assembly Scheme [1]
8 5 10 14 18 9 22 3
……
22 18
Recipe
23 8 22 28 23 12 13 22 23 8 22 Container Read buffer Container Storage 21 19 20 18 25 9 17 8 Forward Assembling Area (FAA) 18 14 Look-Ahead Window
……
5 8 22 18 18 Restored Data Storage 8 22 23 2
……
14 23 5 22 23 8 22
[1] Lillibridge M, Eshghi K, Bhagwat D. Improving restore speed for backup systems that use inline chunk- based deduplication[C]//FAST. 2013: 183-198.
9
Center for Research in
Intelligent Storage
Chunk-based Caching vs. Forward Assembly
- 1. Highly efficient, when chunks from
the same container are used most in the FAA range
- 2. Low operating and management
- verhead
Workload sensitive, requires good workload locality
Chunk-based Caching
Higher operating and management
- verhead
Forward Assembly
- 1. When chunks are re-used in a
relatively long distance (larger than FAA), caching is more effective
Center for Research in
Intelligent Storage
- Deduplication Process
- Restore Process with Different Caching Schemes
– Container/chunk based caching – Forward Assembly
- Objective and Challenges
- Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-ahead Chunk-based Caching (ALACC)
- Evaluations
- Conclusions and Future Work
Agenda
Center for Research in
Intelligent Storage
- Objective:
– Forward assembly + chunk-based caching + LAW (limited memory space)
- Challenges
– When the total size of available memory for restore is limited and fixed, how to use these schemes in an efficient way, is unclear. – How to make better trade-offs to achieve fewer container-reads, but limit the computing
- verhead including the LAW, caching and forward assembly overhead.
– How to make the design adapt to the changing workload is very challenging.
Objective and Challenges
Center for Research in
Intelligent Storage
- Deduplication Process
- Restore Process with Different Caching Schemes
– Container/chunk based caching – Forward Assembly
- Objective and Challenges
- Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-Ahead Chunk-based Caching (ALACC)
- Evaluations
- Conclusions and Future Work
Agenda
Center for Research in
Intelligent Storage
Look-ahead Window Assisted Chunk Cache
2 5 10 14 10 15 7 22
……
Look-Ahead Window (LAW) Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored Information for Chunk Cache 2 5 18
FAA FAB1 FAB2 Chunk Cache
Container Read Buffer 21 19 20 18 25 9 2 8 14 23 5 22 Container Storage
……
Restored Data Storage 22 2 8 22
……
25 2 18 8 14 23 5 22 10 17 12 13 10 10
What’s the caching policy?
17 12 13 10
Center for Research in
Intelligent Storage
Chunks in the Read-in Container
10 P-chunk: Probably used chunk 17 U-chunk: Unused chunk 12 13 F-chunk: Future used chunk
Chunk Cache
F-cache P-cache 2 5 10 14 10 15 7 22
……
Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored Information for Chunk Cache 2 5 18
FAA FAB1 FAB2
Container Read Buffer 21 19 20 18 17 12 13 10 25 9 2 8 14 23 5 22 Container Storage
……
Restored Data Storage 22 2 8 22
……
10 17 12 13 10 10 22 18 32 28 21 25 2 5
Center for Research in
Intelligent Storage
Caching Priority of F-cache
Chunk Cache
F-cache 2 5 10 14 10 15 7 22
……
Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored Information for Chunk Cache 2 5 18
FAA FAB1 FAB2
Container Read Buffer 21 19 20 18 17 12 13 10 25 9 2 8 14 23 5 22 Container Storage
……
Restored Data Storage 22 2 8 22
……
10 17 12 13 10 10 22 18 32 28 12 13
F-chunks being used in the near future have higher priority than F- chunks being used in the far future
High priority end Low priority end
Center for Research in
Intelligent Storage
Caching Priority of P-cache
Chunk Cache
P-cache 2 5 10 14 10 15 7 22
……
Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored Information for Chunk Cache 2 5 18
FAA FAB1 FAB2
Container Read Buffer 21 19 20 18 17 12 13 10 25 9 2 8 14 23 5 22 Container Storage
……
Restored Data Storage 22 2 8 22
……
10 17 12 13 10 10 5 2 8 21 F-cache
……
P-cache is LRU based caching policy
10 21
High priority end Low priority end
Center for Research in
Intelligent Storage
- What’s the memory space ratio between forward assembly and chunk cache?
- What’s the size of LAW?
– Too large: the computing overhead large but the extra information in the LAW is wasting – Too small: it becomes forward assembly+LRU cache
- What if the workload locality changes?
Good Enough?
Center for Research in
Intelligent Storage
Chunk Cache
2 5 10 14 5 2 18 14
……
Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 5 10 13 32 23 28 6 Restored 2 5 18
FAA FAB2 FAB3
Container Read Buffer 21 19 20 18 17 12 13 10 25 9 2 8 14 23 5 22 Container Storage
……
Restored Data Storage 22 2 8 22
……
10 5
……
FAB1 If most of the chunks (e.g., duplicated chunks) from the same container can be covered by FAA, a large FAA size is preferred
2 18 18 5 10
Large FAA Size (Small Cache)
Cache Range
Center for Research in
Intelligent Storage
Chunk Cache
19 20 21 10 21 3 7 21 Look-Ahead Window Covered Range 12 18 Unknown 2 8 22 18 28 16 22 2 8 2 19
FAA FAB3
Container Read Buffer 21 19 20 18 17 12 13 10 25 9 2 8 14 23 5 22 Container Storage
……
Restored Data Storage 22 2 8 22
……
If most the chunks reuse distance is relatively larger, caching chunks is preferred. 22 2 8 22 Information for Chunk Cache 22 2 8 2
…… ……
Large Chunk Cache Size (Small FAA)
FAA Range
Center for Research in
Intelligent Storage
LAW Size
……
2 5 10 14 10 15 7 22
……
Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored 2 5 10 14 10 15 7 22 Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored 14 10 15 7
With larger LAW, more F-chunks can be potentially identified and cached. However,
- nce the cache is full of F-chunks, further increasing the LAW size cannot improve
the cache efficiency. In contrast, it will bring in more unnecessary overhead. With smaller LAW, lower computing overhead can be achieved, but cache will store more P-chunks, which potentially reduce the caching efficiency.
Center for Research in
Intelligent Storage
The LAW Size Influence
Center for Research in
Intelligent Storage
ALACC
Increase FAA
The re-use distances of most duplicated data chunks are within the FAA range The data chunks in the first FAB are identified mostly as unique data chunks and these chunks are stored in the same
- r close containers
Shrink LAW
P-chunk number is very small F-chunk added during the restore cycle is very large
Adjust LAW Shrink Cache
Not satisfied
Enlarge Cache Shrink FAA
Not satisfied OR OR
Adjust LAW
Center for Research in
Intelligent Storage
- Deduplication Process
- Restore Process with Different Caching Schemes
– Container/chunk based caching – Forward Assembly
- Objective and Challenges
- Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-ahead Chunk-based Caching (ALACC)
- Evaluations
- Conclusions and Future Work
Agenda
Center for Research in
Intelligent Storage
- Five Caching Designs:
– LRU-based container caching (Container_LRU) – LRU-based chunk caching (Chunk_LRU) – Forward assembly (FAA) – Optimal configuration with fixed forward assembly and chunk-based caching (Fix_Opt) – ALACC
- Four Traces:
– 2 FSL traces from FSL /home directory snapshots of the year 2014 [1] – 2 EMC weekly full-backup traces [2]
Experiment Setup
[1] http://tracer.filesystems.org/. [2] Nohhyun Park and David J Lilja. Characteriz- ing datasets for data deduplication in backup ap- plications. In Workload Characterization (IISWC), 2010 IEEE International Symposium on, pages 1– 10. IEEE, 2010.
Center for Research in
Intelligent Storage
How We Get the Fix_Opt
The Fix_Opt configuration of each trace (FAA/chunk cache/LAW size in container unite)
2 4 6 8 10 12 14
12 13 14 15 16 17 18 16 24 32 40 48 56 64 72 80 88 96 FAA Size Restore Throughput (MB/S) LAW Size 12-13 13-14 14-15 15-16 16-17 17-18
We run all possible configurations for each trace to discover the optimal
- throughput. Notice that we need tens
- f experiments to find out the
- ptimal configurations of Fix_Opt
which is almost impossible to carry
- ut in a real-world production
scenario.
Chunk cache size = total memory size – FAA size
Center for Research in
Intelligent Storage
Restore Throughput
FSL_1 FSL_2
5 10 15 20 25 1 2 3 4 5 Restore Throughput(MB/S) Version Number
Container_LRU Chunk_LRU FAA Fix_Opt ALACC
10 20 30 40 50 60 70 1 2 3 4 5 Restore Throughput(MB/S) Version Number
Container_LRU Chunk_LRU FAA Fix_Opt ALACC
- 1. ACS stands for Average Chunk Size
2.DR stands for the Deduplication Ratio.
- 3. CFL stands for the Chunk Fragmentation
Level
Restore Throughput (MB/S):
- riginal data stream size divided by the
total restore time.
(4/12/56) (6/10/72)
>100% >10% <15%
Center for Research in
Intelligent Storage
Speed Factor
0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 Speed Factor Version Number
Container_LRU Chunk_LRU FAA Fix_Opt ALACC
1 2 3 4 5 1 2 3 4 5 Speed Factor Version Number
Container_LRU Chunk_LRU FAA Fix_Opt ALACC
FSL_1 FSL_2
- 1. ACS stands for Average Chunk Size
2.DR stands for the Deduplication Ratio.
- 3. CFL stands for the Chunk Fragmentation
Level
Speed Factor (MB/container-read): the mean data size restored per container read
(4/12/56) (6/10/72)
Center for Research in
Intelligent Storage
Computing Cost Factor
FSL_1 FSL_2
4 8 12 16 20 1 2 3 4 5 Computing Cost Factor Version Number
Container_LRU Chunk_LRU FAA Fix_Opt ALACC
2 4 6 8 10 12 1 2 3 4 5 Computing Cost Factor Version Number
Container_LRU Chunk_LRU FAA Fix_Opt ALACC
- 1. ACS stands for Average Chunk Size
2.DR stands for the Deduplication Ratio.
- 3. CFL stands for the Chunk Fragmentation
Level
Computing Cost Factor (second/GB): the time spent on computing operations (subtracting the storage I/O time from the restore time) per GB data restored
(4/12/56) (6/10/72)
Center for Research in
Intelligent Storage
- Deduplication Process
- Restore Process with Different Caching Schemes
– Container/chunk based caching – Forward Assembly
- Objective and Challenges
- Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed) – Adaptive Look-ahead Chunk-based Caching (ALACC)
- Evaluations
- Conclusions and Future Work
Agenda
Center for Research in
Intelligent Storage
- Studied the effectiveness and the efficiency of different caching mechanisms.
- Designed an adaptive algorithm called ALACC which is able to adaptively
adjust the sizes of the FAA, chunk cache and LAW according to the workload changing.
- In our future work, duplicated data chunk rewriting, multi-threading
implementation will be investigated and integrated with ALACC to further improve the restore performance.
Conclusions and Future Work
Center for Research in
Intelligent Storage
Center for Research in
Intelligent Storage
Look-ahead Window Assisted Chunk Cache
2 5 10 14 10 15 7 22
……
Look-Ahead Window Covered Range 22 18 Unknown FAA Covered Range 2 8 22 18 23 12 13 32 23 28 6 Restored Information for Chunk Cache 2 5 18
FAA FAB1 FAB2 Chunk Cache
Container Read Buffer 21 19 20 18 25 9 2 8 14 23 5 22 Container Storage
……
Restored Data Storage 22 2 8 22
……
25 2 18 8 14 23 5 22 10 17 12 13 10 10 17 12 13 10
Center for Research in
Intelligent Storage
0.5 1 1.5 2 2.5 3 1 2 3 4 5 Speed Factor Version Number
Container_LRU Chunk_LRU FAA Fix_Opt ALACC
1 2 3 4 5 1 2 3 4 5 Speed Factor Version Number
Container_LRU Chunk_LRU FAA Fix_Opt ALACC
EMC_1 EMC_2
Center for Research in
Intelligent Storage
EMC_1 EMC_2
0.5 1 1.5 2 2.5 3 1 2 3 4 5 Computing Cost Factor Version Number
Container_LRU Chunk_LRU FAA Fix_Opt ALACC
0.4 0.8 1.2 1.6 2 1 2 3 4 5 Computing Cost Factor Version Number
Container_LRU Chunk_LRU FAA Fix_Opt ALACC
Center for Research in
Intelligent Storage
EMC_1 EMC_2
10 20 30 40 50 60 1 2 3 4 5 Restore Throughput (MB/S) Version Number
Container_LRU Chunk_LRU FAA Fix_Opt ALACC
15 30 45 60 75 90 105 1 2 3 4 5 Restore Throughput(MB/S) Version Number
Container_LRU Chunk_LRU FAA Fix_Opt ALACC
Center for Research in
Intelligent Storage