Austere Flash Caching with Deduplication and Compression Qiuping - - PowerPoint PPT Presentation

austere flash caching with deduplication and compression
SMART_READER_LITE
LIVE PREVIEW

Austere Flash Caching with Deduplication and Compression Qiuping - - PowerPoint PPT Presentation

Austere Flash Caching with Deduplication and Compression Qiuping Wang * , Jinhong Li * , Wen Xia # Erik Kruus ^ , Biplob Debnath ^ , Patrick P. C. Lee * * The Chinese University of Hong Kong (CUHK) # Harbin Institute of Technology, Shenzhen ^ NEC


slide-1
SLIDE 1

Austere Flash Caching with Deduplication and Compression

Qiuping Wang*, Jinhong Li*, Wen Xia# Erik Kruus^, Biplob Debnath^, Patrick P. C. Lee*

*The Chinese University of Hong Kong (CUHK) #Harbin Institute of Technology, Shenzhen ^NEC Labs

1

slide-2
SLIDE 2

Flash Caching

Ø Flash-based solid-state drives (SSDs)

  • ü Faster than hard disk drives (HDD)
  • ü Better reliability
  • û Limited capacity and endurance

Ø Flash caching

  • Accelerate HDD storage by caching frequently accessed blocks in flash

2

slide-3
SLIDE 3

Deduplication and Compression

Ø Reduce storage and I/O overheads Ø Deduplication (coarse-grained)

  • In units of chunks (fixed- or variable-size)
  • Compute fingerprint (e.g., SHA-1) from chunk content
  • Reference identical (same FP) logical chunks to a physical copy

Ø Compression (fine-grained)

  • In units of bytes
  • Transform chunks into fewer bytes

3

slide-4
SLIDE 4

Deduplicated and Compressed Flash Cache

Ø LBA: chunk address in HDD; FP: chunk fingerprint Ø CA: chunk address in flash cache (after dedup’ed + compressed)

4

SSD Chunking I/O Deduplication and compression LBA à FP FP à CA, length FP-index LBA-index RAM HDD … Dirty list Variable-size compressed chunks (after deduplication) Fixed-size chunks LBA, CA LBA, CA Read/write

slide-5
SLIDE 5

Memory Amplification for Indexing

Ø Example: 512-GiB flash cache with 4-TiB HDD working set Ø Conventional flash cache

  • Memory overhead: 256 MiB

Ø Deduplicated and compressed flash cache

  • LBA-index: 3.5 GiB
  • FP-index: 512 MiB
  • Memory amplification: 16x
  • Can be higher

5

LBA (8B) à CA (8B) LBA (8B) à FP (20B) FP (20B) à CA (8B) + Length (4B)

slide-6
SLIDE 6

Related Work

Ø Nitro [Li et al., ATC’14]

  • First work to study deduplication and compression in flash caching
  • Manage compressed data in Write-Evict Units (WEUs)

Ø CacheDedup [Li et al, FAST’16]

  • Propose dedup-aware algorithms for flash caching to improve hit ratios

6

They both suffer from memory amplification!

slide-7
SLIDE 7

Our Contribution

7

Ø AustereCache: a deduplicated and compressed flash cache with austere memory-efficient management

  • Bucketization
  • No overhead for address mappings
  • Hash chunks to storage locations
  • Fixed-size compressed data management
  • No tracking for compressed lengths of chunks in memory
  • Bucket-based cache replacement
  • Cache replacement per bucket
  • Count-Min Sketch [Cormode 2005] for low-memory reference counting

Ø Extensive trace-driven evaluation and prototype experiments

slide-8
SLIDE 8

Bucketization

Ø Main idea

  • Use hashing to partition index and cache space
  • (RAM) LBA-index and FP-index
  • (SSD) metadata region and data region
  • Store partial keys (prefixes) in memory
  • Memory savings

Ø Layout

  • Hash entries into equal-sized buckets
  • Each bucket has fixed-number of slots

8

Bucket mapping / data …

slot

slide-9
SLIDE 9

(RAM) LBA-index and FP-index

Ø (RAM) LBA-index and FP-index

  • Locate buckets with hash suffixes
  • Match slots with hash prefixes
  • Each slot in FP-index corresponds to a storage location in flash

9

Bucket LBA-index LBA-hash prefix FP hash Flag

FP-index … … FP-hash prefix Flag slot Bucket

… … slot

slide-10
SLIDE 10

(SSD) Metadata and Data Regions

Ø (SSD) Metadata region and data region

  • Each slot has full FP and list of full LBAs in metadata region
  • For validation against prefix collisions
  • Cached chunks in data region

10

Metadata region … … Data region FP List of LBAs Chunk Bucket … … slot Bucket … … slot

slide-11
SLIDE 11

Fixed-size Compressed Data Management

Ø Main idea

  • Slice and pad a compressed chunk into fixed-size subchunks

Ø Advantages

  • Compatible with bucketization
  • Store each subchunk in one slot
  • Allow per-chunk management for cache replacement

11

32KiB 20KiB Compress Slice and Pad 8KiB each

slide-12
SLIDE 12

Fixed-size Compressed Data Management

Ø Layout

  • One chunk occupies multiple consecutive slots
  • No additional memory for compressed length

12

FP-index … … SSD RAM … … FP List of LBAs Length FP-hash prefix Flag … …

Chunk

Bucket

Metadata Region Data Region

Subchunk

slide-13
SLIDE 13

Bucket-based Cache Replacement

Ø Main idea

  • Cache replacement in each bucket independently
  • Eliminate priority-based structures for cache decisions

13

Slot …

LBA-index

Slot … … 2 3

Reference Counter

Old … … …

FP-index

… Recent

  • Combine recency and deduplication
  • LBA-index: least-recently-used policy
  • FP-index: least-referenced policy
  • Weighted reference counting based on

recency in LBAs

slide-14
SLIDE 14

Sketch-based Reference Counting

Ø High memory overhead for complete reference counting

  • One counter for every FP-hash

Ø Count-Min Sketch [Cormode 2005]

  • Fixed memory usage with provable error bounds

14

+1 +1 +1 FP-hash

count = minimum counter indexed by (i, Hi(FP-hash))

w h

slide-15
SLIDE 15

Evaluation

Ø Implement AustereCache as a user-space block device

  • ~4.5K lines of C++ code in Linux

Ø Traces

  • FIU traces: WebVM, Homes, Mail
  • Synthetic traces: varying I/O dedup ratio and write-read ratio
  • I/O dedup ratio: fraction of duplicate written chunks in all written chunks

Ø Schemes

  • AustereCache: AC-D, AC-DC
  • CacheDedup: CD-LRU-D, CD-ARC-D, CD-ARC-DC

15

slide-16
SLIDE 16

Memory Overhead

16

Ø AC-D incurs 69.9-94.9% and 70.4-94.7% less memory across all traces than CD-LRU-D and CD-ARC-D, respectively. Ø AC-DC incurs 87.0-97.0% less memory than CD-ARC-DC.

AC-D AC-DC CD-LRU-D CD-ARC-D CD-ARC-DC

1 10 100 1000 12.5 25 37.5 50 62.5 75 87.5 100 Cache Capacity (%) Memory (MiB) 1 10 100 1000 12.5 25 37.5 50 62.5 75 87.5 100 Cache Capacity (%) Memory (MiB) 1 10 100 1000 12.5 25 37.5 50 62.5 75 87.5 100 Cache Capacity (%) Memory (MiB)

(a) WebVM (b) Homes (c) Mail

slide-17
SLIDE 17

Read Hit Ratios

17

Ø AC-D has up to 39.2% higher read hit ratio than CD-LRU-D, and similar read hit ratio as CD-ARC-D Ø AC-DC has up to 30.7% higher read hit ratio than CD-ARC-DC

AC-D AC-DC CD-LRU-D CD-ARC-D CD-ARC-DC

25 50 75 100 12.5 25 37.5 50 62.5 75 87.5 100 Cache Capacity (%) Read Hit (%) 10 20 30 40 50 12.5 25 37.5 50 62.5 75 87.5 100 Cache Capacity (%) Read Hit (%) 25 50 75 100 12.5 25 37.5 50 62.5 75 87.5 100 Cache Capacity (%) Read Hit (%)

(a) WebVM (b) Homes (c) Mail

slide-18
SLIDE 18

Write Reduction Ratios

18

Ø AC-D is comparable as CD-LRU-D and CD-ARC-D Ø AC-DC is slightly lower (by 7.7-14.5%) than CD-ARC-DC

  • Due to padding in compressed data management

AC-D AC-DC CD-LRU-D CD-ARC-D CD-ARC-DC

20 40 60 80 12.5 25 37.5 50 62.5 75 87.5 100 Cache Capacity (%) Write Rd. (%) 20 40 60 80 12.5 25 37.5 50 62.5 75 87.5 100 Cache Capacity (%) Write Rd. (%) 20 40 60 80 12.5 25 37.5 50 62.5 75 87.5 100 Cache Capacity (%) Write Rd. (%)

(a) WebVM (b) Homes (c) Mail

slide-19
SLIDE 19

Throughput

19

Ø AC-DC has highest throughput

  • Due to high write reduction ratio and high read hit ratio

Ø AC-D has slightly lower throughput than CD-ARC-D

  • AC-D needs to access the metadata region during indexing

AC-D AC-DC CD-LRU-D CD-ARC-D CD-ARC-DC

25 50 75 100 20 40 60 80 I/O Dedup Ratio (%) Thpt (MiB/s)

(a) Throughput vs. I/O dedup ratio (write-to-read ratio 7:3)

25 50 75 100 9:1 7:3 5:5 3:7 1:9 Write-to-Read Ratio Thpt (MiB/s)

(b) Throughput vs. write-to-read ratio (I/O dedup ratio 50%)

slide-20
SLIDE 20

CPU Overhead and Multi-threading

20

Ø Latency (32 KiB chunk write)

  • HDD (5,997 µs) and SSD (85 µs)
  • AustereCache (31.2 µs) (fingerprinting 15.5 µs)
  • Latency hidden via multi-threading

Ø Multi-threading (write-read ratio 7:3)

  • 50% I/O dedup ratio: 2.08X
  • 80% I/O dedup ratio: 2.51X
  • Higher I/O dedup ratio implies less I/O to flash

à more computation savings via multi-threading

25 50 75 100 Latency (us) 5975 6000 6025 Fingerprint Compression Lookup Update SSD HDD

50 100 150 200 250 1 2 4 6 8 Number of threads Thpt (MiB/s)

50% dedup 80% dedup

slide-21
SLIDE 21

Conclusion

21

Ø AustereCache: memory efficiency in deduplicated and compressed flash caching via

  • Bucketization
  • Fixed-size compressed data management
  • Bucket-based cache replacement

Ø Source code: http://adslab.cse.cuhk.edu.hk/software/austerecache

slide-22
SLIDE 22

Thank You! Q & A

22