SLIDE 1 Emulating Goliath Storage Systems with David
Nitin Agrawal, NEC Labs Leo Arulraj, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau ADSL Lab, UW Madison
1
SLIDE 2
The Storage Researchers’ Dilemma Innovate Create the future of storage Measure Quantify improvement obtained Dilemma How to measure future of storage with devices from present?
SLIDE 3
David: A Storage Emulator Large, fast, multiple disks using small, slow, single device
Huge Disks ~1TB disk using 80 GB disk Multiple Disks RAID of multiple disks using RAM
SLIDE 4
Key Idea behind David
Store metadata, throw away data (and generate fake data)
Why is this OK ? Benchmarks measure performance Many benchmarks don’t care about file content Some expect valid but not exact content
SLIDE 5
Outline Intro Overview Design Results Conclusion
SLIDE 6 Benchmark Filesystem Backing Store Storage Model
Userspace Kernelspace
DAVID (Pseudo Block Device Driver)
Overview of how David works
SLIDE 7
Illustrative Benchmark
Create a File Write a block of data Close the File Open file in read mode Read back the data Close the File
SLIDE 8 Benchmark Filesystem
F = fopen(“a.txt”,”w”);
Allocate Inode in block 100
Storage Model Backing Store
How does David handle metadata write?
SLIDE 9 Benchmark Filesystem
100 Inode block LBA : 100
Storage Model Backing Store
How does David handle metadata write?
SLIDE 10 Benchmark Filesystem
100 100
Storage Model Backing Store
How does David handle metadata write?
SLIDE 11 Benchmark Filesystem
100 1
Model calculates response time for write to LBA 100 Metadata block at LBA 100 is remapped to LBA 1 Storage Model Backing Store
Remap Table
100 1
How does David handle metadata write?
SLIDE 12 Benchmark Filesystem
100 1 Response to FS after 6 ms
Storage Model Backing Store
Remap Table
100 1
How does David handle metadata write?
SLIDE 13 Benchmark Filesystem
fwrite(buffer, 4096,1,F); 800 Data block LBA : 800
Storage Model
1
Backing Store
Remap Table
100 1
How does David handle data write?
SLIDE 14 Benchmark Filesystem
800 800
Storage Model
1
Backing Store
Remap Table
100 1
How does David handle data write?
SLIDE 15 Model calculates response time for write to LBA 800 Data block at LBA 800 is THROWN AWAY
800
Benchmark Filesystem
Storage Model Backing Store
1 Remap Table
100 1
How does David handle data write?
SLIDE 16 Response to FS after 8 ms 800 Space Savings
50% Benchmark Filesystem
Storage Model Backing Store
1 Remap Table
100 1
How does David handle data write?
SLIDE 17 Benchmark Filesystem
F = fclose(F); F = fopen(“a.txt”,”r”);
Storage Model Backing Store
1 Remap Table
100 1
How does David handle metadata read?
SLIDE 18 Benchmark Filesystem
100 Inode block LBA : 100
Storage Model Backing Store
1 Remap Table
100 1
How does David handle metadata read?
SLIDE 19 Benchmark Filesystem
100 100
Storage Model Backing Store
1 Remap Table
100 1
How does David handle metadata read?
SLIDE 20 1
Model calculates response time for read to LBA 100 Block at LBA 1 is read and returned.
100 1
Benchmark Filesystem
Storage Model Backing Store
Remap Table
100 1
How does David handle metadata read?
SLIDE 21 Benchmark Filesystem
100 1 Response to FS after 3 ms 100 1
Storage Model Backing Store
Remap Table
100 1
How does David handle metadata read?
SLIDE 22 Benchmark Filesystem
fread(buffer, 4096,1,F); 800 Data block LBA : 800
Storage Model
1
Backing Store
Remap Table
100 1
How does David handle data read?
SLIDE 23 Benchmark Filesystem
800 800
Storage Model Backing Store
1 Remap Table
100 1
How does David handle data read?
SLIDE 24 Model calculates response time for read to LBA 800 Data block at LBA 800 is filled with fake content
800 800
Benchmark Filesystem
Storage Model Backing Store
1 Remap Table
100 1
How does David handle data read?
SLIDE 25 Benchmark Filesystem
Response to FS after 8 ms 800
Storage Model Backing Store
1 Remap Table
100 1
How does David handle data read?
SLIDE 26
Outline Intro Overview Design Results Conclusion
SLIDE 27
Design Goals for David
Accurate Emulated disk should perform similar to real disk Scalable Should be able to emulate large disks Lightweight Emulation overhead should not affect accuracy Flexible Should be able to emulate variety of storage disks Adoptable Easy to install and use for benchmarking
SLIDE 28 Components within David
Storage Model Block Classifier
Metadata Remapper Data Squasher Data Generator
Backing Store
SLIDE 29 Block Classification
Data or Metadata? Distinguish data blocks from metadata blocks to throw away data blocks Why difficult? David is a block-level emulator Two Approaches
Implicit Block Classification (David automatically infers block classification) Explicit Block Classification (Operating System passes down block classification)
SLIDE 30 Implicit Block Classification
Parse metadata writes using filesystem knowledge to infer data blocks Implementation for ext3
- Identify inode blocks using ext3 block layout
- Parse inode blocks to infer direct/indirect blocks
- Parse direct/indirect blocks to infer data blocks
Problem Delay in classification
SLIDE 31
Ext3 Ordered Journaling Mode (without David)
Journal Disk
M D
SLIDE 32
Ext3 Ordered Journaling Mode (with David)
Journal Disk Unclassified Block Store
SLIDE 33
Memory Pressure in Unclassified Block Store
Too many unclassified blocks exhaust memory
Technique: Journal Snooping
Parse metadata writes to journal to infer classification much earlier than usual
SLIDE 34 Effect of Journal Snooping
500 1000 1500 2000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Memory Used (MB) Time (seconds)
Without Journal Snooping With Journal Snooping
Out of Memory
SLIDE 35 Block Classification
Data or Metadata? Distinguish data blocks from metadata blocks to throw away data blocks Why difficult? David is a block-level emulator Two Approaches
Implicit Block Classification (David automatically infers block classification) Explicit Block Classification (Operating System passes down block classification)
SLIDE 36 Capture page pointers to data blocks in the write system call and pass classification information to David
Benchmark Application FileSystem Data Blocks Metadata Blocks To David
Explicit Block Classification
SLIDE 37 Block Classification Summary
Implicit Block Classification Explicit Block Classification No change to filesystem, benchmark
Minimal change to
Requires filesystem knowledge Works for all filesystems Results with ext3 Results with btrfs
SLIDE 38 Components within David
Storage Model Block Classifier
Metadata Remapper Data Squasher Data Generator
Backing Store
SLIDE 39 David’s Storage Model
Filesystem
Actual System Emulated System
Storage Model
I/O request queue Benchmark Disk Filesystem David Benchmark
SLIDE 40 I/O Queue Model
Merge sequential I/O requests
When I/O queue is empty
- Wait for 3 ms anticipating merges
When I/O queue is full
- Process is made to sleep and wait
- Process is woken up once empty slots open up
- Process is given a bonus for the wait period
I/O queue modeling critical for accuracy
SLIDE 41 Disk Model
Simple in-kernel disk model
- Based on Ruemmler and Wilkes disk model
- Current models: 80GB and 1 TB Hitachi deskstar
- Focus of our work is not disk modeling
(more accurate models are possible)
Disk model parameters
Rotational speed, head seek profile, etc.
Head position, on-disk cache state, etc.
SLIDE 42
David’s Storage Model Accuracy
Reasonable accuracy across many workloads Many more results in paper
SLIDE 43 Components within David
Storage Model Block Classifier
Metadata Remapper Data Squasher Data Generator
Backing Store
SLIDE 44 Backing Store
Any physical storage can be used
- Must be large enough to hold all metadata blocks
- Must be fast enough to match emulated disk
Two implementations
- Memory as backing store
- Compressed disk as backing store
Storage space for metadata blocks
SLIDE 45 Metadata Remapper
Remaps metadata blocks into compressed form
Inode Data Inode Data Inode Data Inode Inode Inode
Emulated Disk Compressed Disk (better performance)
SLIDE 46 Components within David
Storage Model Block Classifier
Metadata Remapper Data Squasher Data Generator
Backing Store
SLIDE 47
Data Squasher and Generator
Data Squasher
Throws away writes to data blocks
Data Generator
Generate content for the reads to data blocks (currently generates random content)
SLIDE 48
Outline Intro Overview Design Results Conclusion
SLIDE 49 Experiments
Emulation accuracy
Test emulation accuracy across benchmarks
Emulation scalability
Test space savings for large device emulation
Multiple disk emulation
Test accuracy of multiple device emulation
SLIDE 50
Emulation Accuracy Experiment
Experimental details
Emulated ~1 TB disk with 80 GB disk Ran a variety of benchmarks Validated by using a real 1 TB disk
SLIDE 51 Emulation Accuracy Results
(Ext3 with Implicit Block Classification)
50 100 150 200 250 300 350 400
Real Emulated Runtime (seconds)
SLIDE 52 Emulation Accuracy Results
(Btrfs with Explicit Block Classification)
50 100 150 200 250 300 350
Real Emulated Runtime (seconds)
SLIDE 53
Emulation Scale Experiment
Experimental details
Emulated ~1 TB disk using a 80 GB disk Created filesystem images using Impressions Validated by using a real disk
SLIDE 54
Emulation Scale: Accuracy
SLIDE 55
Emulation Scale: Space Savings
SLIDE 56
Multiple Disks Experiment
Experimental details
Emulated multiple disks using RAM Measured micro-benchmark performance on RAID-1 Validated our results against real disks
SLIDE 57 Simple RAID-1 Emulation
50 100 150 200 250 300
R/3 W/3 R/2 W/2 R/1 W/1 Runtime (seconds) Original David Random Read or Write Performance
SLIDE 58
Outline Intro Overview Design Results Conclusion
SLIDE 59
Conclusion
David: Emulate large devices with limited means Key idea: Throw away data Results: Accurate emulation of large and multiple disks Future: Emulating storage cluster with few machines
SLIDE 60
Thank You
www.cs.wisc.edu/adsl
SLIDE 61
Questions?
SLIDE 62 Measuring Innovation Thorough measurement is Hard and Costly
Time, Money, Effort needed to measure performance on a variety
Tiny benchmarks are easy to run
SLIDE 63 Implicit Block Classification
Unclassified block store
- Unclassifiable blocks are temporarily stored in
Unclassified Block Store which is in RAM
- Journal checkpoint frequency determines the
delay in classification
- Upon classification, data blocks are squashed and
metadata blocks are persisted