Generating Realistic Impressions for File-System Benchmarking Nitin - - PowerPoint PPT Presentation
Generating Realistic Impressions for File-System Benchmarking Nitin - - PowerPoint PPT Presentation
Generating Realistic Impressions for File-System Benchmarking Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau For better or for worse, benchmarks shape a field David Patterson 2 Inputs to file-system benchmarking
“For better or for worse, benchmarks shape a field”
David Patterson
2
Inputs to file-system benchmarking
3
Application
Input: Benchmark workload Input: File-System Image
FS logical
- rganization
Disk layout
File System
Input: In-memory state
Storage device
Postmark, FileBench, Fstress, Bonnie, IOZone, TPCC, etc etc Cold cache/warm cache Anything goes!
FS images in past: use what is convenient
Randomly generated files of several MB (FAST 08) 1000 files in 10 dirs w/ random data (SOSP 03) 188GB and 129GB volumes in Engg dept (OSDI 99) 5-deep tree, 5 subdirs, 10 8KB files in each (FAST 04) 10702 files from /usr/local, size 354MB (SOSP 01) Typical desktop file system w/ no description (SOSP 05) 1641 files, 109 dirs, 13.4 MB total size (OSDI 02)
Performance of find operation
5
Relative
Time Taken
Disk layout & cache state File-system logical
- rganization
Problem scope
Characteristics of file-system images have strong impact on performance We need to incorporate representative file-system images in benchmarking & design
6
How to create representative file-system images?
Requirements for creating FS images
- Access to data on file systems and disk layout
– Properties of file-system metadata [Satyanarayan81,
Mullender84, Irlam93, Sienknecht94, Douceur99, Agrawal07]
– Disk fragmentation [Smith97] – More such studies in future?
- A technique to create file-system images that
is
– Representative: given a set of input distributions – Controllable: supply additional user constraints – Reproducible: control & report internal parameters – Easy to use: for widespread adoption and consensus
7
Introducing Impressions
- Powerful statistical framework to generate
file-system images
– Takes properties of file-system attributes as input – Works out underlying statistical details of the image – Mounted on a disk partition for real benchmarking – Satisfies the four design goals
- Applying Impressions gives useful insights
– What is the impact on performance and storage size? – How does an application behave on a real FS image?
8
Outline
- Introduction
- Generating realistic file-system images
- Applying Impressions: Desktop search
- Conclusion
9
Overview of Impressions
10
Impressions
Properties of file-system metadata
“Five-year study of file-system metadata” [FAST07]
(Agrawal, Bolosky, Douceur, Lorch)
Used as exemplar for metadata properties in Impressions
11
Features of Impressions
- Modes of operation for different usages
– Basic mode: choose default settings for parameters – Advanced mode: several individually tunable knobs
- Thorough statistical machinery ensures accuracy
– Uses parameterized curve fits – Allows arbitrary user constraints – Built-in statistical tests for goodness-of-fit
- Generates namespace, metadata, file content, and
disk fragmentation using above techniques
12
Creating valid metadata
- Creating file-system namespace
– Uses Generative Model proposed earlier [FAST 07] – Explains the process of directory tree creation – Accurately regenerates distribution of directory size and of directory depth
13
Creating namespace
14
Directory tree Monte Carlo run Incorporates dirs by depth and dirs by subdir count
i
Probability of parent selection ≈ Count(subdirs)+2
50 60 70 80 90 100 2 4 6 8 10 12 14 16 Cumulative % of directories Count of subdirectories Directories by Subdirectory Count D G
Dirs by subdir count
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 2 4 6 8 10 12 14 16 Fraction of directories Namespace depth (bin size 1) Directories by Namespace Depth D G
Dirs by namespace depth
Dataset Generated
Creating valid metadata
- Creating file-system namespace
- Creating files: stepwise process
– File size, file extension, file depth, parent directory – Uses statistical models & analytical approximations
15
Example: creating realistic file sizes
- Pure lognormal distribution no longer good fit
- Hybrid model: lognormal body, Pareto tail
– Fits observed data more accurately, used to recreate file sizes in Impressions
Contribution to used space
Containing file size (bytes, log scale)
Lognormal Hybrid
16
File Sizes
Creating files
17
File Size Model Lognormal body, Pareto tail Captures bimodal curve
i
S2 S4 S6 S8 S9 S1 S3 S5 S7
0.02 0.04 0.06 0.08 0.1 0.12 0 8 2K 512K 512M 128G Fraction of bytes File Size (bytes, log scale) Files by Containing Bytes D G
Files by containing bytes
Creating files
18
File Extensions Percentile values Top 20 extensions account for 50% of files and bytes
i
S2 S4 S6 S8 S9 S1 S3 S5 S7 E2 E4 E6 E8 E9 E3 E5 E7 E1
0.2 0.4 0.6 0.8 1 Desired Generated Fraction of files Top Extensions by Count cpp dll exe gif h htm jpg null txt
- thers
cpp dll exe gif h htm jpg null txt
- thers
Top extensions by count
Creating files
19
File Depth Poisson Multiplicative model along with bytes by depth
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 2 4 6 8 10 12 14 16 Fraction of files Namespace depth (bin size 1) Files by Namespace Depth D G
Files by namespace depth
16KB 64KB 256KB 768KB 2MB 0 2 4 6 8 10 12 14 16 Mean bytes per file (log scale) Namespace depth (bin size 1) Bytes by Namespace Depth D G
Bytes by namespace depth
i
S2 S4 S6 S8 S9 S1 S3 S5 S7 E2 E4 E6 E8 E9 E3 E5 E7 E1 D2 D4 D6 D8 D9 D3 D5 D7 D1
Creating files
20
Parent Dir Inverse Polynomial Satisfies distribution of dirs with file count
0.05 0.1 0.15 0.2 0.25 2 4 6 8 10 12 14 16 Fraction of files Namespace depth (bin size 1) Files by Namespace Depth (with Special Directories) D G
Files by namespace depth
w/ special dirs
i
Resolving arbitrary constraints
21
0.05 0.1 0.15 8 2K 512K 8M Fraction of files File Size (bytes, log scale) C 0.05 0.1 0.15 8 2K 512K 8M Fraction of files File Size (bytes, log scale) O
Resolving arbitrary constraints
Accurate both for the sum and the distribution
22
Original Constrained Contrived for sum
Constraint: Given count of files & size distribution, ensure sum of file sizes matches a desired total file system size
Resolving arbitrary constraints
- Arbitrarily specified on file system parameters
- Variant of NP-complete “Subset Sum Problem”
– Approximation algorithm based solution (in paper) – Oversampling to get additional sample values – Local improvement to iteratively converge to the desired sum by identifying best-fit in current sample
- While constraints are satisfied, constrained
distribution also retains original characteristics
23
Interpolation and extrapolation
- Why don’t we just use available data values?
– Limited to empirical data in input dataset – “What-if” analysis beyond available dataset – Efficient to maintain compact curve fits and use interpolation/extrapolation instead of all data
- Technique: Piecewise interpolation
24
0.02 0.04 0.06 0.08 0.1 0.12 0.14 2 8 32 128 512 2K 8K 32K 128K 512K 2M 8M 32M 128M 512M 2G 8G 32G 128G
Fraction of bytes File Size
Piecewise Interpolation
100 GB 50 GB 10 GB
Segment 19 Segment 19 Segment 19 Segment 19
0.02 0.04 0.06 50 100 Segment Value File System Size (GB)
Interpola9on: Seg 19
Interpolation technique & accuracy
Piecewise Interpolation
25
- Each distribution broken down into segments
- Data points within a segment used for curve fit
- Combine segment interpolations for new curve
0.02 0.04 0.06 0.08 0.1 0.12 8 2K 512K 128M 32G Fraction of bytes File Size Extrapolation (125 GB) R E
File Size extrapolation 125GB
0.02 0.04 0.06 0.08 0.1 0.12 8 2K 512K 128M 32G Fraction of bytes File Size Interpolation (75 GB) R I
File Size interpolation 75GB Real Interpolated Extrapolated Real
File content
- Files having natural language content
– Word-popularity model (heavy tailed) – Word-length frequency model (for the long tail)
- Content for other files (mp3, gif, mpeg etc)
– Impressions generates valid header/footer – Uses third-party libraries and software
26
Disk layout and fragmentation
27
Disk layout and fragmentation
- Simplistic technique
– Layout Score for degree of fragmentation [Smith97] – Pairs of file create/delete operations till desired layout score is achieved
- In future more nuanced ways are possible
– Out-of-order file writes, writes with long delays – Access to file-system specific interfaces
- FIPMAP in Linux, XFS_IOC_GETBMAP for XFS
– Perhaps a tool complementary to Impressions
28
File 1 File 2
1 non-contiguous block (out of 8) File Layout Score = 7/8 All blocks contiguous File Layout Score = 1 (6/6)
Outline
- Introduction
- Generating realistic file-system images
- Applying Impressions: Desktop search
- Conclusion
29
Applying Impressions
- Case study: desktop search
– Google desktop for linux (GDL) and Beagle – Metrics of interest:
- Size of index, time to build initial search index
– Identifying application bugs and policies
- GDL doesn’t index content beyond 10-deep
- Computing realistic rules of thumb
– Overhead of metadata replication?
30
Impact of file content
File content has significant affect: around 300% increase in index size for both GDL & Beagle Understanding design: GDL index smaller than Beagle for text files, larger for binary files
31
0.01 0.1 Beagle GDL Index Size/FS size Index Size Comparison Text (1 Word) Text (Model) Binary
Impact of metadata and content
Developer aid: understanding impact of different file system content & different index schemes
32
0.5 1 1.5 2 2.5 3 3.5 O r i g i n a l T e x t C a c h e D i s D i r D i s F i l t e r Relative Index Size Beagle: Index Size Default Text Image Binary
Impact of metadata and content
Reproducing identical file-system image to compare other apps or ones developed later
33
0.5 1 1.5 2 2.5 3 3.5 Original TextCache DisDir DisFilter Relative Index Size Beagle: Index Size Default Text Image Binary
Future App Beagle
Conclusion
- Impressions framework for realistic FS images
– Representative, controllable, reproducible, easy to use – Includes almost all file system params of interest
- Extensible architecture
– Plug in new statistical constructs, new models for metadata and content generation
- Powerful utility for file-system benchmarking
– To be contributed publicly (coming soon) http://www.cs.wisc.edu/adsl/Software/Impressions
34
Nitin Agrawal
http://www.cs.wisc.edu/~nitina
Impressions download (coming soon)
http://www.cs.wisc.edu/adsl/Software/Impressions
35