Generating Realistic Impressions for File-System Benchmarking Nitin - - PowerPoint PPT Presentation

generating realistic impressions
SMART_READER_LITE
LIVE PREVIEW

Generating Realistic Impressions for File-System Benchmarking Nitin - - PowerPoint PPT Presentation

Generating Realistic Impressions for File-System Benchmarking Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau For better or for worse, benchmarks shape a field David Patterson 2 Inputs to file-system benchmarking


slide-1
SLIDE 1

Generating Realistic Impressions for File-System Benchmarking

Nitin Agrawal

Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

slide-2
SLIDE 2

“For better or for worse, benchmarks shape a field”

David Patterson

2

slide-3
SLIDE 3

Inputs to file-system benchmarking

3

Application

Input: Benchmark workload Input: File-System Image

FS logical

  • rganization

Disk layout

File System

Input: In-memory state

Storage device

Postmark, FileBench, Fstress, Bonnie, IOZone, TPCC, etc etc Cold cache/warm cache Anything goes!

slide-4
SLIDE 4

FS images in past: use what is convenient

Randomly generated files of several MB (FAST 08) 1000 files in 10 dirs w/ random data (SOSP 03) 188GB and 129GB volumes in Engg dept (OSDI 99) 5-deep tree, 5 subdirs, 10 8KB files in each (FAST 04) 10702 files from /usr/local, size 354MB (SOSP 01) Typical desktop file system w/ no description (SOSP 05) 1641 files, 109 dirs, 13.4 MB total size (OSDI 02)

slide-5
SLIDE 5

Performance of find operation

5

Relative

Time Taken

Disk layout & cache state File-system logical

  • rganization
slide-6
SLIDE 6

Problem scope

Characteristics of file-system images have strong impact on performance We need to incorporate representative file-system images in benchmarking & design

6

How to create representative file-system images?

slide-7
SLIDE 7

Requirements for creating FS images

  • Access to data on file systems and disk layout

– Properties of file-system metadata [Satyanarayan81,

Mullender84, Irlam93, Sienknecht94, Douceur99, Agrawal07]

– Disk fragmentation [Smith97] – More such studies in future?

  • A technique to create file-system images that

is

– Representative: given a set of input distributions – Controllable: supply additional user constraints – Reproducible: control & report internal parameters – Easy to use: for widespread adoption and consensus

7

slide-8
SLIDE 8

Introducing Impressions

  • Powerful statistical framework to generate

file-system images

– Takes properties of file-system attributes as input – Works out underlying statistical details of the image – Mounted on a disk partition for real benchmarking – Satisfies the four design goals

  • Applying Impressions gives useful insights

– What is the impact on performance and storage size? – How does an application behave on a real FS image?

8

slide-9
SLIDE 9

Outline

  • Introduction
  • Generating realistic file-system images
  • Applying Impressions: Desktop search
  • Conclusion

9

slide-10
SLIDE 10

Overview of Impressions

10

Impressions

slide-11
SLIDE 11

Properties of file-system metadata

“Five-year study of file-system metadata” [FAST07]

(Agrawal, Bolosky, Douceur, Lorch)

Used as exemplar for metadata properties in Impressions

11

slide-12
SLIDE 12

Features of Impressions

  • Modes of operation for different usages

– Basic mode: choose default settings for parameters – Advanced mode: several individually tunable knobs

  • Thorough statistical machinery ensures accuracy

– Uses parameterized curve fits – Allows arbitrary user constraints – Built-in statistical tests for goodness-of-fit

  • Generates namespace, metadata, file content, and

disk fragmentation using above techniques

12

slide-13
SLIDE 13

Creating valid metadata

  • Creating file-system namespace

– Uses Generative Model proposed earlier [FAST 07] – Explains the process of directory tree creation – Accurately regenerates distribution of directory size and of directory depth

13

slide-14
SLIDE 14

Creating namespace

14

Directory tree Monte Carlo run Incorporates dirs by depth and dirs by subdir count

i

Probability of parent selection ≈ Count(subdirs)+2

50 60 70 80 90 100 2 4 6 8 10 12 14 16 Cumulative % of directories Count of subdirectories Directories by Subdirectory Count D G

Dirs by subdir count

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 2 4 6 8 10 12 14 16 Fraction of directories Namespace depth (bin size 1) Directories by Namespace Depth D G

Dirs by namespace depth

Dataset Generated

slide-15
SLIDE 15

Creating valid metadata

  • Creating file-system namespace
  • Creating files: stepwise process

– File size, file extension, file depth, parent directory – Uses statistical models & analytical approximations

15

slide-16
SLIDE 16

Example: creating realistic file sizes

  • Pure lognormal distribution no longer good fit
  • Hybrid model: lognormal body, Pareto tail

– Fits observed data more accurately, used to recreate file sizes in Impressions

Contribution to used space

Containing file size (bytes, log scale)

Lognormal Hybrid

16

File Sizes

slide-17
SLIDE 17

Creating files

17

File Size Model Lognormal body, Pareto tail Captures bimodal curve

i

S2 S4 S6 S8 S9 S1 S3 S5 S7

0.02 0.04 0.06 0.08 0.1 0.12 0 8 2K 512K 512M 128G Fraction of bytes File Size (bytes, log scale) Files by Containing Bytes D G

Files by containing bytes

slide-18
SLIDE 18

Creating files

18

File Extensions Percentile values Top 20 extensions account for 50% of files and bytes

i

S2 S4 S6 S8 S9 S1 S3 S5 S7 E2 E4 E6 E8 E9 E3 E5 E7 E1

0.2 0.4 0.6 0.8 1 Desired Generated Fraction of files Top Extensions by Count cpp dll exe gif h htm jpg null txt

  • thers

cpp dll exe gif h htm jpg null txt

  • thers

Top extensions by count

slide-19
SLIDE 19

Creating files

19

File Depth Poisson Multiplicative model along with bytes by depth

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 2 4 6 8 10 12 14 16 Fraction of files Namespace depth (bin size 1) Files by Namespace Depth D G

Files by namespace depth

16KB 64KB 256KB 768KB 2MB 0 2 4 6 8 10 12 14 16 Mean bytes per file (log scale) Namespace depth (bin size 1) Bytes by Namespace Depth D G

Bytes by namespace depth

i

S2 S4 S6 S8 S9 S1 S3 S5 S7 E2 E4 E6 E8 E9 E3 E5 E7 E1 D2 D4 D6 D8 D9 D3 D5 D7 D1

slide-20
SLIDE 20

Creating files

20

Parent Dir Inverse Polynomial Satisfies distribution of dirs with file count

0.05 0.1 0.15 0.2 0.25 2 4 6 8 10 12 14 16 Fraction of files Namespace depth (bin size 1) Files by Namespace Depth (with Special Directories) D G

Files by namespace depth

w/ special dirs

i

slide-21
SLIDE 21

Resolving arbitrary constraints

21

slide-22
SLIDE 22

0.05 0.1 0.15 8 2K 512K 8M Fraction of files File Size (bytes, log scale) C 0.05 0.1 0.15 8 2K 512K 8M Fraction of files File Size (bytes, log scale) O

Resolving arbitrary constraints

Accurate both for the sum and the distribution

22

Original Constrained Contrived for sum

Constraint: Given count of files & size distribution, ensure sum of file sizes matches a desired total file system size

slide-23
SLIDE 23

Resolving arbitrary constraints

  • Arbitrarily specified on file system parameters
  • Variant of NP-complete “Subset Sum Problem”

– Approximation algorithm based solution (in paper) – Oversampling to get additional sample values – Local improvement to iteratively converge to the desired sum by identifying best-fit in current sample

  • While constraints are satisfied, constrained

distribution also retains original characteristics

23

slide-24
SLIDE 24

Interpolation and extrapolation

  • Why don’t we just use available data values?

– Limited to empirical data in input dataset – “What-if” analysis beyond available dataset – Efficient to maintain compact curve fits and use interpolation/extrapolation instead of all data

  • Technique: Piecewise interpolation

24

slide-25
SLIDE 25

0.02 0.04 0.06 0.08 0.1 0.12 0.14 2 8 32 128 512 2K 8K 32K 128K 512K 2M 8M 32M 128M 512M 2G 8G 32G 128G

Fraction of bytes File Size

Piecewise Interpolation

100 GB 50 GB 10 GB

Segment 19 Segment 19 Segment 19 Segment 19

0.02 0.04 0.06 50 100 Segment Value File System Size (GB)

Interpola9on: Seg 19

Interpolation technique & accuracy

Piecewise Interpolation

25

  • Each distribution broken down into segments
  • Data points within a segment used for curve fit
  • Combine segment interpolations for new curve

0.02 0.04 0.06 0.08 0.1 0.12 8 2K 512K 128M 32G Fraction of bytes File Size Extrapolation (125 GB) R E

File Size extrapolation 125GB

0.02 0.04 0.06 0.08 0.1 0.12 8 2K 512K 128M 32G Fraction of bytes File Size Interpolation (75 GB) R I

File Size interpolation 75GB Real Interpolated Extrapolated Real

slide-26
SLIDE 26

File content

  • Files having natural language content

– Word-popularity model (heavy tailed) – Word-length frequency model (for the long tail)

  • Content for other files (mp3, gif, mpeg etc)

– Impressions generates valid header/footer – Uses third-party libraries and software

26

slide-27
SLIDE 27

Disk layout and fragmentation

27

slide-28
SLIDE 28

Disk layout and fragmentation

  • Simplistic technique

– Layout Score for degree of fragmentation [Smith97] – Pairs of file create/delete operations till desired layout score is achieved

  • In future more nuanced ways are possible

– Out-of-order file writes, writes with long delays – Access to file-system specific interfaces

  • FIPMAP in Linux, XFS_IOC_GETBMAP for XFS

– Perhaps a tool complementary to Impressions

28

File 1 File 2

1 non-contiguous block (out of 8) File Layout Score = 7/8 All blocks contiguous File Layout Score = 1 (6/6)

slide-29
SLIDE 29

Outline

  • Introduction
  • Generating realistic file-system images
  • Applying Impressions: Desktop search
  • Conclusion

29

slide-30
SLIDE 30

Applying Impressions

  • Case study: desktop search

– Google desktop for linux (GDL) and Beagle – Metrics of interest:

  • Size of index, time to build initial search index

– Identifying application bugs and policies

  • GDL doesn’t index content beyond 10-deep
  • Computing realistic rules of thumb

– Overhead of metadata replication?

30

slide-31
SLIDE 31

Impact of file content

File content has significant affect: around 300% increase in index size for both GDL & Beagle Understanding design: GDL index smaller than Beagle for text files, larger for binary files

31

0.01 0.1 Beagle GDL Index Size/FS size Index Size Comparison Text (1 Word) Text (Model) Binary

slide-32
SLIDE 32

Impact of metadata and content

Developer aid: understanding impact of different file system content & different index schemes

32

0.5 1 1.5 2 2.5 3 3.5 O r i g i n a l T e x t C a c h e D i s D i r D i s F i l t e r Relative Index Size Beagle: Index Size Default Text Image Binary

slide-33
SLIDE 33

Impact of metadata and content

Reproducing identical file-system image to compare other apps or ones developed later

33

0.5 1 1.5 2 2.5 3 3.5 Original TextCache DisDir DisFilter Relative Index Size Beagle: Index Size Default Text Image Binary

Future App Beagle

slide-34
SLIDE 34

Conclusion

  • Impressions framework for realistic FS images

– Representative, controllable, reproducible, easy to use – Includes almost all file system params of interest

  • Extensible architecture

– Plug in new statistical constructs, new models for metadata and content generation

  • Powerful utility for file-system benchmarking

– To be contributed publicly (coming soon) http://www.cs.wisc.edu/adsl/Software/Impressions

34

slide-35
SLIDE 35

Nitin Agrawal

http://www.cs.wisc.edu/~nitina

Impressions download (coming soon)

http://www.cs.wisc.edu/adsl/Software/Impressions

35

Questions?

i