Generating Realistic Impressions for File-System Benchmarking Nitin - PowerPoint PPT Presentation

Generating Realistic Impressions for File-System Benchmarking Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

“For better or for worse, � benchmarks shape a field” � David Patterson 2

Inputs to file-system benchmarking Input: Benchmark workload Application Postmark, FileBench, Fstress, Bonnie, IOZone, TPCC, etc etc FS logical organization Input: In-memory state Cold cache/warm cache File System Input: File-System Image Anything goes! Disk layout Storage device 3

FS images in past: use what is convenient Typical desktop file system w/ no description (SOSP 05) 5-deep tree, 5 subdirs, 10 8KB files in each (FAST 04) Randomly generated files of several MB (FAST 08) 1000 files in 10 dirs w/ random data (SOSP 03) 188GB and 129GB volumes in Engg dept (OSDI 99) 10702 files from /usr/local, size 354MB (SOSP 01) 1641 files, 109 dirs, 13.4 MB total size (OSDI 02)

Performance of find operation Disk layout File-system logical & cache state organization Time Taken Relative 5

Problem scope Characteristics of file-system images have strong impact on performance We need to incorporate representative file-system images in benchmarking & design How to create representative file-system images? 6

Requirements for creating FS images • Access to data on file systems and disk layout – Properties of file-system metadata [Satyanarayan81, Mullender84, Irlam93, Sienknecht94, Douceur99, Agrawal07] – Disk fragmentation [Smith97] – More such studies in future? • A technique to create file-system images that is – Representative: given a set of input distributions – Controllable: supply additional user constraints – Reproducible: control & report internal parameters – Easy to use: for widespread adoption and consensus 7

Introducing Impressions • Powerful statistical framework to generate file-system images – Takes properties of file-system attributes as input – Works out underlying statistical details of the image – Mounted on a disk partition for real benchmarking – Satisfies the four design goals • Applying Impressions gives useful insights – What is the impact on performance and storage size? – How does an application behave on a real FS image? 8

Outline • Introduction • Generating realistic file-system images • Applying Impressions: Desktop search • Conclusion 9

Overview of Impressions Impressions 10

Properties of file-system metadata “Five-year study of file-system metadata” [FAST07] (Agrawal, Bolosky, Douceur, Lorch) Used as exemplar for metadata properties in Impressions 11

Features of Impressions • Modes of operation for different usages – Basic mode: choose default settings for parameters – Advanced mode: several individually tunable knobs • Thorough statistical machinery ensures accuracy – Uses parameterized curve fits – Allows arbitrary user constraints – Built-in statistical tests for goodness-of-fit • Generates namespace, metadata, file content, and disk fragmentation using above techniques 12

Creating valid metadata • Creating file-system namespace – Uses Generative Model proposed earlier [FAST 07] – Explains the process of directory tree creation – Accurately regenerates distribution of directory size and of directory depth 13

Creating namespace Dirs by namespace depth Dirs by subdir count Directories by Namespace Depth Directories by Subdirectory Count Cumulative % of directories Fraction of directories 0.18 100 D 0.16 G 90 0.14 0.12 80 0.1 0.08 Dataset 70 0.06 0.04 60 D Generated 0.02 G 0 50 0 0 2 2 4 4 6 6 8 10 12 14 16 8 10 12 14 16 Namespace depth (bin size 1) Count of subdirectories i Directory tree Monte Carlo run Incorporates dirs by depth Probability of parent selection and dirs by subdir count ≈ Count(subdirs)+2 14

Creating valid metadata • Creating file-system namespace • Creating files: stepwise process – File size, file extension, file depth, parent directory – Uses statistical models & analytical approximations 15

Example: creating realistic file sizes File Sizes to used space Contribution Lognormal Hybrid Containing file size (bytes, log scale) • Pure lognormal distribution no longer good fit • Hybrid model: lognormal body, Pareto tail – Fits observed data more accurately, used to recreate file sizes in Impressions 16

Creating files Files by containing bytes Files by Containing Bytes 0.12 D Fraction of bytes 0.1 G 0.08 0.06 0.04 0.02 0 0 8 2K 512K 512M 128G File Size (bytes, log scale) i File Size Model S 9 S 8 S 6 S 4 S 3 S 2 S 7 S 5 S 1 Lognormal body, Pareto tail Captures bimodal curve 17

Creating files Top extensions by count Top Extensions by Count 1 0.8 Fraction of files others others 0.6 txt txt 0.4 null null jpg jpg htm htm h h 0.2 gif gif exe exe dll dll cpp cpp 0 i Desired Generated File Extensions S 9 S 8 S 6 S 4 S 3 S 2 S 7 S 5 S 1 Percentile values E 9 E 8 E 7 E 6 E 5 E 4 E 3 E 2 E 1 Top 20 extensions account for 50% of files and bytes 18

Creating files Bytes by namespace depth Files by namespace depth Files by Namespace Depth Bytes by Namespace Depth 0.16 Mean bytes per file D D 0.14 Fraction of files G G 2MB 0.12 (log scale) 768KB 0.1 0.08 256KB 0.06 64KB 0.04 0.02 16KB 0 0 0 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Namespace depth (bin size 1) Namespace depth (bin size 1) i File Depth S 9 S 8 S 6 S 4 S 3 S 2 S 7 S 5 S 1 Poisson E 9 E 8 E 7 E 6 E 5 E 4 E 3 E 2 E 1 Multiplicative model along D 9 D 8 D 6 D 4 D 3 D 2 D 7 D 5 D 1 with bytes by depth 19

Creating files Files by namespace depth Files by Namespace Depth (with Special Directories) w/ special dirs 0.25 D Fraction of files 0.2 G 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 Namespace depth (bin size 1) i Parent Dir Inverse Polynomial Satisfies distribution of dirs with file count 20

Resolving arbitrary constraints 21

Resolving arbitrary constraints 0.15 0.15 O C Fraction of files Fraction of files Constrained Original 0.1 0.1 Contrived for sum 0.05 0.05 0 0 8 8 2K 2K 512K 8M 512K 8M File Size (bytes, log scale) File Size (bytes, log scale) Constraint: Given count of files & size distribution, ensure Accurate both for the sum and the distribution sum of file sizes matches a desired total file system size 22

Resolving arbitrary constraints • Arbitrarily specified on file system parameters • Variant of NP-complete “Subset Sum Problem” – Approximation algorithm based solution (in paper) – Oversampling to get additional sample values – Local improvement to iteratively converge to the desired sum by identifying best-fit in current sample • While constraints are satisfied, constrained distribution also retains original characteristics 23

Interpolation and extrapolation • Why don’t we just use available data values? – Limited to empirical data in input dataset – “What-if” analysis beyond available dataset – Efficient to maintain compact curve fits and use interpolation/extrapolation instead of all data • Technique: Piecewise interpolation 24

Interpolation technique & accuracy Piecewise Interpolation Piecewise Interpolation 0.14 0.12 100 GB Interpola9on: Seg 19 0.06 50 GB Segment Value 0.1 0.04 10 GB Fraction of bytes 0.02 0.08 0 0.06 0 50 100 File System Size (GB) 0.04 Segment 19 Segment 19 Segment 19 Segment 19 0.02 0 0 2 8 32 128 512 32K 128K 512K 2M 8M 32M 128M 512M 2G 8G 32G 128G 2K 8K File Size File Size extrapolation 125GB File Size interpolation 75GB • Each distribution broken down into segments Extrapolation (125 GB) Interpolation (75 GB) 0.12 0.12 Fraction of bytes R Real Fraction of bytes R 0.1 Interpolated 0.1 E • Data points within a segment used for curve fit I 0.08 0.08 0.06 Real 0.06 0.04 0.04 • Combine segment interpolations for new curve Extrapolated 0.02 0.02 0 0 8 2K 512K 128M 32G 25 8 2K 512K 128M 32G File Size File Size

File content • Files having natural language content – Word-popularity model (heavy tailed) – Word-length frequency model (for the long tail) • Content for other files (mp3, gif, mpeg etc) – Impressions generates valid header/footer – Uses third-party libraries and software 26

Disk layout and fragmentation 27

Disk layout and fragmentation • Simplistic technique – Layout Score for degree of fragmentation [Smith97] – Pairs of file create/delete operations till desired layout score is achieved • In future more nuanced ways are possible File 1 File 2 – Out-of-order file writes, writes with long delays – Access to file-system specific interfaces • FIPMAP in Linux, XFS_IOC_GETBMAP for XFS 1 non-contiguous block (out of 8) All blocks contiguous File Layout Score = 7/8 File Layout Score = 1 (6/6) – Perhaps a tool complementary to Impressions 28

Outline • Introduction • Generating realistic file-system images • Applying Impressions: Desktop search • Conclusion 29

Applying Impressions • Case study: desktop search – Google desktop for linux (GDL) and Beagle – Metrics of interest: • Size of index, time to build initial search index – Identifying application bugs and policies • GDL doesn’t index content beyond 10-deep • Computing realistic rules of thumb – Overhead of metadata replication? 30

Generating Realistic Impressions for File-System Benchmarking Nitin - PowerPoint PPT Presentation

Generating Realistic Impressions for File-System Benchmarking Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau For better or for worse, benchmarks shape a field David Patterson 2 Inputs to file-system benchmarking

Impressions & Impressions & Questions Questions This presentation is based on my

H2 F2009 H2 F2009 GENERATING GENERATING GENERATING GENERATING FREE CASH FLOW FREE CASH FLOW

P2+ IMPRESSIONS ACROSS NATIONAL TV 1 FOX NEWS AND ESPN IMPRESSIONS FOR MARCH 2 SPORTS

FIRST IMPRESSIONS OF FIRST IMPRESSIONS OF SALTSTACK AND RECLASS SALTSTACK AND RECLASS DENNIS

Impressions Part of our Workshop Part of our raw Material stock Impressions Men at work Range

Its All About Impressions. Getting those impressions is easy as 1-2-3. 1. 2. SUBMIT

Impressions of Basil Impressions of Basil by Kathleen OReagan, Richard Frey and Karen

Advanced Electric Generating Advanced Electric Generating Advanced Electric Generating

Ratchaburi Electricity Generating Holding PCL. Ratchaburi Electricity Generating Holding PCL.

Recursive Definitions Generating Functions Lecture 18 Generating Functions A generating

Generating Subfields Mark van Hoeij June 15, 2017 Mark van Hoeij Generating Subfields Overview

Atikokan Generating Station Thunder Bay Generating Station March 5, 2013 Alberta Biomaterials

Impressions of the Mandelbrot set Celebrating the spirit and ideas of Adrien Carsten Lunde

Too early to tell Impressions from working for 5 years in China A personal account by Andreas

Historical Scientific Highlights and Impressions from 15 Years of

Capturing Realistic HDR Images Topics : Post-Processing. Sample Workflow. Q &

Business Results Third Quarter of Fiscal Year Ending March 31, 2020 MinebeaMitsumi Inc.

4.2 Microsoft Word Microsoft Word is the word processing component of the Microsoft Office

Graphic Design Year 7 This Lesson Know what is meant by a file extension and to give examples

to Move our World to UHDTV! What is the TICO Alliance? The growing consortium is an open

Secure, Fast, Easy and Comprehensive File Management System What is FileOrbis? FileOrbis is an

2019 BFG Q4 Webinar Tax Law Changes at a Glance For Individuals: lower income tax rates

eRecruit Paper Application and Supporting Document File Name Edit Demonstration March 4, 2014

Immigration Update: Temporary Protected Status January 25, 2018 Agenda Temporary Protected

Generating Realistic Impressions for File-System Benchmarking Nitin - PowerPoint PPT Presentation

Generating Realistic Impressions for File-System Benchmarking Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau For better or for worse, benchmarks shape a field David Patterson 2 Inputs to file-system benchmarking

Impressions &amp; Impressions &amp; Questions Questions This presentation is based on my

H2 F2009 H2 F2009 GENERATING GENERATING GENERATING GENERATING FREE CASH FLOW FREE CASH FLOW

P2+ IMPRESSIONS ACROSS NATIONAL TV 1 FOX NEWS AND ESPN IMPRESSIONS FOR MARCH 2 SPORTS

FIRST IMPRESSIONS OF FIRST IMPRESSIONS OF SALTSTACK AND RECLASS SALTSTACK AND RECLASS DENNIS

Impressions Part of our Workshop Part of our raw Material stock Impressions Men at work Range

Its All About Impressions. Getting those impressions is easy as 1-2-3. 1. 2. SUBMIT

Impressions of Basil Impressions of Basil by Kathleen OReagan, Richard Frey and Karen

Advanced Electric Generating Advanced Electric Generating Advanced Electric Generating

Ratchaburi Electricity Generating Holding PCL. Ratchaburi Electricity Generating Holding PCL.

Recursive Definitions Generating Functions Lecture 18 Generating Functions A generating

Generating Subfields Mark van Hoeij June 15, 2017 Mark van Hoeij Generating Subfields Overview

Atikokan Generating Station Thunder Bay Generating Station March 5, 2013 Alberta Biomaterials

Impressions of the Mandelbrot set Celebrating the spirit and ideas of Adrien Carsten Lunde

Too early to tell Impressions from working for 5 years in China A personal account by Andreas

Historical Scientific Highlights and Impressions from 15 Years of

Capturing Realistic HDR Images Topics : Post-Processing. Sample Workflow. Q &amp;

Business Results Third Quarter of Fiscal Year Ending March 31, 2020 MinebeaMitsumi Inc.

4.2 Microsoft Word Microsoft Word is the word processing component of the Microsoft Office

Graphic Design Year 7 This Lesson Know what is meant by a file extension and to give examples

to Move our World to UHDTV! What is the TICO Alliance? The growing consortium is an open

Secure, Fast, Easy and Comprehensive File Management System What is FileOrbis? FileOrbis is an

2019 BFG Q4 Webinar Tax Law Changes at a Glance For Individuals: lower income tax rates

eRecruit Paper Application and Supporting Document File Name Edit Demonstration March 4, 2014

Immigration Update: Temporary Protected Status January 25, 2018 Agenda Temporary Protected

Impressions & Impressions & Questions Questions This presentation is based on my

Capturing Realistic HDR Images Topics : Post-Processing. Sample Workflow. Q &