Robust Benchmarking for Archival Storage Tiers PDSW 2011 Lee et. al - PowerPoint PPT Presentation

DongJin Lee 1 , Michael O’Sullivan 1 , Cameron Walker 1 Monique MacKensize 2 1 The University of Auckland New Zealand 2 The University of St Andrews United Kingdom Robust Benchmarking for Archival Storage Tiers –PDSW 2011– Lee et. al (Univ. Auckland) 13-November-2011 1 / 20

Motivation Storage Tiers Organizations use ‘tiered’ storage systems Low overall cost, high capacity and high performance Increasing amount of read/write request in recent years Studies on how to efficiently utilize and build better storage tier Lee et. al (Univ. Auckland) 13-November-2011 2 / 20

Background and Introduction 1 Storage Design (Our research group) Build an optimized storage system (designing better node(s)) Based on tier Requirements, e.g., cost($), capacity(TB), performance(MB/s) and power(W) Based on Architecture, e.g., file system Based on Component, e.g., disk-based, RAID, motherboard types, network types (commodity types) Need to accurately measure MB/s using ‘typical archival workload’ Lee et. al (Univ. Auckland) 13-November-2011 3 / 20

Background and Introduction 1 Storage Design (Our research group) Build an optimized storage system (designing better node(s)) Based on tier Requirements, e.g., cost($), capacity(TB), performance(MB/s) and power(W) Based on Architecture, e.g., file system Based on Component, e.g., disk-based, RAID, motherboard types, network types (commodity types) Need to accurately measure MB/s using ‘typical archival workload’ Archival workload Important in designing/modeling for the archival storage system to meet the expected performance result, e.g., How much MB/s gain do we observe when adding a certain number of disks? Would different workloads give different results? Lee et. al (Univ. Auckland) 13-November-2011 3 / 20

Background and Introduction 2 Workload: access pattern What kind of workloads do archival tiers store/receive? What is the typical case? (need this to design the system) For archival tier: data migration and data retrieval Lee et. al (Univ. Auckland) 13-November-2011 4 / 20

Background and Introduction 2 Workload: access pattern What kind of workloads do archival tiers store/receive? What is the typical case? (need this to design the system) For archival tier: data migration and data retrieval Workload: file size Typical files experienced by the archival tier Characterize and model the file sizes Generate typical archival workload Lee et. al (Univ. Auckland) 13-November-2011 4 / 20

Background and Introduction 2 Workload: access pattern What kind of workloads do archival tiers store/receive? What is the typical case? (need this to design the system) For archival tier: data migration and data retrieval Workload: file size Typical files experienced by the archival tier Characterize and model the file sizes Generate typical archival workload Observation Observe empirical file size distributions from the HPC sites a Develop models for file sizes with variations a S. Dayal. Characterizing HEC storage systems at rest . Technical Report CMU-PDL-08-109, Carnegie Mellon University Parallel Data Lab, 2008. Lee et. al (Univ. Auckland) 13-November-2011 4 / 20

Background and Introduction 3 Traditional workload Example tools: IOmeter, IOzone, Filebench, SPC-1 Limited distribution-based workload and limited file testing No Archival-distribution workload Lee et. al (Univ. Auckland) 13-November-2011 5 / 20

Background and Introduction 3 Traditional workload Example tools: IOmeter, IOzone, Filebench, SPC-1 Limited distribution-based workload and limited file testing No Archival-distribution workload Archival workload HSM write: batch file selection and migration (seq-write) HSM read: retrieval file access from multiple disks/nodes (rand-read) ‘active’ performance; no temporal access patterns (Discussion) Capacity utilization (total volume %) with distributions Lee et. al (Univ. Auckland) 13-November-2011 5 / 20

Background and Introduction 3 Traditional workload Example tools: IOmeter, IOzone, Filebench, SPC-1 Limited distribution-based workload and limited file testing No Archival-distribution workload Archival workload HSM write: batch file selection and migration (seq-write) HSM read: retrieval file access from multiple disks/nodes (rand-read) ‘active’ performance; no temporal access patterns (Discussion) Capacity utilization (total volume %) with distributions Archival workload Apply the archival file size distribution into a benchmark tool Measure the performance e.g., archival vs non-archival, archival vs traditional fixed files Lee et. al (Univ. Auckland) 13-November-2011 5 / 20

Observed file sizes Empirical file size distribution from HPC Archive: arsc-nanu1 , arsc-seau2 , arsc-seau1 , pnnl-nwfs 5.3M–13.7M files, 69TB–305TB volume Non-archive: lanl-scratch1 , pnnl-home , pdl1 , pdl2 1.5M–11.3M files, 1.2TB–9.2TB volume Lee et. al (Univ. Auckland) 13-November-2011 6 / 20

Observed file sizes Empirical file size distribution from HPC Archive: arsc-nanu1 , arsc-seau2 , arsc-seau1 , pnnl-nwfs 5.3M–13.7M files, 69TB–305TB volume Non-archive: lanl-scratch1 , pnnl-home , pdl1 , pdl2 1.5M–11.3M files, 1.2TB–9.2TB volume 1.0 1e+00 1e−01 0.8 1e−02 0.6 Archive Archive CCDF CDF arsc−nanu1, E[X]=14.8MB 1e−03 arsc−nanu1, E[X]=14.8MB arsc−seau2, E[X]=30.2MB arsc−seau2, E[X]=30.2MB arsc−seau1, E[X]=43.8MB 0.4 arsc−seau1, E[X]=43.8MB pnnl−nwfs, E[X]=27.9MB pnnl−nwfs, E[X]=27.9MB 1e−04 Non−Archive Non−Archive 0.2 lanl−scratch1, E[X]=8.9MB lanl−scratch1, E[X]=8.9MB 1e−05 pnnl−home, E[X]=0.7MB pnnl−home, E[X]=0.7MB pdl1, E[X]=0.6MB pdl1, E[X]=0.6MB pdl2, E[X]=0.3MB pdl2, E[X]=0.3MB 0.0 1e−06 2K 8K 32K 256K 1M 4M 16M 64M 512M 2G 8G 32G 2K 8K 32K 256K 1M 4M 16M 64M 512M 2G 8G 32G File size File size Non-Archive: 61% < 8KB and 81% < 32KB (avg. 700KB) Archive: 28% < 8KB and 36% < 32KB (avg. 29.2MB) Lee et. al (Univ. Auckland) 13-November-2011 6 / 20

Fitting file size distribution 1 Gamma and Gen. Gamma distribution f ( x ; θ, k , p ) = ( p /θ k ) x k − 1 e − ( x /θ ) p , for x ≥ 0 , and θ, k , p > 0 Γ( k / p ) Using gnls to find a parameter scale ( θ ) and shape ( k , p ) Lee et. al (Univ. Auckland) 13-November-2011 7 / 20

Fitting file size distribution 1 Gamma and Gen. Gamma distribution f ( x ; θ, k , p ) = ( p /θ k ) x k − 1 e − ( x /θ ) p , for x ≥ 0 , and θ, k , p > 0 Γ( k / p ) Using gnls to find a parameter scale ( θ ) and shape ( k , p ) Robustness of the fit We want to consider possible variability of the dataset Envelopes: risks/errors of typical file size distribution from the dataset Confidence Intervals: lower-bound and upper-bound i.e., more larger files and more smaller files Lee et. al (Univ. Auckland) 13-November-2011 7 / 20

Fitting file size distribution 1 Gamma and Gen. Gamma distribution f ( x ; θ, k , p ) = ( p /θ k ) x k − 1 e − ( x /θ ) p , for x ≥ 0 , and θ, k , p > 0 Γ( k / p ) Using gnls to find a parameter scale ( θ ) and shape ( k , p ) Robustness of the fit We want to consider possible variability of the dataset Envelopes: risks/errors of typical file size distribution from the dataset Confidence Intervals: lower-bound and upper-bound i.e., more larger files and more smaller files CI Bootstrapping bootstrapped CDFs F B i ( x ), each parameter ( θ B i , k B i , p B i ) , i = 1 , . . . , N Sort the F B i ( x ) to find percentiles, i.e., 95th and 99th Identify lower-bound α 2 and upper-bound 1 − α 2 Lee et. al (Univ. Auckland) 13-November-2011 7 / 20

Fitting file size distribution 2 Gamma and Gen. Gamma distribution 1.0 1e+00 1e−01 0.8 1e−02 0.6 CCDF CDF 1e−03 Distribution Fitting Distribution Fitting X~ Gamma X~ Gamma 0.4 X~ Gen. Gamma X~ Gen. Gamma 1e−04 Confidence Intervals Confidence Intervals X~ Gamma CI 95% 0.2 X~ Gamma CI 95% 1e−05 X~ Gamma CI 99% X~ Gamma CI 99% X~ Gen. Gamma CI 95% X~ Gen. Gamma CI 95% X~ Gen. Gamma CI 99% X~ Gen. Gamma CI 99% 0.0 1e−06 2K 8K 32K 256K 1M 4M 16M 64M 512M 2G 8G 32G 2K 8K 32K 256K 1M 4M 16M 64M 512M 2G 8G 32G File size File size Gamma: CDF good-fit at the body, poor-fit at the tail Gen. Gamma: good-fit at the body, good-fit at the tail Both distribution functions produced poor CIs. e.g., produced large probabilities of files with > 64MB lower-bound (E[ X ]=1.7GB) and upper-bound (E[ X ]=3.8MB) Lee et. al (Univ. Auckland) 13-November-2011 8 / 20

Fitting file size distribution 3 Spline distribution 1.0 1e+00 1e−01 0.8 1e−02 0.6 CCDF CDF 1e−03 0.4 Distribution Fitting 1e−04 Distribution Fitting X~ Spline X~ Spline 0.2 1e−05 Confidence Intervals Confidence Intervals X~ Spline CI 95% X~ Spline CI 95% X~ Spline CI 99% X~ Spline CI 99% 0.0 1e−06 2K 8K 32K 256K 2M 8M 32M 256M 2G 8G 32G 2K 8K 32K 256K 2M 8M 32M 256M 2G 8G 32G File size File size Set of piecewise polynomials joining ‘knot’ points of the overall function We made sure to use a monotonically non-decreasing function Using gnls to find a best coefficient for each piece Lee et. al (Univ. Auckland) 13-November-2011 9 / 20

Robust Benchmarking for Archival Storage Tiers PDSW 2011 Lee et. al - PowerPoint PPT Presentation

DongJin Lee 1 , Michael OSullivan 1 , Cameron Walker 1 Monique MacKensize 2 1 The University of Auckland New Zealand 2 The University of St Andrews United Kingdom Robust Benchmarking for Archival Storage Tiers PDSW 2011 Lee et. al

Trends and Challenges in Big Data Ion Stoica November 14, 2016 PDSW-DISCS16 PDSW-DISCS16

B3 Benchmarking B3 Building Benchmarking Program Overview www.CleanEnergyResourceTeams.org B3

Carroll County Commissioners Presentation on Mapping of Growth Tiers November 13, 2012 1

Storage, Data Organization, and Buffering Walid G. Aref Memory Hierarchy Archival Storage

GSCC Amazon.com Archival/Preservation Sources State of Florida

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Benchmarking Lunch-n-Learn March 18, 2019 Agenda 1. Why Benchmarking? 2. Introduction to

PDSW 2019 Panel A h ouse divided: Why dont cloud storage and HPC storage share more

Measuring the Cost of Reliability in Archival Systems James Byron Center for Research in Storage

PDSW 2019 4th International Parallel Data Systems Workshop Suzanne McIntosh, General Chair Jay

PDSW 2020 5th International Parallel Data Systems Workshop Philip Carns, General Chair Shadi

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn

Semantic Data Placement for Power Management in Archival Storage Avani Wildani & Ethan L.

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

FINANCIALLY SUSTAINABLE INCUBATION MODELS KEY SUCCESS FACTORS WG3 TIERS OF SUPPORT AGENDA

Redesigning NCs Economic Development Tiers System Jeff DeBellis NC Department of Commerce

dCache @ Fermilab Dmitry Litvintsev Dmitry Litvintsev, 6th International dCache Workshop, 17-19

A Decade of Condor at Fermilab Steven Timm timm@fnal.gov Fermilab Grid & Cloud Computing

Lecture 6 : Discrete Random Variables and Probability Distributions 0/ 32 Go to BACKGROUND

Mass: Workload-Aware Storage Policy for OpenStack Swift Yu Chen , Wei Tong, Dan Feng, Zike Wang

New CDF Results on Diffraction Talk given in Moriond QCD 2006 by K. Terashi The Rockefeller

R Basics for Math 283 Vectors can be defined in R using the function c() . > x <- c(1,2,3)

Continuous Probability CS70 Summer 2016 - Lecture 6A David Dinh 25 July 2016 UC Berkeley

ScrambleSuit: A Polymorphic Network Protocol to Circumvent Censorship Philipp Winter 1 , Tobias

Sambuz

Useful Links

Newsletter

Mail Us

Robust Benchmarking for Archival Storage Tiers PDSW 2011 Lee et. al - PowerPoint PPT Presentation

DongJin Lee 1 , Michael OSullivan 1 , Cameron Walker 1 Monique MacKensize 2 1 The University of Auckland New Zealand 2 The University of St Andrews United Kingdom Robust Benchmarking for Archival Storage Tiers PDSW 2011 Lee et. al

Trends and Challenges in Big Data Ion Stoica November 14, 2016 PDSW-DISCS16 PDSW-DISCS16

B3 Benchmarking B3 Building Benchmarking Program Overview www.CleanEnergyResourceTeams.org B3

Carroll County Commissioners Presentation on Mapping of Growth Tiers November 13, 2012 1

Storage, Data Organization, and Buffering Walid G. Aref Memory Hierarchy Archival Storage

GSCC Amazon.com Archival/Preservation Sources State of Florida

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Benchmarking Lunch-n-Learn March 18, 2019 Agenda 1. Why Benchmarking? 2. Introduction to

PDSW 2019 Panel A h ouse divided: Why dont cloud storage and HPC storage share more

Measuring the Cost of Reliability in Archival Systems James Byron Center for Research in Storage

PDSW 2019 4th International Parallel Data Systems Workshop Suzanne McIntosh, General Chair Jay

PDSW 2020 5th International Parallel Data Systems Workshop Philip Carns, General Chair Shadi

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn

Semantic Data Placement for Power Management in Archival Storage Avani Wildani &amp; Ethan L.

&gt; SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

FINANCIALLY SUSTAINABLE INCUBATION MODELS KEY SUCCESS FACTORS WG3 TIERS OF SUPPORT AGENDA

Redesigning NCs Economic Development Tiers System Jeff DeBellis NC Department of Commerce

dCache @ Fermilab Dmitry Litvintsev Dmitry Litvintsev, 6th International dCache Workshop, 17-19

A Decade of Condor at Fermilab Steven Timm timm@fnal.gov Fermilab Grid &amp; Cloud Computing

Lecture 6 : Discrete Random Variables and Probability Distributions 0/ 32 Go to BACKGROUND

Mass: Workload-Aware Storage Policy for OpenStack Swift Yu Chen , Wei Tong, Dan Feng, Zike Wang

New CDF Results on Diffraction Talk given in Moriond QCD 2006 by K. Terashi The Rockefeller

R Basics for Math 283 Vectors can be defined in R using the function c() . &gt; x &lt;- c(1,2,3)

Continuous Probability CS70 Summer 2016 - Lecture 6A David Dinh 25 July 2016 UC Berkeley

ScrambleSuit: A Polymorphic Network Protocol to Circumvent Censorship Philipp Winter 1 , Tobias

Sambuz

Useful Links

Newsletter

Mail Us

Semantic Data Placement for Power Management in Archival Storage Avani Wildani & Ethan L.

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

A Decade of Condor at Fermilab Steven Timm timm@fnal.gov Fermilab Grid & Cloud Computing

R Basics for Math 283 Vectors can be defined in R using the function c() . > x <- c(1,2,3)