Architecting a 30 PB all - Architecting a 30 PB all flash file - PowerPoint PPT Presentation

Architecting a 30 PB all - Architecting a 30 PB all flash file system flash file system Kirill Kirill Lozinskiy Lozinskiy Glenn K. Lockwood et al. Glenn K. Lockwood et al. 1

NERSC @ Berkeley Lab (LBNL) NERSC @ Berkeley Lab (LBNL) ● NERSC is the mission HPC computing center for NERSC is the mission HPC computing center for the DOE Office of Science the DOE Office of Science Simulations at scale Simulations at scale ● HPC and data systems for the broad Office of HPC and data systems for the broad Office of Science community Science community ● 7,000 Users, 870 Projects, 700 Codes 7,000 Users, 870 Projects, 700 Codes ● >2,000 publications per year >2,000 publications per year Experimental & Experimental & ● 2015 Nobel prize in physics supported by NERSC 2015 Nobel prize in physics supported by NERSC Observational Data Analysis Observational Data Analysis systems and data archive systems and data archive at Scale at Scale ● Diverse workload type and size: Diverse workload type and size: ○ Photo Credit: CAMERA Biology, Environment, Materials, Chemistry, Geophysics, Nuclear Physics, Fusion Energy, Plasma Physics, Computing Research ● New experimental and AI New experimental and AI -driven workloads driven workloads - 2 -

NERSC's 2020 System, Perlmutter NERSC's 2020 System, Perlmutter ● Designed for both large scale Designed for both large scale simulation and data analysis from simulation and data analysis from experimental facilities experimental facilities ● Overall 3x to 4x capability of Cori Overall 3x to 4x capability of Cori ● Includes both NVIDIA GPU Includes both NVIDIA GPU - accelerated and AMD CPU accelerated and AMD CPU -only only nodes nodes ● Slingshot Interconnect Slingshot Interconnect ● Single Tier, All Single Tier, All -Flash Lustre scratch Flash Lustre scratch filesystem filesystem - 3 -

NERSC NERSC-9's All 9's All -Flash Architecture Flash Architecture Fast across many dimensions Fast across many dimensions CPU + GPU Nodes ● 30 PB usable capacity 4.0 TB/s to Lustre ● ≥ 4 TB/s sustained bandwidth >10 TB/s overall ● ≥ 7,000,000 IOPS ● ≥ 3,200,000 file creates/sec Logins, DTNs, Workflows Integrated network, separate groups Integrated network, separate groups ● Storage/logins remain up when All-Flash Lustre Storage compute is down ● No LNET routers between compute and storage Terabits/sec to Terabit[s]/sec Community File Sys off platform - 4 -

Myth: only DOE can afford all Myth: only DOE can afford all -flash flash Actuals from Fontana & Decad, Adv. Phys. 2018 5

Myth: only DOE can afford all Myth: only DOE can afford all -flash flash Data management policy A measure of time between purge cycles Reference or time after which files are eligible for purging system capacity Minimum capacity of change Perlmutter scratch Sustained System Improvement 3x - 4x output capacity over Cori Desired capacity to Change in time be reclaimed - 6 -

Myth: only DOE can afford all Myth: only DOE can afford all -flash flash Mean daily growth projected for Perlmutter at 133 TB/day Data retention policy for Perlmutter is atime > 28 days ● OK to purge after that time ● Each purge aims to remove or migrate 50% of the total capacity Anticipated 3x to 4x sustained system improvement Figure 1 Figure 1 - Distribution of daily growth of Minimum Perlmutter capacity is Minimum Perlmutter capacity is Cori's scratch between 22 PB and 30 PB between 22 PB and 30 PB - 7 -

Myth: Need high Myth: Need high -endurance drives for HPC endurance drives for HPC Perlmutter File System Writes Per Day Drive Writes Per Day parity blocks of Cori's total write volume required for Perlmutter Perlmutter Write Sustained System Improvement data blocks Amplification 3x - 4x output capacity over Cori Factor - 8 -

Myth: Need high Myth: Need high -endurance drives for HPC endurance drives for HPC Measurements from Measurements from Cori's burst buffer Cori's burst buffer after 3.4 years in after 3.4 years in service service WAF bottom quartile: WAF bottom quartile: • 2.68 2.68 WAF upper quartile: WAF upper quartile: • 3.17 3.17 - 9 -

Myth: Need high Myth: Need high -endurance drives for HPC endurance drives for HPC ● SSI: 3x 3x – 4x 4x ● Mean FSWPD on 30.5 PB file system: 0.024 0.024 D, P = 8+2 8+2 or 10+2 10+2 ● ● WAF = 2.68 2.68 – 3.17 3.17 DWPD needed: 0.23 0.23 – ● 0.38 0.38 - 10 -

Myth: Lustre is terrible Myth: Lustre is terrible New all-flash FS #1 New all-flash FS #2 Lustre* Read bandwidth 1 TB/s 4 TB/s 4 TB/s Write bandwidth 0.75 TB/s 1.5 TB/s 4 TB/s 15 MIOPS 300 MIOPS 7 MIOPS Read IOPS 3 MIOPS 30 MIOPS 7 MIOPS Write IOPS Usable capacity 40 PB 30 PB 30 PB Maturity GA < 2 years GA < 2 years GA > 16 years Openness Open protocols, Closed protocols, GPL closed source closed source * All Lustre numbers are lower bounds. Other numbers derived from reference architectures. ALL NUMBERS ARE SPECULATIVE . 11

Metadata Configuration Metadata Configuration

Metadata Configuration Metadata Configuration Figure 4 Figure 4 - Probability distribution of file size and file mass on Cori's file system in January 2019 95% of the files comprise only 5% of the 95% of the files comprise only 5% of the capacity used capacity used MDT capacity for a new system is a function of the expected file size distribution ● Average file size alone is not enough because HPC file size distribution skews towards small files ● Small changes to the mean file size could represent a significant change to where the optimal DOM size threshold should be - 13 -

Metadata Configuration Metadata Configuration Figure 5 Figure 5 - Probability distribution of inode sizes on Cori's file system in January 2019 MDT Capacity Required for Inodes MDT Capacity Required for Inodes ● Lustre reserves 4 KiB of MDT capacity per inode ● BUT Directories with millions of files are significantly larger ● Most extreme case is 1 GiB in size for 8 million child inodes - 14 -

Metadata Configuration Metadata Configuration Figure 6 Figure 6 - Required MDT capacity as a function of DOM threshold Shaded area bounded by the minimum and maximum estimated requirements dictated Area of Area of by the DOM component and the inode interest interest capacity component of MDT capacity Min DOM Min DOM ● At a very small DOM threshold, the large size 64KiB size 64KiB number of small files does not consume much MDT space ● At a very large DOM threshold, the great majority of files are stored entirely within the MDT, and only a small number of very large files dictates a higher MDT capacity - 15 -

Thank You Thank You (and we're hiring)

Architecting a 30 PB all - Architecting a 30 PB all flash file - PowerPoint PPT Presentation

Architecting a 30 PB all - Architecting a 30 PB all flash file system flash file system Kirill Kirill Lozinskiy Lozinskiy Glenn K. Lockwood et al. Glenn K. Lockwood et al. 1 NERSC @ Berkeley Lab (LBNL) NERSC @ Berkeley Lab (LBNL)

2004: Poisson Matting 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: The

Arc Flash Protection Arc Flash Protection Electrical Reliability Services Arc Flash Hazard Arc

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW18

The Basics Of Flash Building A Web Application With Flash What is Flash? Introduction

Arc Flash Arc Flash Mitigation Mitigation Remote Racking and Switching for Arc Flash danger

Flash Presentation The flash web designs which we make are attractive to captivate your website

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

A Case for Flash Memory SSD in A Case for Flash Memory SSD in A Case for Flash Memory SSD in

Basics of Off-Camera Flash Off-Camera Flash www.jedi.com * What is it & why do we use it? *

Flash Memory and Micro SD Card Presented by: Krishna Goyal (200601195) Anirudh Tripathi

FLASH and Its Research Communities D. Q. Lamb Flash Center for Computational Science

Explosive Astrophysics with Flash Alan Calder (alan.calder@stonybrook.edu) Sean Couch

DFS: A Filesystem for Virtualized Flash Disks 25 February 2010 William Josephson

Flash What is Flash? Multimedia platform used to add animation, video, and interactivity to

What Youll Learn Today Review, Q&A: Flash Tweening Using Flash as a multimedia

Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 , Eno Thereska 3,5 , Binu John

A summary of security-related network measurements. David Malone Maynooth University.

COMMUNITY WIRELESS MESH NETWORKS Johnathan Ishmael ishmael@comp.lancs.ac.uk Talk Overview

Mitigating Attacks in Unstructured Multicast Overlay Networks Cristina Nita-Rotaru,Aaron Walters,

Communication Networks II www.kom.tu-darmstadt.de www.httc.de Multimedia Communications / QoS

TF NOC flash presentation LightNet Triestes optical MAN LightNet in brief Temporary

Note 2 Flash By: Adrian Sham (adrsham) and Trey Anderson (treyman) A note taking program that

DAPPS (Demographic Analysis and Population Projection System) Nobuko Mizoguchi, Sean Fennell,

for Flash Storage and RDMA Michalis Vardoulakis 1,2,* , Giorgos Saloustros 1 , Pilar

Architecting a 30 PB all - Architecting a 30 PB all flash file - PowerPoint PPT Presentation

Architecting a 30 PB all - Architecting a 30 PB all flash file system flash file system Kirill Kirill Lozinskiy Lozinskiy Glenn K. Lockwood et al. Glenn K. Lockwood et al. 1 NERSC @ Berkeley Lab (LBNL) NERSC @ Berkeley Lab (LBNL)

2004: Poisson Matting 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: The

Arc Flash Protection Arc Flash Protection Electrical Reliability Services Arc Flash Hazard Arc

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW18

The Basics Of Flash Building A Web Application With Flash What is Flash? Introduction

Arc Flash Arc Flash Mitigation Mitigation Remote Racking and Switching for Arc Flash danger

Flash Presentation The flash web designs which we make are attractive to captivate your website

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

A Case for Flash Memory SSD in A Case for Flash Memory SSD in A Case for Flash Memory SSD in

Basics of Off-Camera Flash Off-Camera Flash www.jedi.com * What is it &amp; why do we use it? *

Flash Memory and Micro SD Card Presented by: Krishna Goyal (200601195) Anirudh Tripathi

FLASH and Its Research Communities D. Q. Lamb Flash Center for Computational Science

Explosive Astrophysics with Flash Alan Calder (alan.calder@stonybrook.edu) Sean Couch

DFS: A Filesystem for Virtualized Flash Disks 25 February 2010 William Josephson

Flash What is Flash? Multimedia platform used to add animation, video, and interactivity to

What Youll Learn Today Review, Q&amp;A: Flash Tweening Using Flash as a multimedia

Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 , Eno Thereska 3,5 , Binu John

A summary of security-related network measurements. David Malone Maynooth University.

COMMUNITY WIRELESS MESH NETWORKS Johnathan Ishmael ishmael@comp.lancs.ac.uk Talk Overview

Mitigating Attacks in Unstructured Multicast Overlay Networks Cristina Nita-Rotaru,Aaron Walters,

Communication Networks II www.kom.tu-darmstadt.de www.httc.de Multimedia Communications / QoS

TF NOC flash presentation LightNet Triestes optical MAN LightNet in brief Temporary

Note 2 Flash By: Adrian Sham (adrsham) and Trey Anderson (treyman) A note taking program that

DAPPS (Demographic Analysis and Population Projection System) Nobuko Mizoguchi, Sean Fennell,

for Flash Storage and RDMA Michalis Vardoulakis 1,2,* , Giorgos Saloustros 1 , Pilar

Basics of Off-Camera Flash Off-Camera Flash www.jedi.com * What is it & why do we use it? *

What Youll Learn Today Review, Q&A: Flash Tweening Using Flash as a multimedia