Architecting a 30 PB all - Architecting a 30 PB all flash file - - PowerPoint PPT Presentation

architecting a 30 pb all architecting a 30 pb all flash
SMART_READER_LITE
LIVE PREVIEW

Architecting a 30 PB all - Architecting a 30 PB all flash file - - PowerPoint PPT Presentation

Architecting a 30 PB all - Architecting a 30 PB all flash file system flash file system Kirill Kirill Lozinskiy Lozinskiy Glenn K. Lockwood et al. Glenn K. Lockwood et al. 1 NERSC @ Berkeley Lab (LBNL) NERSC @ Berkeley Lab (LBNL)


slide-1
SLIDE 1

Kirill Kirill Lozinskiy Lozinskiy

Glenn K. Lockwood et al. Glenn K. Lockwood et al.

Architecting a 30 PB all Architecting a 30 PB all - flash file system flash file system

1

slide-2
SLIDE 2

NERSC @ Berkeley Lab (LBNL) NERSC @ Berkeley Lab (LBNL)

  • NERSC is the mission HPC computing center for

NERSC is the mission HPC computing center for the DOE Office of Science the DOE Office of Science

  • HPC and data systems for the broad Office of

HPC and data systems for the broad Office of Science community Science community

  • 7,000 Users, 870 Projects, 700 Codes

7,000 Users, 870 Projects, 700 Codes

  • >2,000 publications per year

>2,000 publications per year

  • 2015 Nobel prize in physics supported by NERSC

2015 Nobel prize in physics supported by NERSC systems and data archive systems and data archive

  • Diverse workload type and size:

Diverse workload type and size: ○ Biology, Environment, Materials, Chemistry, Geophysics, Nuclear Physics, Fusion Energy, Plasma Physics, Computing Research

  • New experimental and AI

New experimental and AI

  • driven workloads

driven workloads

Simulations at scale Simulations at scale Experimental & Experimental & Observational Data Analysis Observational Data Analysis at Scale at Scale

Photo Credit: CAMERA

  • 2 -
slide-3
SLIDE 3

NERSC's 2020 System, Perlmutter NERSC's 2020 System, Perlmutter

  • Designed for both large scale

Designed for both large scale simulation and data analysis from simulation and data analysis from experimental facilities experimental facilities

  • Overall 3x to 4x capability of Cori

Overall 3x to 4x capability of Cori

  • Includes both NVIDIA GPU

Includes both NVIDIA GPU - accelerated and AMD CPU accelerated and AMD CPU -only

  • nly

nodes nodes

  • Slingshot Interconnect

Slingshot Interconnect

  • Single Tier, All

Single Tier, All -Flash Lustre scratch Flash Lustre scratch filesystem filesystem

  • 3 -
slide-4
SLIDE 4

NERSC NERSC-9's All 9's All -Flash Architecture Flash Architecture

4.0 TB/s to Lustre >10 TB/s overall Logins, DTNs, Workflows All-Flash Lustre Storage CPU + GPU Nodes

Terabits/sec to Community File Sys Terabit[s]/sec

  • ff platform

Fast across many dimensions Fast across many dimensions

  • 30 PB usable capacity
  • ≥ 4 TB/s sustained bandwidth
  • ≥ 7,000,000 IOPS
  • ≥ 3,200,000 file creates/sec

Integrated network, separate groups Integrated network, separate groups

  • Storage/logins remain up when

compute is down

  • No LNET routers between

compute and storage

  • 4 -
slide-5
SLIDE 5

Myth: only DOE can afford all Myth: only DOE can afford all

  • flash

flash

5

Actuals from Fontana & Decad, Adv. Phys. 2018

slide-6
SLIDE 6

Myth: only DOE can afford all Myth: only DOE can afford all

  • flash

flash

Minimum capacity of Perlmutter scratch Sustained System Improvement 3x - 4x output capacity over Cori Data management policy A measure of time between purge cycles

  • r time after which files are eligible for purging

Desired capacity to be reclaimed Reference system capacity change Change in time

  • 6 -
slide-7
SLIDE 7

Myth: only DOE can afford all Myth: only DOE can afford all

  • flash

flash

Figure 1 Figure 1 - Distribution of daily growth of Cori's scratch Minimum Perlmutter capacity is Minimum Perlmutter capacity is between 22 PB and 30 PB between 22 PB and 30 PB

  • 7 -

Mean daily growth projected for Perlmutter at 133 TB/day Data retention policy for Perlmutter is atime > 28 days

  • OK to purge after that time
  • Each purge aims to remove
  • r migrate 50% of the total

capacity

Anticipated 3x to 4x sustained system improvement

slide-8
SLIDE 8

Myth: Need high Myth: Need high -endurance drives for HPC endurance drives for HPC

  • 8 -

Drive Writes Per Day required for Perlmutter Sustained System Improvement 3x - 4x output capacity over Cori File System Writes Per Day

  • f Cori's total write volume

Perlmutter parity blocks Perlmutter data blocks Write Amplification Factor

slide-9
SLIDE 9

Myth: Need high Myth: Need high -endurance drives for HPC endurance drives for HPC Measurements from Measurements from Cori's burst buffer Cori's burst buffer after 3.4 years in after 3.4 years in service service

  • WAF bottom quartile:

WAF bottom quartile:

2.68 2.68

  • WAF upper quartile:

WAF upper quartile:

3.17 3.17

  • 9 -
slide-10
SLIDE 10

Myth: Need high Myth: Need high -endurance drives for HPC endurance drives for HPC

  • SSI: 3x

3x – 4x 4x

  • Mean FSWPD on 30.5 PB

file system: 0.024

0.024

  • D, P = 8+2

8+2 or 10+2 10+2

  • WAF = 2.68

2.68 – 3.17 3.17

  • DWPD needed: 0.23

0.23 – 0.38 0.38

  • 10 -
slide-11
SLIDE 11

Myth: Lustre is terrible Myth: Lustre is terrible

11

New all-flash FS #1 New all-flash FS #2 Lustre* Read bandwidth

1 TB/s 4 TB/s 4 TB/s

Write bandwidth

0.75 TB/s 1.5 TB/s 4 TB/s

Read IOPS

15 MIOPS 300 MIOPS 7 MIOPS

Write IOPS

3 MIOPS 30 MIOPS 7 MIOPS

Usable capacity

40 PB 30 PB 30 PB

Maturity

GA < 2 years GA < 2 years GA > 16 years

Openness

Open protocols, closed source Closed protocols, closed source GPL

* All Lustre numbers are lower bounds. Other numbers derived from reference architectures. ALL NUMBERS ARE SPECULATIVE.

slide-12
SLIDE 12

Metadata Configuration Metadata Configuration

slide-13
SLIDE 13

Metadata Configuration Metadata Configuration

Figure 4 Figure 4 - Probability distribution of file size and file mass on Cori's file system in January 2019 95% of the files comprise only 5% of the 95% of the files comprise only 5% of the capacity used capacity used MDT capacity for a new system is a function of the expected file size distribution

  • Average file size alone is not enough

because HPC file size distribution skews towards small files

  • Small changes to the mean file size could

represent a significant change to where the optimal DOM size threshold should be

  • 13 -
slide-14
SLIDE 14

Metadata Configuration Metadata Configuration

Figure 5 Figure 5 - Probability distribution of inode sizes on Cori's file system in January 2019 MDT Capacity Required for Inodes MDT Capacity Required for Inodes

  • Lustre reserves 4 KiB of MDT

capacity per inode

  • BUT Directories with millions of files

are significantly larger

  • Most extreme case is 1 GiB in size

for 8 million child inodes

  • 14 -
slide-15
SLIDE 15

Metadata Configuration Metadata Configuration

Figure 6 Figure 6 - Required MDT capacity as a function of DOM threshold Shaded area bounded by the minimum and maximum estimated requirements dictated by the DOM component and the inode capacity component of MDT capacity

  • At a very small DOM threshold, the large

number of small files does not consume much MDT space

  • At a very large DOM threshold, the great

majority of files are stored entirely within the MDT, and only a small number of very large files dictates a higher MDT capacity

  • 15 -

Min DOM Min DOM size 64KiB size 64KiB Area of Area of interest interest

slide-16
SLIDE 16

Thank You Thank You

(and we're hiring)