Storage Lessons from HPC: Extreme Scale Computing Driving - PowerPoint PPT Presentation

Storage Lessons from HPC: Extreme Scale Computing Driving Economical Storage Solutions into Your IT Environments Gary Grider HPC Division Leader, LANL/US DOE Mar 2017 LA-UR-16-26379

Lo Los Al s Alamos 2

Eigh ght Dec Decades es of of Production Weapons Computing to K Keep eep the N he Nation S Safe Maniac IBM Stretch CDC Cray 1 Cray X/Y CM-2 CM-5 SGI Blue Mountain DEC/HP Q IBM Cell Roadrunner Cray XE Cielo Ising DWave Cray Intel KNL Trinity Cross Roads

LANL NL H HPC History Project t (50k 50k a artifac acts) ts) J Joint work wi k with U U Mi Minn Babba bbage I Ins nstitut ute 4

Som ome Prod oducts ts You ou May y Not ot Realize Were Ei Eith ther Funded o unded or Hea eavi vily ly I Influen ence ced b d by DOE DOE H HPC Data Warp 5

Ceph begins

The Promise of Parallel POSIX Metadata Service Circa 2001

And m And more Quantum Key Distribution Products IBM Photostore Hydra – the first Storage Area Network

Ex Extr treme H HPC B Background 9

Simple Vi View of of ou our Com omputi ting En Environment

Curren ent Largest M Machine T e Trinity • Haswell and KNL • 20,000 Nodes • Few Million Cores • 2 PByte DRAM • 4 PByte NAND Burst Buffer ~ 4 Tbyte/sec • 100 Pbyte Scratch PMR Disk File system ~1.2 Tbyte/sec • 30PByte/year Sitewide SMR Disk Campaign Store ~ 1 Gbyte/sec/Pbyte (30 Gbyte/sec currently) • 60 PByte Sitewide Parallel Tape Archive ~ 3 Gbyte/sec 11

A A no not s so simple pi pict cture e of our ur en environmen ent • 30-60MW • Single machines in the 10k nodes and > 18 MW • Single jobs that run across 1M cores for months • Soccer fields of gear in 3 buildings • 20 Semi’s of gear this summer alone Pipes for Trinity Cooling

HP HPC Storag age A Area a Ne Network Circa 2 a 2011 011 Today hi high gh end end is a few TB/sec Current Storage Area Network is a few Tbytes/sec, mostly IB, some 40/100GE

HPC I IO Patte tterns • Million files inserted into a single directory at the same time • Millions of writers into the same file at the same time • Jobs from 1 core to N-Million cores • Files from 0 bytes to N-Pbytes • Workflows from hours to a year (yes a year on a million cores using a PB DRAM)

Becau ause N Non C Comp mpute C Costs a are R Rising a as TCO, Wor orkf kflows a are N Nece cessary y to to S Speci cify 15

Wor orkflow T Taxon onomy fro rom APEX P Procurem emen ent A Si Simula lation P Pipelin line

Enoug ugh w h with t h the HPC b background und How about som ome m mod odern rn S Stor orage E Econ onom omics

Econo nomics h have s e shaped o ped our world The beg he beginning o of storage l e layer er pr prolifer eration 2009 009 • Economic modeling for large burst of data from memory shows bandwidth / capacity better matched for solid state storage near the compute nodes Hdwr/media cost 3 mem/mo 10% FS Economic modeling for $25,000,000  $20,000,000 new servers archive shows bandwidth / new disk $15,000,000 capacity better matched for new cartridges $10,000,000 disk new drives $5,000,000 new robots $0 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025

Wha hat are a e all t thes hese storage l layers? Why do do we e need need all t thes hese s storage l e layers? HPC After Trinity 1-2 PB/sec Memory Residence – hours HPC Before Trinity Overwritten – continuous 4-6 TB/sec Burst Buffer Residence – hours Memory DRAM Overwritten – hours Lustre 1-2 TB/sec Parallel File System Parallel File System Parallel File Residence – days/weeks System Flushed – weeks 100-300 GB/sec HPSS Campaign Storage Residence – months-year Archive Parallel Flushed – months-year Tape 10s GB/sec (parallel tape Archive Residence – forever • Why Campaign: Economics (PFS Raid too expensive,  • BB: Economics (disk PFS solution too rich in function, PFS metadata bw/iops too not scalable enough, PFS designed for scratch use not years residency, Archive BW too expensive) expensive/difficult, Archive metadata too slow)

The Hoopla P a Parad ade c circa a 2014 014 Data Warp

Isn sn’t t ’t that t too oo m many l layers j s just f for or s storage? Factoids Memory Memory (times are changing!) Burst Buffer IOPS/BW Tier LANL HPSS = 53 PB and 543 M files Parallel File System (PFS) Parallel File System (PFS) Trinity 2 PB memory, 4 PB flash (11% of Capacity Tier Campaign Storage HPSS) and 80 PB PFS or 150% Diagram HPSS) courtesy of Archive Archive John Bent Crossroads may EMC have 5-10 PB • If the Burst Buffer does its job very well (and indications are memory, 40 PB solid state or capacity of in system NV will grow radically) and campaign 100% of HPSS storage works out well (leveraging cloud), do we need a parallel file system anymore, or an archive? We would have never contemplated more in • Maybe just a bw/iops tier and a capacity tier. system storage than our • Too soon to say, seems feasible longer term archive a few years ago

I doubt this movement to solid state for BW/IOPS (hot/warm) and SMR/HAMR/etc. capacity oriented disk for Capacity (cool/cold) is unique to HPC Ok – we need a capacity tier Campaign Storage -Billions of files / directory -Trillions of files total -Files from 1 byte to 100 PB -Multiple writers into one file What now?

Won’t cloud technology provide the capacity solution? • Erasure to utilize low cost hardware • Object to enable massive scale • Simple minded interface, get put delete • Problem solved --  NOT • Works great for apps that are newly written to use this interface • Doesn’t work well for people, people need folders and rename and … • Doesn’t work for the $trillions of apps out there that expect some modest name space capability (parts of POSIX)

Enter MarFS The Sea of Data 24

How about a Scalable Near-POSIX Name Space over Cloud style Object Erasure? • Best of both worlds • Objects Systems • Provide massive scaling and efficient erasure techniques • Friendly to applications, not to people. People need a name space. • Huge Economic appeal (erasure enables use of inexpensive storage) • POSIX name space is powerful but has issues scaling • The challenges • Mismatch of POSIX an Object metadata, security, read/write semantics, efficient object/file sizes. • No update in place with Objects • How do we scale POSIX name space to trillions of files/directories

Won’t someone else do it, PLEASE? • There is evidence others see the need but no magic bullets yet: (partial list) • Cleversafe/Scality/EMC ViPR/Ceph/Swift etc. attempting multi-personality data lakes over erasure objects, all are young and assume update in place for posix • GlusterFS is probably the closes thing to MarFS. Gluster is aimed more for the enterprise and midrange HPC and less for extreme HPC. Glusterfs is a way to unify file and object systems, MarFS is another, aiming at different uses • Ceph moved away from file system like access at the time of this analysis. • General Atomics Nirvana, Storage Resource Broker/IRODS optimized for WAN and HSM metadata rates. There are some capabilities for putting POSIX files over objects, but these methods are largely via NFS or other methods that try to mimic full file system semantics including update in place. These methods are not designed for massive parallelism in a single file, etc. • EMC Maginatics but it is in its infancy and isnt a full solution to our problem yet. • Camlistore appears to be targeted and personal storage. • Bridgestore is a POSIX name space over objects but they put their metadata in a flat space so rename of a directory is painful. • Avere over objects is focused at NFS so N to 1 is a non starter. • HPSS or SamQFS or a classic HSM? Metadata rates designs are way low. • HDFS metadata doesn’t scale well.

MarFS What it is • 100-1000 GB/sec, Exabytes, Billion files in a directory, Trillions of files total • Near-POSIX global scalable name space over many POSIX and non POSIX data repositories (Scalable object systems - CDMI, S3, etc.) • (Scality, EMC ECS, all the way to simple erasure over ZFS’s) • It is small amount of code (C/C++/Scripts) • A small Linux Fuse • A pretty small parallel batch copy/sync/compare/ utility • A moderate sized library both FUSE and the batch utilities call • Data movement scales just like many scalable object systems • Metadata scales like NxM POSIX name spaces both across the tree and within a single directory • It is friendly to object systems by • Spreading very large files across many objects • Packing many small files into one large data object What it isnt • No Update in place! Its not a pure file system, Overwrites are fine but no seeking and writing.

MarFS Scaling Scaling test on our retired Cielo machine: Striping across 1 to X 835M File Inserts/sec Stat single file < 1 millisecond Object Repos > 1 trillion files in the same director

Storage Lessons from HPC: Extreme Scale Computing Driving - PowerPoint PPT Presentation

Storage Lessons from HPC: Extreme Scale Computing Driving Economical Storage Solutions into Your IT Environments Gary Grider HPC Division Leader, LANL/US DOE Mar 2017 LA-UR-16-26379 Lo Los Al s Alamos 2 Eigh ght Dec Decades es of

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Self-Driving Cars As Edge Computing Devices Matt Ranney - @mranney Uber ATG Why Self-Driving?

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

UL HPC School 2017 PS9: [Advanced] Prototyping with Python UL High Performance Computing (HPC)

Uni.lu HPC School 2019 PS07: Scientific computing using MATLAB Uni.lu High Performance Computing

Distracted Driving Jennifer Smith What is Distracted Driving? Driving while engaged in any

Uni.lu HPC School 2019 Keynote/PS9: User environment and storage data management Uni.lu High

The Global Status of Citizen Cyberscience David P. Anderson Space Sciences Laboratory U.C.

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi

Grand Unified File Index Development, Deployment, and Performance Update Dominic Manno May 22,

Derivations from the disc algebra into natural modules Yemon Choi University of Saskatchewan

1-Steiner Routing by Kahng/Robins Perform 1-Steiner Routing by Kahng/Robins Need an initial

Disasters Dr. Robin Murphy, Professor & Director Center for Emergency Informatics Center for

1 Preliminaries The idea in these notes is to explain a new approach of Robin Moser 1 to give an

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Sambuz

Useful Links

Newsletter

Mail Us

Storage Lessons from HPC: Extreme Scale Computing Driving - PowerPoint PPT Presentation

Storage Lessons from HPC: Extreme Scale Computing Driving Economical Storage Solutions into Your IT Environments Gary Grider HPC Division Leader, LANL/US DOE Mar 2017 LA-UR-16-26379 Lo Los Al s Alamos 2 Eigh ght Dec Decades es of

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Self-Driving Cars As Edge Computing Devices Matt Ranney - @mranney Uber ATG Why Self-Driving?

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

MATLAB on UL HPC Checkpointing &amp; parallel execution UL High Performance Computing (HPC) Team

UL HPC School 2017 PS9: [Advanced] Prototyping with Python UL High Performance Computing (HPC)

Uni.lu HPC School 2019 PS07: Scientific computing using MATLAB Uni.lu High Performance Computing

Distracted Driving Jennifer Smith What is Distracted Driving? Driving while engaged in any

Uni.lu HPC School 2019 Keynote/PS9: User environment and storage data management Uni.lu High

The Global Status of Citizen Cyberscience David P. Anderson Space Sciences Laboratory U.C.

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi

Grand Unified File Index Development, Deployment, and Performance Update Dominic Manno May 22,

Derivations from the disc algebra into natural modules Yemon Choi University of Saskatchewan

1-Steiner Routing by Kahng/Robins Perform 1-Steiner Routing by Kahng/Robins Need an initial

Disasters Dr. Robin Murphy, Professor &amp; Director Center for Emergency Informatics Center for

1 Preliminaries The idea in these notes is to explain a new approach of Robin Moser 1 to give an

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Sambuz

Useful Links

Newsletter

Mail Us

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

Disasters Dr. Robin Murphy, Professor & Director Center for Emergency Informatics Center for