F ILE S YSTEMS FOR HPC A Data Centre View Professor Mark Parsons - - PowerPoint PPT Presentation

f ile s ystems for hpc
SMART_READER_LITE
LIVE PREVIEW

F ILE S YSTEMS FOR HPC A Data Centre View Professor Mark Parsons - - PowerPoint PPT Presentation

15th May 2017 Dagstuhl Seminar 17202 F ILE S YSTEMS FOR HPC A Data Centre View Professor Mark Parsons EPCC Director Associate Dean for e-Research The University of Edinburgh 10th March 2017 Graham Spittle Visit Advanced Computing Facility


slide-1
SLIDE 1

FILE SYSTEMS FOR HPC

A Data Centre View

Professor Mark Parsons EPCC Director Associate Dean for e-Research The University of Edinburgh

15th May 2017 Dagstuhl Seminar 17202

slide-2
SLIDE 2

Advanced Computing Facility

  • The ‘ACF’
  • Opened 2005
  • Purpose built, secure, world class data centre
  • Houses wide variety of leading-edge systems
  • Major expansion in 2013
  • 7.5 MW, 850m2 plant room, 550m2 machine room
  • Next … the Exascale

10th March 2017 Graham Spittle Visit

Advanced Computing Facility HPC and Big Data

slide-3
SLIDE 3

Principal services

  • Houses variety of leading

edge systems and infrastructures

  • UK national services
  • ARCHER 118,080 cores (Cray XC30)
  • DiRAC 98,304 cores (IBM BlueGene/Q)
  • UK RDF (25Pb Disk / 50Pb Tape)
  • Local services
  • Cirrus – industry and MSc machine
  • ULTRA – SGI UV2000
  • ECDF – DELL and IBM clusters for

University researchers

  • FARR – system for Farr Institute

and NHS Scotland

10th March 2017 Graham Spittle Visit

  • Funded by EPSRC and NERC
  • Service opened in 2013
  • 5,053 users since opening
  • 3,494 users in past 12 months
  • ARCHER 2 procurement starting
slide-4
SLIDE 4

Data centre file systems in 2017

  • Complexity has greatly

increased in past decade

  • Most HPC systems have:
  • Multiple storage systems
  • Multiple file systems per storage

system

  • Filesystems are predominately:
  • Via directly attached storage
  • GPFS (IBM Spectrum Scale)
  • Lustre (versions 2.6 – 2.8 are most

common)

  • Resiliency
  • Storage platforms generally use some

form of RAID

  • Isn’t good enough for “golden” data
  • A lot of tape is still used
  • Generally LT06 or LT07 or IBM

Enterprise formats

  • UPS focussed on keeping file systems

up while shutting down smoothly

  • In 2017 compute is robust –

storage is not

15th May 2017 Dagstuhl Seminar 17202

slide-5
SLIDE 5

Systems grow rapidly … it gets complex very quickly

15th May 2017 Dagstuhl Seminar 17202

April 2016 – our new system “cirrus” is installed

slide-6
SLIDE 6

Schematic layout April 2016

15th May 2017 Dagstuhl Seminar 17202

110 TB LFS 800 TB LFS

slide-7
SLIDE 7

Schematic layout September 2016

15th May 2017 Dagstuhl Seminar 17202

800 TB LFS

slide-8
SLIDE 8

Schematic layout March 2017 - compute

15th May 2017 Dagstuhl Seminar 17202

From 5,184 cores To 13,248 cores

slide-9
SLIDE 9

Schematic layout March 2017 - storage

15th May 2017 Dagstuhl Seminar 17202

1.9PB WOS

slide-10
SLIDE 10

General data centre I/O challenges

  • Many application codes do not

use parallel I/O

  • Most users still have a simple

POSIX FS view of the world

  • Even when they do use parallel

I/O we find libraries fighting against FS optimisations

  • Buying storage is terribly

complicated and confusing

  • Start/End of job read/write

performance wastes investment

  • Performance degrades …
  • Some examples
  • Genome processing
  • ~400 TB moves through storage every week
  • One step creates many small files – real

Lustre challenges

  • HSM solutions not up to the job
  • Users not thinking about underlying

constraints

  • A user on national service created 240+

million files in a single directory last year

  • Issues exist with Lustre and GPFS

15th May 2017 Dagstuhl Seminar 17202

slide-11
SLIDE 11

Performance and benchmarking

  • A real challenge is managing the difference in performance between the day

you buy storage and benchmark it and 6 months later

  • We see enormous differences in file system performance
  • Write can be 3-4X slower
  • Read can be 2-3X slower
  • Significant degradation in performance
  • Very difficult to predict performance
  • IOR and IOZone are commonly used but

neither predicts performance once file system has significant amounts of real user data stored on it

  • We need new parallel I/O benchmarks urgently for procurement purposes

15th May 2017 Dagstuhl Seminar 17202

slide-12
SLIDE 12

Performance and user configuration

  • “Setting striping to 1 has reduced total read time for his 36000 small files

from 2 hours to 6 minutes” - comment on resolution of an ARCHER helpdesk query

  • User was performing I/O on 36000 separate files of ~300KB with 10000

processes

  • Had set parallel striping to maximum possible (48 OSTs / -1) assuming this

would give best performance

  • Overhead of querying every OST for every file dominated the access time
  • Moral: more stripes does not mean better performance
  • But how do users learn non-intuitive configurations?

15th May 2017 Dagstuhl Seminar 17202

(Thanks to David Henty for this slide)

slide-13
SLIDE 13

A new hierarchy

  • Next generation NVRAM

technologies will profoundly changing memory and storage hierarchies

  • HPC systems and Data Intensive

systems will merge

  • Profound changes are coming to

ALL Data Centres

  • … but in HPC we need to develop

software – OS and application – to support their use

30th September 2015 NEXTGenIO Summary

13

HPC systems today HPC systems of the future

CPU Memory NVRAM Spinning storage disk Register Cache

Memory & Storage Latency Gaps

Storage tape 1x 100,000x 10x 10x 10,000x DRAM Storage SSD CPU Register Cache 1x 10x 10x DRAM Spinning storage disk Storage disk - MAID Storage tape 10x 100x 100x 1,000x 10x

socket socket socket socket socket socket DIMM DIMM DIMM IO IO backup IO backup backup

slide-14
SLIDE 14

The future - I/O is the Exascale challenge

  • Parallelism beyond 100 million threads demands a new approach to I/O
  • Today’s Petascale systems struggle with I/O
  • Inter-processor communication limits performance
  • Reading and writing data to parallel filesystems is a major bottleneck
  • New technologies are needed
  • To improve inter-processor communication
  • To help us rethink data movement and processing on capability systems
  • Truly parallel file systems with reproducible performance are required
  • Current technologies simply will not scale
  • Large jobs will spend hours reading initial data and writing results

30th September 2015 NEXTGenIO Summary

14

slide-15
SLIDE 15

Project Objectives

  • Develop a new server architecture using next generation processor and

memory advances

  • New Fujitsu server motherboard
  • Built around Intel Xeon and 3D XPoint memory technologies
  • Investigate the best ways of utilising

these technologies in HPC

  • Develop the systemware to support their use at the Exascale
  • Model three different I/O workloads and use this understanding in a co-design

process

  • Representative of real HPC centre workloads
  • Predict performance of changes to I/O

infrastructure and workloads

30th September 2015 NEXTGenIO Summary

15