nci.org.au
@NCInews
Providing Australian researchers with world-class computing services
THETA 2015
‘Really Big Data’ Building a HPC-ready Storage Platform for Research Datasets
Daniel Rodwell
Manager, Data Storage Services
May 2015
W
THETA 2015 Really Big Data Building a HPC-ready Storage Platform - - PowerPoint PPT Presentation
Providing Australian researchers with world-class computing services THETA 2015 Really Big Data Building a HPC-ready Storage Platform for Research Datasets Daniel Rodwell Manager, Data Storage Services May 2015 nci.org.au W @ NCInews
nci.org.au
@NCInews
Providing Australian researchers with world-class computing services
Manager, Data Storage Services
May 2015
W
2
Agenda
3
4
NCI – an overview
– comprehensive, vertically-integrated research service – providing national access on priority and merit – driven by research objectives
Meteorology and Geoscience Australia
by the Australian Research Council.
5
Where are we located?
6
Research Communities
Research focus areas
– Climate Science and Earth System Science – Astronomy (optical and theoretical) – Geosciences: Geophysics, Earth Observation – Biosciences & Bioinformatics – Computational Sciences
– Social Sciences – Growing emphasis on data-intensive computation
7
Who Uses NCI ?
Astrophysics, Biology, Climate & Weather, Oceanography, particle Physics, fluid dynamics, materials science, Chemistry, Photonics, Mathematics, image processing, Geophysics, Engineering, remote sensing, Bioinformatics, Environmental Science, Geospatial, Hydrology, data mining
8
What do they use it for ?
Earth Sciences Physical Sciences Chemical Sciences Engineering Biological Sciences Technology Mathematical Sciences Information and Computing Sciences Environmental Sciences Medical and Health Sciences Economics Agricultural and Veterinary Sciences
9
Research Highlights
The greatest map ever made Led by Nobel Laureate, Professor Brian Schmidt, Australian astronomers are using NCI to carry our the most detailed optical survey yet of the southern sky. The project involves processing and storing of many terabytes of optical telescopic images, and has led to the discovery of the
Predicting the unpredictable Australia’s weather and future climate are predicted using the ACCESS model—developed by BoM, CSIRO, and ARCCSS—and operating on time spans ranging from hours/days, to centuries. Collaborating with NCI and Fujitsu, BoM, using NCI as its research system, is increasing the scalability
more accurate predictions of extreme weather. Unlocking the Landsat Archive NCI is enabling researchers at Geoscience Australia to ‘unlock’ decades of Landsat earth observation satellite images of Australia since 1979. A one petabyte data cube has been generated by processing and analysing hundreds of thousands of images, yielding important insights for water/land management decision making and policy, with benefits for the environment and agriculture.
10
‘Raijin’ – 1.2 PetaFLOP Fujitsu Primergy Cluster
11
Raijin – Petascale Supercomputer Raijin Fujitsu Primergy cluster, June 2013:
GHz) in 3592 compute nodes;
term scratch space)
– 24th fastest in the world on debut (November 2012); first petaflop system in Australia
– ~52 KM of IB cabling.
cooling
12
Tenjin – High Performance Cloud Tenjin Dell C8000 High Performance Cloud
GHz), 100 nodes;
performance needed for “big data” research.
13
30PB High Performance Storage
14
Storage Overview
– Raijin Lustre – HPC Filesystems: includes /short, /home, /apps, /images, /system
– Gdata1 – Persistent Data: /g/data1
– Gdata2 – Persistent Data: /g/data2
– Massdata – Archive Data: Migrating CXFS/DMF, 1PB Cache, 6PB x2 LTO 5 dual site tape – OpenStack – Persistent Data: CEPH, 1.1PB over 2 systems
15
Systems Overview
10 GigE /g/data 56Gb FDR IB Fabrics
/g/data1
7.4PB
/g/data2
6.75PB
/short
7.6PB /home, /system, /images, /apps
Cache 1.0PB, LTO 5 Tape 12.3PB
Massdata
Archival Data
/g/data
NCI Global Persistent Filesystems
Raijin FS
HPC Filesystems
Raijin Compute Raijin (HPC) Login + Data movers VMware
Openstack Cloud
NCI data services
To Huxley DC Raijin 56Gb FDR IB Fabric
AARNET + Internet
10 GigE HSM Tape – TS1140/50 18.2PB x2 RAW
Aspera + GridFTP
/g/data3
8.0PB Q2 2015
16
What do we store?
– Very. – Average data collection is 50-100+ Terabytes – Larger data collections are multi-Petabytes in size – Individual files can exceed 2TB or be as small as a few KB. – Individual datasets consist of tens of millions of files – Next Generation likely to be 6-10x larger.
– High value, cross-institutional collaborative scientific research collections. – Nationally significant data collections such as:
Simulator (ACCESS) Models
AR5 collection
2.6PB 2.6PB 1.5PB
https://www.rdsi.edu.au/collections-stored
17
How is it used?
– Native Lustre mounts for gdata storage on all 3592 compute nodes (5,472 Xeon cores), 56Gbit per node (each node capable
– Additional Login nodes + Management nodes also 56GBit FDR IB – Scheduler will run jobs as resources become available (semi- predictable, but runs 24/7) – A single job may be 10,000+ cores reading (or creating) a dataset.
– NFS 10 Gbit Ethernet (40GE NFS, Q3 2015) – Unpredictable when load will ramp – Typically many small I/O patterns
– Dedicated datamover nodes connected via 10GE externally and 56Gbit Infiniband internally. – Dedicated datamover systems like Aspera, GridFTP, Long Distance IB connected via 10GE, 40Gb IB, optical circuits – Data access may be sustained for days or weeks, continual streaming read/write access.
8Gbit/sec for 24hrs+, inbound transfers 53,787 of 56,992 cores in use (94.37% utilisation)
8Gbit/s
18
How is it used?
Performance (gdata1, HPC User Application)
Peak 54GB/sec read sustained for 1.5 hrs. Average 27GB/sec sustained for 6 hours
Peak 54GB/sec Read
Availability (Quarterly, 2014-2015)
Gdata1 + Gdata2 filesystems
Gdata1 long term availability of 99.23% (475 days, ex maintenance to 20 Apr 2015)
maintenance events with 3+ days notice
& quarterly maintenance.
19
How is it used?
Metadata Performance (gdata1), example applications
Peak 3.5 Million getattrs /sec, . Average 700,000+ getattrs sustained for 1.5 hours
Peak 54GB/sec Read
Peak 3.4M getattrs/sec
getattrs 500K/sec getXattrs
20
High Performance Persistent Data Store
21
Requirements
– 8 PB by Mid 2015, ability to grow to 10PB+. Additional capacity required for expansion of existing and new data collections. – High Performance, High Capacity Storage capable of supporting HPC connected workload. High Availability. – Persistent Storage for Active Projects and Reference Datasets, with ‘backup’ or HSM capability. – Capable of supporting intense metadata workload of 4 Million+ operations per sec. – Modular design that can be scaled out as required for future growth. – 120+ GB/sec read performance, 80+ GB/sec write
stream and IOPS. – Available across all NCI systems (Cloud, VMWare, HPC) using native mounts and 10/40Gbit NFS.
22
Lustre @ NCI
– Lustre is a high performance parallel distributed filesystem, typically used for large scale compute clusters. – Highly scalable for very large and fast filesystems. – Is the most widely used filesystem in the top 100 fastest supercomputers world-wide, including Titan (#2), Sequoia (#3, LLNL, 55PB – Lustre on Netapp E5500, 1TB/sec). – Lustre is used at NCI for Raijin’s HPC filesystems, /g/data1, /g/data2, /g/data3. – Can be used with common Enterprise-type server and storage hardware – but will have poor performance and reliability if not correctly specified.
23
How Lustre works Compute Node Compute Node Compute Node OSS HA Pair OSS HA Pair OSS HA Pair MDS HA Pair
HPC FDR Infiniband Fabric Object Storage Servers (OSS) Object Storage Targets (OST) File File, stripe count=4
LNET Routers
Storage FDR Infiniband Fabric MetaData Server (MDS) MetaData Target (MDT) Compute nodes w/ Lustre client
NFS / SMB Servers
VMs - Data Catalogues & Services
24
Metatdata
– MDT capacity and performance is typically determined for whole filesystem at initial build – Need to consider overall capacity of FS in initial specification. – Need performance, lots of it. – Filesystem performance is heavily dependent on MDS and MDT. Poor metadata performance impacts entire filesystem. – Slow filesystem = slow jobs = wasted HPC compute hours. – Must consider MDT Controller + Disk IOPS, MDS Cores + RAM – Random 4K IO workload
MDS HA Pair
MetaData Server (MDS) MetaData Target (MDT)
25
Netapp EF-550
– 450,000 IOPS sustained. 900,00 peak. – 24x 800GB SAS SSDs (mixed use SLC) – Dual Controllers, each with:
– 21KG, 2RU – Low power & Thermal loads – August 2014 Eval Testing:
with single host, using 6 of 8 available 10GE ports
EF550 – All Flash Array
26
Design – gdata3 Metadata Gdata 3 Metadata Building Blocks
Netapp EF550 All-Flash block storage array, with 4x MDS-MDT 40Gbit Infiniband interconnects
– 24 x 800G SAS (SLC mixed use) – Dual 40Gbit IB Controllers – 2x 10 Disk RAID 10 pools, LVM together, 4 spares – 1 preferred pool per controller. – ~ 1 Billion inode capacity (as formatted for MDT)
– 2x Servers as High Availability pair – 1RU HP DL 360 Gen 9s, each with
3.6Ghz Turbo Boost max
MDT MDS 2 MDS 1
40Gbit IB CtrlA CtrlB
27
Design – gdata Metadata comparison
Gdata1 + Gdata2 Shared MDT Array
Gdata3 MDT Array
28
Object Storage (Capacity Scale out)
– OST performance is typically determined at initial build by choice of disk array technology (choose carefully if adding incrementally over multiple years). – Performance of all OSTs (and OSSes) in the filesystem should be very similar. – Mixed OSTs sizes and/or performance will result in hotspotting and inconsistent read/write performance as files are striped across OSTs or allocated in a round-robin / stride. – Capacity scales out as you add more building blocks, as does performance* – Design building block for your workload – controller to disk to IOPS ratios need to be considered. – Mixed 1MB Stream and Random 4K IO workload. Lustre uses 1MB transfers (optimise RAID config for 1MB stripe size). Object Storage Servers (OSS) Object Storage Targets (OST)
OSS HA Pair
*interconnect fabric must scale to accommodate bandwidth of additional OSSes
29
Netapp E5660
– Latest generation E-Series – 1st Lustre deployment on E5600 series world wide – Multi-core optimised Controllers – 12,000 MB/sec Read Performance (RAW) – 180x 4TB NL-SAS 7.2K HDDs (NCI Config) – Dual Controllers, each with:
– 1x E5660 60 Disk Controller shelf – 2x DE6600 60 Disk Expansion shelf E5660 – 5600 Series
30
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks
with 8x OSSS-OST 12Gbit SAS interconnects
– 180 x 4TB NL-SAS, 7.2K – Dual 12G SAS Controllers – 2x 90 Disk DDPs – 8 volume slices per 90 Disk DDP
– 2x Servers as High Availability pair – 1RU Fujitsu RX2530-M1’s each with
3.4Ghz Turbo Boost max
OSTs OSS 2 OSS 1
6Gbit SAS CtrlA x2 CtrlB x2
OSTs OSTs
31
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks 1x Building Block
32
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks Front View – bezel removed
Front View – Tray1, Drawer 5 open
33
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks Front of Rack
panel attached at RU0 position
34
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks Rear of Rack
35
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks 180x 4TB NL-SAS
90 Disk DDP – Controller A 90 Disk DDP – Controller B
30TB Volume Slice = 1x OST
8x 30TB Volume Slices per 90D DDP
30TB Volume Slice = 1x OST
8x 30TB Volume Slices OSS A OSS B
SAS Array-Host Connections FDR IB Fabric FDR IB Fabric High Availability Pair
36
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks 180x 4TB NL-SAS
90 Disk DDP – Controller A 90 Disk DDP – Controller B
30TB Volume Slice = 1x OST
8x 30TB Volume Slices per 90D DDP
30TB Volume Slice = 1x OST
8x 30TB Volume Slices OSS A OSS B
4x OSTs to OSS A 4x OSTs to OSS B FDR IB Fabric FDR IB Fabric High Availability Pair
Building block Capacity = 16x 30TB OSTs = 480TB + 6 DDP spares per 90 pool
37
Object Storage Performance
– As disk sizes increase, RAID rebuild times become
a RAID6 set under normal workload conditions. – Volume (LUN -> OST) performance is degraded while this occurs. – Risk of loss of second disk in RAID 6 pool during rebuild (typically use 8+2 R6) – If pool enters a no-redundacy state (ie loss of 2 drives in RAID6 pool), HPC operations are suspended while rebuild occurs due to risk. – DDP – Highly distributed parity. Many drives involved in rebuild. Sparing capacity built in to pool. – DDP rebuild in 1-3 hour range (can be faster).
38
Object Storage Performance
– But… Is there a free lunch? – DDP can traditionally perform slightly lower under peak stream workload conditions. – Need to evaluate impact of rebuild time versus slightly lower stream performance. – Interim Benchmark for E5600 controllers looks very promising – 1x Building block = 180 disks, 2x 90D DDP, 8x slice per DDP with fully balanced SAS/Multipath/Controller config 6.26 GB/sec Write test 9.19 GB/sec Read test
10x 1MB block size streams per slice, driven by OSS HA pair
39
Scale Out MDS HA Pair OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
OSS HA Pair
9GB/sec+ 0.5PB
16x Building blocks 8PB, 144GB/sec+
40
Next Steps
– IOR Benchmark against built Lustre filesystem – Will require 200-300 clients to fully exercise filesystem – Expectations of 140GB sec Read, 90GB sec Write (sequential aggregate) – LNETs Routers will ultimately cap performance (10GB sec each, 14x) – Full production service for Q3 2015 – Lustre HSM-DMF Capability in Q3 2015
41
nci.org.au
@NCInews
Providing Australian researchers with world-class computing services
W
NCI Contacts General enquiries: +61 2 6125 9800 Media enquiries: +61 2 6125 4389 Help desk: help@nci.org.au Address: NCI, Building 143, Ward Road The Australian National University Canberra ACT 0200
43
Lustre HSM
– /g/data 1 – 7.4PB capacity
– /g/data 2 – 6..75PB capacity
– Approx 300-400M inodes per /g/dataN – 14.1PB, 800M+ inodes (possibly 1B inodes?)
– Traditional ‘Backup’ not viable – interval? Deep traversal of directory structures? – Data change between start and end of backup event? – Calculation of difference between backup events takes days/weeks – Backup impact on filesystem performance, particularly metadata load on MDS
– Lustre MDS knows which files are being accessed & altered – Activity logged in a ‘changelog’ – No need for deep traversal if you know what is being altered. – ‘backup’ is always occurring, light persistent load – not periodic intense loads
44
Design – Diagram Fabric Layout
45
Design – HSM Configuration
– Essentially create a backup, rather than migrating tiers – All Lustre objects to be Dual Stated – i.e. exist both on Lustre Disk, HSM Tape – Backend tape to be Dual Site – i.e. copied to primary and secondary tape library
– Lustre v2.5 Front End – Robinhood Policy Engine (2.5.3) – SGI DMF Copytool v1.0 – SGI DMF 6.2 Tape Back-End (+ ISSP 3.2 / CXFS 7.2) – Spectra Logic T950 Tape Library – IBM 3592 Tape System, TS1140 Drives, JC Media