[PPT] - DVS, GPFS and External Lustre at NERSC How Its Working on Hopper PowerPoint Presentation

SLIDE 1

1

DVS, GPFS and External Lustre at NERSC – How It’s Working on Hopper

Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

SLIDE 2

2

NERSC is the Primary Computing Center for DOE Office of Science

NERSC serves a large population

Approximately 3000 users, 400 projects, 500 codes

Focus on “unique” resources

– Expert consulting and other services – High end computing & storage systems

NERSC is known for:

– Excellent services & diverse workload

Physics Math + CS Astrophysics Chemistry Climate Combustion Fusion Lattice Gauge Life Sciences Materials Other

2010 allocations

SLIDE 3

3

NERSC Systems

Large-Scale Computing Systems

Franklin (NERSC-5): Cray XT4

9,532 compute nodes; 38,128 cores
~25 Tflop/s on applications; 356 Tflop/s peak

Hopper (NERSC-6): Cray XE6

Phase 1: Cray XT5, 668 nodes, 5344 cores
Phase 2: Cray XE6, 6384 nodes, 153216 cores 1.28 Pflop/s peak

HPSS Archival Storage

40 PB capacity
4 Tape libraries
150 TB disk cache

NERSC Global Filesystem (NGF) Uses IBM’s GPFS

1.5 PB capacity
10 GB/s of bandwidth

Clusters 140 Tflops total Carver

IBM iDataplex cluster

PDSF (HEP/NP)

~1K core throughput cluster

Magellan Cloud testbed

IBM iDataplex cluster

GenePool (JGI)

~5K core throughput cluster

Analytics Euclid (512 GB shared memory) Dirac GPU testbed (48 nodes)

SLIDE 4

4

Lots of users, multiple systems, lots of data

At the end of the 90’s it was becoming

increasingly clear that data management was a huge issue.

Users were generating larger and larger data

sets and copying their data to multiple systems for pre- and post-processing.

Wasted time and wasted space
Needed to help users be more productive

SLIDE 5

5

Global Unified Parallel Filesystem

In 2001 NERSC began the GUPFS project.

– High performance – High reliability – Highly scalable – Center-wide shared namespace

Assess emerging storage, fabric and

filesystem technology

Deploy across all production systems

SLIDE 6

6

NERSC Global Filesystem (NGF)

First production in 2005 using GPFS

– Multi-cluster support – Shared namespace – Separate data and metadata partitions – Shared lock manager – Filesystems served over Fibre Channel and Ethernet – Partitioned server space through private NSDs

SLIDE 7

7

NERSC Global Filesystem (NGF)

NGF Servers Ethernet Network

Carver/ Magellan Euclid

pNSD pNSD NGF Disk

PDSF/ Planck Dirac

pNSD NGF SAN Franklin Disk Franklin SAN IB IB IB

PDSF Franklin

Hopper External Login

Hopper

Hopper External Filesystem

SLIDE 8

8

NGF Configuration

NSD servers are commodity

– 28 core servers – 26 private NSD servers

8 for hopper; 14 for carver; 8 for planck (PDSF)
Storage is heterogeneous

– DDN 9900 for data LUNs – HDS 2300 for data and metadata LUNs – Have also used Engenio and Sun

Fabric is heterogeneous

– FC-8 and 10 GbE for data transport – Ethernet for control/metadata traffic

SLIDE 9

9

NGF Filesystems

Collaborative - /project

– 873 TB, ~12 GB/s, served over FC-8 – 4 DDN 9900

Scratch - /global/scratch

– 873 TB, ~12 GB/s, served over FC-8 – 4 DDN 9900s

User homes – /global/u1, /global/u2

– 40 TB, ~3-5 GB/s, served over Ethernet – HDS 2300

Common area - /global/common, syscommon

– ~5 TB, ~3-5 GB/s, served over Ethernet – HDS 2300

SLIDE 10

10 GPFS Server SGNs DTNs

NGF /project

Euclid Dirac Franklin DVS Hopper DVS FC (20x2xFC4)

/project 870TB (~12 GB/s) 730TB increase July11

FC (8x4xFC8) FC (12x4xFC8)

F C ( 2 x 2 x F C 4 ) IB

Sea* Gemini

PDSF pNSD IB Carver Magellan pNSD pNSD Planck

FC (4x2xFC4)

SLIDE 11

11 SGNs DTNs

NGF global scratch

Euclid Dirac Franklin Hopper DVS

/global/scratch 870TB (~12 GB/s) No increase planned

FC (8x4xFC8) FC (12x4xFC8)

F C ( 2 x 2 x F C 4 ) IB

Gemini

PDSF pNSD IB Carver Magellan pNSD pNSD Planck

FC (4x2xFC4)

SLIDE 12

12 Ethernet (4x1-10Gb) FC

NGF global homes

Euclid Carver Magellan Dirac Franklin Hopper /global/homes 40 TB 40TB increase July11 SGNs DTNs PDSF Planck GPFS Server

SLIDE 13

13

52 OSS

FC Switch Fabric

12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs NERSC 10GbE LAN to HPSS 4 esDM Servers GPFS Storage GPFS Metadata pNSD Server pNSD Server pNSD Server pNSD Server pNSD Server pNSD Server pNSD Server pNSD Server 2 Spare 2 MDS

RAID 1+0

LSI 3992

RAID 1+0

Main System QDR Switch Fabric

12 External Login Servers

Hopper Configuration

DVS NGF DVS/DSL LNET MOM

SLIDE 14

14

DVS on Hopper

16 DVS servers for NGF filesystems

– IB connected to private NSD servers – GPFS remote cluster serving compute and MOM nodes – 2 DVS nodes dedicated to MOMs – Cluster parallel

32 DVS DSL servers on repurposed compute

nodes

– Loadbalanced for shared root

SLIDE 15

15

9000 9200 9400 9600 9800 10000 10200 10400 10600 10800 11000 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16

MB/s

#I/O processes per block size

pNSD servers to /global/scratch (idle)

Write Read

1 MB Block Size 4 MB Block Size 8 MB Block Size 16 MB Block Size

SLIDE 16

16

1000 2000 3000 4000 5000 6000 7000 8000 9000 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16

MB/s

# I/O processes per block size

pNSD servers to /global/scratch (busy)

Write Read

1MB Block Size 4MB Block Size 8MB Block Size 16MB Block Size

SLIDE 17

17

9200 9400 9600 9800 10000 10200 10400 10600 10800 11000 11200 11400 1 2 4 8

MB/s

# I/O processes - block size 4MB

DVS servers to /global/scratch (idle)

Write Read

SLIDE 18

18

2000 4000 6000 8000 10000 12000 24 48 96 192 384 768 3072 MB/s

# I/O processes - packed nodes

Hopper compute nodes to /global/scratch (idle)

Write Read

SLIDE 19

19

1000 2000 3000 4000 5000 6000 7000 8000 24 48 96 192 384 768 1536 3072

MB/s

#I/O processes - packed nodes

Hopper compute nodes to /global/scratch (busy)

Write Read

SLIDE 20

20

Hopper Filesystems

External Lustre

– 2 local scratch filesystems – 2+ PBs user storage – Aggregate 70 GB/s

External nodes

– 26 LSI 7900 – 52 OSSes with 6 OSTs per OSS – 4 MDS with failover

56 LNET routers

SLIDE 21

21

10000 20000 30000 40000 50000 60000 10000 1000000 1048576

MB/s Block size

IOR 2880 MPI Tasks MPI-IO Aggregate

Write Read

SLIDE 22

22

64000 65000 66000 67000 68000 69000 70000 71000 72000 73000 10000 1000000 1048576

MB/s Block size

IOR 2880 MPI Tasks File Per Processor -- Aggregate

Write Read

SLIDE 23

23

5000 10000 15000 20000 25000 30000 35000 40000 24 48 96 192 384 768 1536 3072

MB/s

#I/O processes - packed nodes

Hopper compute nodes to /scratch (lustre)

Write Read

SLIDE 24

24

Conclusions

The mix of dedicated external Lustre and

shared NGF filesystems works well for user workflows with mostly good performance.

Shared file I/O is an issue for both Lustre and

DVS-served filesystems.

Cray and NERSC working together on DVS

and shared file I/O issues through Center of Excellence.

SLIDE 25

25

Acknowledgments This work was supported by the Director, Office of

Science, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy under contract number DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy.

SLIDE 26