DVS, GPFS and External Lustre at NERSC How Its Working on Hopper - - PowerPoint PPT Presentation

dvs gpfs and external lustre at nersc how it s working on
SMART_READER_LITE
LIVE PREVIEW

DVS, GPFS and External Lustre at NERSC How Its Working on Hopper - - PowerPoint PPT Presentation

DVS, GPFS and External Lustre at NERSC How Its Working on Hopper Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011 1 NERSC is the Primary Computing Center for DOE Office of Science NERSC serves a large population


slide-1
SLIDE 1

1

DVS, GPFS and External Lustre at NERSC – How It’s Working on Hopper

Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

slide-2
SLIDE 2

2

NERSC is the Primary Computing Center for DOE Office of Science

  • NERSC serves a large population

Approximately 3000 users, 400 projects, 500 codes

  • Focus on “unique” resources

– Expert consulting and other services – High end computing & storage systems

  • NERSC is known for:

– Excellent services & diverse workload

Physics Math + CS Astrophysics Chemistry Climate Combustion Fusion Lattice Gauge Life Sciences Materials Other

2010 allocations

slide-3
SLIDE 3

3

NERSC Systems

Large-Scale Computing Systems

Franklin (NERSC-5): Cray XT4

  • 9,532 compute nodes; 38,128 cores
  • ~25 Tflop/s on applications; 356 Tflop/s peak

Hopper (NERSC-6): Cray XE6

  • Phase 1: Cray XT5, 668 nodes, 5344 cores
  • Phase 2: Cray XE6, 6384 nodes, 153216 cores 1.28 Pflop/s peak

HPSS Archival Storage

  • 40 PB capacity
  • 4 Tape libraries
  • 150 TB disk cache

NERSC Global Filesystem (NGF) Uses IBM’s GPFS

  • 1.5 PB capacity
  • 10 GB/s of bandwidth

Clusters 140 Tflops total Carver

  • IBM iDataplex cluster

PDSF (HEP/NP)

  • ~1K core throughput cluster

Magellan Cloud testbed

  • IBM iDataplex cluster

GenePool (JGI)

  • ~5K core throughput cluster

Analytics Euclid (512 GB shared memory) Dirac GPU testbed (48 nodes)

slide-4
SLIDE 4

4

Lots of users, multiple systems, lots of data

  • At the end of the 90’s it was becoming

increasingly clear that data management was a huge issue.

  • Users were generating larger and larger data

sets and copying their data to multiple systems for pre- and post-processing.

  • Wasted time and wasted space
  • Needed to help users be more productive
slide-5
SLIDE 5

5

Global Unified Parallel Filesystem

  • In 2001 NERSC began the GUPFS project.

– High performance – High reliability – Highly scalable – Center-wide shared namespace

  • Assess emerging storage, fabric and

filesystem technology

  • Deploy across all production systems
slide-6
SLIDE 6

6

NERSC Global Filesystem (NGF)

  • First production in 2005 using GPFS

– Multi-cluster support – Shared namespace – Separate data and metadata partitions – Shared lock manager – Filesystems served over Fibre Channel and Ethernet – Partitioned server space through private NSDs

slide-7
SLIDE 7

7

NERSC Global Filesystem (NGF)

NGF Servers Ethernet Network

Carver/ Magellan Euclid

pNSD pNSD NGF Disk

PDSF/ Planck Dirac

pNSD NGF SAN Franklin Disk Franklin SAN IB IB IB

PDSF Franklin

Hopper External Login

Hopper

Hopper External Filesystem

slide-8
SLIDE 8

8

NGF Configuration

  • NSD servers are commodity

– 28 core servers – 26 private NSD servers

  • 8 for hopper; 14 for carver; 8 for planck (PDSF)
  • Storage is heterogeneous

– DDN 9900 for data LUNs – HDS 2300 for data and metadata LUNs – Have also used Engenio and Sun

  • Fabric is heterogeneous

– FC-8 and 10 GbE for data transport – Ethernet for control/metadata traffic

slide-9
SLIDE 9

9

NGF Filesystems

  • Collaborative - /project

– 873 TB, ~12 GB/s, served over FC-8 – 4 DDN 9900

  • Scratch - /global/scratch

– 873 TB, ~12 GB/s, served over FC-8 – 4 DDN 9900s

  • User homes – /global/u1, /global/u2

– 40 TB, ~3-5 GB/s, served over Ethernet – HDS 2300

  • Common area - /global/common, syscommon

– ~5 TB, ~3-5 GB/s, served over Ethernet – HDS 2300

slide-10
SLIDE 10

10 GPFS Server SGNs DTNs

NGF /project

Euclid Dirac Franklin DVS Hopper DVS FC (20x2xFC4)

/project 870TB (~12 GB/s) 730TB increase July11

FC (8x4xFC8) FC (12x4xFC8)

F C ( 2 x 2 x F C 4 ) IB

Sea* Gemini

PDSF pNSD IB Carver Magellan pNSD pNSD Planck

FC (4x2xFC4)

slide-11
SLIDE 11

11 SGNs DTNs

NGF global scratch

Euclid Dirac Franklin Hopper DVS

/global/scratch 870TB (~12 GB/s) No increase planned

FC (8x4xFC8) FC (12x4xFC8)

F C ( 2 x 2 x F C 4 ) IB

Gemini

PDSF pNSD IB Carver Magellan pNSD pNSD Planck

FC (4x2xFC4)

slide-12
SLIDE 12

12 Ethernet (4x1-10Gb) FC

NGF global homes

Euclid Carver Magellan Dirac Franklin Hopper /global/homes 40 TB 40TB increase July11 SGNs DTNs PDSF Planck GPFS Server

slide-13
SLIDE 13

13

52 OSS

FC Switch Fabric

12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs 12 LUNs NERSC 10GbE LAN to HPSS 4 esDM Servers GPFS Storage GPFS Metadata pNSD Server pNSD Server pNSD Server pNSD Server pNSD Server pNSD Server pNSD Server pNSD Server 2 Spare 2 MDS

RAID 1+0

LSI 3992

RAID 1+0

Main System QDR Switch Fabric

12 External Login Servers

Hopper Configuration

DVS NGF DVS/DSL LNET MOM

slide-14
SLIDE 14

14

DVS on Hopper

  • 16 DVS servers for NGF filesystems

– IB connected to private NSD servers – GPFS remote cluster serving compute and MOM nodes – 2 DVS nodes dedicated to MOMs – Cluster parallel

  • 32 DVS DSL servers on repurposed compute

nodes

– Loadbalanced for shared root

slide-15
SLIDE 15

15

9000 9200 9400 9600 9800 10000 10200 10400 10600 10800 11000 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16

MB/s

#I/O processes per block size

pNSD servers to /global/scratch (idle)

Write Read

1 MB Block Size 4 MB Block Size 8 MB Block Size 16 MB Block Size

slide-16
SLIDE 16

16

1000 2000 3000 4000 5000 6000 7000 8000 9000 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16

MB/s

# I/O processes per block size

pNSD servers to /global/scratch (busy)

Write Read

1MB Block Size 4MB Block Size 8MB Block Size 16MB Block Size

slide-17
SLIDE 17

17

9200 9400 9600 9800 10000 10200 10400 10600 10800 11000 11200 11400 1 2 4 8

MB/s

# I/O processes - block size 4MB

DVS servers to /global/scratch (idle)

Write Read

slide-18
SLIDE 18

18

2000 4000 6000 8000 10000 12000 24 48 96 192 384 768 3072 MB/s

# I/O processes - packed nodes

Hopper compute nodes to /global/scratch (idle)

Write Read

slide-19
SLIDE 19

19

1000 2000 3000 4000 5000 6000 7000 8000 24 48 96 192 384 768 1536 3072

MB/s

#I/O processes - packed nodes

Hopper compute nodes to /global/scratch (busy)

Write Read

slide-20
SLIDE 20

20

Hopper Filesystems

  • External Lustre

– 2 local scratch filesystems – 2+ PBs user storage – Aggregate 70 GB/s

  • External nodes

– 26 LSI 7900 – 52 OSSes with 6 OSTs per OSS – 4 MDS with failover

  • 56 LNET routers
slide-21
SLIDE 21

21

10000 20000 30000 40000 50000 60000 10000 1000000 1048576

MB/s Block size

IOR 2880 MPI Tasks MPI-IO Aggregate

Write Read

slide-22
SLIDE 22

22

64000 65000 66000 67000 68000 69000 70000 71000 72000 73000 10000 1000000 1048576

MB/s Block size

IOR 2880 MPI Tasks File Per Processor -- Aggregate

Write Read

slide-23
SLIDE 23

23

5000 10000 15000 20000 25000 30000 35000 40000 24 48 96 192 384 768 1536 3072

MB/s

#I/O processes - packed nodes

Hopper compute nodes to /scratch (lustre)

Write Read

slide-24
SLIDE 24

24

Conclusions

  • The mix of dedicated external Lustre and

shared NGF filesystems works well for user workflows with mostly good performance.

  • Shared file I/O is an issue for both Lustre and

DVS-served filesystems.

  • Cray and NERSC working together on DVS

and shared file I/O issues through Center of Excellence.

slide-25
SLIDE 25

25

Acknowledgments This work was supported by the Director, Office of

Science, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy under contract number DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy.

slide-26
SLIDE 26

26