 
              DVS, GPFS and External Lustre at NERSC – How It’s Working on Hopper Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011 1
NERSC is the Primary Computing Center for DOE Office of Science • NERSC serves a large population Approximately 3000 users, 400 projects, 500 codes • Focus on “unique” resources – Expert consulting and other services – High end computing & storage systems • NERSC is known for: – Excellent services & diverse workload 2010 allocations Physics Math + CS Astrophysics Chemistry Climate Combustion Fusion Lattice Gauge Life Sciences Materials Other 2
NERSC Systems Large-Scale Computing Systems Franklin (NERSC-5): Cray XT4 • 9,532 compute nodes; 38,128 cores • ~25 Tflop/s on applications; 356 Tflop/s peak Hopper (NERSC-6): Cray XE6 • Phase 1: Cray XT5, 668 nodes, 5344 cores • Phase 2: Cray XE6, 6384 nodes, 153216 cores 1.28 Pflop/s peak C lusters NERSC Global Analytics 140 Tflops total Filesystem (NGF) Carver Uses IBM’s GPFS • IBM iDataplex cluster • 1.5 PB capacity PDSF (HEP/NP) Euclid • 10 GB/s of bandwidth • ~1K core throughput cluster (512 GB shared Magellan Cloud testbed HPSS Archival Storage memory) • IBM iDataplex cluster • 40 PB capacity Dirac GPU testbed (48 nodes) GenePool (JGI) • 4 Tape libraries • ~5K core throughput cluster • 150 TB disk cache 3
Lots of users, multiple systems, lots of data • At the end of the 90’s it was becoming increasingly clear that data management was a huge issue. • Users were generating larger and larger data sets and copying their data to multiple systems for pre- and post-processing. • Wasted time and wasted space • Needed to help users be more productive 4
Global Unified Parallel Filesystem • In 2001 NERSC began the GUPFS project. – High performance – High reliability – Highly scalable – Center-wide shared namespace • Assess emerging storage, fabric and filesystem technology • Deploy across all production systems 5
NERSC Global Filesystem (NGF) • First production in 2005 using GPFS – Multi-cluster support – Shared namespace – Separate data and metadata partitions – Shared lock manager – Filesystems served over Fibre Channel and Ethernet – Partitioned server space through private NSDs 6
NERSC Global Filesystem (NGF) Ethernet NGF Network Servers Franklin SAN Franklin NGF Franklin NGF SAN Disk Disk pNSD pNSD pNSD Hopper External Login IB IB IB Carver/ PDSF/ Hopper Hopper Dirac Euclid PDSF External Magellan Planck Filesystem 7
NGF Configuration • NSD servers are commodity – 28 core servers – 26 private NSD servers • 8 for hopper; 14 for carver; 8 for planck (PDSF) • Storage is heterogeneous – DDN 9900 for data LUNs – HDS 2300 for data and metadata LUNs – Have also used Engenio and Sun • Fabric is heterogeneous – FC-8 and 10 GbE for data transport – Ethernet for control/metadata traffic 8
NGF Filesystems • Collaborative - /project – 873 TB, ~12 GB/s, served over FC-8 – 4 DDN 9900 • Scratch - /global/scratch – 873 TB, ~12 GB/s, served over FC-8 – 4 DDN 9900s • User homes – /global/u1, /global/u2 – 40 TB, ~3-5 GB/s, served over Ethernet – HDS 2300 • Common area - /global/common, syscommon – ~5 TB, ~3-5 GB/s, served over Ethernet – HDS 2300 9
NGF /project Franklin Hopper Sea * Gemini FC (8x4xFC8) FC (20x2xFC4) DVS DVS pNSD IB /project 870TB (~12 GB/s) ) 4 C F x 2 x 2 ( C F 730TB increase July11 DTNs FC (12x4xFC8) Euclid FC (4x2xFC4) GPFS IB Server pNSD pNSD Dirac SGNs PDSF Planck Carver Magellan 10
NGF global scratch Franklin Hopper Gemini FC (8x4xFC8) DVS pNSD IB /global/scratch 870TB (~12 GB/s) ) 4 C F x 2 x 2 ( C F No increase planned DTNs FC (12x4xFC8) Euclid FC (4x2xFC4) IB pNSD pNSD Dirac SGNs PDSF Planck Carver Magellan 11
NGF global homes Franklin Hopper /global/homes 40 TB 40TB increase July11 DTNs Euclid FC GPFS Dirac Server Ethernet (4x1-10Gb) SGNs Carver Magellan PDSF Planck 12
Hopper Configuration DVS NGF DVS/DSL pNSD Server LNET GPFS pNSD Server MOM pNSD Server Storage pNSD Server pNSD Server pNSD Server GPFS Main pNSD Server Metadata pNSD Server System NERSC 10GbE LAN to HPSS LSI 3992 12 External Login Servers RAID 2 Spare RAID 1+0 1+0 2 MDS 4 esDM QDR Switch Fabric Servers 52 OSS FC Switch Fabric 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs 13
DVS on Hopper • 16 DVS servers for NGF filesystems – IB connected to private NSD servers – GPFS remote cluster serving compute and MOM nodes – 2 DVS nodes dedicated to MOMs – Cluster parallel • 32 DVS DSL servers on repurposed compute nodes – Loadbalanced for shared root 14
pNSD servers to /global/scratch (idle) 11000 10800 10600 10400 10200 MB/s 10000 9800 Write 9600 Read 9400 9200 9000 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 MB Block Size 4 MB Block Size 8 MB Block Size 16 MB Block Size #I/O processes per block size 15
pNSD servers to /global/scratch (busy) 9000 8000 7000 6000 5000 MB/s 4000 Write 3000 Read 2000 1000 0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1MB Block Size 4MB Block Size 8MB Block Size 16MB Block Size # I/O processes per block size 16
DVS servers to /global/scratch (idle) 11400 11200 11000 10800 10600 10400 MB/s 10200 Write Read 10000 9800 9600 9400 9200 1 2 4 8 # I/O processes - block size 4MB 17
Hopper compute nodes to /global/scratch (idle) 12000 10000 8000 MB/s 6000 Write Read 4000 2000 0 24 48 96 192 384 768 3072 # I/O processes - packed nodes 18
Hopper compute nodes to /global/scratch (busy) 8000 7000 6000 5000 MB/s 4000 Write 3000 Read 2000 1000 0 24 48 96 192 384 768 1536 3072 #I/O processes - packed nodes 19
Hopper Filesystems • External Lustre – 2 local scratch filesystems – 2+ PBs user storage – Aggregate 70 GB/s • External nodes – 26 LSI 7900 – 52 OSSes with 6 OSTs per OSS – 4 MDS with failover • 56 LNET routers 20
IOR 2880 MPI Tasks MPI-IO Aggregate 60000 50000 40000 MB/s 30000 Write Read 20000 10000 0 10000 1000000 1048576 Block size 21
IOR 2880 MPI Tasks File Per Processor -- Aggregate 73000 72000 71000 70000 MB/s 69000 68000 Write Read 67000 66000 65000 64000 10000 1000000 1048576 Block size 22
Hopper compute nodes to /scratch (lustre) 40000 35000 30000 25000 MB/s 20000 Write 15000 Read 10000 5000 0 24 48 96 192 384 768 1536 3072 #I/O processes - packed nodes 23
Conclusions • The mix of dedicated external Lustre and shared NGF filesystems works well for user workflows with mostly good performance. • Shared file I/O is an issue for both Lustre and DVS-served filesystems. • Cray and NERSC working together on DVS and shared file I/O issues through Center of Excellence. 24
Acknowledgments This work was supported by the Director, Office of Science, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy under contract number DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy. 25
26
Recommend
More recommend