SLIDE 1 Application Performance under Different XT Operating Systems
Courtenay T. Vaughan, John P. Van Dyke, and Suzanne M. Kelly Sandia National Laboratories Cray User Group May 2008
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
SLIDE 2 Background
- Cray XT3 series ran Catamount OS
– Light Weight Kernel based on kernel developed at Sandia
- With XT4, Cray moving to Compute Node Linux
(CNL)
– tuned Linux kernel – added support for quad-core processors
SLIDE 3 Catamount N-Way (CNW)
- Developed as risk mitigation for ORNL with
funding from DOE Office of Science
– Jaguar being upgraded to quad-core processors
- Designed to support N cores per processor
– Not just 4 cores per processor – Able to run on nodes with 1 or 2 cores per processor without recompiling – Able to run on a mixture of nodes
SLIDE 4 Comparison of CNL and CNW
- CNL based on Linux kernel
– Linux supports multiple users, processes, and services – Undesirable features configured “off” when kernel was built – Tuned to minimize interrupts
- CNW designed as limited function kernel
– Device drivers only for console output and communication with the SeaStar NIC – No virtual memory or unnecessary features – Each node supports exactly one user running one application on 1 to N cores
SLIDE 5 Tests on pre-upgrade Jaguar
- Conducted last Summer
- Jaguar was a mix of XT3 and XT4 dual-core nodes
- Specific sizes for each codes
- Results from 3 codes
– Gyrokinetic Toroidal Code (GTC)
- 3-d PIC code for magnetic confinement fusion
– Parallel Ocean Program (POP)
– VH1
- a multidimensional ideal compressible
hydrodynamics code
SLIDE 6
Jaguar Results
16.8% 117.4 sec 137.1 sec 4096 core XT3 20.8% 981.7 sec 1186.0 sec 20000 core XT3/XT4 1.0% 778.9 sec 786.5 sec 20000 core XT3/XT4 8.6% 20.9 sec 22.7 sec 1024 core XT3 VH1 31.4% 75.2 sec 98.8 sec 20000 core XT3/XT4 16.8% 77.6 sec 90.6 sec 4800 core XT3 POP 3.5% 593.8 sec 614.6 sec 4096 core XT3 2.0% 584.0 sec 595.6 sec 1024 core XT3 GTC
Improvement CNW 2.0.05+ CNL 2.0.03+
SLIDE 7 Red Storm results
- Both OS based on 2.0.44
- Machine configured with 12960 nodes (25920
cores)
– Ran with Moab scheduler for CNW
- resulted in some bad job layout
– Ran with interactive nodes with CNL
– CTH
– PARTISN
- time-dependent neutron transport code
SLIDE 8 CTH 7.1 - Shaped Charge (90 x 216 x 90/proc)
8 10 12 14 16 18 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 # Processors time/timestep (sec)
CNW CNL
SLIDE 9 Partisn - sn timing - 24 x 24 x 24/proc
50 100 150 200 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 # Processors time (sec)
CNW CNL
SLIDE 10 HPCC
- Series of 7 benchmarks in one package. We generally use 5
- f them:
– PTRANS - matrix transposition – HPL - Linpack direct dense system solve – STREAMS - Memory bandwidth – Random Access - Global random memory access – FFT - large 1-D FFT
- Code is C with libraries
- HPL not used for these runs
- Optimized Random Access and FFT
- Version 1.2
SLIDE 11
HPCC on 16384 cores
1.16 2272.2 1963.8 GFLOPS FFT 1.85 23.4 12.7 GUP/s Random Access 1.48 36499 24721 GB/s STREAMS 1.49 894.1 598.7 GB/s PTRANS CNW/CNL CNW CNL units benchmark
SLIDE 12 Quad-Core System
- Machine with 4 Budapest quad-core nodes
- Running 2.0.44
- PGI 6.2.5 Compiler
- Run with Lustre filesystem
- Ran baseline HPCC version 1.0
SLIDE 13
HPCC on 16 cores (4 nodes)
1.06 3.518 3.331 FFT GFLOPS 2.04 0.03502 0.01717 Random GUPs 1.10 35.13 31.98 STREAMS GB/s 1.02 68.02 66.55 HPL GFLOPS 1.73 2.792 1.612 PTRAN GB/s CNW/CNL CNW CNL Benchmark
SLIDE 14
HPCC on 4 cores (4 nodes)
1.02 1.646 1.609 FFT GFLOPS 1.83 0.11823 0.06445 Random GUP/s 1.02 25.84 25.21 STREAMS GB/s 1.00 17.90 17.88 HPL GFLOPS 2.83 1.606 0.576 PTRANS GB/s CNW/CNL CNW CNL Benchmark
SLIDE 15
HPCC on 4 cores (2 nodes)
1.02 1.360 1.337 FFT GFLOPS 1.88 0.011476 0.006105 Random GUP/s 1.10 18.03 16.45 STREAMS GB/s 1.01 18.03 17.78 HPL GFLOPS 3.18 1.551 0.488 PTRANS GB/s CNW/CNL CNW CNL Benchmark
SLIDE 16
HPCC on 4 cores (4 nodes)
1.06 0.959 0.902 FFT GFLOPS 1.92 0.011476 0.005984 Random GUP/s 1.27 9.95 7.85 STREAMS GB/s 1.01 17.72 17.59 HPL GFLOPS 4.33 1.244 0.287 PTRANS GB/s CNW/CNL CNW CNL Benchmark
SLIDE 17 Additional Codes
– electron structure
– combustion modeling
– structural analysis
– hydrodynamics
– 3-D gas dynamics
– unstructured mesh radiation transport
SLIDE 18 Performance on 16 cores (4 nodes)
8.78% 222.0 241.5 PRONTO 0.44% 472.3 502.7 UMT 0.33% 845.0 847.8 SPPM 14.0% 234.9 267.8 SAGE 0.01% 1948.9 1949.1 S3D 1.22% 151.9 153.8 POP 1.62% 491.3 499.3 PARTISN 4.84% 276.7 290.1 LSMS
670.6 664.9 GTC 16.6% 1298.1 1513.1 CTH Improvement CNW/CNL CNW seconds CNL seconds Application
SLIDE 19
Performance on 4 cores (4 nodes)
3.99% 1701.0 1768.8 UMT 0.51% 293.1 294.6 SPPM 6.94% 158.9 170.0 SAGE 3.53% 1282.5 1327.8 S3D 7.06% 164.2 175.8 PRONTO 0.61% 425.5 428.0 POP 5.75% 165.5 175.1 PARTISN 4.97% 1105.6 1160.6 LSMS 0.93% 577.7 583.1 GTC 5.47% 816.7 861.4 CTH Improvement CNW/CNL CNW seconds CNL seconds Application
SLIDE 20
Performance on 4 cores (2 nodes)
3.17% 1760.4 1816.2 UMT 0.71% 295.2 297.3 SPPM 8.85% 165.3 179.9 SAGE 2.95% 1439.7 1482.2 S3D 6.74% 175.0 186.8 PRONTO 1.01% 435.7 440.1 POP 4.77% 234.4 245.5 PARTISN 5.25% 1118.6 1177.3 LSMS 0.58% 589.5 592.9 GTC 8.19% 877.8 949.7 CTH Improvement CNW/CNL CNW seconds CNL seconds Application
SLIDE 21 Performance on 4 cores (1 node)
6.40% 1827.6 1944.6 UMT 1.11% 297.8 301.1 SPPM 17.47% 190.2 233.4 SAGE
1940.4 1937.3 S3D 7.18% 195.1 209.1 PRONTO 0.66% 464.3 467.3 POP 1.16% 441.9 447.1 PARTISN 5.55% 1144.6 1208.1 LSMS 0.06% 622.4 622.8 GTC 17.51% 1037.8 1219.5 CTH Improvement CNW/CNL CNW seconds CNL seconds Application
SLIDE 22 Summary
- We developed a version of Catamount for quad-
core and beyond
- Most applications at scale on dual-core systems
run better with CNW than with CNL
– Difference gets bigger with larger numbers of cores
- On our 4 quad-core system, most applications
perform somewhat better with CNW
– Different applications react differently
- Need to do a large scale test with quad-core
processors to see if the effects are cumulative