application performance on the uk s new hector service
play

Application performance on the UK's new HECToR service Fiona Reid - PowerPoint PPT Presentation

Application performance on the UK's new HECToR service Fiona Reid 1,2 , Mike Ashworth 1,3 , Thomas Edwards 2 , Alan Gray 1,2 , Joachim Hein 1,2 , Alan Simpson 1,2 , Peter Knight 4 , Kevin Stratford 1,2 , Michele Weiland 1,2 1 HPCX Consortium 2


  1. Application performance on the UK's new HECToR service Fiona Reid 1,2 , Mike Ashworth 1,3 , Thomas Edwards 2 , Alan Gray 1,2 , Joachim Hein 1,2 , Alan Simpson 1,2 , Peter Knight 4 , Kevin Stratford 1,2 , Michele Weiland 1,2 1 HPCX Consortium 2 EPCC, The University of Edinburgh 3 STFC Daresbury Laboratory 4 EURATOM/UKAEA Fusion Association CUG May 5-8th 2008

  2. Acknowledgements • STFC: Roderick Johnstone • Colin Roach, UKAEA and Bill Dorland, University of Maryland for assistance porting GS2 to HECToR and supplying the NEV02 benchmark • Jim Philips, UIUC and Ghengbin Zheng, UIUC for their assistance installing NAMD on HECToR CUG May 5-8th 2008 2

  3. Overview • System Introductions • Synthetic Benchmark Results • Application Benchmark Results • Conclusions CUG May 5-8th 2008 3

  4. Systems for comparison • HPCx (Phase 3): 160 IBM e-Server p575 nodes – SMP cluster, 16 Power5 1.5 GHz cores per node – 32 GB of RAM per node (2 GB per core) – IBM HPS interconnect (aka Federation) – 12.9 TFLOP/s Linpack, No 101 on top500 • HECToR (Phase 1): Cray XT4 – MPP, 5664 nodes, 2 Opteron 2.8 GHz cores per node – 6 GB of RAM per node (3 GB per core) – Cray Seastar2 torus network – 54.6 TFLOP/s Linpack, No 17 on top500 • Also included in some plots: – HECToR Test and Development system (TDS) – Cray XT4, 64 nodes: 2.6 GHz dual core, 4 GB RAM/node CUG May 5-8th 2008 4

  5. System Comparison (cont) HPCx HECToR Chip IBM Power5 (dual core) AMD Opteron (dual core) Clock 1.5 GHz 2.8 GHz FPUs 2 FMA 1 M, 1 A Peak 6.0 GFlop/s 5.6 GFlop/s Perf/core cores 2560 11328 Peak Perf 15.4 TFLOP/s 63.4 TFLOP/s Linpack 12.9 TFLOP/s 54.6 TFLOP/s CUG May 5-8th 2008 5

  6. Synthetic Benchmarks • Memory Bandwidth – Streams • MPI Bandwidth – Intel MPI Benchmarks – PingPing CUG May 5-8th 2008 6

  7. Memory bandwidth - Streams 100000 TDS: 2 cores per node TDS: 1 core per node HECToR: 2 cores per node Bandwidth (load+store) (MB/s) HECToR: 1 core per node HPCx: 16 cores per node HPCx: 8 cores per node 10000 1000 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Array Size (Bytes) CUG May 5-8th 2008 7

  8. Memory bandwidth - Streams • Can clearly see caches • HECToR better at L1, slightly better on main memory – HPCx has advantage for intermediate array sizes. • Underpopulating nodes (1 core per chip) gives improvements on both systems – memory bandwidth cannot sustain 2 cores per chip – HECToR worse than HPCx, especially on main memory – Of course, 1 core/chip means double the resource for same no. tasks • TDS has lower clock rate than HECToR, but has higher bandwidth from main memory! – 4=2+2 GB RAM on TDS is symmetric, interleaving possible – 6=4+2 GB RAM on HECToR only allows partial interleaving CUG May 5-8th 2008 8

  9. MPI bandwidth - PingPing HPCx reaches Intel MPI Multi Ping Ping Benchmark System Comparison saturation point 720 MB/s earlier – HECToR 1000 may scale better 100 140 MB/s Bandwidth per task (MB/s) On both systems the latency (via IMB 10 PingPong) ~5.5µs HECToR: 2 cores per node 1 HPCx: 16 cores per node AlltoAll - HPCx has the advantage for 0.1 small (<100 bytes) messages, HECToR 0.01 outperforms HPCx 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 for larger messages Message Size (Bytes) CUG May 5-8th 2008 9

  10. Applications • Fluid Dynamics – PDNS3D – Ludwig • Fusion – Centori – GS2 • Ocean Modelling – POLCOMS • Molecular Dynamics – DL_POLY – LAMMPS – GROMACS (see paper) – NAMD – AMBER (see paper) CUG May 5-8th 2008 10

  11. Fluid Dynamics: PDNS3D (PCHAN) • Finite difference code for Turbulent Flow – shock/boundary layer interaction (SBLI) • Simulates the flow of fluids to study turbulence • T3 benchmark - Involves a 360x360x360 grid • Developed by Neil Sandham, University of Southampton CUG May 5-8th 2008 11

  12. PDNS3D – compilation optimization PDNS3D T3 Benchmark HECToR PGI Compilation Flag Comparison, 64 cores 1500 1450 1400 1350 time (s) 1300 1250 1200 1150 1100 1050 0 1 2 3 4 e e t 3 4 s O O O O O n n O O a i i - - - - - l l f - - n n - i i t t , , s s t t s s a a f f a a - - f f = = a a p p i i M M - - 3 4 O O - - t t s s a a f f - - CUG May 5-8th 2008 12

  13. PDNS3D – system comparison PDNS3D T3 Benchmark System Comparison 90000 80000 Time * Cores (s) 70000 HECToR 60000 HPCx Phase 3 50000 40000 30000 20000 10 100 1000 10000 Cores CUG May 5-8th 2008 13

  14. PDNS3D Memory Bandwidth sensitivity PDNS3D T3 Benchmark PDNS3D T3 Benchmark HPCx HECToR HPCx Phase 3 90000 90000 HPCx Phase 3, 8 cores 80000 80000 per node Time * Cores (s) Time * Cores (s) 70000 70000 60000 60000 50000 50000 40000 40000 HECToR HECToR, 1 core per node 30000 30000 HECToR TDS 20000 20000 HECToR TDS, 1 core per node 10 100 1000 10 100 1000 10000 Cores Cores • Underpopulating nodes gives huge improvement (in terms of performance/core) on HECToR, slight improvement on HPCx • TDS outperforms HECToR • c.f. streams CUG May 5-8th 2008 14

  15. PDNS3D – Optimised version • New optimised version less sensitive to memory bandwidth HECToR PDNS3D T3 Benchmark HPCx Phase 3 System Comparison HECToR, Opt HPCx, Opt 90000 HECToR, Opt, 1 c/n 80000 Time * Cores (s) 70000 60000 50000 40000 30000 20000 10 100 1000 10000 Cores CUG May 5-8th 2008 15

  16. PDNS3D – Optimised version • PathScale gives a further 10-15% improvement HECToR HPCx Phase 3 PDNS3D T3 Benchmark HECToR, Opt System Comparison HPCx, Opt HECToR, Opt, 1 c/n 90000 HECToR, Opt, PathScale 80000 Time * Cores (s) 70000 60000 50000 40000 30000 20000 10 100 1000 10000 Cores CUG May 5-8th 2008 16

  17. Fluid dynamics - Ludwig • Ludwig – Lattice Boltzmann code for solving the incompressible Navier-Stokes equations – Used to study complex fluids – Code uses a regular domain decomposition with local boundary exchanges between the subdomains – Two problems considered, one with a binary fluid mixture, the other with shear flow CUG May 5-8th 2008 17

  18. 18 CUG May 5-8th 2008 Ludwig 256x512x256 lattice

  19. Fusion • Centori – simulates the fluid flow inside a tokamak reactor developed by UKAEA Fusion in collaboration with EPCC • GS2 – Gyrokinetic simulations of low- frequency turbulence in tokamak developed by Bill Dorland et al. ITER tokamak reactor (www.iter.org) CUG May 5-8th 2008 19

  20. CENTORI Centori, 128x128x128 problem System Comparison 400 HECToR 350 HPCx TDS Time * Cores (s) 300 250 200 150 100 50 0 1 10 100 1000 10000 Cores CUG May 5-8th 2008 20

  21. GS2 GS2 NEV02 benchmark System Comparison HECToR HPCx 300 HECToR, 1 core per node TDS 250 TDS, 1 core per node Time * Cores (s) 200 150 100 50 0 10 100 1000 10000 Cores CUG May 5-8th 2008 21

  22. Ocean Modelling: POLCOMS • Proudman Oceanographic Laboratory Coastal Ocean Modelling System (POLCOMS) – Simulation of the marine environment – Applications include coastal engineering, offshore industries, fisheries management, marine pollution monitoring, weather forecasting and climate research – Uses 3-dimensional hydrodynamic model CUG May 5-8th 2008 22

  23. 23 CUG May 5-8th 2008 Ocean Modelling: POLCOMS

  24. Molecular dynamics • DL_POLY – general purpose molecular dynamics package which can be used to simulate systems with very large numbers of atoms • LAMMPS – Classical Molecular Dynamics - can simulate wide range of materials • NAMD – classical molecular dynamics code designed for high-performance simulation of large biomolecular systems • AMBER Protein Dihydrofolate Reductase – General purpose biomolecular simulation package • GROMACS – General purpose MD package - specialises in biochemical systems, e.g. protiens, lipids etc CUG May 5-8th 2008 24

  25. DL_POLY – system comparison DL_POLY 3.08 - GRAMICIDIN A WITH WATER SOLVATING (792960 atoms) System Comparison 350 HPCx Phase 3 300 timestep time * Cores (s) HECToR 250 200 150 100 50 0 10 100 1000 Cores CUG May 5-8th 2008 25

  26. DL_POLY – system comparison DL_POLY 3.08 - GRAMICIDIN A WITH WATER SOLVATING (792960 atoms) HECToR 350 HECToR HECToR 1 core per node 300 timestep time * Cores (s) 250 200 150 100 50 0 10 100 1000 Cores CUG May 5-8th 2008 26

  27. LAMMPS LAMMPS Rhodopsin benchmark, 4096000 atoms System Comparison 50000 45000 Loop Time * Cores (s) HPCx Phase 3 40000 HPCx Phase 3 SMT 35000 HECToR 30000 25000 20000 15000 10000 10 100 1000 10000 Cores CUG May 5-8th 2008 27

  28. LAMMPS LAMMPS Rhodopsin benchmark, 4096000 atoms HECToR 30000 28000 Loop Time * Cores (s) 26000 HECToR 24000 HECToR 1 core per node 22000 HECToR TDS 20000 HECToR TDS 1 core per node 18000 16000 14000 12000 10000 10 100 1000 10000 Cores CUG May 5-8th 2008 28

  29. LAMMPS • On HECToR we can run a problem with 500 million particles • On HPCx the limit is ~100 million particles – Fewer cores available – Less memory for per core CUG May 5-8th 2008 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend