Application performance on the UK's new HECToR service Fiona Reid - PowerPoint PPT Presentation

Application performance on the UK's new HECToR service Fiona Reid 1,2 , Mike Ashworth 1,3 , Thomas Edwards 2 , Alan Gray 1,2 , Joachim Hein 1,2 , Alan Simpson 1,2 , Peter Knight 4 , Kevin Stratford 1,2 , Michele Weiland 1,2 1 HPCX Consortium 2 EPCC, The University of Edinburgh 3 STFC Daresbury Laboratory 4 EURATOM/UKAEA Fusion Association CUG May 5-8th 2008

Acknowledgements • STFC: Roderick Johnstone • Colin Roach, UKAEA and Bill Dorland, University of Maryland for assistance porting GS2 to HECToR and supplying the NEV02 benchmark • Jim Philips, UIUC and Ghengbin Zheng, UIUC for their assistance installing NAMD on HECToR CUG May 5-8th 2008 2

Overview • System Introductions • Synthetic Benchmark Results • Application Benchmark Results • Conclusions CUG May 5-8th 2008 3

Systems for comparison • HPCx (Phase 3): 160 IBM e-Server p575 nodes – SMP cluster, 16 Power5 1.5 GHz cores per node – 32 GB of RAM per node (2 GB per core) – IBM HPS interconnect (aka Federation) – 12.9 TFLOP/s Linpack, No 101 on top500 • HECToR (Phase 1): Cray XT4 – MPP, 5664 nodes, 2 Opteron 2.8 GHz cores per node – 6 GB of RAM per node (3 GB per core) – Cray Seastar2 torus network – 54.6 TFLOP/s Linpack, No 17 on top500 • Also included in some plots: – HECToR Test and Development system (TDS) – Cray XT4, 64 nodes: 2.6 GHz dual core, 4 GB RAM/node CUG May 5-8th 2008 4

System Comparison (cont) HPCx HECToR Chip IBM Power5 (dual core) AMD Opteron (dual core) Clock 1.5 GHz 2.8 GHz FPUs 2 FMA 1 M, 1 A Peak 6.0 GFlop/s 5.6 GFlop/s Perf/core cores 2560 11328 Peak Perf 15.4 TFLOP/s 63.4 TFLOP/s Linpack 12.9 TFLOP/s 54.6 TFLOP/s CUG May 5-8th 2008 5

Synthetic Benchmarks • Memory Bandwidth – Streams • MPI Bandwidth – Intel MPI Benchmarks – PingPing CUG May 5-8th 2008 6

Memory bandwidth - Streams 100000 TDS: 2 cores per node TDS: 1 core per node HECToR: 2 cores per node Bandwidth (load+store) (MB/s) HECToR: 1 core per node HPCx: 16 cores per node HPCx: 8 cores per node 10000 1000 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Array Size (Bytes) CUG May 5-8th 2008 7

Memory bandwidth - Streams • Can clearly see caches • HECToR better at L1, slightly better on main memory – HPCx has advantage for intermediate array sizes. • Underpopulating nodes (1 core per chip) gives improvements on both systems – memory bandwidth cannot sustain 2 cores per chip – HECToR worse than HPCx, especially on main memory – Of course, 1 core/chip means double the resource for same no. tasks • TDS has lower clock rate than HECToR, but has higher bandwidth from main memory! – 4=2+2 GB RAM on TDS is symmetric, interleaving possible – 6=4+2 GB RAM on HECToR only allows partial interleaving CUG May 5-8th 2008 8

MPI bandwidth - PingPing HPCx reaches Intel MPI Multi Ping Ping Benchmark System Comparison saturation point 720 MB/s earlier – HECToR 1000 may scale better 100 140 MB/s Bandwidth per task (MB/s) On both systems the latency (via IMB 10 PingPong) ~5.5µs HECToR: 2 cores per node 1 HPCx: 16 cores per node AlltoAll - HPCx has the advantage for 0.1 small (<100 bytes) messages, HECToR 0.01 outperforms HPCx 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 for larger messages Message Size (Bytes) CUG May 5-8th 2008 9

Applications • Fluid Dynamics – PDNS3D – Ludwig • Fusion – Centori – GS2 • Ocean Modelling – POLCOMS • Molecular Dynamics – DL_POLY – LAMMPS – GROMACS (see paper) – NAMD – AMBER (see paper) CUG May 5-8th 2008 10

Fluid Dynamics: PDNS3D (PCHAN) • Finite difference code for Turbulent Flow – shock/boundary layer interaction (SBLI) • Simulates the flow of fluids to study turbulence • T3 benchmark - Involves a 360x360x360 grid • Developed by Neil Sandham, University of Southampton CUG May 5-8th 2008 11

PDNS3D – compilation optimization PDNS3D T3 Benchmark HECToR PGI Compilation Flag Comparison, 64 cores 1500 1450 1400 1350 time (s) 1300 1250 1200 1150 1100 1050 0 1 2 3 4 e e t 3 4 s O O O O O n n O O a i i - - - - - l l f - - n n - i i t t , , s s t t s s a a f f a a - - f f = = a a p p i i M M - - 3 4 O O - - t t s s a a f f - - CUG May 5-8th 2008 12

PDNS3D – system comparison PDNS3D T3 Benchmark System Comparison 90000 80000 Time * Cores (s) 70000 HECToR 60000 HPCx Phase 3 50000 40000 30000 20000 10 100 1000 10000 Cores CUG May 5-8th 2008 13

PDNS3D Memory Bandwidth sensitivity PDNS3D T3 Benchmark PDNS3D T3 Benchmark HPCx HECToR HPCx Phase 3 90000 90000 HPCx Phase 3, 8 cores 80000 80000 per node Time * Cores (s) Time * Cores (s) 70000 70000 60000 60000 50000 50000 40000 40000 HECToR HECToR, 1 core per node 30000 30000 HECToR TDS 20000 20000 HECToR TDS, 1 core per node 10 100 1000 10 100 1000 10000 Cores Cores • Underpopulating nodes gives huge improvement (in terms of performance/core) on HECToR, slight improvement on HPCx • TDS outperforms HECToR • c.f. streams CUG May 5-8th 2008 14

PDNS3D – Optimised version • New optimised version less sensitive to memory bandwidth HECToR PDNS3D T3 Benchmark HPCx Phase 3 System Comparison HECToR, Opt HPCx, Opt 90000 HECToR, Opt, 1 c/n 80000 Time * Cores (s) 70000 60000 50000 40000 30000 20000 10 100 1000 10000 Cores CUG May 5-8th 2008 15

PDNS3D – Optimised version • PathScale gives a further 10-15% improvement HECToR HPCx Phase 3 PDNS3D T3 Benchmark HECToR, Opt System Comparison HPCx, Opt HECToR, Opt, 1 c/n 90000 HECToR, Opt, PathScale 80000 Time * Cores (s) 70000 60000 50000 40000 30000 20000 10 100 1000 10000 Cores CUG May 5-8th 2008 16

Fluid dynamics - Ludwig • Ludwig – Lattice Boltzmann code for solving the incompressible Navier-Stokes equations – Used to study complex fluids – Code uses a regular domain decomposition with local boundary exchanges between the subdomains – Two problems considered, one with a binary fluid mixture, the other with shear flow CUG May 5-8th 2008 17

18 CUG May 5-8th 2008 Ludwig 256x512x256 lattice

Fusion • Centori – simulates the fluid flow inside a tokamak reactor developed by UKAEA Fusion in collaboration with EPCC • GS2 – Gyrokinetic simulations of low- frequency turbulence in tokamak developed by Bill Dorland et al. ITER tokamak reactor (www.iter.org) CUG May 5-8th 2008 19

CENTORI Centori, 128x128x128 problem System Comparison 400 HECToR 350 HPCx TDS Time * Cores (s) 300 250 200 150 100 50 0 1 10 100 1000 10000 Cores CUG May 5-8th 2008 20

GS2 GS2 NEV02 benchmark System Comparison HECToR HPCx 300 HECToR, 1 core per node TDS 250 TDS, 1 core per node Time * Cores (s) 200 150 100 50 0 10 100 1000 10000 Cores CUG May 5-8th 2008 21

Ocean Modelling: POLCOMS • Proudman Oceanographic Laboratory Coastal Ocean Modelling System (POLCOMS) – Simulation of the marine environment – Applications include coastal engineering, offshore industries, fisheries management, marine pollution monitoring, weather forecasting and climate research – Uses 3-dimensional hydrodynamic model CUG May 5-8th 2008 22

23 CUG May 5-8th 2008 Ocean Modelling: POLCOMS

Molecular dynamics • DL_POLY – general purpose molecular dynamics package which can be used to simulate systems with very large numbers of atoms • LAMMPS – Classical Molecular Dynamics - can simulate wide range of materials • NAMD – classical molecular dynamics code designed for high-performance simulation of large biomolecular systems • AMBER Protein Dihydrofolate Reductase – General purpose biomolecular simulation package • GROMACS – General purpose MD package - specialises in biochemical systems, e.g. protiens, lipids etc CUG May 5-8th 2008 24

DL_POLY – system comparison DL_POLY 3.08 - GRAMICIDIN A WITH WATER SOLVATING (792960 atoms) System Comparison 350 HPCx Phase 3 300 timestep time * Cores (s) HECToR 250 200 150 100 50 0 10 100 1000 Cores CUG May 5-8th 2008 25

DL_POLY – system comparison DL_POLY 3.08 - GRAMICIDIN A WITH WATER SOLVATING (792960 atoms) HECToR 350 HECToR HECToR 1 core per node 300 timestep time * Cores (s) 250 200 150 100 50 0 10 100 1000 Cores CUG May 5-8th 2008 26

LAMMPS LAMMPS Rhodopsin benchmark, 4096000 atoms System Comparison 50000 45000 Loop Time * Cores (s) HPCx Phase 3 40000 HPCx Phase 3 SMT 35000 HECToR 30000 25000 20000 15000 10000 10 100 1000 10000 Cores CUG May 5-8th 2008 27

LAMMPS LAMMPS Rhodopsin benchmark, 4096000 atoms HECToR 30000 28000 Loop Time * Cores (s) 26000 HECToR 24000 HECToR 1 core per node 22000 HECToR TDS 20000 HECToR TDS 1 core per node 18000 16000 14000 12000 10000 10 100 1000 10000 Cores CUG May 5-8th 2008 28

LAMMPS • On HECToR we can run a problem with 500 million particles • On HPCx the limit is ~100 million particles – Fewer cores available – Less memory for per core CUG May 5-8th 2008 29

Application performance on the UK's new HECToR service Fiona Reid - PowerPoint PPT Presentation

Application performance on the UK's new HECToR service Fiona Reid 1,2 , Mike Ashworth 1,3 , Thomas Edwards 2 , Alan Gray 1,2 , Joachim Hein 1,2 , Alan Simpson 1,2 , Peter Knight 4 , Kevin Stratford 1,2 , Michele Weiland 1,2 1 HPCX Consortium 2

Casey Rosenthal @caseyrosenthal Part One. SERVICE A SERVICE B SERVICE C SERVICE D SERVICE E

CAMP CHIEF HECTOR YMCA Sunship Earth CAMP CHIEF

HECToR, the CoE and Large- Scale Application Performance on CLE David Tanqueray, Jason

C o ff eeScrip t Hector Correa hector@hectorcorrea.com Monday, June 18, 12 Agenda What is

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

Hector Spring School 2014 Presentation Notes Slide #1: the photo is of Lookout Mountain which is

Hector Spring School 2015 Presentation Notes Slide #1: the photo is of Lookout Mountain which is

Hector Open Source Security Intelligence Platform University of Pennsylvania School of Arts &

Evaluation of Query Rewriting Approaches for OWL 2 Hector Perez-Urbina, Edgar Rodriguez-Diaz,

On the effectiveness of Full-ASLR on 64-bit Linux Hector Marco-Gisbert , Ismael Ripoll Universit`

On the effectiveness of NX, SSP, RenewSSP and ASLR against stack buffer overflows Hector

Preventing brute force attacks against stack canary protector on networking servers Hector

Course on Automated Planning: Planning as Heuristic Search Hector Geffner ICREA & Universitat

Skyrama: Art Directing a Social Game Hector Moran Head of Art, Sproing Interactive Media About

Towards Representing What Readers of Fiction Believe Toryn Q. Klassen and Hector J. Levesque and

Localization Nischal K N System Overview Mapping Hector Mapping Localization Path Planning

Sh Shar are e #G #GreyLit: yLit: Using Using So Social cial Me Media dia to Co Comm

Design, Implementation and Applications of PETSc-MUMPS Inteface Hong Zhang Computer Science,

Multiagent Constraint Optimization on the Constraint Composite Graph Ferdinando Fioretto Hong

Nonlinear model reduction Using machine learning to enable rapid simulation of extreme-scale

Digital Cookie 2018-2019 Parent Training Welcome Setting up your site Learning

Optimal control of the atmospheric re-entry of a space shuttle Emmanuel Trlat 1 1 MAPMO

Shor Shortening tening the Length the Length of of Stay in J Stay in Jail ail for or People

Measuring NIR Extinction with GPS 1.0 Transmission 0.8 0.6 0.4 0.2 0.0 0.5 1.0 1.5 2.0

Application performance on the UK's new HECToR service Fiona Reid - PowerPoint PPT Presentation

Application performance on the UK's new HECToR service Fiona Reid 1,2 , Mike Ashworth 1,3 , Thomas Edwards 2 , Alan Gray 1,2 , Joachim Hein 1,2 , Alan Simpson 1,2 , Peter Knight 4 , Kevin Stratford 1,2 , Michele Weiland 1,2 1 HPCX Consortium 2

Casey Rosenthal @caseyrosenthal Part One. SERVICE A SERVICE B SERVICE C SERVICE D SERVICE E

CAMP CHIEF HECTOR YMCA Sunship Earth CAMP CHIEF

HECToR, the CoE and Large- Scale Application Performance on CLE David Tanqueray, Jason

C o ff eeScrip t Hector Correa hector@hectorcorrea.com Monday, June 18, 12 Agenda What is

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

Hector Spring School 2014 Presentation Notes Slide #1: the photo is of Lookout Mountain which is

Hector Spring School 2015 Presentation Notes Slide #1: the photo is of Lookout Mountain which is

Hector Open Source Security Intelligence Platform University of Pennsylvania School of Arts &amp;

Evaluation of Query Rewriting Approaches for OWL 2 Hector Perez-Urbina, Edgar Rodriguez-Diaz,

On the effectiveness of Full-ASLR on 64-bit Linux Hector Marco-Gisbert , Ismael Ripoll Universit`

On the effectiveness of NX, SSP, RenewSSP and ASLR against stack buffer overflows Hector

Preventing brute force attacks against stack canary protector on networking servers Hector

Course on Automated Planning: Planning as Heuristic Search Hector Geffner ICREA &amp; Universitat

Skyrama: Art Directing a Social Game Hector Moran Head of Art, Sproing Interactive Media About

Towards Representing What Readers of Fiction Believe Toryn Q. Klassen and Hector J. Levesque and

Localization Nischal K N System Overview Mapping Hector Mapping Localization Path Planning

Sh Shar are e #G #GreyLit: yLit: Using Using So Social cial Me Media dia to Co Comm

Design, Implementation and Applications of PETSc-MUMPS Inteface Hong Zhang Computer Science,

Multiagent Constraint Optimization on the Constraint Composite Graph Ferdinando Fioretto Hong

Nonlinear model reduction Using machine learning to enable rapid simulation of extreme-scale

Digital Cookie 2018-2019 Parent Training Welcome Setting up your site Learning

Optimal control of the atmospheric re-entry of a space shuttle Emmanuel Trlat 1 1 MAPMO

Shor Shortening tening the Length the Length of of Stay in J Stay in Jail ail for or People

Measuring NIR Extinction with GPS 1.0 Transmission 0.8 0.6 0.4 0.2 0.0 0.5 1.0 1.5 2.0

Hector Open Source Security Intelligence Platform University of Pennsylvania School of Arts &

Course on Automated Planning: Planning as Heuristic Search Hector Geffner ICREA & Universitat