Application performance on the UK's new HECToR service Fiona Reid - - PowerPoint PPT Presentation

application performance on the uk s new hector service
SMART_READER_LITE
LIVE PREVIEW

Application performance on the UK's new HECToR service Fiona Reid - - PowerPoint PPT Presentation

Application performance on the UK's new HECToR service Fiona Reid 1,2 , Mike Ashworth 1,3 , Thomas Edwards 2 , Alan Gray 1,2 , Joachim Hein 1,2 , Alan Simpson 1,2 , Peter Knight 4 , Kevin Stratford 1,2 , Michele Weiland 1,2 1 HPCX Consortium 2


slide-1
SLIDE 1

Application performance on the UK's new HECToR service

Fiona Reid1,2, Mike Ashworth1,3, Thomas Edwards2, Alan Gray1,2, Joachim Hein1,2, Alan Simpson1,2, Peter Knight4, Kevin Stratford1,2, Michele Weiland1,2

1HPCX Consortium 2EPCC, The University of Edinburgh 3STFC Daresbury Laboratory 4EURATOM/UKAEA Fusion Association

CUG May 5-8th 2008

slide-2
SLIDE 2

CUG May 5-8th 2008 2

Acknowledgements

  • STFC: Roderick Johnstone
  • Colin Roach, UKAEA and Bill Dorland, University of

Maryland for assistance porting GS2 to HECToR and supplying the NEV02 benchmark

  • Jim Philips, UIUC and Ghengbin Zheng, UIUC for their

assistance installing NAMD on HECToR

slide-3
SLIDE 3

CUG May 5-8th 2008 3

Overview

  • System Introductions
  • Synthetic Benchmark Results
  • Application Benchmark Results
  • Conclusions
slide-4
SLIDE 4

CUG May 5-8th 2008 4

  • HPCx (Phase 3): 160 IBM e-Server p575 nodes

– SMP cluster, 16 Power5 1.5 GHz cores per node – 32 GB of RAM per node (2 GB per core) – IBM HPS interconnect (aka Federation) – 12.9 TFLOP/s Linpack, No 101 on top500

  • HECToR (Phase 1): Cray XT4

– MPP, 5664 nodes, 2 Opteron 2.8 GHz cores per node – 6 GB of RAM per node (3 GB per core) – Cray Seastar2 torus network – 54.6 TFLOP/s Linpack, No 17 on top500

  • Also included in some plots:

– HECToR Test and Development system (TDS) – Cray XT4, 64 nodes: 2.6 GHz dual core, 4 GB RAM/node

Systems for comparison

slide-5
SLIDE 5

CUG May 5-8th 2008 5

System Comparison (cont)

63.4 TFLOP/s 15.4 TFLOP/s Peak Perf 54.6 TFLOP/s 12.9 TFLOP/s Linpack AMD Opteron (dual core) IBM Power5 (dual core) Chip 1 M, 1 A 2 FMA FPUs 5.6 GFlop/s 6.0 GFlop/s Peak Perf/core 11328 2560 cores 2.8 GHz 1.5 GHz Clock HECToR HPCx

slide-6
SLIDE 6

CUG May 5-8th 2008 6

Synthetic Benchmarks

  • Memory Bandwidth

– Streams

  • MPI Bandwidth

– Intel MPI Benchmarks – PingPing

slide-7
SLIDE 7

CUG May 5-8th 2008 7

Memory bandwidth - Streams

1000 10000 100000 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Array Size (Bytes) Bandwidth (load+store) (MB/s)

TDS: 2 cores per node TDS: 1 core per node HECToR: 2 cores per node HECToR: 1 core per node HPCx: 16 cores per node HPCx: 8 cores per node

slide-8
SLIDE 8

CUG May 5-8th 2008 8

Memory bandwidth - Streams

  • Can clearly see caches
  • HECToR better at L1, slightly better on main memory

– HPCx has advantage for intermediate array sizes.

  • Underpopulating nodes (1 core per chip) gives improvements
  • n both systems

– memory bandwidth cannot sustain 2 cores per chip – HECToR worse than HPCx, especially on main memory – Of course, 1 core/chip means double the resource for same no. tasks

  • TDS has lower clock rate than HECToR, but has higher

bandwidth from main memory!

– 4=2+2 GB RAM on TDS is symmetric, interleaving possible – 6=4+2 GB RAM on HECToR only allows partial interleaving

slide-9
SLIDE 9

CUG May 5-8th 2008 9

MPI bandwidth - PingPing

Intel MPI Multi Ping Ping Benchmark System Comparison 0.01 0.1 1 10 100 1000 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Message Size (Bytes) Bandwidth per task (MB/s)

HECToR: 2 cores per node HPCx: 16 cores per node

HPCx reaches saturation point earlier – HECToR may scale better On both systems the latency (via IMB PingPong) ~5.5µs AlltoAll - HPCx has the advantage for small (<100 bytes) messages, HECToR

  • utperforms HPCx

for larger messages

140 MB/s 720 MB/s

slide-10
SLIDE 10

CUG May 5-8th 2008 10

Applications

  • Fluid Dynamics

– PDNS3D – Ludwig

  • Fusion

– Centori – GS2

  • Ocean Modelling

– POLCOMS

  • Molecular Dynamics

– DL_POLY – LAMMPS – GROMACS (see paper) – NAMD – AMBER (see paper)

slide-11
SLIDE 11

CUG May 5-8th 2008 11

  • Finite difference code for Turbulent Flow

– shock/boundary layer interaction (SBLI)

  • Simulates the flow of fluids to study turbulence
  • T3 benchmark - Involves a 360x360x360 grid
  • Developed by Neil Sandham, University of Southampton

Fluid Dynamics: PDNS3D (PCHAN)

slide-12
SLIDE 12

CUG May 5-8th 2008 12

PDNS3D – compilation optimization

PDNS3D T3 Benchmark HECToR PGI Compilation Flag Comparison, 64 cores

1050 1100 1150 1200 1250 1300 1350 1400 1450 1500

  • O
  • O

1

  • O

2

  • O

3

  • O

4

  • f

a s t

  • f

a s t

  • O

3

  • f

a s t

  • O

4

  • f

a s t

  • O

3

  • M

i p a = f a s t , i n l i n e

  • f

a s t

  • O

4

  • M

i p a = f a s t , i n l i n e

time (s)

slide-13
SLIDE 13

CUG May 5-8th 2008 13

PDNS3D – system comparison

PDNS3D T3 Benchmark System Comparison 20000 30000 40000 50000 60000 70000 80000 90000 10 100 1000 10000 Cores Time * Cores (s) HECToR HPCx Phase 3

slide-14
SLIDE 14

CUG May 5-8th 2008 14

PDNS3D Memory Bandwidth sensitivity

PDNS3D T3 Benchmark HECToR 20000 30000 40000 50000 60000 70000 80000 90000 10 100 1000 10000 Cores Time * Cores (s) HECToR HECToR, 1 core per node HECToR TDS HECToR TDS, 1 core per node PDNS3D T3 Benchmark HPCx 20000 30000 40000 50000 60000 70000 80000 90000 10 100 1000 Cores Time * Cores (s) HPCx Phase 3 HPCx Phase 3, 8 cores per node

  • Underpopulating nodes gives huge improvement (in terms of

performance/core) on HECToR, slight improvement on HPCx

  • TDS outperforms HECToR
  • c.f. streams
slide-15
SLIDE 15

CUG May 5-8th 2008 15

PDNS3D – Optimised version

  • New optimised version less sensitive to memory bandwidth

PDNS3D T3 Benchmark System Comparison 20000 30000 40000 50000 60000 70000 80000 90000 10 100 1000 10000 Cores Time * Cores (s) HECToR HPCx Phase 3 HECToR, Opt HPCx, Opt HECToR, Opt, 1 c/n

slide-16
SLIDE 16

CUG May 5-8th 2008 16

PDNS3D – Optimised version

  • PathScale gives a further 10-15% improvement

PDNS3D T3 Benchmark System Comparison 20000 30000 40000 50000 60000 70000 80000 90000 10 100 1000 10000 Cores Time * Cores (s) HECToR HPCx Phase 3 HECToR, Opt HPCx, Opt HECToR, Opt, 1 c/n HECToR, Opt, PathScale

slide-17
SLIDE 17

CUG May 5-8th 2008 17

Fluid dynamics - Ludwig

  • Ludwig

– Lattice Boltzmann code for solving the incompressible Navier-Stokes equations – Used to study complex fluids – Code uses a regular domain decomposition with local boundary exchanges between the subdomains – Two problems considered, one with a binary fluid mixture, the other with shear flow

slide-18
SLIDE 18

CUG May 5-8th 2008 18

Ludwig 256x512x256 lattice

slide-19
SLIDE 19

CUG May 5-8th 2008 19

Fusion

  • Centori

– simulates the fluid flow inside a tokamak reactor developed by UKAEA Fusion in collaboration with EPCC

  • GS2

– Gyrokinetic simulations of low- frequency turbulence in tokamak developed by Bill Dorland et al.

ITER tokamak reactor (www.iter.org)

slide-20
SLIDE 20

CUG May 5-8th 2008 20

CENTORI

Centori, 128x128x128 problem System Comparison 50 100 150 200 250 300 350 400 1 10 100 1000 10000 Cores Time * Cores (s) HECToR HPCx TDS

slide-21
SLIDE 21

CUG May 5-8th 2008 21

GS2 NEV02 benchmark System Comparison 50 100 150 200 250 300 10 100 1000 10000 Cores Time * Cores (s)

HECToR HPCx HECToR, 1 core per node TDS TDS, 1 core per node

GS2

slide-22
SLIDE 22

CUG May 5-8th 2008 22

Ocean Modelling: POLCOMS

  • Proudman Oceanographic

Laboratory Coastal Ocean Modelling System (POLCOMS)

– Simulation of the marine environment – Applications include coastal engineering, offshore industries, fisheries management, marine pollution monitoring, weather forecasting and climate research – Uses 3-dimensional hydrodynamic model

slide-23
SLIDE 23

CUG May 5-8th 2008 23

Ocean Modelling: POLCOMS

slide-24
SLIDE 24

CUG May 5-8th 2008 24

Molecular dynamics

  • DL_POLY

– general purpose molecular dynamics package which can be used to simulate systems with very large numbers of atoms

  • LAMMPS

– Classical Molecular Dynamics - can simulate wide range of materials

  • NAMD

– classical molecular dynamics code designed for high-performance simulation of large biomolecular systems

  • AMBER

– General purpose biomolecular simulation package

  • GROMACS

– General purpose MD package - specialises in biochemical systems, e.g. protiens, lipids etc

Protein Dihydrofolate Reductase

slide-25
SLIDE 25

CUG May 5-8th 2008 25

DL_POLY – system comparison

DL_POLY 3.08 - GRAMICIDIN A WITH WATER SOLVATING (792960 atoms) System Comparison 50 100 150 200 250 300 350 10 100 1000 Cores timestep time * Cores (s) HPCx Phase 3 HECToR

slide-26
SLIDE 26

CUG May 5-8th 2008 26

DL_POLY – system comparison

DL_POLY 3.08 - GRAMICIDIN A WITH WATER SOLVATING (792960 atoms) HECToR 50 100 150 200 250 300 350 10 100 1000 Cores timestep time * Cores (s) HECToR HECToR 1 core per node

slide-27
SLIDE 27

CUG May 5-8th 2008 27

LAMMPS

LAMMPS Rhodopsin benchmark, 4096000 atoms System Comparison 10000 15000 20000 25000 30000 35000 40000 45000 50000 10 100 1000 10000 Cores Loop Time * Cores (s)

HPCx Phase 3 HPCx Phase 3 SMT HECToR

slide-28
SLIDE 28

CUG May 5-8th 2008 28

LAMMPS Rhodopsin benchmark, 4096000 atoms HECToR 10000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 10 100 1000 10000 Cores Loop Time * Cores (s)

HECToR HECToR 1 core per node HECToR TDS HECToR TDS 1 core per node

LAMMPS

slide-29
SLIDE 29

CUG May 5-8th 2008 29

LAMMPS

  • On HECToR we can run a

problem with 500 million particles

  • On HPCx the limit is ~100 million

particles – Fewer cores available – Less memory for per core

slide-30
SLIDE 30

CUG May 5-8th 2008 30

NAMD

NAMD Benchmarks System Comparison 2 4 6 8 10 12 14 16 18 20 1 10 100 1000 10000 Cores Time * Cores (s) HECToR: F1ATP HPCx: F1ATP HECToR: AP0-A1 TDS: APO-A1 HPCx: AP0-A1 HECToR: AP0-A1, 1 c/n 327K atoms 92K atoms

slide-31
SLIDE 31

CUG May 5-8th 2008 31

Conclusions

  • On a core by core basis, not much difference between

HECToR and HPCx in terms of application performance

– But HECToR has many more cores and a more scalable interconnect

  • Scaling better at high core counts on HECToR

– HECToR can also run much bigger problems, e.g. LAMMPS

  • Memory bandwidth cannot sustain fully populated nodes

for both systems

– general problem for HPC systems these days – This is seen in performance of memory bandwidth sensitive applications – Problem is worse for HECToR, especially with current non- symmetric memory setup.