HPC x : A New Resource for UK Computational Science Mike Ashworth, - - PowerPoint PPT Presentation

hpc x a new resource for uk computational science
SMART_READER_LITE
LIVE PREVIEW

HPC x : A New Resource for UK Computational Science Mike Ashworth, - - PowerPoint PPT Presentation

HPC x : A New Resource for UK Computational Science Mike Ashworth, Ian J. Bush, Martyn F. Guest, Martin Plummer and Andrew G. Sunderland CLRC Daresbury Laboratory, UK m.f.guest@dl.ac.uk and Stephen Booth, David S. Henty, Lorna Smith and Kevin


slide-1
SLIDE 1

Mike Ashworth, Ian J. Bush, Martyn F. Guest, Martin Plummer

and Andrew G. Sunderland

CLRC Daresbury Laboratory, UK m.f.guest@dl.ac.uk

and Stephen Booth, David S. Henty, Lorna Smith and Kevin Stratford

EPCC, University of Edinburgh, UK

HPCx : A New Resource for UK Computational Science

http://www.hpcx.ac.uk/

slide-2
SLIDE 2

2 14th May 2003 HPCS 2003, Sherbrooke

Outline

  • HPCx Overview

– HPCx Consortium – HPCx Technology - Phases 1, 2 and 3 (2002-2007)

  • Performance Overview of Strategic Applications:

– Computational Materials – Molecular Simulation – Molecular Electronic Structure – Atomic and Molecular Physics – Computational Engineering – Environmental Science

  • Evaluation across a range of Current High-End Systems:

– IBM SP/p690, SGI Origin 3800/R14k-500, HP/Compaq AlphaServer SC ES45/1000 and Cray T3E/1200E

  • Summary

Applicat ions, and not H/ W dr iven

slide-3
SLIDE 3

3 14th May 2003 HPCS 2003, Sherbrooke

HPCx Project overview

  • A joint venture between the Edinburgh Parallel Computing

Centre (EPCC) at the University of Edinburgh and the Daresbury Laboratory of the Central Laboratory for the Research Councils (CLRC)

  • Project funded to £53M (~$120M) by UK Government
  • Established to operate and support the principal academic and

research computing service for the UK

  • Principal objective being to provide a Capability Computing

service to run scientific applications that could not be run on any

  • ther available computing platform
  • Six-year project with defined performance requirements at year

0, year 2 and year 4 so as to match Moore’s Law

  • IBM chosen as the technology partner with Power4 based p690

platform, and the “best available interconnect”

slide-4
SLIDE 4

4 14th May 2003 HPCS 2003, Sherbrooke

Consortium partners

  • EPCC (University of Edinburgh)

– established in 1991 as the University’s interdisciplinary focus for high-performance computing and commercial exploitation arm – has hosted specialised HPC services for the UK’s QCD community since 1989. 5Tflop QCDOC system due 2003 in project with Columbia, IBM and Brookhaven National Laboratory – operated and supported UK national services on CRAY T3D and T3E systems from 1994 until 2002

  • CLRC (Daresbury Laboratory)

– HPC service provider to the UK academic community for > 25 yrs – research, development & support centre for leading edge academic engineering and physical science simulation codes – distributed computing support centre for COTS processor & network technologies, evaluating scalability and performance – UK grid support centre

slide-5
SLIDE 5

5 14th May 2003 HPCS 2003, Sherbrooke

HPCx Technology Phase 1

Phase 1 (Dec. 2002): 3 TFlop/s Rmax Linpack

– 40 Regatta-H SMP compute systems (1.28 TB memory)

  • 32 x 1.3GHz processors, 32 GB memory; 4 x 8-way LPARs

– 2 Regatta-H I/O systems

  • 16 x 1.3GHz processors (Regatta-HPC), 4 GPFS LPARS
  • 2 HSM/backup LPARS, 18TB EXP500 fibre-channel global

filesystem – Switch Interconnect

  • Existing SP Switch2 with "Colony" PCI adapters in all LPARs

(20 us latency, 350 MB/s bandwidth)

  • Each compute node has two connections into switch fabric

(dual plane)

  • 160 x 8-way compute nodes in total

– Ranked #9 in the TOP500 list (November 2002)

slide-6
SLIDE 6

6 14th May 2003 HPCS 2003, Sherbrooke

HPCx Technology Phases 2 & 3

Phase 2 (2004): 6 TFlop/s Rmax Linpack

– >40 Regatta-H+ compute systems

  • 32 x 1.8GHz processors, 32 GB memory, full SMP mode (no LPAR)

– 3 Regatta-H I/O systems (Double the capabilities of Phase 1) – "Federation" switch fabric

  • bandwidth quadrupled, ~5-10 microsecond latency, Connect to GX bus

directly

Phase 3 (2006): 12 TFlop/s Rmax Linpack

– >40 Regatta-H+ compute systems

  • unchanged from Phase 2

– >40 additional Regatta-H+ compute systems

  • double the existing configuration

– 4 Regatta I/O systems (Double the capabilities of Phase 2)

Open to Alternative Technology Solutions (IPF, BlueGene/L ..)

slide-7
SLIDE 7

7 14th May 2003 HPCS 2003, Sherbrooke

July 2002 July 2002 November 2002 November 2002

HPCx - Phase 1 Technology at Daresbury

slide-8
SLIDE 8

8 14th May 2003 HPCS 2003, Sherbrooke

GX Bus GX Bus GX Bus GX Bus

Four POWER4 chips (8 pr ocessor s) on an MCM, wit h t wo associat ed memor y slot s

4 GX Bus links f or external connections

L3 cache shared across all processors

Mem Ctrl L3 L3 Mem Ctrl

Shared L2 Distributed switch Shared L2 Shared L2 Shared L2

L3 Mem Ctrl L3 Mem Ctrl

M E M O R Y S L O T M E M O R Y S L O T

Distributed switch Distributed switch Distributed switch

IBM p-series 690Turbo:Multi-chip Module (MCM)

slide-9
SLIDE 9

9 14th May 2003 HPCS 2003, Sherbrooke

100 200 300 400 MATRIX-97

  • Chem. Kernels

DLPOLY GAMESS-UK OVERALL

Intel Tiger Madison 1.2GHz Cray T3E/1200E SGI Origin 3800/R12k-400 HP/Compaq ES45 1 GHz IBM SP/p690 1.3 GHz

Perf ormance relat ive t o t he SGI Origin 3800/ R12k-400

† †

7 X Cray T3E 7 X Cray T3E 3 X SGI Origin 3800 3 X SGI Origin 3800

Serial Benchmark Summary

slide-10
SLIDE 10

10 14th May 2003 HPCS 2003, Sherbrooke

SPEC CPU2000: SPECfp vs SPECfp_rate (32 CPUs)

Values relat ive t o t he I BM 690 Turbo 1.3 GHz

20 40 60 80 100 120

IBM 690 Turbo 1.3 GHz HP Superdome/PA8700-750 HP Superdome/PA8600-552 SGI Origin3800/R14k-600 SGI Origin3800/R14k-500 SGI Origin3800/R12k-400 Compaq Alpha GS320/1000 Compaq Alpha GS320/731

SP ECf p SP ECf p SP ECf p SP ECf p_r at e _r at e

slide-11
SLIDE 11

11 14th May 2003 HPCS 2003, Sherbrooke

77.9 44.2 84.2 367 516 667 476 635 456 698 1217 264 507 964 530 2151 825 1057 362

500 1000 1500 2000

CS6 PIII/800 - LAM CS6 PIII/800 - MPICH CS4 AMD K7/1200 - LAM AMD K7/1000 MP - SCALI (2 CPU) CS8 Itanium/800 - Myrinet (2CPU) CS8 Itanium/800 - Myrinet (1CPU) CS9 P4/2000 Xeon - Myrinet (2CPU) CS9 P4/2000 Xeon - Myrinet (1CPU) CS2 QSNet AlphaEV67 (2 CPU) CS2 QSNet AlphaEV67 (1 CPU) Cray T3E/1200E IBM Regatta-H IBM SP/Regatta-H (16 - 8 + 8) IBM SP/Regatta-H (16 - 16X1) IBM SP/NH2-375 (16 - 8+8) IBM SP/WH2-375 AlphaServer SC ES45/1000 (4CPU) AlphaServer SC ES45/1000 (1CPU) SGI Origin 3800/R14k-500

MBytes MBytes/ sec / sec

Fast Ethernet Fast Ethernet

QSNet QSNet SCALI SCALI

16 CPUs 16 CPUs

Myrinet Myrinet 2k 2k QSNet

Interconnect Benchmark - EFF_BW

slide-12
SLIDE 12

12 14th May 2003 HPCS 2003, Sherbrooke

Capability Benchmarking and Application Tuning HPCx Terascale Applicat ions Team

  • Materials Science

– CASTEP, AIMPRO & CRYSTAL

  • Molecular Simulation

– DL-POLY & NAMD

  • Atomic & Molecular Physics

– PFARM and H2MOL

  • Molecular Electronic Structure

– GAMESS-UK & NWChem

  • Computational Engineering

– PDNS3D

  • Environmental Science

– POLCOMS

slide-13
SLIDE 13

13 14th May 2003 HPCS 2003, Sherbrooke

Systems Used In Performance Analysis

  • IBM Systems

– IBM SP/Regatta-H (1024 procs, 8-way LPARs) HPCx system at DL – Regatta-H (32-way) and Regatta HPC (16-way) (Montpelier) – SP/Regatta-H (8-way LPARs, 1.3 GHz) at ORNL

  • HP/Compaq AlphaServer SC

– 4-way ES40/667 (APAC) and 833 MHz SMP nodes ; – TCS1 system at PSC: 750 4-way ES45 nodes - 3,000 EV68 1 GHz CPUs, with 4 GB memory per node – Quadrics “fat tree” interconnect (5 usec latency, 250+ MB/sec B/W)

  • SGI Origin 3800

– SARA (1000 CPUs) - NumaLink - with R14k/500 and R12k/400 CPUs – CSAR (512 CPUs) - NumaLink - R12k/400

  • Cray T3E/1200E

– CSAR (788 CPUs)

slide-14
SLIDE 14

14 14th May 2003 HPCS 2003, Sherbrooke

Materials Science

AIMPRO

(Ab Initio Modelling PROgram) Patrick Briddon et al, Newcastle University http://aimpro.ncl.ac.uk/

CASTEP

CAmbridge Serial Total Energy Package http://www.cse.clrc.ac.uk/cmg/NETWORKS/UKCP/

CRYSTAL

Properties of crystalline systems periodic HF or DFT Kohn-Sham Hamiltonian various hybrid approximations http://www.cse.clrc.ac.uk/cmg/CRYSTAL/

slide-15
SLIDE 15

15 14th May 2003 HPCS 2003, Sherbrooke

The AIMPRO benchmark

0.0 1.0 2.0 3.0 4.0 5.0 6.0 32 64 96 128 160 192 224 256 Number of processors Performance (10000/time) SGI Origin 3800/R12k-400 IBM SP/p690

216 at oms : C impurit y in a Si lat t ice; 5180 basis f unct ions; limit ed by ScaLaPack rout ine PDSYEVX

x4.3 x2.3 x1.6

slide-16
SLIDE 16

16 14th May 2003 HPCS 2003, Sherbrooke

Fock matrix (N = 1152)

10 20 30 40 50 2 4 8 16 32

PeIGS 2.1 PeIGS 2.1 - Cray T3E/1200 PeIGS 3.0 PDSYEV (Scpk 1.5) PDSYEVD (Scpk 1.7)

SGI Origin 3800/R12k-400 (“green”) SGI Origin 3800/R12k-400 (“green”)

Scalability of Numerical Algorithms I.

Real symmetric eigenvalue problems

Number of processors

Time (sec)

Fock matrix (N = 3888)

20 40 60 80 100 120 16 32 64 128 256 512

PeIGS 2.1 PeIGS 3.0 PDSYEV (Scpk 1.5) PDSYEVD (Scpk 1.7) BFG-Jacobi (DL)

Number of processors

slide-17
SLIDE 17

17 14th May 2003 HPCS 2003, Sherbrooke

20 40 60 80 100 32 64 128 256 512

PeIGS 3.0 O3K PeIGS 3.0 IBM SP/p690 PDSYEV O3K PDSYEV IBM SP/p690 PDSYEVD O3K PDSYEVD IBM SP/p690

IBM SP/p690 and SGI Origin O3800/R12k IBM SP/p690 and SGI Origin O3800/R12k

Scalability of Numerical Algorithms II.

Real symmetric eigenvalue problems

Number of processors Number of processors

IBM SP/p690

407 251 147 739 99 57 75 99 200 400 600 800 16 32 64 128 256 PDSYEV PDSYEVD

Time (secs.) N = 9,000 N = 9,000

Time (secs.)

N = 3,888 N = 3,888

slide-18
SLIDE 18

18 14th May 2003 HPCS 2003, Sherbrooke

Direct minimisation of the total energy (avoiding diagonalisation)

< + + −

=

cut

E G k G r G k i k G j k j

e C r

2

) ( ). ( ,

) (

r r r r r r r r r r

ψ

  • Pseudopotentials must be used to keep the number of plane

waves manageable

  • Large number of basis functions N~106 (especially for heavy

atoms).

The plane wave expansion means that the bulk of the computation comprises large 3D Fast Fourier Transforms (FFTs) between real and momentum space.

  • These are distributed across the processors in various ways.
  • The actual FFT routines are optimized for the cache size of the

processor.

Materials Simulation. Plane Wave Methods: CASTEP

slide-19
SLIDE 19

19 14th May 2003 HPCS 2003, Sherbrooke

1 2 3 4 5 6 7 32 64 128

IBM SP/p690 AlphaServer SC ES45/1000 SGI Origin 3800/R14k-500

CASTEP 4.2 - kG Parallel Benchmark

TiN: A 33 atom slab of TiN, 8 k points, single energy calculation

  • 88,000 plane waves
  • 3D FFT: 108X36X36
  • Vanderbilt pseudopotential

CPUs

Perf ormance Relative to the Cray T3E/ 1200E Bot t leneck: Dat a Transf ormat ion associat ed wit h 3D FFT& MPI _Allt oAllV

slide-20
SLIDE 20

20 14th May 2003 HPCS 2003, Sherbrooke

CASTEP TiN Benchmark, 32 k-points

2 4 6 8 32 64 96 128 Number of processors Performance (10000/Time) IBM SP/p690 Cray T3E/1200E

  • Applying a dense

Monkhorst Pack (MP) mesh t o t he 8 k-point t est , leading t o 32 k- point s

  • respect able perf ormance
  • large numbers of k-point s

t ypical f or calculat ions of met als

x 4.7 x 5.0 x 5.2

slide-21
SLIDE 21

21 14th May 2003 HPCS 2003, Sherbrooke

Scalability of CRYSTAL for Crystalline Crambin

128 256 384 512 640 768 896 1024 256 512 768 1024 Number of Processors Speed-up Linear 6-31G** (12,354 GTOs) 6-31G (7,194 GTOs) STO-3G (3,948 GTOs)

f ast er, more st able version of t he parallel J acobi diagonalizer replaces ScaLaP ack I BM SP/ p690 HPCx

Structure of Crambin is derived from XRD data at 0.52 Å (1284 atoms). T1024 = 716 T1024 = 245 T1024 = 668

slide-22
SLIDE 22

22 14th May 2003 HPCS 2003, Sherbrooke

Molecular Simulation DL_POLY

  • W. Smith and T.R. Forester, CLRC Daresbury Laboratory

General purpose molecular dynamics simulation package http://www.cse.clrc.ac.uk/msi/software/DL_POLY/

NAMD

Theoretical and Computational Biophysics Group, NIH

  • Parallel, object-oriented molecular dynamics code
  • High-performance simulation of large biomolecular systems
  • Scales to hundreds of processors on high-end parallel platforms

http://www.ks.uiuc.edu/Research/namd/

slide-23
SLIDE 23

23 14th May 2003 HPCS 2003, Sherbrooke

2 4 6 8 10 12 16 32 64 128

SGI Origin 3800/R14k-500 Compaq AlphaServer SC ES45/1000 IBM SP/p690

DL_POLY V2: Replicated Data

Bench 7: Gr amicidin in wat er ; r igid bonds and SHAKE, 12,390 at oms, 500 t ime st eps Number of CPUs Perf ormance Relative to the Cray T3E/ 1200E Perf ormance Relative to the Cray T3E/ 1200E Number of CPUs Perf ormance Relative to the Cray T3E/ 1200E

Macromolecular Simulat ions

Bench 4. NaCl; 27,000 ions, Ewald, 75 t ime st eps, Cut of f =24Å

I onic Simulat ions

1 2 3 4 5 6 7 16 32 64

SGI Origin 3800/R14k-500 Compaq AlphaServer SC ES45/1000 IBM SP/p690

slide-24
SLIDE 24

24 14th May 2003 HPCS 2003, Sherbrooke

Alternative FFT algorithm to reduce communication costs: – 3D FFT performed as a series of 1D FFTs, each involving communications only between blocks in a given column – More data is transferred, but in far fewer messages – Rather than all-to-all, the communications are column-wise only

DL_POLY3 Coulomb Energy Performance

Number of CPUs

DL_POLY- 3 216,000 ions, 200 t ime st eps, Cut of f =12Å

Perf ormance Relative to the Cray T3E/ 1200E

  • Distributed Data
  • SPME, with revised FFT Scheme

1 2 3 4 5 6 7

32 64 128 256

IBM SP/p690 AlphaServer SC ES45/1000 SGI Origin 3800/R14k-500

slide-25
SLIDE 25

25 14th May 2003 HPCS 2003, Sherbrooke

Measured Time (seconds) Number of CPUs Gramicidin in wat er; Gramicidin in wat er; rigid bonds + SHAKE: rigid bonds + SHAKE:

792,960 ions, 50 t ime st eps

Perf ormance Relative to the SGI Origin 3800/ R14k- 500 Number of CPUs

DL_POLY3 Macromolecular Simulations

654 273 140 97 68 77 370 186 109 349 189 114

200 400 600 32 64 128 256 SGI Origin 3800/R14k-500 AlphaServer SC ES45/1000 IBM SP/p690

0.5 1 1.5 2 32 64 128 256 IBM SP/p690 AlphaServer SC ES45/1000

slide-26
SLIDE 26

26 14th May 2003 HPCS 2003, Sherbrooke

Molecular Simulation - NAMD Scaling

128 256 384 512 128 256 384 512

Linear IBM SP/Regatta-H Compaq AlphaServer ES45/1000

  • standard NAMD ApoA-I benchmark, a

system comprising 92,442 atoms, with 12Å cutoff and PME every 4 time steps.

  • scalability improves with larger simulations -

speedup of 778 on 1024 CPUs of TCS-1 in a 327K particle simulation of F1-ATPase. Number of CPUs Speedup http://www.ks.uiuc.edu/Research/namd/

NAMD_2.5b2

slide-27
SLIDE 27

27 14th May 2003 HPCS 2003, Sherbrooke 104 92 100 32 64 96 128 32 64 96 128 Linear SGI Origin 3800/R14k-500 IBM SP/p690 AlphaServer ES45/1000

Ab Initio Molecular Electronic Structure

Number of CPUs Valinomycin Valinomycin (DFT HCTH): (DFT HCTH): Basis: DZVP2_A2 ( Basis: DZVP2_A2 (Dgauss Dgauss) ) (1620 (1620 GTOs GTOs) )

Bot t lenecks: LAPI + Mat rix Diagonalisat ion

Elapsed Time (seconds) Number of CPUs

GAMESS-UK - DFT Calculations:

– Global Array (GA) Tools from PNNL – Parallel Eigen Solvers (PeIGS)

Speedup

11053 5557 5711 5846 3109 3081 3388 1940 1825

3000 6000 9000 12000 32 64 128

SGI Origin 3800/R14k-500 IBM SP/p690 AlphaServer ES45/1000

slide-28
SLIDE 28

28 14th May 2003 HPCS 2003, Sherbrooke

Atomic and Molecular Physics

PFARM

Queen’s University Belfast, CLRC Daresbury Laboratory

R-matrix formalism to treat applications such as the description of the edge region in Tokamak plasmas (fusion power research) and for the interpretation

  • f astrophysical spectra

H2MOL

Queen’s University Belfast

Solves the time-dependent Schrodinger equation to calculate energy distributions for laser- driven dissociative ionization of H2 molecule.

slide-29
SLIDE 29

29 14th May 2003 HPCS 2003, Sherbrooke

A&M Physics: Electron-Atom Collisions

  • R-matrix theory - efficient methods for investigating

electron-atom and electron-molecule collisions.

  • Calculation involves integration of up to 103 coupled

channels i.e. 2nd order linear differential equations.

  • External Region Calculation Timings:

– Data from internal region calculations (from disk) – 2 stage approach - Diagonalisation [PeIGS (20%, 4K)] and functional task parallelisation (80%) - BLAS3 dominated – Systolic processor pipeline approach. – Coarse-grained parallelism ensures scalable performance. – Asynchronous communications minimises communication costs.

  • Benchmark Example for PFARM application
slide-30
SLIDE 30

30 14th May 2003 HPCS 2003, Sherbrooke

External Region Calculation Timings

CPUs

PFARM Perf ormance Ratio vs. Cray T3E/ 1200E

Elapsed Time (seconds) CPUs

2 4 6 8 16 32 64 128 SGI Orgin 3800/R12k-400 IBM SP/p690

PFARM - Breakdown of Time on I BM SP/ p690

1000 2000 3000 16 32 64 128

Prop Time Diag Time

Bot t leneck: Mat rix Diagonalisat ion

slide-31
SLIDE 31

31 14th May 2003 HPCS 2003, Sherbrooke

400 800 1200 1600 100 200 300 400 500 600

Global Zpts = 120 Global Zpts = 240 Global Zpts = 360 Global Zpts = 480

H2MOL - Performance on the IBM SP/p690

CPUs Elapsed Time (seconds)

  • Solves the time-dependent Schrodinger

equation to calculate energy distributions for laser-driven dissociative ionization of H2 molecule.

  • Cylindrical computational grid of φ, ρ and Z

co-ordinates. Z points are distributed over processors arranged logically in a triangular grid.

  • Most time spent calculating 5-point finite

difference schemes and in ZGEMM. MPI collectives relatively expensive for extremely large processor grids.

  • Main optimisations: improving ZGEMM

performance for small matrix sizes and using asynchronous message passing.

  • Improved scalability for larger grid sizes.
slide-32
SLIDE 32

32 14th May 2003 HPCS 2003, Sherbrooke

Computational Engineering

UK Turbulence Consortium

Led by Prof. Neil Sandham, University of Southampton

  • Focus on compute-intensive methods (Direct Numerical

Simulation, Large Eddy Simulation, etc) for the simulation of turbulent flows

  • Shock boundary layer interaction modelling - critical for accurate

aerodynamic design but still poorly understood http://www.afm.ses.soton.ac.uk/

slide-33
SLIDE 33

33 14th May 2003 HPCS 2003, Sherbrooke

Direct Numerical Simulation: 3603 benchmark

0.0 10.0 20.0 30.0 40.0 128 256 384 512 640 768 896 1024 Number of processors Performance (million iteration points/sec) IBM SP/p690 (ORNL) Cray T3E/1200E IBM SP p/690 (HPCx) Scaled from 128 CPUs

slide-34
SLIDE 34

34 14th May 2003 HPCS 2003, Sherbrooke

Environmental Science

Proudman Oceanographic Laboratory Coastal Ocean Modelling System (POLCOMS) http://www.pol.ac.uk/home/research/polcoms/

Multidisciplinary Studies in coastal/shelf environments

slide-35
SLIDE 35

35 14th May 2003 HPCS 2003, Sherbrooke

POLCOMS Structure

3D baroclinic hydrodynamic coastal-ocean model ERSEM biology

Tidal forcing Meteorological forcing UKMO Ocean model forcing UKMO Operational forecasting Climatology and extreme statistics Fish larvae modelling Sediment transport and resuspension Contaminant modelling

Realistic physical forcing to interact with, and transport, environmental parameters

slide-36
SLIDE 36

36 14th May 2003 HPCS 2003, Sherbrooke

POLCOMS resolution benchmark

  • Standard workhorse model is a 12 km resolution grid covering

the whole of the north-west European shelf (198 x 224 x 34)

  • Want to maintain accuracy in the presence of eddies, fronts,

steep topography, thermoclines etc.

  • Scientific requirement heading to 1 km shelf-wide resolution
  • Design set of benchmarks

– 12 km ( 200 x 200 x 34) – 6 km ( 400 x 400 x 34) – 3 km ( 800 x 800 x 34) – 2 km (1200 x 1200 x 34) – 1 km (2400 x 2400 x 34)

  • Fixed number of timesteps => decreasing run length
  • Short run, subtract start-up and shut-down times
  • Performance metric is gridpoints * timesteps / time
slide-37
SLIDE 37

37 14th May 2003 HPCS 2003, Sherbrooke

20 40 60 80 100 120 140 128 256 384 512 640 768 896 1024 Number of processors Performance (M grid-points-timesteps/sec)

Ideal IBM 1 km IBM 2 km IBM 3 km IBM 6 km IBM 12 km IBM

POLCOMS resolution b/m : IBM SP/p690

Scientific requirement heading to 1 km shelf- wide resolution

slide-38
SLIDE 38

38 14th May 2003 HPCS 2003, Sherbrooke

POLCOMS 2 km b/m : All systems

20 40 60 80 100 120 128 256 384 512 640 768 896 1024 Number of processors Performance (M grid-points-timesteps/sec)

Ideal IBM IBM p690 Cray T3E Origin 3800

slide-39
SLIDE 39

39 14th May 2003 HPCS 2003, Sherbrooke

HPCx Terascale Applicat ions Team Strategy for Capability Computing

  • 1. Performance Attributes of Key Applications

– Trouble-shooting with Vampir & Paraver

  • 2. Scalability of Numerical Algorithms

– Parallel eigensolvers

  • 3. Optimisation of Communication Collectives

– MPI_ALLTOALLV and CASTEP

  • 4. Memory-driven Approaches

– in-core SCF & DFT, direct minimisation & CRYSTAL

  • 5. Terascaling Applications

– NWChem, NAMD ...

  • 6. Migration from replicated to distributed data

– DL_POLY-3

  • 7. Scientific drivers amenable to Capability Computing

– Enhanced Sampling Methods, Replica Methods Efficient Serial Execution

slide-40
SLIDE 40

40 14th May 2003 HPCS 2003, Sherbrooke

Summary

  • UK has a New Facility for Capability Computing: HPCx

– 66% Technology, 33% Support

  • Key Strategic Applications Areas

– Materials Science, Molecular Simulation, Molecular Electronic Structure, A&M Physics, Computational Engineering, Environmental Science

  • HPCx Terascale ApplicationsTeam

– Strategy for Capability Computing

  • Range of Performance Results

– size matters ! – limited scalability for applications:

  • with global communications (CASTEP)
  • featuring linear algebra routines with extensive communication requirements

(AIMPRO)

– Linear scaling to 1024 processors for nearest neighbour CFD codes (PDNS3D, POLCOMS)

slide-41
SLIDE 41

41 14th May 2003 HPCS 2003, Sherbrooke

  • HPCx Terascaling Team

– Mike Ashworth – Ian Bush – Martyn Guest – David Henty

  • IBM Technical Support

– Luigi Brochard et al.

  • NCSA Rick Kufrin (NAMD)
  • CSAR Computing Service

Cray T3E ‘turing’, Origin 3800 R12k-400 ‘green’

  • ORNL

IBM Regatta ‘cheetah’

  • SARA

Origin 3800 R14k-500 ‘teras’

  • PSC

AlphaServer SC ES45-1000

Acknowledgements

– Martin Plummer – Lorna Smith – Kevin Stratford – Andrew Sunderland