Exploring Extreme Scalability in Scientific Applications Mike - - PowerPoint PPT Presentation

exploring extreme scalability in scientific applications
SMART_READER_LITE
LIVE PREVIEW

Exploring Extreme Scalability in Scientific Applications Mike - - PowerPoint PPT Presentation

Computational Science & Engineering Exploring Extreme Scalability in Scientific Applications Mike Ashworth, Ian Bush, Charles Moulinec, Ilian Todorov Computational Science & Engineering STFC Daresbury Laboratory m.ashworth@dl.ac.uk


slide-1
SLIDE 1

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Exploring Extreme Scalability in Scientific Applications

Mike Ashworth, Ian Bush, Charles Moulinec, Ilian Todorov

Computational Science & Engineering STFC Daresbury Laboratory m.ashworth@dl.ac.uk http://www.cse.scitech.ac.uk/

slide-2
SLIDE 2

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Outline

  • Why?
  • How?
  • What
  • Where?
slide-3
SLIDE 3

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Outline

  • Why explore extreme scalability?
  • How are we doing this?
  • What have we found so far?
  • Where are we going next?
slide-4
SLIDE 4

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

UK National Services

EPCC CSAR HECToR “Child of HECToR” HPCx

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Technology Upgrade T3D T3E T3E Origin Altix Cray XT4 XT4 QC ? IBM p690 p690+ p5-575 p5+

slide-5
SLIDE 5

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

HPC Strategy in the UK HPC Strategy Committee:

"… the UK should aim to achieve sustained Petascale performance as early as possible across a broad field of scientific applications, permitting the UK to remain internationally competitive in an increasingly broad set of high-end computing grand challenge problems.“

… from A Strategic Framework for High-End Computing

slide-6
SLIDE 6

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

What will a Petascale system look like ?

Current indicators:

  • TOP500 #1LLNL Blue Gene L 0.478 Pflop/s

– 212,992 processors, dual-core nodes

  • TACC ranger Sun Constellation Cluster 0.504 Pflop/s peak

– 62,976 processors, 4x quad-core nodes

  • ORNL current upgrade to Cray XT4 0.250 Pflop/s

– 45,016 processors, quad-core nodes

  • Japanese Petascale project

– Smaller number of O(100) Gflop/s vector processors

Most likely solution is O(100,000) processors using multi- core components

slide-7
SLIDE 7

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Challenges at the Petascale

Scientific:

  • What new science can you do with 1000 Tflop/s ?
  • Larger problems, multi-scale, multi-disciplinary

Technical:

  • How will existing codes scale to 10,000 or 100,000 processors ?

Scaling of time with processors, time with problem size, memory with problem size

  • Data management, incl. pre- and post-processing
  • Visualisation
  • Fault tolerance
slide-8
SLIDE 8

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Daresbury Petascale project

Scaling analysis of current codes Performance analysis on O(10,000) procs Forward-look prediction to O(100,000) procs Optimisation of current algorithms Development of new algorithms Evaluation of alternative programming models

slide-9
SLIDE 9

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Machines

slide-10
SLIDE 10

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Machines

Cray XT4 HECToR – DC 2.8 GHz Opteron 11328 cores IBM p5-575 HPCx – DC 1.7 GHz POWER5, HPS, 2560 cores Cray XT3 palu CSCS – DC 2.6 GHz Opteron 3328 cores IBM BlueGene/L jubl – DC 700 MHz PowerPC, 16384 cores

“Application Performance on the UK’s New HECToR Service”, Fiona Reid et al, CUG 2008, Wednesday pm

slide-11
SLIDE 11

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

CCLRC Daresbury Laboratory

Home of HPCx – 2560-CPU IBM POWER5

slide-12
SLIDE 12

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Applications

slide-13
SLIDE 13

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Applications

PDNS3D/SBLI – Direct Numerical Simulation of Turbulent Flow Code_Saturne – Unstructured Finite Element CFD code POLCOMS – Coastal-ocean finite difference code DL_POLY3 – Molecular dynamics code CRYSTAL – First principles periodic quantum chemistry code

slide-14
SLIDE 14

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

What is a processor?

A processor by any other name … An applications view … A processor is what is has always been …

– A short name for Central Processing Unit – Something that runs a single instruction stream – Something that runs an MPI task – Something that runs a bunch of threads (OpenMP)

slide-15
SLIDE 15

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

PDNS3D / SBLI

slide-16
SLIDE 16

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

DNS results of near-wall turbulent flow

slide-17
SLIDE 17

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

3D grid partitioning with halo cells

calculation cost: scales as n3 communication cost: scales as n2 strong scaling: increasing P decreasing n comms will dominate

slide-18
SLIDE 18

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

SBLI on Cray XT4

100 200 300 400 500 600 700 800 1024 2048 3072 4096 5120 6144 7168 8192 Number of processors Performance (Mgrid-points*iterations/sec) 600x600x600 480x480x480 360x360x360

Turbulent channel flow benchmark Larger problems scale better

slide-19
SLIDE 19

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

0% 10% 20% 30% 40% 50% 1024 2048 3072 4096 5120 6144 Number of processors

Communications time (%)

360x360x360 480x480x480 600x600x600

% comms time from craypat

slide-20
SLIDE 20

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Code_Saturne

slide-21
SLIDE 21

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Code_Saturne performance

20 40 60 2048 4096 6144 8192 Number of processors Performance (arbitrary) 78 million cells 120 million cells

slide-22
SLIDE 22

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Code_Saturne Unstructured CFD code from EDF Run with structured mesh for an LES simulation turbulent channel flow Metis or Scotch used to partition the grid Linear scaling performance to 8192 processors (no I/O) Efficient parallel I/O is essential for this code Memory for partitioning an issue with very large meshes Need to move to a parallel partitioner Then will the mesh quality be maintained

slide-23
SLIDE 23

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

POLCOMS

slide-24
SLIDE 24

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

High-Resolution Coastal Ocean Modelling

POLCOMS is the finest resolution model to-date to simulate the circulation, temperature and salinity

  • f the Northwest European

continental Shelf important for understanding of the transport of nutrients, pollutants and dissolved carbon around shelf seas We have worked with POL on coupling with ERSEM, WAM, CICE, data assimilation and optimisation for HPC platforms

Volume transport Jul-Sep mean

Advective controls on primary production in the stratified western Irish Sea: An eddy-resolving model study, JT Holt, R Proctor, JC Blackford, JI Allen, M Ashworth, Journal of Geophysical Research, 109, 2004, p. C05024

slide-25
SLIDE 25

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Coupled Marine Ecosystem Model

Physical Model Pelagic Ecosystem Model Benthic Model Wind Stress Heat Flux Irradiation Cloud Cover C, N, P, Si Sediments

  • C
  • C

River Inputs Open Boundary

slide-26
SLIDE 26

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

POLCOMS HRCS performance

POLCOMS HRCS physics-only

1000 2000 3000 256 512 768 1024 1280 1536 Number of processors Performance (model days/day)

Cray XT4 HECToR Cray XT3 palu IBM p5-575 HPCx

slide-27
SLIDE 27

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

POLCOMS Structured-grid finite difference code from POL Sophisticated advection scheme to represent, fronts, eddies etc in the shelf seas Halo-based partitioning Complicated by land/sea issue Performance dependent on partitioning Known issue with communications imbalance – new version under test Efficient parallel I/O is essential for this code

slide-28
SLIDE 28

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

DL_POLY

slide-29
SLIDE 29

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Conventional routines (e.g. fftw) assume plane or column distributions. A global transpose of the data is required to complete the 3D FFT and additional costs are incurred re-organising the data from the natural block domain decomposition. An alternative FFT algorithm has been designed to reduce communication costs.

– the 3D FFT is done as a series of 1D FFTs, each involving communications only between blocks in a given column – The data distribution matches that used for the rest of the DL_POLY energy routines – More data is transferred, but in far fewer messages – Rather than all-to-all, the communications are column-wise

  • nly (see sparse comms structure, left)

Migration from Replicated to Distributed data DL_POLY3: Coulomb Energy Evaluation

Planes Blocks

slide-30
SLIDE 30

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

BlueGene/L times

0.0 0.5 1.0 1.5 2.0 2.5 3.0 4096 8192 12288 16384 Number of Processors Seconds / Evaluation MD total Ewald - k space Link Other Van der Waals Ewald - Real Space

14.6 million particle Gd2 Zr2 O7 system

slide-31
SLIDE 31

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Cray XT4 & BGL performance

0.0 1.0 2.0 3.0 4.0 5.0 4096 8192 12288 16384 Number of Processors Performance (arbitray) Cray XT4 hector IBM BlueGene/L jubl

slide-32
SLIDE 32

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Scaling analysis BGL

4096 8192 12288 16384 4096 8192 12288 16384

Number of Processors Speed-up

Van der Waals Ewald - Real space Link Other Ewald - k space MD total Ideal

slide-33
SLIDE 33

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Scaling analysis XT4

2048 4096 6144 8192 2048 4096 6144 8192

Number of Processors Speed-up

Van der Waals Ewald - Real space Link Other Ewald - k space MD total Ideal

slide-34
SLIDE 34

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

DL_POLY Excellent scaling with >~1000 particles per processor Scalability limited by long-range forces Can use force-shifted Coulomb electrostatics Fast multipole electrostatics for even larger systems I/O is a major bottleneck Efficient parallel I/O is essential for this code Plus tools to handle & visualize large output datasets

“The Need for Parallel I/O in Classical Molecular Dynamics”, Ilian Todorov, CUG 2008, Tuesday am

slide-35
SLIDE 35

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

CRYSTAL

slide-36
SLIDE 36

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Crystal

Electronic structure and related properties of periodic systems All electron, local Gaussian basis set, DFT and Hartree-Fock Under continuous development since 1974 Distributed to over 500 sites world wide Developed jointly by Daresbury and the University of Turin

slide-37
SLIDE 37

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Crambin Results – Electrostatic Potential

Charge density isosurface coloured according to potential Useful to determine possible chemically active groups

slide-38
SLIDE 38

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

SCF cycle scaling

20 40 60 80 100 1024 2048 3072 4096 Number of Processors Performance (arbitrary)

Cray XT4 HECToR IBM p5-575 HPCx ideal

1737 Atoms, 23268 Basis functions

slide-39
SLIDE 39

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

SCF breakdown

10 20 30 40 50 60 70 80 90 100 1024 2048 3072 4096 Number of Processors Percentage Execution Time

HPCx Integrals HECToR Integrals HPCx Diag HECToR Diag

slide-40
SLIDE 40

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

CRYSTAL

SCF cycle dominated by two parts Integral evaluation for the Kohn-Sham matrix – Time scales linearly – Difficult to distribute so poor scaling in memory Dense linear algebra (diagonalization) – Standard libraries (e.g. ScaLaPack D&C) – Communications-heavy so poor scaling Starts with integral evaluation dominating For larger systems and larger number of processors the diagonalization dominates Will need to look at diagonalization-less methods “Investigating the Performance of Parallel Eigensolvers on High- end Systems”, Andy Sunderland, CUG 2008, Wednesday pm

slide-41
SLIDE 41

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Applications conclusions We have looked at five codes up to 16384 procs

– Mainly to 8192 on Cray XT4, also BlueGene/L and /P

Most codes scale well to O(10,000) procs:

– Need large problem sizes – Need efficient parallel I/O (in progress) – Need diagonalization-less methods for quantum chemistry – Need parallel partitioning for unstructured mesh codes

Prospects look good to exploit higher numbers

– Scaling isn’t everything, need to look also at efficiencies – especially for quad-core, multi-core and beyond – Fortran+MPI works just fine (so far!)

slide-42
SLIDE 42

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

ORNL Scaling Workshop, July 2007

Several speakers concluded that:

  • The MPI send-receive model may hit limitations at very high

processor numbers

  • Hybrid programming e.g. MPI/OpenMP may help, only one MPI

task per multi-core node, esp. for collectives , also saves memory

  • Single-sided messaging may be needed and the PGAS

languages (e.g. Co-Array Fortran, UPC) may be a good high- level interface

However, there are as yet few cases of demonstrated performance advantages over vanilla MPI

“Migrating a Scientific Application from MPI to Co-Arrays”, John Ashby, CUG 2008, Thursday am

slide-43
SLIDE 43

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Conclusions

Petascale computing will soon be available in the UK Largely achieved by massive increases in the number of processors Systems will be based on multi-core nodes We need to look now at scalability and other issues on O(10,000-100,000) processors We may need to look at alternatives/additions to the existing programming model (serial language + MPI)

slide-44
SLIDE 44

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

New Opportunities

Computational Science is evolving very rapidly Hardware is moving rapidly towards the Petascale

– Extreme scalability is required to 10k-100k processors – Clusters of multi-core SMP nodes

Scientific demands are also changing

– Multi-scale – Multi-disciplinary

We need to deliver on the evolving aspirations of the community across a broad spectrum of scientific and engineering disciplines

slide-45
SLIDE 45

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

The Hartree Centre

Strategic science themes incl. energy, biomedicine, environment, functional materials 10,000 sq ft machine room 10 MW power £10M systems / two year cycle

The Hartree Centre will be a new kind of Computational Sciences institute for the UK that will:

– stimulate a step change in modeling capabilities for strategic science themes – Grand challenge projects – multi-disciplinary, multi-scale, effective and efficient simulation

– have at its heart the collaborative development, support and exploitation of scientific applications software – this is the key to real scientific and economic impact and will be Hartree’s essential driver.

April 2010

slide-46
SLIDE 46

6th May 2008 CUG 2008 Helsinki

Computational Science & Engineering

Mike Ashworth

If you have been … … thank you for listening

http://www.cse.scitech.ac.uk/

slide-47
SLIDE 47

If you have been … … thank you for listening

Mike Ashworth http://www.cse.scitech.ac.uk/