Applica'on Accelera'on on Current and Future Cray Pla4orms Alice - - PowerPoint PPT Presentation

applica on accelera on on current and
SMART_READER_LITE
LIVE PREVIEW

Applica'on Accelera'on on Current and Future Cray Pla4orms Alice - - PowerPoint PPT Presentation

Applica'on Accelera'on on Current and Future Cray Pla4orms Alice Koniges, NERSC, Berkeley Lab David Eder, Lawrence Livermore Na'onal Laboratory (speakers) Robert Preissl, Jihan Kim (NERSC LBL), Aaron Fisher, Nathan Masters, Velimir Mlaker


slide-1
SLIDE 1

CUG 2010

Applica'on Accelera'on on Current and Future Cray Pla4orms

Alice Koniges, NERSC, Berkeley Lab David Eder, Lawrence Livermore Na'onal Laboratory (speakers) Robert Preissl, Jihan Kim (NERSC LBL), Aaron Fisher, Nathan Masters, Velimir Mlaker (LLNL), Stephan Ethier, Weixing Wang (PPPL), Mar'n Head‐Gordon (UC Berkeley), Nathan Wichmann (CRAY Inc.) CRAY User Group Mee'ng May 2010

slide-2
SLIDE 2

CUG 2010

Various means of applica'on speedup are described for 3 different codes

  • GTS – magne+c fusion par+cle‐in‐cell code

– Already op+mized and hybrid (MPI + OpenMP) – Consider advanced hybrid techniques to overlap communica+on and computa+on

  • QChem – computa+onal chemistry

– Op+miza+on for GPU and accelerators

  • ALE‐AMR – hydro/materials/radia+on

– Mul+physics code with MPI‐everywhere model – Library speedup – Is the code appropriate for hybrid? – Experiences with automa+c paralleliza+on tools

slide-3
SLIDE 3

CUG 2010

GTS is a massively parallel magne'c fusion applica'on

  • Gyrokine+c Tokamak Simula+on (GTS) code
  • Global 3D Par+cle‐In‐Cell (PIC) code to study

microturbulence & transport in magne+cally confined fusion plasmas of tokamaks

  • Microturbulence: very complex, nonlinear

phenomenon; key in determining instabili'es of magne+c confinement of plasmas

  • GTS: Highly op+mized Fortran90 (+C) code
  • Massively parallel hybrid paralleliza+on (MPI

+OpenMP): tested on today’s largest computers (Earth Simulator, IBM BG/L, Cray XT)

slide-4
SLIDE 4

CUG 2010

PIC: follow trajectories of charged par+cles in electromagne+c fields

  • Sca[er: computa+on of charge density at each

grid point arising from neighboring par+cles

  • Poisson's equa'on for compu+ng the field

poten+al (solved on a 2D poloidal plane)

  • Gather: calculate forces on each

par+cle from the electric poten+al

  • Push: moving par+cles in +me

according to equa+ons of mo+on

  • Repeat
slide-5
SLIDE 5

CUG 2010

The Parallel Model of GTS has three independent levels

  • One‐dimensional (1D) domain decomposi'on in the toroidal

direc'on 5th PIC step: shicing par+cles between toroidal domains (MPI; limited to 128 planes), this can happen to adjacent or even to further toroidal domains

  • Divide par'cles between MPI processes

within toroidal domain: each process keeps a copy of the local grid, requiring processes within a domain to sum their contribu+on to total grid charge density

  • OpenMP compiler direc'ves to heavily

used loop regions exploi+ng shared memory capabili+es

P0 P1 P2

slide-6
SLIDE 6

CUG 2010

Two different hybrid models in GTS: Using tradi+onal OpenMP worksharing constructs and OpenMP tasks

OpenMP tasks enables us to overlap MPI communica+on with independent Computa+on and therefore the overall run+me can be reduced by the costs

  • f MPI communica+on.
slide-7
SLIDE 7

CUG 2010

Overlapping Communica+on with Computa+on in GTS shic rou+ne due to data independent code sec+ons

Work on par+cle array (packing for sending, reordering, adding acer sending) can be

  • verlapped with data independent MPI communica+on using OpenMP tasks.

INDEPENDENT INDEPENDENT INDEPENDENT

GTS shi( roun-ne

slide-8
SLIDE 8

CUG 2010

Reducing the limita+ons of single threaded execu+on (MPI communica+on) can be achieved with OpenMP tasks

Overlapping MPI_Allreduce with par-cle work

Overlap: Master thread encounters (!$omp master) tasking statements and creates work for the thread team for deferred execu+on. MPI Allreduce call is immediately executed. MPI implementa+on has to support at least MPI_THREAD_FUNNELED Subdividing tasks into smaller chunks to allow beler load balancing and scalability among threads.

slide-9
SLIDE 9

CUG 2010

Further communica+on overlaps can be achieved with OpenMP tasks exploi+ng data independent code regions

Overlapping par-cle reordering Overlapping remaining MPI_Sendrecv

Par+cle reordering of remaining par+cles (above) and adding sent par+cles into array (right) & sending or receiving of shiced par+cles can be independently executed.

slide-10
SLIDE 10

CUG 2010

OpenMP tasking version outperforms original shicer, especially in larger poloidal domains

!" #!" $!!" $#!" %!!" %#!" &!!"

'()*+,"

  • ..,+/01+"

2)..)3456.+" '+3/7+18"

9):+";<+1=" !"#$%&"&'()*$%>"?"%"!(+,&-%./")*

9@<A)34" B,)4)3@."

!" #!" $!!" $#!" %!!" %#!"

&'()*+" ,--+*./0*" 1(--(2345-*" &*2.6*07"

8(9*":;*0<" !"#$%&"&'()*$%=">"$?"!(+,&-%./")*

8@;A(23" B+(3(2@-"

Performance breakdown of GTS shicer rou+ne using 4 OpenMP threads per MPI process with varying domain decomposi+on and par+cles per cell on Franklin Cray XT4. MPI communica+on in the shic phase uses a toroidal MPI communicator (constantly 128) However, performance differences in the 256 MPI run compared to 2048 MPI run! Speed‐Up is expected to be higher on larger GTS runs with hundreds of thousands CPUs since MPI communica+on is more expensive

slide-11
SLIDE 11

CUG 2010

Early experiments that overlap communica+on with communica+on are promising for future HPC systems

!" #!" $!" %!" &!" '!!" '#!" '$!" '%!" '&!" #!!" '!#$" ('#" #(%" )*+,"-.,/0" 123"245/,..,.6"7#6$6&8"59,:12";<4,=>."9,4"123"945/,.."

)=.?*:@"" A4*@*:=B"

!" #!" $!!" $#!" %!!" %#!" &!!" $!%'" #$%" %#("

)*+,"-.,/0"

123"245/,..,.6"7%6'689"5:,;12"<=4,>?.":,4"123":45/,.."

)>.@*;A"" B4*A*;>C"

  • Overlapping MPI communica+on with other consecu+ve, data independent MPI

Communica+on

  • Here: itera+ve execu+on of two consecu+ve MPI_Allreduce with small and larger

messages on Hopper Cray XT5

  • GTS shicer or pusher rou+nes have such consecu+ve MPI communica+on
  • Overlapping MPI_Allreduce with larger messages (~1K bytes) pays off when ra+o of

threads/sockets per node is reasonable

  • Future HPC systems are expected to have many communica+on channels per node
slide-12
SLIDE 12

CUG 2010

Reducing overhead of single threaded execu+on is essen+al for massively parallel (hybrid) codes

  • Overhead of MPI communica+on increases when scaling applica+ons

to large number of MPI processes (collec+ve MPI communica+on)

  • Adding OpenMP compiler direc+ves to heavily used loop can exploit

the shared memory capabili+es

  • Overlapping MPI communica+on with independent computa+on by

the new OpenMP tasking model makes usage of idle cores

  • Overlapping MPI communica+on with independent, consecu+ve MPI

communica+on might be another way to reduce MPI overhead; especially regarding future HPC systems with many communica+on channels per node

slide-13
SLIDE 13

CUG 2010

  • Q‐Chem: used to model carbon capture (i.e. reac+vity of CO2 with other materials)
  • Quantum calcula+ons: accurately predict molecular equilibrium structure (used as

an input to classical molecular dynamics/Monte Carlo simula+ons)

  • RI‐MP2: resolu+on‐of‐the‐iden+ty second‐order Moller‐Plesset perturba+on

theory

– Treat correla+on energy with 2nd order Moller‐Plesset theory – U+lize auxiliary basis sets to approximate atomic orbital densi+es – Strengths: no self‐interac+on problem (DFT), 80‐90% of correla+on energy – Weakness: fich‐order computa+onal dependency on system size (expensive)

  • Goal: accelerate RI‐MP2 method in Q‐Chem
  • Q‐Chem RI‐MP2 requirements: quadra+c memory, cubic storage, quar+c I/O,

quin+c computa+on

Q‐Chem: Computa'onal chemistry can accurately model molecular structures

slide-14
SLIDE 14

CUG 2010

  • RI‐MP2 rou+ne: largely divided up

into seven major steps

  • Test input molecules: glycine‐n
  • As system size increases, step 4

becomes the dominant wall +me (e.g. glycine‐16, 83% of total wall +me is spent in step 4)

  • Reason: step 4 contains three

quin+c computa+on rou+nes (BLAS3 matrix mul+plica+ons) and quar+c I/O read

  • Goal: op+mize step 4

Dominant computa'onal steps are fich‐order RI‐MP2 rou'nes

Greta Cluster (M. Head‐Gordon) : AMD Quad‐core Opterons

Wall +me in seconds

slide-15
SLIDE 15

CUG 2010

The GPU and the CPU are significantly different

  • GPU: graphics processing units
  • GPU: More transistors devoted to data

computa+on (CPU: cache, loop control)

  • Interest in high‐performance compu+ng
  • Use CUDA (Compute Unified Device

Architecture): parallel architecture developed by NVIDIA

  • Step 4: CUDA matrix matrix mul+plica+ons

(~ 75 GFLOPS TESLA, ~225 GFLOPS FERMI for double precision)

  • Concurrently execute CPU and GPU rou+nes
slide-16
SLIDE 16

CUG 2010

Ttot ≈ Tload + Tmm1 + Tmm2 + Tmm3 + Trest Ttot ≈ max (Tload , Tmm1) + Tmm3 + max(Tmm2 , Tcopy) + Trest

Step 4 CPU Algorithm Step 4 CPU+GPU Algorithm

CPU and GPU can work together to produce a fast algorithm

slide-17
SLIDE 17

CUG 2010

  • Tesla/Turing (TnT): NERSC GPU Testbed

– Sun SunFire x4600 Servers – AMD Quad Core processors (“Shanghai”), 4 NVidia FX‐5800 Quadro GPUs (4GB memory) – CUDA 2.3 gcc 4.4.2 ACML 4.3.0

  • Franklin: NERSC

– Cray XT4 system – 2.3 GHz AMD Opteron Quad Core

I/O bo[leneck is a concern for accelerated RI‐MP2 code

RI‐MP2 wall +me (seconds) 4945 6542 1405 600 to 800 (?) Franklin TnT(CPU) TnT(GPU) TnT(GPU, beler I/O)

4.7x improvement, more so for beler I/O systems

slide-18
SLIDE 18

CUG 2010

ALE‐AMR: hydro/materials/radia'on code with MPI paralleliza'on

  • Mul+‐physics code using operator splivng

– ALE – Arbitrary Lagrangian Eulerian – AMR – Adap+ve Mesh Refinement – Material interface reconstruc+on – Anisotropic stress tensor – Material strength/failure with history – Thermal conduc+on, radia+on diffusion – Laser ray trace and ion deposi+on

  • Code used to model targets at various high‐

energy experimental facili+es including the Na+onal Igni+on Facility (world’s largest laser)

slide-19
SLIDE 19

CUG 2010

Simula+ons can include hot radia+ng plasmas and cold fragmen+ng solids

slide-20
SLIDE 20

CUG 2010

ALE‐AMR diffusion based models use solvers for an implicit solve

  • Energy transport in NIF ALE‐AMR is

based on diffusion approxima+ons

Diffusion equa+on Heat Conduc+on Radia+on Diffusion

slide-21
SLIDE 21

CUG 2010

The AMR code uses finite elements and a solver for Diffusion

Level representa+on Composite representa+on Special transi+on elements and basis func+ons

slide-22
SLIDE 22

CUG 2010

Some 3D Simula+ons have Unreasonable Performance

# of CPUs u'lized 27x27x27 AMR mesh run'me (s) 81x81x81 AMR mesh run'me(s) 1 21 73 2 15 420 4 9 816 8 7 960

  • Simulated a point explosion on 2‐level

AMR grid

slide-23
SLIDE 23

CUG 2010

Open|SpeedShop Used to Understand Performance Degrada+on

  • Instruments an executable to collect

sample data on code execu+on paths

  • Provides insight into how code is

execu+ng

  • Hot call path feature is very useful for

gevng a sense of the code bollenecks

slide-24
SLIDE 24

CUG 2010

Hot call path feature shows bolleneck for degraded performance in HYPRE

slide-25
SLIDE 25

CUG 2010

Whereas bolleneck in normal performance is in Jacobian computa+on

slide-26
SLIDE 26

CUG 2010

Sparsity Plots HYPRE Matrix and Precondi+oner Shed Light on the Issue

slide-27
SLIDE 27

CUG 2010

HYPRE/Euclid Defaults are Unsuitable for Our AMR Grids

  • Too much non‐zero fill is being allowed
  • Fortunately changing the fill behavior

for Euclid is easy

  • Added “level 0” to the Euclid

parameters file

  • This parameter turns off fill en+rely
  • May not be op+mal, but should fix the

degraded performance issue

slide-28
SLIDE 28

CUG 2010

Euclid Parameter Change Leads to Much Improved Behavior

  • Re‐ran point explosion simula+on with

Euclid fill turned off

# of CPUs u'lized With default Euclid fill ‘level 1’ With updated Euclid fill ‘level 0’ 1 73 67 2 420 43 4 816 28 8 960 23

slide-29
SLIDE 29

CUG 2010

Useful to study effect of same number

  • f MPI tasks on different # of cores

5 10 15 20 25 30 35 40 16 32 64

Run'me Memory

Number of cores Memory and Run'me

  • Speedup even with addi+onal inter‐node

communica+on shows promise for hybrid

– For 32 and 64 cases have idle cores can be used by adding OpenMP to give addi+onal speedup

slide-30
SLIDE 30

CUG 2010

ALE‐AMR has good poten+al for hybrid including in the SAMRAI library

  • SAMRAI provides patch‐level parallelism
  • The physics steps loop inside patch
  • OpenMP compiliers can parallelize C style loops

but not template iterators used by SAMRAI

  • However, autoPar in the Rose complier has the

poten+al to deal with C++ constructs

  • In some cases, modifying the index space can

reduce number of synchronizing barriers

slide-31
SLIDE 31

CUG 2010

We have inves'gated using autoPar tp parallelize code with OpenMP

  • ROSE’s automa+c paralleliza+on tool autoPar

– translates source code to use OpenMP pragmas – used project‐wide in place of compiler

int i, j; int a[100][100]; for (i=0; i<100; i++){ for (j=0; j<100; j++){ a[i][j]=a[i][j]+1; } } #include ”omp.h” int i, j; int a[100UL][100UL]; #pragma omp parallel for private (i,j) for (i=0; i<=100-1; i++){ #pragma omp parallel for private (j) for (j=0; j<=100-1; j++){ a[i][j]=a[i][j]+1; } }

incr.C rose_incr.C autoPar

... #CXX = g++ CXX = autoPar ...

.mk translate compile link .exe

slide-32
SLIDE 32

CUG 2010

The complexity of code affects the ability of an auto tool to be effec+ve

  • ALE‐AMR uses SAMRAI for AMR, loadbalancing and MPI
  • Size suggests improvement opportuni+es in either

codebase

  • Start with autoPar on SAMRAI: general use code,

possibly more standardized

  • Next try ALE‐AMR: less code, and beler known
slide-33
SLIDE 33

CUG 2010

Summary

  • Applica+on Speedup for both current and

future architectures is complicated

  • Overlapping communica+on and computa+on in

GTS shows promise (advanced hybrid)

  • GPU’s allow fast matrix matrix mul+plica+on in Q‐

Chem (next genera+on architectures)

  • Profiling libraries (OpenSpeedShop) in ALE‐AMR is

cri+cal; must use extreme care in library usage

  • Analysis is necessary to add hybrid to exis+ng codes
  • Auto‐paralleliza+on tools show promise