[PPT] - De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma PowerPoint Presentation

SLIDE 1

www.allinea.com

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode

Ma rk O'C onnor m a rk@ a lline a .c om L e a d De ve lope r

SLIDE 2

www.allinea.com

Didn't a s k for pa ra lle lis m

SLIDE 3

www.allinea.com

... but g ot pa ra lle lis m

SLIDE 4

www.allinea.com

Inte re s ting tim e s ...

A recent history of parallel computing

– Increasing core counts – Increasing cores per node – Established programming models

Near-100% of HPC software using MPI and/ or OpenMP
A close match of software to hardware - good portability

– The challenge of a pplic a tion s c a la bility remains!

Times change ...

– GPUs entering HPC – power, performance, … – Massive multi-core clusters – with many GPUs – The challenge of hybrid s oftw a re is here!

SLIDE 5

www.allinea.com

E xploiting te c hnolog y

“ S o ftw are has b e c o m e the #1

ro ad b lo c k … M a ny ap p lic atio ns w ill ne e d a m ajo r re d e s ig n” - IDC HPC Update, June 2010 – Most ISV codes do not scale – High programming costs are delaying GPU usage

Development tools are a

vital part of the solution

SLIDE 6

www.allinea.com

Alline a S oftw a re

UK based HPC tools company since 2001

– Allinea DDT – the scalable parallel debugger – Allinea OPT – the optimization tool for MPI and non-MPI – Allinea DDTLite – the parallel debugging plugin for Microsoft Visual Studio

Large European and US customer base

– Ease of use – means tools get used – Users debugging regularly at all scales – at 1 or 100,000 cores – World's only Petascale debugger!

SLIDE 7

www.allinea.com

S om e C lie nts a nd P a rtne rs

Academic

– Over 200 universities

Major research centres

– ANL, EPCC, IDRIS, Juelich, NERSC, ORNL,

Aviation and Defence

– Airbus, AWE, Dassault, DLR, EADS, ...

Energy

– CEA, CGG Veritas, IFP , Total, ...

EDA

– Cadence, Intel, Synopsys, ...

Climate and Weather

– UK Met Office, Meteo France, NOAA ...

SLIDE 8

www.allinea.com

B a c kg round

Debugging: Good (aka A Necessary Evil)

– Reproducing and fixing software problems

Complexity of scaling and GPU architecture will introduce bugs

– Debuggers interactively examine processes and data

Fastest way to debug – with less chance of introducing more bugs

– Bugs at s c a le need a debugger at s c a le

… until recently debuggers limited to ~4,000-8,000 cores

– Bugs on G P Us need a debugger for G P Us

… until recently GPU software couldn't be debugged

– Allinea DDT is the first graphical debugger to do both

SLIDE 9

www.allinea.com

DDT in a nuts he ll

Scalar features

– Advanced C++ and STL – Fortran 90, 95 and 2003: modules, allocatable data, pointers, derived types – Memory debugging

Multithreading & OpenMP features

– Step, breakpoint etc. one or all threads

MPI features

– Easy to manage groups – Control processes by groups – Compare data – Visualize message queues

SLIDE 10

www.allinea.com

Me m ory De bug g ing

Find memory leaks
Or stop on read/write beyond end of array

SLIDE 11

www.allinea.com

S c a le Ma tte rs

Not just a rich man's

problem

– Used to be exclusive to big labs – Everyone is joining the fun – If you can't debug problems at scale, you can't fix them

Historic debugger limits

– Linear or worse performance in #cores – Maximum size limited by patience (or desperation)

2006 2006 2007 2007 2008 2008 2009 2009 2010 10 20 30 40 50 60 70 80 90 100

Systems in Top 500

8k - 32k cores 32k+ cores

Year (June & November Lists)

SLIDE 12

www.allinea.com

DDT: P e ta s c a le De bug g ing

DDT delivers petascale debugging toda y

– Collaborations with ORNL on Jaguar Cray XT and CEA – Tree architecture – logarithmic performance – Now faster at 220,000 than previously at 1,000 cores – ~1 / 1 0 th of a s e c ond to step and gather all stacks at 220,000 cores

50,000 100,000 150,000 200,000 0.02 0.04 0.06 0.08 0.1 0.12

DDT 3.0 Performance Figures

Jaguar XT5

All Step All Breakpoint MPI Processes

Time (Seconds)

SLIDE 13

www.allinea.com

S c a la ble P roc e s s C ontrol

Parallel Stack View

– Finds rogue processes faster – Identifies classes of process behaviour – Allows rapid grouping of processes

Control Processes by Groups

– Set breakpoints, step, play, stop etc. using user-defined groups – Mutates to scalable groups view – Compact group representations

SLIDE 14

www.allinea.com

P re s e nting Da ta , Us e fully

Gather from every node

– Potentially costly – if all data different – Easy if data mostly same – New ideas

Aggregated statistics
Probabilistic algorithms
ptimize performance –

even in pathological case

Watch this space!

– With a fast and scalable architecture, new things become possible

SLIDE 15

www.allinea.com

T he new hotness

Hybrids are today's hottest topic

– Technology is moving quickly – compilers, SDKs, hardware – NVIDIA CUDA leads in tool support

Many lines of code need rewriting for GPUs

– Memory hierarchy – Explicit data transfer between host and accelerator – Unusual execution model

Kernels, thread blocks, warps, synchronization points

– Massively fine-grained parallel model – It works: 1 billion keys / second on a single GTX480

Inevitable that we need to debug!

SLIDE 16

www.allinea.com

Debugging Options

Old world “printf”

– NVIDIA SDK now allows this (new) – but has limitations

Fake it – run the kernel on the host x86_64 processor

– Languages often support targeting host CPU instead of GPU – Different numeric precision – different answer? – Different scheduling – different answer? – A reasonable option for some bugs

Or run on the GPU with Allinea DDT...

SLIDE 17

www.allinea.com

Introducing DDT for CUDA

The first graphical debugger for NVIDIA CUDA

– Simple and easy to use – As easy as debugging ordinary code

All the commands you'd expect

– Breakpoints – Stepping warps – Viewing data and thread stacks

Plus more advanced features

– CUDA memcheck – memory debugging for CUDA

More to come!

SLIDE 18

www.allinea.com

CUDA T hreads in DDT

Run the code

– Browse source – Set breakpoints – Stop at a line of CUDA code – Stops once for each scheduled collection of blocks

Select a CUDA

thread

– Examine variables and shared memory – Step a warp

SLIDE 19

www.allinea.com

E asy to understand scale

View all threads in parallel stack view

– At one glance, see all GPU and CPU threads together – Links with thread selection – Pick a tree node to select one of the CUDA threads at that location

Full MPI support

– See GPU and CPU threads from multiple nodes

SLIDE 20

www.allinea.com

Some Common P roblems

Incorrect logic (if-statements, calculations)

– Loop iteration to GPU thread analogy - threads identified by grid and block indexes – Solution: Select a thread and step with DDT; look at the local state and shared data

Cherry-pick important threads: start, end, a few interior points
Kernel bounds – getting the right grids and blocks

– Incorrect kernel thread boundaries can lead to incomplete results or crashing of the kernel – Solution: Bugs will often trigger “CUDA memcheck” errors - run with DDT and CUDA memory debugging enabled – Solution: Use DDT's advanced multi-dimensional array viewer to look at data and find the missing indexes

SLIDE 21

www.allinea.com

Current Limitations

SDK 3.0 was a big leap forward
SDK and driver limitations

– Only one GPU can be debugged per O/S (per physical node) – Cannot currently read launch failure codes (without breaking your code) – Only one warp can be stepped per GPU at any time – Cannot debug GPU part of (attach to) an already running job

Strong partnership with NVIDIA, CAPS and others is helping

to extend capabilities

– SDK 3.1 is much better for general computation – SDK 3.2 adds debug support for multiple GPUs per node

SLIDE 22

www.allinea.com

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode

Didn't a s k for pa ra lle lis m

... but g ot pa ra lle lis m

Inte re s ting tim e s ...

– Increasing core counts – Increasing cores per node – Established programming models

– The challenge of a pplic a tion s c a la bility remains!

– GPUs entering HPC – power, performance, … – Massive multi-core clusters – with many GPUs – The challenge of hybrid s oftw a re is here!

E xploiting te c hnolog y

ro ad b lo c k … M a ny ap p lic atio ns w ill ne e d a m ajo r re d e s ig n” - IDC HPC Update, June 2010 – Most ISV codes do not scale – High programming costs are delaying GPU usage

vital part of the solution

Alline a S oftw a re

– Allinea DDT – the scalable parallel debugger – Allinea OPT – the optimization tool for MPI and non-MPI – Allinea DDTLite – the parallel debugging plugin for Microsoft Visual Studio

– Ease of use – means tools get used – Users debugging regularly at all scales – at 1 or 100,000 cores – World's only Petascale debugger!

S om e C lie nts a nd P a rtne rs

– Over 200 universities

– ANL, EPCC, IDRIS, Juelich, NERSC, ORNL,

– Airbus, AWE, Dassault, DLR, EADS, ...

– CEA, CGG Veritas, IFP , Total, ...

– Cadence, Intel, Synopsys, ...

– UK Met Office, Meteo France, NOAA ...

B a c kg round

– Reproducing and fixing software problems

– Debuggers interactively examine processes and data

– Bugs at s c a le need a debugger at s c a le

– Bugs on G P Us need a debugger for G P Us

– Allinea DDT is the first graphical debugger to do both

DDT in a nuts he ll

– Advanced C++ and STL – Fortran 90, 95 and 2003: modules, allocatable data, pointers, derived types – Memory debugging

– Step, breakpoint etc. one or all threads

– Easy to manage groups – Control processes by groups – Compare data – Visualize message queues

Me m ory De bug g ing

S c a le Ma tte rs

problem

– Used to be exclusive to big labs – Everyone is joining the fun – If you can't debug problems at scale, you can't fix them

– Linear or worse performance in #cores – Maximum size limited by patience (or desperation)

DDT: P e ta s c a le De bug g ing

– Collaborations with ORNL on Jaguar Cray XT and CEA – Tree architecture – logarithmic performance – Now faster at 220,000 than previously at 1,000 cores – ~1 / 1 0 th of a s e c ond to step and gather all stacks at 220,000 cores

S c a la ble P roc e s s C ontrol

– Finds rogue processes faster – Identifies classes of process behaviour – Allows rapid grouping of processes

– Set breakpoints, step, play, stop etc. using user-defined groups – Mutates to scalable groups view – Compact group representations

P re s e nting Da ta , Us e fully

– Potentially costly – if all data different – Easy if data mostly same – New ideas

even in pathological case

– With a fast and scalable architecture, new things become possible

T he new hotness

– Technology is moving quickly – compilers, SDKs, hardware – NVIDIA CUDA leads in tool support

– Memory hierarchy – Explicit data transfer between host and accelerator – Unusual execution model

– Massively fine-grained parallel model – It works: 1 billion keys / second on a single GTX480

Debugging Options

– NVIDIA SDK now allows this (new) – but has limitations

– Languages often support targeting host CPU instead of GPU – Different numeric precision – different answer? – Different scheduling – different answer? – A reasonable option for some bugs

Introducing DDT for CUDA

– Simple and easy to use – As easy as debugging ordinary code

– Breakpoints – Stepping warps – Viewing data and thread stacks

– CUDA memcheck – memory debugging for CUDA

CUDA T hreads in DDT

– Browse source – Set breakpoints – Stop at a line of CUDA code – Stops once for each scheduled collection of blocks

thread

– Examine variables and shared memory – Step a warp

E asy to understand scale

– At one glance, see all GPU and CPU threads together – Links with thread selection – Pick a tree node to select one of the CUDA threads at that location

– See GPU and CPU threads from multiple nodes

Some Common P roblems

– Loop iteration to GPU thread analogy - threads identified by grid and block indexes – Solution: Select a thread and step with DDT; look at the local state and shared data

Current Limitations

– Only one GPU can be debugged per O/S (per physical node) – Cannot currently read launch failure codes (without breaking your code) – Only one warp can be stepped per GPU at any time – Cannot debug GPU part of (attach to) an already running job

to extend capabilities

– SDK 3.1 is much better for general computation – SDK 3.2 adds debug support for multiple GPUs per node

Petascale Debugging: Solved. GPU Debugging: Works. Any Questions?