De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma - - PowerPoint PPT Presentation

de bug g ing l a rg e s c a le a nd hybrid p a ra lle l c
SMART_READER_LITE
LIVE PREVIEW

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma - - PowerPoint PPT Presentation

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma rk O'C onnor m a rk@ a lline a .c om L e a d De ve lope r www.allinea.com Didn't a s k for pa ra lle lis m www.allinea.com ... but g ot pa ra lle lis m www.allinea.com


slide-1
SLIDE 1

www.allinea.com

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode

Ma rk O'C onnor m a rk@ a lline a .c om L e a d De ve lope r

slide-2
SLIDE 2

www.allinea.com

Didn't a s k for pa ra lle lis m

slide-3
SLIDE 3

www.allinea.com

... but g ot pa ra lle lis m

slide-4
SLIDE 4

www.allinea.com

Inte re s ting tim e s ...

  • A recent history of parallel computing

– Increasing core counts – Increasing cores per node – Established programming models

  • Near-100% of HPC software using MPI and/ or OpenMP
  • A close match of software to hardware - good portability

– The challenge of a pplic a tion s c a la bility remains!

  • Times change ...

– GPUs entering HPC – power, performance, … – Massive multi-core clusters – with many GPUs – The challenge of hybrid s oftw a re is here!

slide-5
SLIDE 5

www.allinea.com

E xploiting te c hnolog y

  • “ S o ftw are has b e c o m e the #1

ro ad b lo c k … M a ny ap p lic atio ns w ill ne e d a m ajo r re d e s ig n” - IDC HPC Update, June 2010 – Most ISV codes do not scale – High programming costs are delaying GPU usage

  • Development tools are a

vital part of the solution

slide-6
SLIDE 6

www.allinea.com

Alline a S oftw a re

  • UK based HPC tools company since 2001

– Allinea DDT – the scalable parallel debugger – Allinea OPT – the optimization tool for MPI and non-MPI – Allinea DDTLite – the parallel debugging plugin for Microsoft Visual Studio

  • Large European and US customer base

– Ease of use – means tools get used – Users debugging regularly at all scales – at 1 or 100,000 cores – World's only Petascale debugger!

slide-7
SLIDE 7

www.allinea.com

S om e C lie nts a nd P a rtne rs

  • Academic

– Over 200 universities

  • Major research centres

– ANL, EPCC, IDRIS, Juelich, NERSC, ORNL,

  • Aviation and Defence

– Airbus, AWE, Dassault, DLR, EADS, ...

  • Energy

– CEA, CGG Veritas, IFP , Total, ...

  • EDA

– Cadence, Intel, Synopsys, ...

  • Climate and Weather

– UK Met Office, Meteo France, NOAA ...

slide-8
SLIDE 8

www.allinea.com

B a c kg round

  • Debugging: Good (aka A Necessary Evil)

– Reproducing and fixing software problems

  • Complexity of scaling and GPU architecture will introduce bugs

– Debuggers interactively examine processes and data

  • Fastest way to debug – with less chance of introducing more bugs

– Bugs at s c a le need a debugger at s c a le

  • … until recently debuggers limited to ~4,000-8,000 cores

– Bugs on G P Us need a debugger for G P Us

  • … until recently GPU software couldn't be debugged

– Allinea DDT is the first graphical debugger to do both

slide-9
SLIDE 9

www.allinea.com

DDT in a nuts he ll

  • Scalar features

– Advanced C++ and STL – Fortran 90, 95 and 2003: modules, allocatable data, pointers, derived types – Memory debugging

  • Multithreading & OpenMP features

– Step, breakpoint etc. one or all threads

  • MPI features

– Easy to manage groups – Control processes by groups – Compare data – Visualize message queues

slide-10
SLIDE 10

www.allinea.com

Me m ory De bug g ing

  • Find memory leaks
  • Or stop on read/write beyond end of array
slide-11
SLIDE 11

www.allinea.com

S c a le Ma tte rs

  • Not just a rich man's

problem

– Used to be exclusive to big labs – Everyone is joining the fun – If you can't debug problems at scale, you can't fix them

  • Historic debugger limits

– Linear or worse performance in #cores – Maximum size limited by patience (or desperation)

2006 2006 2007 2007 2008 2008 2009 2009 2010 10 20 30 40 50 60 70 80 90 100

Systems in Top 500

8k - 32k cores 32k+ cores

Year (June & November Lists)

slide-12
SLIDE 12

www.allinea.com

DDT: P e ta s c a le De bug g ing

  • DDT delivers petascale debugging toda y

– Collaborations with ORNL on Jaguar Cray XT and CEA – Tree architecture – logarithmic performance – Now faster at 220,000 than previously at 1,000 cores – ~1 / 1 0 th of a s e c ond to step and gather all stacks at 220,000 cores

50,000 100,000 150,000 200,000 0.02 0.04 0.06 0.08 0.1 0.12

DDT 3.0 Performance Figures

Jaguar XT5

All Step All Breakpoint MPI Processes

Time (Seconds)

slide-13
SLIDE 13

www.allinea.com

S c a la ble P roc e s s C ontrol

  • Parallel Stack View

– Finds rogue processes faster – Identifies classes of process behaviour – Allows rapid grouping of processes

  • Control Processes by Groups

– Set breakpoints, step, play, stop etc. using user-defined groups – Mutates to scalable groups view – Compact group representations

slide-14
SLIDE 14

www.allinea.com

P re s e nting Da ta , Us e fully

  • Gather from every node

– Potentially costly – if all data different – Easy if data mostly same – New ideas

  • Aggregated statistics
  • Probabilistic algorithms
  • ptimize performance –

even in pathological case

  • Watch this space!

– With a fast and scalable architecture, new things become possible

slide-15
SLIDE 15

www.allinea.com

T he new hotness

  • Hybrids are today's hottest topic

– Technology is moving quickly – compilers, SDKs, hardware – NVIDIA CUDA leads in tool support

  • Many lines of code need rewriting for GPUs

– Memory hierarchy – Explicit data transfer between host and accelerator – Unusual execution model

  • Kernels, thread blocks, warps, synchronization points

– Massively fine-grained parallel model – It works: 1 billion keys / second on a single GTX480

  • Inevitable that we need to debug!
slide-16
SLIDE 16

www.allinea.com

Debugging Options

  • Old world “printf”

– NVIDIA SDK now allows this (new) – but has limitations

  • Fake it – run the kernel on the host x86_64 processor

– Languages often support targeting host CPU instead of GPU – Different numeric precision – different answer? – Different scheduling – different answer? – A reasonable option for some bugs

  • Or run on the GPU with Allinea DDT...
slide-17
SLIDE 17

www.allinea.com

Introducing DDT for CUDA

  • The first graphical debugger for NVIDIA CUDA

– Simple and easy to use – As easy as debugging ordinary code

  • All the commands you'd expect

– Breakpoints – Stepping warps – Viewing data and thread stacks

  • Plus more advanced features

– CUDA memcheck – memory debugging for CUDA

  • More to come!
slide-18
SLIDE 18

www.allinea.com

CUDA T hreads in DDT

  • Run the code

– Browse source – Set breakpoints – Stop at a line of CUDA code – Stops once for each scheduled collection of blocks

  • Select a CUDA

thread

– Examine variables and shared memory – Step a warp

slide-19
SLIDE 19

www.allinea.com

E asy to understand scale

  • View all threads in parallel stack view

– At one glance, see all GPU and CPU threads together – Links with thread selection – Pick a tree node to select one of the CUDA threads at that location

  • Full MPI support

– See GPU and CPU threads from multiple nodes

slide-20
SLIDE 20

www.allinea.com

Some Common P roblems

  • Incorrect logic (if-statements, calculations)

– Loop iteration to GPU thread analogy - threads identified by grid and block indexes – Solution: Select a thread and step with DDT; look at the local state and shared data

  • Cherry-pick important threads: start, end, a few interior points
  • Kernel bounds – getting the right grids and blocks

– Incorrect kernel thread boundaries can lead to incomplete results or crashing of the kernel – Solution: Bugs will often trigger “CUDA memcheck” errors - run with DDT and CUDA memory debugging enabled – Solution: Use DDT's advanced multi-dimensional array viewer to look at data and find the missing indexes

slide-21
SLIDE 21

www.allinea.com

Current Limitations

  • SDK 3.0 was a big leap forward
  • SDK and driver limitations

– Only one GPU can be debugged per O/S (per physical node) – Cannot currently read launch failure codes (without breaking your code) – Only one warp can be stepped per GPU at any time – Cannot debug GPU part of (attach to) an already running job

  • Strong partnership with NVIDIA, CAPS and others is helping

to extend capabilities

– SDK 3.1 is much better for general computation – SDK 3.2 adds debug support for multiple GPUs per node

slide-22
SLIDE 22

www.allinea.com

Petascale Debugging: Solved. GPU Debugging: Works. Any Questions?