de bug g ing l a rg e s c a le a nd hybrid p a ra lle l c
play

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma - PowerPoint PPT Presentation

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma rk O'C onnor m a rk@ a lline a .c om L e a d De ve lope r www.allinea.com Didn't a s k for pa ra lle lis m www.allinea.com ... but g ot pa ra lle lis m www.allinea.com


  1. De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma rk O'C onnor m a rk@ a lline a .c om L e a d De ve lope r www.allinea.com

  2. Didn't a s k for pa ra lle lis m www.allinea.com

  3. ... but g ot pa ra lle lis m www.allinea.com

  4. Inte re s ting tim e s ... • A recent history of parallel computing – Increasing core counts – Increasing cores per node – Established programming models • Near-100% of HPC software using MPI and/ or OpenMP • A close match of software to hardware - good portability – The challenge of a pplic a tion s c a la bility remains! • Times change ... – GPUs entering HPC – power, performance, … – Massive multi-core clusters – with many GPUs – The challenge of hybrid s oftw a re is here! www.allinea.com

  5. E xploiting te c hnolog y • “ S o ftw are has b e c o m e the #1 ro ad b lo c k … M a ny ap p lic atio ns w ill ne e d a m ajo r re d e s ig n” - IDC HPC Update, June 2010 – Most ISV codes do not scale – High programming costs are delaying GPU usage • Development tools are a vital part of the solution www.allinea.com

  6. Alline a S oftw a re • UK based HPC tools company since 2001 – Allinea DDT – the scalable parallel debugger – Allinea OPT – the optimization tool for MPI and non-MPI – Allinea DDTLite – the parallel debugging plugin for Microsoft Visual Studio • Large European and US customer base – Ease of use – means tools get used – Users debugging regularly at all scales – at 1 or 100,000 cores – World's only Petascale debugger! www.allinea.com

  7. S om e C lie nts a nd P a rtne rs • Academic – Over 200 universities • Major research centres – ANL, EPCC, IDRIS, Juelich, NERSC, ORNL, • Aviation and Defence – Airbus, AWE, Dassault, DLR, EADS, ... • Energy – CEA, CGG Veritas, IFP , Total, ... • EDA – Cadence, Intel, Synopsys, ... • Climate and Weather – UK Met Office, Meteo France, NOAA ... www.allinea.com

  8. B a c kg round • Debugging: Good (aka A Necessary Evil) – Reproducing and fixing software problems • Complexity of scaling and GPU architecture will introduce bugs – Debuggers interactively examine processes and data • Fastest way to debug – with less chance of introducing more bugs – Bugs at s c a le need a debugger at s c a le • … until recently debuggers limited to ~4,000-8,000 cores – Bugs on G P Us need a debugger for G P Us • … until recently GPU software couldn't be debugged – Allinea DDT is the first graphical debugger to do both www.allinea.com

  9. DDT in a nuts he ll • Scalar features – Advanced C++ and STL – Fortran 90, 95 and 2003: modules, allocatable data, pointers, derived types – Memory debugging • Multithreading & OpenMP features – Step, breakpoint etc. one or all threads • MPI features – Easy to manage groups – Control processes by groups – Compare data – Visualize message queues www.allinea.com

  10. Me m ory De bug g ing • Find memory leaks • Or stop on read/write beyond end of array www.allinea.com

  11. S c a le Ma tte rs • Not just a rich man's Systems in Top 500 problem 100 – Used to be exclusive to big 90 labs 80 70 – Everyone is joining the fun 60 – If you can't debug problems 8k - 32k cores 32k+ cores 50 at scale, you can't fix them 40 • Historic debugger limits 30 – Linear or worse 20 performance in #cores 10 0 – Maximum size limited by 2006 2007 2008 2009 2006 2007 2008 2009 2010 patience (or desperation) Year (June & November Lists) www.allinea.com

  12. DDT: P e ta s c a le De bug g ing DDT 3.0 Performance Figures Jaguar XT5 0.12 0.1 Time (Seconds) 0.08 0.06 All Step All Breakpoint 0.04 0.02 0 0 50,000 100,000 150,000 200,000 MPI Processes • DDT delivers petascale debugging toda y – Collaborations with ORNL on Jaguar Cray XT and CEA – Tree architecture – logarithmic performance – Now faster at 220,000 than previously at 1,000 cores – ~1 / 1 0 th of a s e c ond to step and gather all stacks at 220,000 cores www.allinea.com

  13. S c a la ble P roc e s s C ontrol • Parallel Stack View – Finds rogue processes faster – Identifies classes of process behaviour – Allows rapid grouping of processes • Control Processes by Groups – Set breakpoints, step, play, stop etc. using user-defined groups – Mutates to scalable groups view – Compact group representations www.allinea.com

  14. P re s e nting Da ta , Us e fully • Gather from every node – Potentially costly – if all data different – Easy if data mostly same – New ideas • Aggregated statistics • Probabilistic algorithms optimize performance – even in pathological case • Watch this space! – With a fast and scalable architecture, new things become possible www.allinea.com

  15. T he new hotness • Hybrids are today's hottest topic – Technology is moving quickly – compilers, SDKs, hardware – NVIDIA CUDA leads in tool support • Many lines of code need rewriting for GPUs – Memory hierarchy – Explicit data transfer between host and accelerator – Unusual execution model • Kernels, thread blocks, warps, synchronization points – Massively fine-grained parallel model – It works: 1 billion keys / second on a single GTX480 • Inevitable that we need to debug! www.allinea.com

  16. Debugging Options • Old world “printf” – NVIDIA SDK now allows this (new) – but has limitations • Fake it – run the kernel on the host x86_64 processor – Languages often support targeting host CPU instead of GPU – Different numeric precision – different answer? – Different scheduling – different answer? – A reasonable option for some bugs • Or run on the GPU with Allinea DDT... www.allinea.com

  17. Introducing DDT for CUDA • The first graphical debugger for NVIDIA CUDA – Simple and easy to use – As easy as debugging ordinary code • All the commands you'd expect – Breakpoints – Stepping warps – Viewing data and thread stacks • Plus more advanced features – CUDA memcheck – memory debugging for CUDA • More to come! www.allinea.com

  18. CUDA T hreads in DDT • Run the code – Browse source – Set breakpoints – Stop at a line of CUDA code – Stops once for each scheduled collection of blocks • Select a CUDA thread – Examine variables and shared memory – Step a warp www.allinea.com

  19. E asy to understand scale • View all threads in parallel stack view – At one glance, see all GPU and CPU threads together – Links with thread selection – Pick a tree node to select one of the CUDA threads at that location • Full MPI support – See GPU and CPU threads from multiple nodes www.allinea.com

  20. Some Common P roblems • Incorrect logic (if-statements, calculations) – Loop iteration to GPU thread analogy - threads identified by grid and block indexes – Solution: Select a thread and step with DDT; look at the local state and shared data • Cherry-pick important threads: start, end, a few interior points • Kernel bounds – getting the right grids and blocks – Incorrect kernel thread boundaries can lead to incomplete results or crashing of the kernel – Solution: Bugs will often trigger “CUDA memcheck” errors - run with DDT and CUDA memory debugging enabled – Solution: Use DDT's advanced multi-dimensional array viewer to look at data and find the missing indexes www.allinea.com

  21. Current Limitations • SDK 3.0 was a big leap forward • SDK and driver limitations – Only one GPU can be debugged per O/S (per physical node) – Cannot currently read launch failure codes (without breaking your code) – Only one warp can be stepped per GPU at any time – Cannot debug GPU part of (attach to) an already running job • Strong partnership with NVIDIA, CAPS and others is helping to extend capabilities – SDK 3.1 is much better for general computation – SDK 3.2 adds debug support for multiple GPUs per node www.allinea.com

  22. Petascale Debugging: Solved. GPU Debugging: Works. Any Questions? www.allinea.com

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend