enhanced memory debugging of mpi parallel applications in
play

Enhanced Memory debugging of MPI-parallel Applications in Open MPI - PowerPoint PPT Presentation

Enhanced Memory debugging of MPI-parallel Applications in Open MPI 4th Parallel tools workshop 2010 Shiqing Fan HLRS, High Performance Computing Center University of Stuttgart, Germany Slide 1 High Performance Computing Center Stuttgart


  1. Enhanced Memory debugging of MPI-parallel Applications in Open MPI 4th Parallel tools workshop 2010 Shiqing Fan HLRS, High Performance Computing Center University of Stuttgart, Germany Slide 1 High Performance Computing Center Stuttgart

  2. Introduction: Open MPI 1/3 • A new MPI implementation from scratch PACX-MPI • w/o the cruft of previous implementation LAM/MPI LA-MPI • Design started in early 2004 FT-MPI • Project goals – Full, fast & extensible MPI-2 implementation – Thread-safety – Prevent the “forking problem” – Combine the best ideas and techs. • Open source license based on the BSD license Slide 2 High Performance Computing Center Stuttgart

  3. Introduction: Open MPI 2/3 • Current status – Stable version v1.2.6 (April 2008) – Release v1.3 comes very soon • 14 members, 6 contributors – 4 US DOE labs – 8 universities – 7 vendors – 1 individual Slide 3 High Performance Computing Center Stuttgart

  4. Introduction: Open MPI 3/3 • Open MPI consists of three sub-packages Open MPI Open RTE – Open RunTime Environment Open PAL – Open Portable Access Layer Operating System • Modular Component Architecture ( MCA ) – Dynamically load available modules like plug-in and check for hardware – Select best plug-in and unload others (e.g. if hw not available) – Fast indirect calls into each plug-in User application MPI API Module Component Architecture Framework Framework BTL Comp Comp OpenIB TCP Comp Myrinet SM Comp Slide 4 High Performance Computing Center Stuttgart

  5. Introduction: Valgrind 1/2 • An Open-Source Debugging & Profiling tool • For x86/Linux, AMD64/Linux, PPC32/Linux and PPC64/Linux • Works with any dynamically & statically linked application • Memcheck - A heavyweight memory checker • Runs program on a synthetic CPU – Identical to a real CPU, store information of memory • Valid-value bits (V-bits) for each bit – Has valid value or not • Address bits (A-bits) for each byte – Possible to read/write that location • All reads and writes of memory are checked • Calls to malloc/new/free/delete are intercepted Slide 5 High Performance Computing Center Stuttgart

  6. Introduction: Valgrind 2/2 • Use of uninitialized memory – only reports the error when using the uninitialized value – e.g. : int c[2]; int i = c[0]; /* OK !! */ if (i == 0) /* Memcheck: use of uninitialized value !! */ • Use of free’d memory • Mismatched use of malloc/new with free/delete • Memory leaks • Overlap src and dst blocks – memcpy(), strcpy(), strncpy(), strcat(), strncat() Slide 6 High Performance Computing Center Stuttgart

  7. Valgrind – MPI Example 1/2 • Open MPI readily supports execution of apps with valgrind : mpirun –np 2 valgrind ./mpi_murks: Slide 7 High Performance Computing Center Stuttgart

  8. Valgrind – MPI Example 2/2 • With Valgrind mpirun –np 2 valgrind ./mpi_murks : PID ==11278== Invalid read of size 1 ==11278== at 0x4002321E: memcpy (../../memcheck/mac_replace_strmem.c:256) ==11278== by 0x80690F6: MPID_SHMEM_Eagerb_send_short (mpich/../shmemshort.c:70) .. 2 lines of calls to MPIch-functions deleted ... ==11278== by 0x80492BA: MPI_Send (/usr/src/mpich/src/pt2pt/send.c:91) ==11278== by 0x8048F28: main (mpi_murks.c:44) ==11278== Address 0x4158B0EF is 3 bytes after a block of size 40 alloc'd ==11278== at 0x4002BBCE: malloc (../../coregrind/vg_replace_malloc.c:160) ==11278== by 0x8048EB0: main (mpi_murks.c:39) Buffer-Overrun by 4 Bytes in MPI_Send .... ==11278== Conditional jump or move depends on uninitialised value(s) ==11278== at 0x402985C4: _IO_vfprintf_internal (in /lib/libc-2.3.2.so) ==11278== by 0x402A15BD: _IO_printf (in /lib/libc-2.3.2.so) ==11278== by 0x8048F44: main (mpi_murks.c:46) Printing of uninitialized variable • It can not find: – May be run with 1 process: One pending Recv ( Marmot ). – May be run with >2 processes: Unmatched Sends ( Marmot ). Slide 8 High Performance Computing Center Stuttgart

  9. Design and implementation 1/3 • Memchecker: a new concept to use valgrind’s API internally in Open MPI to reveal bugs – In the Application – In Open MPI itself • Implement generic interface memchecker as MCA – Implemented in Open PAL layer – Configure option --enable-memchecker – Possibly pass installed Valgrind --with-valgrind=/path/to/valgrind • Simply run command, e.g. : – mpirun -np 2 valgrind ./my_mpi Open MPI Open RTE Open PAL Memchecker Memchecker* Memchecker valgrind solaris_rtc some mca… Operating System *currently no API implemented in rtc. Slide 9 High Performance Computing Center Stuttgart

  10. Design and implementation 2/3 • Detect application’s memory violation of MPI-standard – Application’s usage of undefined data – Application’s memory access due to MPI-semantics • Detect Non-blocking/One-sided communication buffer errors – Functions in BTL layer for both communications – Set memory accessibility independent of MPI operations – i.e. only set accessibility for the fragment to be sent/received – Handles derived datatypes • MPI object checking – Check definedness of MPI objects that passing to MPI API – MPI_Status, MPI_Comm, MPI_Request and MPI_Datatype – Could be disabled for better performance Slide 10 High Performance Computing Center Stuttgart

  11. Design and implementation 3/3 • Non-blocking send/receive buffer error checking Proc0 Proc1 MPI-Layer Buffer MPI_Isend Frag 0 MPI_Irecv PML not accessible Inaccessible P2P Management Layer Frag 1 (unaddressable) Inaccessible BML (unaddressable) not accessible BTL Management Layer Frag n Frag n MPI_Wait BTL Byte Transfer Layer MPI_Wait Slide 11 High Performance Computing Center Stuttgart

  12. Detectable bug-classes 1/3 • Non-blocking buffer accessed/modified before finished MPI_Isend (buffer, SIZE, MPI_INT, …, &request); buffer[1] = 4711; MPI_Wait (&req, &status); • The standard does not ( yet ) allow read access: MPI_Isend (buffer, SIZE, MPI_INT, …, &request); result[1] = buffer[1]; MPI_Wait (&request, &status); • Side note: – MPI-1, p30, Rationale for restrictive access rules; “allows better performance on some systems”. Slide 12 High Performance Computing Center Stuttgart

  13. Detectable bug-classes 2/3 • Access to buffer under control of MPI: MPI_Irecv (buffer, SIZE, MPI_CHAR, …, &request); buffer[1] = 4711; MPI_Wait (&request, &status); • Side note: CRC-based methods do not reliably catch these cases. • Memory that is outside receive buffer is overwritten : buffer = malloc( SIZE * sizeof(MPI_CHAR) ); memset (buffer, SIZE * sizeof(MPI_CHAR), 0); MPI_Recv(buffer, SIZE+1, MPI_CHAR, …, &status); • Side note: MPI-1, p21, rationale of overflow situations: “no memory that outside the receive buffer will ever be overwritten.” Slide 13 High Performance Computing Center Stuttgart

  14. Detectable bug-classes 3/3 • Usage of the Undefined Memory passed from Open MPI MPI_Wait(&request, &status); if (status.MPI_ERROR != MPI_SUCCESS) • Side note: This field should remain undefined. – MPI-1, p22 (not needed for calls that return only one status) – MPI-2, p24 (Clarification of status in single-completion calls). • Write to buffer before accumulate is finished : MPI_Accumulate(A, NROWS*NCOLS, MPI_INT, 1, 0, 1, \ xpose, MPI_SUM, win); A[0][1] = 4711; MPI_Win_fence(0, win); Slide 14 High Performance Computing Center Stuttgart

  15. Performance 1/2 • Benchmarks – Intel MPI Bechmark • Environment – Dgrid-cluster at HLRS – Dual-processor Intel Woodcrest – Infiniband-DDR network with the Open Fabrics stack • Test cases – Plain Open MPI – With memchecker component without MPI objects checking Slide 15 High Performance Computing Center Stuttgart

  16. Performance 2/2 • Intel MPI Benchmark, Bi-directional Get test • Use 2 nodes, TCP connections employing IPoverIB-interface • Run with/without Valgrind Slide 16 High Performance Computing Center Stuttgart

  17. Valgrind (Memcheck) Extension 1/2 • New client requests for: – Watching on memory read operations – Watching on memory write operations – Initiating callback functions on memory read/write – Making memory readable and/or writable • use fast ordered set algorithm • byte-wise memory checking • handle the memory with mixed registered and unregistered blocks Slide 17 High Performance Computing Center Stuttgart

  18. Valgrind (Memcheck) Extension 2/2 • VALGRIND_REG_USER_MEM_WATCH (addr, len, op, cb, info) • VALGRIND_UNREG_USER_MEM_WATCH (addr, len) • Watch “op” could be: – WATCH_MEM_READ, WATCH_MEM_WRITE and WATCH_MEM_RW Valgrind User app … Alloc_mem … … … Alloc_mem Read_mem … … Read_cb Read_mem Slide 18 High Performance Computing Center Stuttgart

  19. Thank you very much ! Slide 19 High Performance Computing Center Stuttgart

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend