Enhanced Memory debugging of MPI-parallel Applications in Open MPI - - PowerPoint PPT Presentation

enhanced memory debugging of mpi parallel applications in
SMART_READER_LITE
LIVE PREVIEW

Enhanced Memory debugging of MPI-parallel Applications in Open MPI - - PowerPoint PPT Presentation

Enhanced Memory debugging of MPI-parallel Applications in Open MPI 4th Parallel tools workshop 2010 Shiqing Fan HLRS, High Performance Computing Center University of Stuttgart, Germany Slide 1 High Performance Computing Center Stuttgart


slide-1
SLIDE 1

High Performance Computing Center Stuttgart Slide 1

Enhanced Memory debugging of MPI-parallel Applications in Open MPI

4th Parallel tools workshop 2010

Shiqing Fan HLRS, High Performance Computing Center University of Stuttgart, Germany

slide-2
SLIDE 2

High Performance Computing Center Stuttgart Slide 2

Introduction: Open MPI 1/3

  • A new MPI implementation from scratch
  • w/o the cruft of previous implementation
  • Design started in early 2004
  • Project goals

– Full, fast & extensible MPI-2 implementation – Thread-safety – Prevent the “forking problem” – Combine the best ideas and techs.

  • Open source license based on the BSD license

PACX-MPI LAM/MPI LA-MPI FT-MPI

slide-3
SLIDE 3

High Performance Computing Center Stuttgart Slide 3

Introduction: Open MPI 2/3

  • Current status

– Stable version v1.2.6 (April 2008) – Release v1.3 comes very soon

  • 14 members, 6 contributors

– 4 US DOE labs – 8 universities – 7 vendors – 1 individual

slide-4
SLIDE 4

High Performance Computing Center Stuttgart Slide 4

Introduction: Open MPI 3/3

  • Open MPI consists of three sub-packages

Open MPI Open RTE – Open RunTime Environment Open PAL – Open Portable Access Layer Operating System

  • Modular Component Architecture (MCA)

– Dynamically load available modules like plug-in and check for hardware – Select best plug-in and unload others (e.g. if hw not available) – Fast indirect calls into each plug-in

Framework Framework BTL

Comp

User application MPI API Module Component Architecture

SM OpenIB Myrinet TCP Comp Comp Comp

slide-5
SLIDE 5

High Performance Computing Center Stuttgart Slide 5

Introduction: Valgrind 1/2

  • An Open-Source Debugging & Profiling tool
  • For x86/Linux, AMD64/Linux, PPC32/Linux and PPC64/Linux
  • Works with any dynamically & statically linked application
  • Memcheck - A heavyweight memory checker
  • Runs program on a synthetic CPU

– Identical to a real CPU, store information of memory

  • Valid-value bits (V-bits) for each bit

– Has valid value or not

  • Address bits (A-bits) for each byte

– Possible to read/write that location

  • All reads and writes of memory are checked
  • Calls to malloc/new/free/delete are intercepted
slide-6
SLIDE 6

High Performance Computing Center Stuttgart Slide 6

Introduction: Valgrind 2/2

  • Use of uninitialized memory

– only reports the error when using the uninitialized value – e.g. :

  • Use of free’d memory
  • Mismatched use of malloc/new with free/delete
  • Memory leaks
  • Overlap src and dst blocks

– memcpy(), strcpy(), strncpy(), strcat(), strncat()

int c[2]; int i = c[0]; /* OK !! */ if (i == 0) /* Memcheck: use of uninitialized value !! */

slide-7
SLIDE 7

High Performance Computing Center Stuttgart Slide 7

Valgrind – MPI Example 1/2

  • Open MPI readily supports execution of apps with valgrind:

mpirun –np 2 valgrind ./mpi_murks:

slide-8
SLIDE 8

High Performance Computing Center Stuttgart Slide 8

Valgrind – MPI Example 2/2

  • With Valgrind mpirun –np 2 valgrind ./mpi_murks:

==11278== Invalid read of size 1 ==11278== at 0x4002321E: memcpy (../../memcheck/mac_replace_strmem.c:256) ==11278== by 0x80690F6: MPID_SHMEM_Eagerb_send_short (mpich/../shmemshort.c:70) .. 2 lines of calls to MPIch-functions deleted ... ==11278== by 0x80492BA: MPI_Send (/usr/src/mpich/src/pt2pt/send.c:91) ==11278== by 0x8048F28: main (mpi_murks.c:44) ==11278== Address 0x4158B0EF is 3 bytes after a block of size 40 alloc'd ==11278== at 0x4002BBCE: malloc (../../coregrind/vg_replace_malloc.c:160) ==11278== by 0x8048EB0: main (mpi_murks.c:39) ....

PID Buffer-Overrun by 4 Bytes in MPI_Send Printing of uninitialized variable

  • It can not find:

– May be run with 1 process: One pending Recv (Marmot). – May be run with >2 processes: Unmatched Sends (Marmot).

==11278== Conditional jump or move depends on uninitialised value(s) ==11278== at 0x402985C4: _IO_vfprintf_internal (in /lib/libc-2.3.2.so) ==11278== by 0x402A15BD: _IO_printf (in /lib/libc-2.3.2.so) ==11278== by 0x8048F44: main (mpi_murks.c:46)

slide-9
SLIDE 9

High Performance Computing Center Stuttgart Slide 9

Design and implementation 1/3

  • Memchecker: a new concept to use valgrind’s API internally in

Open MPI to reveal bugs

– In the Application – In Open MPI itself

  • Implement generic interface memchecker as MCA

– Implemented in Open PAL layer – Configure option --enable-memchecker – Possibly pass installed Valgrind --with-valgrind=/path/to/valgrind

  • Simply run command, e.g. :

– mpirun -np 2 valgrind ./my_mpi

Open MPI Open RTE Open PAL Operating System Memchecker some mca… Memchecker* solaris_rtc Memchecker valgrind

*currently no API implemented in rtc.

slide-10
SLIDE 10

High Performance Computing Center Stuttgart Slide 10

Design and implementation 2/3

  • Detect application’s memory violation of MPI-standard

– Application’s usage of undefined data – Application’s memory access due to MPI-semantics

  • Detect Non-blocking/One-sided communication buffer errors

– Functions in BTL layer for both communications – Set memory accessibility independent of MPI operations – i.e. only set accessibility for the fragment to be sent/received – Handles derived datatypes

  • MPI object checking

– Check definedness of MPI objects that passing to MPI API – MPI_Status, MPI_Comm, MPI_Request and MPI_Datatype – Could be disabled for better performance

slide-11
SLIDE 11

High Performance Computing Center Stuttgart Slide 11

Design and implementation 3/3

MPI-Layer PML

P2P Management Layer

BML

BTL Management Layer

BTL

Byte Transfer Layer

Buffer not accessible not accessible Fragn

  • Non-blocking send/receive buffer error checking

MPI_Isend Proc0 Proc1 MPI_Irecv MPI_Wait MPI_Wait Inaccessible (unaddressable)

Frag1 Frag0 Fragn

Inaccessible (unaddressable)

slide-12
SLIDE 12

High Performance Computing Center Stuttgart Slide 12

Detectable bug-classes 1/3

  • Non-blocking buffer accessed/modified before finished

MPI_Isend (buffer, SIZE, MPI_INT, …, &request); buffer[1] = 4711; MPI_Wait (&req, &status);

  • Side note:

– MPI-1, p30, Rationale for restrictive access rules; “allows better performance on some systems”.

MPI_Isend (buffer, SIZE, MPI_INT, …, &request); result[1] = buffer[1]; MPI_Wait (&request, &status);

  • The standard does not (yet) allow read access:
slide-13
SLIDE 13

High Performance Computing Center Stuttgart Slide 13

Detectable bug-classes 2/3

  • Access to buffer under control of MPI:

MPI_Irecv (buffer, SIZE, MPI_CHAR, …, &request); buffer[1] = 4711; MPI_Wait (&request, &status);

  • Side note: CRC-based methods do not reliably catch these cases.
  • Memory that is outside receive buffer is overwritten :

buffer = malloc( SIZE * sizeof(MPI_CHAR) ); memset (buffer, SIZE * sizeof(MPI_CHAR), 0); MPI_Recv(buffer, SIZE+1, MPI_CHAR, …, &status);

  • Side note: MPI-1, p21, rationale of overflow situations: “no memory

that outside the receive buffer will ever be overwritten.”

slide-14
SLIDE 14

High Performance Computing Center Stuttgart Slide 14

Detectable bug-classes 3/3

  • Side note: This field should remain undefined.

– MPI-1, p22 (not needed for calls that return only one status) – MPI-2, p24 (Clarification of status in single-completion calls).

  • Usage of the Undefined Memory passed from Open MPI

MPI_Wait(&request, &status); if (status.MPI_ERROR != MPI_SUCCESS)

  • Write to buffer before accumulate is finished :

MPI_Accumulate(A, NROWS*NCOLS, MPI_INT, 1, 0, 1, \ xpose, MPI_SUM, win); A[0][1] = 4711; MPI_Win_fence(0, win);

slide-15
SLIDE 15

High Performance Computing Center Stuttgart Slide 15

Performance 1/2

  • Benchmarks

– Intel MPI Bechmark

  • Environment

– Dgrid-cluster at HLRS – Dual-processor Intel Woodcrest – Infiniband-DDR network with the Open Fabrics stack

  • Test cases

– Plain Open MPI – With memchecker component without MPI objects checking

slide-16
SLIDE 16

High Performance Computing Center Stuttgart Slide 16

Performance 2/2

  • Intel MPI Benchmark, Bi-directional Get test
  • Use 2 nodes, TCP connections employing IPoverIB-interface
  • Run with/without Valgrind
slide-17
SLIDE 17

High Performance Computing Center Stuttgart Slide 17

Valgrind (Memcheck) Extension 1/2

  • New client requests for:

– Watching on memory read operations – Watching on memory write operations – Initiating callback functions on memory read/write – Making memory readable and/or writable

  • use fast ordered set algorithm
  • byte-wise memory checking
  • handle the memory with mixed registered and unregistered blocks
slide-18
SLIDE 18

High Performance Computing Center Stuttgart Slide 18

Valgrind (Memcheck) Extension 2/2

  • VALGRIND_REG_USER_MEM_WATCH (addr, len, op, cb, info)
  • VALGRIND_UNREG_USER_MEM_WATCH (addr, len)
  • Watch “op” could be:

– WATCH_MEM_READ, WATCH_MEM_WRITE and WATCH_MEM_RW Valgrind User app

… Alloc_mem … … Read_mem … Alloc_mem … … Read_mem Read_cb

slide-19
SLIDE 19

High Performance Computing Center Stuttgart Slide 19

Thank you very much !