Enhanced Memory debugging of MPI-parallel Applications in Open MPI - PowerPoint PPT Presentation

Enhanced Memory debugging of MPI-parallel Applications in Open MPI 4th Parallel tools workshop 2010 Shiqing Fan HLRS, High Performance Computing Center University of Stuttgart, Germany Slide 1 High Performance Computing Center Stuttgart

Introduction: Open MPI 1/3 • A new MPI implementation from scratch PACX-MPI • w/o the cruft of previous implementation LAM/MPI LA-MPI • Design started in early 2004 FT-MPI • Project goals – Full, fast & extensible MPI-2 implementation – Thread-safety – Prevent the “forking problem” – Combine the best ideas and techs. • Open source license based on the BSD license Slide 2 High Performance Computing Center Stuttgart

Introduction: Open MPI 2/3 • Current status – Stable version v1.2.6 (April 2008) – Release v1.3 comes very soon • 14 members, 6 contributors – 4 US DOE labs – 8 universities – 7 vendors – 1 individual Slide 3 High Performance Computing Center Stuttgart

Introduction: Open MPI 3/3 • Open MPI consists of three sub-packages Open MPI Open RTE – Open RunTime Environment Open PAL – Open Portable Access Layer Operating System • Modular Component Architecture ( MCA ) – Dynamically load available modules like plug-in and check for hardware – Select best plug-in and unload others (e.g. if hw not available) – Fast indirect calls into each plug-in User application MPI API Module Component Architecture Framework Framework BTL Comp Comp OpenIB TCP Comp Myrinet SM Comp Slide 4 High Performance Computing Center Stuttgart

Introduction: Valgrind 1/2 • An Open-Source Debugging & Profiling tool • For x86/Linux, AMD64/Linux, PPC32/Linux and PPC64/Linux • Works with any dynamically & statically linked application • Memcheck - A heavyweight memory checker • Runs program on a synthetic CPU – Identical to a real CPU, store information of memory • Valid-value bits (V-bits) for each bit – Has valid value or not • Address bits (A-bits) for each byte – Possible to read/write that location • All reads and writes of memory are checked • Calls to malloc/new/free/delete are intercepted Slide 5 High Performance Computing Center Stuttgart

Introduction: Valgrind 2/2 • Use of uninitialized memory – only reports the error when using the uninitialized value – e.g. : int c[2]; int i = c[0]; /* OK !! */ if (i == 0) /* Memcheck: use of uninitialized value !! */ • Use of free’d memory • Mismatched use of malloc/new with free/delete • Memory leaks • Overlap src and dst blocks – memcpy(), strcpy(), strncpy(), strcat(), strncat() Slide 6 High Performance Computing Center Stuttgart

Valgrind – MPI Example 1/2 • Open MPI readily supports execution of apps with valgrind : mpirun –np 2 valgrind ./mpi_murks: Slide 7 High Performance Computing Center Stuttgart

Valgrind – MPI Example 2/2 • With Valgrind mpirun –np 2 valgrind ./mpi_murks : PID ==11278== Invalid read of size 1 ==11278== at 0x4002321E: memcpy (../../memcheck/mac_replace_strmem.c:256) ==11278== by 0x80690F6: MPID_SHMEM_Eagerb_send_short (mpich/../shmemshort.c:70) .. 2 lines of calls to MPIch-functions deleted ... ==11278== by 0x80492BA: MPI_Send (/usr/src/mpich/src/pt2pt/send.c:91) ==11278== by 0x8048F28: main (mpi_murks.c:44) ==11278== Address 0x4158B0EF is 3 bytes after a block of size 40 alloc'd ==11278== at 0x4002BBCE: malloc (../../coregrind/vg_replace_malloc.c:160) ==11278== by 0x8048EB0: main (mpi_murks.c:39) Buffer-Overrun by 4 Bytes in MPI_Send .... ==11278== Conditional jump or move depends on uninitialised value(s) ==11278== at 0x402985C4: _IO_vfprintf_internal (in /lib/libc-2.3.2.so) ==11278== by 0x402A15BD: _IO_printf (in /lib/libc-2.3.2.so) ==11278== by 0x8048F44: main (mpi_murks.c:46) Printing of uninitialized variable • It can not find: – May be run with 1 process: One pending Recv ( Marmot ). – May be run with >2 processes: Unmatched Sends ( Marmot ). Slide 8 High Performance Computing Center Stuttgart

Design and implementation 1/3 • Memchecker: a new concept to use valgrind’s API internally in Open MPI to reveal bugs – In the Application – In Open MPI itself • Implement generic interface memchecker as MCA – Implemented in Open PAL layer – Configure option --enable-memchecker – Possibly pass installed Valgrind --with-valgrind=/path/to/valgrind • Simply run command, e.g. : – mpirun -np 2 valgrind ./my_mpi Open MPI Open RTE Open PAL Memchecker Memchecker* Memchecker valgrind solaris_rtc some mca… Operating System *currently no API implemented in rtc. Slide 9 High Performance Computing Center Stuttgart

Design and implementation 2/3 • Detect application’s memory violation of MPI-standard – Application’s usage of undefined data – Application’s memory access due to MPI-semantics • Detect Non-blocking/One-sided communication buffer errors – Functions in BTL layer for both communications – Set memory accessibility independent of MPI operations – i.e. only set accessibility for the fragment to be sent/received – Handles derived datatypes • MPI object checking – Check definedness of MPI objects that passing to MPI API – MPI_Status, MPI_Comm, MPI_Request and MPI_Datatype – Could be disabled for better performance Slide 10 High Performance Computing Center Stuttgart

Design and implementation 3/3 • Non-blocking send/receive buffer error checking Proc0 Proc1 MPI-Layer Buffer MPI_Isend Frag 0 MPI_Irecv PML not accessible Inaccessible P2P Management Layer Frag 1 (unaddressable) Inaccessible BML (unaddressable) not accessible BTL Management Layer Frag n Frag n MPI_Wait BTL Byte Transfer Layer MPI_Wait Slide 11 High Performance Computing Center Stuttgart

Detectable bug-classes 1/3 • Non-blocking buffer accessed/modified before finished MPI_Isend (buffer, SIZE, MPI_INT, …, &request); buffer[1] = 4711; MPI_Wait (&req, &status); • The standard does not ( yet ) allow read access: MPI_Isend (buffer, SIZE, MPI_INT, …, &request); result[1] = buffer[1]; MPI_Wait (&request, &status); • Side note: – MPI-1, p30, Rationale for restrictive access rules; “allows better performance on some systems”. Slide 12 High Performance Computing Center Stuttgart

Detectable bug-classes 2/3 • Access to buffer under control of MPI: MPI_Irecv (buffer, SIZE, MPI_CHAR, …, &request); buffer[1] = 4711; MPI_Wait (&request, &status); • Side note: CRC-based methods do not reliably catch these cases. • Memory that is outside receive buffer is overwritten : buffer = malloc( SIZE * sizeof(MPI_CHAR) ); memset (buffer, SIZE * sizeof(MPI_CHAR), 0); MPI_Recv(buffer, SIZE+1, MPI_CHAR, …, &status); • Side note: MPI-1, p21, rationale of overflow situations: “no memory that outside the receive buffer will ever be overwritten.” Slide 13 High Performance Computing Center Stuttgart

Detectable bug-classes 3/3 • Usage of the Undefined Memory passed from Open MPI MPI_Wait(&request, &status); if (status.MPI_ERROR != MPI_SUCCESS) • Side note: This field should remain undefined. – MPI-1, p22 (not needed for calls that return only one status) – MPI-2, p24 (Clarification of status in single-completion calls). • Write to buffer before accumulate is finished : MPI_Accumulate(A, NROWS*NCOLS, MPI_INT, 1, 0, 1, \ xpose, MPI_SUM, win); A[0][1] = 4711; MPI_Win_fence(0, win); Slide 14 High Performance Computing Center Stuttgart

Performance 1/2 • Benchmarks – Intel MPI Bechmark • Environment – Dgrid-cluster at HLRS – Dual-processor Intel Woodcrest – Infiniband-DDR network with the Open Fabrics stack • Test cases – Plain Open MPI – With memchecker component without MPI objects checking Slide 15 High Performance Computing Center Stuttgart

Performance 2/2 • Intel MPI Benchmark, Bi-directional Get test • Use 2 nodes, TCP connections employing IPoverIB-interface • Run with/without Valgrind Slide 16 High Performance Computing Center Stuttgart

Valgrind (Memcheck) Extension 1/2 • New client requests for: – Watching on memory read operations – Watching on memory write operations – Initiating callback functions on memory read/write – Making memory readable and/or writable • use fast ordered set algorithm • byte-wise memory checking • handle the memory with mixed registered and unregistered blocks Slide 17 High Performance Computing Center Stuttgart

Valgrind (Memcheck) Extension 2/2 • VALGRIND_REG_USER_MEM_WATCH (addr, len, op, cb, info) • VALGRIND_UNREG_USER_MEM_WATCH (addr, len) • Watch “op” could be: – WATCH_MEM_READ, WATCH_MEM_WRITE and WATCH_MEM_RW Valgrind User app … Alloc_mem … … … Alloc_mem Read_mem … … Read_cb Read_mem Slide 18 High Performance Computing Center Stuttgart

Thank you very much ! Slide 19 High Performance Computing Center Stuttgart

Enhanced Memory debugging of MPI-parallel Applications in Open MPI - PowerPoint PPT Presentation

Enhanced Memory debugging of MPI-parallel Applications in Open MPI 4th Parallel tools workshop 2010 Shiqing Fan HLRS, High Performance Computing Center University of Stuttgart, Germany Slide 1 High Performance Computing Center Stuttgart

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Debugging Debugging Tools Module Overview Introduction to Debugging Problems in Production

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Coroutines Update Seva Tolstopyatov @qwwdfsad October 13, 2020 Coroutines debugging Coroutines

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Memory Debugging Parallel Applications on BlueGene SciComp May 21, 2009 1 Ed Hinkel Agenda

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Debugging Debugging with High Level Languages Same goals as low-level debugging Examine and

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Message Passing Programming Designing MPI Applications Overview Lecture will cover MPI

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Parallel Debugging Objective Learn the basics of debugging parallel programs

An Enhanced Global Router An Enhanced Global Router An Enhanced Global Router An Enhanced Global

The ONS Longitudinal Study 40 years old and going strong Chaired by Professor Allan Findlay,

Impact from the LSs Nicola Shelton, Director of CeLSIUS, UCL Impact from the LSs E&W LS

MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications Zhezhe Chen 1 ,

EQUITY IN THE WORLD AND OUR BACKYARD Can anadi adian an Conf nference erence on n Med

Health and well-being in the planning system Jenny Dunwoody Arup Health is a state of

Welcome Health inequalities What are health inequalities? Our presenters will be introducing the

Getting it Right: 4 Key Principles for Building an Early Learning and Child Care System that

Union Civil War Veterans and Northern Society Herbert Schell Lecture University of South

Enhanced Memory debugging of MPI-parallel Applications in Open MPI - PowerPoint PPT Presentation

Enhanced Memory debugging of MPI-parallel Applications in Open MPI 4th Parallel tools workshop 2010 Shiqing Fan HLRS, High Performance Computing Center University of Stuttgart, Germany Slide 1 High Performance Computing Center Stuttgart

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Debugging Debugging Tools Module Overview Introduction to Debugging Problems in Production

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Coroutines Update Seva Tolstopyatov @qwwdfsad October 13, 2020 Coroutines debugging Coroutines

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Memory Debugging Parallel Applications on BlueGene SciComp May 21, 2009 1 Ed Hinkel Agenda

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Debugging Debugging with High Level Languages Same goals as low-level debugging Examine and

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Message Passing Programming Designing MPI Applications Overview Lecture will cover MPI

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Parallel Debugging Objective Learn the basics of debugging parallel programs

An Enhanced Global Router An Enhanced Global Router An Enhanced Global Router An Enhanced Global

The ONS Longitudinal Study 40 years old and going strong Chaired by Professor Allan Findlay,

Impact from the LSs Nicola Shelton, Director of CeLSIUS, UCL Impact from the LSs E&amp;W LS

MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications Zhezhe Chen 1 ,

EQUITY IN THE WORLD AND OUR BACKYARD Can anadi adian an Conf nference erence on n Med

Health and well-being in the planning system Jenny Dunwoody Arup Health is a state of

Welcome Health inequalities What are health inequalities? Our presenters will be introducing the

Getting it Right: 4 Key Principles for Building an Early Learning and Child Care System that

Union Civil War Veterans and Northern Society Herbert Schell Lecture University of South

Impact from the LSs Nicola Shelton, Director of CeLSIUS, UCL Impact from the LSs E&W LS