MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided - PowerPoint PPT Presentation

MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications Zhezhe Chen 1 , James Dinan 2 , Zhen Tang 3 , Pavan Balaji 4 , Hua Zhong 3 , Jun Wei 3 , Tao Huang 3 , and Feng Qin 5 1. Twitter Inc. 2. Intel Corporation 3. Chinese Academic of Sciences 4. Argonne National Laboratory 5. The Ohio State University 1

MPI One-Sided Communication  Remote Memory Access (RMA) extends MPI with one-sided communication  Allows one process to specify both sender and receiver communication parameters  Facilitates the coding of partitioned global address space (PGAS) data models  Dinan et al. [1] ported the Global Arrays runtime system, ARMCI to MPI RMA  NWChem is a user of MPI RMA, which we use to evaluate our tool  We focus on MPI-2 RMA, which is compatible with MPI-3 (future work) Process 0 Process 1 Process 2 Process 3 Global Public Public Public Public Address Memory Memory Memory Memory Space Region Region Region Region Private Private Private Private Memory Memory Memory Memory Region Region Region Region Private Private Private Private Memory Memory Memory Memory Region Region Region Region Figure credit: Advanced MPI Tutorial , P . Balaji, J. Dinan, T. Hoefler , R. Thakur, SC ‘13 2 [1] Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication , J. Dinan, P . Balaji, S. Krishnamoorthy, V . Tipparaju. IPDPS 2012

MPI RMA Challenges Same source Same epoch Diff. Sources X Public Copy X X Private Copy load store store  To ensure portable, well-defined behavior, programs must follow the rules: Operations must be synchronized using, e.g., lock/unlock or fence 1. Communication operations are nonblocking 2. Local buffers cannot be accessed until put/get/accumulate are completed  Concurrent, conflicting operations are erroneous 3. Local load/store updates conflict with remote accesses 4.  The MPI- 2 model is referred to as the “separate” memory model in MPI -3  The MPI- 3 “unified” model relaxes some rules, so we are solving the harder problem 3

A Bug Example Within an Epoch 1. MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win); 2. MPI_Get(&out, 1, MPI_INT, 0, 0, 1, MPI_INT, win); 3. if(out % 2 == 0) /* bug: load/store access of out */ 4. out++; 5. … 6. MPI_Win_unlock(0, win); 4

A Bug Example Across Processes P0 (Origin Process) P1 (Target Process) P2 (Origin Process) window location X MPI_Barrier MPI_Barrier MPI_Barrier MPI_Win_lock … MPI_Win_lock (SHARED, P1) … (SHARED, P1) … … … MPI_Put(X) … MPI_Put(X) … … … MPI_Win_unlock(P1) … MPI_Win_unlock(P1) MPI_Barrier MPI_Barrier MPI_Barrier 5

Previous Works  Bug detection for MPI one-sided programs  e.g., Marmot, [Pervez-EuroPVM /MPI’06], and Scalasca  Detect parameter errors, deadlocks, and performance bottlenecks  Shared-memory data race detection  e.g., Locksmith, Pacer, Eraser, and Racetrack  Detect data races for shared-memory programs  Fine-grain analysis is not feasible for analysis of MPI programs  Need new techniques for one-sided communication bug detection in one-sided communication models 6

MC-Checker Highlights  MC-Checker is a new tool to detect memory consistency errors in MPI one-sided applications  First comprehensive approach to address memory consistency errors in MPI one-sided communication  Incur relatively low overhead (45.2% on average)  Require no modification of source code  Data access DAG analysis technique  Applicable to variety of one-sided communication models  Identifies bugs based on concurrency of accesses  Finds errors that did happen and could have happened 7

Outline  1. Motivation  2. Bug Examples  3. Main Idea  4. Design and Implementation  5. Evaluation  6. Conclusion 8

MC-Checker Main Idea  Check the one-sided operations and local memory accesses and then check against compatibility tables to see whether there are memory consistency errors.  Check bugs within an epoch:  Identify epoch region  Check operations within an epoch against compatibility table  Check bugs across processes:  Identify concurrent regions by matching synchronization calls  Check operations in the concurrent regions against compatibility table 9

Design of MC-Checker MC-Checker Bug Report ST-Analyzer DN-Analyzer MPI Application Identify relevant CP-Table load/store accesses Offline Analysis Traces Profiler Online Profiling 10

ST-Analyzer: Identify Relevant Memory Accesses  Profiling each memory load/store is very heavy-weight  Perform static analysis to identify relevant memory accesses  Mark the variables and pointers belong to the window buffers and the buffers accessed by one-sided operations  Propagate the markers by using pointer alias analysis  Propagate the markers by following function calls involving pointers and references 11

Profiler: Profiling Runtime Events MPI Application Relevant Vars Profiler MPI_Type_contiguous() Datatype manipulation routines MPI_Type_struct() … MPI_Win_create() MPI one-sided relevant routines MPI_Win_fense() MPI_Put() … winBuf[2] = 5 Memory access instructions tmp = winBuf[3] … MPI_Barrier() General synchronization routines MPI_Bcast() … MPI_Comm_rank() MPI basic support routines 12 12

DN-Analyzer: Memory Consistency  Memory consistency errors occur when conflicting operations are potentially concurrent during program execution  Conflicting operations: e.g. overlapping MPI_Put and MPI_Put  Happen concurrently: operations are not ordered • a b means a happens before b hb • Ordered by barrier, send/recv, etc. co • a b means the memory effects of a are visible before b • Memory updates are synchronized by unlock, fence, etc. 13

DN-Analyzer: DAG Analysis Technique Barrier P1 P2 P0 Barrier() Barrier() Barrier() lock lock lock(shared) lock(shared) A Put(P1, X) store(LX) Put(P1, X) a Put(P1, X) c Put(P1, X) b store(LX) unlock unlock unlock() unlock() Barrier() Barrier() Barrier() Barrier lock(shared) lock(shared) lock lock d e store(X) Get(P1, X) B Get(P1, X) store(X) unlock() unlock() Barrier() Barrier() Barrier() unlock unlock Barrier  Capture dynamic execution and convert to data access DAG  Edges capture ordering and concurrency of access  Identifies logical concurrency – bugs that happened and could have happened  General analysis technique for one-sided and PGAS models 14

DN-Analyzer: Within an Epoch 2 nd 1 st Load Store Get Put/Acc Load BOTH BOTH NOVL BOTH Store BOTH BOTH NOVL NOVL Get BOTH BOTH NOVL NOVL Put/Acc BOTH BOTH NOVL BOTH 1. MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win); 2. MPI_Get(&out, 1, MPI_INT, 0, 0, 1, MPI_INT, win); Bug (overlapping) 3. if(out % 2 == 0) Epoch Region Bug (overlapping) 4. out++; 5. … 6. MPI_Win_unlock(0, win); 15

DN-Analyzer: Across Processes Load Store Get Put Acc Load BOTH BOTH BOTH NOVL NOVL Store BOTH BOTH NOVL X X Get BOTH NOVL BOTH NOVL NOVL Put NOVL X NOVL NOVL NOVL Acc NOVL X NOVL NOVL BOTH  Compatibility matrix of RMA operations  BOTH: overlapping and nonoverlapping combinations of the given operations are permitted  NOVL: only non-overlapping combinations are permitted  X: combination is erroneous. 16

DN-Analyzer: Across Processes P0 P1 P2 Match synchronization barrier() barrier() barrier() calls lock(shared) lock(shared) Put(P1, X) Put(P1, X) Bug unlock() unlock() barrier() barrier() barrier() lock(shared) lock(shared) Get(P1, X) store(X) unlock() unlock() Bug barrier() barrier() barrier() 17

Outline  1. Motivation  2. Bug Examples  3. Main Idea  4. Design and Implementation  5. Evaluation  6. Conclusion 18

Evaluation Methodology  Hardware  Glenn cluster at Ohio Supercomputer Center  658 computer nodes  2.5 GHz Opterons quad-core CPU each node  24 GB RAM, 393 GB local disk each node  Software  Compiler: Modified LLVM to annotate load/store ops of interest  OS: Linux 2.6.18  MPI Library: MPICH2  Evaluation  Effectiveness: 3 real-world and 2 injected bug cases  Overhead: 5 benchmarks 19

Bug Cases MPI Applications Bug IDs Bug Locations Mode emulate 04/2011 within an epoch passive BT-broadcast 06/2004 within an epoch active lockopts r10308 across processes passive pingpong-inj 3.0.3 across processes passive jacobi-inj 09/2008 across processes active  3 real-world and 2 injected bug cases from 5 MPI applications 20

Effectiveness MPI Bug IDs Detected? Pinpoint Error Conflicting Failure # of Apps Root Locations Operations Symptoms Processes Cause? emulate 04/2011 Yes Yes within an get and incorrect 2 epoch load/store result BT- 06/2004 Yes Yes within an get and load program 2 broadcast epoch hang lockopts r10308 Yes Yes across put/get and incorrect 64 processes load/store result pingpong- 3.0.3 Yes Yes across put and put incorrect 64 inj processes result jacobi-inj 09/2008 Yes Yes across put and get incorrect 64 processes result  Detect and locate root cause for all of the 5 bug cases 21

Runtime Overhead Native MC-Checker 1.8 1.6 Normalized Execution Time 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Lennard-Jones SCF boltzmann SKaMPI LU  Runtime overhead is low, ranging from 24.6% to 71.1%, with an average of 45.2% 22

MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided - PowerPoint PPT Presentation

MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications Zhezhe Chen 1 , James Dinan 2 , Zhen Tang 3 , Pavan Balaji 4 , Hua Zhong 3 , Jun Wei 3 , Tao Huang 3 , and Feng Qin 5 1. Twitter Inc. 2. Intel Corporation 3. Chinese

Consistency - Chapter 5 Introduce several notions of Local Consistency: arc consistency,

Constraint Programming - An overview Node-consistency Arc-consistency Path-consistency

A Consistency Checker for Memory Subsystem Traces Matthew Naylor , Simon Moore, Alan Mujumdar

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

1 Applications ? Trading Consistency for Performance Applications ? Trading Consistency for

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Consistency Models CSE 451 James Bornholt Memory consistency models The short version:

C++ 11 Memory Consistency Model Sebastian Gerstenberg NUMA Seminar 07.01.2015 Agenda 1.

Nuspell: version 3 of the new spell checker FOSS spell checker implemented in C++17 with aid of

LIBIS/Aware conformance checker Agenda Functional analysis Technical design Research

Nuspell: the new spell checker FOSS spell checker implemented in C++14 with aid of Mozilla.

The UPPAAL Model Checker The UPPAAL Model Checker Julin Proenza Systems, Robotics and Vision

EQUITY IN THE WORLD AND OUR BACKYARD Can anadi adian an Conf nference erence on n Med

Preventing and Finding Bugs in Parallel Programs Charlie Peck Earlham College SC10 Education

Petascale Debugging with Allinea DDT David Lecomber david@allinea.com CTO www.allinea.com

Countries, IXPs and RIPE Atlas Emile Aben | 2016-02 | AIMS workshop - San Diego RIPE Atlas

Impact from the LSs Nicola Shelton, Director of CeLSIUS, UCL Impact from the LSs E&W LS

The ONS Longitudinal Study 40 years old and going strong Chaired by Professor Allan Findlay,

Enhanced Memory debugging of MPI-parallel Applications in Open MPI 4th Parallel tools workshop

Health and well-being in the planning system Jenny Dunwoody Arup Health is a state of

MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided - PowerPoint PPT Presentation

MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications Zhezhe Chen 1 , James Dinan 2 , Zhen Tang 3 , Pavan Balaji 4 , Hua Zhong 3 , Jun Wei 3 , Tao Huang 3 , and Feng Qin 5 1. Twitter Inc. 2. Intel Corporation 3. Chinese

Consistency - Chapter 5 Introduce several notions of Local Consistency: arc consistency,

Constraint Programming - An overview Node-consistency Arc-consistency Path-consistency

A Consistency Checker for Memory Subsystem Traces Matthew Naylor , Simon Moore, Alan Mujumdar

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

1 Applications ? Trading Consistency for Performance Applications ? Trading Consistency for

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Consistency Models CSE 451 James Bornholt Memory consistency models The short version:

C++ 11 Memory Consistency Model Sebastian Gerstenberg NUMA Seminar 07.01.2015 Agenda 1.

Nuspell: version 3 of the new spell checker FOSS spell checker implemented in C++17 with aid of

LIBIS/Aware conformance checker Agenda Functional analysis Technical design Research

Nuspell: the new spell checker FOSS spell checker implemented in C++14 with aid of Mozilla.

The UPPAAL Model Checker The UPPAAL Model Checker Julin Proenza Systems, Robotics and Vision

EQUITY IN THE WORLD AND OUR BACKYARD Can anadi adian an Conf nference erence on n Med

Preventing and Finding Bugs in Parallel Programs Charlie Peck Earlham College SC10 Education

Petascale Debugging with Allinea DDT David Lecomber david@allinea.com CTO www.allinea.com

Countries, IXPs and RIPE Atlas Emile Aben | 2016-02 | AIMS workshop - San Diego RIPE Atlas

Impact from the LSs Nicola Shelton, Director of CeLSIUS, UCL Impact from the LSs E&amp;W LS

The ONS Longitudinal Study 40 years old and going strong Chaired by Professor Allan Findlay,

Enhanced Memory debugging of MPI-parallel Applications in Open MPI 4th Parallel tools workshop

Health and well-being in the planning system Jenny Dunwoody Arup Health is a state of

Impact from the LSs Nicola Shelton, Director of CeLSIUS, UCL Impact from the LSs E&W LS