Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications - PowerPoint PPT Presentation

Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale Systems, May 19th 2014 Felix Schmitt, Robert Dietrich, Guido Juckeland

Outline 1 Motivation 2 CUDA Dependency Patterns MPI-CUDA Critical Path Analysis 3 4 Use Cases 5 Outlook and Conclusion 1/19

Motivation CUDA Dependency Patterns MPI-CUDA Critical Path Analysis Use Cases Outlook and Conclusion 1/19

Motivation CUDA established for using general-purpose graphics-processing units in HPC [1] Increasing complexity of hybrid HPC programs requires sophisticated performance-analysis tools Problem: no current tool for automated analysis of execution dependencies in MPI-CUDA programs Scalasca: scalable MPI critical-path analysis HPCToolkit: MPI-CUDA profiling, no intra-device dependencies NVIDIA Visual Profiler: CUDA optimization guidance, no MPI 2/19

Goals Guide the developer to optimization targets in hybrid MPI-CUDA programs Scalable critical-path analysis based on trace files Analyze host/device and device/device dependencies and inefficiencies Visualize analysis results in Vampir Order activities by their potential optimization influence 3/19

Preliminaries: Wait-State Analysis Event Stream : stream of ordered events, e.g. MPI process, CUDA stream Wait State : time period at which an event stream is blocked [2], result of inter-stream dependencies and load imbalances Blame (HPCToolkit) or cost of idleness (Scalasca): attributed to the cause of a wait state C: Load imbalance A: Late receiver at barrier Process 1 MPI_Send MPI_Barrier B: Late sender Process 2 MPI_Recv MPI_Recv MPI_Barrier Process 3 MPI_Send MPI_Barrier t 1 t 2 Time Examples for MPI Wait-States 4/19

Preliminaries: Critical Path Event Dependency Graph (EDG) : directed acyclic graph Nodes are the events of parallel event streams Edges model the happens-before relationship and are weighted with the duration between events [3] Init Init Send Send Finalize Finalize E L E L E L Wait-State t start E L E L E L Wait-State Init Init Recv Finalize Finalize t end t 1 t 2 EDG for simple MPI example ( MPI_Init , MPI_Send/Recv , MPI_Finalize ) 5/19

Preliminaries: Critical Path (2) Critical Path : [4] Longest path in an EDG without wait states Optimizing activities on this path can reduce execution time Optimizing other activities can not (directly) Optimization increases wait-state Init Init Send Send Finalize Finalize E L E L E L Wait-State t start E L E L E L Wait-State Init Init Recv Recv Finalize Finalize t end Optimization reduces wait-state 6/19

CUDA Wait-State Analysis Create a dependency/wait-state model for CUDA Two activity kinds: host (API) and device (kernels, memcpys) New categorization of CUDA Inefficiency Patterns: Blocking Synchronization Non-Blocking Synchronization Late Synchronization Inter-Stream Dependencies 7/19

Rule-Based Pattern Detection (3b) Event cuStreamSync cuLaunch(A) cuStreamSync (ES2) ... Stream 1 (2) (3a) Event Kernel X Kernel A ... Stream 2 (1) apply rule to (3b) make cuStreamSync (3a) create (2) find kernel exit on sync exit node a blocking wait-state dependency edge referenced event stream within sync duration BlameKernelRule Identifies blocking synchronization that is delayed by device activities. 8/19

Critical Sub-Paths Combine MPI and CUDA critical path analysis MPI critical path detected using Scalasca’s parallel reverse replay [5] Global CUDA critical path is dominated by MPI critical path ✦ Determine critical sub-paths to efficiently and concurrently compute CUDA critical paths using OpenMP Critical Sub-Path Event Streams cuStreamSync MPI_Send launch cuStreamSync MPI_Barrier Kernel Kernel cuStreamSync MPI_Recv launch cuStreamSync MPI_Barrier Kernel Kernel Critical Sub-Path 9/19

Visualization in Vampir Vampir and VampirServer enable scalable visualization of hybrid applications, including timelines, profiles, message and data transfers and performance counters. 10/19

Visualization in Vampir (2) A B C (A) Counter Overlay: blocking memory copy (implicit synchronization) (B) Counter Timeline: the synchronized kernel is attributed blame (C) Counter Timeline: blocking synchronization is marked as waiting time 11/19

Activity Optimization Order Goal: Rank activity types by their potential influence Create an optimization order for activity types, add normalized fraction of total critical-path duration (direct runtime impact) normalized fraction of total blame (load-balancing impact) ✦ Highest-rated activities are best optimization candidates 12/19

Correctness: Jacobi Method MPI+CUDA application (two processes, one CUDA stream each). Executes two kernels in each iteration. 10% work offloaded to GPU 90% work offloaded to GPU Section of a trace in Vampir with two kernels: jacobi_kernel and copy_kernel . 13/19

Correctness: Jacobi Method (2) Analysis result in Vampir’s performance radar (timeline overlay): CUDA kernels become critical activities (red) for high GPU offloading ratio due to blocking synchronization. 14/19

Correctness: Jacobi Method (3) Activity (all instances) Critical Path [%] Blame [%] Rating jacobi_kernel 40.69 35.34 0.7603 cuMemcpyDtoH_v2 30.10 5.6 0.3570 MPI_Barrier ~0 35.62 0.3562 copy_kernel 5.04 9.59 0.1463 MPI_Allreduce ~0 12.78 0.1278 cuMemcpyHtoD_v2 10.15 0.0 0.1015 Activity optimization order for 90% work offloaded to the GPU. 15/19

Scalability: HPL CUDA 150 HPL CUDA Analysis Tool Execution Time [s] 100 50 0 2 4 8 16 32 # MPI Processes Scalability of HPL CUDA version and analysis 1 . Combining MPI parallel replay and CUDA dependency analysis still scales with the MPI operations of the input trace. 11 MPI process/node, NVIDIA K20X GPUs 16/19

Conclusion Contributions: Comprehensive dependency model for CUDA activities Scalable tool for critical-path analysis of MPI-CUDA traces Identifies waiting time and the causing activities Visualization of all metrics in Vampir Generates a list of optimization targets, ordered by potential influence 17/19

Future Work Extend support to applications including OpenMP , CUDA and MPI (prototype available) Evaluate usage of hardware performance counters during optimization guidance ✦ Which activities are easier to optimize? General CPU functions missing in this implementation (added in prototype) Thank you for your attention! Questions? 18/19

References [1] Wu-chun Feng and Kirk W. Cameron. The Green500 List - November 2013. http://www.green500.org/lists/green201311 , November 2013. [2] Wagner Meira, Thomas J. LeBlanc, and Virgilio A. F. Almeida. Using cause-effect analysis to understand the performance of distributed programs. In Proceedings of the SIGMETRICS symposium on Parallel and distributed tools , SPDT ’98, pages 101–111, New York, NY, USA, 1998. ACM. [3] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM , 21(7):558–565, July 1978. [4] C.-Q. Yang and B.P . Miller. Critical path analysis for the execution of parallel and distributed programs. In Distributed Computing Systems, 1988., 8th International Conference on , pages 366–373, 1988. [5] David Bohme, Felix Wolf, and Markus Geimer. Characterizing Load and Communication Imbalance in Large-Scale Parallel Applications. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International , pages 2538–2541, 2012. 19/19

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications - PowerPoint PPT Presentation

Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale Systems, May 19th 2014 Felix Schmitt,

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 07 (23-11-2015) Detailed Specification with SysML Christoph Lth Jan Peleska

Activity 1 Multiply the following matrices: 1 0 3 1 4 2 1 2 0

CS 4495 Computer Vision Activity Recognition Aaron Bobick School of Interactive Computing

27 A Complete Plane Stress FEM Program IFEM Ch 27 Slide 1 Introduction to FEM The 3 Basic

Simultaneous Learning and Reshaping of an Approximated Optimization Task Patrick MacAlpine , Elad

Job Description Position: Trainee Analyst Reports to: Analysis Team Leader Supervision of:

LBNF Management Breakout Session Elaine McCluskey, LBNF Project Manager LBNC Review 20 February

Software Process and Product Quality Conclusions SE 350 Software Process & Product Quality 1

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications - PowerPoint PPT Presentation

Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale Systems, May 19th 2014 Felix Schmitt,

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 07 (23-11-2015) Detailed Specification with SysML Christoph Lth Jan Peleska

Activity 1 Multiply the following matrices: 1 0 3 1 4 2 1 2 0

CS 4495 Computer Vision Activity Recognition Aaron Bobick School of Interactive Computing

27 A Complete Plane Stress FEM Program IFEM Ch 27 Slide 1 Introduction to FEM The 3 Basic

Simultaneous Learning and Reshaping of an Approximated Optimization Task Patrick MacAlpine , Elad

Job Description Position: Trainee Analyst Reports to: Analysis Team Leader Supervision of:

LBNF Management Breakout Session Elaine McCluskey, LBNF Project Manager LBNC Review 20 February

Software Process and Product Quality Conclusions SE 350 Software Process &amp; Product Quality 1

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Software Process and Product Quality Conclusions SE 350 Software Process & Product Quality 1