Incremental Migration of C and Fortran Applications to GPGPU using - PowerPoint PPT Presentation

Incremental Migration of C and Fortran Applications to GPGPU using HMPP Peppher 2011

Introduction • Many applications can benefit from GPU computing o Linear Algebra, signal processing o Bio informatics, molecular dynamics o Magnetic resonance imaging, tomography o Reverse time migration, electrostatic o … • Porting legacy codes to GPU computing is a major challenge o Can be very expensive o Require to minimize porting risks o Should be based on future-proof approach o Implies application and performance programmers to cooperate • A good methodology is paramount to reduce porting cost o HMPP provides an efficient solution www.caps-entreprise.com 2 Peppher 2011

What is HMPP? (Hybrid Manycore Parallel Programming) • A directive based multi-language programming environment o Help keeping software independent from hardware targets o Provide an incremental tool to exploit GPU in legacy applications o Avoid exit cost, can be future-proof solution • HMPP provides o Code generators from C and Fortran to GPU (CUDA or OpenCL) o A compiler driver that handles all low level details of GPU compilers o A runtime to allocate and manage GPU resources • Source to source compiler o CPU code does not require compiler change o Complement existing parallel APIs (OpenMP or MPI) Peppher 2011 www.caps-entreprise.com 3

HMPP Main Design Considerations • Focus on the main bottleneck o Communication between GPUs and CPUs • Allow incremental development o Up to full access to the hardware features • Work with other parallel APIs (e.g. OpenMP, MPI) o Orchestrate CPU and GPU computations • Consider multiple languages o Avoid asking users to learn a new language • Consider resource management o Generate robust software • Exploit vendor tools/compilers o Do not replace, complement Peppher 2011 www.caps-entreprise.com 4

How Does HMPP Differ from CUDA or OpenCL? • HMPP parallel programming model is parallel loop centric • CUDA and OpenCL parallel programming models are thread centric __global__ void saxpy_cuda(int n, float alpha, void saxpy(int n, float alpha, float *x, float *y) { float *x, float *y){ int i = blockIdx.x*blockDim.x + #pragma hmppcg parallel threadIdx.x; for(int i = 0; i<n; ++i) if(i<n) y[i] = alpha*x[i]+y[i]; y[i] = alpha*x[i] + y[i]; } } int nblocks = (n + 255) / 256; saxpy_cuda<<<nblocks, 256>>>(n, 2.0, x, y); Peppher 2011 www.caps-entreprise.com 5

HMPP Codelets and Regions • A codelet is a pure function that can be remotely executed on a GPU • Regions are a short cut for writing codelets #pragma hmpp myfunc codelet, … void saxpy(int n, float alpha, float x[n], float y[n]) { #pragma hmppcg parallel for(int i = 0; i<n; ++i) y[i] = alpha*x[i] + y[i]; } #pragma hmpp myreg region, … { for(int i = 0; i<n; ++i) y[i] = alpha*x[i] + y[i]; } Peppher 2011 www.caps-entreprise.com 6

Codelet Target Clause • Target clause specifies what GPU code to generate o GPU can be CUDA or OpenCL • Choice of the implementation at runtime can be different! o The runtime select among the available hardware and code #pragma hmpp myLabel codelet, target=[ GPU ], args[C].io=out void myFunc( int n, int A[n], int B[n], int C[n]){ ... } #pragma hmpp myLabel codelet, target=CUDA NVIDIA only GPU #pragma hmpp myLabel codelet, target=OpenCL NVIDIA & AMD GPU, AMD CPU Peppher 2011 www.caps-entreprise.com 7

HMPP Codelets Arguments • The arguments of codelet are also allocated in the GPU device memory o Must exist on both sides to allow backup execution o No hardware mechanism to ensure consistencies o Size must be known to perform the data transfers • Are defined by the io clause (in Fortran use intent instead) o in (default) : read only in the codelet o out : completely defined, no read before a write o inout : read and written • Using inappropriate inout generates extra PCI bus traffic #pragma hmpp myLabel codelet, args[B].io=out, args[C].io=inout void myFunc( int n, int A[n], int B[n], int C[n]){ for( int i=0 ; i<n ; ++i){ B[i] = A[i] * A[i]; C[i] = C[i] * A[i]; } } Peppher 2011 www.caps-entreprise.com 8

Running a Codelet or Section on a GPU - 1 • The callsite #pragma hmpp call1 codelet, target=CUDA directive specifies #pragma hmpp call2 codelet, target=OpenCL void myFunc(int n, int A[n], int B[n]){ the use of a codelet int i; at a given point in for (i=0 ; i<n ; ++i) B[i] = A[i] + 1; your application. } • callsite void main(void) directive performs a { int X[10000], Y[10000], Z[10000]; Remote Procedure … Call onto the GPU #pragma hmpp call1 callsite, … myFunc(10000, X, Y); ... #pragma hmpp call2 callsite, … myFunc(1000, Y, Z); … } Peppher 2011 www.caps-entreprise.com 9

Running a Codelet or Section on a GPU - 2 • By default, a CALLSITE directive implements the whole Remote Procedure Call (RPC) sequence • An RPC sequence consists in 5 steps: o (1) Allocate the GPU and the memory o (2) Transfer the input data: CPU => GPU o (3) Compute o (4) Transfer the output data: GPU=> CPU o (5) Release the GPU and the memory 1 2 3 4 5 Transfer  CPU Allocate  Transfer  Release  GPU Compute  OUT  Fallback GPU  IN data  GPU  data  CPU Compute  Peppher 2011 www.caps-entreprise.com 10

Tuning Hybrid Codes • Tuning hybrid code consists in o Reducing penalty when allocating and releasing GPUs o Reducing data transfer time o Optimizing performance of the GPU kernels o Using CPU cores in parallel with the GPU • HMPP provides a set of directives to address these optimizations • The objective is to get efficient CPU and GPU computations Peppher 2011 www.caps-entreprise.com 11

Reducing Data Transfers between CPUs and GPUs • Hybrid code performance is very sensitive to the amount of CPU-GPU data transfers o PCIx bus is a serious bottleneck (< 10 GBs vs 150 GBs) • Various techniques o Reduce data transfer occurrences o Share data on the GPU between codelets o Map codelet arguments to the same GPU space o Perform partial data transfers • Warning: dealing with two address spaces may introduce inconsistencies Peppher 2011 www.caps-entreprise.com 12

Tuning GPU Kernels • GPU kernel tuning set-up parallel loop suitable for GPU architectures • Multiple issues to address o Memory accesses o Thread grid tuning o Register usage tuning o Shared memory usage o Removing control flow divergence • In many cases, CPU code structure conflicts with GPU efficient code structure Peppher 2011 www.caps-entreprise.com 13

Methodology to Port Applications • Prerequisite o Understand your performance goal • Memory bandwidth needs are a good potential performance indicator o Know your hotspots • Beware of Amdahl’s law o Ensure you know how to validate the output of your application • Rounding may differs on GPUs o Determine if you goal can be achieved • How many CPUs and GPUs are necessary? • Is there similar existing codes for GPUs (in CUDA, OpenCL or HMPP)? • Define an incremental approach o Ensure to check the results at each step • Two phase approach o Phase 1: Application programmers validate the computed results o Phase 2: Performance programmers focus on GPU code tuning and data transfer reduction Peppher 2011 www.caps-entreprise.com 14

Methodology to Port Applications Hotspots Parallelization • Understand your performance goal ( analysis, • Optimize CPU code definition and achievment) • Exhibit application SIMT parallelism • Know your hotspots (analysis, code reorganization, hotspot selection) • Push application hotspot on GPU • Establish a validation process • Validate CPU-GPU execution • Set a continuous integration Define your Port your process with the validation parallel application Hours to Days Days to Weeks project on GPU Phase 1 GPGPU operational application with known potential Phase 2 Tuning Optimize your GPGPU • Exploit CPU and GPU application • Reduce CPU-GPU data transfers • Optimize GPU kernel execution • Provide feedback to application programmers for improving algorithm data structures/… A corporate project • Consider multiple GPUs • Purchasing Department Weeks to Months • Scientists • IT Department www.caps-entreprise.com 15

Methodology Overview BEGIN Compile, Run, and Pick new hotspots Identify hotspots Check results no Pre-analysis tool yes no Hotspots compute Hotspots parallel ? Reconsider algorithms intensive enough ? Phase 1 : Domain Field yes Phase 2 : Computer Rewrite Construct the codelets Sciences Field Compile, Run, and Check results Peak Performance HMPP no achieved Wizard & Feedback Code appropriate to GPU ? Use allocate/release Allocation dominating directives yes HMPP Post-analysis tool Communication Optimize data GPGPU operational dominating transfers select application with known potential Profile Compute dominating Optimize codelet code HMPP Performance Compile and run Check results Analyzer www.caps-entreprise.com 16

Focus on Hotspots Profile your CPU application Build a coherent kernel set Peppher 2011 www.caps-entreprise.com 17

Build Your GPU Computation with HMPP Directives (1) Construct your GPU group of codelet Peppher 2011 www.caps-entreprise.com 18

Build Your GPU Computation with HMPP Directives (2) … and use Codelets in the application Peppher 2011 www.caps-entreprise.com 19

Incremental Migration of C and Fortran Applications to GPGPU using - PowerPoint PPT Presentation

Incremental Migration of C and Fortran Applications to GPGPU using HMPP Peppher 2011 Introduction Many applications can benefit from GPU computing o Linear Algebra, signal processing o Bio informatics, molecular dynamics o Magnetic resonance

1954 1957 FORTRAN I FORTRAN II FORTRAN III FORTRAN 1957 end-1958 october 1956 november

Improving access to migration data Improving access to migration data Local area migration

An introduction to Fortran Daniel Price School of Physics and Astronomy Monash University

Getting started with Fortran branches loops 1 2 Why learn Fortran? Well suited for

Introduction to FORTRAN A Brief Summary of GNU FORTRAN Ashik Iqubal Department of Physics

The Fortran 90 programming language Fortran has evolved since the early days of computing

FORTRAN 04 February 1999; CS655 FORTRAN Concepts/Contributions Binding time Separate

Getting along and working together Fortran-Python Interoperability Jacob Wilkins Fortran AND

AMath 483/583 Lecture 8 Notes: This lecture: Fortran subroutines and functions Arrays

Application of Fortran Application of Fortran Pthreads on Linear Algebra Pthreads on Linear

Programming Memory allocation and ordering Fortran array syntax MPI derived types enable

Evolution of Fortran standards over the few A brief overview of this course decades The 1 st

Fortran 90 Arrays Fortran 90 Arrays Program testing can be used to show the presence of bugs

Ruby on .NET Dr Wayne Kelly Queensland University of Technology Australia Language vs

FORTRAN Intrinsic Functions FORTRAN Intrinsic Functions

Migration and Skills: EU legislation on Legal Migration DG HOME - Legal Migration and

Katahdin A Programming Language Where the Syntax and Semantics Are Mutable at Runtime Chris

Atmospheric Modeling (5 ECTS) Lectures: Mon 10.15-12.00, Physicum D105 Tue 10.15-12.00,

Model Interfacing Jedi Academy IV, Monterey CA 26 th February 2020 Outline Introduction

The Return of Stream I/O @andrestaltz Cycle.js.org A functional and reactive JavaScript framework

A New Vision for Coarray Fortran John Mellor-Crummey, Laksono Adhianto William Scherer III

Review of OpenMP Russian-German School on High-Performance Computer Systems, 27 th June - 6 th

!"#$%&'()*(+,-.'%"-% /0-,+".%1,-#2,#'34% 56789%#2'3:%1'.:2('%

f2py: Pythons interface to the world of number crunching CodeJam08 Moritz Helias

Incremental Migration of C and Fortran Applications to GPGPU using - PowerPoint PPT Presentation

Incremental Migration of C and Fortran Applications to GPGPU using HMPP Peppher 2011 Introduction Many applications can benefit from GPU computing o Linear Algebra, signal processing o Bio informatics, molecular dynamics o Magnetic resonance

1954 1957 FORTRAN I FORTRAN II FORTRAN III FORTRAN 1957 end-1958 october 1956 november

Improving access to migration data Improving access to migration data Local area migration

An introduction to Fortran Daniel Price School of Physics and Astronomy Monash University

Getting started with Fortran branches loops 1 2 Why learn Fortran? Well suited for

Introduction to FORTRAN A Brief Summary of GNU FORTRAN Ashik Iqubal Department of Physics

The Fortran 90 programming language Fortran has evolved since the early days of computing

FORTRAN 04 February 1999; CS655 FORTRAN Concepts/Contributions Binding time Separate

Getting along and working together Fortran-Python Interoperability Jacob Wilkins Fortran AND

AMath 483/583 Lecture 8 Notes: This lecture: Fortran subroutines and functions Arrays

Application of Fortran Application of Fortran Pthreads on Linear Algebra Pthreads on Linear

Programming Memory allocation and ordering Fortran array syntax MPI derived types enable

Evolution of Fortran standards over the few A brief overview of this course decades The 1 st

Fortran 90 Arrays Fortran 90 Arrays Program testing can be used to show the presence of bugs

Ruby on .NET Dr Wayne Kelly Queensland University of Technology Australia Language vs

FORTRAN Intrinsic Functions FORTRAN Intrinsic Functions

Migration and Skills: EU legislation on Legal Migration DG HOME - Legal Migration and

Katahdin A Programming Language Where the Syntax and Semantics Are Mutable at Runtime Chris

Atmospheric Modeling (5 ECTS) Lectures: Mon 10.15-12.00, Physicum D105 Tue 10.15-12.00,

Model Interfacing Jedi Academy IV, Monterey CA 26 th February 2020 Outline Introduction

The Return of Stream I/O @andrestaltz Cycle.js.org A functional and reactive JavaScript framework

A New Vision for Coarray Fortran John Mellor-Crummey, Laksono Adhianto William Scherer III

Review of OpenMP Russian-German School on High-Performance Computer Systems, 27 th June - 6 th

!&quot;#$%&amp;'()*(+,-.'%&quot;-% /0-,+&quot;.%1,-#2,#'34% 56789%#2'3:%1'.:2('%

f2py: Pythons interface to the world of number crunching CodeJam08 Moritz Helias

!"#$%&'()*(+,-.'%"-% /0-,+".%1,-#2,#'34% 56789%#2'3:%1'.:2('%