KernelGen A prototype of auto-parallelizing Fortran/C compiler for - PowerPoint PPT Presentation

Programming weather, climate, and earth-system models on heterogeneous multi-core platforms National Center for Atmospheric Research, Boulder, Colorado, September 12-13, 2012 . . KernelGen – A prototype of auto-parallelizing Fortran/C compiler for NVIDIA GPUs Dmitry Mikushin 1 , 3 Nikolay Likhogrud 2 , 3 Hou Yunqing 4 Sergey Kovylov 5 1 Institute of Computational Science, University of Lugano 2 Lomonosov Moscow State University 3 Applied Parallel Computing LLC 4 Nanyang Technological University 5 NVIDIA Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 1 / 23

Rationale: Old good programming languages could still be usable, if accurate code analysis & parallelization methods exist OpenACC is too restrictive for complex apps and needs more flexibility GPU tends to become a central processing unit in near future, contradicting with OpenACC paradigm NWP is a perfect testbed for novel accelerator programming models . KernelGen research project Goals: Conserve the original application source code, keep all GPU-specific things in the background Minimize manual work on specific code ⇒ develop a compiler toolchain usable with many models Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 2 / 23

. KernelGen research project Goals: Conserve the original application source code, keep all GPU-specific things in the background Minimize manual work on specific code ⇒ develop a compiler toolchain usable with many models Rationale: Old good programming languages could still be usable, if accurate code analysis & parallelization methods exist OpenACC is too restrictive for complex apps and needs more flexibility GPU tends to become a central processing unit in near future, contradicting with OpenACC paradigm NWP is a perfect testbed for novel accelerator programming models Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 2 / 23

. WRF specifics Sets of multiple numerical blocks to switch between, depending on model purpose ⇒ no need to compile all code for GPU at time, JIT-compile only used parts Complex compilation system, most of code is compiled to static libraries, many potential GPU kernels have external dependencies ⇒ needs modified linker to resolve kernels dependencies at link time Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 3 / 23

With technical support of many communities: + AsFermi, OpenMPI and others . Project Team Lomonosov Moscow State University, University of Lugano, Applied Parallel Faculty of Computational Institute of Computational Science Computing LLC Mathematics and Cybernetics Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 4 / 23

. Project Team Lomonosov Moscow State University, University of Lugano, Applied Parallel Faculty of Computational Institute of Computational Science Computing LLC Mathematics and Cybernetics With technical support of many communities: + AsFermi, OpenMPI and others Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 4 / 23

Implementation: Pretty-printed AST – to markup and transform code into host and device parts No reliable data dependency analysis in loops LLVM + C Backend – to convert Fortran to C and chain to CUDA compiler . Project state in September 2011 (v0.1) Results: Could successfully generate CUDA and OpenCL kernels out of parallel loops in Fortran, with lots of limitations Automatic handling of host-device data transfers, with all process data kept on host Better language support than F2C-ACC, but still a lot of issues Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 5 / 23

. Project state in September 2011 (v0.1) Results: Could successfully generate CUDA and OpenCL kernels out of parallel loops in Fortran, with lots of limitations Automatic handling of host-device data transfers, with all process data kept on host Better language support than F2C-ACC, but still a lot of issues Implementation: Pretty-printed AST – to markup and transform code into host and device parts No reliable data dependency analysis in loops LLVM + C Backend – to convert Fortran to C and chain to CUDA compiler Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 5 / 23

Implementation: DragonEgg – to emit LLVM IR from C/C++/Fortran LLVM loop extractor pass – to detect loops in compile time Modified LLVM Polly – to perform loop analysis in runtime LLVM NVPTX Backend – to emit PTX ISA directly from LLVM IR Modified GCC compiler and custom LTO wrapper – to support calling external functions in loops and link code from static libraries . Project state in September 2012 (v0.2 nvptx) Results: Can analyze arbitrary loops in C/C++/Fortran for parallelism and generate CUDA kernels Better quality of parallelism detection, than OpenACC from PGI Automatic handling of host-device data transfers, with all process data kept on device Full compatibility with conventional GCC compiler and linker Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 6 / 23

. Project state in September 2012 (v0.2 nvptx) Results: Can analyze arbitrary loops in C/C++/Fortran for parallelism and generate CUDA kernels Better quality of parallelism detection, than OpenACC from PGI Automatic handling of host-device data transfers, with all process data kept on device Full compatibility with conventional GCC compiler and linker Implementation: DragonEgg – to emit LLVM IR from C/C++/Fortran LLVM loop extractor pass – to detect loops in compile time Modified LLVM Polly – to perform loop analysis in runtime LLVM NVPTX Backend – to emit PTX ISA directly from LLVM IR Modified GCC compiler and custom LTO wrapper – to support calling external functions in loops and link code from static libraries Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 6 / 23

. KernelGen user interface design KernelGen is based on GCC and is fully compatible with it Executable binary preserves host-only version, that is used by default; GPU version is activated by request Execution mode is controlled by $ kernelgen runmode: 0 – run original CPU binary, 1 – run GPU version $ NETCDF=/opt / kernelgen . / configure Please select from among the following supported platforms . . . . 27. Linux x86_64 , kernelgen - gfortran compiler for CUDA ( s e r i a l ) 28. Linux x86_64 , kernelgen - gfortran compiler for CUDA ( smpar ) 29. Linux x86_64 , kernelgen - gfortran compiler for CUDA (dmpar) 30. Linux x86_64 , kernelgen - gfortran compiler for CUDA (dm+sm) Enter selection [1 -38] : 27 . . . $ . / compile em_real . . . $ cd test / em_real / $ kernelgen_runmode=1 . / real . exe Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 7 / 23

. OpenACC: no external calls OpenACC compilers do not allow calls from different compilation units: sincos.f90 ! $ acc p a r a l l e l do k = 1 , nz do j = 1 , ny do i = 1 , nx xy ( i , j , k ) = s i n c o s _ i j k ( x ( i , j , k ) , y ( i , j , k ) ) enddo enddo enddo ! $ acc end p a r a l l e l function.f90 s i n c o s _ i j k = sin ( x ) + cos ( y ) pgfortran - fast -Mnomain - Minfo=accel - ta=nvidia , time -Mcuda=keepgpu , keepbin , keepptx , ptxinfo - c . . / sincos . f90 -o ← ֓ sincos . o PGF90 -W-0155 - Accelerator region ignored ; see - Minfo messages ( . . / sincos . f90 : 33) sincos : 33 , Accelerator region ignored 36 , Accelerator r e s t r i c t i o n : function / procedure c a l l s are not supported 37 , Accelerator r e s t r i c t i o n : unsupported c a l l to s i n c o s _ i j k 0 inform , 1 warnings , 0 severes , 0 fat al for sincos Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 8 / 23

. KernelGen: external calls } Dependency resolution during linking Support for external calls defined ⇒ Kernels generation in runtime in other objects or static libraries ! $ acc p a r a l l e l do k = 1 , nz do j = 1 , ny do i = 1 , nx xy ( i , j , k ) = s i n c o s _ i j k ( x ( i , j , k ) , y ( i , j , k ) ) enddo enddo enddo ! $ acc end p a r a l l e l s i n c o s _ i j k = sin ( x ) + cos ( y ) result Launching kernel __kernelgen_sincos__loop_3 blockDim = { 32 , 16 , 1 } gridDim = { 16 , 32 , 63 } Finishing kernel __kernelgen_sincos__loop_3 __kernelgen_sincos__loop_3 time = 0.00536099 sec Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 9 / 23

KernelGen A prototype of auto-parallelizing Fortran/C compiler for - PowerPoint PPT Presentation

Programming weather, climate, and earth-system models on heterogeneous multi-core platforms National Center for Atmospheric Research, Boulder, Colorado, September 12-13, 2012 . . KernelGen A prototype of auto-parallelizing Fortran/C

1954 1957 FORTRAN I FORTRAN II FORTRAN III FORTRAN 1957 end-1958 october 1956 november

FORTRAN 04 February 1999; CS655 FORTRAN Concepts/Contributions Binding time Separate

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

Introduction to FORTRAN A Brief Summary of GNU FORTRAN Ashik Iqubal Department of Physics

The Fortran 90 programming language Fortran has evolved since the early days of computing

Getting started with Fortran branches loops 1 2 Why learn Fortran? Well suited for

An introduction to Fortran Daniel Price School of Physics and Astronomy Monash University

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Why Parallelize? Why Parallelize? To decrease the overall computation time of a job. To

What is a prototype? Design Thinking + 5-Stage Process Design/ Empathize Define Ideate Test

First Steps Towards the AEI 10m Prototype Single Arm Test Auto Alignment Sean Leavey and the AEI

Evolution of Fortran standards over the few A brief overview of this course decades The 1 st

AMath 483/583 Lecture 8 Notes: This lecture: Fortran subroutines and functions Arrays

Getting along and working together Fortran-Python Interoperability Jacob Wilkins Fortran AND

Fortran 90 Arrays Fortran 90 Arrays Program testing can be used to show the presence of bugs

Topics Redux Michael R. Gunson February 23, 2001 1 AIRS Topics Status mrg Topics From Last

Establishing the overall structure of a software system Objectives To introduce

IAGOS GOS - Sta Status a and nd Pe Pers rspe pective A. Volz-Thomas, M. Gallagher, C.

A Year* With Apache Aurora: Cluster Management at Chartbeat Rick Mangi Director of Platform

Low Power Design Thomas Ebi and Prof. Dr. J. Henkel Thomas Ebi and Prof. Dr. J. Henkel CES CES

CSEL LCCC Workshop on Process Control Sept. 28 - 30, 2016 Control Systems Engineering

Percona and MySQL Shane Murray & Andrew Cook PADDY POWER BETFAIR Who We Are Paddy Power

Economic impacts of long-term climate change on rice production and farmers income: Evidence

KernelGen A prototype of auto-parallelizing Fortran/C compiler for - PowerPoint PPT Presentation

Programming weather, climate, and earth-system models on heterogeneous multi-core platforms National Center for Atmospheric Research, Boulder, Colorado, September 12-13, 2012 . . KernelGen A prototype of auto-parallelizing Fortran/C

1954 1957 FORTRAN I FORTRAN II FORTRAN III FORTRAN 1957 end-1958 october 1956 november

FORTRAN 04 February 1999; CS655 FORTRAN Concepts/Contributions Binding time Separate

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

Introduction to FORTRAN A Brief Summary of GNU FORTRAN Ashik Iqubal Department of Physics

The Fortran 90 programming language Fortran has evolved since the early days of computing

Getting started with Fortran branches loops 1 2 Why learn Fortran? Well suited for

An introduction to Fortran Daniel Price School of Physics and Astronomy Monash University

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Why Parallelize? Why Parallelize? To decrease the overall computation time of a job. To

What is a prototype? Design Thinking + 5-Stage Process Design/ Empathize Define Ideate Test

First Steps Towards the AEI 10m Prototype Single Arm Test Auto Alignment Sean Leavey and the AEI

Evolution of Fortran standards over the few A brief overview of this course decades The 1 st

AMath 483/583 Lecture 8 Notes: This lecture: Fortran subroutines and functions Arrays

Getting along and working together Fortran-Python Interoperability Jacob Wilkins Fortran AND

Fortran 90 Arrays Fortran 90 Arrays Program testing can be used to show the presence of bugs

Topics Redux Michael R. Gunson February 23, 2001 1 AIRS Topics Status mrg Topics From Last

Establishing the overall structure of a software system Objectives To introduce

IAGOS GOS - Sta Status a and nd Pe Pers rspe pective A. Volz-Thomas, M. Gallagher, C.

A Year* With Apache Aurora: Cluster Management at Chartbeat Rick Mangi Director of Platform

Low Power Design Thomas Ebi and Prof. Dr. J. Henkel Thomas Ebi and Prof. Dr. J. Henkel CES CES

CSEL LCCC Workshop on Process Control Sept. 28 - 30, 2016 Control Systems Engineering

Percona and MySQL Shane Murray &amp; Andrew Cook PADDY POWER BETFAIR Who We Are Paddy Power

Economic impacts of long-term climate change on rice production and farmers income: Evidence

Percona and MySQL Shane Murray & Andrew Cook PADDY POWER BETFAIR Who We Are Paddy Power