GPULib: GPU Acceleration of Scientific Applications in (Very) - PowerPoint PPT Presentation

GPULib: GPU Acceleration of Scientific Applications in (Very) High-Level Languages Peter Messmer messmer@txcorp.com messmer@txcorp.com Tech-X Corporation 5621 Arapahoe Ave., Boulder, CO 80303 www.txcorp.com Paul J. Mullowney, Dan Karipides, Keegan Amyx, Nate Sizemore, Brian Granger, Mike Galloy, David Fillmore This work is supported by NASA SBIR Phase-II Grant #NNG06CA13C Los Alamos Computer Science Symposium, October 14-15 2008, Santa Fe, NM

Who are we? What is Tech-X? Connecting Physics and HPC Boulder, CO ~55 employees, 45 PhD Physicis, CS, Math 2 Tech-X Corporation

And who is paying for that? 3 Tech-X Corporation

The year is 2005.. NASA mission is facing a data analysis problem 1 min IDL (Interactive Data Language by ITT VIS) is the tool of choice for data analysis “People are starved for cycles” 5 hrs

Scientists like to develop in very high-level languages • Here “VHLL”: IDL (Interactive Data Language), MATLAB, Python • Want to spend their time doing research, not code development • Sociology: Communities “lock-in” on languages – Solar Physics, hyper-spectral imaging: IDL – Neuro-Biology, financial modelling: MATLAB • Languages offer large collections of domain relevant algorithms • • Increasing data volumes: Analysis has to scale as well Increasing data volumes: Analysis has to scale as well • => Conventional cluster computing too cumbersome – Not always access to cluster – No desire to write MPI code – “Can’t you give me something I can plug into my computer and it makes things 10x faster?” � Accelerator hardware (focus here on GPUs) � CUDA a great architecture, but still requires understanding of the hardware Goal of the project: Provide acceleration without turning scientists into hardware experts

GPULib design goals: Get speedup from accelerator in a transparent way • Accelerators directly usable from within VHLL – Users chose the high-level languages for a reason! Many 4 th generation languages vector oriented -> Beneficial to GPU – • Intuitive for users – Use host language features to make use of accelerators intuitive • Code has to remain portable – Key! – Provide emulation, but do not incur overhead • Take advantage of accelerator – Obtain as high a performance as possible – Less than peak is acceptable • Provide as many operations as possible on accelerator to reduce data motion • Take advantage of available libraries – cuBLAS, cuFFT • Be abstract enough to enable porting to other accelerators Messmer, Mullowney, Granger, “GPULib: GPU computing in High-Level Languages”, Comuters in Science and Engineering, 10(5), 80, 2008.

GPULib layered architecture is easily extensible IDL, MATLAB or Python, Java GPUlib wrappers (language specific, includes software emulator) GPULib functions NVIDIA functions CUDA Runtime Vector Data Complex API cuFFT cuBLAS Arithmetic Manipulation Operations GPU

GPUlib: One way to simplify GPU development • GPULib provides a large set of vector operations – Data transfer GPU/CPU, memory management – Arithmetic, transcendental, logical functions – Data parallel primitives (prefix-sum) – Array operations (reshaping, interpolation, range selection, type casting) – NVIDIA’s cuBLAS, cuFFT • • Data objects on GPU represented as structure on CPU Data objects on GPU represented as structure on CPU – Contains size information, dimensionality and pointer to GPU memory • Library can be run without the library Download from http://gpulib.txcorp.com • (free for non-commercial use)

A GPULib example in IDL GPU CPU X IDL> gpuPutArr, x, x_gpu X_gpu x_gpu Sin() IDL> gpuSin, x_gpu, y_gpu y_gpu y_gpu y IDL> gpuGetArr, y_gpu, y

Can you get all the performance with a vector library? “Scientists want the control to increase performance as necessary but won’t sacrifice everything to performance” Basili et al, “Understanding the High-Performance Computing Community”, IEEE Computer, July ’08. � Vector operations with higher compute density (affine transform of arguments) z = a x + b y + c z = a exp(b y + c) + d � Domain-specific algorithms

How to get performance? • Kernels are very fast, GPU<->CPU data transfer is slow Single invocation Kernel only lgamma(x) Sin(x) ax+by+c exp(x) x+y 10 invocations Vector length Vector length

Example: Image Deconvolution • Image is convolved with detector point-spread function: � I ( x , y ) = I ( x − u , y − v ) P ( u , v ) dudv obs true • Clean image by (complex) division in Fourier space: − 1 I x y = FFT FFT I FFT P ( , ) ( ( ) / ( )) true obs • Large computational load per CPU-GPU data transfer • Real world problem • Speedup ranging from 5x – 28x for 256x256 – 3kx3k images

What happened next? • People downloaded GPULib with interests in – Medical Imaging – Image Rectification – Remote sensing – Signal processing – Wildlife tracking – and many more … – and many more … • Customers and evaluations include – NASA – US AFRL – Rutherford Appleton Lab – Leiden University, NL – Laboratory for Atmospheric and Space Physics (LASP) – Many universities …

GPULib example 1: Image processing Principal Component Analysis (PCA) ∆ t =3s ∆ ∆ ∆ Data courtesy of Dr. Mort Canty, FZ Juelich, Germany

GPULib example 3: Simulation Neutron scattering experiment Use simulation written in IDL to compute location of to compute location of scattering maxima (Bragg peaks) Data courtesy of Dr. Matthias Gutmann, Rutherford Appleton Research Lab, UK

Where we would like to go.. • More specialized kernels – Collaborate with users to get their performance tuned – GPULib enables iterative approach to GPUs/accelerators • Performance promising enough that library could act as abstraction for accelerators for “conventional” HPC applications – Unify of C/Fortran interface • • Develop HPC relevant kernels Develop HPC relevant kernels – Ghost cell exchanges – Particle-push kernels • Target different accelerators – Portable code for accelerators

Conclusions • GPUlib offers large set of vector operations on GPU • Enables users to take advantage of accelerators from within their favourite languages • One example of accelerator interface that requires no hardware knowledge • • Scientists do not lock in on a particular hardware Scientists do not lock in on a particular hardware • We are happy to collaborate on getting your analysis accelerated on GPUs

GPULib: GPU Acceleration of Scientific Applications in (Very) - PowerPoint PPT Presentation

GPULib: GPU Acceleration of Scientific Applications in (Very) High-Level Languages Peter Messmer messmer@txcorp.com messmer@txcorp.com Tech-X Corporation 5621 Arapahoe Ave., Boulder, CO 80303 www.txcorp.com Paul J. Mullowney, Dan Karipides,

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

GPYTORCH : BLACKBOX MATRIX- MATRIX GAUSSIAN PROCESS INFERENCE WITH GPU ACCELERATION Jacob R.

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Overview of RPC Systems Distributed Systems Sun RPC DCE RPC RPC Case Studies DCOM CORBA Java

A pointfree account of Carath eodorys Extension Theorem s Jakl a Tom a Workshop on

A final Vietoris coalgebra beyond compact spaces and a generalized J onsson-Tarski duality

ADC Non-linearity Wenqiang Gu Brookhaven National Laboratory 1 Outline ProtoDUNE TPC

Welcome back... As a distribution. Pareto: 20% of pods have 80% of peas. 20% of peple have 80%

The Case for Interaction Paradigm Interoperability Georgios Bouloukakis 1 Joint work with Nikolaos

Intraday Liquidity Risk Gamal Bemath What is intraday liquidity risk? Intraday Liquidity

First-Order Logical Duality Henrik Forssell June 2008 First-Order Logical Duality Introduction

GPULib: GPU Acceleration of Scientific Applications in (Very) - PowerPoint PPT Presentation

GPULib: GPU Acceleration of Scientific Applications in (Very) High-Level Languages Peter Messmer messmer@txcorp.com messmer@txcorp.com Tech-X Corporation 5621 Arapahoe Ave., Boulder, CO 80303 www.txcorp.com Paul J. Mullowney, Dan Karipides,

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

GPYTORCH : BLACKBOX MATRIX- MATRIX GAUSSIAN PROCESS INFERENCE WITH GPU ACCELERATION Jacob R.

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Overview of RPC Systems Distributed Systems Sun RPC DCE RPC RPC Case Studies DCOM CORBA Java

A pointfree account of Carath eodorys Extension Theorem s Jakl a Tom a Workshop on

A final Vietoris coalgebra beyond compact spaces and a generalized J onsson-Tarski duality

ADC Non-linearity Wenqiang Gu Brookhaven National Laboratory 1 Outline ProtoDUNE TPC

Welcome back... As a distribution. Pareto: 20% of pods have 80% of peas. 20% of peple have 80%

The Case for Interaction Paradigm Interoperability Georgios Bouloukakis 1 Joint work with Nikolaos

Intraday Liquidity Risk Gamal Bemath What is intraday liquidity risk? Intraday Liquidity

First-Order Logical Duality Henrik Forssell June 2008 First-Order Logical Duality Introduction

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team