[PPT] - Bringing Next Generation C++ to GPUs Michael Haidl 1 , Michel Steuwer PowerPoint Presentation

SLIDE 1

Bringing Next Generation C++ to GPUs

Michael Haidl1, Michel Steuwer2, Lars Klein1 and Sergei Gorlatch1

1University of Muenster, Germany 2University of Edinburgh, UK

SLIDE 2

The Problem: Dot Product

std::vector<int> a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<int>()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus<int>());

The STL is the C++ programmers swiss knife
STL containers, iterators and algorithms introduce a high-level
f abstraction
Since C++17 it is also parallel

1

SLIDE 3

The Problem: Dot Product

std::vector<int> a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<int>()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus<int>());

template<typename T> auto mult(const std::vector<T>& a const std::vector<T>& b){ std::vector<T> tmp(a.size()); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; } template<typename T> auto sum(const std::vector<T>& a){ return std::accumulate(a.begin(), a.end(), T(), std::plus<T>()); }

f1.hpp f2.hpp

1

SLIDE 4

The Problem: Dot Product

std::vector<int> a(N), b(N); auto result = sum(mult(a, b));

template<typename T> auto mult(const std::vector<T>& a const std::vector<T>& b){ std::vector<T> tmp(a.size()); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; } template<typename T> auto sum(const std::vector<T>& a){ return std::accumulate(a.begin(), a.end(), T(), std::plus<T>()); }

f1.hpp f2.hpp

1

SLIDE 5

The Problem: Dot Product

std::vector<int> a(N), b(N); auto result = sum(mult(a, b));

f1.hpp f2.hpp Performance:

vectors of size 25600000
Clang/LLVM 5.0.0svn -O3 optimized

transform/accumulate transform/accumulate* inner_product 20 40 Runtime (ms) * LLVM patched with extended D17386 (loop fusion) 1

SLIDE 6

The Problem: Dot Product on GPUs

thrust::device_vector<int> a(N), b(N), tmp(N); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), thrust::multiplies<int>()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::plus<int>());

Highly tuned STL-like library for GPU programming
Thrust offers containers, iterators and algorithms
Based on CUDA

2

SLIDE 7

The Problem: Dot Product on GPUs

thrust::device_vector<int> a(N), b(N), tmp(N); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), thrust::multiplies<int>()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::plus<int>());

Highly tuned STL-like library for GPU programming
Thrust offers containers, iterators and algorithms
Based on CUDA

Same Experiment:

nvcc -O3 (from CUDA 8.0)

transform/reduce inner_product 10 20 30 Runtime (ms) 2

SLIDE 8

The Next Generation: Ranges for the STL

range-v3 prototype implementation by E. Niebler
Proposed as N4560 for the C++ Standard

std::vector<int> a(N), b(N); auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); 3

SLIDE 9

The Next Generation: Ranges for the STL

std::vector<int> a(N), b(N); auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0);

Performance?

Clang/LLVM 5.0.0svn -O3 optimized

transform/accumulate inner_product 20 40 Runtime (ms) 3

SLIDE 10

The Next Generation: Ranges for the STL

std::vector<int> a(N), b(N); auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0);

Views describe lazy, non-mutating operations on ranges
Evaluation happens inside an algorithm (e.g., accumulate)
Fusion is guaranteed by the implementation

3

SLIDE 11

Ranges for GPUs

Extended range-v3 with GPU-enabled container and

algorithms

Original code of range-v3 remains unmodified

std::vector<int> a(N), b(N); auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto ga = gpu::copy(a); auto gb = gpu::copy(b); auto result = gpu::reduce(view::transform(view::zip(ga, gb), mult), 0); 4

SLIDE 12

Programming Accelerators with C++ (PACXX)

Executable

PACXX Runtime Online Compiler LLVM-based

nline compiler

LLVM IR to SPIR NVPTX OpenCL Backend CUDA Backend

LLVM IR SPIR PTX

OpenCL Runtime AMD GPU Intel MIC CUDA Runtime Nvidia GPU

#include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, const T& value) { for (; first != last; ++first) { *first = value; } }

C++

PACXX Offline Compiler LLVM libc++ Clang Frontend Offline Stage Online Stage

Based entirely on LLVM / Clang
Supports C++14 for GPU Programming
Just-In-Time Compilation of LLVM IR for target accelerators

5

SLIDE 13

Multi-Stage Programming

template <typename InRng, typename T, typename Fun> auto reduce(InRng&& in, T init, Fun&& fun) { // 1. preparation of kernel call ... // 2. create GPU kernel auto kernel = pacxx::kernel( [fun](auto&& in, auto&& out, int size, auto init) { // 2a. stage elements per thread auto ept = stage([&]{ return size / get_block_size(0); }); // 2b. start reduction computation auto sum = init; for (int x = 0; x < ept; ++x) { sum = fun(sum, *(in + gid)); gid += glbSize; } // 2c. perform reduction in shared memory ... // 2d. write result back if (lid = 0) *(out + bid) = shared[0]; }, blocks, threads); // 3. execute kernel kernel(in, out, distance(in), init); // 4. finish reduction on the CPU return std::accumulate(out, init, fun); } 6

SLIDE 14

MSP Integration into PACXX

Executable

LLVM-based

nline compiler

PACXX Runtime LLVM IR to SPIR NVPTX OpenCL Backend CUDA Backend

SPIR PTX

OpenCL Runtime AMD GPU Intel MIC CUDA Runtime Nvidia GPU

#include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, const T& value) { for (; first != last; ++first) { *first = value; } }

C++

PACXX Offline Compiler LLVM libc++ Clang Frontend

MSP IR

MSP Engine

KERNEL IR

MSP Engine JIT compiles the MSP IR,
evaluates stage prior to a kernel launch, and
replaces the calls to stage in the kernel’s IR with the results.
Enables more optimizations (e.g., loop-unrolling) in the online

stage.

7

SLIDE 15

Performance Impact of MSP

gpu::reduce on Nvidia K20c

0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 215 217 219 221 223 225 Speedup Input Size Dot Sum Dot +MS Sum +MS

Up to 35% better performance compared to non-MSP version

8

SLIDE 16

Just-In-Time Compilation Overhead

Comparing MSP in PACXX with Nvidia’s nvrtc library

50 100 150 200 250 300 350 400 450 Dot Sum CUDA 7.5 CUDA 8.0 RC PACXX Compilation Time (ms)

10 to 20 times faster because front-end actions are performed.

9

SLIDE 17

Benchmarks

range-v3 + PACXX vs. Nvidia’s Thrust

vadd saxpy sum dot Monte Carlo Mandelbrot Voronoi 0.5 1 1.5 2

8.37% 1.61% 82.32% 73.31% 33.6% −3.13% −7.38%

Speedup

Thrust PACXX

Evaluated on a Nvidia K20c GPU
11 different input sizes
1000 runs for each benchmark

Competitive performance with a composable GPU programming API

10

SLIDE 18

Going Native: Work in Progress

Executable

PACXX Runtime LLVM-based

nline compiler

NVPTX OpenCL Backend CUDA Backend

SPIR PTX

OpenCL Runtime AMD GPU Intel MIC CUDA Runtime Nvidia GPU

#include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, const T& value) { for (; first != last; ++first) { *first = value; } }

C++

PACXX Offline Compiler LLVM libc++ Clang Frontend

MSP IR

MSP Engine

KERNEL IR

LLVM IR to SPIR WFV[1] MCJIT

CPUs: Intel / AMD / IBM ...

PACXX is extended by a native CPU backend
The Kernel IR is modified to be runnable on a CPU
Kernels are vectorized by the Whole Function Vectorizer (WFV) [1]
MCJIT compiles the kernels and TBB executes them in parallel.

[1] Karrenberg, Ralf, and Sebastian Hack. ”Whole-Function Vectorization.” @ CGO’11, pp. 141–150 11

SLIDE 19

Benchmarks

range-v3 + PACXX vs. OpenCL on x86_64

vadd saxpy sum dot Mandelbrot 0.5 1 1.5 2

−1.45% −13.94% −12.64% −20.02% −36.48%

Speedup

OpenCL (Intel) PACXX

Running on 2x Intel Xeon E5-2620 CPUs
Intel’s auto-vectorizer optimizes the OpenCL C code

12

SLIDE 20

Benchmarks

range-v3 + PACXX vs. OpenCL on x86_64

vadd saxpy sum dot Mandelbrot 1 2 3

58.12% 58.45% 153.82%

Speedup

OpenCL (AMD) PACXX

Running on 2x Intel Xeon E5-2620 CPUs
AMD OpenCL SDK has no auto-vectorizer
Barriers are very expensive in AMD’s OpenCL implementation

(speedup up to 126x for sum)

13

SLIDE 21

Benchmarks

range-v3 + PACXX vs. OpenMP on IBM Power8

vadd saxpy sum dot Mandelbrot 0.5 1 1.5 2

0.45% 35.09% −20.05% −42.95% 17.97%

Speedup

OpenMP (IBM) PACXX

Running on a PowerNV 8247-42L with 4x IBM Power8e CPUs
No OpenCL implementation from IBM available
#prama omp parallel for simd parallelized loops
Compiled with XL C++ 13.1.5

14

SLIDE 22

Questions?

14