Bringing Next Generation C++ to GPUs
Michael Haidl1, Michel Steuwer2, Lars Klein1 and Sergei Gorlatch1
1University of Muenster, Germany 2University of Edinburgh, UK
Bringing Next Generation C++ to GPUs Michael Haidl 1 , Michel Steuwer - - PowerPoint PPT Presentation
1 University of Muenster, Germany 2 University of Edinburgh, UK Bringing Next Generation C++ to GPUs Michael Haidl 1 , Michel Steuwer 2 , Lars Klein 1 and Sergei Gorlatch 1 std::vector< int > a(N), b(N), tmp(N); std::transform(a.begin(),
1University of Muenster, Germany 2University of Edinburgh, UK
std::vector<int> a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<int>()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus<int>());
1
std::vector<int> a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<int>()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus<int>());
template<typename T> auto mult(const std::vector<T>& a const std::vector<T>& b){ std::vector<T> tmp(a.size()); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; } template<typename T> auto sum(const std::vector<T>& a){ return std::accumulate(a.begin(), a.end(), T(), std::plus<T>()); }
1
std::vector<int> a(N), b(N); auto result = sum(mult(a, b));
template<typename T> auto mult(const std::vector<T>& a const std::vector<T>& b){ std::vector<T> tmp(a.size()); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; } template<typename T> auto sum(const std::vector<T>& a){ return std::accumulate(a.begin(), a.end(), T(), std::plus<T>()); }
1
std::vector<int> a(N), b(N); auto result = sum(mult(a, b));
transform/accumulate transform/accumulate* inner_product 20 40 Runtime (ms) * LLVM patched with extended D17386 (loop fusion) 1
thrust::device_vector<int> a(N), b(N), tmp(N); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), thrust::multiplies<int>()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::plus<int>());
2
thrust::device_vector<int> a(N), b(N), tmp(N); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), thrust::multiplies<int>()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::plus<int>());
transform/reduce inner_product 10 20 30 Runtime (ms) 2
std::vector<int> a(N), b(N); auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); 3
std::vector<int> a(N), b(N); auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0);
transform/accumulate inner_product 20 40 Runtime (ms) 3
std::vector<int> a(N), b(N); auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0);
3
std::vector<int> a(N), b(N); auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto ga = gpu::copy(a); auto gb = gpu::copy(b); auto result = gpu::reduce(view::transform(view::zip(ga, gb), mult), 0); 4
Executable
PACXX Runtime Online Compiler LLVM-based
LLVM IR to SPIR NVPTX OpenCL Backend CUDA Backend
LLVM IR SPIR PTX
OpenCL Runtime AMD GPU Intel MIC CUDA Runtime Nvidia GPU
#include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, const T& value) { for (; first != last; ++first) { *first = value; } }
PACXX Offline Compiler LLVM libc++ Clang Frontend Offline Stage Online Stage
5
template <typename InRng, typename T, typename Fun> auto reduce(InRng&& in, T init, Fun&& fun) { // 1. preparation of kernel call ... // 2. create GPU kernel auto kernel = pacxx::kernel( [fun](auto&& in, auto&& out, int size, auto init) { // 2a. stage elements per thread auto ept = stage([&]{ return size / get_block_size(0); }); // 2b. start reduction computation auto sum = init; for (int x = 0; x < ept; ++x) { sum = fun(sum, *(in + gid)); gid += glbSize; } // 2c. perform reduction in shared memory ... // 2d. write result back if (lid = 0) *(out + bid) = shared[0]; }, blocks, threads); // 3. execute kernel kernel(in, out, distance(in), init); // 4. finish reduction on the CPU return std::accumulate(out, init, fun); } 6
Executable
LLVM-based
PACXX Runtime LLVM IR to SPIR NVPTX OpenCL Backend CUDA Backend
SPIR PTX
OpenCL Runtime AMD GPU Intel MIC CUDA Runtime Nvidia GPU
#include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, const T& value) { for (; first != last; ++first) { *first = value; } }
PACXX Offline Compiler LLVM libc++ Clang Frontend
MSP IR
MSP Engine
KERNEL IR
7
0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 215 217 219 221 223 225 Speedup Input Size Dot Sum Dot +MS Sum +MS
8
50 100 150 200 250 300 350 400 450 Dot Sum CUDA 7.5 CUDA 8.0 RC PACXX Compilation Time (ms)
9
vadd saxpy sum dot Monte Carlo Mandelbrot Voronoi 0.5 1 1.5 2
8.37% 1.61% 82.32% 73.31% 33.6% −3.13% −7.38%
Thrust PACXX
10
Executable
PACXX Runtime LLVM-based
NVPTX OpenCL Backend CUDA Backend
SPIR PTX
OpenCL Runtime AMD GPU Intel MIC CUDA Runtime Nvidia GPU
#include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, const T& value) { for (; first != last; ++first) { *first = value; } }
PACXX Offline Compiler LLVM libc++ Clang Frontend
MSP IR
MSP Engine
KERNEL IR
LLVM IR to SPIR WFV[1] MCJIT
CPUs: Intel / AMD / IBM ...
[1] Karrenberg, Ralf, and Sebastian Hack. ”Whole-Function Vectorization.” @ CGO’11, pp. 141–150 11
vadd saxpy sum dot Mandelbrot 0.5 1 1.5 2
−1.45% −13.94% −12.64% −20.02% −36.48%
OpenCL (Intel) PACXX
12
vadd saxpy sum dot Mandelbrot 1 2 3
58.12% 58.45% 153.82%
OpenCL (AMD) PACXX
13
vadd saxpy sum dot Mandelbrot 0.5 1 1.5 2
0.45% 35.09% −20.05% −42.95% 17.97%
OpenMP (IBM) PACXX
14
14