Bringing Next Generation C++ to GPUs Michael Haidl 1 , Michel Steuwer - - PowerPoint PPT Presentation

bringing next generation c to gpus
SMART_READER_LITE
LIVE PREVIEW

Bringing Next Generation C++ to GPUs Michael Haidl 1 , Michel Steuwer - - PowerPoint PPT Presentation

1 University of Muenster, Germany 2 University of Edinburgh, UK Bringing Next Generation C++ to GPUs Michael Haidl 1 , Michel Steuwer 2 , Lars Klein 1 and Sergei Gorlatch 1 std::vector< int > a(N), b(N), tmp(N); std::transform(a.begin(),


slide-1
SLIDE 1

Bringing Next Generation C++ to GPUs

Michael Haidl1, Michel Steuwer2, Lars Klein1 and Sergei Gorlatch1

1University of Muenster, Germany 2University of Edinburgh, UK

slide-2
SLIDE 2

The Problem: Dot Product

std::vector<int> a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<int>()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus<int>());

  • The STL is the C++ programmers swiss knife
  • STL containers, iterators and algorithms introduce a high-level
  • f abstraction
  • Since C++17 it is also parallel

1

slide-3
SLIDE 3

The Problem: Dot Product

std::vector<int> a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<int>()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus<int>());

template<typename T> auto mult(const std::vector<T>& a const std::vector<T>& b){ std::vector<T> tmp(a.size()); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; } template<typename T> auto sum(const std::vector<T>& a){ return std::accumulate(a.begin(), a.end(), T(), std::plus<T>()); }

f1.hpp f2.hpp

1

slide-4
SLIDE 4

The Problem: Dot Product

std::vector<int> a(N), b(N); auto result = sum(mult(a, b));

template<typename T> auto mult(const std::vector<T>& a const std::vector<T>& b){ std::vector<T> tmp(a.size()); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; } template<typename T> auto sum(const std::vector<T>& a){ return std::accumulate(a.begin(), a.end(), T(), std::plus<T>()); }

f1.hpp f2.hpp

1

slide-5
SLIDE 5

The Problem: Dot Product

std::vector<int> a(N), b(N); auto result = sum(mult(a, b));

f1.hpp f2.hpp Performance:

  • vectors of size 25600000
  • Clang/LLVM 5.0.0svn -O3 optimized

transform/accumulate transform/accumulate* inner_product 20 40 Runtime (ms) * LLVM patched with extended D17386 (loop fusion) 1

slide-6
SLIDE 6

The Problem: Dot Product on GPUs

thrust::device_vector<int> a(N), b(N), tmp(N); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), thrust::multiplies<int>()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::plus<int>());

  • Highly tuned STL-like library for GPU programming
  • Thrust offers containers, iterators and algorithms
  • Based on CUDA

2

slide-7
SLIDE 7

The Problem: Dot Product on GPUs

thrust::device_vector<int> a(N), b(N), tmp(N); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), thrust::multiplies<int>()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::plus<int>());

  • Highly tuned STL-like library for GPU programming
  • Thrust offers containers, iterators and algorithms
  • Based on CUDA

Same Experiment:

  • nvcc -O3 (from CUDA 8.0)

transform/reduce inner_product 10 20 30 Runtime (ms) 2

slide-8
SLIDE 8

The Next Generation: Ranges for the STL

  • range-v3 prototype implementation by E. Niebler
  • Proposed as N4560 for the C++ Standard

std::vector<int> a(N), b(N); auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); 3

slide-9
SLIDE 9

The Next Generation: Ranges for the STL

std::vector<int> a(N), b(N); auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0);

Performance?

  • Clang/LLVM 5.0.0svn -O3 optimized

transform/accumulate inner_product 20 40 Runtime (ms) 3

slide-10
SLIDE 10

The Next Generation: Ranges for the STL

std::vector<int> a(N), b(N); auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0);

  • Views describe lazy, non-mutating operations on ranges
  • Evaluation happens inside an algorithm (e.g., accumulate)
  • Fusion is guaranteed by the implementation

3

slide-11
SLIDE 11

Ranges for GPUs

  • Extended range-v3 with GPU-enabled container and

algorithms

  • Original code of range-v3 remains unmodified

std::vector<int> a(N), b(N); auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto ga = gpu::copy(a); auto gb = gpu::copy(b); auto result = gpu::reduce(view::transform(view::zip(ga, gb), mult), 0); 4

slide-12
SLIDE 12

Programming Accelerators with C++ (PACXX)

Executable

PACXX Runtime Online Compiler LLVM-based

  • nline compiler

LLVM IR to SPIR NVPTX OpenCL Backend CUDA Backend

LLVM IR SPIR PTX

OpenCL Runtime AMD GPU Intel MIC CUDA Runtime Nvidia GPU

#include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, const T& value) { for (; first != last; ++first) { *first = value; } }

C++

PACXX Offline Compiler LLVM libc++ Clang Frontend Offline Stage Online Stage

  • Based entirely on LLVM / Clang
  • Supports C++14 for GPU Programming
  • Just-In-Time Compilation of LLVM IR for target accelerators

5

slide-13
SLIDE 13

Multi-Stage Programming

template <typename InRng, typename T, typename Fun> auto reduce(InRng&& in, T init, Fun&& fun) { // 1. preparation of kernel call ... // 2. create GPU kernel auto kernel = pacxx::kernel( [fun](auto&& in, auto&& out, int size, auto init) { // 2a. stage elements per thread auto ept = stage([&]{ return size / get_block_size(0); }); // 2b. start reduction computation auto sum = init; for (int x = 0; x < ept; ++x) { sum = fun(sum, *(in + gid)); gid += glbSize; } // 2c. perform reduction in shared memory ... // 2d. write result back if (lid = 0) *(out + bid) = shared[0]; }, blocks, threads); // 3. execute kernel kernel(in, out, distance(in), init); // 4. finish reduction on the CPU return std::accumulate(out, init, fun); } 6

slide-14
SLIDE 14

MSP Integration into PACXX

Executable

LLVM-based

  • nline compiler

PACXX Runtime LLVM IR to SPIR NVPTX OpenCL Backend CUDA Backend

SPIR PTX

OpenCL Runtime AMD GPU Intel MIC CUDA Runtime Nvidia GPU

#include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, const T& value) { for (; first != last; ++first) { *first = value; } }

C++

PACXX Offline Compiler LLVM libc++ Clang Frontend

MSP IR

MSP Engine

KERNEL IR

  • MSP Engine JIT compiles the MSP IR,
  • evaluates stage prior to a kernel launch, and
  • replaces the calls to stage in the kernel’s IR with the results.
  • Enables more optimizations (e.g., loop-unrolling) in the online

stage.

7

slide-15
SLIDE 15

Performance Impact of MSP

gpu::reduce on Nvidia K20c

0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 215 217 219 221 223 225 Speedup Input Size Dot Sum Dot +MS Sum +MS

Up to 35% better performance compared to non-MSP version

8

slide-16
SLIDE 16

Just-In-Time Compilation Overhead

Comparing MSP in PACXX with Nvidia’s nvrtc library

50 100 150 200 250 300 350 400 450 Dot Sum CUDA 7.5 CUDA 8.0 RC PACXX Compilation Time (ms)

10 to 20 times faster because front-end actions are performed.

9

slide-17
SLIDE 17

Benchmarks

range-v3 + PACXX vs. Nvidia’s Thrust

vadd saxpy sum dot Monte Carlo Mandelbrot Voronoi 0.5 1 1.5 2

8.37% 1.61% 82.32% 73.31% 33.6% −3.13% −7.38%

Speedup

Thrust PACXX

  • Evaluated on a Nvidia K20c GPU
  • 11 different input sizes
  • 1000 runs for each benchmark

Competitive performance with a composable GPU programming API

10

slide-18
SLIDE 18

Going Native: Work in Progress

Executable

PACXX Runtime LLVM-based

  • nline compiler

NVPTX OpenCL Backend CUDA Backend

SPIR PTX

OpenCL Runtime AMD GPU Intel MIC CUDA Runtime Nvidia GPU

#include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, const T& value) { for (; first != last; ++first) { *first = value; } }

C++

PACXX Offline Compiler LLVM libc++ Clang Frontend

MSP IR

MSP Engine

KERNEL IR

LLVM IR to SPIR WFV[1] MCJIT

CPUs: Intel / AMD / IBM ...

  • PACXX is extended by a native CPU backend
  • The Kernel IR is modified to be runnable on a CPU
  • Kernels are vectorized by the Whole Function Vectorizer (WFV) [1]
  • MCJIT compiles the kernels and TBB executes them in parallel.

[1] Karrenberg, Ralf, and Sebastian Hack. ”Whole-Function Vectorization.” @ CGO’11, pp. 141–150 11

slide-19
SLIDE 19

Benchmarks

range-v3 + PACXX vs. OpenCL on x86_64

vadd saxpy sum dot Mandelbrot 0.5 1 1.5 2

−1.45% −13.94% −12.64% −20.02% −36.48%

Speedup

OpenCL (Intel) PACXX

  • Running on 2x Intel Xeon E5-2620 CPUs
  • Intel’s auto-vectorizer optimizes the OpenCL C code

12

slide-20
SLIDE 20

Benchmarks

range-v3 + PACXX vs. OpenCL on x86_64

vadd saxpy sum dot Mandelbrot 1 2 3

58.12% 58.45% 153.82%

Speedup

OpenCL (AMD) PACXX

  • Running on 2x Intel Xeon E5-2620 CPUs
  • AMD OpenCL SDK has no auto-vectorizer
  • Barriers are very expensive in AMD’s OpenCL implementation

(speedup up to 126x for sum)

13

slide-21
SLIDE 21

Benchmarks

range-v3 + PACXX vs. OpenMP on IBM Power8

vadd saxpy sum dot Mandelbrot 0.5 1 1.5 2

0.45% 35.09% −20.05% −42.95% 17.97%

Speedup

OpenMP (IBM) PACXX

  • Running on a PowerNV 8247-42L with 4x IBM Power8e CPUs
  • No OpenCL implementation from IBM available
  • #prama omp parallel for simd parallelized loops
  • Compiled with XL C++ 13.1.5

14

slide-22
SLIDE 22

Questions?

14