Exploiting automatic vectorization to employ SPMD on SIMD registers
Stefan Sprenger (sprengsz@informatik.hu-berlin.de) Steffen Zeuch (steffen.zeuch@dfki.de) Ulf Leser (leser@informatik.hu-berlin.de)
HardBD & Active ’18 April 16, 2018
Exploiting automatic vectorization to employ SPMD on SIMD registers - - PowerPoint PPT Presentation
Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger (sprengsz@informatik.hu-berlin.de) Steffen Zeuch (steffen.zeuch@dfki.de) Ulf Leser (leser@informatik.hu-berlin.de) HardBD & Active 18 April 16, 2018
Stefan Sprenger (sprengsz@informatik.hu-berlin.de) Steffen Zeuch (steffen.zeuch@dfki.de) Ulf Leser (leser@informatik.hu-berlin.de)
HardBD & Active ’18 April 16, 2018
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
2
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
3
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
4
registers
Bits), AVX-512 (512 Bits)
data elements fit into one register, e.g., eight 32-bit ints fit into one 256-bit register
compiler-based automatic vectorization
Input A
3 4 2 8
Input B
6 2 3 1
+ =
Result
9 6 5 9
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
5
void square(int[] a, int[] b, int n) { for (int i=0; i<n; ++i) { b[i] = a[i] * a[i]; } }
Program Input Data
void square(int[] a, int[] b, int n) { for (int i=0; i<n; ++i) { b[i] = a[i] * a[i]; } } void square(int[] a, int[] b, int n) { for (int i=0; i<n; ++i) { b[i] = a[i] * a[i]; } } void square(int[] a, int[] b, int n) { for (int i=0; i<n; ++i) { b[i] = a[i] * a[i]; } }
Processor Processor Processor
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
6
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
7
support automatic vectorization
for loops with SIMD instructions
set architectures
code manually tuned by (experienced) developers
Figure taken from: Pohl et al.: “An Evaluation of Current SIMD Programming Models for C++” (WPMVP, 2016)
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
8
hardware architectures or data types
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
9
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
10
modern CPUs
that facilitate writing high-performance SPMD programs
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
11
void square(int[] a, int[] b, int n) { for (int i=0; i<n; ++i) { b[i] = a[i] * a[i]; } } void square(int[] a, int[] b, int n) { for (int i=0; i<n; ++i) { b[i] = a[i] * a[i]; } } void scan(int[] data, int[] results, int lower, int upper) { for (i = 0; i < n; ++i) { if (data[i] >= lower) if (data[i] <= upper) int determine_foo() { int c = INT_MAX; if (a < b) { c = a / b; } else { c = a; } #include <iostream> #include “ispcscan.h” int main(int argc, char **argv) { return 0; }
C/C++ code ispc code Object files Object files $ g++ -c -o … $ ispc -o … -h … Link and create executable
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
12
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
13
rate, 256-bit wide SIMD registers, AVX) and 32 GB of main memory
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
14
Throughput (GB/sec)
1,75 3,5 5,25 7 Query Selectivity 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Scalar (Branches) Scalar (Branch-Free) ispc (Branch-Free) ispc (Branches) Intrinsics (Branch-Free) Intrinsics (Branches)
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
15
Throughput (GB/sec)
1,75 3,5 5,25 7 Query Selectivity 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Scalar (Branches) ispc (Branches) Intrinsics (Branches)
3.82X speedup
6.89X speedup
1.80X speedup
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
16
Throughput (GB/sec)
1,75 3,5 5,25 7 Query Selectivity 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Scalar (Branch-Free) ispc (Branch-Free) Intrinsics (Branch-Free)
1.48X speedup 2.16X speedup 1.46X speedup
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
17
Speedup over scalar execution
1,75 3,5 5,25 7 Key Size 8 Bits 16 Bits 32 Bits 64 Bits
With Branches Branch-Free
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
18
Speedup over scalar execution
1 2 3 4 Key Type unsigned int32 signed int32 float
With Branches Branch-Free
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
19
Lines of Code
12,5 25 37,5 50 Scalar ispc Intrinsics
With Branches Branch-Free
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
20
hyperthreading, AVX-512) and compare performance to modern GPUs
OpenCL, CilkPlus, and OpenMP
Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers
21
SIMD Intrinsics
branching variants of a SPMD-based column scan with a scalar implementation and manually-tuned Intrinsics code
scalar implementations, however manually tuned Intrinsics code is still slightly more efficient
Performance Convenience
Intrinsics Automatic Vectorization SPMD on SIMD