Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger (sprengsz@informatik.hu-berlin.de) Steffen Zeuch (steffen.zeuch@dfki.de) Ulf Leser (leser@informatik.hu-berlin.de) HardBD & Active ’18 April 16, 2018
Agenda • SIMD and SPMD • Automatic Vectorization vs. Intrinsics • Intel SPMD Program Compiler • Case Study: Column Scan Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 2
Agenda • SIMD and SPMD • Automatic Vectorization vs. Intrinsics • Intel SPMD Program Compiler • Case Study: Column Scan Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 3
Single Instruction Multiple Data (SIMD) + = 3 4 2 8 9 6 5 9 6 2 3 1 Result Input A Input B • Process multiple data elements with one instruction • Modern CPUs offer dedicated instructions executed on extra-wide registers • Different instruction set architectures, e.g., SSE (128 Bits), AVX (256 Bits), AVX-512 (512 Bits) • Degree of parallelism of a SIMD instruction depends on how many data elements fit into one register, e.g., eight 32-bit ints fit into one 256-bit register • Developers can use SIMD instructions through intrinsics or rely on compiler-based automatic vectorization Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 4
Single Program Multiple Data (SPMD) void square(int[] a, A single program that int[] b, int n) { for (int i=0; i<n; ++i) { appears to be serial is b[i] = a[i] * a[i]; } } Input Data deployed onto multiple Program independent processing units (processors). void square(int[] a, void square(int[] a, void square(int[] a, The program instances int[] b, int[] b, int[] b, int n) { int n) { int n) { for (int i=0; i<n; ++i) { for (int i=0; i<n; ++i) { for (int i=0; i<n; ++i) { are concurrently b[i] = a[i] * a[i]; b[i] = a[i] * a[i]; b[i] = a[i] * a[i]; } } } } } } executed on different Processor Processor Processor subsets of the data. Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 5
Agenda • SIMD and SPMD • Automatic Vectorization vs. Intrinsics • Intel SPMD Program Compiler • Case Study: Column Scan Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 6
Automatic Vectorization • Recent versions of compilers support automatic vectorization • For instance, they accelerate scalar for loops with SIMD instructions • Works only for simple algorithms • Lacks support of recent instruction set architectures • Cannot compete with intrinsics code manually tuned by (experienced) developers Figure taken from: Pohl et al.: “An Evaluation of Current SIMD Programming Models for C++” (WPMVP, 2016) Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 7
Limitations of SIMD Intrinsics // Broadcast 32-bit floating-point value a to all elements of dst. __m256 _mm256_set1_ps (float a); • Require low-level hardware knowledge • Specific to the underlying instruction set architecture, e.g., AVX • Specific to the processed data type, e.g., float • Result in hard-to-maintain code when supporting different hardware architectures or data types • Forward compatibility Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 8
Agenda • SIMD and SPMD • Automatic Vectorization vs. Intrinsics • Intel SPMD Program Compiler • Case Study: Column Scan Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 9
Intel SPMD Program Compiler (ispc) • Deploys the SPMD execution model on the SIMD registers of modern CPUs • Program instances are mapped onto SIMD lanes • Extension of the C programming language with few new features that facilitate writing high-performance SPMD programs • Programs compiled with ispc can be directly called from C/C++ • Supports current CPU and instruction set architectures • x86, x86-64, Xeon Phi, ARM • SSE 2/4, AVX, AVX2, AVX512, NEON, … • Allows to use multi-threading in addition to SIMD parallelism Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 10
Integrating ispc into your C/C++ project void scan(int[] data, void square(int[] a, #include <iostream> int[] results, int[] b, #include “ispcscan.h” void square(int[] a, int determine_foo() { int lower, int n) { int[] b, int c = INT_MAX; int upper) { for (int i=0; i<n; ++i) { int main(int argc, int n) { if (a < b) { for (i = 0; i < n; ++i) { b[i] = a[i] * a[i]; char **argv) { for (int i=0; i<n; ++i) { c = a / b; if (data[i] >= lower) } return 0; b[i] = a[i] * a[i]; } else { if (data[i] <= upper) } } } c = a; } } C/C++ code ispc code $ g++ -c -o … $ ispc -o … -h … Object files Object files Link and create executable Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 11
Agenda • SIMD and SPMD • Automatic Vectorization vs. Intrinsics • Intel SPMD Program Compiler • Case Study: Column Scan Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 12
Experimental Setup • Scalar, Intrinsics-based, and ispc-based column scan • Branching and branch-free scan variants • 1GB of synthetic keys generated with std::rand() • Synthetic range scans of varying selectivity • lower bound: random, existing key • upper bound: lower bound + selectivity * domain • Server machine equipped with Intel Xeon E5-2620 (2 GHz clock rate, 256-bit wide SIMD registers, AVX) and 32 GB of main memory Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 13
ispc vs. Intrinsics vs. Scalar (4-byte unsigned int keys) 7 Intrinsics (Branch-Free) Throughput (GB/sec) 5,25 Intrinsics (Branches) ispc (Branch-Free) 3,5 ispc (Branches) Scalar (Branch-Free) 1,75 Scalar (Branches) 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Query Selectivity Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 14
ispc vs. Intrinsics vs. Scalar (4-byte unsigned int keys) 7 Throughput (GB/sec) 5,25 Intrinsics (Branches) 6.89X speedup on average 1.80X speedup 3,5 on average ispc (Branches) 3.82X speedup 1,75 on average Scalar (Branches) 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Query Selectivity Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 15
ispc vs. Intrinsics vs. Scalar (4-byte unsigned int keys) 7 Intrinsics (Branch-Free) Throughput (GB/sec) 5,25 2.16X speedup 1.46X speedup ispc (Branch-Free) 3,5 1.48X speedup Scalar (Branch-Free) 1,75 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Query Selectivity Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 16
Impact of Key Size on Performance of ispc-based scan With Branches Branch-Free 7 Speedup over scalar execution 5,25 3,5 1,75 0 8 Bits 16 Bits 32 Bits 64 Bits Key Size Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 17
Impact of Key Type on Performance of ispc-based scan With Branches Branch-Free 4 Speedup over scalar execution 3 2 1 0 unsigned int32 signed int32 float Key Type Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 18
Code Complexity With Branches Branch-Free 50 37,5 Lines of Code 25 12,5 0 Scalar ispc Intrinsics Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 19
Next Steps • Investigate more complex database algorithms, e.g., joins, hashing, or bloom filters • Run experiments on many-core CPUs (70+ cores, 4-way hyperthreading, AVX-512) and compare performance to modern GPUs • Compare to other approaches to automatic vectorization, e.g., OpenCL, CilkPlus, and OpenMP Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 20
Summary • ispc overcomes the limitations of SIMD Intrinsics Intrinsics • We compared branch-free and branching variants of a SPMD-based SPMD on SIMD Performance column scan with a scalar implementation and manually-tuned Intrinsics code • ispc achieves notable speedups over Automatic scalar implementations, however Vectorization manually tuned Intrinsics code is Convenience still slightly more efficient Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 21
Recommend
More recommend