exploiting automatic vectorization to employ spmd on simd
play

Exploiting automatic vectorization to employ SPMD on SIMD registers - PowerPoint PPT Presentation

Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger (sprengsz@informatik.hu-berlin.de) Steffen Zeuch (steffen.zeuch@dfki.de) Ulf Leser (leser@informatik.hu-berlin.de) HardBD & Active 18 April 16, 2018


  1. Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger (sprengsz@informatik.hu-berlin.de) Steffen Zeuch (steffen.zeuch@dfki.de) Ulf Leser (leser@informatik.hu-berlin.de) HardBD & Active ’18 April 16, 2018

  2. Agenda • SIMD and SPMD • Automatic Vectorization vs. Intrinsics • Intel SPMD Program Compiler • Case Study: Column Scan Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 2

  3. Agenda • SIMD and SPMD • Automatic Vectorization vs. Intrinsics • Intel SPMD Program Compiler • Case Study: Column Scan Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 3

  4. Single Instruction Multiple Data (SIMD) + = 3 4 2 8 9 6 5 9 6 2 3 1 Result Input A Input B • Process multiple data elements with one instruction • Modern CPUs offer dedicated instructions executed on extra-wide registers • Different instruction set architectures, e.g., SSE (128 Bits), AVX (256 Bits), AVX-512 (512 Bits) • Degree of parallelism of a SIMD instruction depends on how many data elements fit into one register, e.g., eight 32-bit ints fit into one 256-bit register • Developers can use SIMD instructions through intrinsics or rely on compiler-based automatic vectorization Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 4

  5. Single Program Multiple Data (SPMD) void square(int[] a, A single program that int[] b, int n) { for (int i=0; i<n; ++i) { appears to be serial is b[i] = a[i] * a[i]; } } Input Data deployed onto multiple Program independent processing units (processors). void square(int[] a, void square(int[] a, void square(int[] a, The program instances int[] b, int[] b, int[] b, int n) { int n) { int n) { for (int i=0; i<n; ++i) { for (int i=0; i<n; ++i) { for (int i=0; i<n; ++i) { are concurrently b[i] = a[i] * a[i]; b[i] = a[i] * a[i]; b[i] = a[i] * a[i]; } } } } } } executed on different Processor Processor Processor subsets of the data. Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 5

  6. Agenda • SIMD and SPMD • Automatic Vectorization vs. Intrinsics • Intel SPMD Program Compiler • Case Study: Column Scan Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 6

  7. Automatic Vectorization • Recent versions of compilers support automatic vectorization • For instance, they accelerate scalar for loops with SIMD instructions • Works only for simple algorithms • Lacks support of recent instruction set architectures • Cannot compete with intrinsics code manually tuned by (experienced) developers Figure taken from: Pohl et al.: “An Evaluation of Current SIMD Programming Models for C++” (WPMVP, 2016) Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 7

  8. Limitations of SIMD Intrinsics // Broadcast 32-bit floating-point value a to all elements of dst. __m256 _mm256_set1_ps (float a); • Require low-level hardware knowledge • Specific to the underlying instruction set architecture, e.g., AVX • Specific to the processed data type, e.g., float • Result in hard-to-maintain code when supporting different hardware architectures or data types • Forward compatibility Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 8

  9. Agenda • SIMD and SPMD • Automatic Vectorization vs. Intrinsics • Intel SPMD Program Compiler • Case Study: Column Scan Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 9

  10. Intel SPMD Program Compiler (ispc) • Deploys the SPMD execution model on the SIMD registers of modern CPUs • Program instances are mapped onto SIMD lanes • Extension of the C programming language with few new features that facilitate writing high-performance SPMD programs • Programs compiled with ispc can be directly called from C/C++ • Supports current CPU and instruction set architectures • x86, x86-64, Xeon Phi, ARM • SSE 2/4, AVX, AVX2, AVX512, NEON, … • Allows to use multi-threading in addition to SIMD parallelism Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 10

  11. Integrating ispc into your C/C++ project void scan(int[] data, void square(int[] a, #include <iostream> int[] results, int[] b, #include “ispcscan.h” void square(int[] a, int determine_foo() { int lower, int n) { int[] b, int c = INT_MAX; int upper) { for (int i=0; i<n; ++i) { int main(int argc, int n) { if (a < b) { for (i = 0; i < n; ++i) { b[i] = a[i] * a[i]; char **argv) { for (int i=0; i<n; ++i) { c = a / b; if (data[i] >= lower) } return 0; b[i] = a[i] * a[i]; } else { if (data[i] <= upper) } } } c = a; } } C/C++ code ispc code $ g++ -c -o … $ ispc -o … -h … Object files Object files Link and create executable Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 11

  12. Agenda • SIMD and SPMD • Automatic Vectorization vs. Intrinsics • Intel SPMD Program Compiler • Case Study: Column Scan Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 12

  13. Experimental Setup • Scalar, Intrinsics-based, and ispc-based column scan • Branching and branch-free scan variants • 1GB of synthetic keys generated with std::rand() • Synthetic range scans of varying selectivity • lower bound: random, existing key • upper bound: lower bound + selectivity * domain • Server machine equipped with Intel Xeon E5-2620 (2 GHz clock rate, 256-bit wide SIMD registers, AVX) and 32 GB of main memory Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 13

  14. ispc vs. Intrinsics vs. Scalar (4-byte unsigned int keys) 7 Intrinsics (Branch-Free) Throughput (GB/sec) 5,25 Intrinsics (Branches) ispc (Branch-Free) 3,5 ispc (Branches) Scalar (Branch-Free) 1,75 Scalar (Branches) 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Query Selectivity Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 14

  15. ispc vs. Intrinsics vs. Scalar (4-byte unsigned int keys) 7 Throughput (GB/sec) 5,25 Intrinsics (Branches) 6.89X speedup on average 1.80X speedup 3,5 on average ispc (Branches) 3.82X speedup 1,75 on average Scalar (Branches) 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Query Selectivity Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 15

  16. ispc vs. Intrinsics vs. Scalar (4-byte unsigned int keys) 7 Intrinsics (Branch-Free) Throughput (GB/sec) 5,25 2.16X speedup 1.46X speedup ispc (Branch-Free) 3,5 1.48X speedup Scalar (Branch-Free) 1,75 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Query Selectivity Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 16

  17. Impact of Key Size on Performance of ispc-based scan With Branches Branch-Free 7 Speedup over scalar execution 5,25 3,5 1,75 0 8 Bits 16 Bits 32 Bits 64 Bits Key Size Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 17

  18. Impact of Key Type on Performance of ispc-based scan With Branches Branch-Free 4 Speedup over scalar execution 3 2 1 0 unsigned int32 signed int32 float Key Type Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 18

  19. Code Complexity With Branches Branch-Free 50 37,5 Lines of Code 25 12,5 0 Scalar ispc Intrinsics Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 19

  20. Next Steps • Investigate more complex database algorithms, e.g., joins, hashing, or bloom filters • Run experiments on many-core CPUs (70+ cores, 4-way hyperthreading, AVX-512) and compare performance to modern GPUs • Compare to other approaches to automatic vectorization, e.g., OpenCL, CilkPlus, and OpenMP Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 20

  21. Summary • ispc overcomes the limitations of SIMD Intrinsics Intrinsics • We compared branch-free and branching variants of a SPMD-based SPMD on SIMD Performance column scan with a scalar implementation and manually-tuned Intrinsics code • ispc achieves notable speedups over Automatic scalar implementations, however Vectorization manually tuned Intrinsics code is Convenience still slightly more efficient Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend