Exploiting automatic vectorization to employ SPMD on SIMD registers - PowerPoint PPT Presentation

Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger (sprengsz@informatik.hu-berlin.de) Steffen Zeuch (steffen.zeuch@dfki.de) Ulf Leser (leser@informatik.hu-berlin.de) HardBD & Active ’18 April 16, 2018

Agenda • SIMD and SPMD • Automatic Vectorization vs. Intrinsics • Intel SPMD Program Compiler • Case Study: Column Scan Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 2

Single Instruction Multiple Data (SIMD) + = 3 4 2 8 9 6 5 9 6 2 3 1 Result Input A Input B • Process multiple data elements with one instruction • Modern CPUs offer dedicated instructions executed on extra-wide registers • Different instruction set architectures, e.g., SSE (128 Bits), AVX (256 Bits), AVX-512 (512 Bits) • Degree of parallelism of a SIMD instruction depends on how many data elements fit into one register, e.g., eight 32-bit ints fit into one 256-bit register • Developers can use SIMD instructions through intrinsics or rely on compiler-based automatic vectorization Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 4

Single Program Multiple Data (SPMD) void square(int[] a, A single program that int[] b, int n) { for (int i=0; i<n; ++i) { appears to be serial is b[i] = a[i] * a[i]; } } Input Data deployed onto multiple Program independent processing units (processors). void square(int[] a, void square(int[] a, void square(int[] a, The program instances int[] b, int[] b, int[] b, int n) { int n) { int n) { for (int i=0; i<n; ++i) { for (int i=0; i<n; ++i) { for (int i=0; i<n; ++i) { are concurrently b[i] = a[i] * a[i]; b[i] = a[i] * a[i]; b[i] = a[i] * a[i]; } } } } } } executed on different Processor Processor Processor subsets of the data. Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 5

Automatic Vectorization • Recent versions of compilers support automatic vectorization • For instance, they accelerate scalar for loops with SIMD instructions • Works only for simple algorithms • Lacks support of recent instruction set architectures • Cannot compete with intrinsics code manually tuned by (experienced) developers Figure taken from: Pohl et al.: “An Evaluation of Current SIMD Programming Models for C++” (WPMVP, 2016) Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 7

Limitations of SIMD Intrinsics // Broadcast 32-bit floating-point value a to all elements of dst. __m256 _mm256_set1_ps (float a); • Require low-level hardware knowledge • Specific to the underlying instruction set architecture, e.g., AVX • Specific to the processed data type, e.g., float • Result in hard-to-maintain code when supporting different hardware architectures or data types • Forward compatibility Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 8

Intel SPMD Program Compiler (ispc) • Deploys the SPMD execution model on the SIMD registers of modern CPUs • Program instances are mapped onto SIMD lanes • Extension of the C programming language with few new features that facilitate writing high-performance SPMD programs • Programs compiled with ispc can be directly called from C/C++ • Supports current CPU and instruction set architectures • x86, x86-64, Xeon Phi, ARM • SSE 2/4, AVX, AVX2, AVX512, NEON, … • Allows to use multi-threading in addition to SIMD parallelism Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 10

Integrating ispc into your C/C++ project void scan(int[] data, void square(int[] a, #include <iostream> int[] results, int[] b, #include “ispcscan.h” void square(int[] a, int determine_foo() { int lower, int n) { int[] b, int c = INT_MAX; int upper) { for (int i=0; i<n; ++i) { int main(int argc, int n) { if (a < b) { for (i = 0; i < n; ++i) { b[i] = a[i] * a[i]; char **argv) { for (int i=0; i<n; ++i) { c = a / b; if (data[i] >= lower) } return 0; b[i] = a[i] * a[i]; } else { if (data[i] <= upper) } } } c = a; } } C/C++ code ispc code $ g++ -c -o … $ ispc -o … -h … Object files Object files Link and create executable Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 11

Experimental Setup • Scalar, Intrinsics-based, and ispc-based column scan • Branching and branch-free scan variants • 1GB of synthetic keys generated with std::rand() • Synthetic range scans of varying selectivity • lower bound: random, existing key • upper bound: lower bound + selectivity * domain • Server machine equipped with Intel Xeon E5-2620 (2 GHz clock rate, 256-bit wide SIMD registers, AVX) and 32 GB of main memory Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 13

ispc vs. Intrinsics vs. Scalar (4-byte unsigned int keys) 7 Intrinsics (Branch-Free) Throughput (GB/sec) 5,25 Intrinsics (Branches) ispc (Branch-Free) 3,5 ispc (Branches) Scalar (Branch-Free) 1,75 Scalar (Branches) 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Query Selectivity Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 14

ispc vs. Intrinsics vs. Scalar (4-byte unsigned int keys) 7 Throughput (GB/sec) 5,25 Intrinsics (Branches) 6.89X speedup on average 1.80X speedup 3,5 on average ispc (Branches) 3.82X speedup 1,75 on average Scalar (Branches) 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Query Selectivity Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 15

ispc vs. Intrinsics vs. Scalar (4-byte unsigned int keys) 7 Intrinsics (Branch-Free) Throughput (GB/sec) 5,25 2.16X speedup 1.46X speedup ispc (Branch-Free) 3,5 1.48X speedup Scalar (Branch-Free) 1,75 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Query Selectivity Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 16

Impact of Key Size on Performance of ispc-based scan With Branches Branch-Free 7 Speedup over scalar execution 5,25 3,5 1,75 0 8 Bits 16 Bits 32 Bits 64 Bits Key Size Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 17

Impact of Key Type on Performance of ispc-based scan With Branches Branch-Free 4 Speedup over scalar execution 3 2 1 0 unsigned int32 signed int32 float Key Type Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 18

Code Complexity With Branches Branch-Free 50 37,5 Lines of Code 25 12,5 0 Scalar ispc Intrinsics Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 19

Next Steps • Investigate more complex database algorithms, e.g., joins, hashing, or bloom filters • Run experiments on many-core CPUs (70+ cores, 4-way hyperthreading, AVX-512) and compare performance to modern GPUs • Compare to other approaches to automatic vectorization, e.g., OpenCL, CilkPlus, and OpenMP Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 20

Summary • ispc overcomes the limitations of SIMD Intrinsics Intrinsics • We compared branch-free and branching variants of a SPMD-based SPMD on SIMD Performance column scan with a scalar implementation and manually-tuned Intrinsics code • ispc achieves notable speedups over Automatic scalar implementations, however Vectorization manually tuned Intrinsics code is Convenience still slightly more efficient Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers 21

Exploiting automatic vectorization to employ SPMD on SIMD registers - PowerPoint PPT Presentation

Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger (sprengsz@informatik.hu-berlin.de) Steffen Zeuch (steffen.zeuch@dfki.de) Ulf Leser (leser@informatik.hu-berlin.de) HardBD & Active 18 April 16, 2018

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 -

Is vectorization easy? Is vectorization enough? Sbastien Ponce Florian Lemaitre Plan

Lecture 3 SIMD and Vectorization GPU Architecture Todays lecture Vectorization and SSE

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

Rethinking SIMD Vectorization for In-Memory Databases Sri Harshal Parimi Motivation Need for

Welcome! INFOMOV Lecture 5 SIMD (1) 2 Meanwhile, on ars technica INFOMOV

TSLP Throttling Automatic Vectorization: When Less is More Vasileios Porpodas and Timothy M.

SIMD+ Overview Illiac IV History Early machines First massively

AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael Goldfarb and Milind Kulkarni

Composable GPU programming GPUs -- what are they? Basic model: SIMD, SPMD, MIMD; blocks

Multicore Processing Element for SIMD Computing Da-Qi Ren and Reiji Suda Department of Computer

Function Call Re-Vectorization Pupil: Rubens Emilio Alves Moreira Advisor: Fernando Magno Quinto

LLVM Auto-Vectorization Past Present Future Renato Golin www.linaro.org LLVM

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Using Machine Learning to Improve Automatic Vectorization Kevin Stock Louis-Nol Pouchet P .

Git branches and merges When we need our code to diverge into two different versions, we start

Is branch prediction important for performance? Daniel J. Bernstein Spectre paper: Modern

Branch-and-Bound Math 482, Lecture 33 Misha Lavrov April 27, 2020 Branch-and-bound methods

Nolan Richardson Middle School (NRMS) Veterans Day is a federal holiday in the United States

branch prediction 1 last time what happens with TLB in access patterns overlapping TLB and

CS: Pod of Delight Week 11: Git Git What is Git? Distributed version control tool Keep

THeME: A System for Testing by Hardware Monitoring Events Kristen R. Walcott-Justice Jason Mars

Decision Aid Methodologies In Transportation Lecture 4: Integer programming I: Branch and Bound