Exploiting automatic vectorization to employ SPMD on SIMD registers - - PowerPoint PPT Presentation

exploiting automatic vectorization to employ spmd on simd
SMART_READER_LITE
LIVE PREVIEW

Exploiting automatic vectorization to employ SPMD on SIMD registers - - PowerPoint PPT Presentation

Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger (sprengsz@informatik.hu-berlin.de) Steffen Zeuch (steffen.zeuch@dfki.de) Ulf Leser (leser@informatik.hu-berlin.de) HardBD & Active 18 April 16, 2018


slide-1
SLIDE 1

Exploiting automatic vectorization to employ SPMD on SIMD registers

Stefan Sprenger (sprengsz@informatik.hu-berlin.de) Steffen Zeuch (steffen.zeuch@dfki.de) Ulf Leser (leser@informatik.hu-berlin.de)

HardBD & Active ’18 April 16, 2018

slide-2
SLIDE 2

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Agenda

2

  • SIMD and SPMD
  • Automatic Vectorization vs. Intrinsics
  • Intel SPMD Program Compiler
  • Case Study: Column Scan
slide-3
SLIDE 3

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Agenda

3

  • SIMD and SPMD
  • Automatic Vectorization vs. Intrinsics
  • Intel SPMD Program Compiler
  • Case Study: Column Scan
slide-4
SLIDE 4

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Single Instruction Multiple Data (SIMD)

4

  • Process multiple data elements with one instruction
  • Modern CPUs offer dedicated instructions executed on extra-wide

registers

  • Different instruction set architectures, e.g., SSE (128 Bits), AVX (256

Bits), AVX-512 (512 Bits)

  • Degree of parallelism of a SIMD instruction depends on how many

data elements fit into one register, e.g., eight 32-bit ints fit into one 256-bit register

  • Developers can use SIMD instructions through intrinsics or rely on

compiler-based automatic vectorization

Input A

3 4 2 8

Input B

6 2 3 1

+ =

Result

9 6 5 9

slide-5
SLIDE 5

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Single Program Multiple Data (SPMD)

5

A single program that appears to be serial is deployed onto multiple independent processing units (processors). The program instances are concurrently executed on different subsets of the data.

void square(int[] a, int[] b, int n) { for (int i=0; i<n; ++i) { b[i] = a[i] * a[i]; } }

Program Input Data

void square(int[] a, int[] b, int n) { for (int i=0; i<n; ++i) { b[i] = a[i] * a[i]; } } void square(int[] a, int[] b, int n) { for (int i=0; i<n; ++i) { b[i] = a[i] * a[i]; } } void square(int[] a, int[] b, int n) { for (int i=0; i<n; ++i) { b[i] = a[i] * a[i]; } }

Processor Processor Processor

slide-6
SLIDE 6

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Agenda

6

  • SIMD and SPMD
  • Automatic Vectorization vs. Intrinsics
  • Intel SPMD Program Compiler
  • Case Study: Column Scan
slide-7
SLIDE 7

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Automatic Vectorization

7

  • Recent versions of compilers

support automatic vectorization

  • For instance, they accelerate scalar

for loops with SIMD instructions

  • Works only for simple algorithms
  • Lacks support of recent instruction

set architectures

  • Cannot compete with intrinsics

code manually tuned by (experienced) developers

Figure taken from: Pohl et al.: “An Evaluation of Current SIMD Programming Models for C++” (WPMVP, 2016)

slide-8
SLIDE 8

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Limitations of SIMD Intrinsics

8

  • Require low-level hardware knowledge
  • Specific to the underlying instruction set architecture, e.g., AVX
  • Specific to the processed data type, e.g., float
  • Result in hard-to-maintain code when supporting different

hardware architectures or data types

  • Forward compatibility

// Broadcast 32-bit floating-point value a to all elements of dst. __m256 _mm256_set1_ps (float a);

slide-9
SLIDE 9

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Agenda

9

  • SIMD and SPMD
  • Automatic Vectorization vs. Intrinsics
  • Intel SPMD Program Compiler
  • Case Study: Column Scan
slide-10
SLIDE 10

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Intel SPMD Program Compiler (ispc)

10

  • Deploys the SPMD execution model on the SIMD registers of

modern CPUs

  • Program instances are mapped onto SIMD lanes
  • Extension of the C programming language with few new features

that facilitate writing high-performance SPMD programs

  • Programs compiled with ispc can be directly called from C/C++
  • Supports current CPU and instruction set architectures
  • x86, x86-64, Xeon Phi, ARM
  • SSE 2/4, AVX, AVX2, AVX512, NEON, …
  • Allows to use multi-threading in addition to SIMD parallelism
slide-11
SLIDE 11

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Integrating ispc into your C/C++ project

11

void square(int[] a, int[] b, int n) { for (int i=0; i<n; ++i) { b[i] = a[i] * a[i]; } } void square(int[] a, int[] b, int n) { for (int i=0; i<n; ++i) { b[i] = a[i] * a[i]; } } void scan(int[] data, int[] results, int lower, int upper) { for (i = 0; i < n; ++i) { if (data[i] >= lower) if (data[i] <= upper) int determine_foo() { int c = INT_MAX; if (a < b) { c = a / b; } else { c = a; } #include <iostream> #include “ispcscan.h” int main(int argc, char **argv) { return 0; }

C/C++ code ispc code Object files Object files $ g++ -c -o … $ ispc -o … -h … Link and create executable

slide-12
SLIDE 12

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Agenda

12

  • SIMD and SPMD
  • Automatic Vectorization vs. Intrinsics
  • Intel SPMD Program Compiler
  • Case Study: Column Scan
slide-13
SLIDE 13

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Experimental Setup

13

  • Scalar, Intrinsics-based, and ispc-based column scan
  • Branching and branch-free scan variants
  • 1GB of synthetic keys generated with std::rand()
  • Synthetic range scans of varying selectivity
  • lower bound: random, existing key
  • upper bound: lower bound + selectivity * domain
  • Server machine equipped with Intel Xeon E5-2620 (2 GHz clock

rate, 256-bit wide SIMD registers, AVX) and 32 GB of main memory

slide-14
SLIDE 14

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

ispc vs. Intrinsics vs. Scalar (4-byte unsigned int keys)

14

Throughput (GB/sec)

1,75 3,5 5,25 7 Query Selectivity 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Scalar (Branches) Scalar (Branch-Free) ispc (Branch-Free) ispc (Branches) Intrinsics (Branch-Free) Intrinsics (Branches)

slide-15
SLIDE 15

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

ispc vs. Intrinsics vs. Scalar (4-byte unsigned int keys)

15

Throughput (GB/sec)

1,75 3,5 5,25 7 Query Selectivity 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Scalar (Branches) ispc (Branches) Intrinsics (Branches)

3.82X speedup

  • n average

6.89X speedup

  • n average

1.80X speedup

  • n average
slide-16
SLIDE 16

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

ispc vs. Intrinsics vs. Scalar (4-byte unsigned int keys)

16

Throughput (GB/sec)

1,75 3,5 5,25 7 Query Selectivity 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Scalar (Branch-Free) ispc (Branch-Free) Intrinsics (Branch-Free)

1.48X speedup 2.16X speedup 1.46X speedup

slide-17
SLIDE 17

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Impact of Key Size on Performance of ispc-based scan

17

Speedup over scalar execution

1,75 3,5 5,25 7 Key Size 8 Bits 16 Bits 32 Bits 64 Bits

With Branches Branch-Free

slide-18
SLIDE 18

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Impact of Key Type on Performance of ispc-based scan

18

Speedup over scalar execution

1 2 3 4 Key Type unsigned int32 signed int32 float

With Branches Branch-Free

slide-19
SLIDE 19

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Code Complexity

19

Lines of Code

12,5 25 37,5 50 Scalar ispc Intrinsics

With Branches Branch-Free

slide-20
SLIDE 20

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Next Steps

20

  • Investigate more complex database algorithms, e.g., joins, hashing,
  • r bloom filters
  • Run experiments on many-core CPUs (70+ cores, 4-way

hyperthreading, AVX-512) and compare performance to modern GPUs

  • Compare to other approaches to automatic vectorization, e.g.,

OpenCL, CilkPlus, and OpenMP

slide-21
SLIDE 21

Sprenger, Zeuch, Leser: Exploiting automatic vectorization to employ SPMD on SIMD registers

Summary

21

  • ispc overcomes the limitations of

SIMD Intrinsics

  • We compared branch-free and

branching variants of a SPMD-based column scan with a scalar implementation and manually-tuned Intrinsics code

  • ispc achieves notable speedups over

scalar implementations, however manually tuned Intrinsics code is still slightly more efficient

Performance Convenience

Intrinsics Automatic Vectorization SPMD on SIMD