advanced database systems
play

ADVANCED DATABASE SYSTEMS Vectorized Execution @ Andy_Pavlo // - PowerPoint PPT Presentation

Lect ure # 20 ADVANCED DATABASE SYSTEMS Vectorized Execution @ Andy_Pavlo // 15- 721 // Spring 2019 CMU 15-721 (Spring 2019) 2 Background Hardware Vectorized Algorithms (Columbia) CMU 15-721 (Spring 2019) 3 VECTO RIZATIO N The process


  1. Lect ure # 20 ADVANCED DATABASE SYSTEMS Vectorized Execution @ Andy_Pavlo // 15- 721 // Spring 2019

  2. CMU 15-721 (Spring 2019) 2 Background Hardware Vectorized Algorithms (Columbia)

  3. CMU 15-721 (Spring 2019) 3 VECTO RIZATIO N The process of converting an algorithm's scalar implementation that processes a single pair of operands at a time, to a vector implementation that processes one operation on multiple pairs of operands at once.

  4. CMU 15-721 (Spring 2019) 4 WH Y TH IS M ATTERS Say we can parallelize our algorithm over 32 cores. Each core has a 4-wide SIMD registers. Potential Speed-up: 32x × 4x = 128x

  5. CMU 15-721 (Spring 2019) 5 M ULTI- CO RE CPUS Use a small number of high-powered cores. → Intel Xeon Skylake / Kaby Lake → High power consumption and area per core. Massively superscalar and aggressive out-of- order execution → Instructions are issued from a sequential stream. → Check for dependencies between instructions. → Process multiple instructions per clock cycle.

  6. CMU 15-721 (Spring 2019) 6 M AN Y IN TEGRATED CO RES (M IC) Use a larger number of low-powered cores. → Intel Xeon Phi → Low power consumption and area per core. → Expanded SIMD instructions with larger register sizes. Knights Ferry (Columbia Paper) → Non-superscalar and in-order execution → Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) → Superscalar and out-of-order execution. → Cores = Silvermont (aka Atom)

  7. CMU 15-721 (Spring 2019) 6 M AN Y IN TEGRATED CO RES (M IC) Use a larger number of low-powered cores. → Intel Xeon Phi → Low power consumption and area per core. → Expanded SIMD instructions with larger register sizes. Knights Ferry (Columbia Paper) → Non-superscalar and in-order execution → Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) → Superscalar and out-of-order execution. → Cores = Silvermont (aka Atom)

  8. CMU 15-721 (Spring 2019) 8 S IN GLE I N STRUCTIO N, M ULTIPLE D ATA A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. All major ISAs have microarchitecture support SIMD operations. → x86 : MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512 → PowerPC : Altivec → ARM : NEON

  9. CMU 15-721 (Spring 2019) 9 SIM D EXAM PLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 x n y n x n +y n 2 SISD 1 + 9 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1

  10. CMU 15-721 (Spring 2019) 9 SIM D EXAM PLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 x n y n x n +y n 2 SISD 1 + 9 8 7 6 5 4 3 2 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1

  11. CMU 15-721 (Spring 2019) 9 SIM D EXAM PLE 128-bit SIMD Register X + Y = Z 8 7 x 1 y 1 x 1 +y 1 8 7 6 5 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 x n y n x n +y n 2 SIMD + 1 1 for (i=0; i<n; i++) { 1 1 1 1 1 Z[i] = X[i] + Y[i]; 1 128-bit SIMD Register } Y 1 1 1 1 1

  12. CMU 15-721 (Spring 2019) 9 SIM D EXAM PLE 128-bit SIMD Register X + Y = Z 8 7 x 1 y 1 x 1 +y 1 8 7 6 5 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 x n y n x n +y n 2 SIMD + 1 9 8 7 6 1 for (i=0; i<n; i++) { 128-bit SIMD Register 1 1 1 1 1 Z[i] = X[i] + Y[i]; 1 128-bit SIMD Register } Y 1 1 1 1 1

  13. CMU 15-721 (Spring 2019) 9 SIM D EXAM PLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 4 3 2 1 x n y n x n +y n 2 SIMD + 1 9 8 7 6 5 4 3 2 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1 1 1 1 1

  14. CMU 15-721 (Spring 2019) 10 STREAM IN G SIM D EXTEN SIO N S (SSE) SSE is a collection SIMD instructions that target special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can be performed on each of the four elements simultaneously. First introduced by Intel in 1999.

  15. CMU 15-721 (Spring 2019) 11 SIM D IN STRUCTIO NS (1) Data Movement → Moving data in and out of vector registers Arithmetic Operations → Apply operation on multiple data items (e.g., 2 doubles, 4 floats, 16 bytes) → Example: ADD , SUB , MUL , DIV , SQRT , MAX , MIN Logical Instructions → Logical operations on multiple data items → Example: AND , OR , XOR , ANDN , ANDPS , ANDNPS

  16. CMU 15-721 (Spring 2019) 12 SIM D IN STRUCTIO NS (2) Comparison Instructions → Comparing multiple data items ( == , < , <= , > , >= , != ) Shuffle instructions → Move data in between SIMD registers Miscellaneous → Conversion: Transform data between x86 and SIMD registers. → Cache Control: Move data directly from SIMD registers to memory (bypassing CPU cache).

  17. CMU 15-721 (Spring 2019) 13 IN TEL SIM D EXTEN SIO N S Width Integers Single-P Double-P 1997 MMX 64 bits ✔ 1999 SSE 128 bits ✔ (×4) ✔ 2001 SSE2 128 bits ✔ (×2) ✔ ✔ 2004 SSE3 128 bits ✔ ✔ ✔ 2006 SSSE 3 128 bits ✔ ✔ ✔ 2006 SSE 4.1 128 bits ✔ ✔ ✔ 2008 SSE 4.2 128 bits ✔ ✔ ✔ 2011 AVX 256 bits ✔ (×8) ✔ (×4) ✔ 2013 AVX2 256 bits ✔ ✔ ✔ 2017 AVX-512 512 bits ✔ (×16) ✔ (×8) ✔ Source: James Reinders

  18. CMU 15-721 (Spring 2019) 14 VECTO RIZATIO N Choice #1: Automatic Vectorization Choice #2: Compiler Hints Choice #3: Explicit Vectorization Source: James Reinders

  19. CMU 15-721 (Spring 2019) 14 VECTO RIZATIO N Ease of Use Choice #1: Automatic Vectorization Choice #2: Compiler Hints Choice #3: Explicit Vectorization Programmer Control Source: James Reinders

  20. CMU 15-721 (Spring 2019) 15 AUTO M ATIC VECTO RIZATIO N The compiler can identify when instructions inside of a loop can be rewritten as a vectorized operation. Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions.

  21. CMU 15-721 (Spring 2019) 16 AUTO M ATIC VECTO RIZATIO N This loop is not legal to void add ( int *X , automatically vectorize. int *Y , int *Z ) { for ( int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

  22. CMU 15-721 (Spring 2019) 16 AUTO M ATIC VECTO RIZATIO N This loop is not legal to void add ( int *X , automatically vectorize. int *Y , *Z=*X+1 int *Z ) { for ( int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } } These might point to the same address!

  23. CMU 15-721 (Spring 2019) 16 AUTO M ATIC VECTO RIZATIO N This loop is not legal to void add ( int *X , automatically vectorize. int *Y , *Z=*X+1 int *Z ) { for ( int i=0; i<MAX; i++) { The code is written such that the Z[i] = X[i] + Y[i]; addition is described as being } } done sequentially. These might point to the same address!

  24. CMU 15-721 (Spring 2019) 17 CO M PILER H IN TS Provide the compiler with additional information about the code to let it know that is safe to vectorize. Two approaches: → Give explicit information about memory locations. → Tell the compiler to ignore vector dependencies.

  25. CMU 15-721 (Spring 2019) 18 CO M PILER H IN TS The restrict keyword in C++ void add ( int * restrict X , tells the compiler that the arrays int * restrict Y , int * restrict Z ) { are distinct locations in memory. for ( int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

  26. CMU 15-721 (Spring 2019) 19 CO M PILER H IN TS This pragma tells the compiler to void add ( int *X , ignore loop dependencies for the int *Y , int *Z ) { vectors. #pragma ivdep for ( int i=0; i<MAX; i++) { It’s up to you make sure that this Z[i] = X[i] + Y[i]; } is correct. }

  27. CMU 15-721 (Spring 2019) 20 EXPLICIT VECTO RIZATIO N Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions. Potentially not portable.

  28. CMU 15-721 (Spring 2019) 21 EXPLICIT VECTO RIZATIO N Store the vectors in 128-bit SIMD void add ( int *X , int *Y , registers. int *Z ) { __mm128i *vecX = (__m128i*)X; Then invoke the intrinsic to add __mm128i *vecY = (__m128i*)Y; __mm128i *vecZ = (__m128i*)Z; together the vectors and write for ( int i=0; i<MAX /4 ; i++) { them to the output location. _mm_store_si128(vecZ++, ⮱ _mm_add_epi32(*vecX++, ⮱ *vecY++)) ; } }

  29. CMU 15-721 (Spring 2019) 22 VECTO RIZATIO N DIRECTIO N Approach #1: Horizontal 0 1 2 3 → Perform operation on all elements together 6 SIMD Add within a single vector. Approach #2: Vertical 0 1 2 3 → Perform operation in an elementwise manner on elements of each vector. 1 2 3 4 SIMD Add 1 1 1 1 Source:

  30. CMU 15-721 (Spring 2019) 23 EXPLICIT VECTO RIZATIO N Linear Access Operators → Predicate evaluation → Compression Ad-hoc Vectorization → Sorting → Merging Composable Operations → Multi-way trees → Bucketized hash tables Source: Orestis Polychroniou

  31. CMU 15-721 (Spring 2019) 24 VECTO RIZED DBM S ALGO RITH M S Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality. → Favor vertical vectorization by processing different input data per lane. → Maximize lane utilization by executing different things per lane subset. RETHINKING SIMD VECTORIZATION FOR IN IN- MEMORY DATABASES SIGMOD 2015

  32. CMU 15-721 (Spring 2019) 25 FUN DAM EN TAL O PERATIO N S Selective Load Selective Store Selective Gather Selective Scatter

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend