fused table scans combining avx 512 and jit
play

FUSED TABLE SCANS: COMBINING AVX-512 AND JIT Markus Dreseler, Jan - PowerPoint PPT Presentation

FUSED TABLE SCANS: COMBINING AVX-512 AND JIT Markus Dreseler, Jan Kossmann, Johannes Frohnhofen, Matthias Uflacker, Hasso Plattner Joint Workshop of HardBD & Active @ ICDE Paris, 16th of April 2018 FUSED TABLE SCANS FUSED TABLE SCANS -


  1. FUSED TABLE SCANS: COMBINING AVX-512 AND JIT Markus Dreseler, Jan Kossmann, Johannes Frohnhofen, Matthias Uflacker, Hasso Plattner Joint Workshop of HardBD & Active @ ICDE Paris, 16th of April 2018

  2. FUSED TABLE SCANS FUSED TABLE SCANS - COMBINING AVX-512 AND JIT ▸ AVX-512: Intel’s newest instruction set for SIMD operations ▸ J ust- I n- T ime compilation: Creating binary code at program runtime ▸ Efficient (multi-predicate) sequential scans are a necessity for relational database systems ▸ Secondary indexes can speed up such operations ▸ Drawbacks: memory consumption and maintenance cost ▸ Contribution : Combine the above techniques to accelerate table scans � 2

  3. FUSED TABLE SCANS FUSED TABLE SCANS - COMBINING AVX-512 AND JIT ▸ Optimizations of sequential scans can be grouped into two categories ▸ Block-at-a-time: Evaluate multiple values (SIMD) of a column at a time ▸ Store results in position list A B C ▸ Materialization between operators ▸ Data-centric compilation: Generate (JIT) a tight, optimized loop to process one tuple at a time A B C ▸ No utilization of SIMD until now ▸ Suboptimal interplay with some hardware optimizations � 4

  4. FUSED TABLE SCANS WHY SHOULD WE COMBINE DATA-CENTRIC OPERATION & SIMD? ▸ Assumptions: Data resides in-memory in column-major format with fixed size values ▸ could look similar to: SELECT COUNT(*) FROM tbl WHERE a = 5 AND b = 2 int total_results = 0; for (pos_t i = 0; i < col_a.size(); ++i) { if (col_a[i] == 5 && col_b[i] == 2) { ++total_results; } } � 6

  5. FUSED TABLE SCANS WHY SHOULD WE COMBINE DATA-CENTRIC OPERATION & SIMD? int total_results = 0; for (pos_t i = 0; i < col_a.size(); ++i) { if (col_a[i] == 5 && col_b[i] == 2) { ++total_results; } } � 7

  6. FUSED TABLE SCANS WHY SHOULD WE COMBINE DATA-CENTRIC OPERATION & SIMD? ▸ Experiment: Does a single value at a time evaluation fully utilize the available memory bandwidth? ▸ Reduce the number of cpu operations, but still load all data from memory � 8 4-byte values skipped per scanned item

  7. FUSED TABLE SCANS IMPLEMENTATION ▸ Utilizing the new instruction set AVX-512 ▸ Wider (doubled) register sizes ▸ New instructions offer efficiency advantages ▸ We built equivalent functions using AVX2 (up to 32 lines) ▸ Basic idea ▸ Keep data in the AVX-registers during whole scan � 9

  8. 128 byte uint32_t Column A: 2 5 4 5 6 1 5 7 6 8 5 3 5 9 9 5 _mm_loadu_si128 __m128i 2 5 4 5 5 5 5 5 first search value _mm_cmpeq_epi32_mask __mmask8 0 1 0 1 0 1 2 3 indexes of current block _mm_mask_compress_epi32 128 byte 0 0 1 3 uint32_t position list Column A: 2 5 4 5 6 1 5 7 6 8 5 3 5 9 9 5 6 1 5 7 5 5 5 5 0 0 1 0 4 5 6 7 _mm_loadu_si128 _mm_permutex2var_epi32 second iteration __m128i 0 1 3 0 2 5 4 5 5 5 5 5 _mm_mask_compress_epi32 first search value 0 1 3 6 _mm_cmpeq_epi32_mask 6 8 5 3 5 5 5 5 __mmask8 0 0 1 0 8 9 10 11 0 1 0 1 0 1 2 3 _mm_permutex2var_epi32 indexes of current third iteration block 1 3 6 0 _mm_mask_compress_epi32 matching positions in 1 3 6 10 column a 0 0 1 3 Column B: 5 2 3 1 8 7 3 3 4 5 6 7 2 9 3 2 _mm_i32gather_epi32 position list 2 1 3 6 2 2 2 2 _mm_mask_cmpeq_epi32_mask 1 0 0 0 _mm_mask_compress_epi32 final result: row 1 (the second entry) 0 0 0 1 matches both conditions

  9. 128 byte uint32_t Column A: 2 5 4 5 6 1 5 7 6 8 5 3 5 9 9 5 _mm_loadu_si128 __m128i 2 5 4 5 5 5 5 5 first search value _mm_cmpeq_epi32_mask __mmask8 0 1 0 1 0 1 2 3 indexes of current block _mm_mask_compress_epi32 0 0 1 3 position list position list 6 1 5 7 5 5 5 5 6 1 5 7 5 5 5 5 0 0 1 0 4 5 6 7 _mm_permutex2var_epi32 0 0 1 0 4 5 6 7 second iteration 0 1 3 0 _mm_permutex2var_epi32 second iteration _mm_mask_compress_epi32 0 1 3 0 0 1 3 6 _mm_mask_compress_epi32 6 8 5 3 5 5 5 5 0 1 3 6 0 0 1 0 8 9 10 11 _mm_permutex2var_epi32 6 8 5 3 5 5 5 5 third iteration 1 3 6 0 8 9 10 11 0 0 1 0 _mm_permutex2var_epi32 third iteration matching positions in 1 3 6 10 1 3 6 0 column a Column B: 5 2 3 1 8 7 3 3 4 5 6 7 2 9 3 2 matching positions in 1 3 6 10 _mm_i32gather_epi32 column a 2 1 3 6 2 2 2 2 _mm_mask_cmpeq_epi32_mask 1 0 0 0 _mm_mask_compress_epi32 final result: row 1 (the second entry) 0 0 0 1 matches both conditions

  10. 128 byte uint32_t Column A: 2 5 4 5 6 1 5 7 6 8 5 3 5 9 9 5 _mm_loadu_si128 __m128i 2 5 4 5 5 5 5 5 first search value _mm_cmpeq_epi32_mask __mmask8 0 1 0 1 0 1 2 3 indexes of current block _mm_mask_compress_epi32 0 0 1 3 matching positions in 1 3 6 10 column a position list 6 1 5 7 5 5 5 5 Column B: 5 2 3 1 8 7 3 3 4 5 6 7 2 9 3 2 0 0 1 0 4 5 6 7 _mm_permutex2var_epi32 second iteration 0 1 3 0 _mm_i32gather_epi32 _mm_mask_compress_epi32 0 1 3 6 2 1 3 6 2 2 2 2 6 8 5 3 5 5 5 5 0 0 1 0 8 9 10 11 _mm_permutex2var_epi32 _mm_mask_cmpeq_epi32_mask third iteration 1 3 6 0 1 0 0 0 matching positions in 1 3 6 10 column a _mm_mask_compress_epi32 Column B: 5 2 3 1 8 7 3 3 4 5 6 7 2 9 3 2 final result: row 1 (the second entry) 0 0 0 1 _mm_i32gather_epi32 matches both conditions 2 1 3 6 2 2 2 2 _mm_mask_cmpeq_epi32_mask 1 0 0 0 _mm_mask_compress_epi32 final result: row 1 (the second entry) 0 0 0 1 matches both conditions

  11. FUSED TABLE SCANS IMPLEMENTATION - ACHIEVEMENTS ▸ Fully utilize the CPU’s computation power ▸ More comparisons/cycle by using SIMD instructions ▸ Avoid useless prefetches, only load necessary data of the second column ▸ Increased memory bus efficiency ▸ Fewer and cheaper branch mispredictions ▸ Reduced number of conditions in code ▸ Reduced memory transfers ▸ Intermediary results are kept in AVX registers and are not materialized � 13

  12. FUSED TABLE SCANS JIT - RUNTIME CODE GENERATION … … … … ⨝ ⨝ ⨝ ⨝ σ σ σ σ σ σ ꔖ ꔖ σ σ Table A Table B Table A Table A Table B Table B Table A Table B Regular query plan Plan with Fused Table Scan Regular query plan Plan with Fused Table Scan � 14

  13. FUSED TABLE SCANS JIT - RUNTIME CODE GENERATION ▸ Problem: Some parameters are only known at runtime ▸ Size of scanned values ▸ Exact data types ▸ Signed & unsigned 1, 2, 4, or 8 byte int plus float & double ▸ Type of comparison operators: !=, ==, <, >, <=, >= ▸ Larger number of possible code paths � 16

  14. FUSED TABLE SCANS JIT - RUNTIME CODE GENERATION ▸ The query optimizer identifies fusable operator chains ▸ Parameters are determined by the translator during runtime ▸ Result: Specialized, monolithic function SQL String SQL Parser ▸ Cached for efficiency Abstract Syntax Tree SQL Translator Logical Query Plan (LQP) Optimizer Predicates Logical Query Plan (LQP) JIT Compiler LQP Translator Physical Query Plan (PQP) Binary Executor https://github.com/hyrise/hyrise/ � 17

  15. FUSED TABLE SCANS EVALUATION ▸ On current Skylake system ▸ Intel Xeon Platinum @ 2.5 - 3.8 GHz with 2TB of PC4-2666 main memory ▸ Evaluated dimensions during experiments ▸ Table Size ▸ Selectivity ▸ Implementations / Instruction sets ▸ SISD, AVX2, and AVX-512, automatic compiler vectorization ▸ AVX-Register width: 128, 256, and 512 Bit ▸ Number of Predicates � 18

  16. FUSED TABLE SCANS EVALUATION - PERFORMANCE RELATIVE TO SISD IMPLEMENTATION � 19

  17. FUSED TABLE SCANS EVALUATION - INSTRUCTION SETS & REGISTER WIDTH Table with 32M rows Matching Rows (%) � 20

  18. FUSED TABLE SCANS EVALUATION - NUMBER OF PREDICATES Table with 32M rows � 21

  19. FUSED TABLE SCANS SUMMARY & CONCLUSION ▸ Branch mispredictions and useless prefetches are a huge cost factor in multi-predicate scans ▸ Doubling the register size does not (yet) double the performance ▸ Bringing together AVX-512 with Just-In-Time compilation ▸ Use new AVX-512 instructions to efficiently load and remove tuples from AVX-registers without leaving SIMD mode ▸ Performance was at least doubled in 80% of test cases ▸ Future Work: Impact of other encoding methods � 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend