Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction - - PowerPoint PPT Presentation
Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction - - PowerPoint PPT Presentation
Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction Set in Action Annett Ungethm, Johannes Pietrzyk, Patrick Damme, Dirk Habich, Wolfgang Lehner HardBD & Active'18 Workshop in Paris, France on April 16, 2018 Challenges
2
Challenges for Data Processing Nowadays
Application Side System Side
Mu Mult ltip iple le applic licatio ion area eas Ra Rapidly y growing ng data vo volumes
to be processed efficiently
Gr Growing main memory
to store the data
In Increasingly fast proc
- cessor
- rs
for processing the data
DA DATA … …
IR IR ML ML
DB DB
Pr Proble lem
Ma Main bottleneck ck for effici cient data data pr processi ssing nowaday adays
Gr Growing gap
between processor speed and main memory bandwidth
- Re
Reduce ced transfer times
- Be
Better cache ut utilization
- Le
Less TLB LB misses
Li Lightweight Com
- mpression
- n
Access less bytes for the same logical information
So Solution
3
Lightweight Compression Techniques
DEL DELTA FO FOR DI DICT RL RLE NS NS Te Technique = abstract idea of how compression works
1200 1200 1200 1200 300 300 1200 4 300 2 1000 1100 1150 1350 1355 1000 100 50 200 5 1200 1100 1000 1050 200 100 50 1000 1200 1000 1050 1050 1 2 2 Ru Run Length Enco coding Replace run by value & length Di Differenti tial al Codi ding Replace data elem. by difference to predecessor Fr Frame-of
- f-Re
Reference ce Replace data elem. by difference to reference value Dictionary y Coding Replace data elem. by 0-based key in dictionary Nu Null Suppr ppressi ssion Eliminate leading zeroes in binary representation
00...001011 1011 Vectorization is crucial from performance perspective
4
Vectorization using SIMD
Single Instruction Multiple Data (SIMD)
▪ same instruction on multiple data points simultaneously
Development of Intel’s SIMD Extension
▪ Trend to larger vector registers
- 128-bit (SSE)
- 256-bit (AVX and AVX2)
- 512-bit (AVX-512)
▪ Trend to more instructions
Counted using specification
5
Vectorization and Lightweight Data Compression
Most algorithms have been proposed for 128-bit SIMD registers
▪ Processing 4 elements (32 bit integers) at one
Example Run-Length Encoding
▪ View subsequent occurrences of the same value as a run ▪ Each run representable by its value and length → just two integers
RLE-SIMD
▪ Uses SIMD instructions to parallelize comparisons
Read this way
6
RLE-SIMD: Compression
7
Evaluation using Different Vector Sizes
Compression Speed
▪ Measured in million integers per second (mis)
Speedup
▪ Compared to baseline of 128-bit well-performing area non-well performing area
8
Non-Well Performing Area
Reasons
▪ For large run lengths, the number of loaded integers approaches more or less 100%, i.e. every value is only processed once. ▪ RLE vectorization uses a significantly higher number of load operations for sequences with short runs. ▪ The redundant processing dramatically increases with increasing vector widths.
9
SIMD – New Instruction Sets
10
Conflict Detection using AVX512CD
… A A C B A Read direction
Input register
… b4 b3 b2 b1 b0
Output register
4 3 2 1
C B A
Previous elements
1
filled with 0’s
A C B A
Previous elements
1 1
No equal previous elements à bitmasks are zero b3 b4 filled with 0’s … = " = " ≠ " ≠ " = " ≠ " ≠ " _mm512_conflict _epi32(...)
Vector Position
11
Step 1: Run Detection
A A B A …000 …001 …000 …011
Co Conflict De Dete tecti tion Re Resulting bi bitm tmask ask In Input Co Count le leadin ing ze zero ros Ar Are le leadin ing ze zero ros de desc scendi ding? ? New run New run
12
Step 2: Run Length Detection
A A B A …000 …001 …000 …011
New run New run 00000000 00000000 00000000 00000001
sllv_epi32
10000000 00000000 00000000 00000000
andnot_epi32
01111111 11111111 11111111 11111111
lzcnt_epi32 + 1
2
13
Step 3: Storing
A A B 2 X 1
Store Scatter (RLE512CD)
- Classical storage layout: (run value, run length)-pair
- Independent of vector width
Continuous (RLE512CDAligned)
- Vector wise
- Run length and run value to different
memory locations
14
Evaluation – Load Instructions
15
Evaluation- Vector Instructions
16
Evaluation
Runtime Comparison
▪ Intel Xeon Phi Knights Landing Processor ▪ RLE512CD (Aligned)
- utperforms state-of-the-art
for small average run lengths
17
Evaluation
Runtime Comparison
▪ Intel Xeon 6130 Processor ▪ Similar results
18
Summary
Development of Intel’s SIMD Extension
▪ Trend to larger vector registers
- 128-bit (SSE)
- 256-bit (AVX and AVX2)
- 512-bit (AVX-512)
▪ Trend to more instructions
Run Length Encoding
▪ Proposed novel implementation using AVX512-CD functionality
Robust stness ss vs. s. Ma Maximal Performance