Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction - - PowerPoint PPT Presentation

conflict detection based run length encoding avx 512 cd
SMART_READER_LITE
LIVE PREVIEW

Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction - - PowerPoint PPT Presentation

Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction Set in Action Annett Ungethm, Johannes Pietrzyk, Patrick Damme, Dirk Habich, Wolfgang Lehner HardBD & Active'18 Workshop in Paris, France on April 16, 2018 Challenges


slide-1
SLIDE 1

Conflict Detection-based Run-Length Encoding – AVX-512 CD Instruction Set in Action

Annett Ungethüm, Johannes Pietrzyk, Patrick Damme, Dirk Habich, Wolfgang Lehner

HardBD & Active'18 Workshop in Paris, France on April 16, 2018

slide-2
SLIDE 2

2

Challenges for Data Processing Nowadays

Application Side System Side

Mu Mult ltip iple le applic licatio ion area eas Ra Rapidly y growing ng data vo volumes

to be processed efficiently

Gr Growing main memory

to store the data

In Increasingly fast proc

  • cessor
  • rs

for processing the data

DA DATA … …

IR IR ML ML

DB DB

Pr Proble lem

Ma Main bottleneck ck for effici cient data data pr processi ssing nowaday adays

Gr Growing gap

between processor speed and main memory bandwidth

  • Re

Reduce ced transfer times

  • Be

Better cache ut utilization

  • Le

Less TLB LB misses

Li Lightweight Com

  • mpression
  • n

Access less bytes for the same logical information

So Solution

slide-3
SLIDE 3

3

Lightweight Compression Techniques

DEL DELTA FO FOR DI DICT RL RLE NS NS Te Technique = abstract idea of how compression works

1200 1200 1200 1200 300 300 1200 4 300 2 1000 1100 1150 1350 1355 1000 100 50 200 5 1200 1100 1000 1050 200 100 50 1000 1200 1000 1050 1050 1 2 2 Ru Run Length Enco coding Replace run by value & length Di Differenti tial al Codi ding Replace data elem. by difference to predecessor Fr Frame-of

  • f-Re

Reference ce Replace data elem. by difference to reference value Dictionary y Coding Replace data elem. by 0-based key in dictionary Nu Null Suppr ppressi ssion Eliminate leading zeroes in binary representation

00...001011 1011 Vectorization is crucial from performance perspective

slide-4
SLIDE 4

4

Vectorization using SIMD

Single Instruction Multiple Data (SIMD)

▪ same instruction on multiple data points simultaneously

Development of Intel’s SIMD Extension

▪ Trend to larger vector registers

  • 128-bit (SSE)
  • 256-bit (AVX and AVX2)
  • 512-bit (AVX-512)

▪ Trend to more instructions

Counted using specification

slide-5
SLIDE 5

5

Vectorization and Lightweight Data Compression

Most algorithms have been proposed for 128-bit SIMD registers

▪ Processing 4 elements (32 bit integers) at one

Example Run-Length Encoding

▪ View subsequent occurrences of the same value as a run ▪ Each run representable by its value and length → just two integers

RLE-SIMD

▪ Uses SIMD instructions to parallelize comparisons

Read this way

slide-6
SLIDE 6

6

RLE-SIMD: Compression

slide-7
SLIDE 7

7

Evaluation using Different Vector Sizes

Compression Speed

▪ Measured in million integers per second (mis)

Speedup

▪ Compared to baseline of 128-bit well-performing area non-well performing area

slide-8
SLIDE 8

8

Non-Well Performing Area

Reasons

▪ For large run lengths, the number of loaded integers approaches more or less 100%, i.e. every value is only processed once. ▪ RLE vectorization uses a significantly higher number of load operations for sequences with short runs. ▪ The redundant processing dramatically increases with increasing vector widths.

slide-9
SLIDE 9

9

SIMD – New Instruction Sets

slide-10
SLIDE 10

10

Conflict Detection using AVX512CD

… A A C B A Read direction

Input register

… b4 b3 b2 b1 b0

Output register

4 3 2 1

C B A

Previous elements

1

filled with 0’s

A C B A

Previous elements

1 1

No equal previous elements à bitmasks are zero b3 b4 filled with 0’s … = " = " ≠ " ≠ " = " ≠ " ≠ " _mm512_conflict _epi32(...)

Vector Position

slide-11
SLIDE 11

11

Step 1: Run Detection

A A B A …000 …001 …000 …011

Co Conflict De Dete tecti tion Re Resulting bi bitm tmask ask In Input Co Count le leadin ing ze zero ros Ar Are le leadin ing ze zero ros de desc scendi ding? ? New run New run

slide-12
SLIDE 12

12

Step 2: Run Length Detection

A A B A …000 …001 …000 …011

New run New run 00000000 00000000 00000000 00000001

sllv_epi32

10000000 00000000 00000000 00000000

andnot_epi32

01111111 11111111 11111111 11111111

lzcnt_epi32 + 1

2

slide-13
SLIDE 13

13

Step 3: Storing

A A B 2 X 1

Store Scatter (RLE512CD)

  • Classical storage layout: (run value, run length)-pair
  • Independent of vector width

Continuous (RLE512CDAligned)

  • Vector wise
  • Run length and run value to different

memory locations

slide-14
SLIDE 14

14

Evaluation – Load Instructions

slide-15
SLIDE 15

15

Evaluation- Vector Instructions

slide-16
SLIDE 16

16

Evaluation

Runtime Comparison

▪ Intel Xeon Phi Knights Landing Processor ▪ RLE512CD (Aligned)

  • utperforms state-of-the-art

for small average run lengths

slide-17
SLIDE 17

17

Evaluation

Runtime Comparison

▪ Intel Xeon 6130 Processor ▪ Similar results

slide-18
SLIDE 18

18

Summary

Development of Intel’s SIMD Extension

▪ Trend to larger vector registers

  • 128-bit (SSE)
  • 256-bit (AVX and AVX2)
  • 512-bit (AVX-512)

▪ Trend to more instructions

Run Length Encoding

▪ Proposed novel implementation using AVX512-CD functionality

Robust stness ss vs. s. Ma Maximal Performance

slide-19
SLIDE 19

Conflict Detection-based Run-Length Encoding – AVX-512 CD Instruction Set in Action

Annett Ungethüm, Johannes Pietrzyk, Patrick Damme, Dirk Habich, Wolfgang Lehner

HardBD & Active'18 Workshop in Paris, France on April 16, 2018