Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction - PowerPoint PPT Presentation

Conflict Detection-based Run-Length Encoding – AVX-512 CD Instruction Set in Action Annett Ungethüm, Johannes Pietrzyk, Patrick Damme, Dirk Habich, Wolfgang Lehner HardBD & Active'18 Workshop in Paris, France on April 16, 2018

Challenges for Data Processing Nowadays Application Side System Side Increasingly fast proc In ocessor ors for processing the data DB DB … IR IR … ML ML So Solution Proble Pr lem Growing gap Gr Lightweight Com Li ompression on Mu Mult ltip iple le applic licatio ion area eas between processor speed Access less bytes for the same and main memory bandwidth logical information - Re Reduce ced transfer times Ma Main bottleneck ck for effici cient - Be Better cache ut utilization data data pr processi ssing nowaday adays - Le Less TLB LB misses DA DATA Rapidly Ra y growing ng data vo volumes Gr Growing main memory to be processed efficiently to store the data 2

Lightweight Compression Techniques Technique = abstract idea of how compression works Te RLE RL DEL DELTA FO FOR DI DICT NS NS Run Length Enco Ru coding Differenti Di tial al Codi ding Frame-of Fr of-Re Reference ce Dictionary y Coding Nu Null Suppr ppressi ssion Replace run by Replace data elem. Replace data elem. Replace data elem. Eliminate value & length by difference to by difference to by 0-based key leading zeroes in predecessor reference value in dictionary binary representation 1200 1200 1000 1000 1200 200 1000 0 00...001011 1200 4 1100 100 1100 100 1200 1 1200 300 1150 50 1000 0 1000 0 1200 2 1350 200 1050 50 1050 2 1011 300 1355 5 1050 2 300 Vectorization is crucial from performance perspective 3

Vectorization using SIMD Single Instruction Multiple Data (SIMD) ▪ same instruction on multiple data points simultaneously Development of Intel’s SIMD Extension ▪ Trend to larger vector registers - 128-bit (SSE) - 256-bit (AVX and AVX2) Counted using specification - 512-bit (AVX-512) ▪ Trend to more instructions 4

Vectorization and Lightweight Data Compression Most algorithms have been proposed for 128-bit SIMD registers ▪ Processing 4 elements (32 bit integers) at one Example Run-Length Encoding ▪ View subsequent occurrences of the same value as a run ▪ Each run representable by its value and length → just two integers RLE-SIMD ▪ Uses SIMD instructions to parallelize comparisons Read this way 5

RLE-SIMD: Compression 6

Evaluation using Different Vector Sizes Compression Speed Speedup ▪ Measured in million integers per second (mis) ▪ Compared to baseline of 128-bit non-well performing area well-performing area 7

Non-Well Performing Area Reasons ▪ For large run lengths, the number of loaded integers approaches more or less 100%, i.e. every value is only processed once. ▪ RLE vectorization uses a significantly higher number of load operations for sequences with short runs. ▪ The redundant processing dramatically increases with increasing vector widths. 8

SIMD – New Instruction Sets 9

Conflict Detection using AVX512CD Read direction 4 3 2 1 0 Vector Position _mm512_conflict … A A C B A _epi32(...) Input register … … b4 b3 b2 b1 b0 Output register No equal previous elements à bitmasks are zero Previous Previous A C B A C B A elements elements = " ≠ " ≠ " = " ≠ " ≠ " = " filled filled 1 0 0 1 b4 0 0 1 b3 with 0’s with 0’s 10

Step 1: Run Detection A B A A In Input Co Conflict De Dete tecti tion Resulting bi Re bitm tmask ask …011 …000 …001 …000 Co Count le leadin ing ze zero ros Ar Are le leadin ing ze zero ros de desc scendi ding? ? New run New run 11

Step 2: Run Length Detection 00000000 00000000 00000000 00000001 A B A A sllv_epi32 10000000 00000000 00000000 00000000 …011 …000 …001 …000 andnot_epi32 01111111 11111111 11111111 11111111 lzcnt_epi32 + 1 New run New run 2 12

Step 3: Storing A B A X 1 2 Store Scatter (RLE512CD) Continuous (RLE512CDAligned) • Classical storage layout: (run value, run length)-pair • Vector wise • Independent of vector width • Run length and run value to different memory locations 13

Evaluation – Load Instructions 14

Evaluation- Vector Instructions 15

Evaluation Runtime Comparison ▪ Intel Xeon Phi Knights Landing Processor ▪ RLE512CD (Aligned) outperforms state-of-the-art for small average run lengths 16

Evaluation Runtime Comparison ▪ Intel Xeon 6130 Processor ▪ Similar results 17

Summary Development of Intel’s SIMD Extension ▪ Trend to larger vector registers - 128-bit (SSE) - 256-bit (AVX and AVX2) - 512-bit (AVX-512) ▪ Trend to more instructions Robust stness ss vs. s. Run Length Encoding Ma Maximal Performance ▪ Proposed novel implementation using AVX512-CD functionality 18

Conflict Detection-based Run-Length Encoding – AVX-512 CD Instruction Set in Action Annett Ungethüm, Johannes Pietrzyk, Patrick Damme, Dirk Habich, Wolfgang Lehner HardBD & Active'18 Workshop in Paris, France on April 16, 2018

Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction - PowerPoint PPT Presentation

Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction Set in Action Annett Ungethm, Johannes Pietrzyk, Patrick Damme, Dirk Habich, Wolfgang Lehner HardBD & Active'18 Workshop in Paris, France on April 16, 2018 Challenges

Miniaturization and Advances of Bulk Head Mounted EMI Filters: Material, Process, Design R.

FUSED TABLE SCANS: COMBINING AVX-512 AND JIT Markus Dreseler, Jan Kossmann, Johannes Frohnhofen,

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Govt. of Gujarat Gujarat Coastline Zone Accretion Erosion length Stable length Total length

Disk Drive Schematic Disk Drive Schematic Typically 512 bytes Typically 512 bytes reads by sensing

Disk Drive Schematic Disk Drive Schematic Typically 512 bytes Typically 512 bytes reads by sensing

An opportunistic text indexing structure based on run length encoding Yuya Tamakoshi, Keisuke

Climate Threats, Fragility, and Conflict Risks Office of Conflict Management and Mitigation

Deep Encode: Machine Learning for Per-Title Encoding Daniel Silhavy| IBC20| Per-Title Encoding

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

The Minisatellite Transformation Problem: The Run-Length-Encoding Approach and Further

AVX-470, an Orally Delivered Anti-TNF Antibody for Treatment of Acute Ulcerative Colitis: Results

Welcome! Todays Agenda: Recap Flow Control AVX, Larrabee, GPGPU Further

Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads Sadjad

Di Digital Transm smissi ssion on 01204325 Data Communications and Computer Networks Chaipo

Analysis and Improvement of Differential Computation Attacks against Internally-Encoded White-Box

Differential Encoding for Real-Time Status Updates Sanidhay Bhambay Sudheer Poojary Parimal

Mobile Data Collection and Analysis with Local Differential Privacy - Part 1 Ninghui Li (Purdue

Encoding Meshes in Differential Coordinates Daniel Cohen-Or Tel Aviv University Outline

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

A Theory of Coding for Chip- to-Chip Communica6on Amin