Compression Algorithms for GPUs Annie Yang and Martin Burtscher* - PowerPoint PPT Presentation

Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science

Highlights  MPC compression algorithm  Brand-new lossless compression algorithm for single- and double-precision floating-point data  Systematically derived to work well on GPUs  MPC features  Compression ratio is similar to best CPU algorithms  Throughput is much higher  Requires little internal state (no tables or dictionaries) Synthesizing Effective Data Compression Algorithms for GPUs 2

Introduction  High-Performance Computing Systems  Depend increasingly on accelerators  Process large amounts of floating-point (FP) data S Exponent Mantissa  Moving this data is often the performance bottleneck  Data compression  Can increase transfer throughput  Can reduce storage requirement  But only if effective, fast (real-time), and lossless Synthesizing Effective Data Compression Algorithms for GPUs 3

Problem Statement  Existing FP compression algorithms for GPUs  Fast but compress poorly  Existing FP compression algorithms for CPUs  Compress much better but are slow  Parallel codes run serial algorithms on multiple chunks  Too much state per thread for a GPU implementation  Best serial algos may not be scalably parallelizable  Do effective FP compression algos for GPUs exist?  And if so, how can we create such an algorithm? Synthesizing Effective Data Compression Algorithms for GPUs 4

Our Approach  Need a brand-new massively-parallel algorithm  Study existing FP compression algorithms  Break them down into constituent parts  Only keep GPU-friendly parts  Generalize them as much as possible  Resulted in algorithmic components Charles Trevelyan for http://plus.maths.org/  CUDA implementation: each component takes sequence of values as input and outputs transformed sequence  Components operate on integer representation of data Synthesizing Effective Data Compression Algorithms for GPUs 5

Our Approach (cont.)  Automatically synthesize compression algorithms by chaining components  Use exhaustive search to find best four-component chains  Synthesize decompressor  Employ inverse components  Perform opposite transformation on data Synthesizing Effective Data Compression Algorithms for GPUs 6

Mutator Components  Mutators computationally transform each value  Do not use information about any other value  NUL outputs the input block (identity)  INV flips all the bits  │ , called cut, is a singleton pseudo component that converts a block of words into a block of bytes  Merely a type cast, i.e., no computation or data copying  Byte granularity can be better for compression Synthesizing Effective Data Compression Algorithms for GPUs 7

Shuffler Components  Shufflers reorder whole values or bits of values  Do not perform any computation  Each thread block operates on a chunk of values  BIT emits most significant bits of all values, followed by the second most significant bits, etc.  DIM n groups values by dimension n  Tested n = 2, 3, 4, 5, 8, 16, and 32  For example, DIM2 has the following effect: sequence A, B, C, D, E, F becomes A, C, E, B, D, F Synthesizing Effective Data Compression Algorithms for GPUs 8

Predictor Components  Predictors guess values based on previous values and compute residuals (true minus guessed value)  Residuals tend to cluster around zero, making them easier to compress than the original sequence  Each thread block operates on a chunk of values  LNV n s subtracts n th prior value from current value  Tested n = 1, 2, 3, 5, 6, and 8  LNV n x XORs current with n th prior value  Tested n = 1, 2, 3, 5, 6, and 8 Synthesizing Effective Data Compression Algorithms for GPUs 9

Reducer Components  Reducers eliminate redundancies in value sequence  All other components cannot change length of sequence, i.e., only reducers can compress sequence  Each thread block operates on a chunk of values  ZE emits bitmap of 0s followed by non-zero values  Effective if input sequence contains many zeros  RLE performs run-length encoding, i.e., replaces repeating values by count and a single value  Effective if input contains many repeating values Synthesizing Effective Data Compression Algorithms for GPUs 10

Algorithm Synthesis  Determine best four-stage algorithms with a cut  Exhaustive search of all possible 138,240 combinations  13 double-precision data sets (19 – 277 MB)  Observational data, simulation results, MPI messages  Single-precision data derived from double-precision data  Create general GPU-friendly compression algorithm  Analyze best algorithm for each data set and precision  Find commonalities and generalize into one algorithm Synthesizing Effective Data Compression Algorithms for GPUs 11

Analysis of Reducers data set double precision  Double prec results only LNV1s BIT LNV1s ZE | msg_bt  Single prec results similar LNV5s | DIM8 BIT RLE msg_lu  ZE or RLE required at end DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm  Not counting cut; (encoder) LNV1s DIM32 | DIM8 RLE msg_sweep3d  ZE dominates LNV1s BIT LNV1s ZE | num_brain LNV1s BIT LNV1s ZE | num_comet  Many 0s but not in a row LNV1s BIT LNV1s ZE | num_control  First three stages LNV2s LNV2s LNV2x | ZE num_plasma  Contain almost no reducers LNV1x ZE LNV1s ZE | obs_error  Transformations are key to LNV2s | DIM8 BIT RLE obs_info ZE BIT LNV1s ZE | making reducer effective obs_spitzer LNV8s BIT LNV1s ZE | obs_temp  Chaining whole compression LNV6s BIT LNV1s ZE | overall best algorithms may be futile Synthesizing Effective Data Compression Algorithms for GPUs 13

Analysis of Mutators data set double precision  NUL and INV never used LNV1s BIT LNV1s ZE | msg_bt  No need to invert bits LNV5s | DIM8 BIT RLE msg_lu  Fewer stages perform worse DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm  Cut often at end (not used) LNV1s DIM32 | DIM8 RLE msg_sweep3d  Word granularity suffices LNV1s BIT LNV1s ZE | num_brain  Easier/faster to implement LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE |  DIM8 right after cut num_control LNV2s LNV2s LNV2x | ZE num_plasma  DIM4 with single precision LNV1x ZE LNV1s ZE | obs_error  Used to separate byte LNV2s | DIM8 BIT RLE obs_info positions of each word ZE BIT LNV1s ZE | obs_spitzer LNV8s BIT LNV1s ZE | obs_temp  Synthesis yielded unforeseen LNV6s BIT LNV1s ZE | overall best use of DIM component Synthesizing Effective Data Compression Algorithms for GPUs 14

Analysis of Shufflers data set double precision  Shufflers are important LNV1s BIT LNV1s ZE | msg_bt  Almost always included LNV5s | DIM8 BIT RLE msg_lu  BIT used very frequently DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm  FP bit positions correlate LNV1s DIM32 | DIM8 RLE msg_sweep3d more strongly than values LNV1s BIT LNV1s ZE | num_brain  DIM has two uses LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control  Separate bytes (see before) LNV2s LNV2s LNV2x | ZE num_plasma  Right after cut LNV1x ZE LNV1s ZE | obs_error  Separate values of multi-dim LNV2s | DIM8 BIT RLE obs_info data sets (intended use) ZE BIT LNV1s ZE | obs_spitzer  Early stages LNV8s BIT LNV1s ZE | obs_temp LNV6s BIT LNV1s ZE | overall best Synthesizing Effective Data Compression Algorithms for GPUs 15

Analysis of Predictors data set double precision  Predictors very important LNV1s BIT LNV1s ZE | msg_bt  (Data model) LNV5s | DIM8 BIT RLE msg_lu  Used in every case DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm  Often 2 predictors used LNV1s DIM32 | DIM8 RLE msg_sweep3d  LNV n s dominates LNV n x LNV1s BIT LNV1s ZE | num_brain  Arithmetic (sub) difference LNV1s BIT LNV1s ZE | num_comet superior to bit-wise (xor) LNV1s BIT LNV1s ZE | num_control LNV2s LNV2s LNV2x | ZE difference in residual num_plasma LNV1x ZE LNV1s ZE | obs_error  Dimension n LNV2s | DIM8 BIT RLE obs_info  Separates values of multi- ZE BIT LNV1s ZE | obs_spitzer dim data sets (in 1 st stage) LNV8s BIT LNV1s ZE | obs_temp LNV6s BIT LNV1s ZE | overall best Synthesizing Effective Data Compression Algorithms for GPUs 16

Compression Algorithms for GPUs Annie Yang and Martin Burtscher* - PowerPoint PPT Presentation

Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science Highlights MPC compression algorithm Brand-new lossless compression algorithm for single- and double-precision

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Compression Programs File Compression: Gzip, Bzip Archivers :Arc, Pkzip, Winrar,

Compression Programs File Compression: Gzip, Bzip Archivers :Arc, Pkzip, Winrar,

Branching and Iterative Compression Ariel Kulik Seminar on Algorithms, Technion, Winter 18/19

A Model to Address Salary Compression for Faculty (an anti-compression model) Presented to

Compression Overview Multimedia Encoding and Compression Huffman codes Lossless

Scientific Data Compression: From Stone-Age to Renaissance Factor 10,100 compression

Information Retrieval Tutorial 3: Index Compression Professor: Michel Schellekens TA: Ang Gao

Data Compression via Logic Synthesis u 1 , Pierre-Emmanuel Gaillardon 1 , Andreas Burg 2 , Luca

L1dbproto Design and Results Andy Salnikov SLAC LSST2018 Project & Community Workshop

Project Plan Cloud Management Platform The Capstone Experience Team GE Lyle Fann Vincent Ma

Towards Regression Testing for Database Applications Gregory M. Kapfhammer Department of

Introd u cing SUSE Enterprise Storage 5 1 SUSE Enterprise Storage 5 SUSE Enterprise Storage 5 is

EFFECT OF WAVELET AND HYBRID CLASSIFICATION ON ACTION RECOGNITION Eman Mohammadi Q. M. Jonathan

Index-Free Log Analytics with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer

The LempelZivWelch (LZW) Algorithm Tom Magerlein May 10, 2017 Asked during Bens

Compression Algorithms for GPUs Annie Yang and Martin Burtscher* - PowerPoint PPT Presentation

Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science Highlights MPC compression algorithm Brand-new lossless compression algorithm for single- and double-precision

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Compression Programs File Compression: Gzip, Bzip Archivers :Arc, Pkzip, Winrar,

Compression Programs File Compression: Gzip, Bzip Archivers :Arc, Pkzip, Winrar,

Branching and Iterative Compression Ariel Kulik Seminar on Algorithms, Technion, Winter 18/19

A Model to Address Salary Compression for Faculty (an anti-compression model) Presented to

Compression Overview Multimedia Encoding and Compression Huffman codes Lossless

Scientific Data Compression: From Stone-Age to Renaissance Factor 10,100 compression

Information Retrieval Tutorial 3: Index Compression Professor: Michel Schellekens TA: Ang Gao

Data Compression via Logic Synthesis u 1 , Pierre-Emmanuel Gaillardon 1 , Andreas Burg 2 , Luca

L1dbproto Design and Results Andy Salnikov SLAC LSST2018 Project &amp; Community Workshop

Project Plan Cloud Management Platform The Capstone Experience Team GE Lyle Fann Vincent Ma

Towards Regression Testing for Database Applications Gregory M. Kapfhammer Department of

Introd u cing SUSE Enterprise Storage 5 1 SUSE Enterprise Storage 5 SUSE Enterprise Storage 5 is

EFFECT OF WAVELET AND HYBRID CLASSIFICATION ON ACTION RECOGNITION Eman Mohammadi Q. M. Jonathan

Index-Free Log Analytics with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer

The LempelZivWelch (LZW) Algorithm Tom Magerlein May 10, 2017 Asked during Bens

L1dbproto Design and Results Andy Salnikov SLAC LSST2018 Project & Community Workshop