compression algorithms for gpus
play

Compression Algorithms for GPUs Annie Yang and Martin Burtscher* - PowerPoint PPT Presentation

Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science Highlights MPC compression algorithm Brand-new lossless compression algorithm for single- and double-precision


  1. Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science

  2. Highlights  MPC compression algorithm  Brand-new lossless compression algorithm for single- and double-precision floating-point data  Systematically derived to work well on GPUs  MPC features  Compression ratio is similar to best CPU algorithms  Throughput is much higher  Requires little internal state (no tables or dictionaries) Synthesizing Effective Data Compression Algorithms for GPUs 2

  3. Introduction  High-Performance Computing Systems  Depend increasingly on accelerators  Process large amounts of floating-point (FP) data S Exponent Mantissa  Moving this data is often the performance bottleneck  Data compression  Can increase transfer throughput  Can reduce storage requirement  But only if effective, fast (real-time), and lossless Synthesizing Effective Data Compression Algorithms for GPUs 3

  4. Problem Statement  Existing FP compression algorithms for GPUs  Fast but compress poorly  Existing FP compression algorithms for CPUs  Compress much better but are slow  Parallel codes run serial algorithms on multiple chunks  Too much state per thread for a GPU implementation  Best serial algos may not be scalably parallelizable  Do effective FP compression algos for GPUs exist?  And if so, how can we create such an algorithm? Synthesizing Effective Data Compression Algorithms for GPUs 4

  5. Our Approach  Need a brand-new massively-parallel algorithm  Study existing FP compression algorithms  Break them down into constituent parts  Only keep GPU-friendly parts  Generalize them as much as possible  Resulted in algorithmic components Charles Trevelyan for http://plus.maths.org/  CUDA implementation: each component takes sequence of values as input and outputs transformed sequence  Components operate on integer representation of data Synthesizing Effective Data Compression Algorithms for GPUs 5

  6. Our Approach (cont.)  Automatically synthesize compression algorithms by chaining components  Use exhaustive search to find best four-component chains  Synthesize decompressor  Employ inverse components  Perform opposite transformation on data Synthesizing Effective Data Compression Algorithms for GPUs 6

  7. Mutator Components  Mutators computationally transform each value  Do not use information about any other value  NUL outputs the input block (identity)  INV flips all the bits  │ , called cut, is a singleton pseudo component that converts a block of words into a block of bytes  Merely a type cast, i.e., no computation or data copying  Byte granularity can be better for compression Synthesizing Effective Data Compression Algorithms for GPUs 7

  8. Shuffler Components  Shufflers reorder whole values or bits of values  Do not perform any computation  Each thread block operates on a chunk of values  BIT emits most significant bits of all values, followed by the second most significant bits, etc.  DIM n groups values by dimension n  Tested n = 2, 3, 4, 5, 8, 16, and 32  For example, DIM2 has the following effect: sequence A, B, C, D, E, F becomes A, C, E, B, D, F Synthesizing Effective Data Compression Algorithms for GPUs 8

  9. Predictor Components  Predictors guess values based on previous values and compute residuals (true minus guessed value)  Residuals tend to cluster around zero, making them easier to compress than the original sequence  Each thread block operates on a chunk of values  LNV n s subtracts n th prior value from current value  Tested n = 1, 2, 3, 5, 6, and 8  LNV n x XORs current with n th prior value  Tested n = 1, 2, 3, 5, 6, and 8 Synthesizing Effective Data Compression Algorithms for GPUs 9

  10. Reducer Components  Reducers eliminate redundancies in value sequence  All other components cannot change length of sequence, i.e., only reducers can compress sequence  Each thread block operates on a chunk of values  ZE emits bitmap of 0s followed by non-zero values  Effective if input sequence contains many zeros  RLE performs run-length encoding, i.e., replaces repeating values by count and a single value  Effective if input contains many repeating values Synthesizing Effective Data Compression Algorithms for GPUs 10

  11. Algorithm Synthesis  Determine best four-stage algorithms with a cut  Exhaustive search of all possible 138,240 combinations  13 double-precision data sets (19 – 277 MB)  Observational data, simulation results, MPI messages  Single-precision data derived from double-precision data  Create general GPU-friendly compression algorithm  Analyze best algorithm for each data set and precision  Find commonalities and generalize into one algorithm Synthesizing Effective Data Compression Algorithms for GPUs 11

  12. Best of 138,240 Algorithms data set double precision single precision LNV1s BIT LNV1s ZE | DIM5 ZE LNV6x | ZE msg_bt LNV5s | DIM8 BIT RLE LNV5s LNV5s LNV5x | ZE msg_lu DIM3 LNV5x BIT ZE | DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE RLE DIM5 LNV6s ZE | msg_sppm LNV1s DIM32 | DIM8 RLE LNV1s DIM32 | DIM4 RLE msg_sweep3d LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_brain LNV1s BIT LNV1s ZE | LNV1s | DIM4 BIT RLE num_comet LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_control LNV2s LNV2s LNV2x | ZE LNV2s LNV2s LNV2x | ZE num_plasma LNV1x ZE LNV1s ZE | LNV6s BIT LNV1s ZE | obs_error LNV2s | DIM8 BIT RLE LNV8s DIM2 | DIM4 RLE obs_info ZE BIT LNV1s ZE | ZE BIT LNV1s ZE | obs_spitzer LNV8s BIT LNV1s ZE | BIT LNV1x DIM32 | RLE obs_temp LNV6s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | overall best Synthesizing Effective Data Compression Algorithms for GPUs 12

  13. Analysis of Reducers data set double precision  Double prec results only LNV1s BIT LNV1s ZE | msg_bt  Single prec results similar LNV5s | DIM8 BIT RLE msg_lu  ZE or RLE required at end DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm  Not counting cut; (encoder) LNV1s DIM32 | DIM8 RLE msg_sweep3d  ZE dominates LNV1s BIT LNV1s ZE | num_brain LNV1s BIT LNV1s ZE | num_comet  Many 0s but not in a row LNV1s BIT LNV1s ZE | num_control  First three stages LNV2s LNV2s LNV2x | ZE num_plasma  Contain almost no reducers LNV1x ZE LNV1s ZE | obs_error  Transformations are key to LNV2s | DIM8 BIT RLE obs_info ZE BIT LNV1s ZE | making reducer effective obs_spitzer LNV8s BIT LNV1s ZE | obs_temp  Chaining whole compression LNV6s BIT LNV1s ZE | overall best algorithms may be futile Synthesizing Effective Data Compression Algorithms for GPUs 13

  14. Analysis of Mutators data set double precision  NUL and INV never used LNV1s BIT LNV1s ZE | msg_bt  No need to invert bits LNV5s | DIM8 BIT RLE msg_lu  Fewer stages perform worse DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm  Cut often at end (not used) LNV1s DIM32 | DIM8 RLE msg_sweep3d  Word granularity suffices LNV1s BIT LNV1s ZE | num_brain  Easier/faster to implement LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE |  DIM8 right after cut num_control LNV2s LNV2s LNV2x | ZE num_plasma  DIM4 with single precision LNV1x ZE LNV1s ZE | obs_error  Used to separate byte LNV2s | DIM8 BIT RLE obs_info positions of each word ZE BIT LNV1s ZE | obs_spitzer LNV8s BIT LNV1s ZE | obs_temp  Synthesis yielded unforeseen LNV6s BIT LNV1s ZE | overall best use of DIM component Synthesizing Effective Data Compression Algorithms for GPUs 14

  15. Analysis of Shufflers data set double precision  Shufflers are important LNV1s BIT LNV1s ZE | msg_bt  Almost always included LNV5s | DIM8 BIT RLE msg_lu  BIT used very frequently DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm  FP bit positions correlate LNV1s DIM32 | DIM8 RLE msg_sweep3d more strongly than values LNV1s BIT LNV1s ZE | num_brain  DIM has two uses LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control  Separate bytes (see before) LNV2s LNV2s LNV2x | ZE num_plasma  Right after cut LNV1x ZE LNV1s ZE | obs_error  Separate values of multi-dim LNV2s | DIM8 BIT RLE obs_info data sets (intended use) ZE BIT LNV1s ZE | obs_spitzer  Early stages LNV8s BIT LNV1s ZE | obs_temp LNV6s BIT LNV1s ZE | overall best Synthesizing Effective Data Compression Algorithms for GPUs 15

  16. Analysis of Predictors data set double precision  Predictors very important LNV1s BIT LNV1s ZE | msg_bt  (Data model) LNV5s | DIM8 BIT RLE msg_lu  Used in every case DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm  Often 2 predictors used LNV1s DIM32 | DIM8 RLE msg_sweep3d  LNV n s dominates LNV n x LNV1s BIT LNV1s ZE | num_brain  Arithmetic (sub) difference LNV1s BIT LNV1s ZE | num_comet superior to bit-wise (xor) LNV1s BIT LNV1s ZE | num_control LNV2s LNV2s LNV2x | ZE difference in residual num_plasma LNV1x ZE LNV1s ZE | obs_error  Dimension n LNV2s | DIM8 BIT RLE obs_info  Separates values of multi- ZE BIT LNV1s ZE | obs_spitzer dim data sets (in 1 st stage) LNV8s BIT LNV1s ZE | obs_temp LNV6s BIT LNV1s ZE | overall best Synthesizing Effective Data Compression Algorithms for GPUs 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend