Compression Algorithms for GPUs Annie Yang and Martin Burtscher* - - PowerPoint PPT Presentation

compression algorithms for gpus
SMART_READER_LITE
LIVE PREVIEW

Compression Algorithms for GPUs Annie Yang and Martin Burtscher* - - PowerPoint PPT Presentation

Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science Highlights MPC compression algorithm Brand-new lossless compression algorithm for single- and double-precision


slide-1
SLIDE 1

Synthesizing Effective Data Compression Algorithms for GPUs

Annie Yang and Martin Burtscher*

Department of Computer Science

slide-2
SLIDE 2

Highlights

  • MPC compression algorithm
  • Brand-new lossless compression algorithm for single-

and double-precision floating-point data

  • Systematically derived to work well on GPUs
  • MPC features
  • Compression ratio is similar to best CPU algorithms
  • Throughput is much higher
  • Requires little internal state (no tables or dictionaries)

Synthesizing Effective Data Compression Algorithms for GPUs 2

slide-3
SLIDE 3

Introduction

  • High-Performance Computing Systems
  • Depend increasingly on accelerators
  • Process large amounts of floating-point (FP) data
  • Moving this data is often the performance bottleneck
  • Data compression
  • Can increase transfer throughput
  • Can reduce storage requirement
  • But only if effective, fast (real-time), and lossless

Synthesizing Effective Data Compression Algorithms for GPUs 3

Mantissa Exponent S

slide-4
SLIDE 4

Problem Statement

  • Existing FP compression algorithms for GPUs
  • Fast but compress poorly
  • Existing FP compression algorithms for CPUs
  • Compress much better but are slow
  • Parallel codes run serial algorithms on multiple chunks
  • Too much state per thread for a GPU implementation
  • Best serial algos may not be scalably parallelizable
  • Do effective FP compression algos for GPUs exist?
  • And if so, how can we create such an algorithm?

Synthesizing Effective Data Compression Algorithms for GPUs 4

slide-5
SLIDE 5

Our Approach

  • Need a brand-new massively-parallel algorithm
  • Study existing FP compression algorithms
  • Break them down into constituent parts
  • Only keep GPU-friendly parts
  • Generalize them as much as possible
  • Resulted in algorithmic components
  • CUDA implementation: each component takes sequence
  • f values as input and outputs transformed sequence
  • Components operate on integer representation of data

Synthesizing Effective Data Compression Algorithms for GPUs 5

Charles Trevelyan for http://plus.maths.org/

slide-6
SLIDE 6

Our Approach (cont.)

  • Automatically synthesize

compression algorithms by chaining components

  • Use exhaustive search to find

best four-component chains

  • Synthesize decompressor
  • Employ inverse components
  • Perform opposite

transformation on data

Synthesizing Effective Data Compression Algorithms for GPUs 6

slide-7
SLIDE 7

Mutator Components

  • Mutators computationally transform each value
  • Do not use information about any other value
  • NUL outputs the input block (identity)
  • INV flips all the bits
  • │, called cut, is a singleton pseudo component that

converts a block of words into a block of bytes

  • Merely a type cast, i.e., no computation or data copying
  • Byte granularity can be better for compression

Synthesizing Effective Data Compression Algorithms for GPUs 7

slide-8
SLIDE 8

Shuffler Components

  • Shufflers reorder whole values or bits of values
  • Do not perform any computation
  • Each thread block operates on a chunk of values
  • BIT emits most significant bits of all values,

followed by the second most significant bits, etc.

  • DIMn groups values by dimension n
  • Tested n = 2, 3, 4, 5, 8, 16, and 32
  • For example, DIM2 has the following effect:

sequence A, B, C, D, E, F becomes A, C, E, B, D, F

Synthesizing Effective Data Compression Algorithms for GPUs 8

slide-9
SLIDE 9

Predictor Components

  • Predictors guess values based on previous values

and compute residuals (true minus guessed value)

  • Residuals tend to cluster around zero, making them

easier to compress than the original sequence

  • Each thread block operates on a chunk of values
  • LNVns subtracts nth prior value from current value
  • Tested n = 1, 2, 3, 5, 6, and 8
  • LNVnx XORs current with nth prior value
  • Tested n = 1, 2, 3, 5, 6, and 8

Synthesizing Effective Data Compression Algorithms for GPUs 9

slide-10
SLIDE 10

Reducer Components

  • Reducers eliminate redundancies in value sequence
  • All other components cannot change length of

sequence, i.e., only reducers can compress sequence

  • Each thread block operates on a chunk of values
  • ZE emits bitmap of 0s followed by non-zero values
  • Effective if input sequence contains many zeros
  • RLE performs run-length encoding, i.e., replaces

repeating values by count and a single value

  • Effective if input contains many repeating values

Synthesizing Effective Data Compression Algorithms for GPUs 10

slide-11
SLIDE 11

Algorithm Synthesis

  • Determine best four-stage algorithms with a cut
  • Exhaustive search of all possible 138,240 combinations
  • 13 double-precision data sets (19 – 277 MB)
  • Observational data, simulation results, MPI messages
  • Single-precision data derived from double-precision data
  • Create general GPU-friendly compression algorithm
  • Analyze best algorithm for each data set and precision
  • Find commonalities and generalize into one algorithm

Synthesizing Effective Data Compression Algorithms for GPUs 11

slide-12
SLIDE 12

Best of 138,240 Algorithms

Synthesizing Effective Data Compression Algorithms for GPUs 12

data set double precision single precision msg_bt LNV1s BIT LNV1s ZE | DIM5 ZE LNV6x | ZE msg_lu LNV5s | DIM8 BIT RLE LNV5s LNV5s LNV5x | ZE msg_sp DIM3 LNV5x BIT ZE | DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE RLE DIM5 LNV6s ZE | msg_sweep3d LNV1s DIM32 | DIM8 RLE LNV1s DIM32 | DIM4 RLE num_brain LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | LNV1s | DIM4 BIT RLE num_control LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE LNV2s LNV2s LNV2x | ZE

  • bs_error

LNV1x ZE LNV1s ZE | LNV6s BIT LNV1s ZE |

  • bs_info

LNV2s | DIM8 BIT RLE LNV8s DIM2 | DIM4 RLE

  • bs_spitzer

ZE BIT LNV1s ZE | ZE BIT LNV1s ZE |

  • bs_temp

LNV8s BIT LNV1s ZE | BIT LNV1x DIM32 | RLE

  • verall best

LNV6s BIT LNV1s ZE | LNV6s BIT LNV1s ZE |

slide-13
SLIDE 13

Analysis of Reducers

  • Double prec results only
  • Single prec results similar
  • ZE or RLE required at end
  • Not counting cut; (encoder)
  • ZE dominates
  • Many 0s but not in a row
  • First three stages
  • Contain almost no reducers
  • Transformations are key to

making reducer effective

  • Chaining whole compression

algorithms may be futile

Synthesizing Effective Data Compression Algorithms for GPUs 13

data set double precision msg_bt LNV1s BIT LNV1s ZE | msg_lu LNV5s | DIM8 BIT RLE msg_sp DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE msg_sweep3d LNV1s DIM32 | DIM8 RLE num_brain LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE

  • bs_error

LNV1x ZE LNV1s ZE |

  • bs_info

LNV2s | DIM8 BIT RLE

  • bs_spitzer

ZE BIT LNV1s ZE |

  • bs_temp

LNV8s BIT LNV1s ZE |

  • verall best

LNV6s BIT LNV1s ZE |

slide-14
SLIDE 14

Analysis of Mutators

  • NUL and INV never used
  • No need to invert bits
  • Fewer stages perform worse
  • Cut often at end (not used)
  • Word granularity suffices
  • Easier/faster to implement
  • DIM8 right after cut
  • DIM4 with single precision
  • Used to separate byte

positions of each word

  • Synthesis yielded unforeseen

use of DIM component

Synthesizing Effective Data Compression Algorithms for GPUs 14

data set double precision msg_bt LNV1s BIT LNV1s ZE | msg_lu LNV5s | DIM8 BIT RLE msg_sp DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE msg_sweep3d LNV1s DIM32 | DIM8 RLE num_brain LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE

  • bs_error

LNV1x ZE LNV1s ZE |

  • bs_info

LNV2s | DIM8 BIT RLE

  • bs_spitzer

ZE BIT LNV1s ZE |

  • bs_temp

LNV8s BIT LNV1s ZE |

  • verall best

LNV6s BIT LNV1s ZE |

slide-15
SLIDE 15

Analysis of Shufflers

  • Shufflers are important
  • Almost always included
  • BIT used very frequently
  • FP bit positions correlate

more strongly than values

  • DIM has two uses
  • Separate bytes (see before)
  • Right after cut
  • Separate values of multi-dim

data sets (intended use)

  • Early stages

Synthesizing Effective Data Compression Algorithms for GPUs 15

data set double precision msg_bt LNV1s BIT LNV1s ZE | msg_lu LNV5s | DIM8 BIT RLE msg_sp DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE msg_sweep3d LNV1s DIM32 | DIM8 RLE num_brain LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE

  • bs_error

LNV1x ZE LNV1s ZE |

  • bs_info

LNV2s | DIM8 BIT RLE

  • bs_spitzer

ZE BIT LNV1s ZE |

  • bs_temp

LNV8s BIT LNV1s ZE |

  • verall best

LNV6s BIT LNV1s ZE |

slide-16
SLIDE 16

Analysis of Predictors

  • Predictors very important
  • (Data model)
  • Used in every case
  • Often 2 predictors used
  • LNVns dominates LNVnx
  • Arithmetic (sub) difference

superior to bit-wise (xor) difference in residual

  • Dimension n
  • Separates values of multi-

dim data sets (in 1st stage)

Synthesizing Effective Data Compression Algorithms for GPUs 16

data set double precision msg_bt LNV1s BIT LNV1s ZE | msg_lu LNV5s | DIM8 BIT RLE msg_sp DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE msg_sweep3d LNV1s DIM32 | DIM8 RLE num_brain LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE

  • bs_error

LNV1x ZE LNV1s ZE |

  • bs_info

LNV2s | DIM8 BIT RLE

  • bs_spitzer

ZE BIT LNV1s ZE |

  • bs_temp

LNV8s BIT LNV1s ZE |

  • verall best

LNV6s BIT LNV1s ZE |

slide-17
SLIDE 17

Analysis of Overall Best Algorithm

  • Same algo for SP and DP
  • Few components mismatch
  • But LNV6s dim is off
  • Most frequent pattern
  • LNV*s BIT LNV1s ZE
  • Star denotes dimensionality
  • Why 6 in starred position?
  • Not used in individual algos
  • 6 is least common multiple
  • f 1, 2, and 3
  • Did not test n > 8

Synthesizing Effective Data Compression Algorithms for GPUs 17

data set double precision msg_bt LNV1s BIT LNV1s ZE | msg_lu LNV5s | DIM8 BIT RLE msg_sp DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE msg_sweep3d LNV1s DIM32 | DIM8 RLE num_brain LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE

  • bs_error

LNV1x ZE LNV1s ZE |

  • bs_info

LNV2s | DIM8 BIT RLE

  • bs_spitzer

ZE BIT LNV1s ZE |

  • bs_temp

LNV8s BIT LNV1s ZE |

  • verall best

LNV6s BIT LNV1s ZE |

slide-18
SLIDE 18

MPC: Generalization of Overall Best

  • MPC algorithm
  • Massively Parallel Compression
  • Uses generalized pattern
  • “LNVds BIT LNV1s ZE” where

d is data set dimensionality

  • Matches best algorithm on

several DP and SP data sets

  • Performs even better when

true dimensionality is used

Synthesizing Effective Data Compression Algorithms for GPUs 18

data set double precision msg_bt LNV1s BIT LNV1s ZE | msg_lu LNV5s | DIM8 BIT RLE msg_sp DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE msg_sweep3d LNV1s DIM32 | DIM8 RLE num_brain LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE

  • bs_error

LNV1x ZE LNV1s ZE |

  • bs_info

LNV2s | DIM8 BIT RLE

  • bs_spitzer

ZE BIT LNV1s ZE |

  • bs_temp

LNV8s BIT LNV1s ZE |

  • verall best

LNV6s BIT LNV1s ZE |

slide-19
SLIDE 19

Evaluation Methodology

  • System
  • Dual 10-core Xeon E5-2680 v2 CPU
  • K40 GPU with 15 SMs (2880 cores)
  • 13 DP and 13 SP real-world data sets
  • Same as before
  • Compression algorithms
  • CPU: bzip2, gzip, lzop, and pFPC
  • GPU: GFC and MPC (our algorithm)

Synthesizing Effective Data Compression Algorithms for GPUs 19

slide-20
SLIDE 20

Compression Ratio (Double Precision)

  • MPC delivers record compression on 5 data sets
  • In spite of “GPU-friendly components” constraint
  • MPC outperformed by bzip2 and pFPC on average
  • Due to msg_sppm and num_plasma
  • MPC superior to GFC (only other GPU compressor)

Synthesizing Effective Data Compression Algorithms for GPUs 20

HarMean msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma

  • bs_error
  • bs_info
  • bs_spitzer
  • bs_temp

bzip2 --best 1.321 1.088 1.018 1.055 6.933 1.294 1.043 1.173 1.029 5.789 1.339 1.217 1.752 1.024 gzip --best 1.239 1.130 1.055 1.107 7.431 1.092 1.064 1.162 1.058 1.608 1.448 1.154 1.231 1.036 lzop -9 1.158 1.052 1.000 1.003 6.780 1.017 1.000 1.082 1.017 1.503 1.273 1.096 1.142 1.000 pFPC -1M 1.365 1.250 1.137 1.238 4.710 1.888 1.148 1.151 1.038 7.042 1.542 1.215 1.022 0.997 GFC 1.179 1.122 1.148 1.202 3.506 1.217 1.090 1.110 1.013 1.125 1.233 1.141 1.022 1.037 MPC 1.248 1.207 1.212 1.208 2.999 1.287 1.182 1.267 1.106 1.164 1.180 1.214 1.184 1.101

slide-21
SLIDE 21

Compression Ratio (Single Precision)

  • MPC delivers record compression 8 data sets
  • In spite of “GPU-friendly components” constraint
  • MPC is outperformed by bzip2 on average
  • Due to num_plasma
  • MPC is “superior” to GFC and pFPC
  • They do not support single-precision data, MPC does

Synthesizing Effective Data Compression Algorithms for GPUs 21

HarMean msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma

  • bs_error
  • bs_info
  • bs_spitzer
  • bs_temp

bzip2 --best 1.398 1.129 1.041 1.141 8.741 2.355 1.113 1.117 1.043 8.652 1.338 1.327 1.394 1.049 gzip --best 1.267 1.179 1.086 1.200 9.605 1.151 1.128 1.151 1.080 1.383 1.466 1.200 1.188 1.079 lzop -9 1.153 1.075 1.000 1.083 8.634 1.033 1.003 1.086 1.016 1.223 1.246 1.129 1.077 1.000 MPC 1.350 1.336 1.440 1.385 3.813 1.534 1.344 1.178 1.122 1.345 1.298 1.436 1.047 1.114

slide-22
SLIDE 22

Throughput (Gigabytes per Second)

  • MPC outperforms all CPU compressors
  • Including pFPC running on two 10-core CPUs by 7.5x
  • MPC slower than GFC but mostly faster than PCIe
  • MPC uses slow O(n log n) prefix scan implementation

Synthesizing Effective Data Compression Algorithms for GPUs 22

  • compr. decom. compr. decom.

bzip2 --best 0.01 0.02 0.01 0.02 gzip --best 0.02 0.15 0.03 0.15 lzop -9 0.01 1.93 0.01 1.43 pFPC -1M 1.43 1.05 n/a n/a GFC 32.28 31.47 n/a n/a MPC 10.78 7.91 5.81 4.23 single precision double precision

  • compr. decom. compr. decom.

bzip2 --best 0.1% 0.3% 0.1% 0.6% gzip --best 0.2% 1.9% 0.4% 3.5% lzop -9 0.1% 24.4% 0.2% 33.9% pFPC -1M 13.2% 13.3% n/a n/a GFC 299.4% 398.0% n/a n/a MPC 100.0% 100.0% 100.0% 100.0% double precision single precision

slide-23
SLIDE 23

Summary

  • Goal of research
  • Create an effective algorithm for FP data compression

that is suitable for massively-parallel GPUs

  • Approach
  • Extracted 24 GPU-friendly components and evaluated

138,240 combinations to find best 4-stage algorithms

  • Generalized findings to derive MPC algorithm
  • Result
  • Brand new compression algorithm for SP and DP data
  • Compresses about as well as CPU algos but much faster

Synthesizing Effective Data Compression Algorithms for GPUs 23

slide-24
SLIDE 24

Future Work and Acknowledgments

  • Future work
  • Faster implementation, more components, longer

chains, and other inputs, data types, and constraints

  • Acknowledgments
  • National Science Foundation
  • NVIDIA Corporation
  • Texas Advanced Computing Center
  • Contact information
  • burtscher@txstate.edu

Synthesizing Effective Data Compression Algorithms for GPUs 24

Nvidia

slide-25
SLIDE 25

Number of Stages

  • 3 stages reach about 95% of compression ratio

Synthesizing Effective Data Compression Algorithms for GPUs 25

slide-26
SLIDE 26

Single- vs Double-Precision Algorithms

Synthesizing Effective Data Compression Algorithms for GPUs 26

data set double precision single precision msg_bt LNV1s BIT LNV1s ZE | DIM5 ZE LNV6x | ZE msg_lu LNV5s | DIM8 BIT RLE LNV5s LNV5s LNV5x | ZE msg_sp DIM3 LNV5x BIT ZE | DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE RLE DIM5 LNV6s ZE | msg_sweep3d LNV1s DIM32 | DIM8 RLE LNV1s DIM32 | DIM4 RLE num_brain LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | LNV1s | DIM4 BIT RLE num_control LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE LNV2s LNV2s LNV2x | ZE

  • bs_error

LNV1x ZE LNV1s ZE | LNV6s BIT LNV1s ZE |

  • bs_info

LNV2s | DIM8 BIT RLE LNV8s DIM2 | DIM4 RLE

  • bs_spitzer

ZE BIT LNV1s ZE | ZE BIT LNV1s ZE |

  • bs_temp

LNV8s BIT LNV1s ZE | BIT LNV1x DIM32 | RLE

  • verall best

LNV6s BIT LNV1s ZE | LNV6s BIT LNV1s ZE |

slide-27
SLIDE 27

MPC Operation

  • What does “LNVds BIT LNV1s ZE” do?
  • LNVds predicts each value using a similar value to
  • btain a residual sequence with many small values
  • Similar value = most recent prior value from same dim
  • BIT groups residuals by bit position
  • All LSBs, then all second LSBs, etc.
  • LNV1s turns identical consecutive words into zeros
  • ZE eliminates these zero words
  • GPU friendly
  • All four components are massively parallel
  • Can be implemented with prefix scans or simpler

Synthesizing Effective Data Compression Algorithms for GPUs 27