Huffman Coding with Gap Arrays for GPU Acceleration Naoya Yamamoto, - PowerPoint PPT Presentation

Huffman Coding with Gap Arrays for GPU Acceleration Naoya Yamamoto, Koji Nakano, Yasuaki Ito, Daisuke Takafuji Hiroshima University Akihiko Kasagi, Tsuguchika Tabaru Fujitsu Laboratories 1 ICPP2020: Huffman Coding with Gap Arrays for GPU Acceleration

Huffman coding Lossless data compression scheme Huffman Encoding can be done by converting each symbol to the • • corresponding codeword: parallel encoding is easy. Used in many data compression formats: • Huffman Decoding can be done by reading the codeword • gzip, zip, png, jpg, etc. • sequence from the beginning Uses a codebook : mapping of fixed-length (usually 8-bit) symbols • 1. identifying each codeword into codewords bits. 2. converting it into the corresponding codeword Entropy coding : Symbols appear more frequently are assigned • Parallel Huffman decoding is hard : • codewords with fewer bits. codeword sequence has no separator to identify codewords • Prefix code : Every codeword is not a prefix of the other codewords. • It is not possible to start decoding from the middle of the • codeword sequence. Parallel divide-and-conquer approaches that perform • symbols A B C D E codebook decoding for every equal-sized partitioned segment do not codeword bits 00 01 10 110 111 decode correctly: a codeword may be incomplete and separated into two segments symbol sequence 0 1 A B D E A B D C B C B D C E 1 encode decode 0 0 1 A B C 000111011100011101001100111010111 0 1 codeword sequence D E 2 ICPP2020: Huffman Coding with Gap Arrays for GPU Acceleration

Parallel GPU decoding by self-synchronization Self-synchronization of Huffman decoding [3] Parallel GPU decoding by self-synchronization [29,30] • • Decoding from a middle bit will synchronize. • The codeword sequence is partitioned into equal-sized segments. • Decoding is correct after synchronization. • Each thread is assigned to a segment and starts decoding from it. • The expected length for self-synchronization is 73 [16] • It continues decoding of following segments until it finds synchronization. • Decoding may never synchronize in the worst case. • Drawbacks • decoding from the beginning Every segment is decoded by two times or more. • A B D E A B D C B C B D C E In the worst case, thread 0 must decode all segments. • 00011101 110001110100110 0111010111 segment 0 segment 1 segment 2 segment 3 segment 4 110001110100110 0111010111 00011101 0110001 11 thread 0 D A E B A D B D C E 0110001 11001110 10011011 1 thread 1 11001110 10011011 11100001 thread 2 A B D C D A E A E B A D E D A B decoding from the 8th bit synchronization point [3] T. Ferguson and J. Rabinowitz. 1984. Self-synchronizing Huffman codes. [29] André Weissenberger. 2018. CUHD - A Massively Parallel Huffman Decoder. https://github.com/weissenberger/gpuhd. IEEE Trans. on Information Theory 30, 4 (July 1984), 687 – 693. [16] S. T. Klein and Y. Wiseman. 2003. Parallel Huffman Decoding with [30] André Weissenberger and Bertil Schmidt. 2018. Massively Parallel Huffman Decoding on GPUs. In Proc. of International Conference on Parallel Processing. 1–10. Applications to JPEG Files. Comput. J. 46, 5 (Jan. 2003), 487 – 497. 3 ICPP2020: Huffman Coding with Gap Arrays for GPU Acceleration

Our contribution First contribution: Present a gap array, a new data structure for Second contribution: Develop several acceleration techniques for Huffman • • accelerating parallel decoding encoding/decoding the bit position of the first complete codeword in each segment • 1. Single Kernel Soft Synchronization(SKSS) technique [9] Computed and attached to a codeword sequence when encoding is • Only one kernel call is performed. • performed Kernel call and global memory access overhead can be reduced • Gap array is very small: array of 4 bits • 2. Wordwise global memory access the size overhead is less than 1.5% for 256-bit segments • four 8-bit symbols (32 bits) are read/write by one instruction. • the time overhead for GPU encoding is less than 20%. • 3. Compact codebook: new data structure for codebooks of Huffman coding Gap array accelerate GPU decoding • Codebook size can be 64Kbytes : too large to store it in the GPU • 1.67x − 6450x faster • shared memory The size is reduced to less than 3 Kbytes: enough small to store it in • the GPU shared memory gap array 0 2 1 1 Experimental results for a data set of 10 files • Our GPU encoding/decoding is 2.87x-7.70x and 1.26-2.63x faster • codeword sequence 00011101 11000111 01001100 111010111 than previous presented GPU implementations. If a gap array is available, our GPU decoding is 1.67x-6450x times segment 0 • segment 1 segment 2 segment 3 faster. parallel decoding [9] Shunji Funasaka, Koji Nakano, and Yasuaki Ito. 2017. Single Kernel Soft symbol sequence A B D E A B D C B C B D C E Synchronization Technique for Task Arrays on CUDA-enabled GPUs, with Applications. In Proc. International Symposium on Networking and Computing. pp.11–20. 4 ICPP2020: Huffman Coding with Gap Arrays for GPU Acceleration

GPU Huffman encoding with a gap array Naive Parallel GPU encoding GPU encoding by the Single Kernel Soft Synchronization ( SKSS ) • • Only one kernel call is performed. • Kernel 1: The prefix-sums of codeword bits are computed. • Reduce global memory access • The bit position of the codeword corresponding to each symbol can • be determined from the prefix-sums. The codeword sequence are partitioned into equal-sized segments. • Kernel 2: The codeword of corresponding to each symbol is written. • Each CUDA block i (this number is assigned by a global counter) works • Gap arrays can be written if necessary. for encoding segment i • Both Kernels 1 and 2 perform global memory access. The Prefix-sums for each segment i are computed by looking back • • previous CUDA blocks CUDA CUDA CUDA CUDA symbol sequence A B D E A B D C B C B D C E block 0 block 1 block 2 block 3 codeword bits 2 2 3 3 2 2 3 2 2 2 2 3 2 3 A B D E A B D C B C B D Kernel 1 2 2 3 3 2 2 3 2 2 2 2 3 prefix-sums 2 4 7 10 12 14 17 19 21 23 25 28 30 33 Kernel 2 7 7 7 7 codeword sequence 00011101 11000111 01001100 111010111 21 gap array 0 2 1 1 21 23 25 28 000111011100011101001100111010111 5 ICPP2020: Huffman Coding with Gap Arrays for GPU Acceleration

GPU Huffman decoding with a gap array SKSS technique : • CUDA CUDA CUDA CUDA The codeword sequence is partitioned into equal-sized segments block 0 block 1 block 2 block 3 • and the gap value of each segment is available. Each CUDA block i (this number is assigned by a global counter) • gap array 0 2 1 1 works for decoding a segment i Since the gap value is available, each CUDA block can start • decoding from the first complete codeword. 00011101 11000111 01001100 111010111 segments Similarly to GPU Huffman decoding, the prefix-sums of the • number of symbols corresponding to segments are computed by the SKSS. From the prefix-sums, each CUDA block can determine the • symbol sequence A B D E A B D C B C B D C E position in the symbol sequence where it writes the decoded symbols. Compact codebook: • A 64Kbyte codebook is separated into several small codebooks. • Primary codebook: stores codewords with no more than 11 bits • Secondary codebooks: store codewords with 11 bits or more • Primary codebook Secondary codebooks The total size is less than 3 Kbytes. • wordwise memory access • 4 symbols are written as a 32-bit word. • Global memory access throughput can be improved. • 6 ICPP2020: Huffman Coding with Gap Arrays for GPU Acceleration

Experimental results: Data set of 10 files NOGAP : Original Huffman code with no gap array GAP : Huffman code with gap array for 256-bit segment Compression ratio file type contents size(Mbyte) NOGAP GAP GAP Overhead +0.86% bible text Collection of sacred texts or scriptures 4.047 54.82% 55.67% enwiki xml Wikipedia dump file 1095.488 68.30% 69.37% +1.07% size overhead mozilla exe Tarred executables of Mozilla 51.220 78.05% 79.27% +1.22% +0.39% − +1.47% mr image Medical magnetic resonance image 9.971 46.37% 47.10% +0.72% nci database Chemical database of structures 33.553 30.47% 30.95% +0.48% NOGAP GAP prime text 50th Mersenne number 23.714 44.12% 44.80% +0.69% sao bin The SAO star catalog 7.252 94.37% 95.85% +1.47% webster html The 1913 Webster Unabridged Dictionary 41.459 62.54% 63.52% +0.98% +1.10% linux src Linux kernel 5.2.4 871.352 70.23% 71.32% malicious text Never self-synchronizes until the end 1073.742 25.00% 25.39% +0.39% compressed size Compression ratio = uncompressed size malicious : text that never self-synchronizes D D D E E B B A A A A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A 11011011011111101010000000010010 010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010 01 thread 0 B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B A C B 32 bits thread 1 010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010 01 7 ICPP2020: Huffman Coding with Gap Arrays for GPU Acceleration

Huffman Coding with Gap Arrays for GPU Acceleration Naoya Yamamoto, - PowerPoint PPT Presentation

Huffman Coding with Gap Arrays for GPU Acceleration Naoya Yamamoto, Koji Nakano, Yasuaki Ito, Daisuke Takafuji Hiroshima University Akihiko Kasagi, Tsuguchika Tabaru Fujitsu Laboratories 1 ICPP2020: Huffman Coding with Gap Arrays for GPU

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Huffman Coding Variable Rate Codes Example: David A. Huffman (1951) Huffman coding uses

Arrays (2) Higher-Dimensional Arrays Arrays of Character Strings Topics Variables and Arrays

PTAS for Huffman coding with unequal letter costs Mordecai Golin (HKUST), Claire Mathieu (Brown)

Data Abstraction Copying Arrays. Sorting Arrays. 2D Arrays. Janyl Jumadinova September 30 and

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Huffman Coding Eric Dubois School of Electrical Engineering and Computer Science University of

Welcome to M2 SCCI 2014-2015 Promotion David Albert Huffman David A. Huffman(1925-1999) [Photo:

1 Data structures for decoder: Construction of canonical Huffman: (sketch) The array

Lecture 11 Multidimensional arrays Two-dimensional Arrays Just an array of arrays useful

Lecture 11 Multidimensional arrays Two-dimensional Arrays Just an array of arrays useful

Arrays Arrays and Methods Searching Sorting Arrays Reading: => Continue with

Objectives: Discuss arrays Syntax Multi-dimensional arrays Arrays

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 12:

Information Theory Slides Jonathan Pillow Barlows Efficient Coding Hypothesis Barlow

Radix-64 Conversion in PGP Cunsheng Ding Department of CSE HKUST 1 PGP E-Mail Compatibility

Resolution and the binary encoding of combinatorial principles Stefan Dantchev 2 Nicola Galesi 1

Symbolic Encodings of Bounded Synthesis Saarland University Peter Faymonville 1 , Bernd Finkbeiner

Encoding phylogenetic trees in terms of weighted quartets Katharina Huber, School of Computing

Neural Encoding Models Maneesh Sahani Gatsby Computational Neuroscience Unit University College

We will start at 2:05 pm! Thanks for coming early! Yesterday Fundamental 1. Value of

Huffman Coding with Gap Arrays for GPU Acceleration Naoya Yamamoto, - PowerPoint PPT Presentation

Huffman Coding with Gap Arrays for GPU Acceleration Naoya Yamamoto, Koji Nakano, Yasuaki Ito, Daisuke Takafuji Hiroshima University Akihiko Kasagi, Tsuguchika Tabaru Fujitsu Laboratories 1 ICPP2020: Huffman Coding with Gap Arrays for GPU

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Huffman Coding Variable Rate Codes Example: David A. Huffman (1951) Huffman coding uses

Arrays (2) Higher-Dimensional Arrays Arrays of Character Strings Topics Variables and Arrays

PTAS for Huffman coding with unequal letter costs Mordecai Golin (HKUST), Claire Mathieu (Brown)

Data Abstraction Copying Arrays. Sorting Arrays. 2D Arrays. Janyl Jumadinova September 30 and

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Huffman Coding Eric Dubois School of Electrical Engineering and Computer Science University of

Welcome to M2 SCCI 2014-2015 Promotion David Albert Huffman David A. Huffman(1925-1999) [Photo:

1 Data structures for decoder: Construction of canonical Huffman: (sketch) The array

Lecture 11 Multidimensional arrays Two-dimensional Arrays Just an array of arrays useful

Lecture 11 Multidimensional arrays Two-dimensional Arrays Just an array of arrays useful

Arrays Arrays and Methods Searching Sorting Arrays Reading: =&gt; Continue with

Objectives: Discuss arrays Syntax Multi-dimensional arrays Arrays

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 12:

Information Theory Slides Jonathan Pillow Barlows Efficient Coding Hypothesis Barlow

Radix-64 Conversion in PGP Cunsheng Ding Department of CSE HKUST 1 PGP E-Mail Compatibility

Resolution and the binary encoding of combinatorial principles Stefan Dantchev 2 Nicola Galesi 1

Symbolic Encodings of Bounded Synthesis Saarland University Peter Faymonville 1 , Bernd Finkbeiner

Encoding phylogenetic trees in terms of weighted quartets Katharina Huber, School of Computing

Neural Encoding Models Maneesh Sahani Gatsby Computational Neuroscience Unit University College

We will start at 2:05 pm! Thanks for coming early! Yesterday Fundamental 1. Value of

Arrays Arrays and Methods Searching Sorting Arrays Reading: => Continue with