A GPU Implementation of Belief Propagation Decoder for Polar Codes - PowerPoint PPT Presentation

A GPU Implementation of Belief Propagation Decoder for Polar Codes Bharath Kumar Reddy L. and Nitin Chandrachoodan Indian Institute of Technology Madras Nov 6, 2012 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 1 / 29

Outline Polar Codes and Decoding Algorithms 1 Parallel Implementation 2 Summary 3 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 2 / 29

Topic Polar Codes and Decoding Algorithms 1 Parallel Implementation 2 Summary 3 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 3 / 29

Polar Codes Capacity achieving codes for Symmetric binary-input discrete memoryless channels (B-DMC) 1 Capacity is achieved under Successive Cancellation(SC) decoding for very large code lengths (2 20 or more bits) Objective : To implement a fast decoder for polar codes Channel capacity polarization as a function of channel instance. 1E. Arıkan, “Channel Polarization: A Method for Constructing Capacity-Achieving Codes for Symmetric Binary-Input Memoryless Channels”, IEEE Trans. Info. Theory, 2009 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 4 / 29

Decoding algorithms Successive Cancellation (SC) decoder Serial – bit-by-bit decoding Complexity O ( N log N ) Poor parallelism Good performance only for very large block lengths > 2 20 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 5 / 29

Decoding algorithms Successive Cancellation (SC) decoder Serial – bit-by-bit decoding Complexity O ( N log N ) Poor parallelism Good performance only for very large block lengths > 2 20 Belief Propagation (BP) Generic algorithm based on message passing Performs well at practical block lengths (100-1000 bits) Many stages can be implemented in parallel as there is no interdependence among the bits Iterative: may require more iterations to converge Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 5 / 29

GPUs Graphic Processing Unit Many-core processors an array of multithreaded Streaming Multiprocessors (SM) Multiple levels of memory: registers < shared memory < global memory Synchronization among SMs is possible only via global memory Good for applying same computation on a large set of data Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 6 / 29

GPUs Graphic Processing Unit Many-core processors an array of multithreaded Streaming Multiprocessors (SM) Multiple levels of memory: registers < shared memory < global memory Synchronization among SMs is possible only via global memory Good for applying same computation on a large set of data Our Specification NVIDIA GTX 560 Ti - 384 cores clocking at 1.66GHz Fermi architecture : Max of 1536 concurrent threads, Max of 1024 threads per block, Max of 8 blocks per SM Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 6 / 29

Assumptions Assumptions We have a large number of codewords available to be decoded Calculations are done assuming Likelihood Ratios are available as floating point numbers Rate 1/2 coding An encoder structure based on recursive definition Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 7 / 29

Encoding Graph c = uG , where G, generator matrix, = F ⊗ n , n th Kronecker power of � 1 � 0 F = 1 1 u 0 c 0 + + + u 1 c 4 + + + c 2 u 2 + + + u 3 c 6 + + + u 4 c 1 + + + u 5 c 5 + + + u 6 c 3 + + + u 7 c 7 + + + Figure : Polar Code Encoder for length 8 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 8 / 29

Encoder c 0 u 0 u 1 c 4 S S S H H H c 2 u 2 U U U u 3 c 6 F F F u 4 c 1 F F F c 5 u 5 L L L u 6 E E E c 3 c 7 u 7 Figure : An alternate way of representing the encoder Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 9 / 29

Encoder Unit of repitition. This is repeated log 2 N for each iteration S X even H S U H F U F F L F X odd E L ( X ) E Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 10 / 29

Overview Iterations L Error RE- Bit Bit SHUFFLE UPDATE SHUFFLE Reversal R Reversal Count log 2 N Compute Intensive Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 12 / 29

Identifying parallelism i th S Thread Level Parallelism H Decoding a codeword using U inherent parallelism F i th thread updates i th and F ( i + N / 2 ) th nodes ( i + N/ 2) th L To decode a N-length codeword, E N/2 threads are utilized Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 13 / 29

Belief update Messages LL 1 R 1 L 1 RR 1 Likelihood ratios as basis for ⊕ messages R i left-to-right (frozen bits) LL 2 R 2 L 2 RR 2 L i right-to-left (from channel) Sum-product equations 1 + R 1 R 2 L 2 RR 1 = R 1 + R 2 L 2 R 2 . 1 + R 1 L 1 RR 2 = R 1 + L 1 1 + L 1 L 2 R 2 LL 1 = L 1 + L 2 R 2 L 2 . 1 + L 1 R 1 = LL 2 L 1 + R 1 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 14 / 29

Belief update Messages LL 1 R 1 L 1 RR 1 Likelihood ratios as basis for ⊕ messages R i left-to-right (frozen bits) LL 2 R 2 L 2 RR 2 L i right-to-left (from channel) Sum-product equations 1 + R 1 R 2 L 2 LR or LLR ? RR 1 = R 1 + R 2 L 2 Avoid Jacobean computation (or R 2 . 1 + R 1 L 1 approximation) RR 2 = R 1 + L 1 Floating point multiplication not 1 + L 1 L 2 R 2 LL 1 = expensive on GPU L 1 + L 2 R 2 LLR less susceptible to dynamic L 2 . 1 + L 1 R 1 = LL 2 range problems L 1 + R 1 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 14 / 29

Memory management Shared Memory On-chip memory Very low access latency compared to global memory Limited - 48KB per SM All computations in the shared memory Bank conflicts are avoided Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 15 / 29

Memory management Shared Memory On-chip memory Very low access latency compared to global memory Limited - 48KB per SM All computations in the shared memory Bank conflicts are avoided Table : Speed up using shared memory against global memory (time in ms ) Length Global memory Shared memory Speed-up 256 74.17 7.41 10 512 101.37 8.94 11 1024 234.66 20.5 12 2048 825.96 60.98 14 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 15 / 29

Identifying parallelism Block Level Parallelism Decoding as many codewords as that could fit in shared memory Table : # blocks launched with varying code lengths Length ( N ) Shared mem/ # blocks # simultaneous codeword ( ≤ 8) codewords 1536 256 2KB 128 = 8 (12 > 8) 24 1536 512 4KB 256 = 6 12 1536 1024 8KB 512 = 3 6 1536 2048 16KB 1024 = 1 3 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 16 / 29

Memory Management Registers Fastest form of storage on GPU Limited (32K) per SM More registers per thread - less number of concurrent threads For the Fermi architecture, if a thread uses 20 or less registers, then all threads are active Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 17 / 29

Memory Management Registers Fastest form of storage on GPU Limited (32K) per SM More registers per thread - less number of concurrent threads For the Fermi architecture, if a thread uses 20 or less registers, then all threads are active Table : Number of registers used Length # reg/thread # active threads 256 22 1408 (91.66 % ) 512 22 1280 (83.33 % ) 1024 22 1024 (66.67 % ) 2048 22 1024 (66.67 % ) Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 17 / 29

Memory Management Fast math operations, Intrinsics and Instruction Optimizations Functions replaced by their intrinsics Registers used per thread - 22 Registers used per thread after these optimizations - 19 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 18 / 29

Memory Management Fast math operations, Intrinsics and Instruction Optimizations Functions replaced by their intrinsics Registers used per thread - 22 Registers used per thread after these optimizations - 19 Table : Speed-up using optimizations (for 35 iterations) Length Throughput Speedup 256 17.57 1.1 512 8.71 1.2 1024 3.55 1.5 2048 1.23 - Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 18 / 29

FER vs iterations FER for codelength 1024 0 10 − 1 10 − 2 10 FER − 3 10 − 4 10 10 iterations 15 iterations 20 iterations 25 iterations − 5 10 30 iterations 35 iterations 100 iterations − 6 10 0 0.5 1 1.5 2 2.5 3 3.5 Eb/N0(dB) Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 19 / 29

Results Optimizations done Right choice of decoder architecture for thread level parallelism Shared memory usage tuned for block level parallelism Reducing register count using approximate fast math operations Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 20 / 29

Results Optimizations done Right choice of decoder architecture for thread level parallelism Shared memory usage tuned for block level parallelism Reducing register count using approximate fast math operations Table : Throughput (Mbps) Performance with iterations Length 10 15 20 25 30 35 256 57.20 38.82 30.34 24.42 20.32 17.57 512 29.08 19.98 15.01 12.08 10.15 8.71 1024 11.85 8.06 6.04 4.923 4.13 3.55 2048 4.089 2.79 2.11 1.71 1.43 1.23 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 20 / 29

A GPU Implementation of Belief Propagation Decoder for Polar Codes - PowerPoint PPT Presentation

A GPU Implementation of Belief Propagation Decoder for Polar Codes Bharath Kumar Reddy L. and Nitin Chandrachoodan Indian Institute of Technology Madras Nov 6, 2012 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 1 / 29 Outline

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long

Contents PRO-Decoder Function Methods Results Abstract Experiment Computer RBS-Decoder

PLANT PROPAGATION An Overview of Plant Propagation Methods Two Techniques of Stem Cutting

Digital Design Disc: RTL Combinatorial Components 2-to-4 Decoder 4-to-16 Decoder 8-bit Shifter

Overview Independence Belief Networks Conditional Independence Belief networks Chris

26:198:722 Expert Systems I Dempster-Shafer Belief Functions I Combining Belief Functions I Types

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Introduction: Belief vs Degrees of Belief Hannes Leitgeb LMU Munich October 2014 My three

Shuffled Belief Propagation Decoding Juntan Zhang and Marc Fossorier Department of Electrical

An empirical study of Gaussian belief propagation and application in the detection of F-formations

THE AMATEURS FRIEND OR Enemy A short course on Propagation Propagation What is it? What

1 How to deal with Radio Propagation How to deal with Radio Propagation Where are you from?

Physical of radio propagation Two types of propagation models

UN13750 Programmable Encoder/Decoder Single chip contains both Encoder and Decoder. Schmitt

Exercise 2: Encoder / Decoder Framework Goals : Implement basic framework for encoder and decoder

Decoding How to Do Research to Produce High Quality Output 12th APGA-ASAN ON-LINE Workshop 27

Fast decoding in neural machine translation with Ray MAREK STRELEC Time cost statistics for

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

All - in - One Interactive Display Made in/for India +91 9980878877 | satish@aidio.in

Contents H.264 Overview First Parts of H.264 Decoder NAL Unit Unwrapping Details

A plan was set for Operation Overlord, the Allied invasion of Normandy It included Operation

Improving Hard X-ray Nanoprobe Qingyi Wang, 2010 Lee Teng Internship The Hard X-Ray Nanoprobe

Reading Comprehension in Adolescence Current Theoretical Models And Applications to Intervention