A GPU Implementation of Belief Propagation Decoder for Polar Codes - - PowerPoint PPT Presentation

a gpu implementation of belief propagation decoder for
SMART_READER_LITE
LIVE PREVIEW

A GPU Implementation of Belief Propagation Decoder for Polar Codes - - PowerPoint PPT Presentation

A GPU Implementation of Belief Propagation Decoder for Polar Codes Bharath Kumar Reddy L. and Nitin Chandrachoodan Indian Institute of Technology Madras Nov 6, 2012 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 1 / 29 Outline


slide-1
SLIDE 1

A GPU Implementation of Belief Propagation Decoder for Polar Codes

Bharath Kumar Reddy L. and Nitin Chandrachoodan

Indian Institute of Technology Madras

Nov 6, 2012

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 1 / 29

slide-2
SLIDE 2

Outline

1

Polar Codes and Decoding Algorithms

2

Parallel Implementation

3

Summary

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 2 / 29

slide-3
SLIDE 3

Topic

1

Polar Codes and Decoding Algorithms

2

Parallel Implementation

3

Summary

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 3 / 29

slide-4
SLIDE 4

Polar Codes

Capacity achieving codes for Symmetric binary-input discrete memoryless channels (B-DMC) 1 Capacity is achieved under Successive Cancellation(SC) decoding for very large code lengths (220 or more bits) Objective : To implement a fast decoder for polar codes

Channel capacity polarization as a function of channel instance.

  • 1E. Arıkan, “Channel Polarization: A Method for

Constructing Capacity-Achieving Codes for Symmetric Binary-Input Memoryless Channels”, IEEE Trans. Info. Theory, 2009 Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 4 / 29

slide-5
SLIDE 5

Decoding algorithms

Successive Cancellation (SC) decoder

Serial – bit-by-bit decoding Complexity O(N log N) Poor parallelism Good performance only for very large block lengths > 220

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 5 / 29

slide-6
SLIDE 6

Decoding algorithms

Successive Cancellation (SC) decoder

Serial – bit-by-bit decoding Complexity O(N log N) Poor parallelism Good performance only for very large block lengths > 220

Belief Propagation (BP)

Generic algorithm based on message passing Performs well at practical block lengths (100-1000 bits) Many stages can be implemented in parallel as there is no interdependence among the bits Iterative: may require more iterations to converge

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 5 / 29

slide-7
SLIDE 7

GPUs

Graphic Processing Unit

Many-core processors an array of multithreaded Streaming Multiprocessors (SM) Multiple levels of memory: registers< shared memory< global memory Synchronization among SMs is possible only via global memory Good for applying same computation on a large set of data

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 6 / 29

slide-8
SLIDE 8

GPUs

Graphic Processing Unit

Many-core processors an array of multithreaded Streaming Multiprocessors (SM) Multiple levels of memory: registers< shared memory< global memory Synchronization among SMs is possible only via global memory Good for applying same computation on a large set of data

Our Specification

NVIDIA GTX 560 Ti - 384 cores clocking at 1.66GHz Fermi architecture : Max of 1536 concurrent threads, Max of 1024 threads per block, Max of 8 blocks per SM

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 6 / 29

slide-9
SLIDE 9

Assumptions

Assumptions

We have a large number of codewords available to be decoded Calculations are done assuming Likelihood Ratios are available as floating point numbers Rate 1/2 coding An encoder structure based on recursive definition

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 7 / 29

slide-10
SLIDE 10

Encoding Graph

c = uG, where G, generator matrix, = F ⊗n, nth Kronecker power of F = 1 1 1

  • +

+ + + + + + + + + + + + + + + + + + + + + + +

c0 c4 c6 c1 c5 c3 c7 c2 u0 u1 u3 u4 u5 u6 u7 u2

Figure : Polar Code Encoder for length 8

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 8 / 29

slide-11
SLIDE 11

Encoder c0 c4 c6 c1 c5 c3 c7 c2 u0 u1 u3 u4 u5 u6 u7 u2 S H U F F L E S H U F F L E S H U F F L E

Figure : An alternate way of representing the encoder

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 9 / 29

slide-12
SLIDE 12

Encoder S H U F F L E (X) Xeven Xodd

Unit of repitition. This is repeated log2N for each iteration

S H U F F L E

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 10 / 29

slide-13
SLIDE 13

Topic

1

Polar Codes and Decoding Algorithms

2

Parallel Implementation

3

Summary

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 11 / 29

slide-14
SLIDE 14

Overview

Bit Reversal SHUFFLE UPDATE RE- SHUFFLE Bit Reversal

L R Error Count

Compute Intensive

log2N Iterations

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 12 / 29

slide-15
SLIDE 15

Identifying parallelism

Thread Level Parallelism

Decoding a codeword using inherent parallelism ith thread updates ith and (i + N/2)th nodes To decode a N-length codeword, N/2 threads are utilized

S H U F F L E (i + N/2)th ith

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 13 / 29

slide-16
SLIDE 16

Belief update

LL1 R1 LL2 R2 L1 RR1 L2 RR2

RR1 = 1 + R1R2L2 R1 + R2L2 RR2 = R2.1 + R1L1 R1 + L1 LL1 = 1 + L1L2R2 L1 + L2R2 LL2 = L2.1 + L1R1 L1 + R1

Messages

Likelihood ratios as basis for messages Ri left-to-right (frozen bits) Li right-to-left (from channel) Sum-product equations

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 14 / 29

slide-17
SLIDE 17

Belief update

LL1 R1 LL2 R2 L1 RR1 L2 RR2

RR1 = 1 + R1R2L2 R1 + R2L2 RR2 = R2.1 + R1L1 R1 + L1 LL1 = 1 + L1L2R2 L1 + L2R2 LL2 = L2.1 + L1R1 L1 + R1

Messages

Likelihood ratios as basis for messages Ri left-to-right (frozen bits) Li right-to-left (from channel) Sum-product equations

LR or LLR?

Avoid Jacobean computation (or approximation) Floating point multiplication not expensive on GPU LLR less susceptible to dynamic range problems

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 14 / 29

slide-18
SLIDE 18

Memory management

Shared Memory

On-chip memory Very low access latency compared to global memory Limited - 48KB per SM All computations in the shared memory Bank conflicts are avoided

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 15 / 29

slide-19
SLIDE 19

Memory management

Shared Memory

On-chip memory Very low access latency compared to global memory Limited - 48KB per SM All computations in the shared memory Bank conflicts are avoided

Table : Speed up using shared memory against global memory (time in ms)

Length Global memory Shared memory Speed-up 256 74.17 7.41 10 512 101.37 8.94 11 1024 234.66 20.5 12 2048 825.96 60.98 14

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 15 / 29

slide-20
SLIDE 20

Identifying parallelism

Block Level Parallelism

Decoding as many codewords as that could fit in shared memory

Table : #blocks launched with varying code lengths

Length (N) Shared mem/ # blocks # simultaneous codeword (≤ 8) codewords 256 2KB

1536 128 = 8 (12 > 8)

24 512 4KB

1536 256 = 6

12 1024 8KB

1536 512 = 3

6 2048 16KB

1536 1024 = 1

3

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 16 / 29

slide-21
SLIDE 21

Memory Management

Registers

Fastest form of storage on GPU Limited (32K) per SM More registers per thread - less number of concurrent threads For the Fermi architecture, if a thread uses 20 or less registers, then all threads are active

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 17 / 29

slide-22
SLIDE 22

Memory Management

Registers

Fastest form of storage on GPU Limited (32K) per SM More registers per thread - less number of concurrent threads For the Fermi architecture, if a thread uses 20 or less registers, then all threads are active

Table : Number of registers used

Length # reg/thread # active threads 256 22 1408 (91.66%) 512 22 1280 (83.33%) 1024 22 1024 (66.67%) 2048 22 1024 (66.67%)

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 17 / 29

slide-23
SLIDE 23

Memory Management

Fast math operations, Intrinsics and Instruction Optimizations

Functions replaced by their intrinsics Registers used per thread - 22 Registers used per thread after these optimizations - 19

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 18 / 29

slide-24
SLIDE 24

Memory Management

Fast math operations, Intrinsics and Instruction Optimizations

Functions replaced by their intrinsics Registers used per thread - 22 Registers used per thread after these optimizations - 19

Table : Speed-up using optimizations (for 35 iterations)

Length Throughput Speedup 256 17.57 1.1 512 8.71 1.2 1024 3.55 1.5 2048 1.23

  • Bharath et al. (IIT Madras)

GPU Polar Codes Nov 6, 2012 18 / 29

slide-25
SLIDE 25

FER vs iterations

0.5 1 1.5 2 2.5 3 3.5 10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10 FER for codelength 1024 Eb/N0(dB) FER

10 iterations 15 iterations 20 iterations 25 iterations 30 iterations 35 iterations 100 iterations

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 19 / 29

slide-26
SLIDE 26

Results

Optimizations done

Right choice of decoder architecture for thread level parallelism Shared memory usage tuned for block level parallelism Reducing register count using approximate fast math operations

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 20 / 29

slide-27
SLIDE 27

Results

Optimizations done

Right choice of decoder architecture for thread level parallelism Shared memory usage tuned for block level parallelism Reducing register count using approximate fast math operations

Table : Throughput (Mbps) Performance with iterations

Length 10 15 20 25 30 35 256 57.20 38.82 30.34 24.42 20.32 17.57 512 29.08 19.98 15.01 12.08 10.15 8.71 1024 11.85 8.06 6.04 4.923 4.13 3.55 2048 4.089 2.79 2.11 1.71 1.43 1.23

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 20 / 29

slide-28
SLIDE 28

Topic

1

Polar Codes and Decoding Algorithms

2

Parallel Implementation

3

Summary

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 21 / 29

slide-29
SLIDE 29

Summary

We have described a parallel implementation of a decoder for Polar Codes using GPU Using the right kind of architecture, a single stage can be reused We have also applied optimizations to the usage of registers and shared memory to get a good throughput The resulting decoder is much faster than a CPU decoder and scales with cores available provided enough codewords are available for decoding

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 22 / 29

slide-30
SLIDE 30

Future Work

Working with larger block lengths Codewords spills out of shared memory Comparing with LLRs Optimizing the BP update equations

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 23 / 29

slide-31
SLIDE 31

Acknowledgment

We would like to thank Dr. Andrew Thangaraj, IIT Madras, for his valuable suggestions and insights during the course of this work

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 24 / 29

slide-32
SLIDE 32

Thank you!

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 25 / 29

slide-33
SLIDE 33

Results

Table : Speedup for 1024 length codeword(35 iterations)

CPU CUDA Platform Single Intel Core GTX 560 Ti Time 0.06sec/codeword 0.038sec/288codewords Throughput 8.33Kbps 2 3.55Mbps Speedup

  • 436x

2The CPU code was executed on a single core using the general compiler optimizations. O3 flag was used Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 26 / 29

slide-34
SLIDE 34

Results

0.5 1 1.5 2 2.5 3 3.5 10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10 FER for codelength 256 Eb/N0(dB) FER

10 iterations 15 iterations 20 iterations 25 iterations 30 iterations 35 iterations 100 iterations

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 27 / 29

slide-35
SLIDE 35

Results

0.5 1 1.5 2 2.5 3 3.5 10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10 FER for codelength 512 Eb/N0(dB) FER

10 iterations 15 iterations 20 iterations 25 iterations 30 iterations 35 iterations 100 iterations

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 28 / 29

slide-36
SLIDE 36

Results

0.5 1 1.5 2 2.5 3 3.5 10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10 FER for codelength 2048 Eb/N0(dB) FER

10 iterations 15 iterations 20 iterations 25 iterations 30 iterations 35 iterations 100 iterations

Bharath et al. (IIT Madras) GPU Polar Codes Nov 6, 2012 29 / 29