An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long - - PowerPoint PPT Presentation

▶

Aug 17, 2022 178 likes •304 views

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long Codewords Codewords Stefan Grnroos Kristian Nybom Jerker Bjrkqvist 18.10.11 bo Akademi University - Turku, Finland 1 Background Background

SLIDE 1

18.10.11 Åbo Akademi University - Turku, Finland 1

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long Codewords Codewords

Stefan Grönroos Kristian Nybom Jerker Björkqvist

SLIDE 2

18.10.11 Åbo Akademi University - Turku, Finland 2

Background Background

Working on software real-time DVB-T2

implementation for general purpose computers

DVB-T2, DVB-C2, DVB-S2 standards use LDPC codes

as part of FEC scheme

Very long codewords: 16200 or 64800 bits
One of the most complex operations in the signal

processing chain

DVB-T2 requires up to ~61 Mbps decoder

throughput

Our CPU implementation not even close to realtime

capable

Thus we turned to GPUs
More specifically NVIDIAs CUDA framework

SLIDE 3

18.10.11 Åbo Akademi University - Turku, Finland 3

LDPC Decoding LDPC Decoding

LDPC Code can

be described by:

– H matrix – Corresponding bipartite graph

n-bit codeword

– k data bits – (n-k) parity bits

H=[ 1 1 1 1 1 1 1 1 1 1 0]

n Variable nodes (n – k) Check nodes

SLIDE 4

18.10.11 Åbo Akademi University - Turku, Finland 4

Iterative message passing Iterative message passing

Each edge in graph holds message

between check- and variable nodes

Check node

update:

Variable node

update:

n Variable nodes (n – k) Check nodes n Variable nodes (n – k) Check nodes

SLIDE 5

18.10.11 Åbo Akademi University - Turku, Finland 5

Hardware Setup Hardware Setup

NVIDIA GeForce GTX 570
Based on NVIDIA Fermi

architecture

15 Streaming Multiprocessors
32 cores per SM
Thread warp:
Group of 32 consecutive

threads

The same instruction is run

for a half-warp (16 threads) at a time on 16 cores of an SM

Source: NVIDIA

SLIDE 6

18.10.11 Åbo Akademi University - Turku, Finland 6

GPU Memory Accesses GPU Memory Accesses

Access to the large global memory is very slow on

the GPU

Global memory accesses are processed per warp

(32 threads)

If the threads of a warp access 32 aligned

consecutive 32-byte words, we get full memory coalescence

Only one memory request for 128 bytes is

made, and memory bus is fully utilized

Very low bus utilization if memory accesses are

scattered within a warp!

SLIDE 7

18.10.11 Åbo Akademi University - Turku, Finland 7

Decoder memory accesses Decoder memory accesses

If we decode one codeword at a time:
Either check node update or variable node

update memory accesses scattered

Solution: Decode several codewords in parallel
Efficient memory accesses
Increases parallelism

n Variable nodes (n – k) Check nodes n Variable nodes (n – k) Check nodes

SLIDE 8

18.10.11 Åbo Akademi University - Turku, Finland 8

Our LDPC Decoder approach Our LDPC Decoder approach

Two main kernels (functions). Iterated alternately.
Check node update
Variable node update
8-bit fixed-point representation for messages
Messages for same edge for all codewords stored

consecutively in memory

We decode 128 codewords in parallel
Each thread updates the outgoing messages from
ne check/variable node for 4 different codewords
A warp processes the same updates for all 128

codewords (32 threads x 4 codewords).

Result: 128-byte message reads/writes to global

memory

SLIDE 9

18.10.11 Åbo Akademi University - Turku, Finland 9

Performance Performance

Good memory access patterns
Solution is now instruction bound
No shared (”scratchpad”) memory used, just 48KB

L1 cache.

Allows larger number of active threads
Throughput:
Codeword length: 64800 bits
Code rate ½ (32400 information bits, 32400

parity bits)

20 iterations 30 iterations 50 iterations 163 Mbps 112 Mbps 69 Mbps

SLIDE 10

18.10.11 Åbo Akademi University - Turku, Finland 10

Conclusions Conclusions

Real-time LDPC decoding for DVB-T2, DVB-S2,

DVB-C2 possible on a modern GPU

Some capacity left on GPU for other complex

tasks, such as QAM constellation demapper

Future work

SLIDE 11

18.10.11 Åbo Akademi University - Turku, Finland 11

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long Codewords Codewords

Stefan Grönroos Kristian Nybom Jerker Björkqvist

Background Background

implementation for general purpose computers

as part of FEC scheme

processing chain

throughput

capable

LDPC Decoding LDPC Decoding

be described by:

– H matrix – Corresponding bipartite graph

– k data bits – (n-k) parity bits

H=[ 1 1 1 1 1 1 1 1 1 1 0]

Iterative message passing Iterative message passing

between check- and variable nodes

update:

update:

Hardware Setup Hardware Setup

architecture

threads

for a half-warp (16 threads) at a time on 16 cores of an SM

GPU Memory Accesses GPU Memory Accesses

the GPU

(32 threads)

consecutive 32-byte words, we get full memory coalescence

made, and memory bus is fully utilized

scattered within a warp!

Decoder memory accesses Decoder memory accesses

update memory accesses scattered

Our LDPC Decoder approach Our LDPC Decoder approach

consecutively in memory

codewords (32 threads x 4 codewords).

memory

Performance Performance

L1 cache.

parity bits)

Conclusions Conclusions

DVB-C2 possible on a modern GPU

tasks, such as QAM constellation demapper

Thank you for listening! Questions?