Low-latency software LDPC decoders for x86 multi-core devices - - PowerPoint PPT Presentation

low latency software ldpc decoders for x86 multi core
SMART_READER_LITE
LIVE PREVIEW

Low-latency software LDPC decoders for x86 multi-core devices - - PowerPoint PPT Presentation

Low-latency software LDPC decoders for x86 multi-core devices Bertrand LE GAL and Christophe JEGO IMS laboratory, CNRS UMR 5218 Digital Circuits and Systems team Bordeaux-INP, University of Bordeaux France IEEE International Workshop on


slide-1
SLIDE 1

Low-latency software LDPC decoders for x86 multi-core devices

firstname.lastname@ims-bordeaux.fr

Bertrand LE GAL and Christophe JEGO

IMS laboratory, CNRS UMR 5218 Digital Circuits and Systems team Bordeaux-INP, University of Bordeaux
 France IEEE International Workshop on Signal Processing Systems (SIPS)
 October 3rd, 2017
 Lorient, France

slide-2
SLIDE 2

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Historically, software decoders were limited to…

2

V(n) V(n-1) V(1) V(2) V(3) C(m) C(1) C(2)

Channel SRAM 1 RAM LLR P PEs Channel SRAM 2 Channel SRAM 3 Channel SRAM 4 P PEs P PEs RAM S1 ROM Frozen Channel buf. Unrolled ALU RAM LLR RAM S2 RAM S3 RAM S4 xor xor Soft datapath Hard datapath P.Qc P.Qi 2P.Qi 2P.Qi P.Qi P P 4.P

Global memory banks Processing units with their own local registers MU (LLR Ti) MU (LLR Ti) MU (LLR Ti) MU (LLR Ti) PU Reg. file PU Reg. file PU Reg. file NISC controller System interface control signals k information bits LLR Ti

  • 1

∏ / ∏

control signals control signals IO status

SIMD matrix

Validate and compare error correction code families Benchmarking of decoding algorithms

  • r code construction techniques

Parameter optimization Estimation of hardware decoder performances before development

slide-3
SLIDE 3

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Currently they can fulfill others realtime performance requirements

3

Provide design and runtime flexibilities Software decoders are at least as fast as many hardware circuits Currently, compatible with some industrial use cases. Throughputs are higher than 1 Gbps on multi-core

  • r many-core devices.

Processing latencies from hundreds

  • f us or ms are too high.

Consecutive frame configurations can be different (N, rate) discarding inter-frame parallelism exploitation [1].

[1] OpenAirInterface 5G software alliance for democratising wireless innovation

slide-4
SLIDE 4

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Currently they can fulfill others realtime performance requirements

3

Provide design and runtime flexibilities Software decoders are at least as fast as many hardware circuits Currently, compatible with some industrial use cases. Throughputs are higher than 1 Gbps on multi-core

  • r many-core devices.

Processing latencies from hundreds

  • f us or ms are too high.

Consecutive frame configurations can be different (N, rate) discarding inter-frame parallelism exploitation [1].

[1] OpenAirInterface 5G software alliance for democratising wireless innovation

slide-5
SLIDE 5

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Currently they can fulfill others realtime performance requirements

3

Provide design and runtime flexibilities Software decoders are at least as fast as many hardware circuits Currently, compatible with some industrial use cases. Throughputs are higher than 1 Gbps on multi-core

  • r many-core devices.

Processing latencies from hundreds

  • f us or ms are too high.

Consecutive frame configurations can be different (N, rate) discarding inter-frame parallelism exploitation [1].

[1] OpenAirInterface 5G software alliance for democratising wireless innovation

slide-6
SLIDE 6

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

The processing performance of GPU & CPU devices

4

Multicore device (e.g. INTEL Core-i7)

One chip composed hierarchically of physical processor cores (4) and SIMD unit (1). During 1 clock cycle, a SIMD instr. can perform 32 computations on 8-bits fixed point data => 32 8b-oper. During 1 clock cycle, a physical processor (superscalar) can perform up to 6 SIMD instr => 192 8b-oper. During 1 clock cycle, a Core-i7 processor can execute 4 cores x 6 SIMD instr => 768 8b-oper. INTEL Core-i7 processor NVIDIA Tegra K1 GPU

GPU devices (e.g NVIDIA Titan GPU)

One chip composed hierarchically of stream processors (14) and cores (2688). Each stream processor controls a set of cores (192). During 1 clock cycle 2688 floating point operations can be executed. However, more computations are required to hide processing and memory access latencies.

With 1 to 3 GHz clock frequency, it delivers (theoretically) a high processing performance.

slide-7
SLIDE 7

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

The structure of standardized LDPC code

๏Standardized H matrix have a Quasi-Cyclic structure,

➡ Compressed matrix definition, ➡ Z expansion factor, ➡ Shifting coefficients,

๏This QC structure of H matrix

➡ Reduces the H memory footprint, ➡ Limits the data dependency during the decoding making parallel computing easy,

๏From an hardware point of view, Z factor « enforce »,

➡ Z processing units, ➡ Z memory banks, ➡ One or two Z × Z data interleavers.

5

WIMAX 576 × 288 LDPC code, Z = 24 Reconstructed H matrix Z × Z shifted
 ID matrix

slide-8
SLIDE 8

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

The standardized LDPC codes structure

๏Standardized H matrix have a Quasi-Cyclic structure,

➡ Compressed matrix definition, ➡ Z expansion factor, ➡ Shifting coefficients,

๏This QC structure of H matrix

➡ Reduces the H memory footprint, ➡ Limits the data dependency during the decoding making parallel computing easy,

๏From a hardware point of view, Z factor « enforce » the design,

➡ Z processing units, ➡ Z memory banks, ➡ One or two Z × Z data interleavers.

6

MU (LLR Ti) MU (LLR Ti) MU (LLR Ti) MU (LLR Ti) PU Reg. file PU Reg. file PU Reg. file FSM controller System interface control signals k information bits LLR Ti control signals control signals IO status PU Reg. file

  • 1

∏ / ∏ Hardware design of a Z decoder structure is possible even for
 Z = {7, 13, 420} Z elements

slide-9
SLIDE 9

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Parallelization of the LDPC decoding process (1/3)

7

V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7 V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7 V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7 V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7 V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7

Parallelization of CN kernels (intra-frame)

  • Parallelization is limited due to CN degrees,
  • Horizontal SIMD processing (bad efficiency),
  • Necessitate unaligned memory accesses to VNs.

Parallelization across CN kernels

  • Like in hardware architectures (Q CN of same deg.),
  • Unaligned memory accesses to VNs,
  • Need matrix reordering (not always possible: unstructured).

Parallelization across frames

  • Very regular computation processing (inc. memory),
  • Not evaluated in hardware architectures (high latency),
  • Necessitate reordering at the beginning of the

decoding.

V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7

slide-10
SLIDE 10

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Parallelization of the LDPC decoding process (2/3)

8

V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7 V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7 V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7 V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7 V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7

Parallelization of CN kernels

  • Parallelization is limited due to CN degrees,
  • Horizontal SIMD processing (bad efficiency),
  • Necessitate unaligned memory accesses to VNs.

Parallelization across CN kernels

  • Like in hardware architectures (Q CN of same deg.),
  • Unaligned memory accesses to VNs,
  • Need matrix reordering (not always possible: unstructured).

SIMD Parallelization across frames [1] (inter-frame)

  • Regular computation processing,
  • High-memory footprint at runtime (buffering),
  • Decoding processing latency is high (100 us).

V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7

[1] High-throughput multi-core LDPC decoders based on x86 processor, B. Le Gal and C. Jego, IEEE TPDS, 2016

slide-11
SLIDE 11

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Parallelization of the LDPC decoding process (3/3)

9

V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7 V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7 V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7 V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7 V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7

Parallelization of CN kernels

  • Parallelization is limited due to CN degrees,
  • Horizontal SIMD processing (bad efficiency),
  • Necessitate unaligned memory accesses to VNs.

Parallelization across CN kernels (intra-frame)

  • Low latency (like in hardware architectures)
  • Should be quite efficient (when Z > SIMD width),
  • Irregular accesses to VNs => performance penalties,
  • Limited to QC LDPC codes.

Parallelization across frames

  • Very regular computation processing (inc. memory),
  • Not evaluated in hardware architectures (high latency),
  • Necessitate reordering at the beginning of the

decoding.

V0

C0 C1 C3 C2

V1 V2 V3 V4 V5 V6 V7

Z CNs

slide-12
SLIDE 12

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

The implementation concerns (1/2)

10

Cn # of Kernel throughput degree instr. (all) (latest) 6 169 47 cc. 66 cc. 7 194 52 cc. 73 cc. 8 218 58 cc. 79 cc. 9 242 64 cc. 84 cc. 10 269 71 cc. 89 cc. 12 320 84 cc. 103 cc. 19 517 131 cc. 150 cc. 20 548 139 cc. 156 cc. 32 858 217 cc. 242 cc.

CN kernel from [1] was reuse due to its
 x86 SIMD performances:

  • 169 instructions for one CN computation,
  • 47 processor clock cycles (IPC = 4),
  • 47 clock cycles for 32 parallel CNs

[1] High-throughput multi-core LDPC decoders based on x86 processor, B. Le Gal and C. Jego, IEEE TPDS, 2016

The first implementation issue comes from the CN processing. Instruction set of Core-i7 is not devoted to LDPC decoding. CN processing efficiency was improved by software oriented algorithmic transformations

dc

Y

i=1

sign

  • m(vi,c)
  • = sign

dc Y

i=1

m(vi,c) !

slide-13
SLIDE 13

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

The implementation concerns (1/2)

11

Cn # of Kernel throughput degree instr. (all) (latest) 6 169 47 cc. 66 cc. 7 194 52 cc. 73 cc. 8 218 58 cc. 79 cc. 9 242 64 cc. 84 cc. 10 269 71 cc. 89 cc. 12 320 84 cc. 103 cc. 19 517 131 cc. 150 cc. 20 548 139 cc. 156 cc. 32 858 217 cc. 242 cc.

CN kernel from [1] was reuse due to its
 x86 SIMD performances:

  • 169 instructions for one CN computation,
  • 47 processor clock cycles (IPC = 4),


(47 clock cycles for 32 parallel CNs)

[1] High-throughput multi-core LDPC decoders based on x86 processor, B. Le Gal and C. Jego, IEEE TPDS, 2016

The first implementation issue comes from the CN processing. Instruction set of Core-i7 is not devoted to LDPC decoding. CN processing efficiency was improved by software oriented algorithmic transformations

dc

Y

i=1

sign

  • m(vi,c)
  • = sign

dc Y

i=1

m(vi,c) !

slide-14
SLIDE 14

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

The implementation concerns (1/2)

12

Cn # of Kernel throughput degree instr. (all) (latest) 6 169 47 cc. 66 cc. 7 194 52 cc. 73 cc. 8 218 58 cc. 79 cc. 9 242 64 cc. 84 cc. 10 269 71 cc. 89 cc. 12 320 84 cc. 103 cc. 19 517 131 cc. 150 cc. 20 548 139 cc. 156 cc. 32 858 217 cc. 242 cc.

CN kernel from [1] was reuse due to its
 good x86 SIMD performances:

  • 169 instructions for one CN computation,
  • 47 processor clock cycles (IPC = 4),


(47 clock cycles for 32 parallel CNs)

[1] High-throughput multi-core LDPC decoders based on x86 processor, B. Le Gal and C. Jego, IEEE TPDS, 2016

The first implementation issue comes from the CN processing. Instruction set of Core-i7 is not devoted to LDPC decoding. CN processing efficiency was improved by software oriented algorithmic transformations

dc

Y

i=1

sign

  • m(vi,c)
  • = sign

dc Y

i=1

m(vi,c) !

Product of message sign was replaced by sign or message product

slide-15
SLIDE 15

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

The implementation concerns (2/2)

13

The second implementation issue comes from VN memory accesses that are irregular.

  • Gather data loading of a VN set in a SIMD

register cost between 24 to 42 cycles each,

  • Not usage with previous CN processing

requires only 47 clock cycles.

63 64 VN set (Z = 64) ... ... ... ... ... ... X ... ... 31 X

register 2

32 ... ... 63 X

register 1

16 ... ... 47 48 63 0 15

register 2 register 1

1 operation 1 operation 1 operation ? operations

VN rotation Rid = 16 VN rotation Rid = 0

For one CN with dc = 6, memory accesses would represent up to 86% of the execution time

slide-16
SLIDE 16

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

The implementation concerns (2/2)

14

The proposed solution consist in designing an efficient software SIMD shift register according to the QC H matrix to access VN elements.

63 64 VN set (Z = 64) ... ... ... ... ... ... X X 48 ... ... 79

AND OR AND

48 63 0 15 X 0 15 ... ...

register 4 register 3

1 operation 1 operation

register 5 1 operation

FF FF 00 00

register 2

2 operation 1 operation

16 ... ... 47

register 1 1 operation

In best case, this complex memory access could be executed in 5 clock cycles

slide-17
SLIDE 17

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Theoretical performance limits

๏ Performance of intra-frame parallelized decoders, should be irregular, ๏ Processing efficiency

➡ Usage rate of SIMD units depends on Z value,

  • Z = 32 => 100%, Z = 33 => 51%,
  • Z = 24 => 75%, Z = 42 => 65%,

๏ VN access efficiency

➡ The number of « complex » memory accesses depends on Z value,

  • Z < 32 => 100%, Z = 96 => 33%,
  • Z = 320 => 10%, Z = 42 => 50%,

15

Z should be high enough to reduce VN access penalties and it should be a multiple of 32 for SIMD efficiency. To sum up:

slide-18
SLIDE 18

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Intra-frame parallelized decoder generation framework

16

Parameters (Z) + SIMD Generic LDPC decoder
 source code generator LDPC
 decoder CN processing
 kernels VN access layer, (QC dedicated)

slide-19
SLIDE 19

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Intra-frame parallelized decoder generation framework

16

Parameters (Z) + SIMD Generic LDPC decoder
 source code generator LDPC
 decoder CN processing
 kernels VN access layer, (QC dedicated) Vectorization x86
 and ARM library C++ compiler (e.g clang) Optimized decoder
 implementations

slide-20
SLIDE 20

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Evaluation of proposed LDPC decoder implementations

๏ Two x86 multi-core platforms

➡ A laptop computer (Core-i7),

  • 2 cores @ 3.0 GHz,
  • 4 MB of L3 cache, [10~15] Watts,

➡ A high-end Xeon server,

  • 2 × CPU, 12 cores each @ 2.5 GHz,
  • 30 MB of L3 cache, [10~240] Watts,

๏ LDPC decoder implementations

➡ LLVM & Clang++ version 4.0, ➡ « -Ofast -march=native -mtune=native » ➡ Thread library from C++11 standards,

๏ Measurement setup

➡ Complete digital simulation chain is executed (avoid best-case evaluation). ➡ Data copies are integrated (through., lat.).

17

slide-21
SLIDE 21

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Performance comparisons of multicore LDPC decoders

18

Performance of proposed intra-frame decoders Best performance for x86 inter-frame decoders

LDPC Code Inter-frame (F=32) Intra-frame (F=1) Improvement 𝜟 (Mbps) 𝜠 (𝝂s) 𝜟 (Mbps) 𝜠 (𝝂s) 𝜟 (Mbps) 𝜠 (𝝂s) 802.11e

576 × 288 262 70 102 5,6

  • 61 %

12,5 × 2304 × 1152 245 300 202 8,6

  • 18 %

34,9 ×

802.11ad

672 × 336 247 155 153 7,8

  • 38 %

19,9 × 672 × 252 221 173 167 7,2

  • 24 %

24,0 × 672 × 168 230 166 154 7,7

  • 33 %

21,6 × 672 × 126 238 161 155 7,7

  • 35 %

20,9 ×

INTEL Core-i7 i7-5650U (2 physical cores sharing 4 MB of L3 cache memory) @ about 3.0 GHz

slide-22
SLIDE 22

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Comparison of the memory footprint of the decoders

19

LDPC Code Inter-frame (F=32) Intra-frame (F=1) Improvement Static Runtime Static Runtime Static Runtime Overall 802.11e

576 × 288 13745 76800 18971 3008 38 %

  • 96 %
  • 76 %

2304 × 1152 24913 307200 36776 9664 48 %

  • 97 %
  • 86 %

802.11ad

672 × 336 16910 91392 19692 3616 16 %

  • 96 %
  • 78 %

672 × 252 16540 88704 20217 4320 22 %

  • 95 %
  • 77 %

672 × 168 19110 96768 23711 3936 24 %

  • 96 %
  • 76 %

672 × 126 16597 81984 19370 4064 17 %

  • 95 %
  • 76 %

Program memory footprint is high because the LDPC decoder is flatten Huge memory footprint at runtime

slide-23
SLIDE 23

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Scalability of the intra-frame LDPC decoder implementations

20

LDPC Code 1 core 2 cores Improvement 𝜟 (Mbps) 𝜠 (𝝂s) 𝜟 (Mbps) 𝜠 (𝝂s) 𝜟 (Mbps) 𝜠 (𝝂s) 802.11e

576 × 288 102 5,6 221 5,2 2,2 ×

  • 8 %

2304 × 1152 202 8,6 385 11,9 1,9 × 28 %

802.11ad

672 × 336 153 7,8 317 7,6 2,1 ×

  • 3 %

672 × 252 167 7,2 294 8,2 1,8 × 12 % 672 × 168 154 7,7 312 7,7 2,0 × 0 % 672 × 126 155 7,7 321 7,5 2,1 ×

  • 3 %

LDPC Code 1 core 24 cores Improvement 𝜟 (Mbps) 𝜠 (𝝂s) 𝜟 (Mbps) 𝜠 (𝝂s) 𝜟 (Mbps) 𝜠 (𝝂s) 802.11e

576 × 288 83 1 1755 1 20,1 × 1,0 × 2304 × 1152 255 1 5501 1 20,6 × 1,0 ×

802.11ad

672 × 336 137 1 3189 1 22,3 × 1,0 × 672 × 252 129 1 3000 1 22,3 × 1,0 × 672 × 168 150 1 3010 1 19,1 × 1,0 × 672 × 126 137 1 3245 1 22,7 × 1,0 ×

INTEL Core-i7 i7-5650U (2 physical cores sharing 4 MB of L3 cache memory) @ about 3.0 GHz INTEL Xeon E5-2670 (2 × 12 physical cores sharing 30 MB of SmartCache memory) @ 2.50GHz

2 processor cores and the throughput is multiplied by 2

slide-24
SLIDE 24

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Scalability of the intra-frame LDPC decoder implementations

21

LDPC Code 1 core 2 cores Improvement 𝜟 (Mbps) 𝜠 (𝝂s) 𝜟 (Mbps) 𝜠 (𝝂s) 𝜟 (Mbps) 𝜠 (𝝂s) 802.11e

576 × 288 102 5,6 221 1 2,2 × 5,6 × 2304 × 1152 202 8,6 539 1 2,7 × 8,6 ×

802.11ad

672 × 336 153 7,8 317 1 2,1 × 7,8 × 672 × 252 167 7,2 305 1 1,8 × 7,2 × 672 × 168 154 7,7 312 1 2,0 × 7,7 × 672 × 126 155 7,7 321 1 2,1 × 7,7 ×

LDPC Code 1 core 24 cores Improvement 𝜟 (Mbps) 𝜠 (𝝂s) 𝜟 (Mbps) 𝜠 (𝝂s) 𝜟 (Mbps) 𝜠 (𝝂s) 802.11e

576 × 288 83

  • 1755
  • 20,1 ×
  • 2304 × 1152

255

  • 5501
  • 20,6 ×
  • 802.11ad

672 × 336 137

  • 3189
  • 22,3 ×
  • 672 × 252

129

  • 3000
  • 22,3 ×
  • 672 × 168

150

  • 3010
  • 19,1 ×
  • 672 × 126

137

  • 3245
  • 22,7 ×
  • INTEL Core-i7 i7-5650U (2 physical cores sharing 4 MB of L3 cache memory) @ about 3.0 GHz

INTEL Xeon E5-2670 (2 × 12 physical cores sharing 30 MB of SmartCache memory) @ 2.50GHz

24 processor cores usage produces a decoding throughput increase by 20 ×

slide-25
SLIDE 25

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Performance positioning versus GPU-based LDPC decoders

22

On GPU device, the processing of a LDPC frame is not (so) fast even with a large set of processor cores. High-throughput, is obtained for large workload !

LDPC Code GPU implementation 2 cores Improvement Work 𝜟 (Mbps) 𝜠 (𝝂s) 𝜟 (Mbps) 𝜠 (𝝂s) 𝜟 (Mbps) 𝜠 (𝝂s)

2304 × 1152 [1] 1 3600 202 8,6 316 × 419 × [2, 3] 26 26500 8 × 3 081 × [4] 40 14760 5 × 1 716 × [5] 12 N/A 17 × N/A [6] 31 414 7 × 48 × 55 472 4 × 55 × 100 670 2 × 78 ×

[1] Memory access optimized implementation of cyclic and Quasi-Cyclic LDPC codes on a GPGPU [2] A massively parallel implementation of QC-LDPC decoder on GPU [3] GPU accelerated scalable parallel decoding of LDPC codes [4] A scalable LDPC decoder on GPU [5] Parallel LDPC decoder implementation on GPU based on unbalanced memory coalescing [6] High Throughput Low Latency LDPC Decoding on GPU for SDR Systems

GPU implementations versus INTEL Core-i7 i7-5650U (1 physical core) - 10 layered iterations

slide-26
SLIDE 26

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Conclusion & Future works

23

slide-27
SLIDE 27

IEEE International Workshop on Signal Processing Systems (SIPS)

  • B. Le Gal

October 3, 2017

Conclusion on power efficiency of software ECC decoders

๏Currently

  • High-throughput and low-latency are possible with

software decoder implementations,

  • Throughput in range [100, 200] Mbps,
  • Latency lower than 10 𝝂s
  • Under industrial evaluation for 5G equipment,

๏ Improvement is still possible

  • Decoding 2, 4, or 8 frames in parallel should

improve decoding throughput with a limited
 latency increase,

๏ Evaluation of the decoders on other platforms,

  • ARM for power efficiency (not so trivial),
  • Future INTEL Xeon processor (AVX-512).

24