Low-latency software LDPC decoders for x86 multi-core devices - PowerPoint PPT Presentation

Low-latency software LDPC decoders for x86 multi-core devices Bertrand LE GAL and Christophe JEGO IMS laboratory, CNRS UMR 5218 Digital Circuits and Systems team Bordeaux-INP, University of Bordeaux   France IEEE International Workshop on Signal Processing Systems (SIPS)   October 3 rd , 2017   Lorient, France firstname.lastname@ims-bordeaux.fr

∏ / ∏ Historically, software decoders were limited to… Benchmarking of decoding algorithms Validate and compare error correction or code construction techniques code families C(1) C(2) C(m) V(1) V(2) V(3) V(n-1) V(n) Parameter optimization Estimation of hardware decoder performances before development P.Qc P.Qi 2P.Qi 2P.Qi Soft datapath Channel k information SRAM 1 P.Qi LLR Ti RAM LLR IO status bits P P Channel PEs PEs System interface SRAM 2 Channel RAM LLR buf. Channel P P SRAM 3 control Unrolled ROM PEs Global signals MU MU MU MU ALU Frozen Channel memory (LLR Ti) (LLR Ti) (LLR Ti) (LLR Ti) SRAM 4 banks xor NISC 4.P control xor SIMD controller signals -1 matrix Hard RAM S3 RAM S1 datapath P RAM S4 RAM S2 PU PU PU control Processing units with signals their own Reg. Reg. Reg. local registers file file file 2 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

Currently they can fulfill others realtime performance requirements Software decoders are at least as Provide design and fast as many hardware circuits runtime flexibilities Throughputs are higher than 1 Gbps on multi-core or many-core devices. Currently, compatible with some industrial Processing latencies from hundreds use cases. of us or ms are too high. Consecutive frame configurations can be different (N, rate) discarding inter-frame parallelism exploitation [1]. [1] OpenAirInterface 5G software alliance for democratising wireless innovation 3 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

The processing performance of GPU & CPU devices NVIDIA Tegra K1 GPU INTEL Core-i7 processor Multicore device (e.g. INTEL Core-i7 ) GPU devices (e.g NVIDIA Titan GPU ) One chip composed hierarchically of physical processor One chip composed hierarchically of stream processors cores ( 4 ) and SIMD unit ( 1 ). ( 14 ) and cores ( 2688 ). Each stream processor controls a set of cores ( 192 ). During 1 clock cycle, a SIMD instr. can perform 32 computations on 8-bits fixed point data => 32 8b-oper. During 1 clock cycle 2688 floating point operations can be executed. During 1 clock cycle, a physical processor (superscalar) However, more computations are required to hide can perform up to 6 SIMD instr => 192 8b-oper. processing and memory access latencies. During 1 clock cycle, a Core-i7 processor can execute 4 cores x 6 SIMD instr => 768 8b-oper. With 1 to 3 GHz clock frequency, it delivers (theoretically) a high processing performance. 4 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

The structure of standardized LDPC code WIMAX 576 × 288 LDPC code, Z = 24 ๏ Standardized H matrix have a Quasi-Cyclic structure, ➡ Compressed matrix definition, ➡ Z expansion factor, ➡ Shifting coefficients, ๏ This QC structure of H matrix Z × Z shifted   ID matrix ➡ Reduces the H memory footprint, Reconstructed H matrix ➡ Limits the data dependency during the decoding making parallel computing easy, ๏ From an hardware point of view, Z factor « enforce », ➡ Z processing units, ➡ Z memory banks, ➡ One or two Z × Z data interleavers. 5 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

The standardized LDPC codes structure k information LLR Ti ๏ Standardized H matrix have a IO status bits System interface Quasi-Cyclic structure, ➡ Compressed matrix definition, control signals MU MU MU MU (LLR Ti) (LLR Ti) (LLR Ti) (LLR Ti) ➡ Z expansion factor, ➡ Shifting coefficients, FSM control controller signals -1 ∏ / ∏ ๏ This QC structure of H matrix PU PU PU PU ➡ Reduces the H memory footprint, control signals Reg. Reg. Reg. Reg. ➡ Limits the data dependency during the file file file file decoding making parallel computing easy, Z elements ๏ From a hardware point of view, Z factor « enforce » the design, Hardware design of a Z decoder ➡ Z processing units, structure is possible even for   ➡ Z memory banks, Z = {7, 13, 420} ➡ One or two Z × Z data interleavers. 6 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

Parallelization of the LDPC decoding process (1/3) V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization of CN kernels (intra-frame) - Parallelization is limited due to CN degrees, - Horizontal SIMD processing (bad efficiency), - Necessitate unaligned memory accesses to VNs. C 0 C 1 C 2 C 3 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization across CN kernels - Like in hardware architectures (Q CN of same deg.), - Unaligned memory accesses to VNs, - Need matrix reordering (not always possible: unstructured). C 0 C 1 C 2 C 3 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization across frames V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 - Very regular computation processing (inc. memory), - Not evaluated in hardware architectures (high latency), - Necessitate reordering at the beginning of the C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 decoding. C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 7 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

Parallelization of the LDPC decoding process (2/3) V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization of CN kernels - Parallelization is limited due to CN degrees, - Horizontal SIMD processing (bad efficiency), - Necessitate unaligned memory accesses to VNs. C 0 C 1 C 2 C 3 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization across CN kernels - Like in hardware architectures (Q CN of same deg.), - Unaligned memory accesses to VNs, - Need matrix reordering (not always possible: unstructured). C 0 C 1 C 2 C 3 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 SIMD Parallelization across frames [1] (inter-frame) - Regular computation processing, - High-memory footprint at runtime (buffering), - Decoding processing latency is high (100 us). C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 [1] High-throughput multi-core LDPC decoders based on x86 processor, B. Le Gal and C. Jego, IEEE TPDS, 2016 8 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

Parallelization of the LDPC decoding process (3/3) V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization of CN kernels - Parallelization is limited due to CN degrees, - Horizontal SIMD processing (bad efficiency), - Necessitate unaligned memory accesses to VNs. C 0 C 1 C 2 C 3 Parallelization across CN kernels (intra-frame) V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 - Low latency (like in hardware architectures) - Should be quite efficient (when Z > SIMD width), - Irregular accesses to VNs => performance penalties, - Limited to QC LDPC codes. C 0 C 1 C 2 C 3 Z CNs V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization across frames V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 - Very regular computation processing (inc. memory), - Not evaluated in hardware architectures (high latency), - Necessitate reordering at the beginning of the C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 decoding. C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 9 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

Low-latency software LDPC decoders for x86 multi-core devices - PowerPoint PPT Presentation

Low-latency software LDPC decoders for x86 multi-core devices Bertrand LE GAL and Christophe JEGO IMS laboratory, CNRS UMR 5218 Digital Circuits and Systems team Bordeaux-INP, University of Bordeaux France IEEE International Workshop on

Design of Energy-Efficient LDPC Codes and Decoders Elsa Dupraz 16/04/2019 Section 1:

Combinational Logic Building Blocks Chapter 6 Combinational Logic Introduction Decoders Basic

Anytime Reliability of Systematic LDPC Motivation Convolutional Codes LDPC Convolutional Codes

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Learned Scheduling of LDPC Decoders Based on Multi-armed Bandits Salman Habib, Allison Beemer,

x86-32 and x86-64 Assembly (Part 2) (I know Kung-Fu !) Emmanuel Fleury

x86 Introduction Philipp Koehn 25 October 2019 Philipp Koehn Computer Systems Fundamentals: x86

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long

A Reaction Attack on the QC-LDPC McEliece Cryptosystem Tomas Fabsic 1 , Viliam Hromada 1 , Paul

Design and Analysis of LDPC for MIMO-OFDM Guosen Yue NEC Labs Research Princeton, NJ Joint work

- tunnel-effect ( "micro-convergence" ) for SC-LDPC [ 1 ] [ 1 ] Schmalen, ten Brink,

Multicore Implementation of LDPC Decoders based on ADMM Algorithm Imen DEBBABI 1 , Nadia KHOUJA 1

Multicore Implementation of LDPC Decoders based on ADMM Algorithm Imen DEBBABI 1 , Nadia KHOUJA 1

COLOR CODE DECODERS FROM TORIC CODE DECODERS Aleksander Kubica work w/ N. Delfosse

Unit 7 Fundamental Digital Building Blocks: Decoders & Multiplexers 7.2 CHECKERS / DECODERS

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Sequential circuits If the same input may produce different output signal, we have a sequential

Lecture no: 7 Overview Block codes Convolution codes Fading channel and

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Linear-Time Erasure List-Decoding of Expander Codes Noga Ron-Zewi (University of Haifa) Mary

Figerprinting digital documents survey Gbor Tardos Rnyi Institute & Central European

Lecture 2 Point-to-Point Communications 1 I-Hsiang Wang ihwang@ntu.edu.tw

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

Low-latency software LDPC decoders for x86 multi-core devices - PowerPoint PPT Presentation

Low-latency software LDPC decoders for x86 multi-core devices Bertrand LE GAL and Christophe JEGO IMS laboratory, CNRS UMR 5218 Digital Circuits and Systems team Bordeaux-INP, University of Bordeaux France IEEE International Workshop on

Design of Energy-Efficient LDPC Codes and Decoders Elsa Dupraz 16/04/2019 Section 1:

Combinational Logic Building Blocks Chapter 6 Combinational Logic Introduction Decoders Basic

Anytime Reliability of Systematic LDPC Motivation Convolutional Codes LDPC Convolutional Codes

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Learned Scheduling of LDPC Decoders Based on Multi-armed Bandits Salman Habib, Allison Beemer,

x86-32 and x86-64 Assembly (Part 2) (I know Kung-Fu !) Emmanuel Fleury

x86 Introduction Philipp Koehn 25 October 2019 Philipp Koehn Computer Systems Fundamentals: x86

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long

A Reaction Attack on the QC-LDPC McEliece Cryptosystem Tomas Fabsic 1 , Viliam Hromada 1 , Paul

Design and Analysis of LDPC for MIMO-OFDM Guosen Yue NEC Labs Research Princeton, NJ Joint work

- tunnel-effect ( &quot;micro-convergence&quot; ) for SC-LDPC [ 1 ] [ 1 ] Schmalen, ten Brink,

Multicore Implementation of LDPC Decoders based on ADMM Algorithm Imen DEBBABI 1 , Nadia KHOUJA 1

Multicore Implementation of LDPC Decoders based on ADMM Algorithm Imen DEBBABI 1 , Nadia KHOUJA 1

COLOR CODE DECODERS FROM TORIC CODE DECODERS Aleksander Kubica work w/ N. Delfosse

Unit 7 Fundamental Digital Building Blocks: Decoders &amp; Multiplexers 7.2 CHECKERS / DECODERS

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Sequential circuits If the same input may produce different output signal, we have a sequential

Lecture no: 7 Overview Block codes Convolution codes Fading channel and

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Linear-Time Erasure List-Decoding of Expander Codes Noga Ron-Zewi (University of Haifa) Mary

Figerprinting digital documents survey Gbor Tardos Rnyi Institute &amp; Central European

Lecture 2 Point-to-Point Communications 1 I-Hsiang Wang ihwang@ntu.edu.tw

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

- tunnel-effect ( "micro-convergence" ) for SC-LDPC [ 1 ] [ 1 ] Schmalen, ten Brink,

Unit 7 Fundamental Digital Building Blocks: Decoders & Multiplexers 7.2 CHECKERS / DECODERS

Figerprinting digital documents survey Gbor Tardos Rnyi Institute & Central European