An Efficient, Portable and Generic Library for Successive - - PowerPoint PPT Presentation

an efficient portable and generic library for successive
SMART_READER_LITE
LIVE PREVIEW

An Efficient, Portable and Generic Library for Successive - - PowerPoint PPT Presentation

An Efficient, Portable and Generic Library for Successive Cancellation Decoding of Polar Codes Adrien Cassagne 1 , 2 Bertrand Le Gal 1 Camille Leroux 1 Olivier Aumage 2 Denis Barthou 2 1 IMS, Univ. Bordeaux, INP, France 2 Inria / Labri, Univ.


slide-1
SLIDE 1

An Efficient, Portable and Generic Library for Successive Cancellation Decoding of Polar Codes

Adrien Cassagne1,2 Bertrand Le Gal1 Camille Leroux1 Olivier Aumage2 Denis Barthou2

1IMS, Univ. Bordeaux, INP, France 2Inria / Labri, Univ. Bordeaux, INP, France

LCPC, September 2015

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 1 / 20

slide-2
SLIDE 2

Context: Error Correction Codes (ECC)

Algorithm that enable reliable delivery of digital data

Redundancy for error correction

Usually implemented in hardware Growing interest for software implementation

End-user utilization (low power consumption processors) Less expensive production than dedicated hardware chips Algorithms validation (typically Monte-Carlo HPC simulations)

Require performance Focus on the decoder (most time consuming part)

  • Comm. Chan.

Transmitter Receiver

Channel Encoder Source

Decoder

Sink

The communication chain

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 2 / 20

slide-3
SLIDE 3

Polar Codes as a New Class of ECC

Explored for upcoming 5G Mobile Phones (Huawei1) Redundancy: adding bits at fixed positions (value is always 0)

1 1

N = 8 Frozen bits Information bits

1 1

Information bits

  • Enc. process

K = 4

Example of Polar Code (number of info. bits K = 4, frame size N = 8)

Rate R = N/K: frame size / information bits ratio

1http://www.huawei.com/minisite/has2015/img/5g_radio_whitepaper.pdf

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 3 / 20

slide-4
SLIDE 4

The Successive Cancellation (SC) Algorithm

Depth-first binary tree traversal/search algorithm 3 key functions:

λc

= f (λa, λb) = sign(λa.λb). min(|λa|, |λb|) λc = g(λa, λb, s) = (1 − 2s)λa + λb (sc, sd) = h(sa, sb) = (sa ⊕ sb, sb).

1 3

λd λd+1 λd+1 f g h

2

Per-node downward and upward computations

4 3 2 1 Depth

Data layout representation

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 4 / 20

slide-5
SLIDE 5

Polar Decoding Tree

? ? ? ? ? ? ? f() g() f() f() f() f() g() g() f() f() g() g() g() g() h() h() h() h() h() h() h()

Position of the frozen and information bits in the tree

Same specialized tree for each frame Frames are independent

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 5 / 20

slide-6
SLIDE 6

A Wide Optimization Space

Simplification of the computations (tree pruning, rewriting rules) Vectorization of the node functions (f , g, h) Optimization on the decoder binary size Implementation of low level kernels: various instruction sets (SSE, AVX, NEON, etc.)

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 6 / 20

slide-7
SLIDE 7

Example: Application of the Rewriting Rules

  • ?

? gr() xor() f() xor() gr() f() SPC Rep Rate 0 Rate 1 Rep SPC cut branch

Rewriting rules applied to a N = 8 and K = 4 frame

Rewriting rules are applied recursively Repeated application of this rules lead to a simplified tree

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 7 / 20

slide-8
SLIDE 8

Rewriting Rules and Tree Pruning

  • ?

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

level >= 2

? ? ? ?

f() g() xor() f() gr() xor() f() xor() g1() g0() xor0() spc() rep() h() spc() rep() h() Rate 1 Rep

Repetition

SPC

Single Parity Check

Rate 0 Any node Leaf

Rate 1, left only Rate 0, left only Rate 0 Rate 1 Repetition SPC Level 2+ SPC Rep., leaf children Rate 1, leaf childrens Rate 0, leaf childrens Repetition, left only Standard Case

Sub-tree rewriting rules and tree pruning for processing specialization

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 8 / 20

slide-9
SLIDE 9

Vectorization

Intra frame SIMD strategy:

n data

f f f f

Inter frame SIMD strategy:

frame 0 frame 1 frame 2 frame 3 frame 0 frame 1 frame 2 frame 3

n data

n frames

f f f f

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 9 / 20

slide-10
SLIDE 10

P-EDGE: a Dedicated Framework for Polar Codes

Features

1 Code generation

Flattening recursive calls Rewriting rules Generation of templated C++

2 C++ specialization

Loop unrolling Data types SIMD

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 10 / 20

slide-11
SLIDE 11

P-EDGE: Code Generation Example

  • ?

? gr() xor() f() xor() gr() f() SPC Rep Rate 0 Rate 1 Rep SPC cut branch

Generated code for N = 8 and K = 4

1 void Generated_SC_decoder_N8_K4 :: decode () 2 { 3 //

  • -----------template

args ---------------

  • std args -

4 //

  • types - --funcs -- ----offsets ---- -size -
  • -buffs ---

5 f < R , F , FI , 0 , 4 , 8 , 4 > :: apply ( l ); 6 rep < B , R , H , HI , 8 , 0 , 4 > :: apply ( l , s ); 7 gr < B , R , G , GI , 0 , 4 , 0 , 8 , 4 > :: apply ( l , s ); 8 spc < B , R , H , HI , 8 , 4 , 4 > :: apply ( l , s ); 9 xo < B , X , XI , 0 , 4 , 0 , 4 > :: apply ( s ); 10 }

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 11 / 20

slide-12
SLIDE 12

Reducing the L1I Cache Occupancy

Flattening may generate large binaries Binary size grows with the frame size Performance slowdown when the binary exceeds the L1I cache Moving offsets from template to function arguments

Help the compiler to factorize many function calls

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 12 / 20

slide-13
SLIDE 13

Sub-tree Folding Technique

Legend Default Rate 0 left Rate 0 Rate 1 Rep left Rep SPC 2 N0 xo() N1 f() N22 g() xo() N2 f() N11 g() xo0() N3 N4 g0() xo0() N5 N6 g0() xo0() N7 N8 g0() xo0() N9 N10 g0() h() xo() N12 f() N17 g() xo0() N13 N14 g0() xo() N15 f() N16 gr() rep() spc() xo() N18 f() N21 g() xo() N19 f() N20 gr() rep() spc() spc() xo() N23 f() N34 g() xo() N24 f() N29 g() xo() N25 f() N26 gr() rep() xo() N27 f() N28 gr() rep() spc() xo() N30 f() N33 g() xo() N31 f() N32 gr() rep() spc() h() xo() N35 f() N42 g() xo() N36 f() N41 g() xo() N37 f() N40 g() xo0() N38 N39 g0() h() h() h() h()

Full decoding tree representation (N = 128, K = 64).

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 13 / 20

slide-14
SLIDE 14

Sub-tree Folding Technique

Enabling compression

Legend Default Rate 0 left Rate 0 Rate 1 Rep left Rep SPC 2 N0 N1 N22 N2 N11 N3 N4 N5 N6 N7 N8 N9 N10 N12 N17 N14 N15 N16 N21 N23 N34 N24 N29 N25 N33 N35 N42 N36 N40

A single occurrence of a given sub-tree traversal is generated, and reused wherever needed Compression ratio on the example shown: 1.48

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 14 / 20

slide-15
SLIDE 15

Binary Sizes Comparison

P-EDGE generated decoder binary sizes depending on the frame size (R=1/2)

2 4 8 16 32 64 128 256 512 1024 2048 4096 6 7 8 9 10 11 12 13 14 15 16 Decoder binary size (KB) Codewords size (n = log2(N)) Without compression 32-bit Inter-SIMD 32-bit Intra-SIMD 8-bit Intra-SIMD L1I size 0.5 1 2 4 8 16 32 64 6 7 8 9 10 11 12 13 14 15 16 Codewords size (n = log2(N)) With compression

Enable sub-tree folding

De-templetization: 10-fold binary reduction Tree folding: 5-fold binary reduction (for N = 216)

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 15 / 20

slide-16
SLIDE 16

Performance results

P-Edge Intel-based P-Edge ARM-based McGill impl.2 CPU Intel Xeon E31225 ARM Cortex-A15 Intel Core i7-2600 3.10Ghz MPCore 2.32GHz 3.40GHz Cache L1I/L1D 32KB L1I/L1D 32KB L1I/L1D 32KB L2 256KB L2 1024KB L2 256KB L3 6MB No L3 L3 8MB Compiler GNU g++ 4.8 GNU g++ 4.8 GNU g++ 4.8

Performance evaluation platforms.

Compiler flags: -std=c++11 -Ofast -funroll-loops

  • 2G. Sarkis, P. Giard, C. Thibeault, and W.J. Gross. Autogenerating software polar
  • decoders. In Signal and Information Processing (GlobalSIP), 2014 IEEE Global

Conference on, pages 6–10, Dec 2014.

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 16 / 20

slide-17
SLIDE 17

Intra-SIMD Comparison with State of Art

Intra frame vectorization (32-bit, float)

250 300 350 400 450 500 550 6 7 8 9 10 11 12 13 14 15 16 Coded throughput (Mbit/s) Frame size (n = log2(N)) Intel Xeon CPU E31225 (AVX1 SIMD) P-Edge, rate 5/6 McGill, rate 5/6 50 60 70 80 90 100 110 120 6 7 8 9 10 11 12 13 14 15 16 Frame size (n = log2(N)) Nvidia Jetson TK1 CPU A15 (NEON SIMD) P-Edge, rate 5/6

Cross marks: Sarkis et al. P-Edge up to 25% better

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 17 / 20

slide-18
SLIDE 18

Inter-SIMD Comparison with State of Art

Inter frame vectorization (8-bit, char)

800 1000 1200 1400 1600 1800 2000 2200 2400 2600 6 7 8 9 10 11 12 13 14 15 16 Coded throughput (Mbit/s) Frame size (n = log2(N)) Intel Xeon CPU E31225 (SSE4.1 SIMD) P-Edge, rate 5/6 Handw., rate 5/6 100 150 200 250 300 350 400 450 500 6 7 8 9 10 11 12 13 14 15 16 Frame size (n = log2(N)) Nvidia Jetson TK1 CPU A15 (NEON SIMD)

Triangles marks: former “handwritten” implementation3 P-Edge up to 25% better

  • 3B. Le Gal, C. Leroux, and C. Jego. Multi-gb/s software decoding of polar codes.

IEEE Transactions on Signal Processing, 63(2):349–359, Jan 2015.

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 18 / 20

slide-19
SLIDE 19

Exploring Optimization Impacts

Impact of the different optimizations on the throughput

100 200 300 400 500 1 / 5 1 / 2 5 / 6 9 / 1 Coded throughput (Mbit/s) Rate (R = K / N) 32-bit intra frame SIMD (AVX1)

412 304 380 511 521 492

500 1000 1500 2000 1 / 5 1 / 2 5 / 6 9 / 1 Rate (R = K / N) 8-bit inter frame SIMD (SSE4.1) spc2 spc4 spc6 rep cut1 cut0 ref

1861 1377 1540 1808 1804 1770

Throughput depending on the different optimizations for N = 2048

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 19 / 20

slide-20
SLIDE 20

Conclusion and Future Works

Conclusion: P-EDGE: a Polar ECC decoder exploration framework Clear separation of concerns

Rewriting rules engine Low level building blocks (f , g, h, rep, spc, ...)

Large optimization exploration Outperform state of art decoders Performance Portability Future works: In-depth performance analysis

Performance model (Roof-line, Execution-Cache-Memory (ECM))

Reduce memory footprint Explore other Polar code decoder variants

  • A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou

IMS, Inria / Labri, Univ. Bordeaux, INP P-EDGE: Polar ECC Decoder Generation Environment 20 / 20