New Circuit Minimization Techniques for Smaller and Faster AES - - PowerPoint PPT Presentation

new circuit minimization techniques for smaller and
SMART_READER_LITE
LIVE PREVIEW

New Circuit Minimization Techniques for Smaller and Faster AES - - PowerPoint PPT Presentation

New Circuit Minimization Techniques for Smaller and Faster AES SBoxes Alexander Maximov and Patrik Ekdahl Ericsson Research Patrik Ekdahl Ericsson Research 2019-08-26 Ericsson Internal | 2018-02-21 Plaintext Preliminaries 128 128


slide-1
SLIDE 1

Ericsson Internal | 2018-02-21

New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

Alexander Maximov and Patrik Ekdahl Ericsson Research

Patrik Ekdahl Ericsson Research 2019-08-26

slide-2
SLIDE 2

Ericsson Internal | 2018-02-21

Preliminaries

SubBytes ShiftRows MixColumns Registers Mux Mux Ciphertext Plaintext Roundkey 1 Roundkey n

128 128 128 128

AES Round Function

  • SubBytes is the only non-linear part
  • 16 8x8 SBoxes needed for a full implementation
  • Forward only or combined SBox
  • In ASICs
  • Look-up table
  • Gate implementation

What to remember: — New improved methods for circuit minimization. — New SBox architecture which improves the critical path.

slide-3
SLIDE 3

Ericsson Internal | 2018-02-21

Preliminaries

Basic flow of AES SBox

Inversion GF(2^8)

xM +b Input U Output R Affine transformation

Linear Constant Direct implementation of inversion over Rijndael field is very complex.

slide-4
SLIDE 4

Ericsson Internal | 2018-02-21

Previous work (low area)

Rijmen [Rij00] proposed (based on Itoh and Tsujii [IT88]) to use a composite field and do the inversion in GF(2^4) instead.

xM +b Input U Output R X-1 X ( )-1 ( )2 v

8 8 4 4 4 4 8

Base conversion matrix Base back-conversion matrix Inversion over GF(24)

— Satoh et al [SMT01] reduced inversion to GF(22). — Canright [Can05] investigated the importance of subfield representation.

slide-5
SLIDE 5

Ericsson Internal | 2018-02-21

Previous work (low depth)

xM +b Input U Output R X-1 X ( )-1 ( )17

8 8 4 4 8

Boyar, Peralta et al ([BP10a,BP10b,BP12,BFP18]) used a normal base A=a0Y + a1Y16 and A-1 = (AA16)-1A16 (also based on Itoh and Tsujii [IT88]) to derive another implementation. Several papers followed: — Nogami et al [NNT+10], looking at mixed bases. — Ueno et al [UHS+15], looking at redundant bases. — Reyhani et al [RMTA18a,b], improving Boyar-Peralta (BP) search algorithm. — Li et al [LSL+19], incorporating depth into BP algorithm.

slide-6
SLIDE 6

Ericsson Internal | 2018-02-21

Previous work (low depth)

xM +b Input U Output R X-1 X ( )-1 ( )17

8 8 4 4 8

Boyar, Peralta et al ([BP10a,BP10b,BP12,BFP18]) used a normal base A=a0Y + a1Y16 and A-1 = (AA16)-1A16 (also based on Itoh and Tsujii [IT88]) to derive another implementation. Several papers followed: — Nogami et al [NNT+10], looking at mixed bases. — Ueno et al [UHS+15], looking at redundant bases. — Reyhani et al [RMTA18a,b], improving Boyar-Peralta (BP) search algorithm. — Li et al [LSL+19], incorporating depth into BP algorithm. Collect all linear terms and push into two matrices.

slide-7
SLIDE 7

Ericsson Internal | 2018-02-21

Architectural starting point [BP12]

Input U Output R

8

Top linear Bottom linear Mul- Sum Inverse GF(24) 2 x Mul

22 bit Q 4 bit X 4 bit Y 18 bit N 8

Base conversion and generation of linear parts

  • f multiplications

Base back-conversion and the affine transformation of the AES SBox.

Given a binary matrix 𝑁"#$ and the maximum allowed depth maxD, find the circuit of depth D ≤ maxD with the minimum number of 2-input XOR gates such that it computes 𝑍 = 𝑁 ' 𝑌.

𝑧+ = 𝑦+ + 𝑦. + 𝑦/ + 𝑦0 𝑧1 = 𝑦1 + 𝑦. + 𝑦0 𝑧. = 𝑦+ + 𝑦1 + 𝑦/ + 𝑦0 𝑁 = 1 1 1 1 1 1 1 1 1 1 1

Additional Input Requirement (AIR)

  • Input signals may arrive with different delay 𝑒5

Additional Output Requirement (AOR)

  • Output signals may need to be ready earlier, 𝑓5 ≤ 𝑛𝑏𝑦𝐸

Basic problem statement:

slide-8
SLIDE 8

Ericsson Internal | 2018-02-21

Our contributions

— New techniques for minimizing the Top and Bottom matrices (area with delay constraints). — Introduced a probabilistic heuristic approach to the cancellation-free algorithm by Paar [Paa97]. — New cancellation-allowed exhaustive search algorithm, based on BP-algorithm [BP10a]. — Floating Multiplexers for the combined SBox. — New generalization of BP-algorithm, allowing other types of gates. — New metrics, with lots of speed up tricks for the distance function. — Stack algorithm with a search tree. — New architecture that removes the Bottom matrix and reduces the overall depth. — New circuit for the inverse operation. — Additional Transformation Matrices.

slide-9
SLIDE 9

Ericsson Internal | 2018-02-21

Combined SBox with multiplexers

Mux Top Forward Top Inverse Common part Bottom Forward Bottom Inverse Mux Output R Input U

slide-10
SLIDE 10

Ericsson Internal | 2018-02-21

Combined SBox with multiplexers

Mux Top Forward Top Inverse Common part Bottom Forward Bottom Inverse Mux Output R Input U

Example:

Mux YF YI Y Input X

𝑍; = 𝑌+ + 𝑌1 𝑍< = 𝑌+ + 𝑌. 𝑍 = MUX(select, 𝑌+ + 𝑌1, 𝑌+ + 𝑌.) 𝑍 = MUX select, 𝑌1, 𝑌. + 𝑌+ Generally:

𝑍 = A + MUX select, 𝐶, 𝐷 → 𝑍 = A + Δ + MUX select, B + Δ, 𝐷 + Δ

Replace with:

slide-11
SLIDE 11

Ericsson Internal | 2018-02-21

— Notion of a “point”. — In original algorithm, this is a linear combination

  • f input signals. Set of gates used G ={XOR}.

— Base set of known points S. — Set of target points T, the rows 𝑧5 of M. — Metric using a distance function 𝜀5 𝑇, 𝑧5 . — Set of candidates C.

Boyar-Peralta algorithm [BP10a]

  • Try all base pair 𝑡5, 𝑡

Q in 𝑇R and form a candidate 𝑑 = 𝑕 𝑡5, 𝑡 Q , in this case: 𝑑 = 𝑡5 + 𝑡 Q

  • Calculate the new distance vector ∆ based on 𝑇R ∪ 𝑑
  • We save the candidate 𝑑 that gives the lowest distance 𝑇R\1 = 𝑇R ∪ 𝑑
  • Repeat until the distance vector is all-zero.

1 1 1 1 1 1 1 1 1 1 1 = 𝑧+ 𝑧1 𝑧. S+ = 𝑦+, 𝑦1, … , 𝑦0 = ( 1,0,0,0,0 , 0,1,0,0,0 , … , 0,0,0,0,1 ) ∆= (𝜀+, 𝜀1,..., 𝜀$_1).

slide-12
SLIDE 12

Ericsson Internal | 2018-02-21

— Include MUX, NMUX in the set of gates. — A point is now a tuple 𝑞 = (𝐺, 𝐽) — F and I are linear combinations of input signals — Translated into 𝑁𝑉𝑌(𝑎𝐺, 𝐺 ' 𝑌, 𝐽 ' 𝑌) — Input points 𝑌e = 2e, 2e , 𝑙 = 0, …𝑜 − 1 — Target points 𝑍

e = 𝑍 e ;, 𝑍 e < , 𝑙 = 0, …, 𝑛 − 1

— Improved metrics and new algorithm (with lots of speed up) to calculate 𝜀5 𝑇, 𝑧5|𝐸𝑛𝑏𝑦 . — We keep track of AIR, and AOR at each stage. — For the full Affine transformation, define the point as 𝑞 = (𝑔, 𝐺, 𝑗, 𝐽) à 𝑁𝑉𝑌(𝑎𝐺, 𝐺 ' 𝑌 + 𝑔, 𝐽 ' 𝑌 + 𝑗)

BP for Linear Circuits with Floating Multiplexers

The six gates MUX(v,w) MUX(w,v) NMUX(v,w) NMUX(w,v) XOR(v,w) XNOR(v,w)

slide-13
SLIDE 13

Ericsson Internal | 2018-02-21

BP for any Nonlinear Circuit

— Allow all kinds of gates in G (XOR, AND, MUX, … 2-input, 3-input…). — A point is now the truth table of a Boolean function. — Combine points using truth tables and gate functionality. — Target points are the truth table for every output signal of the nonlinear block. — Applicable to circuits of maximum 4-5 input signals, and the number of output signals is not limited. — Used to derive a smaller inversion circuit over GF(24).

slide-14
SLIDE 14

Ericsson Internal | 2018-02-21

Search Tree

Sr Sr+1 Sr+2 Sr+3

Sr+TD Sr+TD Sr+TD Sr+TD Sr+TD

20-50 children ~ 400 total children

— Try to keep leaves from as many different branches as possible

slide-15
SLIDE 15

Ericsson Internal | 2018-02-21

Search Tree

Sr Sr+1 Sr+2 Sr+3

Sr+TD

TD

Sr+TD Sr+TD Sr+TD Sr+TD

— Try to keep leaves from as many different branches as possible

slide-16
SLIDE 16

Ericsson Internal | 2018-02-21

New architecture for lower depths

Top linear Bottom linear Mul- Sum Inverse GF(24) 2xMul 32nand2 +8xor4

4-bit Y 32-bit L Architecture D 8-bit output R Architecture A 8-bit output R 8-bit Input U 4-bit X 4-bit Y 18-bit N 18-bit Q 18-bit Q

The Bottom matrix only depends on the multiplication

  • f the 4-bit signal Y with some linear combination
  • f the input signal U

𝑺 = 𝑍

+ ' 𝑁+' 𝑽 + ⋯ + 𝑍 / ' 𝑁/' 𝑽

where 𝑁5 is an 8x8 matrix to be scalar multiplied by the 𝑍

5 bit.

Calculate 𝑁5 in parallel in Top matrix. Assembling requires 56 gates (32NAND, 24XOR)

slide-17
SLIDE 17

Ericsson Internal | 2018-02-21

— In [BP12] they found a circuit of 17 gates and depth 4 (with base gates {AND, XOR}). — By applying the BP-algorithm for general non-linear circuits, we managed to achieve 9 gates and depth 3. T0 = NAND(X0, X2) T3 = MUX(X1, X2, 1) Y1 = MUX(T2, X3, T3) T1 = NOR(X1, X3) T4 = MUX(X3, X0, 1) Y2 = MUX(X0, T2, X1) T2 = XNOR(T0, T1) Y0 = MUX(X2, T2, X3) Y3 = MUX(T2, X1, T4) We also found a small conventional (no MUXes) circuit of 15 gates and depth 3.

New circuit for the inversion in GF(24)

𝑍

+ = 𝑌1𝑌.𝑌/ + 𝑌+𝑌. + 𝑌1𝑌. + 𝑌. + 𝑌/

𝑍

1 = 𝑌+𝑌.𝑌/ + 𝑌+𝑌. + 𝑌1𝑌. + 𝑌1𝑌/ + 𝑌/

𝑍

. = 𝑌+𝑌1𝑌/ + 𝑌+𝑌. + 𝑌+𝑌/ + 𝑌+ + 𝑌1

𝑍

/ = 𝑌+𝑌1𝑌. + 𝑌+𝑌. + 𝑌+𝑌/ + 𝑌1𝑌/ + 𝑌1

slide-18
SLIDE 18

Ericsson Internal | 2018-02-21

Additional Transformation Matrices

Excluding the final constant from the affine transformation, we can write the SBox as: 𝑇𝐶𝑝𝑦 𝑦 = 𝑦_1 ' 𝐵r#r In any field of characteristic 2, squaring, square root, and multiplication by a constant are linear

  • functions. Thus, for any choice of 𝛽 = 1 … 255, and 𝛾 = 0 … 7 we have:

𝑎 𝑦 = 𝛽 ' 𝑦.w _1 𝑇𝐶𝑝𝑦 𝑦 =

xw 𝛽 ' 𝑎(𝑦) ' 𝐵r#r

Top matrix Bottom matrix — For Forward (Inverse) we have 2040 choices. Tried all! — For Combined we have 20402 = 4,161,600 choices. Based on the heuristic algorithm, we selected candidates to run the full generic floating multiplexer algorithm. A similar approach was independently proposed in [UHNA19] but they only considered multiplication.

slide-19
SLIDE 19

Ericsson Internal | 2018-02-21

Forward SBox Results

50 100 150 200 250 300 600 700 800 900 1000 1100 1200 Area (um2) Clock period (ps) Our - fast Our - tradeoff Our - bonus Reyhani - fast Reyhani - light Ueno'15 Ueno'19 Boyar - small

slide-20
SLIDE 20

Ericsson Internal | 2018-02-21

Combined SBox Results

100 150 200 250 300 350 400 450 700 800 900 1000 1100 1200 Area (um2) Clock period (ps) Our - fast Our - tradeoff Our - bonus Reyhani Ueno'19 Canright

slide-21
SLIDE 21

Ericsson Internal | 2018-02-21

Alexander also applied the algorithms to the AES MixColumns circuits

AES MixColumnsresults

Previous results (XORs): 103 Jean et al, CHES 2017 97 Krantz et al, ToSC 2017 95 Banik et al, ePrint Archive Report 2019/856 94 Tan and Peyrin, ePrint Archive Report 2019/847 Alexander’s result: 92 (depth 6) Alexander Maximov, ePrint Archive Report 2019/833

slide-22
SLIDE 22

www.ericsson.com

Thank you.