How Fast Can Higher-Order Masking Be in Software? Dahmun Goudarzi - - PowerPoint PPT Presentation

how fast can higher order masking be in software
SMART_READER_LITE
LIVE PREVIEW

How Fast Can Higher-Order Masking Be in Software? Dahmun Goudarzi - - PowerPoint PPT Presentation

How Fast Can Higher-Order Masking Be in Software? Dahmun Goudarzi and Matthieu Rivain EUROCRYPT 2017, Paris 1 Introduction 2 Field Multiplications 3 Non-Linear Operations 4 Generic Polynomial Methods 5 Polynomial Methods for


slide-1
SLIDE 1

How Fast Can Higher-Order Masking Be in Software?

Dahmun Goudarzi and Matthieu Rivain

EUROCRYPT 2017, Paris

slide-2
SLIDE 2

1 Introduction 2 Field Multiplications 3 Non-Linear Operations 4 Generic Polynomial Methods 5 Polynomial Methods for AES 6 The Bitslice Strategy

2/32

slide-3
SLIDE 3

Higher-Order Masking

x = x1 + x2 + · · · + xd

3/32

slide-4
SLIDE 4

Higher-Order Masking

x = x1 + x2 + · · · + xd

Linear operations: O(d)

3/32

slide-5
SLIDE 5

Higher-Order Masking

x = x1 + x2 + · · · + xd

Linear operations: O(d) Non-linear operations: O(d2)

3/32

slide-6
SLIDE 6

Higher-Order Masking

x = x1 + x2 + · · · + xd

Linear operations: O(d) Non-linear operations: O(d2)

Challenge for blockciphers: S-boxes

3/32

slide-7
SLIDE 7

Ishai-Sahai-Wagner Multiplication

  • i

ci =

i

ai

  • ×

i

bi

  • =
  • i,j

ai × bj

    

a1b1 a1b2 . . . a1bd a2b2 . . . . . . . . . . . . . . . . . . adbd

    +     

. . . a2b1 . . . . . . . . . . . . . . . adb1 adb2 . . .

    +     

r1,2 . . . r1,d r1,2 . . . . . . ... rd,d−1 r1,d rd,d−1

    

4/32

slide-8
SLIDE 8

The Polynomial Methods

Sbox seen as a polynomial over GF(2n)

S(x) =

n

  • i=0

ai xi

5/32

slide-9
SLIDE 9

The Polynomial Methods

Sbox seen as a polynomial over GF(2n)

S(x) =

n

  • i=0

ai xi Generic Methods

S(x) =

  • i

(pi ⋆ qi)(x)

CRV decomposition, ⋆ = × (CHES 2014) Algebraic decomposition, ⋆ = ◦ (CRYPTO 2015)

  • 5/32
slide-10
SLIDE 10

The Polynomial Methods

Sbox seen as a polynomial over GF(2n)

S(x) =

n

  • i=0

ai xi Generic Methods

S(x) =

  • i

(pi ⋆ qi)(x)

CRV decomposition, ⋆ = × (CHES 2014) Algebraic decomposition, ⋆ = ◦ (CRYPTO 2015)

AES Specific Methods

SAES(x) = Aff(x254)

RP multiplication chain (CHES 2010) KHL multiplication chain (CHES 2011)

  • 5/32
slide-11
SLIDE 11

Our results

Optimized implementations of state of the art higher-order masking

techniques

Bottom-up approach: ◮ base field multiplication ◮ ISW/CPRR ◮ polynomial methods Finely tuned ARM assembly (parallelization) Alternative strategy: bitslice method (new AES and PRESENT speed

records)

6/32

slide-12
SLIDE 12

ARM

32-bit architecture with 16 registers (13 user accessible register) Barrelshifter: shifts and rotates virtually free Example: x-times and add on GF(2)[x] in 1 cycle

EOR $acc , $var , $acc , LSL #1

7/32

slide-13
SLIDE 13

1 Introduction 2 Field Multiplications 3 Non-Linear Operations 4 Generic Polynomial Methods 5 Polynomial Methods for AES 6 The Bitslice Strategy

8/32

slide-14
SLIDE 14

Field Multiplication

Goal: efficient implementation of multiplication over GF(2n) Fastest method: precomputed look-up table Limitation: constrained memory on embedded system

n 4 5 6 7 8 9 10 Table size 0.25 kiB 1 kiB 4 kiB 16 kiB 64 kiB 512 kiB 2048 kiB

9/32

slide-15
SLIDE 15

Field Multiplication

bin mult v1 bin mult v2 exp-log v1 exp-log v2 kara. half-tab full-tab clock cycles 10n + 3 7n + 3 18 16 19 10 4 registers 5 5 5 5 6 5 5 code size 52 2n−1 + 48 2n+1 + 48 3 · 2n + 40 3 · 2n + 42 2

3n 2 +1 + 24

22n + 12

10/32

slide-16
SLIDE 16

Field Multiplication

bin mult v1 bin mult v2 exp-log v1 exp-log v2 kara. half-tab full-tab clock cycles 10n + 3 7n + 3 18 16 19 10 4 registers 5 5 5 5 6 5 5 code size 52 2n−1 + 48 2n+1 + 48 3 · 2n + 40 3 · 2n + 42 2

3n 2 +1 + 24

22n + 12

a × b = (ah x

n 2 + aℓ) × (bh x n 2 + bℓ)

Karatsuba = T1[ ah | bh ] + T2[ aℓ | bℓ ] + T3[ ah + aℓ | bh + bℓ ]

10/32

slide-17
SLIDE 17

Field Multiplication

bin mult v1 bin mult v2 exp-log v1 exp-log v2 kara. half-tab full-tab clock cycles 10n + 3 7n + 3 18 16 19 10 4 registers 5 5 5 5 6 5 5 code size 52 2n−1 + 48 2n+1 + 48 3 · 2n + 40 3 · 2n + 42 2

3n 2 +1 + 24

22n + 12

a × b = (ah x

n 2 + aℓ) × (bh x n 2 + bℓ)

Half table = T1[ ah | aℓ | bh ] + T2[ ah | aℓ | bℓ ]

10/32

slide-18
SLIDE 18

Field Multiplication

bin mult v1 bin mult v2 exp-log v1 exp-log v2 kara. half-tab full-tab clock cycles 10n + 3 7n + 3 18 16 19 10 4 registers 5 5 5 5 6 5 5 code size 52 56 B 80 B 88 B 90 B 152 B 268 B

For n = 4: full table ◮ Fastest multiplication: 4 clock cycles ◮ Low code size: 268 B

10/32

slide-19
SLIDE 19

Field Multiplication

bin mult v1 bin mult v2 exp-log v1 exp-log v2 kara. half-tab full-tab clock cycles 10n + 3 7n + 3 18 16 19 10 4 registers 5 5 5 5 6 5 5 code size 52 176 B 560 B 808 B 810 B 8216 B 64 kiB

For n = 8: exp-log or half-tab ◮ tradeoff between clock cycles and code size

10/32

slide-20
SLIDE 20

1 Introduction 2 Field Multiplications 3 Non-Linear Operations 4 Generic Polynomial Methods 5 Polynomial Methods for AES 6 The Bitslice Strategy

11/32

slide-21
SLIDE 21

Quadratic Operations

ISW ◮ Secure GF-mult of 2 operands ◮ Might need refreshing (see paper for details) CPRR ◮ Evaluation of quadratic functions in 1 operand ◮ Similar to ISW: GF-mult lookup tables ◮ Twice more random

12/32

slide-22
SLIDE 22

Performances Comparisons

d = 3 d = 5 d = 10 500 1,000 1,500 2,000 2,500 3,000 3,500 Clock Cycles ISW-FT ISW-HT ISW-EL CPRR ISW < CPRR when table too huge Asymptotical comp: 1 CPRR 1.16 ISW-FT, 0.88 ISW-HT, 0.75 ISW-EL

13/32

slide-23
SLIDE 23

Parallelization

32-bit register filled with only n-bit elements Perform several ISW/CPRR in parallel: ◮ n = 4 8 elements/register ◮ n = 8 4 elements/register Consequence: ◮ Parallel: load, store, xor, loops ◮ Sequential: GF mult, CPRR lookups

14/32

slide-24
SLIDE 24

Performances Gain of Parallelization

n = 8 (4 elements) d = 3 d = 5 d = 10 5,000 10,000 15,000 Clock Cycles ISW-HT ISW-EL CPRR sequential parallel

  • Asympt. ratio: CPRR 54%.

n = 4 (8 elements) d = 3 d = 5 d = 10 5,000 10,000 15,000 Clock Cycles ISW-FT CPRR sequential parallel

  • Asympt. ratio: ISW 42%.

15/32

slide-25
SLIDE 25

1 Introduction 2 Field Multiplications 3 Non-Linear Operations 4 Generic Polynomial Methods 5 Polynomial Methods for AES 6 The Bitslice Strategy

16/32

slide-26
SLIDE 26

Polynomial Decomposition

S(x) =

i qi(x) ⋆ pi(x)

17/32

slide-27
SLIDE 27

Polynomial Decomposition

S(x) =

i qi(x) ⋆ pi(x)

qi: random linear combinations from a basis B

17/32

slide-28
SLIDE 28

Polynomial Decomposition

S(x) =

i qi(x) ⋆ pi(x)

qi: random linear combinations from a basis B find pi by solving a linear system

17/32

slide-29
SLIDE 29

Polynomial Decomposition

S(x) =

i qi(x) ⋆ pi(x)

qi: random linear combinations from a basis B find pi by solving a linear system CRV vs AD: ◮ CRV [CRV14]: ⋆ = GF-multiplication ISW multiplication ◮ AD [CPRR15]: ⋆ = composition

CPRR evaluation

17/32

slide-30
SLIDE 30

CRV Improvement

Use CPRR for the basis computation Example for n = 8:

CRV x3 = x · x2 x7 = x · (x3)2 x29 = x · (x7)4 x87 = x3 · x29 x251 = (x6)16 · (x87)128 5 ISW This paper x3 = x3 x9 = (x3)3 x5 = x5 x25 = (x5)5 x125 = (x25)5 x115 = (x125)5 6 CPRR

18/32

slide-31
SLIDE 31

Implementation Results

n = 4 (8 s-boxes in /

/ )

d = 3 d = 5 d = 10 500 1,000 1,500 2,000 2,500 3,000 Clock Cycles ×10

  • Alge. dec.

CRV-FT n = 8 (4 s-boxes in /

/ )

d = 3 d = 5 d = 10 200 400 600 800 Clock Cycles ×102

  • Alge. dec.

CRV-HT CRV-EL

19/32

slide-32
SLIDE 32

1 Introduction 2 Field Multiplications 3 Non-Linear Operations 4 Generic Polynomial Methods 5 Polynomial Methods for AES 6 The Bitslice Strategy

20/32

slide-33
SLIDE 33

Polynomial Methods for AES

Based on the specific algebraic structure of the AES:

S(x) = Aff(x254)

RP10 method : 4 ISW mult

Security flaw due to refreshing Patch [CPRR13]: 1 CPRR + 3 ISW Improvement [GPS14]: 3 CPRR + 1 ISW

KHL11 method: 5 ISW mult on GF(16)

Patch [this paper]: 1 CPRR + 4 ISW

21/32

slide-34
SLIDE 34

Implementation Results

16 s-boxes in /

/

d = 3 d = 5 d = 10 20 40 60 80 100 Clock Cycles ×103 KHL RP-HT RP-EL KHL < RP-∗: smaller elements higher parallelization degree

22/32

slide-35
SLIDE 35

1 Introduction 2 Field Multiplications 3 Non-Linear Operations 4 Generic Polynomial Methods 5 Polynomial Methods for AES 6 The Bitslice Strategy

23/32

slide-36
SLIDE 36

Bitslice for the AES

Sbox seen as boolean circuit

. . . . . . . . .

x1 x2 xn + + +

  • . . .

. . . . . .

X1 X2 Xn

CPU XOR CPU AND CPU XOR

16 S-boxes in /

/

24/32

slide-37
SLIDE 37

Application for AES S-boxes

Circuit for the AES S-box [BMP13] ◮ 83 XOR gates ◮ 32 AND gates Bitslice (16 s-boxes) ◮ 83 XOR instructions ◮ 32 AND instructions Masking at the order d: ◮ 83 × d XOR instructions ◮ 32 ISW-AND

25/32

slide-38
SLIDE 38

Improvement

2 16-bit ISW-AND 1 32-bit ISW-AND

Goal: grouping AND gates per pairs Validation on BMP circuit 16 s-boxes = 16 ISW-AND 1 ISW-AND per s-box

26/32

slide-39
SLIDE 39

Performance Comparison of ISW

d = 3 d = 5 d = 10 2,000 4,000 6,000 8,000 Clock Cycles ISW-AND (32 / / AND) ISW-FT (8 / / GF(16)-mult) ISW-HT (4 / / GF(256)-mult)

27/32

slide-40
SLIDE 40

Performances for AES S-box

16 S-boxes in /

/

d = 3 d = 5 d = 10 200 400 600 800 1,000 Clock Cycles ×102 RP-HT. KHL Bitslice RP-HT: 1 ISW-HT/CPRR per s-box KHL: 0.83 ISW-FT/CPRR per s-box Bitslice: 1 ISW-AND per s-box

28/32

slide-41
SLIDE 41

AES vs Generic

16 S-boxes in /

/

d = 3 d = 5 d = 10 500 1,000 1,500 2,000 2,500 3,000 Clock Cycles ×102

  • Alg. dec.

KHL Bitslice KHL 3.1× faster than AD (for n = 8) Bitslice 2.3× faster than KHL

29/32

slide-42
SLIDE 42

Timing for AES and PRESENT Block-Cipher

d = 2 d = 3 d = 4 d = 5 d = 10 Bitslice AES 0.89 ms 1.39 ms 1.99 ms 2.7 ms 8.01 ms Bitslice PRESENT 0.62 ms 0.96 ms 1.35 ms 1.82 ms 5.13 ms

Clock frequency: 60 MHZ

30/32

slide-43
SLIDE 43

Conclusion

Case study on ARM: barrelshifter and 32-bit registers

slide-44
SLIDE 44

Conclusion

Case study on ARM: barrelshifter and 32-bit registers Selection of best field multiplication algorithms: ◮ New proposed method: half-table ◮ For n = 4, full tabulated (4 clock cycles and 268B) ◮ For n − 8, trade-off between exp-log and half-table

slide-45
SLIDE 45

Conclusion

Case study on ARM: barrelshifter and 32-bit registers Selection of best field multiplication algorithms: ◮ New proposed method: half-table ◮ For n = 4, full tabulated (4 clock cycles and 268B) ◮ For n − 8, trade-off between exp-log and half-table Optimization of non-linear operations ◮ CPRR > ISW when table too huge ◮ Smaller elements higher parallelization degree

slide-46
SLIDE 46

Conclusion

Case study on ARM: barrelshifter and 32-bit registers Selection of best field multiplication algorithms: ◮ New proposed method: half-table ◮ For n = 4, full tabulated (4 clock cycles and 268B) ◮ For n − 8, trade-off between exp-log and half-table Optimization of non-linear operations ◮ CPRR > ISW when table too huge ◮ Smaller elements higher parallelization degree Generic polynomial methods: ◮ New optimal parameters for CRV with CPRR evaluations ◮ Depending on n, trade-off between AD and CRV

slide-47
SLIDE 47

Conclusion

Case study on ARM: barrelshifter and 32-bit registers Selection of best field multiplication algorithms: ◮ New proposed method: half-table ◮ For n = 4, full tabulated (4 clock cycles and 268B) ◮ For n − 8, trade-off between exp-log and half-table Optimization of non-linear operations ◮ CPRR > ISW when table too huge ◮ Smaller elements higher parallelization degree Generic polynomial methods: ◮ New optimal parameters for CRV with CPRR evaluations ◮ Depending on n, trade-off between AD and CRV Polynomial methods for AES: ◮ KHL > RP because of manipulations of higher parallelization degree

slide-48
SLIDE 48

Conclusion

Case study on ARM: barrelshifter and 32-bit registers Selection of best field multiplication algorithms: ◮ New proposed method: half-table ◮ For n = 4, full tabulated (4 clock cycles and 268B) ◮ For n − 8, trade-off between exp-log and half-table Optimization of non-linear operations ◮ CPRR > ISW when table too huge ◮ Smaller elements higher parallelization degree Generic polynomial methods: ◮ New optimal parameters for CRV with CPRR evaluations ◮ Depending on n, trade-off between AD and CRV Polynomial methods for AES: ◮ KHL > RP because of manipulations of higher parallelization degree Pushing the parallelization to the optimal: bitslice strategy ◮ Reordering of Boolean circuit for optimal use of registers ◮ Better than any polynomials methods for AES and Present

slide-49
SLIDE 49

Conclusion

Case study on ARM: barrelshifter and 32-bit registers Selection of best field multiplication algorithms: ◮ New proposed method: half-table ◮ For n = 4, full tabulated (4 clock cycles and 268B) ◮ For n − 8, trade-off between exp-log and half-table Optimization of non-linear operations ◮ CPRR > ISW when table too huge ◮ Smaller elements higher parallelization degree Generic polynomial methods: ◮ New optimal parameters for CRV with CPRR evaluations ◮ Depending on n, trade-off between AD and CRV Polynomial methods for AES: ◮ KHL > RP because of manipulations of higher parallelization degree Pushing the parallelization to the optimal: bitslice strategy ◮ Reordering of Boolean circuit for optimal use of registers ◮ Better than any polynomials methods for AES and Present

Can we use Bitslice for generic methods?

slide-50
SLIDE 50

Conclusion

Case study on ARM: barrelshifter and 32-bit registers Selection of best field multiplication algorithms: ◮ New proposed method: half-table ◮ For n = 4, full tabulated (4 clock cycles and 268B) ◮ For n − 8, trade-off between exp-log and half-table Optimization of non-linear operations ◮ CPRR > ISW when table too huge ◮ Smaller elements higher parallelization degree Generic polynomial methods: ◮ New optimal parameters for CRV with CPRR evaluations ◮ Depending on n, trade-off between AD and CRV Polynomial methods for AES: ◮ KHL > RP because of manipulations of higher parallelization degree Pushing the parallelization to the optimal: bitslice strategy ◮ Reordering of Boolean circuit for optimal use of registers ◮ Better than any polynomials methods for AES and Present

Can we use Bitslice for generic methods? Yes, GR16 [CHES 2016]

31/32

slide-51
SLIDE 51

Questions?

32/32