Hardware Implementation of Block Cipher: Case Study Using AES - - PowerPoint PPT Presentation

hardware implementation of block cipher case study using
SMART_READER_LITE
LIVE PREVIEW

Hardware Implementation of Block Cipher: Case Study Using AES - - PowerPoint PPT Presentation

Hardware Implementation of Block Cipher: Case Study Using AES Tohoku University Rei Ueno Acknowledgments Naofumi Homma, Tohoku Univ . Takafumi Aoki, Tohoku Univ . Sumio Morioka, Interstellar technologies, Inc . Noriyuki Miura, Kobe Univ . Kohei


slide-1
SLIDE 1

Tohoku University Rei Ueno

Hardware Implementation of Block Cipher: Case Study Using AES

slide-2
SLIDE 2

Acknowledgments

Naofumi Homma, Tohoku Univ. Takafumi Aoki, Tohoku Univ. Sumio Morioka, Interstellar technologies, Inc. Noriyuki Miura, Kobe Univ. Kohei Matsuda, Kobe Univ. Makoto Nagata, Kobe Univ. Shivam Bhasin, NTU Yves Mathieu, Telecom ParisTech Tarik Graba, Telecom ParisTech Jean-Luc Danger, Telecom ParisTech

2

slide-3
SLIDE 3

This talk

n Given a symmetric key cipher, how hardware designer implement and optimize it

p For practical application:

  • With higher efficiency, encryption/decryption unified,
  • n-the-fly key scheduling, without block-wise pipelining

p Case study using AES!

n Disclaimer

p Some modern lightweight ciphers are already optimized

and they avoid some concerns in implementing AES

p But I still believe that optimization of AES

implementation can be feedbacked to cipher designs

3

slide-4
SLIDE 4

Hardware architectures of block cipher

4

Time for one block encryption Area Round- based Serialized Un- rolled

Resource sharing Datapath replication

slide-5
SLIDE 5

Hardware architectures of block cipher

5

Time for one block encryption Area Round- based Byte- serial Un- rolled

Resource sharing Datapath replication

Efficient hardware

Pipelining Datapath

  • ptimization
slide-6
SLIDE 6

For practical hardware implementation

n Block-chaining modes have been widely deployed

p CBC, CMAC, and CCM…

n (Un)Parallelizability: Issue on block-wise pipelining

p AES hardware achieves 53Gbps, but works only for

parallelizable modes [Mathew+ JSSC2011]

p Higher throughput ≠ Lower latency

n Both encryption and decryption operations n Importance of on-the-fly key scheduling

p Off-the-fly key scheduling requires additional memories

to store expanded keys

p Latency for calculating round keys is nonnegligible if we

use AES with key-tweakable modes

6

slide-7
SLIDE 7

Outline

n Introduction n Related works n Optimized architecture n Optimization of linear functions over tower-field n Performance evaluation n Concluding remarks

7

slide-8
SLIDE 8

Conventional architecture 1/2 [Lutz+, CHES 2002]

n Enc and Dec datapaths with additional selectors

p Overhead of selectors for unification is nontrivial p False paths appear

8

www.chesworkshop.org/ches2002/presentations/Lutz.pdf

slide-9
SLIDE 9

Conventional architecture 2/2 [Satoh+, AC 2001]

n Unify each pair of operation and its inverse

p RoundKey requires InvMixColumns p Some MUXs in unified operations p Long critical path

9

slide-10
SLIDE 10

Tower-field implementation

n Inversion should be performed over tower-field

p Tower-field inversion is more efficient than direct

mapping (e.g., table-lookup)

n Two types of tower-field implementation

p Type-I: only inversion is performed over tower-field p Type-II: all operations are performed over tower-field

10

Inversion (S-box) MixColumns InvMixColumns

Type-I Good Good Type-II Better Bad

slide-11
SLIDE 11

Outline

n Introduction n Related works n Optimized architecture n Optimization of linear functions over tower-field n Performance evaluation n Concluding remarks

11

slide-12
SLIDE 12

Overall architecture

n Round-based architecture n On-the-fly key scheduler

12

Round function part Key scheduling part

Ciphertext/Plaintext Plaintext/Ciphertext Initial key

slide-13
SLIDE 13

Round function part

n Compress encryption and decryption datapaths by register-retiming and operation-reordering

p Unify inversion circuits in encryption and decryption

  • Without any additional selectors (i.e., overheads)

p Merge linear operations to reduce gates and critical delay

  • Affine/InvAffine and MixColumns/InvMixColumns
  • At most one linear operation for a round

n Type-II tower-field implementation

p Isomorphic mappings are performed at data I/O p Lower-area tower-field (Inv)Affine and (Inv)MixColumns

13

slide-14
SLIDE 14

Resister-retiming and operation-reordering

14

Encryption Decryption

Original Proposed Original Proposed

slide-15
SLIDE 15

Key tricks (of decryption)

15

AddRoundKey InvSubBytes InvShiftRows AddRoundKey InvMixColumns InvSubBytes InvShiftRows AddRoundKey Pre-round op. Round op. Final op. Ciphertext Plaintext Data register Data register Data register Data register

slide-16
SLIDE 16

Key tricks (of decryption)

16

n Decompose InvSubByte to InvAffine and Inversion n Register-retiming to initially perform inversion in round operations

AddRoundKey InvShiftRows AddRoundKey InvMixColumns InvShiftRows AddRoundKey Pre-round op. Round op. Final op. Ciphertext Plaintext Data register Data register Data register InvAffine Inversion Inversion Data register InvAffine

slide-17
SLIDE 17

Key tricks (of decryption)

17

n Merge linear operations as Unified affine-1

p InvAffine and InvMixColumns

n Distinct AddRoundKey to avoid additional selectors or InvMixColumns for RoundKey

AddRoundKey InvShiftRows AddRoundKey Unified affine-1 InvShiftRows AddRoundKey Pre-round op. Round op. Final op. Ciphertext Plaintext Data register Data register Data register InvAffine Inversion Inversion Data register

slide-18
SLIDE 18

Resulting datapath

18

Unified inversion without selector Disable inactive path At most one linear

  • peration for round

Only one 4:1 selector

slide-19
SLIDE 19

Overall architecture

n Round-based architecture n On-the-fly key scheduler

19

Round function part Key scheduling part

Ciphertext/Plaintext Plaintext/Ciphertext Initial key

slide-20
SLIDE 20

Key scheduling part

n Round key generator is dominant

p Unify encryption and decryption datapaths p Shorten critical delay than round function part by

NOT unifying some XOR gates

20

Not unified XOR gates Unified components

slide-21
SLIDE 21

Outline

n Introduction n Related works n Optimized architecture n Optimization of linear functions over tower-field n Performance evaluation n Concluding remarks

21

slide-22
SLIDE 22

Coming back to round function part

n Major components

p Inversion p Linear operations p Bit-parallel XOR p Selectors p (Inv)ShiftRows

n Performance depends on constructions of inversion and linear operations

p Inversion: Use state-of-the-art adoptable one p Linear operations: Depends on XOR matrices

22

slide-23
SLIDE 23

Multiplicative-offset

n Increase variation of construction of XOR matrices

p To find optimal XOR matrices with lower HWs

n Multiply offset value c to intermediate value di,j(r) and store cdi,j(r) into register

p Multiplication with fixed value is XOR matrix operation p c is taken from GF(28) excluding 0

23

Pre-round Round Post-round

Plaintext

di,j (1) di,j (r)

Inversion Unified Affine

di,j (r+1)

  • Iso. Mapping-1

di,j (11)

Ciphertext

  • Iso. mapping

Original encryption flow (simplified)

slide-24
SLIDE 24

Multiplicative-offset

24

Pre-round Round Post-round Proposed encryption flow (simplified)

Multiply c Inversion Unified Affine

  • Iso. Mapping-1
  • Iso. mapping

Multiply c2 Multiply c-1 Plaintext

cdi,j(1) cdi,j(r) cdi,j(r+1) cdi,j(11)

Ciphertext

n Increase variation of construction of XOR matrices

p To find optimal XOR matrices with lower HWs

n Multiply offset value c to intermediate value di,j(r) and store cdi,j(r) into register

p Multiplication with fixed value is XOR matrix operation p c is taken from GF(28) excluding 0

slide-25
SLIDE 25

n Increase variation of construction of XOR matrices

p To find optimal XOR matrices with lower HWs

n Multiply offset value c to intermediate value di,j(r) and store cdi,j(r) into register

p Multiplication with fixed value is XOR matrix operation p c is taken from GF(28) excluding 0

Multiplicative-offset

25

Pre-round Round Post-round

Plaintext

cdi,j(1) cdi,j (r)

Inversion Merged Unified Affine

cdi,j (r+1)

Merged mapping-1

cdi,j (11)

Ciphertext Merged mapping

Original encryption flow (simplified)

Reduce HW of XOR matrices for linear operations by 10%

slide-26
SLIDE 26

n Synthesized proposed and conventional archs.

p Logic synthesis: Design Compiler p Technology: Nangate 45-nm Open Cell Library

n 51—57% higher efficient than conventional ones

p Multiplicative-offset (MO) improves efficiency by 7—9%

Performance comparison

26 Area (GE) Latency (ns) Throughput (Gbps) Efficiency (Kbps/GE) Satoh et al. 16,628.67 24.97 5.64 339.10 Lutz et al. 28,301.33 16.20 7.90 279.18 Liu et al. 15,335.67 29.70 4.74 309.13 Mathew et al. 21,429.33 30.80 4.57 213.33 This work w/o MO 18,013.00 16.28 8.65 480.49 This work w/ MO 17,368,67 15.84 8.89 511.78

slide-27
SLIDE 27

Evaluation of power/energy consumption

n Gate-level timing simulation with back-annotation for estimating power consumption

p With regarding glitch-effects

n Our architecture achieved lowest power/energy

p MO achieves further reduction by 7—24%

27 Power [uW] @ 100 MHz PL product Satoh et al. 902 22,523 Lutz et al. 735 11,907 Liu et al. 1,010 29,997 Mathew et al. 1,390 42,812 This work w/o MO 569 9,263 This work w/ MO 465 7,366

Power consumption and power-latency product at encryption

slide-28
SLIDE 28

Encryption only architecture

n Designed encryption-only hardware based on

  • ur philosophy

p Compared with representative open-source IP

(SASEBO IP) and state-of-the-art one [ARITH 2016]

n Our architecture is 58—64% higher efficient

p Also advantageous in power/energy consumption

28 Area (GE) Latency (ns) Thru (Gbps) Thru/GE Power (uW) PL product SASEBO IP Table 23,085.00 11.64 12.00 519.66 352 4,097 Comp 11,431.67 23.04 6.06 530.16 513 11,820 ARITH 2016 Type-I 12,108.33 23.87 5.90 487.16 655 14,266 Type-II 13,249.33 21.78 6.46 487.92 755 18,022 This work 12,127,00 13.97 10.08 831.10 279 3,898

slide-29
SLIDE 29

Massages take away

n Round-based implementation of block ciphers may be essential for evaluating their performance

p Should be conscious of mode-of-operations,

applications, etc.

p Optimizing round datapath is valuable and essential

n Feedback to block cipher design?

p Optimized MDS matrices for cryptanalyses ≠ optimized

for implementation (area and latency)

  • But it can be optimized at implementation for

implementation

p Inversion-based 8-bit Sbox makes many spaces for

architectural/design optimization

29

slide-30
SLIDE 30

References

n R. Ueno et al., “A High Throughput/Gate AES Hardware Architecture by Compressing Encryption and Decryption Datapaths—Toward efficient CBC-Mode Implementation,” CHES 2016. n R. Ueno et al., “High Throughput/Gate AES Hardware Architectures Based on Datapath Compression,” IEEE

  • Trans. Comput., 2019. (Early Access)

30