SLIDE 1
Hardware Implementation of Block Cipher: Case Study Using AES - - PowerPoint PPT Presentation
Hardware Implementation of Block Cipher: Case Study Using AES - - PowerPoint PPT Presentation
Hardware Implementation of Block Cipher: Case Study Using AES Tohoku University Rei Ueno Acknowledgments Naofumi Homma, Tohoku Univ . Takafumi Aoki, Tohoku Univ . Sumio Morioka, Interstellar technologies, Inc . Noriyuki Miura, Kobe Univ . Kohei
SLIDE 2
SLIDE 3
This talk
n Given a symmetric key cipher, how hardware designer implement and optimize it
p For practical application:
- With higher efficiency, encryption/decryption unified,
- n-the-fly key scheduling, without block-wise pipelining
p Case study using AES!
n Disclaimer
p Some modern lightweight ciphers are already optimized
and they avoid some concerns in implementing AES
p But I still believe that optimization of AES
implementation can be feedbacked to cipher designs
3
SLIDE 4
Hardware architectures of block cipher
4
Time for one block encryption Area Round- based Serialized Un- rolled
Resource sharing Datapath replication
SLIDE 5
Hardware architectures of block cipher
5
Time for one block encryption Area Round- based Byte- serial Un- rolled
Resource sharing Datapath replication
Efficient hardware
Pipelining Datapath
- ptimization
SLIDE 6
For practical hardware implementation
n Block-chaining modes have been widely deployed
p CBC, CMAC, and CCM…
n (Un)Parallelizability: Issue on block-wise pipelining
p AES hardware achieves 53Gbps, but works only for
parallelizable modes [Mathew+ JSSC2011]
p Higher throughput ≠ Lower latency
n Both encryption and decryption operations n Importance of on-the-fly key scheduling
p Off-the-fly key scheduling requires additional memories
to store expanded keys
p Latency for calculating round keys is nonnegligible if we
use AES with key-tweakable modes
6
SLIDE 7
Outline
n Introduction n Related works n Optimized architecture n Optimization of linear functions over tower-field n Performance evaluation n Concluding remarks
7
SLIDE 8
Conventional architecture 1/2 [Lutz+, CHES 2002]
n Enc and Dec datapaths with additional selectors
p Overhead of selectors for unification is nontrivial p False paths appear
8
www.chesworkshop.org/ches2002/presentations/Lutz.pdf
SLIDE 9
Conventional architecture 2/2 [Satoh+, AC 2001]
n Unify each pair of operation and its inverse
p RoundKey requires InvMixColumns p Some MUXs in unified operations p Long critical path
9
SLIDE 10
Tower-field implementation
n Inversion should be performed over tower-field
p Tower-field inversion is more efficient than direct
mapping (e.g., table-lookup)
n Two types of tower-field implementation
p Type-I: only inversion is performed over tower-field p Type-II: all operations are performed over tower-field
10
Inversion (S-box) MixColumns InvMixColumns
Type-I Good Good Type-II Better Bad
SLIDE 11
Outline
n Introduction n Related works n Optimized architecture n Optimization of linear functions over tower-field n Performance evaluation n Concluding remarks
11
SLIDE 12
Overall architecture
n Round-based architecture n On-the-fly key scheduler
12
Round function part Key scheduling part
Ciphertext/Plaintext Plaintext/Ciphertext Initial key
SLIDE 13
Round function part
n Compress encryption and decryption datapaths by register-retiming and operation-reordering
p Unify inversion circuits in encryption and decryption
- Without any additional selectors (i.e., overheads)
p Merge linear operations to reduce gates and critical delay
- Affine/InvAffine and MixColumns/InvMixColumns
- At most one linear operation for a round
n Type-II tower-field implementation
p Isomorphic mappings are performed at data I/O p Lower-area tower-field (Inv)Affine and (Inv)MixColumns
13
SLIDE 14
Resister-retiming and operation-reordering
14
Encryption Decryption
Original Proposed Original Proposed
SLIDE 15
Key tricks (of decryption)
15
AddRoundKey InvSubBytes InvShiftRows AddRoundKey InvMixColumns InvSubBytes InvShiftRows AddRoundKey Pre-round op. Round op. Final op. Ciphertext Plaintext Data register Data register Data register Data register
SLIDE 16
Key tricks (of decryption)
16
n Decompose InvSubByte to InvAffine and Inversion n Register-retiming to initially perform inversion in round operations
AddRoundKey InvShiftRows AddRoundKey InvMixColumns InvShiftRows AddRoundKey Pre-round op. Round op. Final op. Ciphertext Plaintext Data register Data register Data register InvAffine Inversion Inversion Data register InvAffine
SLIDE 17
Key tricks (of decryption)
17
n Merge linear operations as Unified affine-1
p InvAffine and InvMixColumns
n Distinct AddRoundKey to avoid additional selectors or InvMixColumns for RoundKey
AddRoundKey InvShiftRows AddRoundKey Unified affine-1 InvShiftRows AddRoundKey Pre-round op. Round op. Final op. Ciphertext Plaintext Data register Data register Data register InvAffine Inversion Inversion Data register
SLIDE 18
Resulting datapath
18
Unified inversion without selector Disable inactive path At most one linear
- peration for round
Only one 4:1 selector
SLIDE 19
Overall architecture
n Round-based architecture n On-the-fly key scheduler
19
Round function part Key scheduling part
Ciphertext/Plaintext Plaintext/Ciphertext Initial key
SLIDE 20
Key scheduling part
n Round key generator is dominant
p Unify encryption and decryption datapaths p Shorten critical delay than round function part by
NOT unifying some XOR gates
20
Not unified XOR gates Unified components
SLIDE 21
Outline
n Introduction n Related works n Optimized architecture n Optimization of linear functions over tower-field n Performance evaluation n Concluding remarks
21
SLIDE 22
Coming back to round function part
n Major components
p Inversion p Linear operations p Bit-parallel XOR p Selectors p (Inv)ShiftRows
n Performance depends on constructions of inversion and linear operations
p Inversion: Use state-of-the-art adoptable one p Linear operations: Depends on XOR matrices
22
SLIDE 23
Multiplicative-offset
n Increase variation of construction of XOR matrices
p To find optimal XOR matrices with lower HWs
n Multiply offset value c to intermediate value di,j(r) and store cdi,j(r) into register
p Multiplication with fixed value is XOR matrix operation p c is taken from GF(28) excluding 0
23
Pre-round Round Post-round
Plaintext
di,j (1) di,j (r)
Inversion Unified Affine
di,j (r+1)
- Iso. Mapping-1
di,j (11)
Ciphertext
- Iso. mapping
Original encryption flow (simplified)
SLIDE 24
Multiplicative-offset
24
Pre-round Round Post-round Proposed encryption flow (simplified)
Multiply c Inversion Unified Affine
- Iso. Mapping-1
- Iso. mapping
Multiply c2 Multiply c-1 Plaintext
cdi,j(1) cdi,j(r) cdi,j(r+1) cdi,j(11)
Ciphertext
n Increase variation of construction of XOR matrices
p To find optimal XOR matrices with lower HWs
n Multiply offset value c to intermediate value di,j(r) and store cdi,j(r) into register
p Multiplication with fixed value is XOR matrix operation p c is taken from GF(28) excluding 0
SLIDE 25
n Increase variation of construction of XOR matrices
p To find optimal XOR matrices with lower HWs
n Multiply offset value c to intermediate value di,j(r) and store cdi,j(r) into register
p Multiplication with fixed value is XOR matrix operation p c is taken from GF(28) excluding 0
Multiplicative-offset
25
Pre-round Round Post-round
Plaintext
cdi,j(1) cdi,j (r)
Inversion Merged Unified Affine
cdi,j (r+1)
Merged mapping-1
cdi,j (11)
Ciphertext Merged mapping
Original encryption flow (simplified)
Reduce HW of XOR matrices for linear operations by 10%
SLIDE 26
n Synthesized proposed and conventional archs.
p Logic synthesis: Design Compiler p Technology: Nangate 45-nm Open Cell Library
n 51—57% higher efficient than conventional ones
p Multiplicative-offset (MO) improves efficiency by 7—9%
Performance comparison
26 Area (GE) Latency (ns) Throughput (Gbps) Efficiency (Kbps/GE) Satoh et al. 16,628.67 24.97 5.64 339.10 Lutz et al. 28,301.33 16.20 7.90 279.18 Liu et al. 15,335.67 29.70 4.74 309.13 Mathew et al. 21,429.33 30.80 4.57 213.33 This work w/o MO 18,013.00 16.28 8.65 480.49 This work w/ MO 17,368,67 15.84 8.89 511.78
SLIDE 27
Evaluation of power/energy consumption
n Gate-level timing simulation with back-annotation for estimating power consumption
p With regarding glitch-effects
n Our architecture achieved lowest power/energy
p MO achieves further reduction by 7—24%
27 Power [uW] @ 100 MHz PL product Satoh et al. 902 22,523 Lutz et al. 735 11,907 Liu et al. 1,010 29,997 Mathew et al. 1,390 42,812 This work w/o MO 569 9,263 This work w/ MO 465 7,366
Power consumption and power-latency product at encryption
SLIDE 28
Encryption only architecture
n Designed encryption-only hardware based on
- ur philosophy
p Compared with representative open-source IP
(SASEBO IP) and state-of-the-art one [ARITH 2016]
n Our architecture is 58—64% higher efficient
p Also advantageous in power/energy consumption
28 Area (GE) Latency (ns) Thru (Gbps) Thru/GE Power (uW) PL product SASEBO IP Table 23,085.00 11.64 12.00 519.66 352 4,097 Comp 11,431.67 23.04 6.06 530.16 513 11,820 ARITH 2016 Type-I 12,108.33 23.87 5.90 487.16 655 14,266 Type-II 13,249.33 21.78 6.46 487.92 755 18,022 This work 12,127,00 13.97 10.08 831.10 279 3,898
SLIDE 29
Massages take away
n Round-based implementation of block ciphers may be essential for evaluating their performance
p Should be conscious of mode-of-operations,
applications, etc.
p Optimizing round datapath is valuable and essential
n Feedback to block cipher design?
p Optimized MDS matrices for cryptanalyses ≠ optimized
for implementation (area and latency)
- But it can be optimized at implementation for
implementation
p Inversion-based 8-bit Sbox makes many spaces for
architectural/design optimization
29
SLIDE 30
References
n R. Ueno et al., “A High Throughput/Gate AES Hardware Architecture by Compressing Encryption and Decryption Datapaths—Toward efficient CBC-Mode Implementation,” CHES 2016. n R. Ueno et al., “High Throughput/Gate AES Hardware Architectures Based on Datapath Compression,” IEEE
- Trans. Comput., 2019. (Early Access)