AES and other secret key implementations Ingrid Verbauwhede - - PDF document

aes and other secret key implementations
SMART_READER_LITE
LIVE PREVIEW

AES and other secret key implementations Ingrid Verbauwhede - - PDF document

AES and other secret key implementations Ingrid Verbauwhede ingrid.verbauwhede-at-esat.kuleuven.be K.U.Leuven, ESAT- SCD - COSIC Computer Security and Industrial Cryptography Acknowledgements: Current and former Ph.D. students at UCLA and


slide-1
SLIDE 1

Ingrid Verbauwhede, K.U.Leuven - COSIC 1

KUL - COSIC ECRYPT Summer School - 1 Albena, May 2011

AES and other secret key implementations

Ingrid Verbauwhede ingrid.verbauwhede-at-esat.kuleuven.be K.U.Leuven, ESAT- SCD - COSIC Computer Security and Industrial Cryptography Acknowledgements: Current and former Ph.D. students at UCLA and K.U.Leuven

KUL - COSIC ECRYPT Summer School - 2 Albena, May 2011

Outline & Goal

  • Crypto engineering for secret key algorithms

– Design parameters – DES – Modes of operation – AES – Light weight crypto

slide-2
SLIDE 2

Ingrid Verbauwhede, K.U.Leuven - COSIC 2

KUL - COSIC ECRYPT Summer School - 3 Albena, May 2011

Design Parameters

Embedded security: Area, delay, power, energy

KUL - COSIC ECRYPT Summer School - 4 Albena, May 2011

Crypto engineering everywhere

  • Continuum between software

and hardware

– ASIC (microcode) – FPGA – fully programmable processor Everything is always connected everywhere

slide-3
SLIDE 3

Ingrid Verbauwhede, K.U.Leuven - COSIC 3

KUL - COSIC ECRYPT Summer School - 5 Albena, May 2011

Embedded Security

NEED BOTH

  • Efficient, light-weight Implementation

– Within power, area, timing budgets – Public key: 1024 bits RSA on 8 bit μC and 100 μW – Public key on a passive RFID tag

  • Trustworthy implementation

– Resistant to attacks – Active attacks: probing, power glitches, JTAG scan chain – Passive attacks: side channel attacks, including power, timing and electromagnetic leaks

KUL - COSIC ECRYPT Summer School - 6 Albena, May 2011

Cost definition

  • Area
  • Time
  • Power, Energy
  • Physical Security
  • NRE (Non Recurring Engineering) cost
slide-4
SLIDE 4

Ingrid Verbauwhede, K.U.Leuven - COSIC 4

KUL - COSIC ECRYPT Summer School - 7 Albena, May 2011

Design parameters

  • Speed or throughput:

– HW: Gbits/sec or Mbits/sec/slice – SW: Cycles/byte, independent of clock frequency

  • Area:

– HW: mm2 (gate or transistor count) – SW: memory footprint

  • Power or energy consumption:

– Power (Watts) for cooling or transmission (RFID) – Energy (Joule): battery operated devices

  • Security: difficult to measure, but we want it

– Entropy, leakage functions? – Measurements until disclosure?

KUL - COSIC ECRYPT Summer School - 8 Albena, May 2011

Throughput: Real-time

  • Extremely high throughput (Radar or fiber optics)
  • One operator (= hardware unit, e.g. adder, shifter, register)
  • for each operation (= algorithmic, e.g. addition, multiplication, delay)

clock frequency = sample frequency

  • Most designs: time multiplexing

clock frequency = sample frequency clock frequency sample frequency = number of clock cycles available for the job

slide-5
SLIDE 5

Ingrid Verbauwhede, K.U.Leuven - COSIC 5

KUL - COSIC ECRYPT Summer School - 9 Albena, May 2011

SW: cycles per byte

  • “independent” of

clock frequency

  • r machine
  • Size of packet

matters

  • “match” of

algorithm to architecture

[Source: http://bench.cr.yp.to/results-sha3.html] Size (bytes) 8 64 4096 40 cycles/byte Cycles/byte

KUL - COSIC ECRYPT Summer School - 10 Albena, May 2011

Power density problem

  • Intel S. Borkar power density problem

[Author: S. Borkar, Intel]

Cooling!!

slide-6
SLIDE 6

Ingrid Verbauwhede, K.U.Leuven - COSIC 6

KUL - COSIC ECRYPT Summer School - 11 Albena, May 2011

Low Energy: battery capacity

  • Rabaey slide battery capacity

One AAA battery: 1300 to 5000 Joule

KUL - COSIC ECRYPT Summer School - 12 Albena, May 2011

Power and Energy are not the same!

  • Power = P = I x V (current x voltage) (= Watt)

– instantaneous – Typically checked for cooling or for peak performance

  • Energy = Power x execution time (= Joule)

– Battery content is expressed in Joules – Gives idea of how much Joules to get the job done

Low power processor low energy solution !

slide-7
SLIDE 7

Ingrid Verbauwhede, K.U.Leuven - COSIC 7

KUL - COSIC ECRYPT Summer School - 13 Albena, May 2011

Heat and parallelism

memory processor M P C Pmono = CV2f (Watt)

Power (Heat)

C/4 C/4 C/4 C/4

M/4 P/4 M/4 P/4 M/4 P/4 M/4 P/4 4 (C/4)V2(f/4) = Pmono/4 but since f ~ V can be even Pmono/43

Reduce power = reduce WASTE !! TREND: MULTI-CORE!!

KUL - COSIC ECRYPT Summer School - 14 Albena, May 2011

Intermezzo: standard cell based design

slide-8
SLIDE 8

Ingrid Verbauwhede, K.U.Leuven - COSIC 8

KUL - COSIC ECRYPT Summer School - 15 Albena, May 2011

Logic Design Activities

  • Logic and FSM synthesis

– State minim., coding – Multilevel Logic Optimisation

  • Technology Mapping

– Functions to library cells – Minimal Area for given delay

  • Timing Verification

– Estimate wiring load C – Critical logic path

  • Layout

– P&R C extraction from wiring ...

Delay Area

! !

aoi

ff

Extraction-> Timing Timing Closure

2 6... Logic Depth #literals

VHDL Logic Synthesis (Synopsys)

KUL - COSIC ECRYPT Summer School - 16 Albena, May 2011

Standard Cell Layout

  • Std. Cell
  • Std. Cell Place & Route (RT-Module)

Routing Channel Cell Row

(Courtesy : Tanner Tools)

slide-9
SLIDE 9

Ingrid Verbauwhede, K.U.Leuven - COSIC 9

KUL - COSIC ECRYPT Summer School - 17 Albena, May 2011

Standard Cell Zoom In

layout

vdd vss

KUL - COSIC ECRYPT Summer School - 18 Albena, May 2011

Module Generation

For data-path operators: structure is in bit-slices Computer generated layout as function of wordlength Compact, predictable IP

Power Instruction, Clock Data

slide-10
SLIDE 10

Ingrid Verbauwhede, K.U.Leuven - COSIC 10

KUL - COSIC ECRYPT Summer School - 19 Albena, May 2011

Standard Cell and Module

Courtesy: J. Van Campenhout RUG

Datapath Standard Cell Random Logic

KUL - COSIC ECRYPT Summer School - 20 Albena, May 2011

Start with easy one: Block cipher - DES

slide-11
SLIDE 11

Ingrid Verbauwhede, K.U.Leuven - COSIC 11

KUL - COSIC ECRYPT Summer School - 21 Albena, May 2011

Symmetric key: DES

  • DES = Data Encryption Standard
  • FIPS Standard 46 effective in July 1977: US government standard

for sensitive but unclassified data

  • Re-affirmed in 1983, 1988, 1993, 1999 (FIPS 46-3)
  • July 26, 2004: FIPS 46-3 is withdrawn: use TDEA or AES
  • TDEA = Triple DES encryption algorithm – NIST 800-67

DES Plaintext (Pi) Ciphertext (Ci) Key = 56 bits + 8 parity bits 64 64 64

KUL - COSIC ECRYPT Summer School - 22 Albena, May 2011

TDEA

  • Triple DES Encryption Algorithm, NIST Spec. Pub. 800-

67 (May 2004)

  • Three Key options:

– K1, K2, K3 different – K1=K3, K2 different – K1=K2=K3, backward compatible with single DES

  • two-key triple DES: until 2009
  • three-key triple DES: until 2030

DES Plaintext (Pi) Ciphertext (Ci) 64 64 64 DES-1 DES K1 64 K2 64 K3

slide-12
SLIDE 12

Ingrid Verbauwhede, K.U.Leuven - COSIC 12

KUL - COSIC ECRYPT Summer School - 23 Albena, May 2011

DES = Feistel cipher

+

f

Li-1 Ri-1 Li Ri Encryption round i

+

f

Li-1 Ri-1 Li Ri Decryption round i Ki Ki

  • DES has 16 rounds + initial and final permutation
  • Basic cipher structure is Feistel cipher

– other examples of Feistel: IDEA, FEAL, Kasumi

  • Hardware: encryption = decryption!

(different for AES)

KUL - COSIC ECRYPT Summer School - 24 Albena, May 2011

DES- f function

32b-to-48b permutation (wiring & bit duplication) input of S-boxes: 8x6b Si = 6b-to-4b non linear substitution (ROM or logic based Look up table)

  • utput of S-boxes: 8x4b

Expansion E

+

32 Ri-1 Ki 48 48 S1 S2 S3 S4 S5 S6 S7 S8 Permutation P 32 32 f(Ri-1, Ki)

  • Because of Feistel: no need for f -1 (different for AES)

32b-to32b permutation (wiring)

slide-13
SLIDE 13

Ingrid Verbauwhede, K.U.Leuven - COSIC 13

KUL - COSIC ECRYPT Summer School - 25 Albena, May 2011

DES Key schedule

PC1 C D 56 64 PC2 48 56 PC1: permute and drop 8 bits C&D: rotate left 1 or 2 bits each round DECRYPTION: rotate right PC2: permute and select 48

  • utput bits

Initial key K Round Key Ki

C&D left/right shift registers: encryption & decryption HW

KUL - COSIC ECRYPT Summer School - 26 Albena, May 2011

Key Schedule

Two options:

  • On the “fly” = just in time processing
  • Pre-compute and store

Memory BC Key Schedule Key Schedule BC Typical for Hardware Typical for Software

slide-14
SLIDE 14

Ingrid Verbauwhede, K.U.Leuven - COSIC 14

KUL - COSIC ECRYPT Summer School - 27 Albena, May 2011

Key schedule on the fly

  • The cost of fast key

context switching:

  • Example for IPSEC

router

– one 128 bit key = 1408 bits round keys (10 rounds + initial key) – half of internet packets are

  • nly 64 bytes in length

(512 bits)

10 102 103 104 105 2 4 6 8 10 Record Size (bytes) ARC4 AES 3DES

Context bandwidth (Gbps)

Data at 1Gbps

[source: J. Goodman]

BANDWIDTH PROBLEM !

KUL - COSIC ECRYPT Summer School - 28 Albena, May 2011

Modes of operation

slide-15
SLIDE 15

Ingrid Verbauwhede, K.U.Leuven - COSIC 15

KUL - COSIC ECRYPT Summer School - 29 Albena, May 2011

Design method

  • Advice: include modes of operation into

hardware IP module or co-processor:

  • increases the complexity somewhat: more

control or instructions are needed + CLEAN security partitioning + reduces communication overhead and traffic

KUL - COSIC ECRYPT Summer School - 30 Albena, May 2011

Modes of operation: ECB

  • ECB = Electronic code book
  • cipher blocks are independent, thus insertion or

deletion of blocks can go undetected

  • block cipher does not hide data patterns
  • BC = block cipher (e.g. 3DES or AES)

BC BC-1 Message M Ciphertext C Plaintext M Key K K

slide-16
SLIDE 16

Ingrid Verbauwhede, K.U.Leuven - COSIC 16

KUL - COSIC ECRYPT Summer School - 31 Albena, May 2011

Modes of operation: CBC

  • CBC = Cipher block chaining
  • error in Ci: propagation over 2 blocks (Ri and Ri+1)
  • loss of block synchronization = fatal
  • error in Pi: propagation for the remaining blocks
  • mostly used with encryption-only for MAC generation

BC

+

Pi Ci-1 Ci BC-1

+

Ci-1 Ri

KUL - COSIC ECRYPT Summer School - 32 Albena, May 2011

Modes of operation: CBC-MAC

  • Cipher block chaining – Message Authentication Code
  • Initialization Vector: IV = C-1
  • Feedback inhibits pipelining

BC

+

Pi Ci-1 BC

+

Ci-1 Pi

slide-17
SLIDE 17

Ingrid Verbauwhede, K.U.Leuven - COSIC 17

KUL - COSIC ECRYPT Summer School - 33 Albena, May 2011

Pipelining?

  • Due to feedback pipeline remains empty
  • Worse for triple DES

DES

+

Ci-1 Pi 16 x Pi DES

+

Ci-1 Pi DES

+

Ci-1 Pi 48 x Pi 48 rounds: 16 DES(K1) 16 DES-1(K2) 16 DES (K3)

KUL - COSIC ECRYPT Summer School - 34 Albena, May 2011

Modes of operation: counter

  • Converts block cipher into stream cipher
  • no feedback: pipelining is possible
  • crucial to choose non-repeating counter functions, e.g. LFSR
  • crucial to choose counter IV’s that are UNIQUE

BC

Pi yi Ci yi

BC

Pi cntri cntri

slide-18
SLIDE 18

Ingrid Verbauwhede, K.U.Leuven - COSIC 18

KUL - COSIC ECRYPT Summer School - 35 Albena, May 2011

Modes of operation: OCB

  • Offset code book – proprietary
  • (used to be) popular because option in 802.11 WLAN
  • can be pipelined
  • need encryption & decryption logic (problem for AES)

BC

+

Pi Zi Ci

+

Zi BC-1

+

Pi Zi

+

Zi

+

extras checksum

+

extras checksum

KUL - COSIC ECRYPT Summer School - 36 Albena, May 2011

CCM (Counter + CBC MAC) mode

  • Encryption & MAC creation (802.11 WLAN)

MIC = Message Integrity Check is same as MAC

Clear text frame

FC Dur A1 A2 A3 A4 SC QC PC Data MIC AES_E(K) AES_E(K) AES_E(K) AES_E(K) AES_E(K) CBC-MAC AES_E(K) AES_E(K) FC Dur A1 A2 A3 A4 SC QC PC Data MIC Pl(2) Pl(1)

Counter preload Transmitted encrypted frame IV

AES_E(K) FCS

0 padded 0 padded

Flag Nonce Dlen Flag Nonce Cnt Hlen AES_E(K) Pl(C) AES_E(K) Pl(0)

slide-19
SLIDE 19

Ingrid Verbauwhede, K.U.Leuven - COSIC 19

KUL - COSIC ECRYPT Summer School - 37 Albena, May 2011

Block cipher modes of operation

  • Conclusion: most practical applications ONLY use encryption. Important for:

– AES because encryption is more efficient than decryption (non-Feistel) – area constraint applications (e.g. IEEE 802.11)

Privacy & MAC Privacy & MAC Yes Only CTR Dec 2Enc Enc 2Enc OCB CCM Yes Enc Enc CTR No Enc Enc OFB No Enc Enc CFB Message authentication No Enc Enc CBC-MAC No Dec Enc CBC not used Yes Dec Enc ECB Notes Pipelining Receiver Sender

KUL - COSIC ECRYPT Summer School - 38 Albena, May 2011

Block cipher AES

Hardware

slide-20
SLIDE 20

Ingrid Verbauwhede, K.U.Leuven - COSIC 20

KUL - COSIC ECRYPT Summer School - 39 Albena, May 2011

AES: Byte substitution

  • Byte substitution: each byte individual
  • 16 identical Sboxes
  • Area - time trade-off: HW multiplexing
  • 32 for Rijndael

a2 a6 a10 a14 a0 a4 a8 a12 a1 a5 a9 a13 a3 a7 a11 a15 b0 b4 b8 b12 b1 b5 b9 b13 b2 b6 b10 b14 b3 b7 b11 b15 GF(28)-1 Permute ai bi

KUL - COSIC ECRYPT Summer School - 40 Albena, May 2011

AES: Shiftrow

  • Shiftrow: circularly rotate each row of state array
  • Easy wiring

a2 a6 a10 a14 a0 a4 a8 a12 a1 a5 a9 a13 a3 a7 a11 a15 a10 a14 a2 a6 a0 a4 a8 a12 a5 a9 a13 a1 a15 a3 a7 a11

Shiftrow

slide-21
SLIDE 21

Ingrid Verbauwhede, K.U.Leuven - COSIC 21

KUL - COSIC ECRYPT Summer School - 41 Albena, May 2011

a6 a5 a4 a3 a2 a1 a0 0 0 0 0 a7 a7 0 a7 a7 b7 b6 b5 b4 b3 b2 b1 b0

AES: mix column

  • matrix multiplication of state array columns

– multiply with constant entries

a2 a6 a10 a14 a0 a4 a8 a12 a1 a5 a9 a13 a3 a7 a11 a15 b0 b4 b8 b12 b1 b5 b9 b13 b2 b6 b10 b14 b3 b7 b11 b15 bi bi+1 bi+2 bi+3 ai ai+1 ai+2 ai+3 2 3 1 1 1 2 3 1 1 1 2 3 3 1 1 2

= 2 x +

a7 a6 a5 a4 a3 a2 a1 a0 a6 a5 a4 a3 a2 a1 a0 0 0 0 0 a7 a7 0 a7 a7 b7 b6 b5 b4 b3 b2 b1 b0

+ 3 x

KUL - COSIC ECRYPT Summer School - 42 Albena, May 2011

Mix column - encryption

GF(B x 2) GF(B x 3)

+

G(x) 00011011

<< 1

carry

1

GF(B x 1)

+

GF(B) 8 a b c d 02 03 01 01 01 02 03 01 01 01 02 03 03 01 01 02

Mix Column Operation is GF(28) Linear Transform

+

<< 1

+ + +

slide-22
SLIDE 22

Ingrid Verbauwhede, K.U.Leuven - COSIC 22

KUL - COSIC ECRYPT Summer School - 43 Albena, May 2011

Key schedule

  • Unit is 32 bit words, W[i] = 32 bit = 1 column
  • 4 different operations
  • One round key is four W[i]’s

Input key W[i-Nk] ^ W[i-1] W[i-Nk] ^ ByteSub(RotByte(W[i-1]))^ Rcon[i/Nk] 128b W[i-Nk] ^ ByteSub(W[i-1]) 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 192b 256b

KUL - COSIC ECRYPT Summer School - 44 Albena, May 2011

Key schedule

  • Encryption key

– HW: on the fly: round key[i] = function (round key[i-1]) – SW: pre-compute and store in context (176, 208, 240Bytes)

  • Decryption key

– encryption key in reverse order – BUT need final round words to start

… W[15] = D0 = C1^D1 W[14] = C0 = B1^C1 W[13] = B0 = A1^B1 W[12] = A0 = f(C1^D1)^A1 … … W[16] = A1 = f(D0)^A0 W[17] = B1 = f(D0)^A0^B0 W[18] = C1 = f(D0)^A0^B0^C0 W[18] = D1 = f(D0)^A0^B0^C0^D0 … 128 bit encrypt 128 bit decrypt Initial Round Words Initial Round Words Final Round Words Final Round Words

slide-23
SLIDE 23

Ingrid Verbauwhede, K.U.Leuven - COSIC 23

KUL - COSIC ECRYPT Summer School - 45 Albena, May 2011

Combined AES architecture

  • AES is not

Feistel

  • Every operation

has its inverse

MixColumn ShiftRow ByteSub KeyAdd KeyAdd ShiftRow ByteSub KeyAdd Ki K0 KNr Input Data Output Data

Encryption

MixColumn-1 ShiftRow-1 ByteSub-1 KeyAdd-1 KeyAdd-1 ShiftRow-1 ByteSub-1 KeyAdd-1 Ki K0 KNr Output Data Input Data

Decryption

KUL - COSIC ECRYPT Summer School - 46 Albena, May 2011

Decryption datapath

MixColumn-1 ShiftRow-1 ByteSub-1 KeyAdd-1 KeyAdd-1 ShiftRow-1 ByteSub-1 KeyAdd-1 Ki K0 KNr Output Data Input Data

Decryption

MixColumn-1 ByteSub-1 ShiftRow-1 KeyAdd KeyAdd ByteSub-1 ShiftRow-1 KeyAdd Ki K0 KNr Output Data Input Data

Decryption

KeyAdd ShiftRow -1 ByteSub -1 KeyAdd MixColumn -1 ShiftRow -1 ByteSub -1 KeyAdd Ki KNr K0 Output Data Input Data

Decryption

MixColumn-1 ByteSub-1 ShiftRow-1 KeyAdd KeyAdd ByteSub-1 ShiftRow-1 KeyAdd Ki K0 KNr Output Data Input Data

Decryption

Reorganize Key addition is its own inverse switch shiftrow and bytesub Flip upside down

slide-24
SLIDE 24

Ingrid Verbauwhede, K.U.Leuven - COSIC 24

KUL - COSIC ECRYPT Summer School - 47 Albena, May 2011

Combined architecture

MixColumn ShiftRow ByteSub KeyAdd KeyAdd ShiftRow ByteSub KeyAdd Ki K0 KNr Input Data Output Data

Encryption

KeyAdd ShiftRow -1 ByteSub -1 KeyAdd MixColumn -1 ShiftRow -1 ByteSub -1 KeyAdd Ki KNr K0 Output Data Input Data

Decryption

  • Does not follow

completely Rijndael proposal (which suggests to switch KeyAdd and MixColumn) because it requires a InvMixColumn to be applied to the roundkey.

KUL - COSIC ECRYPT Summer School - 48 Albena, May 2011

Combined architecture

  • Provides encryption &

decryption

  • Key addition is its
  • wn inverse

ByteSub ShiftRow enc 1 MixColumn 1 1 KeyAdd 1 enc enc enc·first first regSel Output Data Input Data

Combined

enc

slide-25
SLIDE 25

Ingrid Verbauwhede, K.U.Leuven - COSIC 25

KUL - COSIC ECRYPT Summer School - 49 Albena, May 2011

Sub modules

GF(28)-1 perm perm -1 1 1 enc In Out

ByteSub

1 1 Out[3:0] Out[7:4] Out[11:8] Out[15:12] In[3:0] In[4,7,6,5] In[6,5,4,7] In[11:8] In[12,15,14,13] In[14,13,12,15] enc

ShiftRow

2 3 1 1 1 2 3 1 1 1 2 3 3 1 1 2 ·Incol+enc· c 8 c 8 8 c 8 c c 8 c 8 8 c 8 c ·In col Outcol=

MixColumn

Reason that decryption is slower

KUL - COSIC ECRYPT Summer School - 50 Albena, May 2011

Sbox optimization

  • GF(28)-1 requires large Look up tables
  • Map to isomorphic fields, GF((24)2) or GF(((22)2)2) and

invert there

  • smaller but slower!

()2 ()-1 p0 A A-1 al ah GF(24) GF(28) GF(28)

slide-26
SLIDE 26

Ingrid Verbauwhede, K.U.Leuven - COSIC 26

KUL - COSIC ECRYPT Summer School - 51 Albena, May 2011

Sbox experiment

  • 0.18 μm CMOS, Synopsys experiment
  • size of 1 Sbox, push for area or for speed

200 400 600 800 1000 1200 1400 1600 1800 2000 2 4 6 8 Latency (ns) Area (gates) Direct Implementation Wolkerstorfer

GF(28) GF((24)2)

KUL - COSIC ECRYPT Summer School - 52 Albena, May 2011

Compact SBOX

  • GF(((22)2)2) instead of GF(28)
  • Reduces the gate count to only 280 gates!

[size depends on the choice of ]

^2

  • ^-1

4 4 4 4 8 8 GF(28)

  • GF(((22)2)2)

GF(((22)2)2)

  • GF(28)

[Mentens et al., RSA 2005]

slide-27
SLIDE 27

Ingrid Verbauwhede, K.U.Leuven - COSIC 27

KUL - COSIC ECRYPT Summer School - 53 Albena, May 2011

Ballpark numbers

  • 1 gate = 2input NAND gate = 4 transistors
  • Sbox size:

– GF(28)-1 = 650 to 700 gates – GF((24)2)-1 = 400 gates ([Wol02]) – GF(((22)2)2)-1 = 280 gates ( [Sat01][Men05]) – but 50 to 100% slower

  • AES core encryption only: 20K to 25Kgates

– 128bit data, 128 bit key – key schedule on the fly – 1 clock cycle per round

  • AES core for encryption and decryption: 40Kgates

– 128 bit data, 128 bit key – precompute and store round keys: 128x11bits SRAM – 1 clock cycle per round

– savings in combining logic, losses in multiplexers!

KUL - COSIC ECRYPT Summer School - 54 Albena, May 2011

Extensions to Rijndael

  • Original Rijndael submission

– 128, 192 or 256 bits data (limited to 128bit in AES standard) – 128, 192 or 256 bits key (unchanged in AES standard)

  • Tricky part in key schedule: 2 loops in parallel

ciphertext N-1 rounds plaintext m m m 128 192 256

{

k 128 192 256

{

keyadd substitution shiftrow mixcolumn substitution shiftrow keyadd keyschedule key k L rounds k m final round roundkey

slide-28
SLIDE 28

Ingrid Verbauwhede, K.U.Leuven - COSIC 28

KUL - COSIC ECRYPT Summer School - 55 Albena, May 2011

Key schedule for Rijndael

Cycle 1 Roundkey Iterated Key

keysched1 keysched1

Cycle 2 Cycle 3 Roundkey Iterated Key

keysched1 keysched2 keysched1

Cycle 1 Cycle 2

data=128 key=192 data=192 key=128

KUL - COSIC ECRYPT Summer School - 56 Albena, May 2011

Key schedule for Rijndael

  • Key schedule on the

fly: one clockcycle per round

seed keysched1 keysched2 P C N roundkey data encrypt Key Schedule controller iteratedkey roundkey assembly m k m,k m

slide-29
SLIDE 29

Ingrid Verbauwhede, K.U.Leuven - COSIC 29

KUL - COSIC ECRYPT Summer School - 57 Albena, May 2011

  • Rijndael
  • Enc + Dec
  • 0.18um CMOS
  • Standard cells

KUL - COSIC ECRYPT Summer School - 58 Albena, May 2011

  • AES, 2nd generation
  • Regular & WDDL based implementation
  • Standard cells
  • 0.18 um CMOS
slide-30
SLIDE 30

Ingrid Verbauwhede, K.U.Leuven - COSIC 30

KUL - COSIC ECRYPT Summer School - 59 Albena, May 2011

SW implementations

Illustrate with DES

  • Option 1: operation by operation
  • Option 2: table look-up
  • Option 3: bit-slices

KUL - COSIC ECRYPT Summer School - 60 Albena, May 2011

SW: DES- f function

32b-to-48b perm/expansion Combined with permutation P:

  • ne Look-Up-Table

Wordwise EXOR: efficient Sbox: Look-up Table Expansion E

+

32 Ri-1 Ki 48 48 S1 S2 S3 S4 S5 S6 S7 S8 Permutation P 32 32 f(Ri-1, Ki)

  • DES: made for HW, inefficient in SW
slide-31
SLIDE 31

Ingrid Verbauwhede, K.U.Leuven - COSIC 31

KUL - COSIC ECRYPT Summer School - 61 Albena, May 2011

Software SBOX

KUL - COSIC ECRYPT Summer School - 62 Albena, May 2011

SBOX in SW

Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009

slide-32
SLIDE 32

Ingrid Verbauwhede, K.U.Leuven - COSIC 32

KUL - COSIC ECRYPT Summer School - 63 Albena, May 2011 Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009

Bit Slicing: Alternative Data Representation Bit Slicing: Alternative Data Representation

  • Introduced by Biham, 1997
  • each register contains 1 bit of, e.g.,

32 blocks

  • pipelining (=parallelization) of n

encryptions bit 1, block n bit 1, block 1 bit 2, block 1 bit 2, block n Block 1 Block n Block 2

64 bit 64, block 1 bit 64 , block n

register (n=16, 32, 64)

  • CPU can be viewed as

16/32/64 one-bit parallel processors

  • CPU acts as SIMD

(Single-instruction multiple-data) processor

Encryption with Bit Slicing (1): Permutations Encryption with Bit Slicing (1): Permutations

bit 1, block n bit 1, block 1 bit 2, block 1 bit 2, block n bit 64, block 1 bit 64 , block n

bit 1 , block n bit 2, block n bit 2, block 1 bit 64, block 1 bit 64, block n bit 1, block 1

Bit permuation is realized by re-ordering of registers (in practice: re-ordering of pointers) Ex: 64-64 bit permutation:

Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009

slide-33
SLIDE 33

Ingrid Verbauwhede, K.U.Leuven - COSIC 33

KUL - COSIC ECRYPT Summer School - 65 Albena, May 2011 Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009

Encryption with Bit Slicing (2): S-Box Encryption with Bit Slicing (2): S-Box

S-box

a b c d e f O1 O2 O3 O4

Each output can be expressed as Boolean function (i.e., function

  • n bits)

O1 = f1(a, b, c, d, e, f) … O4 = f4(a, b, c, d, e, f) On average, each S-box requires about 100 gates.

KUL - COSIC ECRYPT Summer School - 66 Albena, May 2011 Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009

Encryption with Bit Slicing (2): S-Box Encryption with Bit Slicing (2): S-Box

bit 48 , block n bit 1, block n bit 1, block 1 bit 2, block 1 bit 2, block n bit 48, block 1

Ex: Reg 8 AND Reg 17 OR Reg 44 NOR … S-box

a b c d e f O1 O2 O3 O4

Idea: Compute S-box Si for many blocks in parallel! i.e., realize functions O1 = f1(a, b, c, d, e, f) … O4 = f4(a, b, c, d, e, f) for 64 (or 32 or 16) blocks of data in parallel!

slide-34
SLIDE 34

Ingrid Verbauwhede, K.U.Leuven - COSIC 34

KUL - COSIC ECRYPT Summer School - 67 Albena, May 2011 Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009

DES Bit Slicing: Performance on 64 bit CPU DES Bit Slicing: Performance on 64 bit CPU

8 S-Boxes, 64 blocks parallel: 100 x 8 = 800 instructions total (incl. load/store & conversion of I/O data)

19000 instr. / 64 blocks 300 instr. / block 4-5 instr. / bit

Throughput on 300 MHz Alpha bit sliced: 137 Mbit/sec

  • ptimized non-bit sliced:

46 Mbit/sec ONLY COMPATIBLE WITH PIPELINE TYPE MODES OF OPERATION!! Further reading: [Biham 97]

KUL - COSIC ECRYPT Summer School - 68 Albena, May 2011

[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator

[2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110 [3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet [4] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS [5] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS

648 Mbits/sec Asm Pentium III [3] 41.4 W 0.015 (1/800) Java [5] Emb. Sparc 450 bits/sec 120 mW 0.0000037 (1/3.000.000) C Emb. Sparc [4] 133 Kbits/sec 0.0011 (1/10.000) 350 mW Power 1.32 Gbit/sec FPGA [1] 11 (1/1) 3.84 Gbits/sec 0.18μm CMOS Figure of Merit (Gb/s/W) Throughput AES 128bit key 128bit data 490 mW 2.7 (1/4) 120 mW

Throughput – Energy numbers

ASM StrongARM [2] 240 mW 0.13 (1/85) 31 Mbit/sec

slide-35
SLIDE 35

Ingrid Verbauwhede, K.U.Leuven - COSIC 35

KUL - COSIC ECRYPT Summer School - 69 Albena, May 2011

Conclusions

  • Energy is not the same as power
  • Feistel = hardware friendly!
  • Modes of operation inhibit pipelining,parallelism

– in SW bitslicing = pipelining

  • Need for light weight (ultra small)
  • Need for high throughput (ultra fast)
  • Need for low power/low energy