Lightweight Implementations of SHA-3 Candidates on FPGAs Jens-Peter - - PowerPoint PPT Presentation

lightweight implementations of sha 3 candidates on fpgas
SMART_READER_LITE
LIVE PREVIEW

Lightweight Implementations of SHA-3 Candidates on FPGAs Jens-Peter - - PowerPoint PPT Presentation

Introduction Methodology Implementations Results Lightweight Implementations of SHA-3 Candidates on FPGAs Jens-Peter Kaps Panasayya Yalla Kishore Kumar Surapathi Bilal Habib Susheel Vadlamudi Smriti Gurung John Pham Cryptographic


slide-1
SLIDE 1

Introduction Methodology Implementations Results

Lightweight Implementations of SHA-3 Candidates

  • n FPGAs

Jens-Peter Kaps Panasayya Yalla Kishore Kumar Surapathi Bilal Habib Susheel Vadlamudi Smriti Gurung John Pham

Cryptographic Engineering Research Group (CERG) http://cryptography.gmu.edu Department of ECE, Volgenau School of Engineering, George Mason University, Fairfax, VA, USA

12th International Conference on Cryptology in India Indocrypt 2011

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 1 / 27

slide-2
SLIDE 2

Introduction Methodology Implementations Results

Outline

1 Introduction 2 Methodology 3 Implementations 4 Results

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 2 / 27

slide-3
SLIDE 3

Introduction Methodology Implementations Results Hash Function Competition Previous Work Goal

Hash Function Competition

A hash algorithm reads an arbitrary length message and produces a fixed bit string called hash value/message digest. Main applications: Digital signatures, Message Authentication Codes (MAC), Universal Unique IDentifier(UUID/GUID), password tables and many more. NIST competition for new secure hash algorithm SHA-3

Announced in Nov 2007, 64 entries submitted. 14 selected for Round 2. Currently in Round 3 → 5 finalists.

NIST’s selection criteria: Security, HW/SW speed, scalability.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 3 / 27

slide-4
SLIDE 4

Introduction Methodology Implementations Results Hash Function Competition Previous Work Goal

Hash Function Competition

A hash algorithm reads an arbitrary length message and produces a fixed bit string called hash value/message digest. Main applications: Digital signatures, Message Authentication Codes (MAC), Universal Unique IDentifier(UUID/GUID), password tables and many more. NIST competition for new secure hash algorithm SHA-3

Announced in Nov 2007, 64 entries submitted. 14 selected for Round 2. Currently in Round 3 → 5 finalists.

NIST’s selection criteria: Security, HW/SW speed, scalability. Motivation Analyze performance of candidates in a constrained FPGA environment ⇒ determine scalability on FPGAs.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 3 / 27

slide-5
SLIDE 5

Introduction Methodology Implementations Results Hash Function Competition Previous Work Goal

Previous Work on SHA-3 Candidates

Several Throughput/Area optimized implementations on FPGAs were published: Gaj et al.[CHES 2010], Matsuo et al.[SHA-3 conference 2010], Baldwin et al.[SHA-3 conference 2010]. Only two specific for low-area implementations of SHA-3 finalists: Kerckhof et al.[HASH 2011], Jungk et al.[Reconfig 2011].

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 4 / 27

slide-6
SLIDE 6

Introduction Methodology Implementations Results Hash Function Competition Previous Work Goal

Previous Work on SHA-3 Candidates

Several Throughput/Area optimized implementations on FPGAs were published: Gaj et al.[CHES 2010], Matsuo et al.[SHA-3 conference 2010], Baldwin et al.[SHA-3 conference 2010]. Only two specific for low-area implementations of SHA-3 finalists: Kerckhof et al.[HASH 2011], Jungk et al.[Reconfig 2011]. Problem: Rating algorithm performance when Implementations are on different devices, made with different implementation goals and features, vary in both: area and throughput, and support different I/O interface widths.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 4 / 27

slide-7
SLIDE 7

Introduction Methodology Implementations Results Hash Function Competition Previous Work Goal

Our Goal:

Comprehensive set of lightweight implementations of all Round 2 SHA-3 Candidates (except SIMD) and all SHA-3 Finalists. All optimized for the same target → maximum Throughput to Area ratio for given area budget. All use the same standardized interface. Implemented on different families for fair comparison with

  • ther reported results.

Target Details: Xilinx Spartan 3, low cost FPGA family Budget: 400-600 slices, 1 Block RAM (BRAM) Implemented 256 bit digest versions only

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 5 / 27

slide-8
SLIDE 8

Introduction Methodology Implementations Results Assumptions Interface and Protocol

Assumptions

Implementing for minimum area alone can lead to unrealistic run-times. ⇒ Target: Achieve the maximum Throughput/Area ratio for a given area budget. Realistic scenario:

System on Chip: Certain area only available. Standalone: Smaller Chip, lower cost, but limit to smallest chip available, e.g. 768 slices on smallest Spartan 3 FPGA.

Makes fair comparison of lightweight implementations possible.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 6 / 27

slide-9
SLIDE 9

Introduction Methodology Implementations Results Assumptions Interface and Protocol

Interface and Protocol

Based on Interface and I/O Protocol from Gaj et al.[CHES 2010]. msg len ap, seq len ap (after padding ) in 32-bit words. msg len bp, seq len bp (before padding) in bits. msg len bp =

n−2

  • i=0

seq len api · 32 + seq len bpn−1 msg len ap =

n−1

  • i=0

seq len api · 32 w = 16 bits.

din

w bits w seq

n−1

seq_len_ap 1 seq_len_bp

n−1 n−1

seq seq_len_ap seq_len_ap

1

seq

1

bits w

src_ready dst_ready clk clk

SHA Core

src_read dst_write rst rst dout

w msg_len_ap 1 msg_len_bp message a)SHA Interface b)SHA Protocol

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 7 / 27

slide-10
SLIDE 10

Introduction Methodology Implementations Results BLAKE-256 Grøstl JH Keccak Skein

BLAKE-256 Algorithm

G G G G G G G G P1 P2 Init. P P

255 512 512 512 512 512 256 511

IV H

255 256

8

<<<

7

<<<

16

<<<

12

<<< 32 32 32 32

Ti CM C M 14x D’ C’ B’ A’ A B C D CM CM G

Key Features Salt value: A user Dependant constant 128 bits set all to 0 8 G functions : XOR, addition, shifting. P1,P2 : Permutation Blake scales very well. Folded up to 4 times vertically and 4 times horizontally.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 8 / 27

slide-11
SLIDE 11

Introduction Methodology Implementations Results BLAKE-256 Grøstl JH Keccak Skein

BLAKE-256 Implementation

REG_1 1 REG_2 1

32 32 32 32 32 32 32

DRAM DRAM DRAM DRAM REG_B1 REG_B2

R3

<<<

R1

<<< 1 REG_C

R4

<<<

R2

<<< 1 1 1 1 1 REG_A

32 CM 32 32 32 32

Reg

31 16 15

din dout

1 15 31 16

BRAM

Port−B Port−A

D B C A C D A B B’ D’ A’ C’

Implementation Salt : BRAM State: DRAM Quasi pipelined Half G function Registers: Reduce critical path Permutation causes a large controller with 210 addresses. BRAM contains constants, message, IV, intermediate hash. Scalability: Unfolding leads to worse TP/A. Improvement: Rescheduling of G results in 290 clock per block versus 350 .

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 9 / 27

slide-12
SLIDE 12

Introduction Methodology Implementations Results BLAKE-256 Grøstl JH Keccak Skein

Grøstl Algorithm

Mix Sft Row S−Box Addp

512 512

Mix Sft Row S−Box

512

Addq

10x

Q

255 512 512

10x

P

Hi H IV Hi−1 M

Key Features Based on AES like architecture S-BOX, shift rows, Mixed columns Grøstl scales well, like AES. Folded up to 8 times vertically. Small storage requirements. Uses many narrow memory accesses in parallel (8 per column).

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 10 / 27

slide-13
SLIDE 13

Introduction Methodology Implementations Results BLAKE-256 Grøstl JH Keccak Skein

Grøstl Implementation

1 2 1 2 1 1

A BRAM

Port−B Port−A

Reg din

15 31 16

dout

1 15 31

SBox SBox SBox SBox Reg Add Constant

8 8 8 8

4xDRAM 4xDRAM 4xDRAM 4xDRAM

1

Reg

1

Reg GFMul

1 32 32

A B B A

Implementation State p,q : DRAM Shift Rows : how data accessed from DRAM Mix Column : GF-multiplier(half multiplier) Finalization takes as many clock cycles as 1 block. BRAM stores only intermediate hash and IV. One new column every 3 clock cycles, P & Q interleaved. Scalability: Reducing number of clock cycles per column by adding S-Boxes and/or GF-Multiplier.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 11 / 27

slide-14
SLIDE 14

Introduction Methodology Implementations Results BLAKE-256 Grøstl JH Keccak Skein

JH Algorithm

Group De−group

8

E S−box L P

8

R L P

6

R S−box

1023 512 1024 511 511 1024 256 1024

S0 C0

1023 512 512 512 1023 768

42x M M H

Key Features Grouping: reordering of 1024 bits state SBOX : Permutation Linear transformation : rotation and XOR De-grouping: inverse of grouping Permutation , grouping, and de-grouping makes scaling difficult Folding increases size

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 12 / 27

slide-15
SLIDE 15

Introduction Methodology Implementations Results BLAKE-256 Grøstl JH Keccak Skein

JH Implementation

S−box L Group De−group

31 16

Reg

1

DRAM R6

7 7

P dout

1 31 16 15

Reg

14

Reg Reg

14 15 31 14 15 31 1 31 31 31 15 3 1 2

BRAM

Port−B Port−A

31 31

din

15 1 2

Implementation State: BRAM R8 function: Implemening 8X2 S-BOX for R8(S0 and S1) R6 function :2 S-BOX for R6(S0) 32-bit datapath to maximize use of BlockRAM. On-the-fly generation of round constants. Scalability: 64-bit datapath only viable without BlockRAM. Improvement: Group can be performed on M and de-group

  • nly on H Kerckhof et al.[ECRYPT II 2011].

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 13 / 27

slide-16
SLIDE 16

Introduction Methodology Implementations Results BLAKE-256 Grøstl JH Keccak Skein

Keccak Algorithm

θ ρ & π χ ι

1087 1599 1088 1600 1600 1088 1088 1088 1599 1087 512 255 zeros

24x M

1600

H Const Round Rotate Const

Key Features θ simple XOR ρ, π rotation and reordering

  • perate on columns

χ logical operation on rows Dependency on Previous states prevents folding.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 14 / 27

slide-17
SLIDE 17

Introduction Methodology Implementations Results BLAKE-256 Grøstl JH Keccak Skein

Keccak Implementation

1

<<<

Rho&Pi

63 31 31 32 63 15

BRAM

Port−A Port−B

31 31 1

dout

31 32 63 32 63 1 2

A A

var_out

rc_a var_out rc_a var_out Chi_B Chi_B din Chi_B

Reg−A

1

Reg−B Reg−C

1

Reg−V Reg

RegA_out RegB_out RegC_out

63 31 3 1 2

A

  • ut_32
  • ut_32

Chi

Implementation Round constants, States: BRAM Quasi-pipelined θ & ρ & π Fixed rotations turn into variable rotator for small datapaths. ρ & π contains the rotator. Scalability: 64-bit datapath only viable without BlockRAM. Adding 2 more 64-bit registers saves approx 700 clock cycles.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 15 / 27

slide-18
SLIDE 18

Introduction Methodology Implementations Results BLAKE-256 Grøstl JH Keccak Skein

Skein Algorithm

<<<R 4xMIX P Keygen <<<R

64 64 255

Tweak

128 512 512 512 512 512

4x Threefish MIX M H 18x+1 IV M

Key Features Mix function : Addition, Rotation and XOR Tweak constant : Key Generation for each block 64-bit adders lead to long delay. Algorithm cannot be folded.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 16 / 27

slide-19
SLIDE 19

Introduction Methodology Implementations Results BLAKE-256 Grøstl JH Keccak Skein

Skein Implementation

31

reg−2

1

reg−1

63 1 63 32 31 1 2

Reg

63 32 63 1 31 1 15 31 16

Tweak BRAM

Port−B Port−A

63 32 15 31 16

dout din

31 63 32 31

<<<R

Implementation State: BRAM Key Generation,Mix : 32 bit adder 32 bit adder leads critical path through barrel shifter. Barrel shifter is single largest block in the design (192 slices). Finalization takes as many clock cycles as 1 block hash. Scalability: Running Keygen and MIX in parallel. Improvement: Addition of Registers to cut down the critical path delay.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 17 / 27

slide-20
SLIDE 20

Introduction Methodology Implementations Results Results of SHA-3 R-2 Candidates Results of SHA-3 Finalists Comparison with Kerckhof Results Comparison with Jungk Results

Throughput versus Area on Spartan-3

400 450 500 550 600 100 200 300 400 500 Shabal BLAKE-32 Grøstl-0 SHAvite-3 BMW ECHO Hamsi Luffa Fugue JH CubeHash Keccak Skein BLAKE-256 Grøstl JH42 Round 2 Round 2 & 3 Round 3

Area (slices) Throughput (Mbps)

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 18 / 27

slide-21
SLIDE 21

Introduction Methodology Implementations Results Results of SHA-3 R-2 Candidates Results of SHA-3 Finalists Comparison with Kerckhof Results Comparison with Jungk Results

Ranking by Throughput over Area on Spartan-3

Shabal BLAKE-32 BLAKE-256 Grøstl-0 Grøstl SHAvite-3 BMW ECHO Luffa Fugue Hamsi JH42 JH Keccak CubeHash Skein 0.0 0.2 0.4 0.6 0.8 1.0 Long Messages Short Messages

TP/Area (Mbps/slice)

Algorithms with finalization rounds perform worse for short messages.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 19 / 27

slide-22
SLIDE 22

Introduction Methodology Implementations Results Results of SHA-3 R-2 Candidates Results of SHA-3 Finalists Comparison with Kerckhof Results Comparison with Jungk Results

Implementation Summary of Finalists for Long Messages

Algorithm Block Size (bits) b Clock Cycles to hash N blocks clk = st + ( l + p) · N + end Throughput b (l + p) · T BLAKE-256 512 2 + ( 32 + 318) · N + 17 512/( 350 · T) Grøstl 512 2 + ( 32 + 515) · N + 532 512/( 547 · T) JH42 512 35 + ( 32 + 1813) · N − 15 512/(1845 · T) Keccak 1088 2 + ( 68 + 3696) · N + 17 1088/(3764 · T) Skein 512 5 + ( 32 + 2407) · N + 2423 512/(2439 · T)

st: Clock cycles computing initial steps before processing. l + p : Loading and processing cycles per block. end : Clock cycles needed for finalization and output of hash. st and end ignored for long messages as their influence goes toward zero.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 20 / 27

slide-23
SLIDE 23

Introduction Methodology Implementations Results Results of SHA-3 R-2 Candidates Results of SHA-3 Finalists Comparison with Kerckhof Results Comparison with Jungk Results

Implementation Results of Finalists on Xilinx Spartan-3

Long Message Short Message Algorithm Area (slices) Block RAMs Maximum Delay (ns) T Throughput (Mbps) TP/Area (Mbps/slice) Throughput (Mbps) TP/Area (Mbps/slice) BLAKE-256 545 1 8.42 173.8 0.32 164.8 0.302 Grøstl 537 1 6.95 134.6 0.25 68.1 0.127 JH42 428 1 9.74 28.5 0.07 28.2 0.066 Keccak 582 1 8.30 34.8 0.06 34.7 0.060 Skein 491 1 10.68 19.7 0.04 9.9 0.020 Short Message : clock cycles associated with initialization, loading, processing, finalization

  • No. of blocks of message(N)=1 after padding for short message

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 21 / 27

slide-24
SLIDE 24

Introduction Methodology Implementations Results Results of SHA-3 R-2 Candidates Results of SHA-3 Finalists Comparison with Kerckhof Results Comparison with Jungk Results

Throughput over Area of 5 Finalists

Xilinx Virtex 5

BLAKE-256 Grøstl Keccak JH42 Skein 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60

TP/Area (Mbps/slice)

Long Messages Short Messages

Xilinx Virtex 6

BLAKE-256 Grøstl Keccak JH42 Skein 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00

TP/Area (Mbps/slice)

Long Messages Short Messages

No difference in ranking on the two devices. Keccak better than JH here, compared to Spartan-3.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 22 / 27

slide-25
SLIDE 25

Introduction Methodology Implementations Results Results of SHA-3 R-2 Candidates Results of SHA-3 Finalists Comparison with Kerckhof Results Comparison with Jungk Results

Throughput over Area of 5 Finalists

Xilinx Spartan 6

BLAKE-256 Grøstl Keccak JH42 Skein 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60

TP/Area (Mbps/slice)

Long Messages Short Messages

Altera Cyclone ii

Grøstl BLAKE-256 Keccak JH42 Skein 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16

TP/Area (Mbps/LE)

Long Messages Short Messages

Grøstl better than BLAKE on Cyclone ii. Small changes in ranking depending on device.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 23 / 27

slide-26
SLIDE 26

Introduction Methodology Implementations Results Results of SHA-3 R-2 Candidates Results of SHA-3 Finalists Comparison with Kerckhof Results Comparison with Jungk Results

Comparison with [Kerckhof] Results (Virtex 6)

50 100 150 200 250 300 100 200 300 400 500 600 700 800 900 BLAKE-256 Grøstl JH42 Keccak Skein BLAKE-256(K) Grøstl(K) JH42(K) Keccak(K) Skein(K)

Area (slices) Throughput (Mbps)

Range of our Results

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 24 / 27

slide-27
SLIDE 27

Introduction Methodology Implementations Results Results of SHA-3 R-2 Candidates Results of SHA-3 Finalists Comparison with Kerckhof Results Comparison with Jungk Results

Comparison with [Jungk] Results (Virtex 5)

100 200 300 400 500 600 200 400 600 800 1000 1200 BLAKE-256 Grøstl JH42Keccak Skein BLAKE-256(J) Grøstl(J) JH42(J) Keccak(J) Skein(J)

Area (slices) Throughput (Mbps)

Range of our Results

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 25 / 27

slide-28
SLIDE 28

Introduction Methodology Implementations Results Results of SHA-3 R-2 Candidates Results of SHA-3 Finalists Comparison with Kerckhof Results Comparison with Jungk Results

Announcement

All the above data will shortly be available in the ATHENa database.

http://cryptography.gmu.edu/athenadb/

Source codes will be available on the ATHENa webpage at the end of December.

http://cryptography.gmu.edu/athena/ Follow the “GMU Source Codes” Link

Thanks This work has been supported in part by NIST through the Recovery Act Measurement Science and Engineering Research Grant Project under contract no. 60NANB10D004.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 26 / 27

slide-29
SLIDE 29

Introduction Methodology Implementations Results Results of SHA-3 R-2 Candidates Results of SHA-3 Finalists Comparison with Kerckhof Results Comparison with Jungk Results

Thanks for your attention.

Indocrypt 2011 J.-P. Kaps, Smriti Gurung, et al. Lightweight Implementations of SHA-3 on FPGAs 27 / 27