Toward Fair and Comprehensive Benchmarking of CAESAR Candidates in - - PowerPoint PPT Presentation

toward fair and comprehensive benchmarking of caesar
SMART_READER_LITE
LIVE PREVIEW

Toward Fair and Comprehensive Benchmarking of CAESAR Candidates in - - PowerPoint PPT Presentation

Toward Fair and Comprehensive Benchmarking of CAESAR Candidates in Hardware: Standard API, High-Speed ImplementaCons in VHDL/Verilog, and Benchmarking Using FPGAs Ekawat Homsirikamol, William Diehl, Ahmed Ferozpuri, Farnoud Farahmand, Michael


slide-1
SLIDE 1

Toward Fair and Comprehensive Benchmarking of CAESAR Candidates in Hardware: Standard API, High-Speed ImplementaCons in VHDL/Verilog, and Benchmarking Using FPGAs

Ekawat Homsirikamol, William Diehl, Ahmed Ferozpuri, Farnoud Farahmand, Michael X. Lyons, Panasayya Yalla, and Kris Gaj George Mason University USA

Based on work partially supported by the National Science Foundation under Grant No. 1314540

slide-2
SLIDE 2

GMU Benchmarking Team

“Ice” Homsirikamol Will Diehl Ahmed Ferozpuri Farnoud Farahmand Mike X. Lyons Panasayya Yalla

slide-3
SLIDE 3

3

Evaluation Criteria in Cryptographic Contests Security

Software Efficiency Hardware Efficiency Simplicity

FPGAs ASICs

Flexibility Licensing

µProcessors µControllers

slide-4
SLIDE 4

4

AES (1999-2000): 5 final candidates eSTREAM (2007-2008): 8 Phase-3 candidates SHA-3 (2010-2012): 14 Round 2 Candidates + 5 Final Candidates CAESAR (2016): 29 Round 2 Candidates

Hardware Benchmarking in Previous Contests

slide-5
SLIDE 5

5

New in CAESAR

1) standard hardware Application Programming Interface (API) 2) comprehensive Implementer’s Guide and Development Package, including VHDL and Python code common for all candidates 3) the design teams have been asked to submit their

  • wn Verilog/VHDL code
slide-6
SLIDE 6

CAESAR Hardware API

slide-7
SLIDE 7

7

Specifies:

  • Minimum Compliance Criteria
  • Interface
  • Communication Protocol
  • Timing Characteristics

Assures:

  • Compatibility
  • Fairness

CAESAR Hardware API

slide-8
SLIDE 8

8

  • July 2015, CryptArchi, Leuven, GMU API v1.0
  • Sep. 2015, DIAC, Singapore, GMU API v1.1
  • Dec. 2015, ReConFig, Cancun, GMU API v1.2
  • Feb. 16, 2016, proposed CAESAR API v1.0
  • Mar. 22, 2016, CAESAR Committee considers adoption
  • May 7, 2016, official adoption by the CAESAR Committee
  • May 12, 2016, final version of CAESAR API v1.0
  • June 30, 2016, deadline for VHDL/Verilog Code
  • August 12, 2016, last submission of the code

CAESAR Hardware API - Timeline

slide-9
SLIDE 9

9

  • Functional Changes
  • Supporting both high-speed and lightweight implementations
  • Supporting both single-pass and two-pass algorithms
  • Moving the buffering of decrypted data to an external unit,

common for all candidates

  • No passing of Npub and AD to the output
  • Specifying the maximum size of AD/message/ciphertext explicitly
  • Requiring full support for key scheduling
  • Editorial Changes
  • Adding Minimum Compliance Criteria & Timing Characteristics
  • Separating from the Implementer’s Guide

CAESAR API v1.0 vs. GMU API v1.2

  • Feb. 16, 2016
slide-10
SLIDE 10

10

Advantages of CAESAR API v1.0 vs. GMU API 1.2

  • Simplified:

§ code development § definitions of timing parameters for decryption § resource utilization characterization § benchmarking § Aimed to § speed-up coding § encourage more design teams to get involved

slide-11
SLIDE 11

11

Interface:

  • No parallel loading of AD and Message

(used by Keyak)

Protocol:

  • No support for intermediate tags

(used by variants of ELmD, POET, TriviA-ck, and COLM)

  • No protocol support for a second pass without storing

intermediate results (or the entire input) inside of the authenticated cipher core

Limitations of the CAESAR API v1.0

slide-12
SLIDE 12

CAESAR Implementer’s Guide & Development Package

slide-13
SLIDE 13

13

Top-level block diagram of a High-Speed architecture

KEY_SIZE

Processor Pre Processor Post

do_ready

do_ready

24 24

key_update bdi_eot bdi_eoi bdi_type bdi_ready

3

bdi_valid bdi key bdo

Datapath CipherCore

msg_auth_valid msg_auth_done key_update bdi_eot bdi_eoi bdi_type bdi_ready bdo_size bdo_ready

Controller CipherCore

bdi_valid bdo_valid bdi key

DBLK_SIZE

msg_auth_valid msg_auth_done bdo_size bdo_ready bdo_valid bdo key_valid key_ready key_valid key_ready

LBS_BYTES+1

decrypt decrypt bdi_valid_bytes bdi_pad_loc

DBLK_SIZE/8 DBLK_SIZE/8

bdi_size bdi_pad_loc bdi_valid_bytes bdi_size

LBS_BYTES+1

CipherCore AEAD

pdi_valid pdi_ready

pdi_ready pdi_valid

Optional Required

sdi_valid sdi_ready

sdi_ready sdi_valid

do_valid

do_valid sdi_data pdi_data do_data

do_data sdi_data pdi_data

sw w w

din_valid din_ready din

FIFO CMD

dout dout_ready dout_valid cmd_valid cmd_ready cmd cmd_valid cmd_ready cmd bdi_partial bdi_partial

DBLK_SIZE

slide-14
SLIDE 14

14

  • a. VHDL code of a generic PreProcessor, PostProcessor,

and CMD FIFO, common for all Round 2 Candidates (src_rtl)

  • b. Universal testbench common for all Round 2 candidates

(AEAD_TB)

  • c. Python app used to automatically generate test vectors

(aeadtvgen)

  • d. Six reference high-speed implementations of

Dummy authenticated ciphers (dummyN)

Development Package

  • May. 12, 2016 - present
slide-15
SLIDE 15

15

Manual Design HDL Code

Post Place & Route Results

(Resource UClizaCon,

  • Max. Clock Frequency)

Functional Verification SpecificaCon Test Vectors

The API Compliant Code Development

Reference C Code Development Package src_rtl Development Package aeadtvgen Development Package AEAD_TB Pass/ Fail Formulas for the ExecuCon Time & Throughput Development Package dummyN

FPGA Tools

slide-16
SLIDE 16

Overview of Submitted Designs

slide-17
SLIDE 17

17

Submitters

  • 1. CCRG NTU (Nanyang Technological University) Singapore –

ACORN, AEGIS, JAMBU, & MORUS

  • 2. CLOC-SILC Team, Japan – CLOC & SILC
  • 3. Ketje-Keyak Team – Ketje & Keyak
  • 4. Lab Hubert Curien, St. Etienne, France – ELmD & TriviA-ck
  • 5. Axel Y. Poschmann and Marc Stöttinger – Deoxys & Joltik
  • 6. NEC Japan – AES-OTR
  • 7. IAIK TU Graz, Austria – Ascon
  • 8. DS Radboud University Nijmegen, Netherlands – HS1-SIV
  • 9. IIS ETH Zurich, Switzerland – NORX
  • 10. Pi-Cipher Team – Pi-Cipher
  • 11. EmSec RUB, Germany – POET
  • 12. CG UCL, INRIA – SCREAM
  • 13. Shanghai Jiao Tong University, China – SHELL

Total: 19 Candidate Families

slide-18
SLIDE 18

18

Submitters - GMU Benchmarking Team

“Ice” Homsirikamol AES-GCM, AEZ, Ascon, Deoxys, HS1-SIV, ICEPOLE, Joltik, NORX, OCB, PAEQ, Pi-Cipher, STRIBOB Will Diehl Ahmed Ferozpuri PRIMATEs- GIBBON & HANUMAN, PAEQ Farnoud Farahmand AES-COPA CLOC Mike X. Lyons TriviA-ck Minalpher OMD POET SCREAM

Total: 19 Candidate Families + AES-GCM

slide-19
SLIDE 19

19

Variant vs. Architecture

Variant 1 Variant 2 input

  • utput_1

input

  • utput_2

Arch 1 Arch 2 input

  • utput

input

  • utput
  • utput_2 ≠ output_1

Variants Architectures

Typically different throughput, area

slide-20
SLIDE 20

20

Round 2 Statistics

  • 43 hardware design packages
  • 75 variant-architecture pairs
  • Covering the majority of primary variants of

28 out of 29 Round 2 Candidate Families (all except Tiaoxin)

  • High-speed implementation of AES-GCM (baseline)

The biggest and the earliest hardware benchmarking effort in the history of cryptographic competitions

slide-21
SLIDE 21

21

Summary of Submitted Designs

  • 2 Compliant designs + 1 Non-Compliant Design

1: TriviA-ck

  • 2 Compliant designs

3: Ascon, CLOC, Minalpher

  • 1 Compliant Design + 1 Non-Compliant Design

8: Deoxys, ELmD, HS1-SIV, Joltik, NORX, Pi-Cipher, POET, SCREAM

  • 1 Compliant Design

17: ACORN, AEGIS, AES-COPA, AES-JAMBU, AES-OTR, AEZ, ICEPOLE, Ketje, Keyak, MORUS, OCB, OMD, PAEQ, PRIMATEs-GIBBON, HANUMAN, SHELL, SILC, STRIBOB

  • No Designs

1: Tiaoxin

slide-22
SLIDE 22

22

Non Compliant Designs

Algorithm (Target) Hardware designers No decryption Full-block width interface No support for CAESAR API Protocol Wrapper required Deoxys & Joltik (ASIC) Axel Y. Poschmann & Marc Stöttinger

X X X

POET (ASIC, FPGA) Amir Moradi

X X

SCREAM (ASIC, FPGA) Lubos Gaspar & Stephanie Kerckhof

X X

NORX (ASIC) Michael Muehl- berghuber

X X X

slide-23
SLIDE 23

23

Partial Compliance

Keyak (by the Ketje-Keyak Team)

  • Compliance criteria:

§ supported maximum size for AD should be 232-1 bytes

  • Implementation:

§ supported maximum size for AD is 24 bytes In the Motorist mode: metadata (AD) is input together with the plaintext and possibly in input blocks after it

  • Feature unique for Keyak
  • No plug-in replacement for AES-GCM
slide-24
SLIDE 24

24

Architectures

  • Majority of algorithms have designs based on

Basic Iterative Architecture

Other Architectures: § Lightweight: ACORN § Folded: HS1-SIV, Pi-Cipher § Unrolled (extra): Ascon, SCREAM § With Speculative Deoxys Precomputation:

  • One round per clock cycle
  • Straightforward
  • Easy to describe in VHDL/Verilog
  • Best or close to best throughput/area
  • Hard to optimize
slide-25
SLIDE 25

25

Key sizes

  • Majority of implemented ciphers support 128-bit keys only

Exceptions: § AES-JAMBU, Ketje: 96 § AEZ: 384 § PRIMATEs: 80 & 120 § STRIBOB: 192 § Joltik: 64 & 128 § Pi-Cipher: 96, 128, 256 § Deoxys, NORX: 128 & 256 Possible allowed key ranges: |K| ≥ 96 |K| ≥ 120

  • covers all families
  • excludes variants with

64 and 80-bit keys

  • covers all families except AES-JAMBU and Ketje
  • covers stronger variants of PRIMATEs
  • excludes lightweight variants
slide-26
SLIDE 26

26

PDI & DO Ports Width, w

  • The CAESAR API Minimum Compliance Criteria allow

§ High-speed: 32 ≤ w ≤ 256 § Lightweight: w = 8, 16, 32

  • Majority of the API compliant implementations support w=32 or 64 only

Exceptions: § ACORN: 8 & 32 § PRIMATEs: 40 § HS1-SIV: 128 § NORX, Pi-Cipher: 128 & 256 § AEGIS, ICEPOLE, MORUS: 256

slide-27
SLIDE 27

Benchmarking Methodology

slide-28
SLIDE 28

28

High-Performance FPGA Families used for benchmarking of All Round 2 Candidates & AES-GCM

  • Xilinx Virtex-6:

xc6vlx240tff1156-3

  • Xilinx Virtex-7:

xc7vx485tffg1761-3

  • Altera Stratix IV:

ep4se530h35c2

  • Altera Stratix V:

5sgxea7k2f40c1 Low-Cost FPGA Families used for benchmarking of 10 Candidates with the Smallest Area in High-Performance Benchmarking:

  • Xilinx Spartan-6: xc6slx16csg324-3
  • Xilinx Artix-7: xc7a100tcsg324-3
  • Altera Cyclone IV:

EP4CE22F17C6

  • Altera Cyclone V:

5CEBA4F23C7

FPGA Families & Devices Used for Benchmarking

slide-29
SLIDE 29

29

HDL Code Automated Optimization FPGA Tools

Post Place & Route Results

(Resource UClizaCon,

  • Max. Clock Frequency)

RTL Benchmarking

ReplicaCon Script OpCmal OpCons of Tools (for the best Throughput/Area)

slide-30
SLIDE 30

30

For Benchmarking Targeting Xilinx FPGAs (other than Virtex 7): Target FPGAs: Virtex-6, Spartan 6, Artix 7 Synthesis Tool: Xilinx XST 14.7 Implementation Tool: Xilinx ISE 14.7 Automated Optimization: ATHENa For Benchmarking Targeting Altera FPGAs: Target FPGAs: Stratix IV, Stratix V, Cyclone IV, Cyclone V Synthesis Tool: Quartus Prime 16.0.0 Implementation Tool: Quartus Prime 16.0.0 Automated Optimization: ATHENa

FPGA Tools (1)

slide-31
SLIDE 31

31

For Benchmarking Targeting Xilinx Virtex 7 FPGAs: Target FPGAs: Virtex-7 Synthesis Tool: Xilinx Vivado 2015.1 Implementation Tool: Xilinx Vivado 2015.1 Automated Optimization: 25 Default Strategies of Vivado

FPGA Tools (2)

slide-32
SLIDE 32

Results

slide-33
SLIDE 33

Virtex-6

33

slide-34
SLIDE 34

34

Results for Virtex 6 – Throughput vs. Area Linear Scale

slide-35
SLIDE 35

35

Results for Virtex 6 – Throughput vs. Area Logarithmic Scale

A E, D E, D A A E, D E D, A E, D A E – Throughput for Encryption D – Throughput for Decryption A – Throughput for Authentication Only Default: Throughputs the same for all 3 operations

slide-36
SLIDE 36

36

Throughput/Area of AES-GCM = 1.020 (Mbit/s)/LUTs

Relative Throughput/Area in Virtex 6

  • vs. AES-GCM

E – Throughput/Area for Encryption D – Throughput/Area for Decryption A – Throughput/Area for Authentication Only Default: Throughput/Area the same for all 3 operations

slide-37
SLIDE 37

37

Relative Throughput in Virtex 6

Ratio of a given Cipher Throughput/Throughput of AES-GCM

Throughput of AES-GCM = 3239 Mbit/s

E – Throughput for Encryption D – Throughput for Decryption A – Throughput for Authentication Only Default: Throughput the same for all 3 operations

slide-38
SLIDE 38

38

Relative Area (#LUTs) in Virtex 6

Ratio of a given Cipher Area/Area of AES-GCM

Area of AES-GCM = 3175 LUTs

slide-39
SLIDE 39

ATHENa Database

  • f Results
slide-40
SLIDE 40

40

  • Available at

http://cryptography.gmu.edu/athena

  • Developed by John Pham, a Master’s-level student of

Jens-Peter Kaps as a part of the SHA-3 Hardware Benchmarking project, 2010-2012, (sponsored by NIST)

  • In June 2015 extended to support Authenticated Ciphers

ATHENa Database of Results

slide-41
SLIDE 41

41

One Stop Website

https://cryptography.gmu.edu/athena/index.php?id=CAESAR OR https://cryptography.gmu.edu/athena and click on Download

  • VHDL/Verilog Code of CAESAR Candidates: Summary I
  • VHDL/Verilog Code of CAESAR Candidates: Summary II
  • ATHENa Database of Results: Rankings View
  • ATHENa Database of Results: Table View
  • Benchmarking of Round 2 CAESAR Candidates in Hardware:

Methodology, Designs & Results

  • GMU Implementations of Authenticated Ciphers and Their Building

Blocks

  • CAESAR Hardware API v1.0
slide-42
SLIDE 42

Round 3 Benchmarking Goals & Timeline

slide-43
SLIDE 43

43

Throughput/Area:

  • 1. ACORN
  • 2. AEGIS
  • 3. Ascon
  • 4. Ketje
  • 5. Keyak
  • 6. MORUS
  • 7. NORX

Round 3 Candidates Outperforming AES-GCM

Throughput:

  • 1. ACORN
  • 2. AEGIS
  • 3. Ascon
  • 4. Ketje
  • 5. Keyak
  • 6. MORUS
  • 7. NORX

High-Speed Implementations (4 FPGA families) Alphabetical Order

slide-44
SLIDE 44

44

R3 Candidates – Relative Throughput/Area - Virtex 6

Throughput/Area of AES-GCM = 1.020 (Mbit/s)/LUTs

A – Throughput/Area for Authentication Only Default: Throughput/Area the same for all 3 operations

slide-45
SLIDE 45

45

R3 Candidates – Relative Throughput - Virtex 6

Throughput of AES-GCM = 3239 Mbit/s

A – Throughput for Authentication Only Default: Throughput the same for all 3 operations

slide-46
SLIDE 46

46

I. Lightweight Implementations, benchmarked for area, throughput/area, power, energy/bit

  • 1. ACORN
  • 2. Ascon
  • 3. CLOC (TWINE-80, AES-128
  • 4. JAMBU (SIMON, AES)
  • 5. Ketje
  • 6. SILC (PRESENT-80, LED-80, AES-128)
  • 7. Others (AES-OTR, COLM, Deoxys, Keyak, MORUS)?
  • II. Natural resistance to side-channel attacks &

the cost of countermeasures

Round 3 Benchmarking Goals

Possibly a subject of the next DPA Contest ?

slide-47
SLIDE 47

47

  • III. ASIC Benchmarking
  • High-speed implementations
  • Lightweight implementations
  • Implementations of two-pass algorithms

(effect of external memory)

  • Side-channel resistance
  • IV. High-speed architectures supporting multiple messages

processed in parallel

Round 3 Benchmarking Goals

  • Multi-message pipelining
  • Extensions to API required
slide-48
SLIDE 48

48

  • V. Investigating Throughputs vs. Area Trade-offs

(flexibility, wide range of applications)

Round 3 Benchmarking Goals

Possible Architectures: folded, unrolled, with inner-round pipelining, etc.:

slide-49
SLIDE 49

49

  • VI. Experimental Setups

Round 3 Benchmarking Goals

  • power/energy measurements
  • communication & control overhead of a hardware accelerator
  • operating system overhead
  • CAESAR API validation taking into account the most popular

Bus Interfaces, such as AXI4 and PCIe

  • VI. Extensions Common for all Authenticated Ciphers
  • buffering of decrypted data before authentication
  • merging Npub, AD, Ciphertext, and Tag after decryption
  • word width conversion (for communication between

implementations with different PDI/SDI/DO widths)

slide-50
SLIDE 50

50

Round 3 VHDL/Verilog:

Round 3 Benchmarking Timeline

Requests for changes in the CAESAR API: Independent Benchmarking Efforts (ASIC, Side-channel, etc.): October 31, 2016 At least two months before the announcement of finalists Early declarations and guidelines for designers strongly encouraged

slide-51
SLIDE 51

51

  • The biggest and the earliest hardware benchmarking

effort in the history of cryptographic competitions

  • 14 hardware designer groups
  • 28 candidate families
  • 75 variant-architecture pairs
  • Key new features:

§ Standard API § Implementer’s Guide and Development Package § Algorithm designers requested to submit HDL code (possibly designed by other teams)

  • Modest but noticeable influence on the Round 3 selection

Conclusions

slide-52
SLIDE 52

52

  • Faster adoption of the submitted proposals (e.g., API)

by the CAESAR Committee

  • More realistic and relaxed deadlines
  • Clear indication of the influence of hardware

benchmarking on the final decision

  • Avoiding mixed signals:

Ø “reference” hardware implementation Ø advancing candidates without VHDL/Verilog code

  • Early collaborations
  • More groups involved in various benchmarking efforts

(lightweight, ASIC, side-channel)

  • Incentives: publication venues, grants, PhD/MS theses

Possible Improvements

slide-53
SLIDE 53

Questions? Thank you!

53

Comments? Suggestions?

ATHENa: http://cryptography.gmu.edu/athena CERG: http://cryptography.gmu.edu