SW/HW Codesign of the Post-Quantum Cryptography Algorithm - - PowerPoint PPT Presentation
SW/HW Codesign of the Post-Quantum Cryptography Algorithm - - PowerPoint PPT Presentation
SW/HW Codesign of the Post-Quantum Cryptography Algorithm NTRUEncrypt Using HLS and RTL Design Methodologies Farnoud Farahmand, Duc Tri Nguyen, Viet B. Dang*, Ahmed Ferozpuri and Kris Gaj George Mason University Post st-Quantum Quantum
Post st-Quantum Quantum Crypt ptograph graphy y (PQC QC)
Ongoi
- ing
ng NIST PQC standa andardiz dizat ation ion proc
- ces
ess Total 69 submis missions sions in Round 1 and 26 submis missions ions qualified to Round 2
Challen enges ges
Math athema ematic tical al co comple plexity xity Large amount of ma man-power er New types of basic sic operations erations Constant stant-time time implementations Need for new w SCA (Side de-Ch Chann nnel el Attac ack) k) co countermea ermeasures ures against power and electromagnetic analysis
2
Risk sks s of Ea Early ly Hardw dware are Impl plementations ementations
3
GMU implemen lementat tation n of DAGS S develope
- ped
d in Fall l 2017-Spring pring 2018 18. Prelim elimin inar ary results sults present esented ed at the Code-Based Based Crypt yptograph
- graphy
y (CBC BC) works rkshop
- p in April
il 2018. 18. Attac tack k against inst DAGS S announce unced d on May 16, 2018. 18. DAGS S not qualif ifie ied d to Round d 2
Softw tware/Har are/Hardw dware are Codesign esign
Most t time me-crit critic ical al
- perat
ration ion Softw tware are RTL RTL or HLS LS-generat generated ed Hardw dware are
4
SW/HW Codesign esign for PQC QC: : Advantages antages
5
Focus us on a few (typical pically y 1-3) ) major
- r operati
tions ns, , known wn to be e easily ly paral alle leliz izab able le
muc uch h shor
- rter
er developme elopment nt time me (at least t by a fa factor
- r of 10)
gua uarant nteed ed sub ubsta tantial ntial speed ed-up up
Insight ight regardin ding performanc rmance of future ure instru ruct ctio ion n set et extensio sions ns
- f modern
n micropr process cessor
- rs
Possibili bility ty of impleme lement nting ing multipl iple e candid didat ates s by the same research ch group, , eliminating inating the influence uence of different rent design gn skills ls
- peratio
tion n subset et (e.g., ., includin uding or excluding ding key generatio ation) n) interfac ace & prot
- tocol
col
- ptimi
mizatio ation n target et platform rm
Two Major jor Types pes of Platf atforms
- rms
6
FPGA A Fabric ric & Hard-core
- re Proces
essor sors FPGA A Fabric ric, , including uding Soft-core
- re Processor
cessors
Examples:
- Xilinx Zynq 7000 System on Chip (SoC)
- Xilinx Zynq UltraScale+ MPSoC
- Intel Arria 10 SoC FPGAs
- Intel Stratix 10 SoC FPGAs
Examples: Xilinx Virtex UltraScale+ FPGAs Intel Stratix 10 FPGAs, including
- Xilinx MicroBlaze
- Intel Nios II
- RISC-V, originally UC Berkeley
Processor w/ Memory & I/O FPGA Fabric FPGA Fabric Soft-core Processor
Sel elect ected ed Platf atform
- rm
7
FPGA A Famil ily: Xilinx inx Zynq UltraS traScale ale+ + MPSoC SoC Device: e: XCZU9E U9EG-2FF FFVB1 VB1156E 6E Prototy typing ping Board: d: ZCU102 2 Evalu luation ation Kit t from m Xilinx inx Processing cessing Syst stem em: Qua uad-cor
- re ARM Cortex-A53
A53 Applic ication ation Proc
- cessing
essing Unit Unit, running at the frequency of 1.2 GHz (only one core used for benchmarking) Progr gramm ammable ble Logic ic: Config igura urable ble Logic ic Bloc
- cks
ks (CLB), LB), Block ck RAMs, , DSP P units ts
Ex Expe perim rimental ental Setup etup
8
Output FIFO Input FIFO Hardware Accelerator Zynq Processing System AXI DMA
FIFO Interface FIFO Interface AXI Stream Interface AXI Stream Interface AXI Lite Interface AXI Full Interface AXI Lite Interface IRQ
Clocking wizard
rd_clk wr_clk wr_clk rd_clk clk
UUT_clk
Main Clock
AXI Lite Interface
AXI Timer
AXI Lite Interface
Sel elect ected ed Algorit
- rithm
hm
NTRUEncrypt ypt is one of the most well-known PQC algorithms that has withstood cryptanalysis. The speed of NTRUEncrypt in software, especially on embedded software platforms, is limited by the long execution time of polyno nomia mial l multip iplicatio lication. We implement two variants of the NIST Round 1 PQC candidate NTRUEncrypt ypt: ntru-pke-443 and ntru-pke-743 in bare-met metal al mode. Polynomial multiplication is implemented in the Programmable Logic (PL) of Zynq using two approaches RT RTL and HLS HLS
Accelerat celerator
- r De
Desi sign gn
10 Target: t: Minimum mum Ex Execut cution ion Time me
Register-Transfer Level methodology with VHDL Block diagram of the Datapath and Algorithmic State Machine (ASM) chart of the Controller High Level Synthesis methodology with C Goal: The same or comparable number of clock cycles as in the Register-Transfer Level (manual) implementation in VHDL Attem empt pt 1: Reference implementation based on the grade school algorithm for multiplication (a.k.a. schoolbook, paper-and-pencil, etc.) Attem empt pt 2: Optimized implementation based on rotation Multiple attempts at optimization using Vivado HLS directives (pragmas) and minor code changes Outcome come 1: Tens s of thousa usands nds of clock ck cycles es, compared to the expected n=743 clock cycles Soluti tion:
- n: Rewriting the code in C in such a way to match the
block diagram used to generate VHDL code Outcome come 2: Expected functionality Around d n clock ck cycles es of the execution time
Speed-up achieved for Polynomial Multiplication
11
89.1 82.8 128.5 119.8 81.9 76.1 106.8 99.6
20 40 60 80 100 120 140
ntru-pke-443 ENC Speed up ntru-pke-443 DEC Speed up ntru-pke-743 ENC Speed up ntru-pke-743 DEC Speed up
RTL HLS
Total Speed-up achieved for entire ENC/DEC
12
2.4 4 3.9 6.8 2.3 4 3.9 6.8
1 2 3 4 5 6 7 8 ntru-pke-443 ENC Total Speed-up ntru-pke-443 DEC Total Speed up ntru-pke-743 ENC Total Speed-up ntru-pke-743 DEC Total Speed up
RTL HLS
Resource Utilization
13
44,257 51,953 76,972 95,329 29,655 49,293 49,674 82,221 7,802 9,413 11,425 16,686 1 1 1 1 20,000 40,000 60,000 80,000 100,000
RTL ntru-pke-443 HLS ntru-pke-443 RTL ntru-pke-743 HLS ntru-pke-743
LUTs FFs Slices BRAMs
Q&A
14