High-speed Instruction-set Coprocessor for Lattice-based Key - PowerPoint PPT Presentation

High-speed Instruction-set Coprocessor for Lattice-based Key Encapsulation Mechanisms: Saber in Hardware Sujoy Sinha Roy and Andrea Basso CHES 2020

Motivation Saber is (now) a round 3 finalist for the NIST PQC standardization process. NIST [MAA + 20] reported that “SABER is one of the most promising KEM schemes to be considered for standardization at the end of the third round.” Saber’s unique design choices • Different implementation approaches from other lattice-based protocols • Non-NTT based polynomial multipliers 1/15

The Saber protocol [DKSRV18] seed A ← random () Key Generation A = gen ( seed A ) A A s ← small _ vec () seed A , b A T · s ⌊︂ p ⌉︂ b = b b q A A 2/15

The Saber protocol [DKSRV18] seed A ← random () Key Generation A = gen ( seed A ) A A s ← small _ vec () A = gen ( seed A ) A A seed A , b A T · s ⌊︂ p ⌉︂ b = s ′ ← small _ vec () Encryption b b q A A ⌊︂ p A · s ′ ⌉︂ b ′ b ′ b ′ = q A A b ′ , c m ⌊︂ T b T s ′ + T ⌉︂ c m = b p b 2 m 2/15

The Saber protocol [DKSRV18] seed A ← random () Key Generation A = gen ( seed A ) A A s ← small _ vec () A = gen ( seed A ) A A seed A , b A T · s ⌊︂ p ⌉︂ b = s ′ ← small _ vec () Encryption b b q A A ⌊︂ p A · s ′ ⌉︂ b ′ b ′ b ′ = q A A b ′ , c m Decryption v = b ′ b ′ T s b ′ ⌊︂ T b T s ′ + T ⌉︂ c m = b ⌊︂ 2 p b 2 m ⌉︂ m = q ( v − p T c m ) 2/15

The Saber protocol [DKSRV18] seed A ← random () Key Generation A = gen ( seed A ) A A s ← small _ vec () A = gen ( seed A ) A A seed A , b A T · s ⌊︂ p ⌉︂ b = s ′ ← small _ vec () Encryption b b q A A ⌊︂ p A · s ′ ⌉︂ b ′ b ′ b ′ = q A A b ′ , c m Decryption v = b ′ b ′ T s b ′ ⌊︂ T b T s ′ + T ⌉︂ c m = b ⌊︂ 2 p b 2 m ⌉︂ m = q ( v − p T c m ) Key Encapsulation Mechanism Saber.KEM is obtained via the Fujisaki-Okamoto (FO) transform. Implementation-wise, the FO consists mainly of SHA/SHAKE calls. 2/15

Performance bottlenecks The majority of computations involve 1. SHA/SHAKE 2. Computing polynomial multiplication 3/15

Performance bottlenecks The majority of computations involve 1. SHA/SHAKE – 70/80% of computations in software – Keccak is very fast in hardware – High-speed implementation by the Keccak team – Serialized SHA(KE) in Saber −→ one core 2. Computing polynomial multiplication 3/15

Performance bottlenecks The majority of computations involve 1. SHA/SHAKE – 70/80% of computations in software – Keccak is very fast in hardware – High-speed implementation by the Keccak team – Serialized SHA(KE) in Saber −→ one core 2. Computing polynomial multiplication – The main focus of this work 3/15

Polynomial multiplication in Saber The main characteristics • Module-LWR – Different module ranks for different security levels – All polynomials have degree 255 • Small secrets – Secret polynomial coefficients in [ − 3 , 3 ] , [ − 4 , 4 ] or [ − 5 , 5 ] • Power-of-2 moduli – Multiplication modulo 2 13 or 2 10 – Free modular reduction – No NTT 4/15

Our polynomial multiplication approach The alternatives to NTT The Number Theoretic Transform (NTT) requires the modulus to be prime In software: improved Toom-Cook ([BMKV20], also at CHES 2020) In hardware: • Toom-Cook/Karatsuba not convenient because recursive • High parallelism • Ad-hoc solutions 5/15

Our polynomial multiplication approach The alternatives to NTT The Number Theoretic Transform (NTT) requires the modulus to be prime In software: improved Toom-Cook ([BMKV20], also at CHES 2020) In hardware: • Toom-Cook/Karatsuba not convenient because recursive ⇒ Schoolbook algorithm • High parallelism • Ad-hoc solutions 5/15

The schoolbook algorithm The alternatives to NTT Algorithm: Schoolbook algorithm acc ( x ) ← 0 for i = 0 ; i < 256 ; i++ do for j = 0 ; j < 256 ; j++ do acc [ j ] = acc [ j ] + b [ j ] · a [ i ] b = b · x mod 〈 x 256 + 1 〉 return acc 6/15

The schoolbook algorithm The alternatives to NTT Algorithm: Schoolbook algorithm acc ( x ) ← 0 for i = 0 ; i < 256 ; i++ do for j = 0 ; j < 256 ; j++ do acc [ j ] = acc [ j ] + b [ j ] · a [ i ] b = b · x mod 〈 x 256 + 1 〉 return acc negacyclic shift 6/15

The schoolbook algorithm The alternatives to NTT Algorithm: Schoolbook algorithm acc ( x ) ← 0 Advantages for i = 0 ; i < 256 ; i++ do • Simple implementation for j = 0 ; j < 256 ; j++ do • High flexibility acc [ j ] = acc [ j ] + b [ j ] · a [ i ] b = b · x mod 〈 x 256 + 1 〉 • Great performance return acc negacyclic shift 6/15

Multiply and ACcumulate (MAC) units How to compute coefficient-wise operations s[i] a[j] MAC • Small secrets −→ bitshift & add multiplication • Power-of-two moduli −→ no modular reduction 1 acc[i] 7/15

Multiply and ACcumulate (MAC) units How to compute coefficient-wise operations s[i] a[j] MAC • Small secrets −→ bitshift & add multiplication • Power-of-two moduli −→ no modular reduction ⇓ 1 A MAC unit requires little area (50 LUTs) acc[i] 7/15

Multiply and ACcumulate (MAC) units How to compute coefficient-wise operations s[i] a[j] MAC • Small secrets −→ bitshift & add multiplication • Power-of-two moduli −→ no modular reduction ⇓ 1 A MAC unit requires little area (50 LUTs) We use 256 MACs in parallel acc[i] 7/15

The polynomial multiplier BRAM small polynomial buffer coeffcient secret polynomial selector 4 ... MAC MAC MAC polynomial accumulator multiplier 8/15

The polynomial multiplier BRAM small polynomial buffer coeffcient secret polynomial selector 4 ... MAC MAC MAC polynomial accumulator multiplier Performance A full polynomial multiplication can be computed in 256 cycles! 8/15

The full architecture An instruction-set coprocessor architecture Polynomial Vector-Vector Advantages Multiplier Data input and output • Modularity SHA3-256/ SHA3-512/ ⇓ SHAKE128 Communication • Generic framework Controller Bus Manager Binomial … ⇓ Sampler Data Memory • Other protocols (Block RAM) AddPack • Programmability AddRound Program Memory Verify Disadvantages CMOV • No parallelism CopyWords 9/15

s[i] a[j] s[i] a[j] s[i-1] a[j+1] MAC MAC 1 1 acc[i] acc[i] Design extendability Unified architecture • LightSaber • Saber • FireSaber 10/15

Design extendability Unified architecture s[i] a[j] s[i] a[j] s[i-1] a[j+1] • LightSaber MAC MAC • Saber • FireSaber ⇒ 1 1 Performance/area trade-offs • 512 multipliers acc[i] acc[i] • ∼ 20 % improvement in speed 10/15

Performance Results Running on a Ultrascale+ XCZU9EG-2FFVB1156 FPGA Key Generation Encapsulation Decapsulation Polynomial multiplication Keccak computations Other operations Total cycles 5,453 6,618 8,034 21.8 μ μ s μ 26.5 μ μ μ s 32.1 μ μ μ s Total time Throughput 45,872 op/s 37,776 op/s 31,118 op/s 11/15

Area Results Running on a Ultrascale+ XCZU9EG-2FFVB1156 FPGA LUTs Flip flops DSPs BRAM Tiles Total 23,686 0 2 9,805 % 8.6 % 0 % 0.2 % 1.8 % It is possible to fit 11 coprocessors, achieving a throughput of 504k / 416k / 342k op/s 12/15

Comparisons to other work Time in μ s Implementation Platform Frequency Area Key Encps Decps (MHz) LUT FF DSP BRAM Kyber [DFA + 20] Virtex-7 - 17.1 23.3 245 14k 11k 8 14 NewHope [ZYC + 20] Artix-7 40 62.5 24 200 6.8k 4.4k 2 8 FrodoKEM [HOKG18] Artix-7 45K 45K 47K 167 7.7K 3.5K 1 24 Virtex-7 ∗ SIKE [MLRB20] 8K 14K 15K 142 21K 14K 162 38 Saber [BMTK + 20] Artix-7 ∗ 3K 4K 3K 125 7.4K 7.3K 28 2 UltraScale+ ∗ Saber [DFAG19] - 60 65 322 13K 12K 256 4 Saber [this work] UltraScale+ 21.8 26.5 32.1 250 24K 10K 0 2 13/15 ∗ : HW/SW codesign

Future work Other protocols • Kyber and other lattice-based schemes • Signature schemes? Lightweight implementation • Fewer multipliers Side-channel resistance • Masked implementation • Handle small coefficients 14/15

High-speed Instruction-set Coprocessor for Lattice-based Key - PowerPoint PPT Presentation

High-speed Instruction-set Coprocessor for Lattice-based Key Encapsulation Mechanisms: Saber in Hardware Sujoy Sinha Roy and Andrea Basso CHES 2020 Motivation Saber is (now) a round 3 finalist for the NIST PQC standardization process. NIST [MAA

Secure Coprocessor What is Secure coprocessor: Robust. (A system that is not easily or is

Floating Point Coprocessor Instructions Systems Design & Programming CMPE 310 Coprocessor

lecture 12 MIPS assembly language 5 - coprocessor 1 (floating point unit FPU) -

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

EE 109 Unit 10 MIPS Instruction Set MIPS INSTRUCTION OVERVIEW 10.3 10.4 Instruction Set

EE 109 Unit 8 MIPS Instruction Set Architecting a vocabulary for the HW INSTRUCTION SET

EE 457 Unit 3 Instruction Sets 2 With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

Lecture 3: Instruction Lecture 3: Instruction of a computer that a machine language of a

Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics

Algebraic Study of Lattice-Valued Logic and Lattice-Valued Modal Logic Yoshihiro Maruyama

Lattice gas simulations Tony Kim Spring 2007 18.354 Project 1) Introducing the lattice gas;

Lattice Points in Polytopes Richard P. Stanley U. Miami & M.I.T. A lattice polygon Georg

When is the lattice of closure operators on a subgroup lattice again a subgroup lattice? Martha

Energy Depositions For Lattices 1 and 2 Lattice 1 Lattice 2 Two scenarios FODO bend FODO

Cedar Rapids RLR & Speed Des Moines RLR & Speed

Simple Optimal Hitting Sets for Small-Success RL William M. Hoza 1 David Zuckerman 2 The

Heap Recycling for Lazy Languages Stefan Holdermans (Joint work with Jurriaan Hage) Dept. of

Colorados Accountable Care Collaborative Phase II An Overview Kathryn M. Jantz ACC

3. Monte Carlo Simulations Understanding Molecular Simulation Molecular Simulations Molecular

Transactional Forest Strong Consistency for File Stores Jonathan DiLorenzo (Cornell)

MDS/MPN MPN MDS Clonal hematopoietic stem cell disorders Overexuberant production of

12/7/2012 1 12/7/2012 Chronic Myeloid Leukemia Chronic Myeloid Leukemia Diagnosis and

HSCT in childhood leukemia 29 th June, 2019 Kleebsabai Sanpakit, MD Division of

Sambuz

Useful Links

Newsletter

Mail Us

High-speed Instruction-set Coprocessor for Lattice-based Key - PowerPoint PPT Presentation

High-speed Instruction-set Coprocessor for Lattice-based Key Encapsulation Mechanisms: Saber in Hardware Sujoy Sinha Roy and Andrea Basso CHES 2020 Motivation Saber is (now) a round 3 finalist for the NIST PQC standardization process. NIST [MAA

Secure Coprocessor What is Secure coprocessor: Robust. (A system that is not easily or is

Floating Point Coprocessor Instructions Systems Design &amp; Programming CMPE 310 Coprocessor

lecture 12 MIPS assembly language 5 - coprocessor 1 (floating point unit FPU) -

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

EE 109 Unit 10 MIPS Instruction Set MIPS INSTRUCTION OVERVIEW 10.3 10.4 Instruction Set

EE 109 Unit 8 MIPS Instruction Set Architecting a vocabulary for the HW INSTRUCTION SET

EE 457 Unit 3 Instruction Sets 2 With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

Lecture 3: Instruction Lecture 3: Instruction of a computer that a machine language of a

Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics

Algebraic Study of Lattice-Valued Logic and Lattice-Valued Modal Logic Yoshihiro Maruyama

Lattice gas simulations Tony Kim Spring 2007 18.354 Project 1) Introducing the lattice gas;

Lattice Points in Polytopes Richard P. Stanley U. Miami &amp; M.I.T. A lattice polygon Georg

When is the lattice of closure operators on a subgroup lattice again a subgroup lattice? Martha

Energy Depositions For Lattices 1 and 2 Lattice 1 Lattice 2 Two scenarios FODO bend FODO

Cedar Rapids RLR &amp; Speed Des Moines RLR &amp; Speed

Simple Optimal Hitting Sets for Small-Success RL William M. Hoza 1 David Zuckerman 2 The

Heap Recycling for Lazy Languages Stefan Holdermans (Joint work with Jurriaan Hage) Dept. of

Colorados Accountable Care Collaborative Phase II An Overview Kathryn M. Jantz ACC

3. Monte Carlo Simulations Understanding Molecular Simulation Molecular Simulations Molecular

Transactional Forest Strong Consistency for File Stores Jonathan DiLorenzo (Cornell)

MDS/MPN MPN MDS Clonal hematopoietic stem cell disorders Overexuberant production of

12/7/2012 1 12/7/2012 Chronic Myeloid Leukemia Chronic Myeloid Leukemia Diagnosis and

HSCT in childhood leukemia 29 th June, 2019 Kleebsabai Sanpakit, MD Division of

Sambuz

Useful Links

Newsletter

Mail Us

Floating Point Coprocessor Instructions Systems Design & Programming CMPE 310 Coprocessor

Lattice Points in Polytopes Richard P. Stanley U. Miami & M.I.T. A lattice polygon Georg

Cedar Rapids RLR & Speed Des Moines RLR & Speed