high speed instruction set coprocessor for lattice based
play

High-speed Instruction-set Coprocessor for Lattice-based Key - PowerPoint PPT Presentation

High-speed Instruction-set Coprocessor for Lattice-based Key Encapsulation Mechanisms: Saber in Hardware Sujoy Sinha Roy and Andrea Basso CHES 2020 Motivation Saber is (now) a round 3 finalist for the NIST PQC standardization process. NIST [MAA


  1. High-speed Instruction-set Coprocessor for Lattice-based Key Encapsulation Mechanisms: Saber in Hardware Sujoy Sinha Roy and Andrea Basso CHES 2020

  2. Motivation Saber is (now) a round 3 finalist for the NIST PQC standardization process. NIST [MAA + 20] reported that “SABER is one of the most promising KEM schemes to be considered for stan- dardization at the end of the third round.” Saber’s unique design choices • Different implementation approaches from other lattice-based protocols • Non-NTT based polynomial multipliers 1/15

  3. The Saber protocol [DKSRV18] seed A ← random () Key Generation A = gen ( seed A ) A A s ← small _ vec () seed A , b A T · s ⌊︂ p ⌉︂ b = b b q A A 2/15

  4. The Saber protocol [DKSRV18] seed A ← random () Key Generation A = gen ( seed A ) A A s ← small _ vec () A = gen ( seed A ) A A seed A , b A T · s ⌊︂ p ⌉︂ b = s ′ ← small _ vec () Encryption b b q A A ⌊︂ p A · s ′ ⌉︂ b ′ b ′ b ′ = q A A b ′ , c m ⌊︂ T b T s ′ + T ⌉︂ c m = b p b 2 m 2/15

  5. The Saber protocol [DKSRV18] seed A ← random () Key Generation A = gen ( seed A ) A A s ← small _ vec () A = gen ( seed A ) A A seed A , b A T · s ⌊︂ p ⌉︂ b = s ′ ← small _ vec () Encryption b b q A A ⌊︂ p A · s ′ ⌉︂ b ′ b ′ b ′ = q A A b ′ , c m Decryption v = b ′ b ′ T s b ′ ⌊︂ T b T s ′ + T ⌉︂ c m = b ⌊︂ 2 p b 2 m ⌉︂ m = q ( v − p T c m ) 2/15

  6. The Saber protocol [DKSRV18] seed A ← random () Key Generation A = gen ( seed A ) A A s ← small _ vec () A = gen ( seed A ) A A seed A , b A T · s ⌊︂ p ⌉︂ b = s ′ ← small _ vec () Encryption b b q A A ⌊︂ p A · s ′ ⌉︂ b ′ b ′ b ′ = q A A b ′ , c m Decryption v = b ′ b ′ T s b ′ ⌊︂ T b T s ′ + T ⌉︂ c m = b ⌊︂ 2 p b 2 m ⌉︂ m = q ( v − p T c m ) Key Encapsulation Mechanism Saber.KEM is obtained via the Fujisaki-Okamoto (FO) transform. Implementation-wise, the FO consists mainly of SHA/SHAKE calls. 2/15

  7. Performance bottlenecks The majority of computations involve 1. SHA/SHAKE 2. Computing polynomial multiplication 3/15

  8. Performance bottlenecks The majority of computations involve 1. SHA/SHAKE – 70/80% of computations in software – Keccak is very fast in hardware – High-speed implementation by the Keccak team – Serialized SHA(KE) in Saber −→ one core 2. Computing polynomial multiplication 3/15

  9. Performance bottlenecks The majority of computations involve 1. SHA/SHAKE – 70/80% of computations in software – Keccak is very fast in hardware – High-speed implementation by the Keccak team – Serialized SHA(KE) in Saber −→ one core 2. Computing polynomial multiplication 3/15

  10. Performance bottlenecks The majority of computations involve 1. SHA/SHAKE – 70/80% of computations in software – Keccak is very fast in hardware – High-speed implementation by the Keccak team – Serialized SHA(KE) in Saber −→ one core 2. Computing polynomial multiplication – The main focus of this work 3/15

  11. Polynomial multiplication in Saber The main characteristics • Module-LWR – Different module ranks for different security levels – All polynomials have degree 255 • Small secrets – Secret polynomial coefficients in [ − 3 , 3 ] , [ − 4 , 4 ] or [ − 5 , 5 ] • Power-of-2 moduli – Multiplication modulo 2 13 or 2 10 – Free modular reduction – No NTT 4/15

  12. Our polynomial multiplication approach The alternatives to NTT The Number Theoretic Transform (NTT) requires the modulus to be prime In software: improved Toom-Cook ([BMKV20], also at CHES 2020) In hardware: • Toom-Cook/Karatsuba not convenient because recursive • High parallelism • Ad-hoc solutions 5/15

  13. Our polynomial multiplication approach The alternatives to NTT The Number Theoretic Transform (NTT) requires the modulus to be prime In software: improved Toom-Cook ([BMKV20], also at CHES 2020) In hardware: • Toom-Cook/Karatsuba not convenient because recursive ⇒ Schoolbook algorithm • High parallelism • Ad-hoc solutions 5/15

  14. The schoolbook algorithm The alternatives to NTT Algorithm: Schoolbook algorithm acc ( x ) ← 0 for i = 0 ; i < 256 ; i++ do for j = 0 ; j < 256 ; j++ do acc [ j ] = acc [ j ] + b [ j ] · a [ i ] b = b · x mod 〈 x 256 + 1 〉 return acc 6/15

  15. The schoolbook algorithm The alternatives to NTT Algorithm: Schoolbook algorithm acc ( x ) ← 0 for i = 0 ; i < 256 ; i++ do for j = 0 ; j < 256 ; j++ do acc [ j ] = acc [ j ] + b [ j ] · a [ i ] b = b · x mod 〈 x 256 + 1 〉 return acc negacyclic shift 6/15

  16. The schoolbook algorithm The alternatives to NTT Algorithm: Schoolbook algorithm acc ( x ) ← 0 Advantages for i = 0 ; i < 256 ; i++ do • Simple implementation for j = 0 ; j < 256 ; j++ do • High flexibility acc [ j ] = acc [ j ] + b [ j ] · a [ i ] b = b · x mod 〈 x 256 + 1 〉 • Great performance return acc negacyclic shift 6/15

  17. Multiply and ACcumulate (MAC) units How to compute coefficient-wise operations s[i] a[j] MAC • Small secrets −→ bitshift & add multiplication • Power-of-two moduli −→ no modular reduction 1 acc[i] 7/15

  18. Multiply and ACcumulate (MAC) units How to compute coefficient-wise operations s[i] a[j] MAC • Small secrets −→ bitshift & add multiplication • Power-of-two moduli −→ no modular reduction ⇓ 1 A MAC unit requires little area (50 LUTs) acc[i] 7/15

  19. Multiply and ACcumulate (MAC) units How to compute coefficient-wise operations s[i] a[j] MAC • Small secrets −→ bitshift & add multiplication • Power-of-two moduli −→ no modular reduction ⇓ 1 A MAC unit requires little area (50 LUTs) We use 256 MACs in parallel acc[i] 7/15

  20. The polynomial multiplier BRAM small polynomial buffer coeffcient secret polynomial selector 4 ... MAC MAC MAC polynomial accumulator multiplier 8/15

  21. The polynomial multiplier BRAM small polynomial buffer coeffcient secret polynomial selector 4 ... MAC MAC MAC polynomial accumulator multiplier 8/15

  22. The polynomial multiplier BRAM small polynomial buffer coeffcient secret polynomial selector 4 ... MAC MAC MAC polynomial accumulator multiplier 8/15

  23. The polynomial multiplier BRAM small polynomial buffer coeffcient secret polynomial selector 4 ... MAC MAC MAC polynomial accumulator multiplier 8/15

  24. The polynomial multiplier BRAM small polynomial buffer coeffcient secret polynomial selector 4 ... MAC MAC MAC polynomial accumulator multiplier 8/15

  25. The polynomial multiplier BRAM small polynomial buffer coeffcient secret polynomial selector 4 ... MAC MAC MAC polynomial accumulator multiplier Performance A full polynomial multiplication can be computed in 256 cycles! 8/15

  26. The full architecture An instruction-set coprocessor architecture Polynomial Vector-Vector Advantages Multiplier Data input and output • Modularity SHA3-256/ SHA3-512/ ⇓ SHAKE128 Communication • Generic framework Controller Bus Manager Binomial … ⇓ Sampler Data Memory • Other protocols (Block RAM) AddPack • Programmability AddRound Program Memory Verify Disadvantages CMOV • No parallelism CopyWords 9/15

  27. s[i] a[j] s[i] a[j] s[i-1] a[j+1] MAC MAC 1 1 acc[i] acc[i] Design extendability Unified architecture • LightSaber • Saber • FireSaber 10/15

  28. Design extendability Unified architecture s[i] a[j] s[i] a[j] s[i-1] a[j+1] • LightSaber MAC MAC • Saber • FireSaber ⇒ 1 1 Performance/area trade-offs • 512 multipliers acc[i] acc[i] • ∼ 20 % improvement in speed 10/15

  29. Performance Results Running on a Ultrascale+ XCZU9EG-2FFVB1156 FPGA Key Generation Encapsulation Decapsulation Polynomial multiplication Keccak computations Other operations Total cycles 5,453 6,618 8,034 21.8 μ μ s μ 26.5 μ μ μ s 32.1 μ μ μ s Total time Throughput 45,872 op/s 37,776 op/s 31,118 op/s 11/15

  30. Area Results Running on a Ultrascale+ XCZU9EG-2FFVB1156 FPGA LUTs Flip flops DSPs BRAM Tiles Total 23,686 0 2 9,805 % 8.6 % 0 % 0.2 % 1.8 % It is possible to fit 11 coprocessors, achieving a throughput of 504k / 416k / 342k op/s 12/15

  31. Comparisons to other work Time in μ s Implementation Platform Frequency Area Key Encps Decps (MHz) LUT FF DSP BRAM Kyber [DFA + 20] Virtex-7 - 17.1 23.3 245 14k 11k 8 14 NewHope [ZYC + 20] Artix-7 40 62.5 24 200 6.8k 4.4k 2 8 FrodoKEM [HOKG18] Artix-7 45K 45K 47K 167 7.7K 3.5K 1 24 Virtex-7 ∗ SIKE [MLRB20] 8K 14K 15K 142 21K 14K 162 38 Saber [BMTK + 20] Artix-7 ∗ 3K 4K 3K 125 7.4K 7.3K 28 2 UltraScale+ ∗ Saber [DFAG19] - 60 65 322 13K 12K 256 4 Saber [this work] UltraScale+ 21.8 26.5 32.1 250 24K 10K 0 2 13/15 ∗ : HW/SW codesign

  32. Future work Other protocols • Kyber and other lattice-based schemes • Signature schemes? Lightweight implementation • Fewer multipliers Side-channel resistance • Masked implementation • Handle small coefficients 14/15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend