Parameterized Hardware Accelerators for Lattice-Based Cryptography - - PowerPoint PPT Presentation

parameterized hardware accelerators for lattice based
SMART_READER_LITE
LIVE PREVIEW

Parameterized Hardware Accelerators for Lattice-Based Cryptography - - PowerPoint PPT Presentation

Parameterized Hardware Accelerators for Lattice-Based Cryptography and Their Application to the HW/SW Co-Design of qTESLA Wen Wang , Shanquan Tian, Bernhard Jungk, Nina Bindel, Patrick Longa, and Jakub Szefer CHES 2020 September 14, 2020


slide-1
SLIDE 1

Parameterized Hardware Accelerators for Lattice-Based Cryptography

and Their Application to the HW/SW Co-Design of qTESLA

Wen Wang, Shanquan Tian, Bernhard Jungk, Nina Bindel, Patrick Longa, and Jakub Szefer

CHES 2020 – September 14, 2020

slide-2
SLIDE 2

Outline

  • Yet another hardware design for a lattice-based scheme?
  • qTESLA
  • Hardware blocks
  • Binary-search CDT sampler
  • NTT-based polynomial multiplier
  • Software-hardware co-design on RISC-V
  • Evaluation

1

slide-3
SLIDE 3

Yet another hardware design for a lattice-based scheme?

2

slide-4
SLIDE 4

Existing lattice-based hardware designs

Existing designs Accelerated Security parameters Hardware architecture Lattice-based scheme Standard IO Building blocks Partly Fixed Fixed Specific scheme N/A Full hardware design Fully Fixed Fixed Specific scheme N/A Software-hardware co-design Partly Fixed Flexible Specific scheme N/A

3

slide-5
SLIDE 5

Existing designs Accelerated Security parameters Hardware architecture Lattice-based scheme Standard IO Building blocks Partly Fixed Fixed Specific scheme N/A Full hardware design Fully Fixed Fixed Specific scheme N/A Software-hardware co-design Partly Fixed Flexible Specific scheme N/A

4

Existing lattice-based hardware designs

slide-6
SLIDE 6

Existing designs Accelerated Security parameters Hardware architecture Lattice-based scheme Standard IO Building blocks Partly Fixed Fixed Specific scheme N/A Full hardware design Fully Fixed Fixed Specific scheme N/A Software-hardware co-design Partly Fixed Flexible Specific scheme N/A

5

Existing lattice-based hardware designs

slide-7
SLIDE 7

Existing designs Accelerated Security parameters Hardware architecture Lattice-based scheme Standard IO Building blocks Partly Fixed Fixed Specific scheme N/A Full hardware design Fully Fixed Fixed Specific scheme N/A Software-hardware co-design Partly Fixed Flexible Specific scheme N/A

6

Existing lattice-based hardware designs

slide-8
SLIDE 8

Our new lattice-based hardware design

Existing designs Accelerated Security parameters Hardware architecture Lattice-based scheme Standard IO Building blocks Partly Fixed Fixed Specific scheme N/A Full hardware design Fully Fixed Fixed Specific scheme N/A Software-hardware co-design Partly Fixed Fixed Specific scheme N/A Our new design Fully Flexible Tunable Universal applicability Portable

7

slide-9
SLIDE 9

ü Full acceleration ü Flexible security parameters ü Tunable hardware architecture ü Universal applicability to lattice- based schemes ü Portable among different platforms

Accelerator

config. 32/64-bit AMBA Bus

Accelerator

config.

Accelerator

config.

8

Our new lattice-based hardware design

slide-10
SLIDE 10

qTESLA

9

slide-11
SLIDE 11

qTESLA

liboqs library Reference C implementation Round 2 submission in PQ standardization BouncyCastle library

See qtesla.org

10

slide-12
SLIDE 12

qTESLA

liboqs library Reference C implementation Round 2 submission in PQ standardization BouncyCastle library

See qtesla.org ü Secure against classical and quantum adversaries

11

slide-13
SLIDE 13

qTESLA

liboqs library Reference C implementation Round 2 submission in PQ standardization BouncyCastle library

See qtesla.org ü Secure against classical and quantum adversaries ü Implementation security

12

slide-14
SLIDE 14

qTESLA

liboqs library Reference C implementation Round 2 submission in PQ standardization BouncyCastle library

See qtesla.org ü Secure against classical and quantum adversaries ü Implementation security ü Simple arithmetic operations

13

slide-15
SLIDE 15

qTESLA

liboqs library Reference C implementation Round 2 submission in PQ standardization BouncyCastle library

See qtesla.org ü Secure against classical and quantum adversaries ü Implementation security ü Simple arithmetic operations ü Provable-secure parameters

Parameter set Public key size (in B) Signature size (in B) qTESLA-p-I 14, 880 2, 592 qTESLA-p-III 38, 432 5, 664

14

slide-16
SLIDE 16

qTESLA‘s sign and verify

Input: sk, m Signature generation

15

Output: signature z, c

slide-17
SLIDE 17

qTESLA‘s sign and verify

Input: sk, m Sample random y polynomial Signature generation

16

Output: signature z, c

slide-18
SLIDE 18

qTESLA‘s sign and verify

Hash c(sk, y, m) Input: sk, m Sample random y polynomial Signature generation

17

Output: signature z, c

slide-19
SLIDE 19

qTESLA‘s sign and verify

Hash c(sk, y, m) Check to ensure acceptance during verify Input: sk, m Sample random y polynomial Signature generation

18

Output: signature z, c

slide-20
SLIDE 20

qTESLA‘s sign and verify

Hash c(sk, y, m) Compute potential signature z = y + sc Check to ensure acceptance during verify Input: sk, m Sample random y polynomial ü û Signature generation

19

Output: signature z, c

slide-21
SLIDE 21

qTESLA‘s sign and verify

Hash c(sk, y, m) Compute potential signature z = y + sc Check to ensure acceptance during verify Check to ensure security Input: sk, m Sample random y polynomial ü û Signature generation

20

Output: signature z, c

slide-22
SLIDE 22

qTESLA‘s sign and verify

Hash c(sk, y, m) Compute potential signature z = y + sc Check to ensure acceptance during verify Check to ensure security Input: sk, m Sample random y polynomial Output: signature z, c ü ü û û Signature generation

21

slide-23
SLIDE 23

qTESLA‘s sign and verify

Hash c(sk, y, m) Compute potential signature z = y + sc Check to ensure acceptance during verify Check to ensure security Input: sk, m Sample random y polynomial Output: signature z, c ü ü û û Signature generation Input: pk, z, c , m Signature verification Output: or û ü

22

slide-24
SLIDE 24

qTESLA‘s sign and verify

Hash c(sk, y, m) Compute potential signature z = y + sc Check to ensure acceptance during verify Check to ensure security Input: sk, m Sample random y polynomial Output: signature z, c ü ü û û Signature generation Input: pk, z, c , m Signature verification Hash c-(pk, z, c, m) Output: or û ü

23

slide-25
SLIDE 25

qTESLA‘s sign and verify

Hash c(sk, y, m) Compute potential signature z = y + sc Check to ensure acceptance during verify Check to ensure security Input: sk, m Sample random y polynomial Output: signature z, c ü ü û û Signature generation Check c, = c ? Input: pk, z, c , m Signature verification Hash c,(pk, z, c, m) Output: or û ü

24

slide-26
SLIDE 26

qTESLA‘s sign and verify

Hash c(sk, y, m) Compute potential signature z = y + sc Check to ensure acceptance during verify Check to ensure security Input: sk, m Sample random y polynomial Output: signature z, c ü ü û û Signature generation Check c, = c ? Check security property Input: pk, z, c , m ü Signature verification Hash c,(pk, z, c, m) Output: or û ü

25

slide-27
SLIDE 27

qTESLA‘s sign and verify

Hash c(sk, y, m) Compute potential signature z = y + sc Check to ensure acceptance during verify Check to ensure security Input: sk, m Sample random y polynomial Output: signature z, c ü ü û û Signature generation Check c, = c ? Check security property Input: pk, z, c , m ü ü Signature verification Hash c,(pk, z, c, m) Output: or û ü

26

slide-28
SLIDE 28

qTESLA‘s sign and verify

Hash c(sk, y, m) Compute potential signature z = y + sc Check to ensure acceptance during verify Check to ensure security Input: sk, m Sample random y polynomial Output: signature z, c ü ü û û Signature generation Check c, = c ? Check security property Input: pk, z, c , m ü ü û û Signature verification Hash c,(pk, z, c, m) Output: or û ü

27

slide-29
SLIDE 29

qTESLA‘s sign and verify

Hash c(sk, y, m) Compute potential signature z = y + sc Check to ensure acceptance during verify Check to ensure security Input: sk, m Sample random y polynomial Output: signature z, c ü ü û û Signature generation Check c, = c ? Check security property Input: pk, z, c , m ü ü û û Signature verification Hash c,(pk, z, c, m) Output: or û ü Simple operations:

  • Sampling
  • Hashing
  • Comparison
  • Multiplication and addition

28

slide-30
SLIDE 30

Hardware blocks for lattice-based schemes

29

slide-31
SLIDE 31

Lattice-based hardware blocks

30

Key generation Signing Verification Gauss sampler (4.5%) Hash function (39.4%) Poly. Multiplication (27.9%) Sparse poly. multiplication (6.3%)

qTESLA

Respective subroutines (% of runtime)

slide-32
SLIDE 32

Lattice-based hardware blocks

  • A unified hardware core for both SHAKE-128/256 and cSHAKE-128/256
  • A novel, parameterized binary-search CDT sampler in hardware
  • A novel, fully pipelined NTT-based polynomial multiplier
  • A parameterized sparse polynomial multiplier
  • A lightweight Hmax-Sum module

31

Key generation Signing Verification Gauss sampler (4.5%) Hash function (39.4%) Poly. Multiplication (27.9%) Sparse poly. multiplication (6.3%)

qTESLA

Respective subroutines (% of runtime)

slide-33
SLIDE 33

Lattice-based hardware blocks

  • A unified hardware core for both SHAKE-128/256 and cSHAKE-128/256
  • A novel, parameterized binary-search CDT sampler in hardware
  • A novel, fully pipelined NTT-based polynomial multiplier
  • A parameterized sparse polynomial multiplier
  • A lightweight Hmax-Sum module

32

slide-34
SLIDE 34

A new binary-search CDT sampler

Input: a random number ! of precision " generated by a PRNG Output: a signed Gaussian sample # Ø Pre-computed CDT table of depth $ and width " Ø Split CDT into two power-of-two parts, with the ending index of the first part and the starting index of the second part as: %&'( = 2 +,-. / 0( − 1, #$45$6 = $ − 2 +,-. / 0( If ! < CDT %&'( + 1 then 89& = 0, 84! = %&'( + 1 Else 89& = #$45$6, 84! = $ While 89& + 1 ≠ 84! do if ! < CDT (89& + 84!)/2 then 84! = (89& + 84!)/2 else 89& = (89& + 84!)/2 Return # = MSB(!) ? −89& ∶ 89&

33

slide-35
SLIDE 35

Input: a random number ! of precision " generated by a PRNG Output: a signed Gaussian sample # Ø Pre-computed CDT table of depth $ and width " Ø Split CDT into two power-of-two parts, with the ending index of the first part and the starting index of the second part as: %&'( = 2 +,-. / 0( − 1, #$45$6 = $ − 2 +,-. / 0( If ! < CDT %&'( + 1 then 89& = 0, 84! = %&'( + 1 Else 89& = #$45$6, 84! = $ While 89& + 1 ≠ 84! do if ! < CDT (89& + 84!)/2 then 84! = (89& + 84!)/2 else 89& = (89& + 84!)/2 Return # = MSB(!) ? −89& ∶ 89&

A BCDA E 0F A BCDA E 0F

34

A new binary-search CDT sampler

slide-36
SLIDE 36

Input: a random number ! of precision " generated by a PRNG Output: a signed Gaussian sample # Ø Pre-computed CDT table of depth $ and width " Ø Split CDT into two power-of-two parts, with the ending index of the first part and the starting index of the second part as: %&'( = 2 +,-. / 0( − 1, #$45$6 = $ − 2 +,-. / 0( If ! < CDT %&'( + 1 then 89& = 0, 84! = %&'( + 1 Else 89& = #$45$6, 84! = $ While 89& + 1 ≠ 84! do if ! < CDT (89& + 84!)/2 then 84! = (89& + 84!)/2 else 89& = (89& + 84!)/2 Return # = MSB(!) ? −89& ∶ 89& BinSearch size-β Comparator CDT Table (βxd)

BRAM/ROM

cur min

35

A new binary-search CDT sampler

slide-37
SLIDE 37

cSHAKE I/O FSM GaussSampler BinSearch size-β Comparator CDT Table (βxd)

BRAM/ROM

cur min Control Logic SW-HW Bridge PRNG IO FSM

36

A new binary-search CDT sampler

slide-38
SLIDE 38

cSHAKE I/O FSM GaussSampler BinSearch size-β Comparator CDT Table (βxd)

BRAM/ROM

cur min Control Logic SW-HW Bridge PRNG IO FSM

ü Parameterized

!: standard deviation .: targeted precision 2: tail-cut 3: batch size

ü Wide Applicability

(small !)

ü Constant-time ü Lightweight ü Standard interface

37

A new binary-search CDT sampler

slide-39
SLIDE 39

Performance

Design Batch size ! Device Total cycles PRNG cycles Slices LUTs FFs RAMs Fmax (MHz) Ours (qTESLA-p-I) 512 Artix-7 19,046 18,948 113 278 295 2.5 131 1024 Artix-7 18,451 18,370 118 279 296 2.5 134 Ours (qTESLA-p-III) 512 Artix-7 83,040 82,952 217 485 487 4.5 128 1024 Artix-7 81,904 81,860 191 450 487 4.5 123 2048 Artix-7 81,335 81,314 191 470 490 4.5 123 Ours 512 Artix-7 9,506 9474 114 268 283 2.5 101 [HKR+18] 512 Virtex-7 2,560 (w/o PRNG) 15 53 17 1 193 [TWS19] 512 Artix-7 50,700 (w/o PRNG) − 893 796 3 113

[HKR+18]: Howe, James, et al. "On practical discrete Gaussian samplers for lattice-based cryptography." [TWS19]: Tian, Shanquan, et al. "Merge-exchange sort based discrete Gaussian sampler with fixed memory access pattern."

38

slide-40
SLIDE 40

Performance

Design Batch size ! Device Total cycles PRNG cycles Slices LUTs FFs RAMs Fmax (MHz) Ours (qTESLA-p-I) 512 Artix-7 19,046 18,948 113 278 295 2.5 131 1024 Artix-7 18,451 18,370 118 279 296 2.5 134 Ours (qTESLA-p-III) 512 Artix-7 83,040 82,952 217 485 487 4.5 128 1024 Artix-7 81,904 81,860 191 450 487 4.5 123 2048 Artix-7 81,335 81,314 191 470 490 4.5 123 Ours 512 Artix-7 9,506 9474 114 268 283 2.5 101 [HKR+18] 512 Virtex-7 2,560 (w/o PRNG) 15 53 17 1 193 [TWS19] 512 Artix-7 50,700 (w/o PRNG) − 893 796 3 113

[HKR+18]: Howe, James, et al. "On practical discrete Gaussian samplers for lattice-based cryptography." [TWS19]: Tian, Shanquan, et al. "Merge-exchange sort based discrete Gaussian sampler with fixed memory access pattern."

ü Parameterized

39

slide-41
SLIDE 41

Performance

Design Batch size ! Device Total cycles PRNG cycles Slices LUTs FFs RAMs Fmax (MHz) Ours (qTESLA-p-I) 512 Artix-7 19,046 18,948 113 278 295 2.5 131 1024 Artix-7 18,451 18,370 118 279 296 2.5 134 Ours (qTESLA-p-III) 512 Artix-7 83,040 82,952 217 485 487 4.5 128 1024 Artix-7 81,904 81,860 191 450 487 4.5 123 2048 Artix-7 81,335 81,314 191 470 490 4.5 123 Ours 512 Artix-7 9,506 9474 114 268 283 2.5 101 [HKR+18] 512 Virtex-7 2,560 (w/o PRNG) 15 53 17 1 193 [TWS19] 512 Artix-7 50,700 (w/o PRNG) − 893 796 3 113

[HKR+18]: Howe, James, et al. "On practical discrete Gaussian samplers for lattice-based cryptography." [TWS19]: Tian, Shanquan, et al. "Merge-exchange sort based discrete Gaussian sampler with fixed memory access pattern."

ü Parameterized ü Lightweight

40

slide-42
SLIDE 42

Performance

Design Batch size ! Device Total cycles PRNG cycles Slices LUTs FFs RAMs Fmax (MHz) Ours (qTESLA-p-I) 512 Artix-7 19,046 18,948 113 278 295 2.5 131 1024 Artix-7 18,451 18,370 118 279 296 2.5 134 Ours (qTESLA-p-III) 512 Artix-7 83,040 82,952 217 485 487 4.5 128 1024 Artix-7 81,904 81,860 191 450 487 4.5 123 2048 Artix-7 81,335 81,314 191 470 490 4.5 123 Ours 512 Artix-7 9,506 9474 114 268 283 2.5 101 [HKR+18] 512 Virtex-7 2,560 (w/o PRNG) 15 53 17 1 193 [TWS19] 512 Artix-7 50,700 (w/o PRNG) − 893 796 3 113

[HKR+18]: Howe, James, et al. "On practical discrete Gaussian samplers for lattice-based cryptography." [TWS19]: Tian, Shanquan, et al. "Merge-exchange sort based discrete Gaussian sampler with fixed memory access pattern."

ü Parameterized ü Lightweight ü Computation cycles perfectly hidden by PRNG ü Cryptographically strong cSHAKE

41

slide-43
SLIDE 43

Performance

Design Batch size ! Device Total cycles PRNG cycles Slices LUTs FFs RAMs Fmax (MHz) Ours (qTESLA-p-I) 512 Artix-7 19,046 18,948 113 278 295 2.5 131 1024 Artix-7 18,451 18,370 118 279 296 2.5 134 Ours (qTESLA-p-III) 512 Artix-7 83,040 82,952 217 485 487 4.5 128 1024 Artix-7 81,904 81,860 191 450 487 4.5 123 2048 Artix-7 81,335 81,314 191 470 490 4.5 123 Ours 512 Artix-7 9,506 9,474 114 268 283 2.5 101 [HKR+18] 512 Virtex-7 2,560 (w/o PRNG) 15 53 17 1 193 [TWS19] 512 Artix-7 50,700 (w/o PRNG) − 893 796 3 113

[HKR+18]: Howe, James, et al. "On practical discrete Gaussian samplers for lattice-based cryptography." [TWS19]: Tian, Shanquan, et al. "Merge-exchange sort based discrete Gaussian sampler with fixed memory access pattern."

Related work: ✗ Hard to port to other platforms/applications

No support for standard interface Sequential search and PRNG, no synchronization between modules

✗ Less cryptographically secure PRNG

42

slide-44
SLIDE 44

NTT-based polynomial multiplier

Forward NTT

Unified NTT algorithm

Inverse NTT

Separated NTT algorithm

Cooley-Tukey (CT) butterfly CT butterfly Forward NTT Inverse NTT CT butterfly Gentlemen-Sande (GS) butterfly

ü One hardware module ✗ Extra computations

Pre-scaling Bit-reversal Post-scaling

ü No extra computations ✗ Two hardware modules

43

slide-45
SLIDE 45

NTT-based polynomial multiplier

Forward NTT

Unified NTT algorithm

Inverse NTT

Separated NTT algorithm

Cooley-Tukey (CT) butterfly CT butterfly Forward NTT Inverse NTT CT butterfly Gentlemen-Sande (GS) butterfly

CT-GS NTT algorithm

Forward NTT Inverse NTT CT-GS butterfly

ü One hardware module ✗ Extra computations

Pre-scaling Bit-reversal Post-scaling

ü No extra computations ✗ Two hardware modules ü No extra computations ✓ One hardware module Our approach

44

slide-46
SLIDE 46

Input: ! = ∑$%&

'() !$*$ ∈ ,-, with !$ ∈ ℤ-; pre-computed twiddle factors /

Output: 011(!) or 011()(!) ∈ ,- 4& = 5/2 or 1; 4) = 1/2 or 2; 5& = 1 or 0; 5) = 5 or 5/2 9 = 0, < = 0 For 4 = 4&; 5& < 4 < 5); 4 = 4 ? 4) do For @ = 0; @ < 5/2; @ = < + 4/2 do B = / 9 For < = @; < < @ + 4/2; < = < + 1 do C), D) = ! < , ! < + 4 ; CE, DE = ! < + 4 ? 4) , ! < + 4 + 4 ? 4) F

) = B ? D) or w ? C) − D) ;

F

E= B ? DE or w ? CE − DE

! < = C) + F

) KF C) + D);

! < + 4 = C) − F

) KF F )

! < + 4 ? 4) = CE + F

E KF CE + DE; ! < + 4 + 4 ? 4) = CE − F E KF F E

4L4 < = ! < , ! < + 4 ? 4) ; 4L4 < + 4 ? 4) = (! < + 4 , ! < + 4 + 4 ? 4) ) 9 = 9 + 1 // repeat for the Last NTT round Return !

Unified CT-GS algorithm → One hardware module

45

Memory efficient CT-GS NTT algorithm

slide-47
SLIDE 47

Input: ! = ∑$%&

'() !$*$ ∈ ,-, with !$ ∈ ℤ-; pre-computed twiddle factors /

Output: 011(!) or 011()(!) ∈ ,- 4& = 5/2 or 1; 4) = 1/2 or 2; 5& = 1 or 0; 5) = 5 or 5/2 9 = 0, < = 0 For 4 = 4&; 5& < 4 < 5); 4 = 4 ? 4) do For @ = 0; @ < 5/2; @ = < + 4/2 do B = / 9 For < = @; < < @ + 4/2; < = < + 1 do C), D) = ! < , ! < + 4 ; CE, DE = ! < + 4 ? 4) , ! < + 4 + 4 ? 4) F

) = B ? D) or w ? C) − D) ;

F

E= B ? DE or w ? CE − DE

! < = C) + F

) KF C) + D);

! < + 4 = C) − F

) KF F )

! < + 4 ? 4) = CE + F

E KF CE + DE; ! < + 4 + 4 ? 4) = CE − F E KF F E

4L4 < = ! < , ! < + 4 ? 4) ; 4L4 < + 4 ? 4) = (! < + 4 , ! < + 4 + 4 ? 4) ) 9 = 9 + 1 // repeat for the Last NTT round Return !

Efficient memory access scheme Memory efficiency Fully pipelined architecture → Unified CT-GS algorithm → One hardware module

Memory efficient CT-GS NTT algorithm

46

slide-48
SLIDE 48

NTT-based polynomial multiplier

RAM

x x·y

CT-GS Butterfly Unit mem_x

RAM

mem_y

ROM

mem_zeta

ROM

mem_zetainv

ROM

Controller Module PolyMul SW-HW Bridge

y

47

slide-49
SLIDE 49

NTT-based polynomial multiplier

ü Parameterized

!: polynomial length ": modulus

ü Wide applicability

(" ≡ 1 %&' 2!)

ü Fully pipelined ü Constant-time ü Standard interface

RAM

x x·y

CT-GS Butterfly Unit mem_x

RAM

mem_y

ROM

mem_zeta

ROM

mem_zetainv

ROM

Controller Module PolyMul SW-HW Bridge

y

48

slide-50
SLIDE 50

Performance

Design Parameters Tunable (!, #) Platform Cycles Slices LUTs FFs DSPs Fmax (MHz) Ours (1024, 343576577) √, √ Artix-7 11,455 582 1977 991 6 124 (2048, 856145921) Artix-7 24,785 555 1981 1021 8 124 Ours (1024, 12289) √, √ Artix-7 11,455 271 944 467 3 141 [KLC+17] √, − Artix-7 5494 − 2832 1381 8 150 Ours (2048, 65537) √, √ Spartan-6 24,785 543 1601 368 8 90 [DB16] −, − Spartan-6 25,654 269 − − 9 207

[HKR+18]: Howe, James, et al. "On practical discrete Gaussian samplers for lattice-based cryptography." [TWS19]: Tian, Shanquan, et al. "Merge-exchange sort based discrete Gaussian sampler with fixed memory access pattern."

49

slide-51
SLIDE 51

Performance

Design Parameters Tunable (!, #) Platform Cycles Slices LUTs FFs DSPs Fmax (MHz) Ours (1024, 343576577) √, √ Artix-7 11,455 582 1977 991 6 124 (2048, 856145921) Artix-7 24,785 555 1981 1021 8 124 Ours (1024, 12289) √, √ Artix-7 11,455 271 944 467 3 141 [KLC+17] √, − Artix-7 5494 − 2832 1381 8 150 Ours (2048, 65537) √, √ Spartan-6 24,785 543 1601 368 8 90 [DB16] −, − Spartan-6 25,654 269 − − 9 207

[HKR+18]: Howe, James, et al. "On practical discrete Gaussian samplers for lattice-based cryptography." [TWS19]: Tian, Shanquan, et al. "Merge-exchange sort based discrete Gaussian sampler with fixed memory access pattern."

50

ü Parameterized

slide-52
SLIDE 52

Performance

Design Parameters Tunable (!, #) Platform Cycles Slices LUTs FFs DSPs Fmax (MHz) Ours (1024, 343576577) √, √ Artix-7 11,455 582 1977 991 6 124 (2048, 856145921) Artix-7 24,785 555 1981 1021 8 124 Ours (1024, 12289) √, √ Artix-7 11,455 271 944 467 3 141 [KLC+17] √, − Artix-7 5494 − 2832 1381 8 150 Ours (2048, 65537) √, √ Spartan-6 24,785 543 1601 368 8 90 [DB16] −, − Spartan-6 25,654 269 − − 9 207

[HKR+18]: Howe, James, et al. "On practical discrete Gaussian samplers for lattice-based cryptography." [TWS19]: Tian, Shanquan, et al. "Merge-exchange sort based discrete Gaussian sampler with fixed memory access pattern."

51

ü Parameterized ü Achieves theoretical cycles limit (' ( log, ' +

. , )

slide-53
SLIDE 53

Performance

Design Parameters Tunable (!, #) Platform Cycles Slices LUTs FFs DSPs Fmax (MHz) Ours (1024, 343576577) √, √ Artix-7 11,455 582 1977 991 6 124 (2048, 856145921) Artix-7 24,785 555 1981 1021 8 124 Ours (1024, 12289) √, √ Artix-7 11,455 271 944 467 3 141 [KLC+17] √, − Artix-7 5494 − 2832 1381 8 150 Ours (2048, 65537) √, √ Spartan-6 24,785 543 1601 368 8 90 [DB16] −, − Spartan-6 25,654 269 − − 9 207

[HKR+18]: Howe, James, et al. "On practical discrete Gaussian samplers for lattice-based cryptography." [TWS19]: Tian, Shanquan, et al. "Merge-exchange sort based discrete Gaussian sampler with fixed memory access pattern."

52

ü Parameterized ü Achieves theoretical cycles limit (' ( log, ' +

. , )

ü Good time-area product

slide-54
SLIDE 54

Performance

Design Parameters Tunable (!, #) Platform Cycles Slices LUTs FFs DSPs Fmax (MHz) Ours (1024, 343576577) √, √ Artix-7 11,455 582 1977 991 6 124 (2048, 856145921) Artix-7 24,785 555 1981 1021 8 124 Ours (1024, 12289) √, √ Artix-7 11,455 271 944 467 3 141 [KLC+17] √, − Artix-7 5494 − 2832 1381 8 150 Ours (2048, 65537) √, √ Spartan-6 24,785 543 1601 368 8 90 [DB16] −, − Spartan-6 25,654 269 − − 9 207

[HKR+18]: Howe, James, et al. "On practical discrete Gaussian samplers for lattice-based cryptography." [TWS19]: Tian, Shanquan, et al. "Merge-exchange sort based discrete Gaussian sampler with fixed memory access pattern."

ü Parameterized ü Achieves theoretical cycles limit (' ( log, ' +

. , )

ü Good time-area product

53

slide-55
SLIDE 55

Software-hardware co-design of qTESLA

  • n RISC-V

54

slide-56
SLIDE 56

Software-hardware co-design on RISC-V

Murax SoC VexRiscV

RV32IM

On-Chip RAM APB APB Decoder APB Bridge HW Accelerator UART JTAG UART

ü Standard ISA ü Standard interface ü Lightweight ü Good performance ü Fully open-source

55

slide-57
SLIDE 57

Experimental setup for qTESLA on FPGA

Data exchange Load software Program FPGA Artix-7 AC701 FPGA

  • Artix-7 FPGA board from Xilinx (recommended by NIST)

56

slide-58
SLIDE 58

Performance of core functions in qTESLA

57

slide-59
SLIDE 59

58

Performance of core functions in qTESLA

slide-60
SLIDE 60

59

Performance of core functions in qTESLA

slide-61
SLIDE 61

60

Performance of qTESLA: key generation

slide-62
SLIDE 62

61

Performance of qTESLA: key generation

slide-63
SLIDE 63

Performance of qTESLA: sign

62

slide-64
SLIDE 64

63

Performance of qTESLA: verify

slide-65
SLIDE 65

Comparison with related work

[KRS+19] [KRS+19] [KRS+19] [KRS+19] [BUC19]

[KRS+19]: Kannwischer, Matthias, et al. "pqm4: testing and benchmarking NIST PQC on ARM Cortex-M4.." [BUC19]: Banerjee, Utsav, et al. "A configurable crypto-processor for post-quantum lattice-based protocols ."

64

slide-66
SLIDE 66

Comparison with related work

[KRS+19] [KRS+19] [KRS+19] [KRS+19] [BUC19]

[KRS+19]: Kannwischer, Matthias, et al. "pqm4: testing and benchmarking NIST PQC on ARM Cortex-M4.." [BUC19]: Banerjee, Utsav, et al. "A configurable crypto-processor for post-quantum lattice-based protocols ."

65

slide-67
SLIDE 67

Comparison with related work

[KRS+19] [KRS+19] [KRS+19] [KRS+19] [BUC19]

[KRS+19]: Kannwischer, Matthias, et al. "pqm4: testing and benchmarking NIST PQC on ARM Cortex-M4.." [BUC19]: Banerjee, Utsav, et al. "A configurable crypto-processor for post-quantum lattice-based protocols ."

66

slide-68
SLIDE 68

Summary

  • Design and implement various hardware accelerators for lattice-based

schemes:

  • SHAKE-128/256 and cSHAKE-128/256
  • Binary-search CDT sampler
  • NTT-based polynomial multiplier
  • Sparse polynomial multiplier
  • Hmax-Sum module
  • Prototype of the full qTESLA scheme as a software-hardware co-design
  • n a RISC-V platform.
  • Fully open-source code: https://caslab.csl.yale.edu/code/qtesla-hw-sw-platform/

67

slide-69
SLIDE 69

Thank you!

68