DATA FLOW ORIENTED HARDWARE DESIGN OF RNS-BASED POLYNOMIAL - - PowerPoint PPT Presentation

β–Ά
data flow oriented hardware design of rns based
SMART_READER_LITE
LIVE PREVIEW

DATA FLOW ORIENTED HARDWARE DESIGN OF RNS-BASED POLYNOMIAL - - PowerPoint PPT Presentation

Jol Cathbras Alexandre Carbon Peter Milder Renaud Sirdey Nicolas Ventroux DATA FLOW ORIENTED HARDWARE DESIGN OF RNS-BASED POLYNOMIAL MULTIPLICATION FOR SHE ACCELERATION Conference on Cryptographic Hardware and Embedded Systems 2018 |


slide-1
SLIDE 1

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

JoΓ«l CathΓ©bras Alexandre Carbon Renaud Sirdey Nicolas Ventroux

DATA FLOW ORIENTED HARDWARE DESIGN OF RNS-BASED POLYNOMIAL MULTIPLICATION FOR SHE ACCELERATION

Peter Milder

slide-2
SLIDE 2

| 2

IMPLEMENTATION PROBLEMATIC FOR RLWE-BASED LEVELED-FHE SCHEMES

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

  • Handling polynomial of 𝑺 = β„€ π‘Œ /(𝐺(π‘Œ)) and π‘Ίπ‘Ÿ = 𝑺/π‘Ÿπ‘Ί:
  • Modulus π‘Ÿ ~ several hundred of bits
  • deg(𝐺) ~ several thousand

Security Multiplicative depth

Impact

slide-3
SLIDE 3

| 3

IMPLEMENTATION PROBLEMATIC FOR RLWE-BASED LEVELED-FHE SCHEMES

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

π‘Ÿ =

𝑗=1 𝑙

π‘Ÿπ‘—

  • Residue Number System:

π‘π‘Ÿ1 π‘π‘Ÿ1 𝑠

π‘Ÿ1

Γ—

π‘Ÿ1 π‘π‘Ÿπ‘— π‘π‘Ÿπ‘— 𝑠

π‘Ÿπ‘—

Γ—

π‘Ÿπ‘— π‘π‘Ÿπ‘™ π‘π‘Ÿπ‘™ 𝑠

π‘Ÿπ‘™

Γ—

π‘Ÿπ‘™

… …

Γ—

𝑏 𝑐 𝑠

⇔

π‘Ÿ

  • Handling polynomial of 𝑺 = β„€ π‘Œ /(𝐺(π‘Œ)) and π‘Ίπ‘Ÿ = 𝑺/π‘Ÿπ‘Ί:
  • Modulus π‘Ÿ ~ several hundred of bits
  • deg(𝐺) ~ several thousand

Security Multiplicative depth

Impact

slide-4
SLIDE 4

| 4

IMPLEMENTATION PROBLEMATIC FOR RLWE-BASED LEVELED-FHE SCHEMES

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

π‘Ÿ =

𝑗=1 𝑙

π‘Ÿπ‘—

  • Residue Number System:

π‘π‘Ÿ1 π‘π‘Ÿ1 𝑠

π‘Ÿ1

Γ—

π‘Ÿ1 π‘π‘Ÿπ‘— π‘π‘Ÿπ‘— 𝑠

π‘Ÿπ‘—

Γ—

π‘Ÿπ‘— π‘π‘Ÿπ‘™ π‘π‘Ÿπ‘™ 𝑠

π‘Ÿπ‘™

Γ—

π‘Ÿπ‘™

… …

Γ—

𝑏 𝑐 𝑠

⇔

π‘Ÿ

  • Handling polynomial of 𝑺 = β„€ π‘Œ /(𝐺(π‘Œ)) and π‘Ίπ‘Ÿ = 𝑺/π‘Ÿπ‘Ί:
  • Modulus π‘Ÿ ~ several hundred of bits
  • deg(𝐺) ~ several thousand
  • Bajard et al. in 2016, further simplified by Halevi et al. in 2018 :
  • RNS compatible FV. Dec𝑆𝑂𝑇 and FV. Mult&Relin𝑆𝑂𝑇.
  • New π’”π’Žπ’π‘†π‘‚π‘‡: pair of 𝑙 Γ— 𝑙–matrices with elements in π‘†π‘Ÿπ‘— for 𝑗 in 1, … , 𝑙.
  • Performance bottleneck: Residue Polynomial Multiplication (π‘†π‘Ÿπ‘—β€™s products)

Security Multiplicative depth

Impact

slide-5
SLIDE 5

| 5

IMPLEMENTATION PROBLEMATIC FOR RLWE-BASED LEVELED-FHE SCHEMES

  • Negative Wrapped Convolution over π‘†π‘Ÿπ‘— = β„€π‘Ÿπ‘— π‘Œ /(𝐺(π‘Œ)):
  • No polynomial modular reduction.
  • Restrict the choice of 𝐺 π‘Œ = π‘Œπ‘œ + 1 with π‘œ a power of 2.
  • Restrict the choice of π‘Ÿπ‘—: π‘Ÿπ‘— ≑ 1 mod 2π‘œ.
  • 2π‘œπ‘™ precomputed values: (πœ”π‘—

π‘˜)0β‰€π‘˜<2π‘œ, where πœ”π‘— a π‘œ-th primitive root of -1 in β„€π‘Ÿπ‘— βˆ— .

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

π‘Ÿ =

𝑗=1 𝑙

π‘Ÿπ‘—

  • Residue Number System:

π‘π‘Ÿ1 π‘π‘Ÿ1 𝑠

π‘Ÿ1

Γ—

π‘Ÿ1 π‘π‘Ÿπ‘— π‘π‘Ÿπ‘— 𝑠

π‘Ÿπ‘—

Γ—

π‘Ÿπ‘— π‘π‘Ÿπ‘™ π‘π‘Ÿπ‘™ 𝑠

π‘Ÿπ‘™

Γ—

π‘Ÿπ‘™

… …

Γ—

𝑏 𝑐 𝑠

⇔

π‘Ÿ

  • Handling polynomial of 𝑺 = β„€ π‘Œ /(𝐺(π‘Œ)) and π‘Ίπ‘Ÿ = 𝑺/π‘Ÿπ‘Ί:
  • Modulus π‘Ÿ ~ several hundred of bits
  • deg(𝐺) ~ several thousand
  • Bajard et al. in 2016, further simplified by Halevi et al. in 2018 :
  • RNS compatible FV. Dec𝑆𝑂𝑇 and FV. Mult&Relin𝑆𝑂𝑇.
  • New π’”π’Žπ’π‘†π‘‚π‘‡: pair of 𝑙 Γ— 𝑙–matrices with elements in π‘†π‘Ÿπ‘— for 𝑗 in 1, … , 𝑙.
  • Performance bottleneck: Residue Polynomial Multiplication (π‘†π‘Ÿπ‘—β€™s products)

Security Multiplicative depth

Impact

slide-6
SLIDE 6

| 6

RELATED WORKS (HARDWARE ACCELERATION)

  • Migliore et al. 2018: Karatsuba rather than NWC (no RNS)
  • Finer choice of 𝐺(π‘Œ) allowing batching of binary messages.
  • Asymptotic complexity in 𝑃(π‘œ1,585) Vs 𝑃(π‘œ log π‘œ): turning point (π‘œ = 6144, log2 π‘Ÿ = 512).

Not sufficient to target large multiplicative depth.

  • Γ–ztΓΌrk et al. 2015: RNS and NTT approach for LTV scheme (no NWC)
  • Memory-access iterative NTT.
  • External pre-computation of NTT twiddle factors.

Use communication bandwidth for non-payload data.

  • Cousins et al. 2017: RNS and NTT approach for LTV scheme
  • Dataflow oriented pipelined NTT.
  • Local storage of all twiddle factors at compile time.

Storage cost in O(π‘™π‘œ), dependent of RNS basis size.

  • Sinha Roy et al. 2015: RNS and NTT (no NWC) approach for RLWE-based scheme
  • Memory-access iterative NTT.
  • Local storage of a subset of the twiddle factors, and computation on-the-fly of the others.

Better storage in O(𝑙 log π‘œ), but still dependent of RNS basis size.

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Dataflow oriented NWC with on-the-fly computation of twiddle factors

slide-7
SLIDE 7

| 7

NWC ARCHITECTURE PRINCIPLE

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

GEN ITW

π‘Ÿπ‘— 𝑀𝑗 πœ”π‘—

1

GEN PCTW GEN TW

π‘œπ‘—

βˆ’1

πœ”π‘—

π‘₯

…

π‘œπ‘—

βˆ’1

(π‘Ÿπ‘—, 𝑀𝑗, Ψ𝑗) (π‘Ÿπ‘—, 𝑀𝑗, Ω𝑗) (π‘Ÿπ‘—, 𝑀𝑗) (π‘Ÿπ‘—, 𝑀𝑗, Ω𝑗

βˆ’1)

(π‘Ÿπ‘—, 𝑀𝑗, π‘œπ‘—

βˆ’1 β‹… Ψ𝑗 βˆ’1)

VEC PW MM 𝐡𝑗 𝐢𝑗 VEC NTT PW MM NTT PW MM 𝐷𝑗 twiddle flow data flow Ω𝑗 βŠ‚ Ψ𝑗 and Ω𝑗

βˆ’1 βŠ‚ Ψ𝑗 βˆ’1 (πœ•π‘— = πœ”π‘— 2 mod π‘Ÿπ‘—)

One NWC over 𝑺 ⟺ O(𝑙) smaller NWC over the π‘Ίπ‘Ÿπ‘—β€™s : 𝐷𝑗 = NWC𝑗 𝐡𝑗, 𝐢𝑗

  • Architecture principle:
  • Required values for NWC𝑗:
  • πœ”π‘—: a π‘œ-th primitive root of -1 over β„€π‘Ÿπ‘—

βˆ— β‡’ πœ•π‘— = πœ”π‘— 2 mod π‘Ÿπ‘— is a π‘œ-th primitive root of 1 over β„€π‘Ÿπ‘— βˆ—

slide-8
SLIDE 8

| 8

NWC ARCHITECTURE PRINCIPLE

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑃 π‘₯ seeds β‰ͺ 𝑃 π‘œ twiddles Generation of Ψ𝑗 = πœ”π‘—

π‘˜ π‘˜=0 π‘œβˆ’1

. One set every π‘ˆ =

π‘œ π‘₯ cycles.

GEN ITW

π‘Ÿπ‘— 𝑀𝑗 πœ”π‘—

1

GEN PCTW GEN TW

π‘œπ‘—

βˆ’1

πœ”π‘—

π‘₯

…

π‘œπ‘—

βˆ’1

(π‘Ÿπ‘—, 𝑀𝑗, Ψ𝑗) (π‘Ÿπ‘—, 𝑀𝑗, Ω𝑗) (π‘Ÿπ‘—, 𝑀𝑗) (π‘Ÿπ‘—, 𝑀𝑗, Ω𝑗

βˆ’1)

(π‘Ÿπ‘—, 𝑀𝑗, π‘œπ‘—

βˆ’1 β‹… Ψ𝑗 βˆ’1)

VEC PW MM 𝐡𝑗 𝐢𝑗 VEC NTT PW MM NTT PW MM 𝐷𝑗 twiddle flow data flow Ω𝑗 βŠ‚ Ψ𝑗 and Ω𝑗

βˆ’1 βŠ‚ Ψ𝑗 βˆ’1 (πœ•π‘— = πœ”π‘— 2 mod π‘Ÿπ‘—)

One NWC over 𝑺 ⟺ O(𝑙) smaller NWC over the π‘Ίπ‘Ÿπ‘—β€™s : 𝐷𝑗 = NWC𝑗 𝐡𝑗, 𝐢𝑗

  • Architecture principle:
  • Required values for NWC𝑗:
  • πœ”π‘—: a π‘œ-th primitive root of -1 over β„€π‘Ÿπ‘—

βˆ— β‡’ πœ•π‘— = πœ”π‘— 2 mod π‘Ÿπ‘— is a π‘œ-th primitive root of 1 over β„€π‘Ÿπ‘— βˆ—

slide-9
SLIDE 9

| 9

NWC ARCHITECTURE PRINCIPLE

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑃 π‘₯ seeds β‰ͺ 𝑃 π‘œ twiddles Generation of Ψ𝑗 = πœ”π‘—

π‘˜ π‘˜=0 π‘œβˆ’1

. One set every π‘ˆ =

π‘œ π‘₯ cycles.

GEN ITW

π‘Ÿπ‘— 𝑀𝑗 πœ”π‘—

1

GEN PCTW GEN TW

π‘œπ‘—

βˆ’1

πœ”π‘—

π‘₯

…

π‘œπ‘—

βˆ’1

(π‘Ÿπ‘—, 𝑀𝑗, Ψ𝑗) (π‘Ÿπ‘—, 𝑀𝑗, Ω𝑗) (π‘Ÿπ‘—, 𝑀𝑗) (π‘Ÿπ‘—, 𝑀𝑗, Ω𝑗

βˆ’1)

(π‘Ÿπ‘—, 𝑀𝑗, π‘œπ‘—

βˆ’1 β‹… Ψ𝑗 βˆ’1)

VEC PW MM 𝐡𝑗 𝐢𝑗 VEC NTT PW MM NTT PW MM 𝐷𝑗 twiddle flow data flow Ω𝑗 βŠ‚ Ψ𝑗 and Ω𝑗

βˆ’1 βŠ‚ Ψ𝑗 βˆ’1 (πœ•π‘— = πœ”π‘— 2 mod π‘Ÿπ‘—)

One NWC over 𝑺 ⟺ O(𝑙) smaller NWC over the π‘Ίπ‘Ÿπ‘—β€™s : 𝐷𝑗 = NWC𝑗 𝐡𝑗, 𝐢𝑗

  • Architecture principle:
  • Required values for NWC𝑗:
  • πœ”π‘—: a π‘œ-th primitive root of -1 over β„€π‘Ÿπ‘—

βˆ— β‡’ πœ•π‘— = πœ”π‘— 2 mod π‘Ÿπ‘— is a π‘œ-th primitive root of 1 over β„€π‘Ÿπ‘— βˆ—

Computation of Ψ𝑗

βˆ’1 = πœ”π‘— βˆ’π‘˜ π‘˜=0 π‘œβˆ’1

Ψ𝑗

βˆ’1 = Reorder(π‘Ÿπ‘— βˆ’ Ψ𝑗)

(π‘Ÿπ‘— βˆ’ πœ”π‘—

π‘˜ = πœ”π‘— βˆ’ π‘œβˆ’π‘˜ mod π‘Ÿπ‘—)

slide-10
SLIDE 10

| 10

NWC ARCHITECTURE PRINCIPLE

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑃 π‘₯ seeds β‰ͺ 𝑃 π‘œ twiddles Generation of Ψ𝑗 = πœ”π‘—

π‘˜ π‘˜=0 π‘œβˆ’1

. One set every π‘ˆ =

π‘œ π‘₯ cycles.

Scale Ψ𝑗

βˆ’1 by π‘œπ‘— βˆ’1

(π‘œπ‘—

βˆ’1 = π‘œβˆ’1 mod π‘Ÿπ‘—)

GEN ITW

π‘Ÿπ‘— 𝑀𝑗 πœ”π‘—

1

GEN PCTW GEN TW

π‘œπ‘—

βˆ’1

πœ”π‘—

π‘₯

…

π‘œπ‘—

βˆ’1

(π‘Ÿπ‘—, 𝑀𝑗, Ψ𝑗) (π‘Ÿπ‘—, 𝑀𝑗, Ω𝑗) (π‘Ÿπ‘—, 𝑀𝑗) (π‘Ÿπ‘—, 𝑀𝑗, Ω𝑗

βˆ’1)

(π‘Ÿπ‘—, 𝑀𝑗, π‘œπ‘—

βˆ’1 β‹… Ψ𝑗 βˆ’1)

VEC PW MM 𝐡𝑗 𝐢𝑗 VEC NTT PW MM NTT PW MM 𝐷𝑗 twiddle flow data flow Computation of Ψ𝑗

βˆ’1 = πœ”π‘— βˆ’π‘˜ π‘˜=0 π‘œβˆ’1

Ψ𝑗

βˆ’1 = Reorder(π‘Ÿπ‘— βˆ’ Ψ𝑗)

(π‘Ÿπ‘— βˆ’ πœ”π‘—

π‘˜ = πœ”π‘— βˆ’ π‘œβˆ’π‘˜ mod π‘Ÿπ‘—)

Ω𝑗 βŠ‚ Ψ𝑗 and Ω𝑗

βˆ’1 βŠ‚ Ψ𝑗 βˆ’1 (πœ•π‘— = πœ”π‘— 2 mod π‘Ÿπ‘—)

One NWC over 𝑺 ⟺ O(𝑙) smaller NWC over the π‘Ίπ‘Ÿπ‘—β€™s : 𝐷𝑗 = NWC𝑗 𝐡𝑗, 𝐢𝑗

  • Architecture principle:
  • Required values for NWC𝑗:
  • πœ”π‘—: a π‘œ-th primitive root of -1 over β„€π‘Ÿπ‘—

βˆ— β‡’ πœ•π‘— = πœ”π‘— 2 mod π‘Ÿπ‘— is a π‘œ-th primitive root of 1 over β„€π‘Ÿπ‘— βˆ—

slide-11
SLIDE 11

| 11

NWC ARCHITECTURE PRINCIPLE

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑃 π‘₯ seeds β‰ͺ 𝑃 π‘œ twiddles Generation of Ψ𝑗 = πœ”π‘—

π‘˜ π‘˜=0 π‘œβˆ’1

. One set every π‘ˆ =

π‘œ π‘₯ cycles.

Scale Ψ𝑗

βˆ’1 by π‘œπ‘— βˆ’1

(π‘œπ‘—

βˆ’1 = π‘œβˆ’1 mod π‘Ÿπ‘—)

GEN ITW

π‘Ÿπ‘— 𝑀𝑗 πœ”π‘—

1

GEN PCTW GEN TW

π‘œπ‘—

βˆ’1

πœ”π‘—

π‘₯

…

π‘œπ‘—

βˆ’1

(π‘Ÿπ‘—, 𝑀𝑗, Ψ𝑗) (π‘Ÿπ‘—, 𝑀𝑗, Ω𝑗) (π‘Ÿπ‘—, 𝑀𝑗) (π‘Ÿπ‘—, 𝑀𝑗, Ω𝑗

βˆ’1)

(π‘Ÿπ‘—, 𝑀𝑗, π‘œπ‘—

βˆ’1 β‹… Ψ𝑗 βˆ’1)

VEC PW MM 𝐡𝑗 𝐢𝑗 VEC NTT PW MM NTT PW MM 𝐷𝑗 twiddle flow data flow Computation of Ψ𝑗

βˆ’1 = πœ”π‘— βˆ’π‘˜ π‘˜=0 π‘œβˆ’1

Ψ𝑗

βˆ’1 = Reorder(π‘Ÿπ‘— βˆ’ Ψ𝑗)

(π‘Ÿπ‘— βˆ’ πœ”π‘—

π‘˜ = πœ”π‘— βˆ’ π‘œβˆ’π‘˜ mod π‘Ÿπ‘—)

Ω𝑗 βŠ‚ Ψ𝑗 and Ω𝑗

βˆ’1 βŠ‚ Ψ𝑗 βˆ’1 (πœ•π‘— = πœ”π‘— 2 mod π‘Ÿπ‘—)

One NWC over 𝑺 ⟺ O(𝑙) smaller NWC over the π‘Ίπ‘Ÿπ‘—β€™s : 𝐷𝑗 = NWC𝑗 𝐡𝑗, 𝐢𝑗

  • Architecture principle:
  • Required values for NWC𝑗:
  • πœ”π‘—: a π‘œ-th primitive root of -1 over β„€π‘Ÿπ‘—

βˆ— β‡’ πœ•π‘— = πœ”π‘— 2 mod π‘Ÿπ‘— is a π‘œ-th primitive root of 1 over β„€π‘Ÿπ‘— βˆ—

slide-12
SLIDE 12

| 12

  • SPIRAL tool: DFT hardware generator.
  • Design space exploration.

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (1)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

slide-13
SLIDE 13

| 13

  • SPIRAL tool: DFT hardware generator.
  • Design space exploration.
  • Complex arithmetic β‡’ β„€π‘Ÿπ‘— modular arithmetic.

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (1)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

π‘Ÿπ‘— ⟡ NFLlib prime selection. Barrett modular reduction. (𝑀𝑗 =

22(𝑑+2) π‘Ÿπ‘—

mod 2s+2)

slide-14
SLIDE 14

| 14

  • SPIRAL tool: DFT hardware generator.
  • Design space exploration.
  • Complex arithmetic β‡’ β„€π‘Ÿπ‘— modular arithmetic.
  • Modifying twiddle factor handling.

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (1)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

π‘Ÿπ‘— ⟡ NFLlib prime selection. Barrett modular reduction. (𝑀𝑗 =

22(𝑑+2) π‘Ÿπ‘—

mod 2s+2)

Example of NTT data path (𝑠 = 2, π‘œ = 16, π‘₯ = 4):

π‘Ÿπ‘— π‘Ÿπ‘—, 𝑀𝑗 π‘Ÿπ‘—, 𝑀𝑗 π‘Ÿπ‘— π‘Ÿπ‘— π‘Ÿπ‘— π‘Ÿπ‘— π‘Ÿπ‘—, 𝑀𝑗 π‘Ÿπ‘—, 𝑀𝑗 π‘Ÿπ‘—, 𝑀𝑗 π‘Ÿπ‘— π‘Ÿπ‘— π‘Ÿπ‘— πœ•π‘—

0, πœ•π‘— 2, πœ•π‘— 4, πœ•π‘— 6

πœ•π‘—

1, πœ•π‘— 3, πœ•π‘— 5, πœ•π‘— 7

πœ•π‘—

0, πœ•π‘— 4

πœ•π‘—

2, πœ•π‘— 6

πœ•π‘—

4

Init Perm NTT 2 Perm NTT 2 Perm NTT 2 Perm NTT 2 Perm NTT 2 NTT 2 NTT 2 NTT 2 Stage 0 Stage 1 Stage 2 Stage 3

Characteristics:

  • 𝑀 = log𝑠 π‘œ stages.
  • π‘₯ words per cycles.
  • One transform every π‘ˆ =

π‘œ π‘₯ cycles.

slide-15
SLIDE 15

| 15

  • SPIRAL tool: DFT hardware generator.
  • Design space exploration.
  • Complex arithmetic β‡’ β„€π‘Ÿπ‘— modular arithmetic.
  • Modifying twiddle factor handling.

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (1)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

π‘Ÿπ‘— ⟡ NFLlib prime selection. Barrett modular reduction. (𝑀𝑗 =

22(𝑑+2) π‘Ÿπ‘—

mod 2s+2)

Example of NTT data path (𝑠 = 2, π‘œ = 16, π‘₯ = 4):

Init Perm NTT 2 Perm NTT 2 Perm NTT 2 Perm NTT 2 Perm NTT 2 NTT 2 NTT 2 NTT 2 Stage 0 Stage 1 Stage 2 Stage 3

Characteristics:

  • 𝑀 = log𝑠 π‘œ stages.
  • π‘₯ words per cycles.
  • One transform every π‘ˆ =

π‘œ π‘₯ cycles.

πœ•π‘— πœ•π‘—

2

πœ•π‘—

4

πœ•π‘—

6

πœ•π‘—

1

πœ•π‘—

3

πœ•π‘—

5

πœ•π‘—

7

πœ•π‘— πœ•π‘—

4

πœ•π‘—

2

πœ•π‘—

6

πœ•π‘—

4

π‘Ÿπ‘—, 𝑀𝑗

  • RNS channel specific
  • Reprogrammable

Twiddle Bank (TWB) (1,0) (2,0)(2,1) (3,0)(3,1)

slide-16
SLIDE 16

| 16

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

TWB 1 TWB 𝐻

…

NTT DP

next_in next_out data_in data_out

𝐻 = Lat𝐸𝑄 π‘ˆ + 1

π‘₯ words π‘₯ words

read addresses write addresses write enables twiddle flow 𝑒 : way index (in 0, … ,

π‘₯ 2 βˆ’ 1)

π‘š : stage index Cyclic access and reprogramming

  • f TWB

Init Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2 NTT 2 NTT 2 NTT 2

slide-17
SLIDE 17

| 17

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

TWB 1 TWB 𝐻

… …

CTRL

next_[0: 𝑀]

INTERCONNECT DP NTT DP

next_in next_out data_in data_out

𝐻 = Lat𝐸𝑄 π‘ˆ + 1

π‘₯ words π‘₯ words

read addresses write addresses write enables twiddle flow 𝑒 : way index (in 0, … ,

π‘₯ 2 βˆ’ 1)

π‘š : stage index Cyclic access and reprogramming

  • f TWB

Init Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2 NTT 2 NTT 2 NTT 2

slide-18
SLIDE 18

| 18

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

TWB 1 TWB 𝐻

… …

CTRL

next_[0: 𝑀]

INTERCONNECT DP NTT DP

next_in next_out data_in data_out

𝐻 = Lat𝐸𝑄 π‘ˆ + 1

π‘₯ words π‘₯ words

read addresses write addresses write enables twiddle flow 𝑒 : way index (in 0, … ,

π‘₯ 2 βˆ’ 1)

π‘š : stage index Cyclic access and reprogramming

  • f TWB

Init Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2 NTT 2 NTT 2 NTT 2

INTERCONNECT PRG GA

slide-19
SLIDE 19

| 19

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

TWB 1 TWB 𝐻

… …

CTRL

next_[0: 𝑀]

INTERCONNECT DP NTT DP

next_in next_out data_in data_out

𝐻 = Lat𝐸𝑄 π‘ˆ + 1

π‘₯ words π‘₯ words

PRG

next_prg twiddle les

π‘₯/2 words

read addresses write addresses write enables twiddle flow 𝑒 : way index (in 0, … ,

π‘₯ 2 βˆ’ 1)

π‘š : stage index Cyclic access and reprogramming

  • f TWB

TWB 𝑕 reg (π‘š, 𝑒) mem (π‘š, 𝑒)

1

… … …

prg_tw_* prg_tw_* tw_(π‘š, 𝑒) tw_(π‘š, 𝑒) we_(π‘š, 𝑒) we_(π‘š, 𝑒) rd_addr_π‘š wr_addr_(π‘š, 𝑒)

  • Reprogramming a TWB:

Init Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2 NTT 2 NTT 2 NTT 2

INTERCONNECT PRG GA

slide-20
SLIDE 20

| 20

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

TWB 1 TWB 𝐻

… …

CTRL

next_[0: 𝑀]

INTERCONNECT DP NTT DP

next_in next_out data_in data_out

𝐻 = Lat𝐸𝑄 π‘ˆ + 1

π‘₯ words π‘₯ words

PRG

next_prg twiddle les

π‘₯/2 words

  • Example of reprogram counters (𝑠 = 2, π‘œ = 16, π‘₯ = 4):

Counter for mem(3,1) : offset 0, step 1, index 1 Counter for mem(2,1) : offset 1, step 2, index 0 πœ•π‘—

1, πœ•π‘— 3, πœ•π‘— 5, πœ•π‘— 7

πœ•π‘—

2, πœ•π‘— 6

twidd iddle les s β‡’ π‘₯/2 words per cycles prg_tw_0 prg_tw_1 πœ•π‘—

1, πœ•π‘— 3, πœ•π‘— 5, πœ•π‘— 7

πœ•π‘—

0, πœ•π‘— 2, πœ•π‘— 4, πœ•π‘— 6

Select from the flow β‡’ Update we_(π‘š, 𝑒) and wr_addr_(π‘š, 𝑒) read addresses write addresses write enables twiddle flow 𝑒 : way index (in 0, … ,

π‘₯ 2 βˆ’ 1)

π‘š : stage index Cyclic access and reprogramming

  • f TWB

TWB 𝑕 reg (π‘š, 𝑒) mem (π‘š, 𝑒)

1

… … …

prg_tw_* prg_tw_* tw_(π‘š, 𝑒) tw_(π‘š, 𝑒) we_(π‘š, 𝑒) we_(π‘š, 𝑒) rd_addr_π‘š wr_addr_(π‘š, 𝑒)

  • Reprogramming a TWB:

Init Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2 NTT 2 NTT 2 NTT 2

INTERCONNECT PRG GA

slide-21
SLIDE 21

| 21

RPM CHARACTERIZATION PROOF-OF-CONCEPT INTEGRATION (1)

RPM WRAP AXI + FIFOs BCHI

DMA 2 DMA 1 DS DMA 0 PCIe 3 x8 Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

  • Preliminary integration:
  • Alpha-Data ADM-PCIE 7v3.
  • Xilinx Virtex 7: XC7VX690T-2-FFG1157C.
  • PCIe Gen3, 8 lanes.
  • Vivado 2016.3: placed and routed.

π‘œ = 212, log π‘Ÿπ‘— = 30, π‘₯ = 2

12.5% 8.3% 7.7% 14.1% 14.4% LUT LUTRAM FF BRAM DSP 6.4% 3.1% 4.6% 10.4% 1.3% LUT LUTRAM FF BRAM DSP

𝑔

𝑆𝑄𝑁 = 200 MHz

Test PCIe Ok!

slide-22
SLIDE 22

| 22

RPM CHARACTERIZATION PROOF-OF-CONCEPT INTEGRATION (1)

RPM WRAP AXI + FIFOs BCHI

DMA 2 DMA 1 DS DMA 0 PCIe 3 x8

  • Preliminary integration:

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

π‘œ = 212, log π‘Ÿπ‘— = 30, π‘₯ = 2

  • Alpha-Data ADM-PCIE 7v3.
  • Xilinx Virtex 7: XC7VX690T-2-FFG1157C.
  • PCIe Gen3, 8 lanes.
  • Vivado 2016.3: placed and routed.

RPM more constraining resources:

  • BRAM slices
  • DSP slices
  • PCIe bandwidth

How does RPM scale in SHE context?

12.5% 8.3% 7.7% 14.1% 14.4% LUT LUTRAM FF BRAM DSP 6.4% 3.1% 4.6% 10.4% 1.3% LUT LUTRAM FF BRAM DSP

𝑔

𝑆𝑄𝑁 = 200 MHz

Test PCIe Ok!

slide-23
SLIDE 23

| 23

RPM CHARACTERIZATION PROJECTIONS (1)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

  • Impact of the polynomial degree π‘œ (π‘₯ = 2 and log2 π‘Ÿπ‘— = 30 ):

Xilinx Virtex 7: XC7VX690T-2-FFG1157C Slight increase in DSP utilization. Resource limitation (FPGA / PCIe Gen3 x8) Required bandwidth is acheivable BRAM is restrictive for π‘œ > 215 ([58-65]% for NTT permutations)

DSP BRAM Required bandwidth (𝑔 = 200MHz)

slide-24
SLIDE 24

| 24

RPM CHARACTERIZATION PROJECTIONS (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

  • Impact of the streaming width π‘₯ (π‘œ = 214 and log2 π‘Ÿπ‘— = 30 ):

Xilinx Virtex 7: XC7VX690T-2-FFG1157C Resource limitation (FPGA / PCIe Gen3 x8)

DSP BRAM

Great increase in DSP utilization.

Required bandwidth (𝑔 = 200MHz)

Required bandwidth is prohibitive Increase of BRAM utilization.

slide-25
SLIDE 25

| 25

RPM CHARACTERIZATION PROJECTIONS (3)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

  • Impact of the RNS prime size log2 π‘Ÿπ‘— (π‘œ = 214 and π‘₯ = 2 ):

Xilinx Virtex 7: XC7VX690T-2-FFG1157C Resource limitation (FPGA / PCIe Gen3 x8)

DSP BRAM

Required Bandwidth may become restrictive. Balanced impact on DSP and BRAM utilization.

Required bandwidth (𝑔 = 200MHz)

slide-26
SLIDE 26

| 26

PERFORMANCE PROJECTIONS: FV-RNS APPLICATION

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

  • Performance projection @200MHz:

With respect to timing from [HPS18] (πœ‡ > 128)

  • :
  • Raw performances:

~ 𝑔

𝑆𝑄𝑁3π‘₯ log2 π‘Ÿπ‘—

𝑔

𝑆𝑄𝑁

π‘œ π‘₯

Required bandwidth RPM / s

slide-27
SLIDE 27

| 27

  • Performance projection @200MHz:

With respect to timing from [HPS18] (πœ‡ > 128)

  • :

PERFORMANCE PROJECTIONS: FV-RNS APPLICATION

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Scalability w.r.t. multiplicative depth:

  • Speedup (su) is scalable.
  • Realistic bandwidth usage.
  • Timing after RPM speedup:
  • Basis ext. & Scaling: [77-86] %
  • RPMs: [9-16] %
  • RPM Vs NTT implementation?
  • Raw performances:

~ 𝑔

𝑆𝑄𝑁3π‘₯ log2 π‘Ÿπ‘—

𝑔

𝑆𝑄𝑁

π‘œ π‘₯

Required bandwidth RPM / s

slide-28
SLIDE 28

| 28

PERFORMANCE PROJECTIONS: FV-RNS APPLICATION

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Increasing parallelism:

  • Greatly improves speedup.
  • Bandwidth and DSPs may be

quickly restrictive.

  • Performance projection @200MHz:

With respect to timing from [HPS18] (πœ‡ > 128)

  • :
  • Raw performances:

~ 𝑔

𝑆𝑄𝑁3π‘₯ log2 π‘Ÿπ‘—

𝑔

𝑆𝑄𝑁

π‘œ π‘₯

Required bandwidth RPM / s Scalability w.r.t. multiplicative depth:

  • Speedup (su) is scalable.
  • Realistic bandwidth usage.
  • Timing after RPM speedup:
  • Basis ext. & Scaling: [77-86] %
  • RPMs: [9-16] %
  • RPM Vs NTT implementation?
slide-29
SLIDE 29

| 29

  • Performance projection @200MHz:

With respect to timing from [HPS18] (πœ‡ > 128)

  • :

PERFORMANCE PROJECTIONS: FV-RNS APPLICATION

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Increasing parallelism:

  • Greatly improves speedup.
  • Bandwidth and DSPs are

quickly restrictive.

Increasing prime size:

  • Slightly improves speedup.
  • Balanced cost on DSP and

BRAM usage.

  • Bandwidth may be restrictive.
  • Raw performances:

~ 𝑔

𝑆𝑄𝑁3π‘₯ log2 π‘Ÿπ‘—

𝑔

𝑆𝑄𝑁

π‘œ π‘₯

Required bandwidth RPM / s Scalability w.r.t. multiplicative depth:

  • Speedup (su) is scalable.
  • Realistic bandwidth usage.
  • Timing after RPM speedup:
  • Basis ext. & Scaling: [77-86] %
  • RPMs: [9-16] %
  • RPM Vs NTT implementation?
slide-30
SLIDE 30

| 30

  • Hardware implementation for SHE should be flexible:
  • Refinement of parameter range still in progress.
  • Multiplicative depth has significant impact on both π‘œ and log2 π‘Ÿ.

CONCLUSION & PERSPECTIVES

  • Our response:
  • Dataflow RNS-based NWC with on-the-fly generation of twiddles.
  • Exploiting DSP knowledge on DFT implementation.
  • Minimize the impact of log2 π‘Ÿ on hardware design.
  • Research perspectives:
  • NTT Vs RPM?
  • Proper system integration
  • Design space exploration with SPIRAL
  • Application perspectives:
  • Hybrid architecture for SHE acceleration

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

slide-31
SLIDE 31

Centre de Saclay Nano-Innov PC 172 - 91191 Gif sur Yvette Cedex

Conference on Cryptographic Hardware and Embedded Systems 2018 Amsterdam, The Netherlands | 09-10-18

Thanks! Questions?

slide-32
SLIDE 32

| 32 Mid-term evaluation | JoΓ«l CathΓ©bras

Homomorphic encryption has to be secure … and correct ! INTRODUCTION : HOMOMORPHIC ENCRYPTION

𝑑1 𝑑2 𝑑sum

+π’Ÿ =

𝑑mul 𝑑1 𝑑2

Γ—π’Ÿ =

𝑑 Decrypt 𝑛𝑓 𝑑 Decrypt 𝑛e 𝜏 Error distribution πœ“π‘“π‘ π‘  Usually πœ“π‘“π‘ π‘  = 𝑂(0, 𝜏²) 𝑓𝑠𝑠 𝑛1 ∘ 𝑛2 ⟺ 𝑑1 ⊚ 𝑑2 Dec 𝑑1 ∘ Dec 𝑑2 = Dec(𝑑1 ⊚ 𝑑2) 𝑑1, 𝑑2 two ciphertexts such that 𝑑1 = Enc 𝑛1 and 𝑑2 = Enc 𝑛2

  • Decryption function is an homomorphism:

𝑛 ∈ β„³ message space 𝑛𝑓 ∈ β„‹ cleartext space 𝑑 ∈ π’Ÿ ciphertext space 𝑛 Encode Encrypt 𝑛𝑓 𝑑 𝑑 Decrypt Decode 𝑛𝑓 𝑛

  • Semantic security : noise in ciphertexts
slide-33
SLIDE 33

| 33

MODULAR ARITHMETIC

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

  • Modular Addition:
  • Modular Subtraction:
  • Modular Multiplication (NFLlib):
slide-34
SLIDE 34

| 34

GENERATION OF TWIDDLE FACTORS (1)

𝑆0 = 𝐡0 β‹… 𝐡0

Local Storage

𝐡0 𝐡1

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 4 3 2 3 6 5 8 7 4 5 6 7 22 21 20 19 18 17 28 27 26 25 24 23 32 31 30 29 9 8 9 10 11 10 12 11 13 12 15 14 14 13 16 15 4 3 4 6 5 6 8 7 8 7 8 10 9 16 15 16 10 11 12 14 15 16 11 12 13 14 13 14 15 16 12 13 14 15 16

Lat𝑁𝑁

1 2

Inputs

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

  • Example of Ξ¨ generation (π‘œ = 32, π‘₯ = 2):
  • Problematic of twiddle generation:
  • Data dependencies.
  • Modular multiplication latency.
  • Required throughput π‘ˆ =

π‘œ π‘₯.

  • Example of recurrence relation:
  • πœ”2𝑙 = πœ”π‘™ β‹… πœ”π‘™ and πœ”2𝑙+1 = πœ”π‘™ β‹… πœ”π‘™+1
  • Intermediate storage in 𝑃

π‘œ 4

  • Compute β€œat the earliest”

𝑆1 = 𝐡0 β‹… 𝐡1

slide-35
SLIDE 35

| 35

GENERATION OF TWIDDLE FACTORS (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

INTERCONNECT OUT INTERCONNECT IN GH 1 GH 𝐼 MMB CTRL COMPUTE

…

π‘Ÿπ‘— 𝑀𝑗 πœ”π‘—

1

πœ”π‘—

π‘₯

…

BUF 𝐼

…

BUF 1 CTRL SORT num valid

next_in twiddle les next_out

Sequential access to MMB (π‘₯ MMs) with cyclic priority order 𝐼 =

Lat𝐻𝐹𝑂 π‘ˆ

+ 1 (𝐼 = 3 when π‘ˆ ≫ Lat𝑁𝑁)

  • Data flow twiddle generation:
  • Minimize Generation Handler local storage:

πœ”π‘’+1 πœ”π‘’+2 πœ”π‘’+π‘₯ … bunch𝑒 twiddle set β‰ˆ (bunch𝑒)𝑒=0

π‘ˆβˆ’1

bunchπ‘’π‘œπ‘“π‘¦π‘’ = 𝑔

π‘˜ βˆ™ bunchπ‘’π‘šπ‘π‘‘π‘’

𝑔

π‘˜ = πœ”π‘˜π‘₯

(π‘’π‘œπ‘“π‘¦π‘’ = π‘˜ + π‘’π‘šπ‘π‘‘π‘’) π‘˜ is upper bounded

by design parameter

  • nly

Local storage independent of π‘œ!

β‡’