[PPT] - DATA FLOW ORIENTED HARDWARE DESIGN OF RNS-BASED POLYNOMIAL PowerPoint Presentation

SLIDE 1

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Joël Cathébras Alexandre Carbon Renaud Sirdey Nicolas Ventroux

DATA FLOW ORIENTED HARDWARE DESIGN OF RNS-BASED POLYNOMIAL MULTIPLICATION FOR SHE ACCELERATION

Peter Milder

SLIDE 2

| 2

IMPLEMENTATION PROBLEMATIC FOR RLWE-BASED LEVELED-FHE SCHEMES

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Handling polynomial of 𝑺 = ℤ 𝑌 /(𝐺(𝑌)) and 𝑺𝑟 = 𝑺/𝑟𝑺:
Modulus 𝑟 ~ several hundred of bits
deg(𝐺) ~ several thousand

Security Multiplicative depth

Impact

SLIDE 3

| 3

IMPLEMENTATION PROBLEMATIC FOR RLWE-BASED LEVELED-FHE SCHEMES

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑟 =

𝑗=1 𝑙

𝑟𝑗

Residue Number System:

𝑏𝑟1 𝑐𝑟1 𝑠

𝑟1

×

𝑟1 𝑏𝑟𝑗 𝑐𝑟𝑗 𝑠

𝑟𝑗

×

𝑟𝑗 𝑏𝑟𝑙 𝑐𝑟𝑙 𝑠

𝑟𝑙

×

𝑟𝑙

… …

×

𝑏 𝑐 𝑠

⇔

𝑟

Handling polynomial of 𝑺 = ℤ 𝑌 /(𝐺(𝑌)) and 𝑺𝑟 = 𝑺/𝑟𝑺:
Modulus 𝑟 ~ several hundred of bits
deg(𝐺) ~ several thousand

Security Multiplicative depth

Impact

SLIDE 4

| 4

IMPLEMENTATION PROBLEMATIC FOR RLWE-BASED LEVELED-FHE SCHEMES

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑟 =

𝑗=1 𝑙

𝑟𝑗

Residue Number System:

𝑏𝑟1 𝑐𝑟1 𝑠

𝑟1

×

𝑟1 𝑏𝑟𝑗 𝑐𝑟𝑗 𝑠

𝑟𝑗

×

𝑟𝑗 𝑏𝑟𝑙 𝑐𝑟𝑙 𝑠

𝑟𝑙

×

𝑟𝑙

… …

×

𝑏 𝑐 𝑠

⇔

𝑟

Handling polynomial of 𝑺 = ℤ 𝑌 /(𝐺(𝑌)) and 𝑺𝑟 = 𝑺/𝑟𝑺:
Modulus 𝑟 ~ several hundred of bits
deg(𝐺) ~ several thousand
Bajard et al. in 2016, further simplified by Halevi et al. in 2018 :
RNS compatible FV. Dec𝑆𝑂𝑇 and FV. Mult&Relin𝑆𝑂𝑇.
New 𝒔𝒎𝒍𝑆𝑂𝑇: pair of 𝑙 × 𝑙–matrices with elements in 𝑆𝑟𝑗 for 𝑗 in 1, … , 𝑙.
Performance bottleneck: Residue Polynomial Multiplication (𝑆𝑟𝑗’s products)

Security Multiplicative depth

Impact

SLIDE 5

| 5

IMPLEMENTATION PROBLEMATIC FOR RLWE-BASED LEVELED-FHE SCHEMES

Negative Wrapped Convolution over 𝑆𝑟𝑗 = ℤ𝑟𝑗 𝑌 /(𝐺(𝑌)):
No polynomial modular reduction.
Restrict the choice of 𝐺 𝑌 = 𝑌𝑜 + 1 with 𝑜 a power of 2.
Restrict the choice of 𝑟𝑗: 𝑟𝑗 ≡ 1 mod 2𝑜.
2𝑜𝑙 precomputed values: (𝜔𝑗

𝑘)0≤𝑘<2𝑜, where 𝜔𝑗 a 𝑜-th primitive root of -1 in ℤ𝑟𝑗 ∗ .

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑟 =

𝑗=1 𝑙

𝑟𝑗

Residue Number System:

𝑏𝑟1 𝑐𝑟1 𝑠

𝑟1

×

𝑟1 𝑏𝑟𝑗 𝑐𝑟𝑗 𝑠

𝑟𝑗

×

𝑟𝑗 𝑏𝑟𝑙 𝑐𝑟𝑙 𝑠

𝑟𝑙

×

𝑟𝑙

… …

×

𝑏 𝑐 𝑠

⇔

𝑟

Handling polynomial of 𝑺 = ℤ 𝑌 /(𝐺(𝑌)) and 𝑺𝑟 = 𝑺/𝑟𝑺:
Modulus 𝑟 ~ several hundred of bits
deg(𝐺) ~ several thousand
Bajard et al. in 2016, further simplified by Halevi et al. in 2018 :
RNS compatible FV. Dec𝑆𝑂𝑇 and FV. Mult&Relin𝑆𝑂𝑇.
New 𝒔𝒎𝒍𝑆𝑂𝑇: pair of 𝑙 × 𝑙–matrices with elements in 𝑆𝑟𝑗 for 𝑗 in 1, … , 𝑙.
Performance bottleneck: Residue Polynomial Multiplication (𝑆𝑟𝑗’s products)

Security Multiplicative depth

Impact

SLIDE 6

| 6

RELATED WORKS (HARDWARE ACCELERATION)

Migliore et al. 2018: Karatsuba rather than NWC (no RNS)
Finer choice of 𝐺(𝑌) allowing batching of binary messages.
Asymptotic complexity in 𝑃(𝑜1,585) Vs 𝑃(𝑜 log 𝑜): turning point (𝑜 = 6144, log2 𝑟 = 512).

Not sufficient to target large multiplicative depth.

Öztürk et al. 2015: RNS and NTT approach for LTV scheme (no NWC)
Memory-access iterative NTT.
External pre-computation of NTT twiddle factors.

Use communication bandwidth for non-payload data.

Cousins et al. 2017: RNS and NTT approach for LTV scheme
Dataflow oriented pipelined NTT.
Local storage of all twiddle factors at compile time.

Storage cost in O(𝑙𝑜), dependent of RNS basis size.

Sinha Roy et al. 2015: RNS and NTT (no NWC) approach for RLWE-based scheme
Memory-access iterative NTT.
Local storage of a subset of the twiddle factors, and computation on-the-fly of the others.

Better storage in O(𝑙 log 𝑜), but still dependent of RNS basis size.

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Dataflow oriented NWC with on-the-fly computation of twiddle factors

SLIDE 7

| 7

NWC ARCHITECTURE PRINCIPLE

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

GEN ITW

𝑟𝑗 𝑤𝑗 𝜔𝑗

1

GEN PCTW GEN TW

𝑜𝑗

−1

𝜔𝑗

𝑥

…

𝑜𝑗

−1

(𝑟𝑗, 𝑤𝑗, Ψ𝑗) (𝑟𝑗, 𝑤𝑗, Ω𝑗) (𝑟𝑗, 𝑤𝑗) (𝑟𝑗, 𝑤𝑗, Ω𝑗

−1)

(𝑟𝑗, 𝑤𝑗, 𝑜𝑗

−1 ⋅ Ψ𝑗 −1)

VEC PW MM 𝐵𝑗 𝐶𝑗 VEC NTT PW MM NTT PW MM 𝐷𝑗 twiddle flow data flow Ω𝑗 ⊂ Ψ𝑗 and Ω𝑗

−1 ⊂ Ψ𝑗 −1 (𝜕𝑗 = 𝜔𝑗 2 mod 𝑟𝑗)

One NWC over 𝑺 ⟺ O(𝑙) smaller NWC over the 𝑺𝑟𝑗’s : 𝐷𝑗 = NWC𝑗 𝐵𝑗, 𝐶𝑗

Architecture principle:
Required values for NWC𝑗:
𝜔𝑗: a 𝑜-th primitive root of -1 over ℤ𝑟𝑗

∗ ⇒ 𝜕𝑗 = 𝜔𝑗 2 mod 𝑟𝑗 is a 𝑜-th primitive root of 1 over ℤ𝑟𝑗 ∗

SLIDE 8

| 8

NWC ARCHITECTURE PRINCIPLE

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑃 𝑥 seeds ≪ 𝑃 𝑜 twiddles Generation of Ψ𝑗 = 𝜔𝑗

𝑘 𝑘=0 𝑜−1

. One set every 𝑈 =

𝑜 𝑥 cycles.

GEN ITW

𝑟𝑗 𝑤𝑗 𝜔𝑗

1

GEN PCTW GEN TW

𝑜𝑗

−1

𝜔𝑗

𝑥

…

𝑜𝑗

−1

(𝑟𝑗, 𝑤𝑗, Ψ𝑗) (𝑟𝑗, 𝑤𝑗, Ω𝑗) (𝑟𝑗, 𝑤𝑗) (𝑟𝑗, 𝑤𝑗, Ω𝑗

−1)

(𝑟𝑗, 𝑤𝑗, 𝑜𝑗

−1 ⋅ Ψ𝑗 −1)

VEC PW MM 𝐵𝑗 𝐶𝑗 VEC NTT PW MM NTT PW MM 𝐷𝑗 twiddle flow data flow Ω𝑗 ⊂ Ψ𝑗 and Ω𝑗

−1 ⊂ Ψ𝑗 −1 (𝜕𝑗 = 𝜔𝑗 2 mod 𝑟𝑗)

One NWC over 𝑺 ⟺ O(𝑙) smaller NWC over the 𝑺𝑟𝑗’s : 𝐷𝑗 = NWC𝑗 𝐵𝑗, 𝐶𝑗

Architecture principle:
Required values for NWC𝑗:
𝜔𝑗: a 𝑜-th primitive root of -1 over ℤ𝑟𝑗

∗ ⇒ 𝜕𝑗 = 𝜔𝑗 2 mod 𝑟𝑗 is a 𝑜-th primitive root of 1 over ℤ𝑟𝑗 ∗

SLIDE 9

| 9

NWC ARCHITECTURE PRINCIPLE

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑃 𝑥 seeds ≪ 𝑃 𝑜 twiddles Generation of Ψ𝑗 = 𝜔𝑗

𝑘 𝑘=0 𝑜−1

. One set every 𝑈 =

𝑜 𝑥 cycles.

GEN ITW

𝑟𝑗 𝑤𝑗 𝜔𝑗

1

GEN PCTW GEN TW

𝑜𝑗

−1

𝜔𝑗

𝑥

…

𝑜𝑗

−1

(𝑟𝑗, 𝑤𝑗, Ψ𝑗) (𝑟𝑗, 𝑤𝑗, Ω𝑗) (𝑟𝑗, 𝑤𝑗) (𝑟𝑗, 𝑤𝑗, Ω𝑗

−1)

(𝑟𝑗, 𝑤𝑗, 𝑜𝑗

−1 ⋅ Ψ𝑗 −1)

VEC PW MM 𝐵𝑗 𝐶𝑗 VEC NTT PW MM NTT PW MM 𝐷𝑗 twiddle flow data flow Ω𝑗 ⊂ Ψ𝑗 and Ω𝑗

−1 ⊂ Ψ𝑗 −1 (𝜕𝑗 = 𝜔𝑗 2 mod 𝑟𝑗)

One NWC over 𝑺 ⟺ O(𝑙) smaller NWC over the 𝑺𝑟𝑗’s : 𝐷𝑗 = NWC𝑗 𝐵𝑗, 𝐶𝑗

Architecture principle:
Required values for NWC𝑗:
𝜔𝑗: a 𝑜-th primitive root of -1 over ℤ𝑟𝑗

∗ ⇒ 𝜕𝑗 = 𝜔𝑗 2 mod 𝑟𝑗 is a 𝑜-th primitive root of 1 over ℤ𝑟𝑗 ∗

Computation of Ψ𝑗

−1 = 𝜔𝑗 −𝑘 𝑘=0 𝑜−1

Ψ𝑗

−1 = Reorder(𝑟𝑗 − Ψ𝑗)

(𝑟𝑗 − 𝜔𝑗

𝑘 = 𝜔𝑗 − 𝑜−𝑘 mod 𝑟𝑗)

SLIDE 10

| 10

NWC ARCHITECTURE PRINCIPLE

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑃 𝑥 seeds ≪ 𝑃 𝑜 twiddles Generation of Ψ𝑗 = 𝜔𝑗

𝑘 𝑘=0 𝑜−1

. One set every 𝑈 =

𝑜 𝑥 cycles.

Scale Ψ𝑗

−1 by 𝑜𝑗 −1

(𝑜𝑗

−1 = 𝑜−1 mod 𝑟𝑗)

GEN ITW

𝑟𝑗 𝑤𝑗 𝜔𝑗

1

GEN PCTW GEN TW

𝑜𝑗

−1

𝜔𝑗

𝑥

…

𝑜𝑗

−1

(𝑟𝑗, 𝑤𝑗, Ψ𝑗) (𝑟𝑗, 𝑤𝑗, Ω𝑗) (𝑟𝑗, 𝑤𝑗) (𝑟𝑗, 𝑤𝑗, Ω𝑗

−1)

(𝑟𝑗, 𝑤𝑗, 𝑜𝑗

−1 ⋅ Ψ𝑗 −1)

VEC PW MM 𝐵𝑗 𝐶𝑗 VEC NTT PW MM NTT PW MM 𝐷𝑗 twiddle flow data flow Computation of Ψ𝑗

−1 = 𝜔𝑗 −𝑘 𝑘=0 𝑜−1

Ψ𝑗

−1 = Reorder(𝑟𝑗 − Ψ𝑗)

(𝑟𝑗 − 𝜔𝑗

𝑘 = 𝜔𝑗 − 𝑜−𝑘 mod 𝑟𝑗)

Ω𝑗 ⊂ Ψ𝑗 and Ω𝑗

−1 ⊂ Ψ𝑗 −1 (𝜕𝑗 = 𝜔𝑗 2 mod 𝑟𝑗)

One NWC over 𝑺 ⟺ O(𝑙) smaller NWC over the 𝑺𝑟𝑗’s : 𝐷𝑗 = NWC𝑗 𝐵𝑗, 𝐶𝑗

Architecture principle:
Required values for NWC𝑗:
𝜔𝑗: a 𝑜-th primitive root of -1 over ℤ𝑟𝑗

∗ ⇒ 𝜕𝑗 = 𝜔𝑗 2 mod 𝑟𝑗 is a 𝑜-th primitive root of 1 over ℤ𝑟𝑗 ∗

SLIDE 11

| 11

NWC ARCHITECTURE PRINCIPLE

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑃 𝑥 seeds ≪ 𝑃 𝑜 twiddles Generation of Ψ𝑗 = 𝜔𝑗

𝑘 𝑘=0 𝑜−1

. One set every 𝑈 =

𝑜 𝑥 cycles.

Scale Ψ𝑗

−1 by 𝑜𝑗 −1

(𝑜𝑗

−1 = 𝑜−1 mod 𝑟𝑗)

GEN ITW

𝑟𝑗 𝑤𝑗 𝜔𝑗

1

GEN PCTW GEN TW

𝑜𝑗

−1

𝜔𝑗

𝑥

…

𝑜𝑗

−1

(𝑟𝑗, 𝑤𝑗, Ψ𝑗) (𝑟𝑗, 𝑤𝑗, Ω𝑗) (𝑟𝑗, 𝑤𝑗) (𝑟𝑗, 𝑤𝑗, Ω𝑗

−1)

(𝑟𝑗, 𝑤𝑗, 𝑜𝑗

−1 ⋅ Ψ𝑗 −1)

VEC PW MM 𝐵𝑗 𝐶𝑗 VEC NTT PW MM NTT PW MM 𝐷𝑗 twiddle flow data flow Computation of Ψ𝑗

−1 = 𝜔𝑗 −𝑘 𝑘=0 𝑜−1

Ψ𝑗

−1 = Reorder(𝑟𝑗 − Ψ𝑗)

(𝑟𝑗 − 𝜔𝑗

𝑘 = 𝜔𝑗 − 𝑜−𝑘 mod 𝑟𝑗)

Ω𝑗 ⊂ Ψ𝑗 and Ω𝑗

−1 ⊂ Ψ𝑗 −1 (𝜕𝑗 = 𝜔𝑗 2 mod 𝑟𝑗)

One NWC over 𝑺 ⟺ O(𝑙) smaller NWC over the 𝑺𝑟𝑗’s : 𝐷𝑗 = NWC𝑗 𝐵𝑗, 𝐶𝑗

Architecture principle:
Required values for NWC𝑗:
𝜔𝑗: a 𝑜-th primitive root of -1 over ℤ𝑟𝑗

∗ ⇒ 𝜕𝑗 = 𝜔𝑗 2 mod 𝑟𝑗 is a 𝑜-th primitive root of 1 over ℤ𝑟𝑗 ∗

SLIDE 12

| 12

SPIRAL tool: DFT hardware generator.
Design space exploration.

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (1)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

SLIDE 13

| 13

SPIRAL tool: DFT hardware generator.
Design space exploration.
Complex arithmetic ⇒ ℤ𝑟𝑗 modular arithmetic.

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (1)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑟𝑗 ⟵ NFLlib prime selection. Barrett modular reduction. (𝑤𝑗 =

22(𝑡+2) 𝑟𝑗

mod 2s+2)

SLIDE 14

| 14

SPIRAL tool: DFT hardware generator.
Design space exploration.
Complex arithmetic ⇒ ℤ𝑟𝑗 modular arithmetic.
Modifying twiddle factor handling.

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (1)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑟𝑗 ⟵ NFLlib prime selection. Barrett modular reduction. (𝑤𝑗 =

22(𝑡+2) 𝑟𝑗

mod 2s+2)

Example of NTT data path (𝑠 = 2, 𝑜 = 16, 𝑥 = 4):

𝑟𝑗 𝑟𝑗, 𝑤𝑗 𝑟𝑗, 𝑤𝑗 𝑟𝑗 𝑟𝑗 𝑟𝑗 𝑟𝑗 𝑟𝑗, 𝑤𝑗 𝑟𝑗, 𝑤𝑗 𝑟𝑗, 𝑤𝑗 𝑟𝑗 𝑟𝑗 𝑟𝑗 𝜕𝑗

0, 𝜕𝑗 2, 𝜕𝑗 4, 𝜕𝑗 6

𝜕𝑗

1, 𝜕𝑗 3, 𝜕𝑗 5, 𝜕𝑗 7

𝜕𝑗

0, 𝜕𝑗 4

𝜕𝑗

2, 𝜕𝑗 6

𝜕𝑗

4

Init Perm NTT 2 Perm NTT 2 Perm NTT 2 Perm NTT 2 Perm NTT 2 NTT 2 NTT 2 NTT 2 Stage 0 Stage 1 Stage 2 Stage 3

Characteristics:

𝑀 = log𝑠 𝑜 stages.
𝑥 words per cycles.
One transform every 𝑈 =

𝑜 𝑥 cycles.

SLIDE 15

| 15

SPIRAL tool: DFT hardware generator.
Design space exploration.
Complex arithmetic ⇒ ℤ𝑟𝑗 modular arithmetic.
Modifying twiddle factor handling.

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (1)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑟𝑗 ⟵ NFLlib prime selection. Barrett modular reduction. (𝑤𝑗 =

22(𝑡+2) 𝑟𝑗

mod 2s+2)

Example of NTT data path (𝑠 = 2, 𝑜 = 16, 𝑥 = 4):

Init Perm NTT 2 Perm NTT 2 Perm NTT 2 Perm NTT 2 Perm NTT 2 NTT 2 NTT 2 NTT 2 Stage 0 Stage 1 Stage 2 Stage 3

Characteristics:

𝑀 = log𝑠 𝑜 stages.
𝑥 words per cycles.
One transform every 𝑈 =

𝑜 𝑥 cycles.

𝜕𝑗 𝜕𝑗

2

𝜕𝑗

4

𝜕𝑗

6

𝜕𝑗

1

𝜕𝑗

3

𝜕𝑗

5

𝜕𝑗

7

𝜕𝑗 𝜕𝑗

4

𝜕𝑗

2

𝜕𝑗

6

𝜕𝑗

4

𝑟𝑗, 𝑤𝑗

RNS channel specific
Reprogrammable

Twiddle Bank (TWB) (1,0) (2,0)(2,1) (3,0)(3,1)

SLIDE 16

| 16

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

TWB 1 TWB 𝐻

…

NTT DP

next_in next_out data_in data_out

𝐻 = Lat𝐸𝑄 𝑈 + 1

𝑥 words 𝑥 words

read addresses write addresses write enables twiddle flow 𝑢 : way index (in 0, … ,

𝑥 2 − 1)

𝑚 : stage index Cyclic access and reprogramming

f TWB

Init Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2 NTT 2 NTT 2 NTT 2

SLIDE 17

| 17

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

TWB 1 TWB 𝐻

… …

CTRL

next_[0: 𝑀]

INTERCONNECT DP NTT DP

next_in next_out data_in data_out

𝐻 = Lat𝐸𝑄 𝑈 + 1

𝑥 words 𝑥 words

read addresses write addresses write enables twiddle flow 𝑢 : way index (in 0, … ,

𝑥 2 − 1)

𝑚 : stage index Cyclic access and reprogramming

f TWB

Init Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2 NTT 2 NTT 2 NTT 2

SLIDE 18

| 18

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

TWB 1 TWB 𝐻

… …

CTRL

next_[0: 𝑀]

INTERCONNECT DP NTT DP

next_in next_out data_in data_out

𝐻 = Lat𝐸𝑄 𝑈 + 1

𝑥 words 𝑥 words

read addresses write addresses write enables twiddle flow 𝑢 : way index (in 0, … ,

𝑥 2 − 1)

𝑚 : stage index Cyclic access and reprogramming

f TWB

Init Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2 NTT 2 NTT 2 NTT 2

INTERCONNECT PRG GA

SLIDE 19

| 19

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

TWB 1 TWB 𝐻

… …

CTRL

next_[0: 𝑀]

INTERCONNECT DP NTT DP

next_in next_out data_in data_out

𝐻 = Lat𝐸𝑄 𝑈 + 1

𝑥 words 𝑥 words

PRG

next_prg twiddle les

𝑥/2 words

read addresses write addresses write enables twiddle flow 𝑢 : way index (in 0, … ,

𝑥 2 − 1)

𝑚 : stage index Cyclic access and reprogramming

f TWB

TWB 𝑕 reg (𝑚, 𝑢) mem (𝑚, 𝑢)

1

… … …

prg_tw_* prg_tw_* tw_(𝑚, 𝑢) tw_(𝑚, 𝑢) we_(𝑚, 𝑢) we_(𝑚, 𝑢) rd_addr_𝑚 wr_addr_(𝑚, 𝑢)

Reprogramming a TWB:

Init Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2 NTT 2 NTT 2 NTT 2

INTERCONNECT PRG GA

SLIDE 20

| 20

AUTOMATIC GENERATION OF MULTI FIELD NTT DESIGN (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

TWB 1 TWB 𝐻

… …

CTRL

next_[0: 𝑀]

INTERCONNECT DP NTT DP

next_in next_out data_in data_out

𝐻 = Lat𝐸𝑄 𝑈 + 1

𝑥 words 𝑥 words

PRG

next_prg twiddle les

𝑥/2 words

Example of reprogram counters (𝑠 = 2, 𝑜 = 16, 𝑥 = 4):

Counter for mem(3,1) : offset 0, step 1, index 1 Counter for mem(2,1) : offset 1, step 2, index 0 𝜕𝑗

1, 𝜕𝑗 3, 𝜕𝑗 5, 𝜕𝑗 7

𝜕𝑗

2, 𝜕𝑗 6

twidd iddle les s ⇒ 𝑥/2 words per cycles prg_tw_0 prg_tw_1 𝜕𝑗

1, 𝜕𝑗 3, 𝜕𝑗 5, 𝜕𝑗 7

𝜕𝑗

0, 𝜕𝑗 2, 𝜕𝑗 4, 𝜕𝑗 6

Select from the flow ⇒ Update we_(𝑚, 𝑢) and wr_addr_(𝑚, 𝑢) read addresses write addresses write enables twiddle flow 𝑢 : way index (in 0, … ,

𝑥 2 − 1)

𝑚 : stage index Cyclic access and reprogramming

f TWB

TWB 𝑕 reg (𝑚, 𝑢) mem (𝑚, 𝑢)

1

… … …

prg_tw_* prg_tw_* tw_(𝑚, 𝑢) tw_(𝑚, 𝑢) we_(𝑚, 𝑢) we_(𝑚, 𝑢) rd_addr_𝑚 wr_addr_(𝑚, 𝑢)

Reprogramming a TWB:

Init Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2

Perm

NTT 2 NTT 2 NTT 2 NTT 2

INTERCONNECT PRG GA

SLIDE 21

| 21

RPM CHARACTERIZATION PROOF-OF-CONCEPT INTEGRATION (1)

RPM WRAP AXI + FIFOs BCHI

DMA 2 DMA 1 DS DMA 0 PCIe 3 x8 Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Preliminary integration:
Alpha-Data ADM-PCIE 7v3.
Xilinx Virtex 7: XC7VX690T-2-FFG1157C.
PCIe Gen3, 8 lanes.
Vivado 2016.3: placed and routed.

𝑜 = 212, log 𝑟𝑗 = 30, 𝑥 = 2

12.5% 8.3% 7.7% 14.1% 14.4% LUT LUTRAM FF BRAM DSP 6.4% 3.1% 4.6% 10.4% 1.3% LUT LUTRAM FF BRAM DSP

𝑔

𝑆𝑄𝑁 = 200 MHz

Test PCIe Ok!

SLIDE 22

| 22

RPM CHARACTERIZATION PROOF-OF-CONCEPT INTEGRATION (1)

RPM WRAP AXI + FIFOs BCHI

DMA 2 DMA 1 DS DMA 0 PCIe 3 x8

Preliminary integration:

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

𝑜 = 212, log 𝑟𝑗 = 30, 𝑥 = 2

Alpha-Data ADM-PCIE 7v3.
Xilinx Virtex 7: XC7VX690T-2-FFG1157C.
PCIe Gen3, 8 lanes.
Vivado 2016.3: placed and routed.

RPM more constraining resources:

BRAM slices
DSP slices
PCIe bandwidth

How does RPM scale in SHE context?

12.5% 8.3% 7.7% 14.1% 14.4% LUT LUTRAM FF BRAM DSP 6.4% 3.1% 4.6% 10.4% 1.3% LUT LUTRAM FF BRAM DSP

𝑔

𝑆𝑄𝑁 = 200 MHz

Test PCIe Ok!

SLIDE 23

| 23

RPM CHARACTERIZATION PROJECTIONS (1)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Impact of the polynomial degree 𝑜 (𝑥 = 2 and log2 𝑟𝑗 = 30 ):

Xilinx Virtex 7: XC7VX690T-2-FFG1157C Slight increase in DSP utilization. Resource limitation (FPGA / PCIe Gen3 x8) Required bandwidth is acheivable BRAM is restrictive for 𝑜 > 215 ([58-65]% for NTT permutations)

DSP BRAM Required bandwidth (𝑔 = 200MHz)

SLIDE 24

| 24

RPM CHARACTERIZATION PROJECTIONS (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Impact of the streaming width 𝑥 (𝑜 = 214 and log2 𝑟𝑗 = 30 ):

Xilinx Virtex 7: XC7VX690T-2-FFG1157C Resource limitation (FPGA / PCIe Gen3 x8)

DSP BRAM

Great increase in DSP utilization.

Required bandwidth (𝑔 = 200MHz)

Required bandwidth is prohibitive Increase of BRAM utilization.

SLIDE 25

| 25

RPM CHARACTERIZATION PROJECTIONS (3)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Impact of the RNS prime size log2 𝑟𝑗 (𝑜 = 214 and 𝑥 = 2 ):

Xilinx Virtex 7: XC7VX690T-2-FFG1157C Resource limitation (FPGA / PCIe Gen3 x8)

DSP BRAM

Required Bandwidth may become restrictive. Balanced impact on DSP and BRAM utilization.

Required bandwidth (𝑔 = 200MHz)

SLIDE 26

| 26

PERFORMANCE PROJECTIONS: FV-RNS APPLICATION

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Performance projection @200MHz:

With respect to timing from [HPS18] (𝜇 > 128)

:
Raw performances:

~ 𝑔

𝑆𝑄𝑁3𝑥 log2 𝑟𝑗

𝑔

𝑆𝑄𝑁

𝑜 𝑥

Required bandwidth RPM / s

SLIDE 27

| 27

Performance projection @200MHz:

With respect to timing from [HPS18] (𝜇 > 128)

:

PERFORMANCE PROJECTIONS: FV-RNS APPLICATION

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Scalability w.r.t. multiplicative depth:

Speedup (su) is scalable.
Realistic bandwidth usage.
Timing after RPM speedup:
Basis ext. & Scaling: [77-86] %
RPMs: [9-16] %
RPM Vs NTT implementation?
Raw performances:

~ 𝑔

𝑆𝑄𝑁3𝑥 log2 𝑟𝑗

𝑔

𝑆𝑄𝑁

𝑜 𝑥

Required bandwidth RPM / s

SLIDE 28

| 28

PERFORMANCE PROJECTIONS: FV-RNS APPLICATION

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Increasing parallelism:

Greatly improves speedup.
Bandwidth and DSPs may be

quickly restrictive.

Performance projection @200MHz:

With respect to timing from [HPS18] (𝜇 > 128)

:
Raw performances:

~ 𝑔

𝑆𝑄𝑁3𝑥 log2 𝑟𝑗

𝑔

𝑆𝑄𝑁

𝑜 𝑥

Required bandwidth RPM / s Scalability w.r.t. multiplicative depth:

Speedup (su) is scalable.
Realistic bandwidth usage.
Timing after RPM speedup:
Basis ext. & Scaling: [77-86] %
RPMs: [9-16] %
RPM Vs NTT implementation?

SLIDE 29

| 29

Performance projection @200MHz:

With respect to timing from [HPS18] (𝜇 > 128)

:

PERFORMANCE PROJECTIONS: FV-RNS APPLICATION

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Increasing parallelism:

Greatly improves speedup.
Bandwidth and DSPs are

quickly restrictive.

Increasing prime size:

Slightly improves speedup.
Balanced cost on DSP and

BRAM usage.

Bandwidth may be restrictive.
Raw performances:

~ 𝑔

𝑆𝑄𝑁3𝑥 log2 𝑟𝑗

𝑔

𝑆𝑄𝑁

𝑜 𝑥

Required bandwidth RPM / s Scalability w.r.t. multiplicative depth:

Speedup (su) is scalable.
Realistic bandwidth usage.
Timing after RPM speedup:
Basis ext. & Scaling: [77-86] %
RPMs: [9-16] %
RPM Vs NTT implementation?

SLIDE 30

| 30

Hardware implementation for SHE should be flexible:
Refinement of parameter range still in progress.
Multiplicative depth has significant impact on both 𝑜 and log2 𝑟.

CONCLUSION & PERSPECTIVES

Our response:
Dataflow RNS-based NWC with on-the-fly generation of twiddles.
Exploiting DSP knowledge on DFT implementation.
Minimize the impact of log2 𝑟 on hardware design.
Research perspectives:
NTT Vs RPM?
Proper system integration
Design space exploration with SPIRAL
Application perspectives:
Hybrid architecture for SHE acceleration

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

SLIDE 31

Centre de Saclay Nano-Innov PC 172 - 91191 Gif sur Yvette Cedex

Conference on Cryptographic Hardware and Embedded Systems 2018 Amsterdam, The Netherlands | 09-10-18

Thanks! Questions?

SLIDE 32

| 32 Mid-term evaluation | Joël Cathébras

Homomorphic encryption has to be secure … and correct ! INTRODUCTION : HOMOMORPHIC ENCRYPTION

𝑑1 𝑑2 𝑑sum

+𝒟 =

𝑑mul 𝑑1 𝑑2

×𝒟 =

𝑑 Decrypt 𝑛𝑓 𝑑 Decrypt 𝑛e 𝜏 Error distribution 𝜓𝑓𝑠𝑠 Usually 𝜓𝑓𝑠𝑠 = 𝑂(0, 𝜏²) 𝑓𝑠𝑠 𝑛1 ∘ 𝑛2 ⟺ 𝑑1 ⊚ 𝑑2 Dec 𝑑1 ∘ Dec 𝑑2 = Dec(𝑑1 ⊚ 𝑑2) 𝑑1, 𝑑2 two ciphertexts such that 𝑑1 = Enc 𝑛1 and 𝑑2 = Enc 𝑛2

Decryption function is an homomorphism:

𝑛 ∈ ℳ message space 𝑛𝑓 ∈ ℋ cleartext space 𝑑 ∈ 𝒟 ciphertext space 𝑛 Encode Encrypt 𝑛𝑓 𝑑 𝑑 Decrypt Decode 𝑛𝑓 𝑛

Semantic security : noise in ciphertexts

SLIDE 33

| 33

MODULAR ARITHMETIC

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Modular Addition:
Modular Subtraction:
Modular Multiplication (NFLlib):

SLIDE 34

| 34

GENERATION OF TWIDDLE FACTORS (1)

𝑆0 = 𝐵0 ⋅ 𝐵0

Local Storage

𝐵0 𝐵1

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 4 3 2 3 6 5 8 7 4 5 6 7 22 21 20 19 18 17 28 27 26 25 24 23 32 31 30 29 9 8 9 10 11 10 12 11 13 12 15 14 14 13 16 15 4 3 4 6 5 6 8 7 8 7 8 10 9 16 15 16 10 11 12 14 15 16 11 12 13 14 13 14 15 16 12 13 14 15 16

Lat𝑁𝑁

1 2

Inputs

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

Example of Ψ generation (𝑜 = 32, 𝑥 = 2):
Problematic of twiddle generation:
Data dependencies.
Modular multiplication latency.
Required throughput 𝑈 =

𝑜 𝑥.

Example of recurrence relation:
𝜔2𝑙 = 𝜔𝑙 ⋅ 𝜔𝑙 and 𝜔2𝑙+1 = 𝜔𝑙 ⋅ 𝜔𝑙+1
Intermediate storage in 𝑃

𝑜 4

Compute “at the earliest”

𝑆1 = 𝐵0 ⋅ 𝐵1

SLIDE 35

| 35

GENERATION OF TWIDDLE FACTORS (2)

Conference on Cryptographic Hardware and Embedded Systems 2018 | Amsterdam, The Netherlands | 09-10-18

INTERCONNECT OUT INTERCONNECT IN GH 1 GH 𝐼 MMB CTRL COMPUTE

…

𝑟𝑗 𝑤𝑗 𝜔𝑗

1

𝜔𝑗

𝑥

…

BUF 𝐼

…

BUF 1 CTRL SORT num valid

next_in twiddle les next_out

Sequential access to MMB (𝑥 MMs) with cyclic priority order 𝐼 =

Lat𝐻𝐹𝑂 𝑈

+ 1 (𝐼 = 3 when 𝑈 ≫ Lat𝑁𝑁)

Data flow twiddle generation:
Minimize Generation Handler local storage: