High-Performance FV Somewhat Homomorphic Encryption on GPUs: An - - PowerPoint PPT Presentation

โ–ถ
high performance fv somewhat
SMART_READER_LITE
LIVE PREVIEW

High-Performance FV Somewhat Homomorphic Encryption on GPUs: An - - PowerPoint PPT Presentation

High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University of Singapore (NUS) Sept 10 th 2018 CHES 2018 FHE The holy grail of Cryptography FHE


slide-1
SLIDE 1

High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA

Ahmad Al Badawi ahmad@u.nus.edu National University of Singapore (NUS) Sept 10th 2018 โ€“ CHES 2018

slide-2
SLIDE 2

FHE โ€“ The holy grail of Cryptography

  • FHE enables computing on encrypted data without decryption

[GB2009]

  • Challenge: requires enormous computation

Client Cloud Server Homomorphic evaluation of ๐‘” ๐น๐‘œ๐‘‘(๐‘ฆ) ๐น๐‘œ๐‘‘(๐‘”(๐‘ฆ))

Encryption/Decryption

Ahmad Al Badawi - ahmad@u.nus.edu 2

slide-3
SLIDE 3

How the problem is being tackled?

  • Algorithmic methods:

โ€“ New FHE schemes โ€“ Plaintext packing (1D, 2D, โ€ฆ) โ€“ Encoding schemes โ€“ Approximated computing โ€“ Squashing the target function โ€“ DAG optimizations for the target circuit

  • Acceleration methods:

โ€“ Speedup FHE basic primitives (KeyGen, Enc, Dec, Add, Mul) โ€“ Modular Algorithms โ€“ Parallel Implementations โ€“ Hardware implementations: GPUs, FPGAs and probably ASICs

Ahmad Al Badawi - ahmad@u.nus.edu 3

slide-4
SLIDE 4

Our Contributions

  • 1. Implementation of FV RNS on GPUs
  • 2. Introducing a set of CUDA optimizations
  • 3. Benchmarking with state-of-the-art implementations

Ahmad Al Badawi - ahmad@u.nus.edu 4

slide-5
SLIDE 5

Why GPUs for FHE?

Ahmad Al Badawi - ahmad@u.nus.edu 5

  • GPU +

โ€“ Naturally available โ€“ many computing cores โ€“ Developer friendly (FPGA, ASIC)

  • FHE +

โ€“ Huge level of parallelism โ€œIf you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?โ€

Seymour Cray 1925-1996

slide-6
SLIDE 6

Textbook FV

Ahmad Al Badawi - ahmad@u.nus.edu 6

  • Basic mathematical structure is ๐‘†: โ„ค ๐‘ฆ /(๐‘ฆ๐‘œ + 1)

โ€“ Plaintext space: ๐‘†๐‘ข: โ„ค๐‘ข ๐‘ฆ /(๐‘ฆ๐‘œ + 1) โ€“ Ciphertext space: ๐‘†๐‘Ÿ: โ„ค๐‘Ÿ ๐‘ฆ /(๐‘ฆ๐‘œ + 1)

  • Public key: ๐‘ž๐‘™0, ๐‘ž๐‘™1 โˆˆ ๐‘†๐‘Ÿ
  • Secret key: ๐‘ก๐‘™ โˆˆ ๐‘†๐‘Ÿ
  • ๐‘‘ = ๐น๐‘œ๐‘‘(๐‘›): (

๐‘Ÿ ๐‘ข ๐‘› + ๐‘ž๐‘™0๐‘ฃ + ๐‘“0 ๐‘Ÿ , ๐‘ž๐‘™1๐‘ฃ + ๐‘“1 ๐‘Ÿ)

  • ๐‘› = ๐ธ๐‘“๐‘‘ ๐‘‘ :

๐‘ข ๐‘Ÿ

๐‘‘0 + ๐‘‘1๐‘ก๐‘™ ๐‘Ÿ

๐‘ข

  • ๐‘‘+ = ๐ต๐‘’๐‘’ ๐‘‘0, ๐‘‘1 : ( ๐‘‘00 + ๐‘‘10 ๐‘Ÿ, ๐‘‘01 + ๐‘‘11 ๐‘Ÿ)
slide-7
SLIDE 7

Textbook FV (cont.)

Ahmad Al Badawi - ahmad@u.nus.edu 7

  • ๐‘‘ร— = ๐‘๐‘ฃ๐‘š ๐‘‘0, ๐‘‘1, ๐‘“๐‘ค๐‘™ :
  • 1. Tensor product:

๐‘ค0 = โŒŠ

๐‘ข ๐‘Ÿ ๐‘‘00๐‘‘10โŒ‰ ๐‘Ÿ , ๐‘ค2 = โŒŠ ๐‘ข ๐‘Ÿ ๐‘‘01๐‘‘11โŒ‰ ๐‘Ÿ

๐‘ค1 = โŒŠ๐‘ข ๐‘Ÿ (๐‘‘00๐‘‘11 + ๐‘‘01๐‘‘10)โŒ‰

๐‘Ÿ

  • 2. Base decomposition:

๐‘ค2 = ๐‘ค2

(๐‘—)๐‘ฅ๐‘— ๐‘š ๐‘—=0

, ๐‘š =

  • 2. Relinearization:

๐‘‘ร— = ๐‘ค๐‘˜ + ๐‘“๐‘ค๐‘™๐‘—๐‘˜ โ‹… ๐‘ค2

(๐‘—) ๐‘š ๐‘—=0 ๐‘Ÿ

, ๐‘˜ โˆˆ {0,1}

slide-8
SLIDE 8

Implementation Requirements

Ahmad Al Badawi - ahmad@u.nus.edu 8

  • Polynomial arithmetic in cyclotomic rings
  • Large polynomial degree (a few thousands)

โ€“ Power-of-2 cyclotomic โ€“ Addition/Subtraction: ๐’ซ(๐‘œ) โ€“ Multiplication: ๐’ซ(n log ๐‘œ)

  • Large coefficients โˆˆ โ„ค๐‘Ÿ (a few hundreds of bits)

โ€“ Modular algorithms (RNS)

  • Extra non-trivial operations:

โ€“ Scaling-and-round โŒŠ

๐‘ข ๐‘Ÿ ๐‘ฆโŒ‰

โ€“ Base decomposition

slide-9
SLIDE 9

Polynomial Arithmetic

Ahmad Al Badawi - ahmad@u.nus.edu 9

๐‘™ = log2 ๐‘Ÿ log2 ๐‘ž ๐‘œ CRT: ๐‘Ÿ = ๐‘ž๐‘—

๐‘™โˆ’1 ๐‘—=0

, where ๐‘ž๐‘— is a prime RNS / CRT

log2 ๐‘ž-bit number

. . . . . . . . .

Addition/Subtraction: component-wise add/sub modulo ๐‘ž๐‘—

๐‘ž0 ๐‘ž1 ๐‘ž๐‘™โˆ’1

slide-10
SLIDE 10

Polynomial Arithmetic (cont.)

Ahmad Al Badawi - ahmad@u.nus.edu 10

Addition/Subtraction/Multiplication: component-wise add/sub/mul modulo ๐‘ž๐‘—

๐‘ž0 ๐‘ž1 ๐‘ž๐‘™โˆ’1

๐‘œ RNS / CRT ๐‘œ

32-bit number

NTT (DGT)

. . . . . . . . . . . . . . .

NTT NTTโˆ’1

. . .

๐‘ž0 ๐‘ž1 ๐‘ž๐‘™โˆ’1

๐‘™ = log2 ๐‘Ÿ log2 ๐‘ž

slide-11
SLIDE 11

DFT, NTT, DWT, DGTโ€ฆ?

Ahmad Al Badawi - ahmad@u.nus.edu 11

Pros Cons DFT

  • Well-established
  • Several efficient libraries to use
  • Floating point errors increase as (๐‘œ & ๐‘ž๐‘—

โ€ฒs) increase

  • Reduce precision (smaller ๐‘ž๐‘—

โ€ฒs) => longer RNS matrix

=> more DFTs

NTT

  • Exact
  • Transform length (2๐‘œ)

DWT

  • Exact
  • Only power-of-2 cyclotomics
  • Transform length (๐‘œ)

DGT

  • Exact
  • Transform length (

๐‘œ 2)

  • 50% Less interaction with

memory

  • Only power-of-2 cyclotomics
  • Gaussian Arithmetic (larger number of

multiplications ~(30% - 40%)

  • We use DGT in our implementation
slide-12
SLIDE 12

Efficient DGT/NTT/DWT on GPU?

Ahmad Al Badawi - ahmad@u.nus.edu 12

๐ต = NTT(๐‘) s.t. ๐ต๐‘˜ = ๐‘๐‘—๐‘ฅ๐‘˜๐‘—

๐‘œโˆ’1 ๐‘—=0

mod ๐‘Ÿ ๐‘ = NTTโˆ’1(๐ต) s.t. ๐‘๐‘— = ๐‘œโˆ’1 ๐ต๐‘˜๐‘ฅโˆ’๐‘—๐‘˜

๐‘œโˆ’1 ๐‘˜=0

mod ๐‘Ÿ

  • Better to store ๐‘ฅ๐‘˜๐‘— in lookup table.

โ€“ LUT can be stored in GPU texture memory (which is limited on GPU) โ€“ DWT LUT are ๐’ซ(๐‘œ) โ€“ DGT LUT are ๐’ซ(

๐‘œ 2)

  • Compute in ๐ป๐บ(๐‘ž ) or in ๐ป๐บ(๐‘ž๐‘—)?

โ€“ We found it is better to do it ๐ป๐บ(๐‘ž๐‘—). โ€“ Why? (see next)

slide-13
SLIDE 13

Compute in ๐ป๐บ(๐‘ž ) or in ๐ป๐บ(๐‘ž๐‘—)?

Ahmad Al Badawi - ahmad@u.nus.edu 13

๐‘ž0 ๐‘ž1 ๐‘ž๐‘™โˆ’1

๐‘œ

. . . . . . . . .

๐‘™ = log2 ๐‘Ÿ log2 ๐‘ž

๐‘ฏ๐‘ฎ(๐’’ )

๐‘ž : 64-bit prime (should fit in one word) ๐‘ž โ‰ค

๐‘ž 2๐‘œ (one multiplication)

๐‘œ 212 213 214 215 216 log2 ๐‘ž 26 25 25 24 24

  • Longer RNS matrix => more NTTs
  • Size double (32-bit => 64-bit)
  • Supports limited number of operations

in NTT domain

๐‘ฏ๐‘ฎ(๐’’๐’‹)

๐‘ž: word-size prime (can be 64-bit)

  • Shorter RNS matrix => Less NTTs
  • No size doubling
  • Supports unlimited number of
  • perations in NTT domain
slide-14
SLIDE 14

But, is NTT/DWT/DGT performance-critical?

Ahmad Al Badawi - ahmad@u.nus.edu 14

Toy Settings

NTT RNS Base Extension RNS Scaling

  • thers

Medium Settings

NTT RNS Base Extension RNS Scaling

  • thers

Large Settings

NTT RNS Base Extension RNS Scaling

  • thers

Breakdown of homomorphic multiplication (AND) in the BFV FHE scheme

Halevi, Shai, Yuriy Polyakov, and Victor Shoup. "An Improved RNS Variant of the BFV Homomorphic Encryption Scheme." (2018).

slide-15
SLIDE 15

Computing CRT on GPU?

Ahmad Al Badawi - ahmad@u.nus.edu 15

  • CRT ๐‘, {๐‘ž๐‘—} :

(๐‘0, โ€ฆ , ๐‘๐‘™โˆ’1) = ๐‘ mod ๐‘ž๐‘—

  • CRTโˆ’1(๐‘0, โ€ฆ , ๐‘๐‘™โˆ’1) = ๐‘ s.t.

๐‘ = ๐‘Ÿ ๐‘ž๐‘—

๐‘™โˆ’1 ๐‘—=0

๐‘Ÿ ๐‘ž๐‘—

โˆ’1

๐‘๐‘— (mod ๐‘ž๐‘—) (mod ๐‘Ÿ) where ๐‘Ÿ = ๐‘ž๐‘—

๐‘™โˆ’1 ๐‘—=0

  • At least two methods:

โ€“ Classic algorithm โ€“ Garnerโ€™s algorithm

Classic Garners LUT

๐‘™2 ๐‘™ ๐‘™ โˆ’ 1 2

Thread Divergence

Non tractable Nil

  • Is CRT critical to performance?

โ€“ No!

slide-16
SLIDE 16

RNS tools

Ahmad Al Badawi - ahmad@u.nus.edu 16

  • Useful to:

โ€“ Remain in RNS representation โ€“ No costly multi-precision arithmetic

  • Two basic operations:

โ€“ Scale-and-round โ€“ Base decomposition

  • Adopted from (BEHZ2016*) scheme
  • Are RNS tools critical to performance?

โ€“ Extremely critical

* Bajard, Jean-Claude, et al. "A full RNS variant of FV like somewhat homomorphic encryption schemes." International Conference on Selected Areas in Cryptography. Springer, Cham, 2016.

slide-17
SLIDE 17

FV_RNS Homomorphic Multiplication

Ahmad Al Badawi - ahmad@u.nus.edu 17

slide-18
SLIDE 18

Benchmarking Results

Ahmad Al Badawi - ahmad@u.nus.edu 18

0.000 200.000 400.000 600.000 800.000 1000.000 1200.000 1400.000 1600.000 (11,62) (12,186) (13,372) (14,744) Time (ms)

Key Generation

GPU-FV SEAL NFLlib-FV 0.000 5.000 10.000 15.000 20.000 25.000 30.000 35.000 (11,62) (12,186) (13,372) (14,744) Time (ms)

Enc

GPU-FV SEAL NFLlib-FV 0.000 2.000 4.000 6.000 8.000 10.000 12.000 14.000 16.000 18.000 (11,62) (12,186) (13,372) (14,744) Time (ms)

Dec

GPU-FV SEAL NFLlib-FV 0.000 50.000 100.000 150.000 200.000 250.000 300.000 350.000 400.000 450.000 500.000 (11,62) (12,186) (13,372) (14,744) Time (ms)

HomoMul + Relinearization

GPU-FV SEAL NFLlib-FV

slide-19
SLIDE 19

Which FV RNS variant to Implement?

Ahmad Al Badawi - ahmad@u.nus.edu 20

  • Two RNS variants of FV

โ€“ BEHZ โ€“ HPS

  • Answer can be found in:

โ€“ Al Badawi, Ahmad, et al. "Implementation and Performance Evaluation of RNS Variants of the BFV Homomorphic Encryption Scheme." IACR Cryptology ePrint Archive 2018 (2018): 589.

slide-20
SLIDE 20

21

Thank You

Questions? Ahmad Al Badawi ahmad@u.nus.edu