High-Performance FV Somewhat Homomorphic Encryption on GPUs: An - - PowerPoint PPT Presentation
High-Performance FV Somewhat Homomorphic Encryption on GPUs: An - - PowerPoint PPT Presentation
High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University of Singapore (NUS) Sept 10 th 2018 CHES 2018 FHE The holy grail of Cryptography FHE
FHE โ The holy grail of Cryptography
- FHE enables computing on encrypted data without decryption
[GB2009]
- Challenge: requires enormous computation
Client Cloud Server Homomorphic evaluation of ๐ ๐น๐๐(๐ฆ) ๐น๐๐(๐(๐ฆ))
Encryption/Decryption
Ahmad Al Badawi - ahmad@u.nus.edu 2
How the problem is being tackled?
- Algorithmic methods:
โ New FHE schemes โ Plaintext packing (1D, 2D, โฆ) โ Encoding schemes โ Approximated computing โ Squashing the target function โ DAG optimizations for the target circuit
- Acceleration methods:
โ Speedup FHE basic primitives (KeyGen, Enc, Dec, Add, Mul) โ Modular Algorithms โ Parallel Implementations โ Hardware implementations: GPUs, FPGAs and probably ASICs
Ahmad Al Badawi - ahmad@u.nus.edu 3
Our Contributions
- 1. Implementation of FV RNS on GPUs
- 2. Introducing a set of CUDA optimizations
- 3. Benchmarking with state-of-the-art implementations
Ahmad Al Badawi - ahmad@u.nus.edu 4
Why GPUs for FHE?
Ahmad Al Badawi - ahmad@u.nus.edu 5
- GPU +
โ Naturally available โ many computing cores โ Developer friendly (FPGA, ASIC)
- FHE +
โ Huge level of parallelism โIf you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?โ
Seymour Cray 1925-1996
Textbook FV
Ahmad Al Badawi - ahmad@u.nus.edu 6
- Basic mathematical structure is ๐: โค ๐ฆ /(๐ฆ๐ + 1)
โ Plaintext space: ๐๐ข: โค๐ข ๐ฆ /(๐ฆ๐ + 1) โ Ciphertext space: ๐๐: โค๐ ๐ฆ /(๐ฆ๐ + 1)
- Public key: ๐๐0, ๐๐1 โ ๐๐
- Secret key: ๐ก๐ โ ๐๐
- ๐ = ๐น๐๐(๐): (
๐ ๐ข ๐ + ๐๐0๐ฃ + ๐0 ๐ , ๐๐1๐ฃ + ๐1 ๐)
- ๐ = ๐ธ๐๐ ๐ :
๐ข ๐
๐0 + ๐1๐ก๐ ๐
๐ข
- ๐+ = ๐ต๐๐ ๐0, ๐1 : ( ๐00 + ๐10 ๐, ๐01 + ๐11 ๐)
Textbook FV (cont.)
Ahmad Al Badawi - ahmad@u.nus.edu 7
- ๐ร = ๐๐ฃ๐ ๐0, ๐1, ๐๐ค๐ :
- 1. Tensor product:
๐ค0 = โ
๐ข ๐ ๐00๐10โ ๐ , ๐ค2 = โ ๐ข ๐ ๐01๐11โ ๐
๐ค1 = โ๐ข ๐ (๐00๐11 + ๐01๐10)โ
๐
- 2. Base decomposition:
๐ค2 = ๐ค2
(๐)๐ฅ๐ ๐ ๐=0
, ๐ =
- 2. Relinearization:
๐ร = ๐ค๐ + ๐๐ค๐๐๐ โ ๐ค2
(๐) ๐ ๐=0 ๐
, ๐ โ {0,1}
Implementation Requirements
Ahmad Al Badawi - ahmad@u.nus.edu 8
- Polynomial arithmetic in cyclotomic rings
- Large polynomial degree (a few thousands)
โ Power-of-2 cyclotomic โ Addition/Subtraction: ๐ซ(๐) โ Multiplication: ๐ซ(n log ๐)
- Large coefficients โ โค๐ (a few hundreds of bits)
โ Modular algorithms (RNS)
- Extra non-trivial operations:
โ Scaling-and-round โ
๐ข ๐ ๐ฆโ
โ Base decomposition
Polynomial Arithmetic
Ahmad Al Badawi - ahmad@u.nus.edu 9
๐ = log2 ๐ log2 ๐ ๐ CRT: ๐ = ๐๐
๐โ1 ๐=0
, where ๐๐ is a prime RNS / CRT
log2 ๐-bit number
. . . . . . . . .
Addition/Subtraction: component-wise add/sub modulo ๐๐
๐0 ๐1 ๐๐โ1
Polynomial Arithmetic (cont.)
Ahmad Al Badawi - ahmad@u.nus.edu 10
Addition/Subtraction/Multiplication: component-wise add/sub/mul modulo ๐๐
๐0 ๐1 ๐๐โ1
๐ RNS / CRT ๐
32-bit number
NTT (DGT)
. . . . . . . . . . . . . . .
NTT NTTโ1
. . .
๐0 ๐1 ๐๐โ1
๐ = log2 ๐ log2 ๐
DFT, NTT, DWT, DGTโฆ?
Ahmad Al Badawi - ahmad@u.nus.edu 11
Pros Cons DFT
- Well-established
- Several efficient libraries to use
- Floating point errors increase as (๐ & ๐๐
โฒs) increase
- Reduce precision (smaller ๐๐
โฒs) => longer RNS matrix
=> more DFTs
NTT
- Exact
- Transform length (2๐)
DWT
- Exact
- Only power-of-2 cyclotomics
- Transform length (๐)
DGT
- Exact
- Transform length (
๐ 2)
- 50% Less interaction with
memory
- Only power-of-2 cyclotomics
- Gaussian Arithmetic (larger number of
multiplications ~(30% - 40%)
- We use DGT in our implementation
Efficient DGT/NTT/DWT on GPU?
Ahmad Al Badawi - ahmad@u.nus.edu 12
๐ต = NTT(๐) s.t. ๐ต๐ = ๐๐๐ฅ๐๐
๐โ1 ๐=0
mod ๐ ๐ = NTTโ1(๐ต) s.t. ๐๐ = ๐โ1 ๐ต๐๐ฅโ๐๐
๐โ1 ๐=0
mod ๐
- Better to store ๐ฅ๐๐ in lookup table.
โ LUT can be stored in GPU texture memory (which is limited on GPU) โ DWT LUT are ๐ซ(๐) โ DGT LUT are ๐ซ(
๐ 2)
- Compute in ๐ป๐บ(๐ ) or in ๐ป๐บ(๐๐)?
โ We found it is better to do it ๐ป๐บ(๐๐). โ Why? (see next)
Compute in ๐ป๐บ(๐ ) or in ๐ป๐บ(๐๐)?
Ahmad Al Badawi - ahmad@u.nus.edu 13
๐0 ๐1 ๐๐โ1
๐
. . . . . . . . .
๐ = log2 ๐ log2 ๐
๐ฏ๐ฎ(๐ )
๐ : 64-bit prime (should fit in one word) ๐ โค
๐ 2๐ (one multiplication)
๐ 212 213 214 215 216 log2 ๐ 26 25 25 24 24
- Longer RNS matrix => more NTTs
- Size double (32-bit => 64-bit)
- Supports limited number of operations
in NTT domain
๐ฏ๐ฎ(๐๐)
๐: word-size prime (can be 64-bit)
- Shorter RNS matrix => Less NTTs
- No size doubling
- Supports unlimited number of
- perations in NTT domain
But, is NTT/DWT/DGT performance-critical?
Ahmad Al Badawi - ahmad@u.nus.edu 14
Toy Settings
NTT RNS Base Extension RNS Scaling
- thers
Medium Settings
NTT RNS Base Extension RNS Scaling
- thers
Large Settings
NTT RNS Base Extension RNS Scaling
- thers
Breakdown of homomorphic multiplication (AND) in the BFV FHE scheme
Halevi, Shai, Yuriy Polyakov, and Victor Shoup. "An Improved RNS Variant of the BFV Homomorphic Encryption Scheme." (2018).
Computing CRT on GPU?
Ahmad Al Badawi - ahmad@u.nus.edu 15
- CRT ๐, {๐๐} :
(๐0, โฆ , ๐๐โ1) = ๐ mod ๐๐
- CRTโ1(๐0, โฆ , ๐๐โ1) = ๐ s.t.
๐ = ๐ ๐๐
๐โ1 ๐=0
๐ ๐๐
โ1
๐๐ (mod ๐๐) (mod ๐) where ๐ = ๐๐
๐โ1 ๐=0
- At least two methods:
โ Classic algorithm โ Garnerโs algorithm
Classic Garners LUT
๐2 ๐ ๐ โ 1 2
Thread Divergence
Non tractable Nil
- Is CRT critical to performance?
โ No!
RNS tools
Ahmad Al Badawi - ahmad@u.nus.edu 16
- Useful to:
โ Remain in RNS representation โ No costly multi-precision arithmetic
- Two basic operations:
โ Scale-and-round โ Base decomposition
- Adopted from (BEHZ2016*) scheme
- Are RNS tools critical to performance?
โ Extremely critical
* Bajard, Jean-Claude, et al. "A full RNS variant of FV like somewhat homomorphic encryption schemes." International Conference on Selected Areas in Cryptography. Springer, Cham, 2016.
FV_RNS Homomorphic Multiplication
Ahmad Al Badawi - ahmad@u.nus.edu 17
Benchmarking Results
Ahmad Al Badawi - ahmad@u.nus.edu 18
0.000 200.000 400.000 600.000 800.000 1000.000 1200.000 1400.000 1600.000 (11,62) (12,186) (13,372) (14,744) Time (ms)
Key Generation
GPU-FV SEAL NFLlib-FV 0.000 5.000 10.000 15.000 20.000 25.000 30.000 35.000 (11,62) (12,186) (13,372) (14,744) Time (ms)
Enc
GPU-FV SEAL NFLlib-FV 0.000 2.000 4.000 6.000 8.000 10.000 12.000 14.000 16.000 18.000 (11,62) (12,186) (13,372) (14,744) Time (ms)
Dec
GPU-FV SEAL NFLlib-FV 0.000 50.000 100.000 150.000 200.000 250.000 300.000 350.000 400.000 450.000 500.000 (11,62) (12,186) (13,372) (14,744) Time (ms)
HomoMul + Relinearization
GPU-FV SEAL NFLlib-FV
Which FV RNS variant to Implement?
Ahmad Al Badawi - ahmad@u.nus.edu 20
- Two RNS variants of FV
โ BEHZ โ HPS
- Answer can be found in:
โ Al Badawi, Ahmad, et al. "Implementation and Performance Evaluation of RNS Variants of the BFV Homomorphic Encryption Scheme." IACR Cryptology ePrint Archive 2018 (2018): 589.
21