High-Performance FV Somewhat Homomorphic Encryption on GPUs: An - PowerPoint PPT Presentation

High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University of Singapore (NUS) Sept 10 th 2018 – CHES 2018

FHE – The holy grail of Cryptography • FHE enables computing on encrypted data without decryption [GB2009] • Challenge: requires enormous computation Homomorphic evaluation of 𝑔 Encryption/Decryption 𝐹𝑜𝑑(𝑦) 𝐹𝑜𝑑(𝑔(𝑦)) Cloud Server Client Ahmad Al Badawi - ahmad@u.nus.edu 2

How the problem is being tackled? • Algorithmic methods: – New FHE schemes – Plaintext packing (1D, 2 D, …) – Encoding schemes – Approximated computing – Squashing the target function – DAG optimizations for the target circuit • Acceleration methods: – Speedup FHE basic primitives (KeyGen, Enc, Dec, Add, Mul) – Modular Algorithms – Parallel Implementations – Hardware implementations: GPUs, FPGAs and probably ASICs Ahmad Al Badawi - ahmad@u.nus.edu 3

Our Contributions 1. Implementation of FV RNS on GPUs 2. Introducing a set of CUDA optimizations 3. Benchmarking with state-of-the-art implementations Ahmad Al Badawi - ahmad@u.nus.edu 4

Why GPUs for FHE? • GPU + – Naturally available – many computing cores – Developer friendly (FPGA, ASIC) • FHE + “If you were plowing a field, – Huge level of parallelism which would you rather use? Two strong oxen or 1024 chickens ?” Seymour Cray 1925-1996 Ahmad Al Badawi - ahmad@u.nus.edu 5

Textbook FV • Basic mathematical structure is 𝑆: ℤ 𝑦 /(𝑦 𝑜 + 1) – Plaintext space: 𝑆 𝑢 : ℤ 𝑢 𝑦 /(𝑦 𝑜 + 1) – Ciphertext space: 𝑆 𝑟 : ℤ 𝑟 𝑦 /(𝑦 𝑜 + 1) • Public key: 𝑞𝑙 0 , 𝑞𝑙 1 ∈ 𝑆 𝑟 • Secret key: 𝑡𝑙 ∈ 𝑆 𝑟 𝑟 • 𝑑 = 𝐹𝑜𝑑(𝑛): ( 𝑢 𝑛 + 𝑞𝑙 0 𝑣 + 𝑓 0 𝑟 , 𝑞𝑙 1 𝑣 + 𝑓 1 𝑟 ) 𝑢 • 𝑛 = 𝐸𝑓𝑑 𝑑 : 𝑑 0 + 𝑑 1 𝑡𝑙 𝑟 𝑢 𝑟 • 𝑑 + = 𝐵𝑒𝑒 𝑑 0 , 𝑑 1 : ( 𝑑 00 + 𝑑 10 𝑟 , 𝑑 01 + 𝑑 11 𝑟 ) Ahmad Al Badawi - ahmad@u.nus.edu 6

Textbook FV (cont.) • 𝑑 × = 𝑁𝑣𝑚 𝑑 0 , 𝑑 1 , 𝑓𝑤𝑙 : 1. Tensor product: 𝑢 𝑢 𝑤 0 = ⌊ 𝑟 𝑑 00 𝑑 10 ⌉ 𝑟 , 𝑤 2 = ⌊ 𝑟 𝑑 01 𝑑 11 ⌉ 𝑟 𝑤 1 = ⌊𝑢 𝑟 (𝑑 00 𝑑 11 + 𝑑 01 𝑑 10 )⌉ 𝑟 2. Base decomposition: 𝑚 (𝑗) 𝑥 𝑗 𝑤 2 = 𝑤 2 , 𝑚 = 𝑗=0 2. Relinearization: 𝑚 𝑑 × = 𝑤 𝑘 + 𝑓𝑤𝑙 𝑗𝑘 ⋅ 𝑤 2 (𝑗) , 𝑘 ∈ {0,1} 𝑗=0 𝑟 Ahmad Al Badawi - ahmad@u.nus.edu 7

Implementation Requirements • Polynomial arithmetic in cyclotomic rings • Large polynomial degree (a few thousands) – Power-of-2 cyclotomic – Addition/Subtraction: 𝒫(𝑜) – Multiplication: 𝒫(n log 𝑜) • Large coefficients ∈ ℤ 𝑟 (a few hundreds of bits) – Modular algorithms (RNS) • Extra non-trivial operations: 𝑢 – Scaling-and-round ⌊ 𝑟 𝑦⌉ – Base decomposition Ahmad Al Badawi - ahmad@u.nus.edu 8

Polynomial Arithmetic CRT : 𝑙−1 𝑟 = 𝑞 𝑗 , where 𝑞 𝑗 is a prime 𝑗=0 log 2 𝑞 -bit number RNS / CRT 𝑜 . . . 𝑞 0 . 𝑞 1 . . . 𝑙 = log 2 𝑟 . log 2 𝑞 . 𝑞 𝑙−1 Addition/Subtraction : component-wise add/sub modulo 𝑞 𝑗 Ahmad Al Badawi - ahmad@u.nus.edu 9

Polynomial Arithmetic (cont.) 32-bit NTT number RNS / CRT NTT (DGT) 𝑜 𝑜 . . . . . . 𝑞 0 𝑞 0 . . 𝑞 1 𝑞 1 . . . . . . 𝑙 = log 2 𝑟 . . log 2 𝑞 . . 𝑞 𝑙−1 𝑞 𝑙−1 NTT −1 Addition/Subtraction/Multiplication : component-wise add/sub/mul modulo 𝑞 𝑗 Ahmad Al Badawi - ahmad@u.nus.edu 10

DFT, NTT, DWT, DGT…? Pros Cons ′ s ) increase - Floating point errors increase as (𝑜 & 𝑞 𝑗 - Well-established DFT ′ s ) => longer RNS matrix - Reduce precision (smaller 𝑞 𝑗 - Several efficient libraries to use => more DFTs - Transform length ( 2𝑜 ) NTT - Exact DWT - Exact - Only power-of-2 cyclotomics - Transform length ( 𝑜 ) DGT - Exact - Only power-of-2 cyclotomics 𝑜 - Gaussian Arithmetic (larger number of - Transform length ( 2 ) multiplications ~(30% - 40%) - 50% Less interaction with memory • We use DGT in our implementation Ahmad Al Badawi - ahmad@u.nus.edu 11

Efficient DGT/NTT/DWT on GPU? • Better to store 𝑥 𝑘𝑗 in lookup table. 𝐵 = NTT(𝑏) s.t. 𝑜−1 – LUT can be stored in GPU texture 𝐵 𝑘 = 𝑏 𝑗 𝑥 𝑘𝑗 mod 𝑟 𝑗=0 𝑏 = NTT −1 (𝐵) s.t. memory (which is limited on GPU) 𝑜−1 𝑏 𝑗 = 𝑜 −1 𝐵 𝑘 𝑥 −𝑗𝑘 – DWT LUT are 𝒫(𝑜) mod 𝑟 𝑘=0 𝑜 – DGT LUT are 𝒫( 2 ) • Compute in 𝐻𝐺(𝑞 ) or in 𝐻𝐺(𝑞 𝑗 ) ? – We found it is better to do it 𝐻𝐺(𝑞 𝑗 ) . – Why? (see next) Ahmad Al Badawi - ahmad@u.nus.edu 12

Compute in 𝐻𝐺(𝑞 ) or in 𝐻𝐺(𝑞 𝑗 ) ? 𝑜 . . . 𝑞 0 . . 𝑞 1 . . 𝑙 = log 2 𝑟 . . log 2 𝑞 𝑞 𝑙−1 ) 𝑯𝑮(𝒒 𝑯𝑮(𝒒 𝒋 ) 𝑞 : 64-bit prime (should fit in one word) 𝑞 : word-size prime (can be 64-bit) 𝑞 - Shorter RNS matrix => Less NTTs 𝑞 ≤ 2𝑜 (one multiplication) - No size doubling - Supports unlimited number of 2 12 2 13 2 14 2 15 2 16 𝑜 operations in NTT domain log 2 𝑞 26 25 25 24 24 - Longer RNS matrix => more NTTs - Size double (32-bit => 64-bit) - Supports limited number of operations in NTT domain Ahmad Al Badawi - ahmad@u.nus.edu 13

But, is NTT/DWT/DGT performance-critical? Breakdown of homomorphic multiplication (AND) in the BFV FHE scheme Toy Settings Medium Settings NTT NTT RNS Base Extension RNS Base Extension RNS Scaling RNS Scaling others others Large Settings NTT RNS Base Extension RNS Scaling others Halevi, Shai, Yuriy Polyakov, and Victor Shoup. "An Improved RNS Variant of the BFV Homomorphic Encryption Scheme." (2018). Ahmad Al Badawi - ahmad@u.nus.edu 14

Computing CRT on GPU? - CRT 𝑏, {𝑞 𝑗 } : • At least two methods: (𝑏 0 , … , 𝑏 𝑙−1 ) = 𝑏 mod 𝑞 𝑗 - CRT −1 (𝑏 0 , … , 𝑏 𝑙−1 ) = 𝑏 s.t. – Classic algorithm – Garner’s algorithm 𝑙−1 −1 𝑏 = 𝑟 𝑟 𝑏 𝑗 (mod 𝑞 𝑗 ) (mod 𝑟) 𝑞 𝑗 𝑞 𝑗 𝑗=0 𝑙−1 where 𝑟 = 𝑞 𝑗 𝑗=0 Classic Garners 𝑙 2 𝑙 𝑙 − 1 LUT 2 Non tractable Nil Thread Divergence • Is CRT critical to performance? – No! Ahmad Al Badawi - ahmad@u.nus.edu 15

RNS tools • Useful to: – Remain in RNS representation – No costly multi-precision arithmetic • Two basic operations: – Scale-and-round – Base decomposition • Adopted from (BEHZ2016 * ) scheme • Are RNS tools critical to performance? – Extremely critical * Bajard, Jean-Claude, et al. "A full RNS variant of FV like somewhat homomorphic encryption schemes." International Conference on Selected Areas in Cryptography . Springer, Cham, 2016. Ahmad Al Badawi - ahmad@u.nus.edu 16

FV_RNS Homomorphic Multiplication Ahmad Al Badawi - ahmad@u.nus.edu 17

Benchmarking Results Dec Key Generation 1600.000 18.000 16.000 1400.000 GPU-FV GPU-FV 14.000 1200.000 SEAL SEAL 12.000 Time (ms) 1000.000 Time (ms) NFLlib-FV NFLlib-FV 10.000 800.000 8.000 600.000 6.000 400.000 4.000 200.000 2.000 0.000 0.000 (11,62) (12,186) (13,372) (14,744) (11,62) (12,186) (13,372) (14,744) Enc HomoMul + Relinearization 35.000 500.000 450.000 30.000 GPU-FV GPU-FV 400.000 SEAL SEAL 25.000 350.000 Time (ms) Time (ms) 300.000 NFLlib-FV NFLlib-FV 20.000 250.000 15.000 200.000 150.000 10.000 100.000 5.000 50.000 0.000 0.000 (11,62) (12,186) (13,372) (14,744) (11,62) (12,186) (13,372) (14,744) Ahmad Al Badawi - ahmad@u.nus.edu 18

Which FV RNS variant to Implement? • Two RNS variants of FV – BEHZ – HPS • Answer can be found in: – Al Badawi, Ahmad, et al. "Implementation and Performance Evaluation of RNS Variants of the BFV Homomorphic Encryption Scheme." IACR Cryptology ePrint Archive 2018 (2018): 589. Ahmad Al Badawi - ahmad@u.nus.edu 20

Thank You Questions? Ahmad Al Badawi ahmad@u.nus.edu 21

High-Performance FV Somewhat Homomorphic Encryption on GPUs: An - PowerPoint PPT Presentation

High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University of Singapore (NUS) Sept 10 th 2018 CHES 2018 FHE The holy grail of Cryptography FHE

St. Paul Congregational Survey Part 1Ministry Needs MINISTRY NEEDS Somewhat Somewhat Agree

St. Paul Congregational Survey Part 2Facilities Needs MINISTRY NEEDS Somewhat Somewhat

Elections How likely, if at all, are you to vote in the next elections? Very likely Somewhat

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

Leveraging External Knowledge On different tasks and various domains Gabi Stanovsky (a somewhat

Good morning, Thanks, I readily admit that I am somewhat intimidated to speak with this athletic

Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Small, Blue, and Somewhat Painful Thoughts on approaching and making sense of lymphoid

Feedback for: Date of presentation Graded by: Criteria highly problematic somewhat

JACKSONVILLES PORTS NETWORK Jacksonville Waterways Commission June 13, 2018 Somewhat unique

C. S. Lewis His Life and Writings Two (somewhat) Contemporaries of Lewis Holy Week Who do

Capital & Entrepreneurship HOW TO? Jaime L. Amsel, Ph.D. 1 = Strongly disagree, 2 =

What s next in Accelerator Particle Physics (somewhat CERN biased) Neutrino Telescopes

THE LAST DAYS OF THE somewhat sparse record means that IN THIS ISSUE 107 th CONGRESS many

UMD/Nvidia GPU Summit October 27-19, 2014 Fran LoPresti, Deputy CIO, Cyberinfrastructure and

GPU Architecture Analytical Patrick Cozzi Graphics, Inc. University of Pennsylvania developer

Beam Optics Sample n Detectors A canonical experimental recipe (scanning): while not

Single-molecule characterization of blood coagulation, tissue repair and viral invasion X. Frank

Arbitrary Dimension Reed-Solomon Coding and Decoding for Extended RAID on GPUs Matthew Curry, H.

Using Architectural Properties to Model System-Wide Graceful Degradation Charles Shelton Philip

Graceful degradation over the BEC via non-linear codes Hajir Roozbehani, Yury Polyanskiy LIDS

Mixed Criticality A Personal View Alan Burns Contents Some discussion on the notion of mixed

High-Performance FV Somewhat Homomorphic Encryption on GPUs: An - PowerPoint PPT Presentation

High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University of Singapore (NUS) Sept 10 th 2018 CHES 2018 FHE The holy grail of Cryptography FHE

St. Paul Congregational Survey Part 1Ministry Needs MINISTRY NEEDS Somewhat Somewhat Agree

St. Paul Congregational Survey Part 2Facilities Needs MINISTRY NEEDS Somewhat Somewhat

Elections How likely, if at all, are you to vote in the next elections? Very likely Somewhat

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

Leveraging External Knowledge On different tasks and various domains Gabi Stanovsky (a somewhat

Good morning, Thanks, I readily admit that I am somewhat intimidated to speak with this athletic

Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Small, Blue, and Somewhat Painful Thoughts on approaching and making sense of lymphoid

Feedback for: Date of presentation Graded by: Criteria highly problematic somewhat

JACKSONVILLES PORTS NETWORK Jacksonville Waterways Commission June 13, 2018 Somewhat unique

C. S. Lewis His Life and Writings Two (somewhat) Contemporaries of Lewis Holy Week Who do

Capital &amp; Entrepreneurship HOW TO? Jaime L. Amsel, Ph.D. 1 = Strongly disagree, 2 =

What s next in Accelerator Particle Physics (somewhat CERN biased) Neutrino Telescopes

THE LAST DAYS OF THE somewhat sparse record means that IN THIS ISSUE 107 th CONGRESS many

UMD/Nvidia GPU Summit October 27-19, 2014 Fran LoPresti, Deputy CIO, Cyberinfrastructure and

GPU Architecture Analytical Patrick Cozzi Graphics, Inc. University of Pennsylvania developer

Beam Optics Sample n Detectors A canonical experimental recipe (scanning): while not

Single-molecule characterization of blood coagulation, tissue repair and viral invasion X. Frank

Arbitrary Dimension Reed-Solomon Coding and Decoding for Extended RAID on GPUs Matthew Curry, H.

Using Architectural Properties to Model System-Wide Graceful Degradation Charles Shelton Philip

Graceful degradation over the BEC via non-linear codes Hajir Roozbehani, Yury Polyanskiy LIDS

Mixed Criticality A Personal View Alan Burns Contents Some discussion on the notion of mixed

Capital & Entrepreneurship HOW TO? Jaime L. Amsel, Ph.D. 1 = Strongly disagree, 2 =