engin ineerin ing lattice based cry ryptography
play

Engin ineerin ing Lattice-based Cry ryptography Sujoy Sinha Roy - PowerPoint PPT Presentation

Engin ineerin ing Lattice-based Cry ryptography Sujoy Sinha Roy Solving system of linear equations System of linear equations with unknown s Gaussian elimination solves s when number of equations m n 2 System of linear equations


  1. Observation ω + - A[4], A[5] During NTT loop m = 2 A[2], A[3] 1. Read {A[0], A[1]} [1 Cycle] A[0], A[1] 2. Butterfly(A[0], A[1]) [1 Cycle] 3. Write {A[0], A[1]} [1 Cycle] 55

  2. Observation ω + - A[4], A[5] A[2], A[3] A[2] Problem happens next, when m=4. A[0] A[0], A[1] Butterfly(A[0], A[2]), Butterfly(A[4], A[6]) … 56

  3. Solution Process 4 coefficients together ω + - A[4], A[5] A[2], A[3] A[0], A[1] 57

  4. Solution Process 4 coefficients together A[0], A[1] A[2], A[3] ω + - A[4], A[5] A[2], A[3] A[0], A[1] 2R 58

  5. Solution Process 4 coefficients together A[0], A[1] A[2], A[3] ω + - A[4], A[5] A[2] A[3] A[2], A[3] A[0] A[1] A[0], A[1] 2R + 2C 59

  6. Solution Process 4 coefficients together A[0], A[1] A[2], A[3] ω + - A[4], A[5] A[2] A[3] A[1], A[3] A[0] A[1] A[0], A[2] 2R + 2C + 2W 60

  7. Results Before O(n) + O(n) + O(n log n) + O(n)

  8. Results Before O(n) + O(n) + O(n log n) + O(n) Optimization 1 Zero-cost prescaling Now O(n) + O(n) + O(n log n) + O(n)

  9. Results Before O(n) + O(n) + O(n log n) + O(n) Optimization 2 Memory access reduction Now O(n) + O(n) + ½ O(n log n) + O(n)

  10. Architecture of NTT-based polynomial multiplier 64

  11. Lattice-based public-key instruction-set encryption processor Instruction Set Throughput: 1. LOAD 50,000 encryptions/sec 2. ENCODE-LOAD 3. GAUSSIAN-LOAD 100,000 decryptions/sec 4. FFT Area: 1349 LUT, 860 FF, 5. INV-FFT 1 DSPMULT, 2 BRAM18 6. ADD 7. CMULT (polynomial degree 256) 8. REARRANGE 9. READ Publication: CHES 2014

  12. Instruction-set ring-LWE cryptoprocessor Throughput: 50,000 encryption/sec 100,000 decryption/sec CHES 2014 Area: 1.3K LUT, 860 FF, 1 DSPMULT, 2 BRAM18 CHES 2012 Throughput: 40,000 encryption/sec < 80,000 decryption/sec < Area: 18349 LUT, 5644 FF ECC 66

  13. Ring-LWE encryption: followup works Software implementations • On 32-bit ARM R. de Clercq, S. Sinha Roy, F. Vercauteren, I.Verbauwhede, "Efficient software implementation of ring-LWE encryption", DATE 2015 Encryptions 121,166 cycles Decryptions 43,324 cycles Orders of magnitude faster than ECC • On 8-bit AVR Z. Liu, H. Seo, S. Sinha Roy, J. Großschädl, H. Kim, I. Verbauwhede, "Efficient Ring-LWE Encryption on 8-Bit AVR Processors", CHES 2015 Encryptions 671,628 cycles Decryptions 275,646 cycles 67

  14. Ring-LWE encryption: followup works Side channel security: masking scheme O. Reparaz, S. Sinha Roy, F. Vercauteren, I. Verbauwhede, "A masked ring-LWE implementation", in CHES 2015 68

  15. Hardware accelerators for Homomorphic Computation 69

  16. Homomorphic computation Interesting applications : • Machine learning on encrypted data • Prediction from consumption data in smart electricity meters • Health-care applications • Encrypted web-search engine

  17. Lattice-based Homomorphic encryption 1. Encrypt 2. Decrypt public keys 𝒒 𝟏 , 𝒒 𝟐 private key 𝒕 𝒆𝒃𝒖𝒃 𝒒 𝟐 𝒅𝒖 𝟐 𝒆𝒃𝒖𝒃 𝒕 𝒒 𝟏 𝒅𝒖 𝟏 71

  18. Lattice-based Homomorphic encryption 1. Encrypt locally 2. Process on cloud 3. Decrypt locally public keys 𝒒 𝟏 , 𝒒 𝟐 private key 𝒕 𝒆𝒃𝒖𝒃 ∗ 𝒒 𝟐 𝒅𝒖 𝟐 𝒅𝒖 𝟐 𝒆𝒃𝒖𝒃 ∗ 𝒕 ∗ 𝒅𝒖 𝟏 𝒒 𝟏 𝒅𝒖 𝟏 Evaluated many times 72

  19. Homomorphic Multiplication 𝒅𝒖 𝑩,𝟏 𝒅𝒖 𝑫,𝟏 𝒅𝒖 𝑩,𝟐 𝒅𝒖 𝑫,𝟐 𝒅𝒖 𝑪,𝟏 𝒅𝒖 𝑪,𝟐 Uses multiple computation blocks: • Lift • Polynomial multiplication • Scale 73

  20. Homomorphic Multiplication. How complex? 𝒅𝒖 𝑩,𝟏 𝒅𝒖 𝑫,𝟏 𝒅𝒖 𝑩,𝟐 Two challenges 𝒅𝒖 𝑫,𝟐 • Coefficient size 𝒅𝒖 𝑪,𝟏 • Polynomial length 𝒅𝒖 𝑪,𝟐 Low complexity applications: M edium complexity applications: • Polynomials have 4,000 coeffs. • Polynomials have 32,000 coeffs. • Coeffs are ~180 bit wide • Coeffs are ~1200 bit wide Lattice-based key exchange schemes • Polynomials with 256 or 512 coeffs • Coeff size ~10 bits 74

  21. Hardware accelerators for Homomorphic Computation → Arit rithmetic ic of f lar large coeffic ficie ients 75

  22. Application of Residue Number System • We need to compute arithmetic modulo q • Let q = ∏q i where q i are coprime • Then we can work with Residue Number System (RNS) Chinese Arithmetic mod q 0 Remainder Arithmetic mod q 1 Arithmetic mod q Theorem … (CRT) Arithmetic mod q L RNS arithmetic Result mod q • Small coefficients • Parallel computation 76

  23. Example: polynomial multiplication Let q = q 0 ∙q 1 where q 0 and q 1 are of equal bit-length Input: a(x), b(x) mod q Overhead1: Splitting into residues a(x) * b(x) mod q 0 a(x) * b(x) mod q 1 Advantages: • Parallel multiplications • Smaller ALU width due to smaller coefficient size Overhead2: Chinese Remainder Theorem Reconstruction from residues Output: a(x) * b(x) mod q 77

  24. On Hardware Parallel processing using multiple Residue Polynomial Arithmetic Unit (RPAU) RPAU L RPAU 0 Memory Memory Core Core File File Number or RPAUs is a design parameter

  25. Hardware accelerators for Homomorphic Computation → Arit rithmetic ic of f lar large coeffic ficie ients → Arit rithmetic ic of f lar large poly lynomia ials ls 79

  26. Polynomial multiplication : multiple butterfly cores BRAM … Single core NTT too slow! Design Challenges: BRAM • Long routing • Memory access conflicts BRAM 80

  27. Memory access parallelism m=2048 m=4096 m=8 m=2 m=4 Upper Upper Upper BRAM Upper Upper #2047 … … … NTT Core 2 #1024 Lower Lower BRAM Lower Lower Lower #1023 … … … NTT Core 1 #0 COSIC - KU Leuven 81

  28. Block Level Pipelining Lift • Separate building blocks for block-level pipeline • Realize a resource shared architecture • Reduces the area requirement • Increase the computation time 82

  29. Execution Units • Two parallel cores for Lift and Scale • Seven Residue Polynomial Arithmetic Unit (RPAU) Lift & Scale RPAU 0 ... RPAU 6 Core Core Core 0 0 0 Memory Memory File Core File Core Core 1 1 1 Parameter: • Ciphertext polynomial degree 4096 • Ciphertext coefficient size 180 83

  30. Arm rm + FPGA Im Imple lementatio ion Zynq UltraScale+ MPSoC ZCU102 FPGA Arm 0 Arm 1 DMA AXI Coprocessor 0 Interface Arm 2 Arm 3 Cache AXI Coprocessor 1 Interface Mem. Controller Source code public on Github 84

  31. Performance of High-Level Operations Speed Operation (cycles) (msec) Add in HW 31,339 0.026 Multiply in HW 5,349,567 4.458 Send two ciphertext to HW 434,013 0.362 Receive result ciphertext from HW 216,697 0.180 Measurements are in cycles of CPU clocked at 1200 MHz Coprocessor is clocked at 200 MHz Publication: HPCA 2019 400 homomorphic multiplications per sec (2 cores) Faster than Tesla K80 GPU 85

  32. Reso source Utiliz ilizatio ion LUTs REGs BRAMs DSPs # of used instances % utilization 133,692 60,312 815 416 Two Coprocessors & Interface 49 11 89 16 63,522 25,622 388 208 A Single Coprocessor & Interface 23 5 43 8 86

  33. Conclusions so far • Ring-LWE is efficient in hardware and software • But, there are security concerns due to special structure a 0 -a 3 -a 2 -a 1 s 0 e 0 b 0 a 1 a 0 -a 3 -a 2 s 1 e 1 b 1 + ≈ * (mod q) a 2 a 1 a 0 -a 3 s 2 e 2 b 2 a 3 a 2 a 1 a 0 s 3 e 3 b 3 Special structure in matrix 87

  34. Interpolating LWE and ring-LWE: Module LWE e 0 a 8 -a 11 -a 10 -a 9 s 0 b 0 a 0 -a 3 -a 2 -a 1 e 1 a 9 a 8 -a 11 -a 10 s 1 b 1 a 1 a 0 -a 3 -a 2 e 2 a 10 a 9 a 8 -a 11 s 2 b 2 a 2 a 1 a 0 -a 3 e 3 b 3 a 11 a 10 a 7 a 8 s 3 a 3 a 2 a 1 a 0 + ≈ * e 4 s 4 b 4 a 12 -a 15 -a 14 -a 13 a 4 -a 7 -a 6 -a 5 e 5 s 5 b 5 a 5 a 4 -a 7 -a 6 a 13 a 12 -a 15 -a 14 e 6 b 6 s 6 a 6 a 5 a 4 -a 7 a 14 a 13 a 12 -a 15 e 7 b 7 s 7 a 7 a 6 a 5 a 4 a 15 a 14 a 13 a 12 a 0,0 (x) a 0,1 (x) s 0 (x) e 0 (x) b 0 (x) + ≈ (mod q) (mod x 4 + 1) * a 1,0 (x) a 1,1 (x) s 1 (x) e 1 (x) b 1 (x) 88

  35. Saber: Module-LWR based key exchange, CPA-secure encryption and CCA-secure KEM a lattice-based candidate for NIST standardization moved to second round! Jointly designed by EE and Math team! 89

  36. SABER: flexibility and efficiency • Saber uses module-LWR problem • Polynomials are always of 256 coefficients [Efficient pol. arithmetic] • Flexibility : matrix dimensions is parameterizable ➢ 2-by-2 for 115-bit post-quantum security Light SABER ➢ 3-by-3 for 180-bit post-quantum security SABER ➢ 4-by-4 for 245-bit post-quantum security Fire SABER 90

  37. SABER: Parameter set a 0,0 (x) ... a 0,k-1 (x) s 0 (x) b 0 (x) p (mod x 256 + 1) ... ≈ … * q a k-1,0 (x) ... a k-1,k-1 (x) s k-1 (x) b k-1 (x) • Polynomials of fixed size 256 coefficients • Flexible dimension k = 2, 3 or 4 • How to choose p and q? 91

  38. Learning with rounding (LWR) A problem with rounding: where p < q Uniform in [0, q-1] Prime q introduces rounding bias - Cannot use prime q  - Hence, no NTT-based fast polynomial multiplication + No modular reduction + Easy rounding → We need to use generic polynomial multiplication algorithm

  39. Next best polynomial multiplication algorithms • Karatsuba multiplication O(n log 2 3 ) 256 256 A(x) B(x) . . . . . . . . . . . . . . . . 1 2 1 3 2 3 . . . . . . . . . . . . . . . . . . . . 128 128 128 128

  40. Next best polynomial multiplication algorithms • Toom-Cook multiplication 256 256 A(x) B(x) . . . . . . . . . . . . . . . . 1 2 1 4 2 4 . . . . . . . . . . . . . . . . . . . . 64 64 64 64 Toom-Cook 4 Way needs 7 multiplications Karatsuba would need 9 multiplications

  41. Toom-Cook 4 Way: step-by-step: splitting 256 256 A(x) B(x) . . . . . . . . . . . . . . . . 1 1 2 4 2 4 . . . . . . . . . . . . . . . . . . . . 64 64 64 64 Splitting operand into 4 polynomials Take y = x 64 A( y ) = A 3 y 3 + A 2 y 2 + A 1 y + A 0 B( y ) = B 3 y 3 + B 2 y 2 + B 1 y + B 0

  42. Toom-Cook 4 Way: step-by-step: evaluation Linear operations + Seven multiplications are computed

  43. Toom-Cook 4 Way: step-by-step: interpolation Linear operations This number has a role to play Linear operations

  44. Advanced Vector Extensions (AVX) Vectorized instructions for 16-bit operands

  45. DSP instructions ARM Cortex-M4 • Popular 32-bit microcontroller • Has DSP instructions for half-word operations

  46. + AVX Microcontroller with DSP Keep coefficients smaller/equal to 16 bits to use ➢ _epi16( ) AVX intrinsics in high-end platforms ➢ DSP instructions in low-end microcontrollers Options for q: 2 16 , 2 15 , 2 14 , 2 13 … etc

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend