Engin ineerin ing Lattice-based Cry ryptography Sujoy Sinha Roy - PowerPoint PPT Presentation

Observation ω + - A[4], A[5] During NTT loop m = 2 A[2], A[3] 1. Read {A[0], A[1]} [1 Cycle] A[0], A[1] 2. Butterfly(A[0], A[1]) [1 Cycle] 3. Write {A[0], A[1]} [1 Cycle] 55

Observation ω + - A[4], A[5] A[2], A[3] A[2] Problem happens next, when m=4. A[0] A[0], A[1] Butterfly(A[0], A[2]), Butterfly(A[4], A[6]) … 56

Solution Process 4 coefficients together ω + - A[4], A[5] A[2], A[3] A[0], A[1] 57

Solution Process 4 coefficients together A[0], A[1] A[2], A[3] ω + - A[4], A[5] A[2], A[3] A[0], A[1] 2R 58

Solution Process 4 coefficients together A[0], A[1] A[2], A[3] ω + - A[4], A[5] A[2] A[3] A[2], A[3] A[0] A[1] A[0], A[1] 2R + 2C 59

Solution Process 4 coefficients together A[0], A[1] A[2], A[3] ω + - A[4], A[5] A[2] A[3] A[1], A[3] A[0] A[1] A[0], A[2] 2R + 2C + 2W 60

Results Before O(n) + O(n) + O(n log n) + O(n)

Results Before O(n) + O(n) + O(n log n) + O(n) Optimization 1 Zero-cost prescaling Now O(n) + O(n) + O(n log n) + O(n)

Results Before O(n) + O(n) + O(n log n) + O(n) Optimization 2 Memory access reduction Now O(n) + O(n) + ½ O(n log n) + O(n)

Architecture of NTT-based polynomial multiplier 64

Lattice-based public-key instruction-set encryption processor Instruction Set Throughput: 1. LOAD 50,000 encryptions/sec 2. ENCODE-LOAD 3. GAUSSIAN-LOAD 100,000 decryptions/sec 4. FFT Area: 1349 LUT, 860 FF, 5. INV-FFT 1 DSPMULT, 2 BRAM18 6. ADD 7. CMULT (polynomial degree 256) 8. REARRANGE 9. READ Publication: CHES 2014

Instruction-set ring-LWE cryptoprocessor Throughput: 50,000 encryption/sec 100,000 decryption/sec CHES 2014 Area: 1.3K LUT, 860 FF, 1 DSPMULT, 2 BRAM18 CHES 2012 Throughput: 40,000 encryption/sec < 80,000 decryption/sec < Area: 18349 LUT, 5644 FF ECC 66

Ring-LWE encryption: followup works Software implementations • On 32-bit ARM R. de Clercq, S. Sinha Roy, F. Vercauteren, I.Verbauwhede, "Efficient software implementation of ring-LWE encryption", DATE 2015 Encryptions 121,166 cycles Decryptions 43,324 cycles Orders of magnitude faster than ECC • On 8-bit AVR Z. Liu, H. Seo, S. Sinha Roy, J. Großschädl, H. Kim, I. Verbauwhede, "Efficient Ring-LWE Encryption on 8-Bit AVR Processors", CHES 2015 Encryptions 671,628 cycles Decryptions 275,646 cycles 67

Ring-LWE encryption: followup works Side channel security: masking scheme O. Reparaz, S. Sinha Roy, F. Vercauteren, I. Verbauwhede, "A masked ring-LWE implementation", in CHES 2015 68

Hardware accelerators for Homomorphic Computation 69

Homomorphic computation Interesting applications : • Machine learning on encrypted data • Prediction from consumption data in smart electricity meters • Health-care applications • Encrypted web-search engine

Lattice-based Homomorphic encryption 1. Encrypt 2. Decrypt public keys 𝒒 𝟏 , 𝒒 𝟐 private key 𝒕 𝒆𝒃𝒖𝒃 𝒒 𝟐 𝒅𝒖 𝟐 𝒆𝒃𝒖𝒃 𝒕 𝒒 𝟏 𝒅𝒖 𝟏 71

Lattice-based Homomorphic encryption 1. Encrypt locally 2. Process on cloud 3. Decrypt locally public keys 𝒒 𝟏 , 𝒒 𝟐 private key 𝒕 𝒆𝒃𝒖𝒃 ∗ 𝒒 𝟐 𝒅𝒖 𝟐 𝒅𝒖 𝟐 𝒆𝒃𝒖𝒃 ∗ 𝒕 ∗ 𝒅𝒖 𝟏 𝒒 𝟏 𝒅𝒖 𝟏 Evaluated many times 72

Homomorphic Multiplication 𝒅𝒖 𝑩,𝟏 𝒅𝒖 𝑫,𝟏 𝒅𝒖 𝑩,𝟐 𝒅𝒖 𝑫,𝟐 𝒅𝒖 𝑪,𝟏 𝒅𝒖 𝑪,𝟐 Uses multiple computation blocks: • Lift • Polynomial multiplication • Scale 73

Homomorphic Multiplication. How complex? 𝒅𝒖 𝑩,𝟏 𝒅𝒖 𝑫,𝟏 𝒅𝒖 𝑩,𝟐 Two challenges 𝒅𝒖 𝑫,𝟐 • Coefficient size 𝒅𝒖 𝑪,𝟏 • Polynomial length 𝒅𝒖 𝑪,𝟐 Low complexity applications: M edium complexity applications: • Polynomials have 4,000 coeffs. • Polynomials have 32,000 coeffs. • Coeffs are ~180 bit wide • Coeffs are ~1200 bit wide Lattice-based key exchange schemes • Polynomials with 256 or 512 coeffs • Coeff size ~10 bits 74

Hardware accelerators for Homomorphic Computation → Arit rithmetic ic of f lar large coeffic ficie ients 75

Application of Residue Number System • We need to compute arithmetic modulo q • Let q = ∏q i where q i are coprime • Then we can work with Residue Number System (RNS) Chinese Arithmetic mod q 0 Remainder Arithmetic mod q 1 Arithmetic mod q Theorem … (CRT) Arithmetic mod q L RNS arithmetic Result mod q • Small coefficients • Parallel computation 76

Example: polynomial multiplication Let q = q 0 ∙q 1 where q 0 and q 1 are of equal bit-length Input: a(x), b(x) mod q Overhead1: Splitting into residues a(x) * b(x) mod q 0 a(x) * b(x) mod q 1 Advantages: • Parallel multiplications • Smaller ALU width due to smaller coefficient size Overhead2: Chinese Remainder Theorem Reconstruction from residues Output: a(x) * b(x) mod q 77

On Hardware Parallel processing using multiple Residue Polynomial Arithmetic Unit (RPAU) RPAU L RPAU 0 Memory Memory Core Core File File Number or RPAUs is a design parameter

Hardware accelerators for Homomorphic Computation → Arit rithmetic ic of f lar large coeffic ficie ients → Arit rithmetic ic of f lar large poly lynomia ials ls 79

Polynomial multiplication : multiple butterfly cores BRAM … Single core NTT too slow! Design Challenges: BRAM • Long routing • Memory access conflicts BRAM 80

Memory access parallelism m=2048 m=4096 m=8 m=2 m=4 Upper Upper Upper BRAM Upper Upper #2047 … … … NTT Core 2 #1024 Lower Lower BRAM Lower Lower Lower #1023 … … … NTT Core 1 #0 COSIC - KU Leuven 81

Block Level Pipelining Lift • Separate building blocks for block-level pipeline • Realize a resource shared architecture • Reduces the area requirement • Increase the computation time 82

Execution Units • Two parallel cores for Lift and Scale • Seven Residue Polynomial Arithmetic Unit (RPAU) Lift & Scale RPAU 0 ... RPAU 6 Core Core Core 0 0 0 Memory Memory File Core File Core Core 1 1 1 Parameter: • Ciphertext polynomial degree 4096 • Ciphertext coefficient size 180 83

Arm rm + FPGA Im Imple lementatio ion Zynq UltraScale+ MPSoC ZCU102 FPGA Arm 0 Arm 1 DMA AXI Coprocessor 0 Interface Arm 2 Arm 3 Cache AXI Coprocessor 1 Interface Mem. Controller Source code public on Github 84

Performance of High-Level Operations Speed Operation (cycles) (msec) Add in HW 31,339 0.026 Multiply in HW 5,349,567 4.458 Send two ciphertext to HW 434,013 0.362 Receive result ciphertext from HW 216,697 0.180 Measurements are in cycles of CPU clocked at 1200 MHz Coprocessor is clocked at 200 MHz Publication: HPCA 2019 400 homomorphic multiplications per sec (2 cores) Faster than Tesla K80 GPU 85

Reso source Utiliz ilizatio ion LUTs REGs BRAMs DSPs # of used instances % utilization 133,692 60,312 815 416 Two Coprocessors & Interface 49 11 89 16 63,522 25,622 388 208 A Single Coprocessor & Interface 23 5 43 8 86

Conclusions so far • Ring-LWE is efficient in hardware and software • But, there are security concerns due to special structure a 0 -a 3 -a 2 -a 1 s 0 e 0 b 0 a 1 a 0 -a 3 -a 2 s 1 e 1 b 1 + ≈ * (mod q) a 2 a 1 a 0 -a 3 s 2 e 2 b 2 a 3 a 2 a 1 a 0 s 3 e 3 b 3 Special structure in matrix 87

Interpolating LWE and ring-LWE: Module LWE e 0 a 8 -a 11 -a 10 -a 9 s 0 b 0 a 0 -a 3 -a 2 -a 1 e 1 a 9 a 8 -a 11 -a 10 s 1 b 1 a 1 a 0 -a 3 -a 2 e 2 a 10 a 9 a 8 -a 11 s 2 b 2 a 2 a 1 a 0 -a 3 e 3 b 3 a 11 a 10 a 7 a 8 s 3 a 3 a 2 a 1 a 0 + ≈ * e 4 s 4 b 4 a 12 -a 15 -a 14 -a 13 a 4 -a 7 -a 6 -a 5 e 5 s 5 b 5 a 5 a 4 -a 7 -a 6 a 13 a 12 -a 15 -a 14 e 6 b 6 s 6 a 6 a 5 a 4 -a 7 a 14 a 13 a 12 -a 15 e 7 b 7 s 7 a 7 a 6 a 5 a 4 a 15 a 14 a 13 a 12 a 0,0 (x) a 0,1 (x) s 0 (x) e 0 (x) b 0 (x) + ≈ (mod q) (mod x 4 + 1) * a 1,0 (x) a 1,1 (x) s 1 (x) e 1 (x) b 1 (x) 88

Saber: Module-LWR based key exchange, CPA-secure encryption and CCA-secure KEM a lattice-based candidate for NIST standardization moved to second round! Jointly designed by EE and Math team! 89

SABER: flexibility and efficiency • Saber uses module-LWR problem • Polynomials are always of 256 coefficients [Efficient pol. arithmetic] • Flexibility : matrix dimensions is parameterizable ➢ 2-by-2 for 115-bit post-quantum security Light SABER ➢ 3-by-3 for 180-bit post-quantum security SABER ➢ 4-by-4 for 245-bit post-quantum security Fire SABER 90

SABER: Parameter set a 0,0 (x) ... a 0,k-1 (x) s 0 (x) b 0 (x) p (mod x 256 + 1) ... ≈ … * q a k-1,0 (x) ... a k-1,k-1 (x) s k-1 (x) b k-1 (x) • Polynomials of fixed size 256 coefficients • Flexible dimension k = 2, 3 or 4 • How to choose p and q? 91

Learning with rounding (LWR) A problem with rounding: where p < q Uniform in [0, q-1] Prime q introduces rounding bias - Cannot use prime q  - Hence, no NTT-based fast polynomial multiplication + No modular reduction + Easy rounding → We need to use generic polynomial multiplication algorithm

Next best polynomial multiplication algorithms • Karatsuba multiplication O(n log 2 3 ) 256 256 A(x) B(x) . . . . . . . . . . . . . . . . 1 2 1 3 2 3 . . . . . . . . . . . . . . . . . . . . 128 128 128 128

Next best polynomial multiplication algorithms • Toom-Cook multiplication 256 256 A(x) B(x) . . . . . . . . . . . . . . . . 1 2 1 4 2 4 . . . . . . . . . . . . . . . . . . . . 64 64 64 64 Toom-Cook 4 Way needs 7 multiplications Karatsuba would need 9 multiplications

Toom-Cook 4 Way: step-by-step: splitting 256 256 A(x) B(x) . . . . . . . . . . . . . . . . 1 1 2 4 2 4 . . . . . . . . . . . . . . . . . . . . 64 64 64 64 Splitting operand into 4 polynomials Take y = x 64 A( y ) = A 3 y 3 + A 2 y 2 + A 1 y + A 0 B( y ) = B 3 y 3 + B 2 y 2 + B 1 y + B 0

Toom-Cook 4 Way: step-by-step: evaluation Linear operations + Seven multiplications are computed

Toom-Cook 4 Way: step-by-step: interpolation Linear operations This number has a role to play Linear operations

Advanced Vector Extensions (AVX) Vectorized instructions for 16-bit operands

DSP instructions ARM Cortex-M4 • Popular 32-bit microcontroller • Has DSP instructions for half-word operations

+ AVX Microcontroller with DSP Keep coefficients smaller/equal to 16 bits to use ➢ _epi16( ) AVX intrinsics in high-end platforms ➢ DSP instructions in low-end microcontrollers Options for q: 2 16 , 2 15 , 2 14 , 2 13 … etc

Engin ineerin ing Lattice-based Cry ryptography Sujoy Sinha Roy - PowerPoint PPT Presentation

Engin ineerin ing Lattice-based Cry ryptography Sujoy Sinha Roy Solving system of linear equations System of linear equations with unknown s Gaussian elimination solves s when number of equations m n 2 System of linear equations

Far Cry and DirectX Far Cry and DirectX Carsten Wenzel Carsten Wenzel Far Cry uses the latest

OCEAN SCIEN ENCE E & E ENGIN INEERIN ING Southern University of Science and Technology

Synchronous Constructive Cry ryptography Chen-Da Ueli Liu-Zhang Maurer ETH Zurich ETH

Shadows increase realism: Cry Cry En Engine Zaxxon Zaxxon (1982) 2 Shadows increase

Spelling, Punctuation and Grammar Suffixes -ing Year One SPaG | Suffixes -ing Suffixes Suffixes

St Stabiliz ilizatio ion/Solid lidif ific icatio ion ( (S/S) S/S) V Valu lue E Engin

Operational Practices Internet Security [1] VU Engin Kirda engin@infosys.tuwien.ac.at

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Testing Internet Security [1] VU Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

$ Jitish Kallat Traumanama - The Cry of the Gland, 2009-10 Mixed Media on Indian Handmade Paper

Sort these words into adjectives and verbs cry try dry funny happy copy heavy reply

Low-C -Cost S st Se e lf-T lf-Te e st o st of C f Cry rypto to D De e v vice ice s

PATMOS2010 PATMOS2010 Optimization and Simulation Optimization and Simulation, Grenoble, France

B11.1 Divisibility Can we equally share n muffins among m persons without cutting a muffin?

Efficient Finite Field and Elliptic Curve Arithmetic Laurent Imbert CNRS, LIRMM, Universit e

Unrolling residues to avoid progressions Steve Butler 1 Ron Graham Linyuan Lu 1 Department of

Geometric analog of Green-Taos PAP theorem Chunlei Liu Shanghai Jiao Tong University Chunlei

Intervals, Tridiagonal Matrices and the Lanczos Method Wolfgang W ulling August, 21th 2007

Cryptography and Cryptography and Number Theory and Finite Number Theory and Finite Network

On the analytic class number formula for Selberg zeta functions Gerard Freixas i Montplet

Engin ineerin ing Lattice-based Cry ryptography Sujoy Sinha Roy - PowerPoint PPT Presentation

Engin ineerin ing Lattice-based Cry ryptography Sujoy Sinha Roy Solving system of linear equations System of linear equations with unknown s Gaussian elimination solves s when number of equations m n 2 System of linear equations

Far Cry and DirectX Far Cry and DirectX Carsten Wenzel Carsten Wenzel Far Cry uses the latest

OCEAN SCIEN ENCE E &amp; E ENGIN INEERIN ING Southern University of Science and Technology

Synchronous Constructive Cry ryptography Chen-Da Ueli Liu-Zhang Maurer ETH Zurich ETH

Shadows increase realism: Cry Cry En Engine Zaxxon Zaxxon (1982) 2 Shadows increase

Spelling, Punctuation and Grammar Suffixes -ing Year One SPaG | Suffixes -ing Suffixes Suffixes

St Stabiliz ilizatio ion/Solid lidif ific icatio ion ( (S/S) S/S) V Valu lue E Engin

Operational Practices Internet Security [1] VU Engin Kirda engin@infosys.tuwien.ac.at

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Testing Internet Security [1] VU Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

$ Jitish Kallat Traumanama - The Cry of the Gland, 2009-10 Mixed Media on Indian Handmade Paper

Sort these words into adjectives and verbs cry try dry funny happy copy heavy reply

Low-C -Cost S st Se e lf-T lf-Te e st o st of C f Cry rypto to D De e v vice ice s

PATMOS2010 PATMOS2010 Optimization and Simulation Optimization and Simulation, Grenoble, France

B11.1 Divisibility Can we equally share n muffins among m persons without cutting a muffin?

Efficient Finite Field and Elliptic Curve Arithmetic Laurent Imbert CNRS, LIRMM, Universit e

Unrolling residues to avoid progressions Steve Butler 1 Ron Graham Linyuan Lu 1 Department of

Geometric analog of Green-Taos PAP theorem Chunlei Liu Shanghai Jiao Tong University Chunlei

Intervals, Tridiagonal Matrices and the Lanczos Method Wolfgang W ulling August, 21th 2007

Cryptography and Cryptography and Number Theory and Finite Number Theory and Finite Network

On the analytic class number formula for Selberg zeta functions Gerard Freixas i Montplet

OCEAN SCIEN ENCE E & E ENGIN INEERIN ING Southern University of Science and Technology