Using Vector Instructions Joppe W. Bos, Peter L. Montgomery, Daniel - PowerPoint PPT Presentation

Montgomery Multiplication Using Vector Instructions Joppe W. Bos, Peter L. Montgomery, Daniel Shumow, and Gregory M. Zaverucha SAC 2013

Motivation E.g. ECDSA, ECDH E.g. DH, 𝐹(𝐆 𝑞 ) Point DSA, RSA arithmetic 𝐆 𝑞 or 𝐚/𝑁𝐚 Montgomery Multiplication

Motivation E.g. ECDSA, ECDH E.g. DH, 𝐹(𝐆 𝑞 ) Point DSA, RSA arithmetic 𝐆 𝑞 or 𝐚/𝑁𝐚 ECC often use primes of a Useful for special form: pairings NIST curves, Montgomery curve25519 Multiplication

Modular Multiplication Compute 𝐷 = 𝐵 × 𝐶 (mod 𝑁) 𝑆 = 𝐵 × 𝐶 write 𝑆 = 𝑟 × 𝑁 + 𝐷 such that 0 ≤ 𝐷 < 𝑁 Cost: One multiplication + one division with remainder

Modular Multiplication Compute 𝐷 = 𝐵 × 𝐶 (mod 𝑁) 𝑆 = 𝐵 × 𝐶 write 𝑆 = 𝑟 × 𝑁 + 𝐷 such that 0 ≤ 𝐷 < 𝑁 Cost: One multiplication + one division with remainder Montgomery (Math. Comp. 1985) observed that we can avoid the expensive division when M is odd 𝐵 2 if 𝐵 is even 𝐵 2 mod 𝑁 = 𝐵+𝑁 if 𝐵 is odd 2 A × −𝑁 −1 mod 2 32 ≡ 0 mod 2 32 , A + M × precompute 𝜈 = −𝑁 −1 mod 2 32

Interleaved Montgomery Multiplication 𝑜−1 𝑏 𝑗 , 𝐶 , 𝑁 , 𝜈 = −𝑁 −1 mod 2 32 Input: 𝐵 = 𝑗=0 Output: 𝐷 = 𝐵𝐶2 −32𝑜 mod 𝑁 𝐷 = 0 for 𝑗 = 0 to 𝑜 − 1 do 𝐷 = 𝐷 + 𝑏 𝑗 𝐶 ( 1 × 𝑜) limbs 𝑟 = 𝜈𝐷 mod 2 32 ( 1 × 1) limb 𝐷 = (𝐷 + 𝑟𝑁)/ 2 32 ( 1 × 𝑜) limbs If 𝐷 ≥ 𝑁 then 𝐷 = 𝐷 − 𝑁

Interleaved Montgomery Multiplication 𝑜−1 𝑏 𝑗 , 𝐶 , 𝑁 , 𝜈 = −𝑁 −1 mod 2 32 Input: 𝐵 = 𝑗=0 Output: 𝐷 = 𝐵𝐶2 −32𝑜 mod 𝑁 𝐷 = 0 2 × (1 × 1) limb for 𝑗 = 0 to 𝑜 − 1 do 𝐷 = 𝐷 + 𝑏 𝑗 𝐶 ( 1 × 𝑜) limbs 𝑟 = (𝑑 0 + 𝑏 𝑗 𝑐 0 )𝜈 mod 2 32 𝑟 = 𝜈𝐷 mod 2 32 ( 1 × 1) limb 𝐷 = (𝐷 + 𝑏 𝑗 𝐶 + 𝑟𝑁)/ 2 32 𝐷 = (𝐷 + 𝑟𝑁)/ 2 32 ( 1 × 𝑜) limbs If 𝐷 ≥ 𝑁 then 𝐷 = 𝐷 − 𝑁 At the cost of one extra ( 1 × 1) limb 2 × (1 × 𝑜) limbs multiplication the two ( 1 × 𝑜) limbs multiplications become independent.

Interleaved Montgomery Multiplication 𝐉𝐞𝐟𝐛 𝑜−1 𝑏 𝑗 , 𝐶 , 𝑁 , 𝜈 = −𝑁 −1 mod 2 32 Flip the sign of 𝜈 : 𝜈 = +𝑁 −1 mod 2 32 Input: 𝐵 = 𝑗=0 Output: 𝐷 = 𝐵𝐶2 −32𝑜 mod 𝑁 𝐷 = 0 2 × (1 × 1) limb for 𝑗 = 0 to 𝑜 − 1 do 𝐷 = 𝐷 + 𝑏 𝑗 𝐶 ( 1 × 𝑜) limbs 𝑟 = (𝑑 0 + 𝑏 𝑗 𝑐 0 )𝜈 mod 2 32 𝑟 = 𝜈𝐷 mod 2 32 ( 1 × 1) limb 𝐷 = (𝐷 + 𝑏 𝑗 𝐶 + 𝑟𝑁)/ 2 32 𝐷 = (𝐷 + 𝑟𝑁)/ 2 32 ( 1 × 𝑜) limbs If 𝐷 ≥ 𝑁 then 𝐷 = 𝐷 − 𝑁 At the cost of one extra ( 1 × 1) limb 2 × (1 × 𝑜) limbs multiplication the two ( 1 × 𝑜) limbs multiplications become independent.

2-way SIMD Interleaved Montgomery Multiplication

2-way SIMD Interleaved Montgomery Multiplication Non-SIMD part 𝑒 𝑗 2 32𝑗 − 𝑓 𝑗 2 32𝑗 𝐷 = 𝑗 𝑗 mod 2 32 𝑟 = 𝜈𝑐 0 𝑏 𝑘 + 𝜈 𝑒 0 − 𝑓 0 𝜈𝑐 0 𝑏 𝑘 + 𝜈𝑑 0 mod 2 32 = = (𝑑 0 + 𝑏 𝑘 𝑐 0 )𝜈 mod 2 32

Expected Performance Speedup Sequential Montgomery Multiplication Long Muls: 2𝑜 2 Short Muls: 𝑜 2-way SIMD Montgomery Multiplication Long Muls: 𝑜 2 Short Muls: 2𝑜

Expected Performance Speedup Sequential Montgomery Multiplication Long Muls: 2𝑜 2 Short Muls: 𝑜 2-way SIMD Montgomery Multiplication Long Muls: 𝑜 2 Short Muls: 2𝑜 Based on #multiplications only we expect: • 32-bit 2-way SIMD to be at most 2x as fast as 32-bit sequential • 32-bit 2-way SIMD to be approximately 2x as slow as 64-bit sequential

Performance Results – x86 Intel Xeon E31230 (3.2 GHz) - PC Intel Atom Z2760 (1.8 GHz) - Tablet RSA Classic SIMD Ratio Classic SIMD Ratio enc 2048 181,412 414,787 0.44 2,583,643 1,601,878 1.61 dec 2048 4,928,633 12,211,700 0.40 80,204,317 52,000,367 1.54

Performance Results - ARM Dell XPS 10 tablet (1.8 GHz) NVIDIA Tegra 4 (1.9 GHz) NVIDIA Tegra 3 T30 (1.4 GHz) Snapdragon S4 (dev board, Cortex-A15) (dev board, Cortex-A9) RSA Classic SIMD Ratio Classic SIMD Ratio Classic SIMD Ratio enc 1,087,318 710,910 1.53 725,336 712,542 1.02 872,468 1,358,955 0.64 2048 dec 34,769,147 21,478,047 1.62 23,177,617 22,812,040 1.02 27,547,434 47,205,919 0.58 2048

Performance Results Compare to results from: eBACS: ECRYPT Benchmarking of Cryptographic Systems and OpenSSL Snapdragon S4 (1.8 GHz) vs Intel Atom Z2760 (1.8 GHz) Snapdragon S3 (1.78 GHz) - Tablet RSA Classic OpenSSL Classic OpenSSL enc 2048 1,087,318 609,593 2,583,643 2,323,800 dec 2048 34,769,147 39,746,105 80,204,317 75,871,800

Can we do (asymptotically) better? What about faster multiplication methods (Karatsuba)? • Incompatible with interleaved Montgomery multiplication • Possible gain ([A]) on 32-bit platform for 1024-bit Montgomery multiplication Following the analysis from [A] (one level Karatsuba) for 32-bit platforms Sequential Karatsuba montmul Sequential Karatsuba reduces muls by 1.14x versus Sequential Karatsuba reduces adds by 1.18x Sequential interleaved montmul Sequential Karatsuba montmul SIMD interleaved reduces muls by 1.70x versus SIMD interleaved reduces adds by 1.67x SIMD interleaved montmul [A] J. Großschädl, R. M. Avanzi, E. Savas, and S. Tillich. Energy-efficient software implementation of long integer modular arithmetic. CHES 2005

Can we do (asymptotically) better? What about SIMD Karatsuba montmul versus SIMD interleaved montmul? • SIMD Karatsuba, but how to GMP SIMD GMP SIMD calculate SIMD reduction? RSA-2048 enc RSA-2048 enc RSA-2048 dec RSA-2048 dec • This approach is used in GMP Atom Z2760 2,184,436 1,601,878 37,070,875 52,000,367 • GMP is not a crypto lib Intel Xeon E3-1230 695,861 414,787 11,929,868 12,211,700 (32-bit mode)

Can we do (asymptotically) better? What about SIMD Karatsuba montmul versus SIMD interleaved montmul? • SIMD Karatsuba, but how to GMP SIMD GMP SIMD calculate SIMD reduction? RSA-2048 enc RSA-2048 enc RSA-2048 dec RSA-2048 dec • This approach is used in GMP Atom Z2760 2,184,436 1,601,878 37,070,875 52,000,367 • GMP is not a crypto lib Intel Xeon E3-1230 695,861 414,787 11,929,868 12,211,700 (32-bit mode) Modular Squaring Modular Squaring • Time(Montgomery squaring) ≈ 0.80 × Time(Montgomery Multiplication) [A] • SIMD Montgomery squaring? • We didn’t use this optimization [A] J. Großschädl, R. M. Avanzi, E. Savas, and S. Tillich. Energy-efficient software implementation of long integer modular arithmetic. CHES 2005

Conclusions  Current vector instructions can be used to enhance the performance of Montgomery multiplication on modern embedded devices Examples: 32-bit x86 (SSE) and ARM (NEON) platforms  Faster RSA-2048 on some tablets: performance on ARM differs significantly  If future instruction set(s) support 64 × 64 → 128 -bit 2-way SIMD multipliers: enhance interleaved Montgomery multiplication performance Future work  Investigate SIMD Karatsuba + SIMD (?) Montgomery reduction  Investigate SIMD Montgomery squaring

Using Vector Instructions Joppe W. Bos, Peter L. Montgomery, Daniel - PowerPoint PPT Presentation

Montgomery Multiplication Using Vector Instructions Joppe W. Bos, Peter L. Montgomery, Daniel Shumow, and Gregory M. Zaverucha SAC 2013 Motivation E.g. ECDSA, ECDH E.g. DH, ( ) Point DSA, RSA arithmetic or

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Software Vector Chaining M. Anton Ertl TU Wien Data Parallelism and SIMD instructions Data

Outline 2.1 Assembly language program structure 2.2 Data transfer instructions 2.3 Arithmetic

vector class homogeneous aggregate with random access templated class: Vector<int>

Optimizing multiplications with vector instructions Chitchanok Chuengsatiansup INRIA and ENS de

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Lecture 11 Vector Linear Network Coding Vector Linear Network Coding Outline Fundamentals for

. Vector Graphics Introduction to Web Design Vector graphics contain geometric objects, such as

Class 7: Vector and scalar, components Vector operations in components Multiplying a vector with a

Vector Functions A vector function is simply a function whose codomain is R n . In other words,

Vector Field Topology 8-1 Ronald Peikert SciVis 2007 - Vector Field Topology Vector fields as

Linear Algebra Vectors A column vector is a list of numbers stored vertically. The dimen-

Vector/Axial-vector Technical stuff: - use POWHEG-BOX process of pp-->DM DM 1j at NLO (need

NOW Handout Page 1 1 Styles of Vector Architectures Components of Vector Processor Vector

I-vector representation based on GMM and DNN for audio classification Najim Dehak Center for

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Encoder-decoder Models

Fisher Vector image representation Machine Learning and Category Representation 2014-2015 Jakob

i-vector space for speaker recognition Timur Pekhovsky Sergey Novoselov Aleksey Sholokhov Oleg

CS 103 Unit 12 Slides Standard Template Library Vectors & Deques Mark Redekopp 2 Templates

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

Chapter 3: Logical Time Ajay Kshemkalyani and Mukesh Singhal Distributed Computing: Principles,

Approaches for Angle of Arrival Estimation Wenguang Mao Angle of Arrival (AoA) Definition:

Using Vector Instructions Joppe W. Bos, Peter L. Montgomery, Daniel - PowerPoint PPT Presentation

Montgomery Multiplication Using Vector Instructions Joppe W. Bos, Peter L. Montgomery, Daniel Shumow, and Gregory M. Zaverucha SAC 2013 Motivation E.g. ECDSA, ECDH E.g. DH, ( ) Point DSA, RSA arithmetic or

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Software Vector Chaining M. Anton Ertl TU Wien Data Parallelism and SIMD instructions Data

Outline 2.1 Assembly language program structure 2.2 Data transfer instructions 2.3 Arithmetic

vector class homogeneous aggregate with random access templated class: Vector&lt;int&gt;

Optimizing multiplications with vector instructions Chitchanok Chuengsatiansup INRIA and ENS de

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Lecture 11 Vector Linear Network Coding Vector Linear Network Coding Outline Fundamentals for

. Vector Graphics Introduction to Web Design Vector graphics contain geometric objects, such as

Class 7: Vector and scalar, components Vector operations in components Multiplying a vector with a

Vector Functions A vector function is simply a function whose codomain is R n . In other words,

Vector Field Topology 8-1 Ronald Peikert SciVis 2007 - Vector Field Topology Vector fields as

Linear Algebra Vectors A column vector is a list of numbers stored vertically. The dimen-

Vector/Axial-vector Technical stuff: - use POWHEG-BOX process of pp--&gt;DM DM 1j at NLO (need

NOW Handout Page 1 1 Styles of Vector Architectures Components of Vector Processor Vector

I-vector representation based on GMM and DNN for audio classification Najim Dehak Center for

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Encoder-decoder Models

Fisher Vector image representation Machine Learning and Category Representation 2014-2015 Jakob

i-vector space for speaker recognition Timur Pekhovsky Sergey Novoselov Aleksey Sholokhov Oleg

CS 103 Unit 12 Slides Standard Template Library Vectors &amp; Deques Mark Redekopp 2 Templates

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

Chapter 3: Logical Time Ajay Kshemkalyani and Mukesh Singhal Distributed Computing: Principles,

Approaches for Angle of Arrival Estimation Wenguang Mao Angle of Arrival (AoA) Definition:

vector class homogeneous aggregate with random access templated class: Vector<int>

Vector/Axial-vector Technical stuff: - use POWHEG-BOX process of pp-->DM DM 1j at NLO (need

CS 103 Unit 12 Slides Standard Template Library Vectors & Deques Mark Redekopp 2 Templates