Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL - PowerPoint PPT Presentation

Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL – LACAL Nov. 4, 2010

Motivation NIST Recommendations for Key Management (SP 800-57) NIST DRAFT recommendation for the Transitioning of Cryptographic Algorithms and Key Sizes (SP 800-131) RSA 1024 Deprecated from January 1, 2011 RSA 2048 8x Computational Effort 2 2

Object Use GPUs as cryptographic accelerators to offload work from the CPU. Low latency Parallel implementation Generic implementation OpenCL Server application Speed 3 3

RSA 2048 Decryption Decryption d m  p  q e  d  1 mod  (m) z  c mod m dP  e  1 mod (p - 1) Precomputed values dQ  e  1 mod (q - 1) ( p , q, dP, dQ, qInv)  1 dInv  q mod p Chinese Remainder Theorem dP z 1  c mod p Mod Exp 1024 moduli dQ z 2  c mod q (32 limbs of 32-bits) s  32 32 B  2 h  qInv  ( z 1  z ) mod p 2 z  z  h  q 2 4 4

Montgomery Multiplication General overview Ordinary Representation Montgomery Representation ~ u u ~ v v ~  ~ u  v u v     ~ z z Sequential multiplications performed in Montgomery representation 5

Montgomery Multiplication R  B s  m , gcd(R, m)  1 Montgomery radix Ordinary Representation Montgomery Representation  (   ,  ) ( , * )  ~ u u  u  R mod m Isomorphic ~ ~ ~ ~ u  v  ( u  v ) mod m  1 u * v  u  v  R mod m ~ v v  v  R mod m

Montgomery Multiplication Definition: m : large odd integer ~ ~ , gcd(m, B)  1 u , v  Z / m Z ~ ~ ~ ~ u * v  u  v  R  1 mod m s ( usually R  B ) R  m

Sequential Computation on CPU ~ ~ ~ ~  1 u * v  u  v  R mod m m Algorithm ~ u ~ v ~  * z 0; ~ for (i  0; i  s - 1; i   ) z { ~ ~ ~ ~ z  z  u  v ; i ~  1 q  (  z  m ) mod B; M 0 0 ~ ~ ~ ~  1 ( z  (  z  m mod B)  m) mod B  z  ( z  q M  m ) div B; 0 0 } ~ ~  1  ( z  (  z  m  m mod B)) mod B ~ ~ ~ 0 0 if z  m then z  z - m; ~ ~  ( z  z ) mod B  0 0

Sequential Computation on CPU ~ ~ ~ ~  1 u * v  u  v  R mod m m Algorithm ~ u ~ v ~  * z 0; ~ for (i  0; i  s - 1; i   ) z { ~ ~ ~ ~ z  z  u  v ; i ~  1 i  0 q  (  z  m ) mod B; M 0 0 ~ z  m  ( B  1 ) ~ ~ z  ( z  q M  m ) div B; } m  ( B  1 )  m  ( B  1 ) ~ z  ~ ~ ~ if z  m then z  z - m; B 2  m  ( B  1 )   2 m B

Sequential Computation on CPU ~ ~ ~ ~  1 u * v  u  v  R mod m m Algorithm ~ u ~ v ~  * z 0; ~ for (i  0; i  s - 1; i   ) z { ~ ~ ~ ~ z  z  u  v ; i ~  1 q  (  z  m ) mod B; M 0 0 ~ ~ z  ( z  q M  m ) div B; } ~ ~ ~ if z  m then z  z - m;

Sequential Computation on CPU ~ ~ ~ ~  1 u * v  u  v  R mod m m Algorithm ~ u ~ v ~  * z 0; ~ for (i  0; i  s - 1; i   ) z { ~ ~ ~ ~ i  1 z  z  u  v ; i ~ 0  z  2  m ~  1 q  (  z  m ) mod B; M 0 0 ~ ~ ~ z  2  m  m  ( B  1 ) z  ( z  q M  m ) div B; } 2  m  m  ( B  1 )  m  ( B  1 ) ~ z  ~ ~ ~ if z  m then z  z - m; B m  ( 2  B  1  B  1 )   2  m B

Fermi architecture Specifications: 3 billon transistors 16 Streaming Multiprocessors (SM) 6 x 64-bit memory partitions Up to total 6GB GDDR5 with ECC GigaThread global scheduler Shared L2 Cache (768KB) Source: NVIDIA’s next Generation CUDA TM Compute Architecture: Fermi 12 12

Fermi architecture Streaming Multiprocessor 32 CUDA Cores (16 x 32 = 512) Dual warp scheduler 16 LD/ST Units 4 Special Function Units (SFU) 64KB of configurable Shared Memory and L1 Cache (48KB/16KB) CUDA Core Pipelined ALU and FPU ALU supports 32-bit int FPU single precision (512 FMA ops / clock) Source: NVIDIA’s next Generation CUDA TM Compute Architecture: Fermi 1K 32-bit registers per core 13 13

Representation of Integers Parallel version Sequential version x x x 31 31 31  x x x x  0  31 2 1   x x x x 0 x x 31 x 2 1 2 2 2  x x x x x x x 31 0 2 1 1 1 1 x x x 0 0 0 • Low Latency • High Latency • Cryptography • Cryptanalysis 14 14

Representation of Integers To avoid barriers (mem fence) try to fit entire operand within a block of 32 threads (Warps) Data coherence is maintained within a warp. Each thread operates in one limb in radix B=2 32 Possible representations: Avizienis representation (signed-digit) Residue Number System Carry-save  c c c c 31 2 1 0  x x x x 31 0 2 1 15 15

Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 {  L(a b ) L(a b ) L(a b ) L(a b ) T :  T  b A; 31 0 1 0 1 0 0 0 i -1 q :  t  (  m ) mod B; M 0 0 T :  ( T  q M  M ) div B; }  c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  t t t t t t t t 31 0 0 30 2 1 31 1 16

Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 {  L(a b ) L(a b ) L(a b ) L(a b ) T :  T  b A; 31 0 1 0 1 0 0 0 i -1 q :  t  (  m ) mod B; M 0 0     T :  ( T  q M  M ) div B;     }  c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  q t t t t t t t t 31 0 0 M 30 2 1 31 1 17

Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { T :  T  b A; i -1 q :  t  (  m ) mod B;  H(m q ) H(m q ) H(m q ) H(m q ) M 0 0 31 M 2 M 0 M 1 M T :  ( T  q M  M ) div B;  L(m q ) L(m q ) L(m q ) L(m q ) 31 M 0 M 2 M 1 M }  c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  q t t t t 31 0 M 2 1 18

Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { T :  T  b A; i -1 q :  t  (  m ) mod B;  H(m q ) H(m q ) H(m q ) H(m q ) M 0 0 31 M 2 M 0 M 1 M T :  ( T  q M  M ) div B;  L(m q ) L(m q ) L(m q ) L(m q ) 31 M 0 M 2 M 1 M }           c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  q t t t t 31 0 M 2 1 19

Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { T :  T  b A; i -1 q :  t  (  m ) mod B;  H(m q ) H(m q ) H(m q ) H(m q ) M 0 0 31 M 2 M 0 M 1 M T :  ( T  q M  M ) div B;        }   c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  q t t t t 31 0 M 2 1 20

Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL - PowerPoint PPT Presentation

Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL Nov. 4, 2010 Motivation NIST Recommendations for Key Management (SP 800-57) NIST DRAFT recommendation for the Transitioning of Cryptographic Algorithms and Key Sizes (SP

RSA RSA RSA in OpenSSL RSA in Python Cryptography School of Engineering and Technology

RSA Implementations of RSA RSA in OpenSSL RSA in Python Cryptography School of Engineering

RSA-PSS Provable Secure RSA Signatures and their Implementation Overview What is RSA-PSS?

Outline Review of RSA 1 CPSC 418/MATH 318 Introduction to Cryptography More on RSA,

RSA PARTNER CENTRAL A step-by-step guide for RSA SecurWorld partners 1 BEFORE YOU BEGIN

Cryptanalysis of RSA Variants and Implicit Factorization Santanu Sarkar August 20, 2013 Outline

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

The security impact AES-128, RSA-2048, etc. of a new cryptographic library are widely accepted

NaCl: a new crypto library AES-128, RSA-2048, etc. are widely accepted standards. D. J.

Making Change in 2048 David Eppstein 9th International Conference on Fun With Algorithms (FUN

RSA Encryption 10 February 2012 RSA Encryption 10 February 2012 1/35 We saw some methods of

Post-quantum RSA We built a great, great 1-terabyte RSA wall, and we had the university pay for

C.b) RSA with Applications W. Schindler: Cryptography, B-IT, winter 2006 / 2007 2 C.31 RSA The

Side Channel Attack to Actual Cryptanalysis: Breaking CRT-RSA with Low Weight Decryption

The RSA Cryptosystem February 27, 2008 Introducing PS4: RSA encryption Problem set 4 is about

Public Key (RSA) Encryption Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech

( t,w ) Threshold schemes " A master key ! (e.g. for a Certificate Authority) is very very

An Algebraic Approach to the Design of Block Ciphers Jos Valena scar Pereira Tiago

Linear congruences: ax b (mod n ) for x Z a x = b in Z n (in particular x {

LinBox Lab University of Delaware D. Saunders, Z. Wan, D. Roche, C. Devore (A. Duran, E.

A Generalized Brezing-Weng Algorithm for Constructing Pairing-Friendly Ordinary Abelian Varieties

Mod 2 linear algebra and tabulation of rational eigenforms Kiran S. Kedlaya Department of

Lecture 2: Terminology and Definitions Abhinav Bhatele, Department of Computer Science

F r e e R T O S a n d T C P / I P c o mmu n i c a t i o n : t h e l

Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL - PowerPoint PPT Presentation

Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL Nov. 4, 2010 Motivation NIST Recommendations for Key Management (SP 800-57) NIST DRAFT recommendation for the Transitioning of Cryptographic Algorithms and Key Sizes (SP

RSA RSA RSA in OpenSSL RSA in Python Cryptography School of Engineering and Technology

RSA Implementations of RSA RSA in OpenSSL RSA in Python Cryptography School of Engineering

RSA-PSS Provable Secure RSA Signatures and their Implementation Overview What is RSA-PSS?

Outline Review of RSA 1 CPSC 418/MATH 318 Introduction to Cryptography More on RSA,

RSA PARTNER CENTRAL A step-by-step guide for RSA SecurWorld partners 1 BEFORE YOU BEGIN

Cryptanalysis of RSA Variants and Implicit Factorization Santanu Sarkar August 20, 2013 Outline

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

The security impact AES-128, RSA-2048, etc. of a new cryptographic library are widely accepted

NaCl: a new crypto library AES-128, RSA-2048, etc. are widely accepted standards. D. J.

Making Change in 2048 David Eppstein 9th International Conference on Fun With Algorithms (FUN

RSA Encryption 10 February 2012 RSA Encryption 10 February 2012 1/35 We saw some methods of

Post-quantum RSA We built a great, great 1-terabyte RSA wall, and we had the university pay for

C.b) RSA with Applications W. Schindler: Cryptography, B-IT, winter 2006 / 2007 2 C.31 RSA The

Side Channel Attack to Actual Cryptanalysis: Breaking CRT-RSA with Low Weight Decryption

The RSA Cryptosystem February 27, 2008 Introducing PS4: RSA encryption Problem set 4 is about

Public Key (RSA) Encryption Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech

( t,w ) Threshold schemes &quot; A master key ! (e.g. for a Certificate Authority) is very very

An Algebraic Approach to the Design of Block Ciphers Jos Valena scar Pereira Tiago

Linear congruences: ax b (mod n ) for x Z a x = b in Z n (in particular x {

LinBox Lab University of Delaware D. Saunders, Z. Wan, D. Roche, C. Devore (A. Duran, E.

A Generalized Brezing-Weng Algorithm for Constructing Pairing-Friendly Ordinary Abelian Varieties

Mod 2 linear algebra and tabulation of rational eigenforms Kiran S. Kedlaya Department of

Lecture 2: Terminology and Definitions Abhinav Bhatele, Department of Computer Science

F r e e R T O S a n d T C P / I P c o mmu n i c a t i o n : t h e l

( t,w ) Threshold schemes " A master key ! (e.g. for a Certificate Authority) is very very