implementation of rsa 2048 on gpus
play

Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL - PowerPoint PPT Presentation

Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL Nov. 4, 2010 Motivation NIST Recommendations for Key Management (SP 800-57) NIST DRAFT recommendation for the Transitioning of Cryptographic Algorithms and Key Sizes (SP


  1. Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL – LACAL Nov. 4, 2010

  2. Motivation NIST Recommendations for Key Management (SP 800-57) NIST DRAFT recommendation for the Transitioning of Cryptographic Algorithms and Key Sizes (SP 800-131) RSA 1024 Deprecated from January 1, 2011 RSA 2048 8x Computational Effort 2 2

  3. Object Use GPUs as cryptographic accelerators to offload work from the CPU. Low latency Parallel implementation Generic implementation OpenCL Server application Speed 3 3

  4. RSA 2048 Decryption Decryption d m  p  q e  d  1 mod  (m) z  c mod m dP  e  1 mod (p - 1) Precomputed values dQ  e  1 mod (q - 1) ( p , q, dP, dQ, qInv)  1 dInv  q mod p Chinese Remainder Theorem dP z 1  c mod p Mod Exp 1024 moduli dQ z 2  c mod q (32 limbs of 32-bits) s  32 32 B  2 h  qInv  ( z 1  z ) mod p 2 z  z  h  q 2 4 4

  5. Montgomery Multiplication General overview Ordinary Representation Montgomery Representation ~ u u ~ v v ~  ~ u  v u v     ~ z z Sequential multiplications performed in Montgomery representation 5

  6. Montgomery Multiplication R  B s  m , gcd(R, m)  1 Montgomery radix Ordinary Representation Montgomery Representation  (   ,  ) ( , * )  ~ u u  u  R mod m Isomorphic ~ ~ ~ ~ u  v  ( u  v ) mod m  1 u * v  u  v  R mod m ~ v v  v  R mod m

  7. Montgomery Multiplication Definition: m : large odd integer ~ ~ , gcd(m, B)  1 u , v  Z / m Z ~ ~ ~ ~ u * v  u  v  R  1 mod m s ( usually R  B ) R  m

  8. Sequential Computation on CPU ~ ~ ~ ~  1 u * v  u  v  R mod m m Algorithm ~ u ~ v ~  * z 0; ~ for (i  0; i  s - 1; i   ) z { ~ ~ ~ ~ z  z  u  v ; i ~  1 q  (  z  m ) mod B; M 0 0 ~ ~ ~ ~  1 ( z  (  z  m mod B)  m) mod B  z  ( z  q M  m ) div B; 0 0 } ~ ~  1  ( z  (  z  m  m mod B)) mod B ~ ~ ~ 0 0 if z  m then z  z - m; ~ ~  ( z  z ) mod B  0 0

  9. Sequential Computation on CPU ~ ~ ~ ~  1 u * v  u  v  R mod m m Algorithm ~ u ~ v ~  * z 0; ~ for (i  0; i  s - 1; i   ) z { ~ ~ ~ ~ z  z  u  v ; i ~  1 i  0 q  (  z  m ) mod B; M 0 0 ~ z  m  ( B  1 ) ~ ~ z  ( z  q M  m ) div B; } m  ( B  1 )  m  ( B  1 ) ~ z  ~ ~ ~ if z  m then z  z - m; B 2  m  ( B  1 )   2 m B

  10. Sequential Computation on CPU ~ ~ ~ ~  1 u * v  u  v  R mod m m Algorithm ~ u ~ v ~  * z 0; ~ for (i  0; i  s - 1; i   ) z { ~ ~ ~ ~ z  z  u  v ; i ~  1 q  (  z  m ) mod B; M 0 0 ~ ~ z  ( z  q M  m ) div B; } ~ ~ ~ if z  m then z  z - m;

  11. Sequential Computation on CPU ~ ~ ~ ~  1 u * v  u  v  R mod m m Algorithm ~ u ~ v ~  * z 0; ~ for (i  0; i  s - 1; i   ) z { ~ ~ ~ ~ i  1 z  z  u  v ; i ~ 0  z  2  m ~  1 q  (  z  m ) mod B; M 0 0 ~ ~ ~ z  2  m  m  ( B  1 ) z  ( z  q M  m ) div B; } 2  m  m  ( B  1 )  m  ( B  1 ) ~ z  ~ ~ ~ if z  m then z  z - m; B m  ( 2  B  1  B  1 )   2  m B

  12. Fermi architecture Specifications: 3 billon transistors 16 Streaming Multiprocessors (SM) 6 x 64-bit memory partitions Up to total 6GB GDDR5 with ECC GigaThread global scheduler Shared L2 Cache (768KB) Source: NVIDIA’s next Generation CUDA TM Compute Architecture: Fermi 12 12

  13. Fermi architecture Streaming Multiprocessor 32 CUDA Cores (16 x 32 = 512) Dual warp scheduler 16 LD/ST Units 4 Special Function Units (SFU) 64KB of configurable Shared Memory and L1 Cache (48KB/16KB) CUDA Core Pipelined ALU and FPU ALU supports 32-bit int FPU single precision (512 FMA ops / clock) Source: NVIDIA’s next Generation CUDA TM Compute Architecture: Fermi 1K 32-bit registers per core 13 13

  14. Representation of Integers Parallel version Sequential version x x x 31 31 31  x x x x  0  31 2 1   x x x x 0 x x 31 x 2 1 2 2 2  x x x x x x x 31 0 2 1 1 1 1 x x x 0 0 0 • Low Latency • High Latency • Cryptography • Cryptanalysis 14 14

  15. Representation of Integers To avoid barriers (mem fence) try to fit entire operand within a block of 32 threads (Warps) Data coherence is maintained within a warp. Each thread operates in one limb in radix B=2 32 Possible representations: Avizienis representation (signed-digit) Residue Number System Carry-save  c c c c 31 2 1 0  x x x x 31 0 2 1 15 15

  16. Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 {  L(a b ) L(a b ) L(a b ) L(a b ) T :  T  b A; 31 0 1 0 1 0 0 0 i -1 q :  t  (  m ) mod B; M 0 0 T :  ( T  q M  M ) div B; }  c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  t t t t t t t t 31 0 0 30 2 1 31 1 16

  17. Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 {  L(a b ) L(a b ) L(a b ) L(a b ) T :  T  b A; 31 0 1 0 1 0 0 0 i -1 q :  t  (  m ) mod B; M 0 0     T :  ( T  q M  M ) div B;     }  c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  q t t t t t t t t 31 0 0 M 30 2 1 31 1 17

  18. Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { T :  T  b A; i -1 q :  t  (  m ) mod B;  H(m q ) H(m q ) H(m q ) H(m q ) M 0 0 31 M 2 M 0 M 1 M T :  ( T  q M  M ) div B;  L(m q ) L(m q ) L(m q ) L(m q ) 31 M 0 M 2 M 1 M }  c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  q t t t t 31 0 M 2 1 18

  19. Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { T :  T  b A; i -1 q :  t  (  m ) mod B;  H(m q ) H(m q ) H(m q ) H(m q ) M 0 0 31 M 2 M 0 M 1 M T :  ( T  q M  M ) div B;  L(m q ) L(m q ) L(m q ) L(m q ) 31 M 0 M 2 M 1 M }           c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  q t t t t 31 0 M 2 1 19

  20. Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { T :  T  b A; i -1 q :  t  (  m ) mod B;  H(m q ) H(m q ) H(m q ) H(m q ) M 0 0 31 M 2 M 0 M 1 M T :  ( T  q M  M ) div B;        }   c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  q t t t t 31 0 M 2 1 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend