Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL - - PowerPoint PPT Presentation

implementation of rsa 2048 on gpus
SMART_READER_LITE
LIVE PREVIEW

Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL - - PowerPoint PPT Presentation

Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL Nov. 4, 2010 Motivation NIST Recommendations for Key Management (SP 800-57) NIST DRAFT recommendation for the Transitioning of Cryptographic Algorithms and Key Sizes (SP


slide-1
SLIDE 1

Implementation of RSA 2048 on GPUs

Marcelo E. Kaihara EPFL – LACAL

  • Nov. 4, 2010
slide-2
SLIDE 2

2 2

Motivation

NIST Recommendations for Key Management (SP 800-57) NIST DRAFT recommendation for the Transitioning of Cryptographic Algorithms and Key Sizes (SP 800-131)

RSA 1024

Deprecated from January 1, 2011

RSA 2048

8x Computational Effort

slide-3
SLIDE 3

3 3

Object

Use GPUs as cryptographic accelerators to offload work from the CPU.

Low latency Generic implementation Server application Parallel implementation OpenCL Speed

slide-4
SLIDE 4

4 4

RSA 2048 Decryption

Precomputed values m mod c z

d

 q p m   (m) mod 1 d e    qInv) dQ, dP, q, , p ( 1)

  • (p

mod e dP

1 

 1)

  • (q

mod e dQ

1 

 p mod q dInv

1 

 p mod c z

dP 1 

q mod c z

dQ 2 

p mod ) z z ( qInv h

2 1 

  q h z z

2

   Decryption Chinese Remainder Theorem Mod Exp 1024 moduli (32 limbs of 32-bits)

32

2 B  32 s 

slide-5
SLIDE 5

5

General overview

Ordinary Representation Montgomery Representation

Sequential multiplications performed in Montgomery representation

v ~ u ~ 

u ~ v ~ u v z

 z ~

v u   

Montgomery Multiplication

slide-6
SLIDE 6

Ordinary Representation Montgomery Representation

Montgomery radix

Montgomery Multiplication

u

) * , (

 ) , (  

m mod R u u ~   v

m mod ) v u ( v u    m mod R v ~ u ~ v ~ * u ~

1 

   m mod R v v ~   1 m) gcd(R, , m B R

s

   Isomorphic

slide-7
SLIDE 7

Definition:

m mod R v ~ u ~ v ~ * u ~

1 

   integer

  • dd

large : m ) B R usually (

s

 m R  1 B) gcd(m, ,  Z Z m / v ~ , u ~ 

Montgomery Multiplication

slide-8
SLIDE 8

m mod R v ~ u ~ v ~ * u ~

1 

  

* u ~ v ~ m z ~

0; z ~  ) i 1;

  • s

i 0; (i for     B; mod ) m z ~ ( q

1 M 

   ; v ~ u ~ z ~ z ~

i

   B; div ) m q z ~ ( z ~

M 

  m;

  • z

~ z ~ then m z ~ if   { }

    

B mod m) B) mod m z ~ ( z ~ (

1

B mod B)) mod m m z ~ ( z ~ (

1

    

B mod ) z ~ z ~ (   

Sequential Computation on CPU

Algorithm

slide-9
SLIDE 9

* u ~ v ~ m z ~

0; z ~  ) i 1;

  • s

i 0; (i for     B; mod ) m z ~ ( q

1 M 

  

m;

  • z

~ z ~ then m z ~ if   { }

i  ) 1 B ( m z ~    m 2 B ) 1 B ( m 2 B ) 1 B ( m ) 1 B ( m z ~           

; v ~ u ~ z ~ z ~

i

   B; div ) m q z ~ ( z ~

M 

 

Sequential Computation on CPU

m mod R v ~ u ~ v ~ * u ~

1 

  

Algorithm

slide-10
SLIDE 10

*

u ~ v ~ m z ~

0; z ~  ) i 1;

  • s

i 0; (i for     B; mod ) m z ~ ( q

1 M 

   ; v ~ u ~ z ~ z ~

i

   B; div ) m q z ~ ( z ~

M 

  m;

  • z

~ z ~ then m z ~ if   { }

m mod R v ~ u ~ v ~ * u ~

1 

  

Sequential Computation on CPU

Algorithm

slide-11
SLIDE 11

*

u ~ v ~ m z ~

0; z ~  ) i 1;

  • s

i 0; (i for     B; mod ) m z ~ ( q

1 M 

   m;

  • z

~ z ~ then m z ~ if   { }

1 i  m 2 z ~    ) 1 B ( m m 2 z ~     

; v ~ u ~ z ~ z ~

i

   B; div ) m q z ~ ( z ~

M 

 

m mod R v ~ u ~ v ~ * u ~

1 

  

Sequential Computation on CPU

B ) 1 B ( m ) 1 B ( m m 2 z ~         m 2 B ) 1 B 1 B 2 ( m        

Algorithm

slide-12
SLIDE 12

12 12

Fermi architecture

Specifications:

3 billon transistors 16 Streaming Multiprocessors (SM) 6 x 64-bit memory partitions Up to total 6GB GDDR5 with ECC GigaThread global scheduler Shared L2 Cache (768KB)

Source: NVIDIA’s next Generation CUDATM Compute Architecture: Fermi

slide-13
SLIDE 13

13 13

Fermi architecture

Streaming Multiprocessor 32 CUDA Cores (16 x 32 = 512) Dual warp scheduler 16 LD/ST Units 4 Special Function Units (SFU) 64KB of configurable Shared Memory and L1 Cache (48KB/16KB) CUDA Core Pipelined ALU and FPU ALU supports 32-bit int FPU single precision (512 FMA ops / clock) 1K 32-bit registers per core

Source: NVIDIA’s next Generation CUDATM Compute Architecture: Fermi

slide-14
SLIDE 14

14 14

Representation of Integers

x

1

x

2

x

31

x

  • High Latency
  • Cryptanalysis

x x

1

x

1

x

2

x

2

x

31

x

31

x

 

x

1

x

2

x

31

x x x

1

x

1

x

2

x

2

x

31

x

31

x

  • Low Latency
  • Cryptography

Parallel version Sequential version

slide-15
SLIDE 15

15 15

Representation of Integers

To avoid barriers (mem fence) try to fit entire

  • perand within a block of 32 threads (Warps)

Data coherence is maintained within a warp. Each thread operates in one limb in radix B=232 Possible representations: Avizienis representation (signed-digit) Residue Number System Carry-save

31

c

2

c

1

c c

 

x

1

x

2

x

31

x

slide-16
SLIDE 16

) b L(a ) b L(a

1

) b L(a

1

) b L(a

31

16

Montgomery Multiplication

A B M

b

1

b

2

b

31

b a

1

a

2

a

31

a m

1

m

2

m

31

m t

1

t

2

t

31

t

  

) b H(a ) b H(a

1

) b H(a

2

) b H(a

31

 

T

30

t

1

t t

31

t

30

c

1

c c

31

c

m mod R v ~ u ~ v ~ * u ~

1 

  

m; : M ; v ~ : B ; u ~ : A    0; : T  B; mod ) m ( t : q

  • 1

M

  

Algorithm

A; b T : T

i

  B; div ) M q T ( : T

M 

  M;

  • T

: Z then M T if   T; : Z else 

) i 1;

  • s

i 0; (i for     { }

slide-17
SLIDE 17

) b L(a ) b L(a

1

) b L(a

1

) b L(a

31

17

Montgomery Multiplication

A B M

b

1

b

2

b

31

b a

1

a

2

a

31

a m

1

m

2

m

31

m t

1

t

2

t

31

t

  

) b H(a ) b H(a

1

) b H(a

2

) b H(a

31

 

T

M

q

30

t

1

t t

31

t

30

c

1

c c

31

c

m mod R v ~ u ~ v ~ * u ~

1 

  

m; : M ; v ~ : B ; u ~ : A    0; : T  B; mod ) m ( t : q

  • 1

M

  

Algorithm

A; b T : T

i

  B; div ) M q T ( : T

M 

  M;

  • T

: Z then M T if   T; : Z else 

) i 1;

  • s

i 0; (i for     { }

       

slide-18
SLIDE 18

18

Montgomery Multiplication

A B M

b

1

b

2

b

31

b a

1

a

2

a

31

a m

1

m

2

m

31

m t

1

t

2

t

31

t

  

) b H(a ) b H(a

1

) b H(a

2

) b H(a

31

  

) q L(m

M

) q L(m

M 1

) q L(m

M 2

) q L(m

M 31

) q H(m

M

) q H(m

M 1

) q H(m

M 2

) q H(m

M 31

T

M

q

30

c

1

c c

31

c

m mod R v ~ u ~ v ~ * u ~

1 

  

m; : M ; v ~ : B ; u ~ : A    0; : T  B; mod ) m ( t : q

  • 1

M

  

Algorithm

A; b T : T

i

  B; div ) M q T ( : T

M 

  M;

  • T

: Z then M T if   T; : Z else 

) i 1;

  • s

i 0; (i for     { }

slide-19
SLIDE 19

19

Montgomery Multiplication

A B M

b

1

b

2

b

31

b a

1

a

2

a

31

a m

1

m

2

m

31

m t

1

t

2

t

31

t

  

) b H(a ) b H(a

1

) b H(a

2

) b H(a

31

  

) q L(m

M

) q L(m

M 1

) q L(m

M 2

) q L(m

M 31

) q H(m

M

) q H(m

M 1

) q H(m

M 2

) q H(m

M 31

T

M

q

30

c

1

c c

31

c

m mod R v ~ u ~ v ~ * u ~

1 

  

m; : M ; v ~ : B ; u ~ : A    0; : T  B; mod ) m ( t : q

  • 1

M

  

Algorithm

A; b T : T

i

  B; div ) M q T ( : T

M 

  M;

  • T

: Z then M T if   T; : Z else 

) i 1;

  • s

i 0; (i for     { }

       

slide-20
SLIDE 20

20

Montgomery Multiplication

A B M

b

1

b

2

b

31

b a

1

a

2

a

31

a m

1

m

2

m

31

m t

1

t

2

t

31

t

  

) b H(a ) b H(a

1

) b H(a

2

) b H(a

31

  

) q H(m

M

) q H(m

M 1

) q H(m

M 2

) q H(m

M 31

T

M

q

30

c

1

c c

31

c

m mod R v ~ u ~ v ~ * u ~

1 

  

m; : M ; v ~ : B ; u ~ : A    0; : T  B; mod ) m ( t : q

  • 1

M

  

Algorithm

A; b T : T

i

  B; div ) M q T ( : T

M 

  M;

  • T

: Z then M T if   T; : Z else 

) i 1;

  • s

i 0; (i for     { }

      

slide-21
SLIDE 21

21

Montgomery Multiplication

A B M

b

1

b

2

b

31

b a

1

a

2

a

31

a m

1

m

2

m

31

m t

1

t

2

t

31

t

    

) q H(m

M

) q H(m

M 1

) q H(m

M 2

) q H(m

M 31

T

M

q

30

c

1

c c

31

c

m mod R v ~ u ~ v ~ * u ~

1 

  

m; : M ; v ~ : B ; u ~ : A    0; : T  B; mod ) m ( t : q

  • 1

M

  

Algorithm

A; b T : T

i

  B; div ) M q T ( : T

M 

  M;

  • T

: Z then M T if   T; : Z else 

) i 1;

  • s

i 0; (i for     { }

      

slide-22
SLIDE 22

22

Montgomery Multiplication

A B M

b

1

b

2

b

31

b a

1

a

2

a

31

a m

1

m

2

m

31

m

   

T

30

c

1

c c

31

c

m mod R v ~ u ~ v ~ * u ~

1 

  

m; : M ; v ~ : B ; u ~ : A    0; : T  B; mod ) m ( t : q

  • 1

M

  

Algorithm

A; b T : T

i

  B; div ) M q T ( : T

M 

  M;

  • T

: Z then M T if   T; : Z else 

) i 1;

  • s

i 0; (i for     { }

30

t

1

t t

31

t index

slide-23
SLIDE 23

23 23

Carry propagation

t

1

t c

1

c

2

t

3

t

4

t

5

t

7

t

8

t

2

c

3

c

4

c

5

c

6

c

7

c

Probability of new carry after one iteration is very low. One time carry propagation + logarithmic time verification.

Requires 31 iteration of additions with carry

slide-24
SLIDE 24

24 24

Avoiding carry propagation

t

1

t c

1

c

2

t

3

t

4

t

5

t

7

t

8

t

2

c

3

c

4

c

5

c

6

c

7

c  

    

Probability of new carry after one iteration is very low. One time carry propagation + logarithmic time verification. If carry detected, continue carry propagation

Remaining carries can be checked in log time

Check here

slide-25
SLIDE 25

25

Speeding up further

A B M

31

b

31

a

31

m

  

2

b

2

a

2

m

1

b

1

a

1

m b a m

 

31

c

2

c

1

c c t

1

t

2

t

31

t

T

acc

C

   

Use log time check after exponentiation If carry detected in the end, call slow exponentiation Prepare two exponentiation algorithms: 1) Fast: Accumulate carries, check after exponentiation 2) Slow: Checks every modular multiplication

slide-26
SLIDE 26

26

Operand Scaling Techniques

A B

B) mod ) (-m ( M M ~

1 

 

b

1

b

2

b

31

b a

1

a

2

a

31

a

1 m0  

1

m

2

m

31

m

31

c

2

c

1

c c t

1

t

2

t

31

t

    

Scaling the modulus

32

m

; t : q

M 

Simplifies quotient determination

32

m

32

m

32

m

32

m

t

1

t

2

t

31

t

Truncate

slide-27
SLIDE 27

27

Operand Scaling Techniques

A B

B) mod ) (-m ( M M ~

1 

 

b

1

b

2

b

31

b a

1

a

2

a

31

a

1 m0  

1

m

2

m

31

m

31

c

2

c

1

c c t

1

t

2

t

31

t

    

Scaling the modulus

32

m

; t : q

M 

Simplifies quotient determination

32

m

32

m

32

m

32

m

t

1

t

2

t

31

t

Drawback Needs to record ti Result < 3M Similar case if multiplicand A is scaled up by radix B

slide-28
SLIDE 28

28

Performance Evaluation

Evaluation Platform

Linux x86_64 OpenCL 1.0 CUDA Nvidia GeForce GTX 465 11 SM with 32 cores each = 352 Cores

(5 SM are disabled to increase yield production)

Device clock freq. 1.2 GHz

(lower clock freq. compared to GTX480@1.4GHz)

Measurements includes I/O

(CRT is performed in GPU)

slide-29
SLIDE 29

29

* R. Szerwinski and T. Guneysu, “Exploiting the Power of GPUs for Asymmetric Cryptography”, CHES 2008

Decryptions/sec Messages submitted

AMD Opteron™ 1381 Quad-Core @ 2.6GHz

Performance Evaluation

slide-30
SLIDE 30

OpenSSL Normal(1) Crypto++(1) OpenSSL GMP (1) GPU 8800 GTS (CIOS) (3) GPU 8800 GTS (RNS) (3) GPU GTX465 (2) Ops/sec 946.9 1’566.28 (scaled) 2’738.9 104.3 57.9 2’232.43 Delay [ms]

  • 55’184

849 39.4

(3)

  • R. Szerwinski and T. Guneysu, “Exploiting the Power of GPUs for Asymmetric

Cryptography”, CHES 2008 Implemented on Nvidia 8800 GTS (Total 112 CUDA Cores @ 1.5GHz)

(1)

Evaluated on AMD Opteron™ 1381 Quad-Core @ 2.6GHz on Linux x86_64

(2)

Current implementation on Nvidia GTX465 (11 MS, total of 352 CUDA Cores) @ 1.2GHz

Performance Evaluation

slide-31
SLIDE 31

Fermi architecture supports addition with carry (add.cc) Inline assembly using CUDA

b [ local_id ] += a[ local_id ]; c [ local_id ] += ( b [ local_id ] < a [ local_id ] );

Further optimizations (in progress)

Code: Carry generation Exponentiation algorithm

Currently using left-to-right binary exponentiation With windows exponentiation 25% speed up Randomization on checking point for carries as countermeasure

slide-32
SLIDE 32

I presented an implementation of RSA 2048 on GPUs that takes advantage of data coherence inside the warp.

Summary

Current implementation is competitive compared to CPU implementations and suitable for server applications with low latency. The use of GPUs as cryptographic accelerators seems to have a promising future.