Implementation of RSA 2048 on GPUs
Marcelo E. Kaihara EPFL – LACAL
- Nov. 4, 2010
Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL - - PowerPoint PPT Presentation
Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL Nov. 4, 2010 Motivation NIST Recommendations for Key Management (SP 800-57) NIST DRAFT recommendation for the Transitioning of Cryptographic Algorithms and Key Sizes (SP
2 2
3 3
4 4
d
1
1
1
dP 1
dQ 2
2 1
2
32
2 B 32 s
5
Ordinary Representation Montgomery Representation
v ~ u ~
z ~
v u
Ordinary Representation Montgomery Representation
) * , (
) , (
m mod ) v u ( v u m mod R v ~ u ~ v ~ * u ~
1
m mod R v v ~ 1 m) gcd(R, , m B R
s
1
s
1
* u ~ v ~ m z ~
0; z ~ ) i 1;
i 0; (i for B; mod ) m z ~ ( q
1 M
; v ~ u ~ z ~ z ~
i
B; div ) m q z ~ ( z ~
M
m;
~ z ~ then m z ~ if { }
B mod m) B) mod m z ~ ( z ~ (
1
B mod B)) mod m m z ~ ( z ~ (
1
B mod ) z ~ z ~ (
Algorithm
* u ~ v ~ m z ~
0; z ~ ) i 1;
i 0; (i for B; mod ) m z ~ ( q
1 M
m;
~ z ~ then m z ~ if { }
i ) 1 B ( m z ~ m 2 B ) 1 B ( m 2 B ) 1 B ( m ) 1 B ( m z ~
; v ~ u ~ z ~ z ~
i
B; div ) m q z ~ ( z ~
M
1
Algorithm
0; z ~ ) i 1;
i 0; (i for B; mod ) m z ~ ( q
1 M
; v ~ u ~ z ~ z ~
i
B; div ) m q z ~ ( z ~
M
m;
~ z ~ then m z ~ if { }
1
Algorithm
0; z ~ ) i 1;
i 0; (i for B; mod ) m z ~ ( q
1 M
m;
~ z ~ then m z ~ if { }
1 i m 2 z ~ ) 1 B ( m m 2 z ~
; v ~ u ~ z ~ z ~
i
B; div ) m q z ~ ( z ~
M
1
B ) 1 B ( m ) 1 B ( m m 2 z ~ m 2 B ) 1 B 1 B 2 ( m
Algorithm
12 12
Specifications:
3 billon transistors 16 Streaming Multiprocessors (SM) 6 x 64-bit memory partitions Up to total 6GB GDDR5 with ECC GigaThread global scheduler Shared L2 Cache (768KB)
Source: NVIDIA’s next Generation CUDATM Compute Architecture: Fermi
13 13
Streaming Multiprocessor 32 CUDA Cores (16 x 32 = 512) Dual warp scheduler 16 LD/ST Units 4 Special Function Units (SFU) 64KB of configurable Shared Memory and L1 Cache (48KB/16KB) CUDA Core Pipelined ALU and FPU ALU supports 32-bit int FPU single precision (512 FMA ops / clock) 1K 32-bit registers per core
Source: NVIDIA’s next Generation CUDATM Compute Architecture: Fermi
14 14
x
1
x
2
x
31
x
x x
1
x
1
x
2
x
2
x
31
x
31
x
x
1
x
2
x
31
x x x
1
x
1
x
2
x
2
x
31
x
31
x
15 15
31
c
2
c
1
c c
x
1
x
2
x
31
x
) b L(a ) b L(a
1
) b L(a
1
) b L(a
31
16
A B M
b
1
b
2
b
31
b a
1
a
2
a
31
a m
1
m
2
m
31
m t
1
t
2
t
31
t
) b H(a ) b H(a
1
) b H(a
2
) b H(a
31
T
30
t
1
t t
31
t
30
c
1
c c
31
c
1
m; : M ; v ~ : B ; u ~ : A 0; : T B; mod ) m ( t : q
M
Algorithm
A; b T : T
i
B; div ) M q T ( : T
M
M;
: Z then M T if T; : Z else
) i 1;
i 0; (i for { }
) b L(a ) b L(a
1
) b L(a
1
) b L(a
31
17
A B M
b
1
b
2
b
31
b a
1
a
2
a
31
a m
1
m
2
m
31
m t
1
t
2
t
31
t
) b H(a ) b H(a
1
) b H(a
2
) b H(a
31
T
M
q
30
t
1
t t
31
t
30
c
1
c c
31
c
1
m; : M ; v ~ : B ; u ~ : A 0; : T B; mod ) m ( t : q
M
Algorithm
A; b T : T
i
B; div ) M q T ( : T
M
M;
: Z then M T if T; : Z else
) i 1;
i 0; (i for { }
18
A B M
b
1
b
2
b
31
b a
1
a
2
a
31
a m
1
m
2
m
31
m t
1
t
2
t
31
t
) b H(a ) b H(a
1
) b H(a
2
) b H(a
31
) q L(m
M
) q L(m
M 1
) q L(m
M 2
) q L(m
M 31
) q H(m
M
) q H(m
M 1
) q H(m
M 2
) q H(m
M 31
T
M
q
30
c
1
c c
31
c
1
m; : M ; v ~ : B ; u ~ : A 0; : T B; mod ) m ( t : q
M
Algorithm
A; b T : T
i
B; div ) M q T ( : T
M
M;
: Z then M T if T; : Z else
) i 1;
i 0; (i for { }
19
A B M
b
1
b
2
b
31
b a
1
a
2
a
31
a m
1
m
2
m
31
m t
1
t
2
t
31
t
) b H(a ) b H(a
1
) b H(a
2
) b H(a
31
) q L(m
M
) q L(m
M 1
) q L(m
M 2
) q L(m
M 31
) q H(m
M
) q H(m
M 1
) q H(m
M 2
) q H(m
M 31
T
M
q
30
c
1
c c
31
c
1
m; : M ; v ~ : B ; u ~ : A 0; : T B; mod ) m ( t : q
M
Algorithm
A; b T : T
i
B; div ) M q T ( : T
M
M;
: Z then M T if T; : Z else
) i 1;
i 0; (i for { }
20
A B M
b
1
b
2
b
31
b a
1
a
2
a
31
a m
1
m
2
m
31
m t
1
t
2
t
31
t
) b H(a ) b H(a
1
) b H(a
2
) b H(a
31
) q H(m
M
) q H(m
M 1
) q H(m
M 2
) q H(m
M 31
T
M
q
30
c
1
c c
31
c
1
m; : M ; v ~ : B ; u ~ : A 0; : T B; mod ) m ( t : q
M
Algorithm
A; b T : T
i
B; div ) M q T ( : T
M
M;
: Z then M T if T; : Z else
) i 1;
i 0; (i for { }
21
A B M
b
1
b
2
b
31
b a
1
a
2
a
31
a m
1
m
2
m
31
m t
1
t
2
t
31
t
) q H(m
M
) q H(m
M 1
) q H(m
M 2
) q H(m
M 31
T
M
q
30
c
1
c c
31
c
1
m; : M ; v ~ : B ; u ~ : A 0; : T B; mod ) m ( t : q
M
Algorithm
A; b T : T
i
B; div ) M q T ( : T
M
M;
: Z then M T if T; : Z else
) i 1;
i 0; (i for { }
22
A B M
b
1
b
2
b
31
b a
1
a
2
a
31
a m
1
m
2
m
31
m
T
30
c
1
c c
31
c
1
m; : M ; v ~ : B ; u ~ : A 0; : T B; mod ) m ( t : q
M
Algorithm
A; b T : T
i
B; div ) M q T ( : T
M
M;
: Z then M T if T; : Z else
) i 1;
i 0; (i for { }
30
t
1
t t
31
t index
23 23
t
1
t c
1
c
2
t
3
t
4
t
5
t
7
t
8
t
2
c
3
c
4
c
5
c
6
c
7
c
24 24
t
1
t c
1
c
2
t
3
t
4
t
5
t
7
t
8
t
2
c
3
c
4
c
5
c
6
c
7
c
25
A B M
31
b
31
a
31
m
2
b
2
a
2
m
1
b
1
a
1
m b a m
31
c
2
c
1
c c t
1
t
2
t
31
t
T
acc
C
26
A B
B) mod ) (-m ( M M ~
1
b
1
b
2
b
31
b a
1
a
2
a
31
a
1 m0
1
m
2
m
31
m
31
c
2
c
1
c c t
1
t
2
t
31
t
32
m
; t : q
M
32
m
32
m
32
m
32
m
t
1
t
2
t
31
t
Truncate
27
A B
B) mod ) (-m ( M M ~
1
b
1
b
2
b
31
b a
1
a
2
a
31
a
1 m0
1
m
2
m
31
m
31
c
2
c
1
c c t
1
t
2
t
31
t
32
m
; t : q
M
32
m
32
m
32
m
32
m
t
1
t
2
t
31
t
28
(5 SM are disabled to increase yield production)
(lower clock freq. compared to GTX480@1.4GHz)
(CRT is performed in GPU)
29
* R. Szerwinski and T. Guneysu, “Exploiting the Power of GPUs for Asymmetric Cryptography”, CHES 2008
Decryptions/sec Messages submitted
AMD Opteron™ 1381 Quad-Core @ 2.6GHz
OpenSSL Normal(1) Crypto++(1) OpenSSL GMP (1) GPU 8800 GTS (CIOS) (3) GPU 8800 GTS (RNS) (3) GPU GTX465 (2) Ops/sec 946.9 1’566.28 (scaled) 2’738.9 104.3 57.9 2’232.43 Delay [ms]
849 39.4
(3)
Cryptography”, CHES 2008 Implemented on Nvidia 8800 GTS (Total 112 CUDA Cores @ 1.5GHz)
(1)
Evaluated on AMD Opteron™ 1381 Quad-Core @ 2.6GHz on Linux x86_64
(2)
Current implementation on Nvidia GTX465 (11 MS, total of 352 CUDA Cores) @ 1.2GHz
Fermi architecture supports addition with carry (add.cc) Inline assembly using CUDA
Currently using left-to-right binary exponentiation With windows exponentiation 25% speed up Randomization on checking point for carries as countermeasure