Pollard Rho
- n the PlayStation 3
RAIM’09 October 27th 2009 LIP ENS Lyon
Joppe W. Bos1 Marcelo E. Kaihara1 Peter L. Montgomery2
1 EPFL IC LACAL, CH-1015 Lausanne, Switzerland 2 Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA
Pollard Rho on the PlayStation 3 Joppe W. Bos 1 Marcelo E. Kaihara 1 - - PowerPoint PPT Presentation
Pollard Rho on the PlayStation 3 Joppe W. Bos 1 Marcelo E. Kaihara 1 Peter L. Montgomery 2 1 EPFL IC LACAL, CH-1015 Lausanne, Switzerland 2 Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA RAIM09 October 27 th 2009 LIP ENS Lyon
RAIM’09 October 27th 2009 LIP ENS Lyon
1 EPFL IC LACAL, CH-1015 Lausanne, Switzerland 2 Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA
2
Widely standardized
Standard for Efficient Cryptography 2, SEC2 (112-521 bit) Wireless Transport Layer Security Specification (112-224 bit) Digital Signature Standard, FIPS 186-3, NIST (192-521 bit)
Security relies on hardness of solving Elliptic Curve
Are the standardized key sizes secure? What is the practical cost of solving the ECDLP?
3
Evaluate cost of solving ECDLP for small key sizes Use broadly available platform 112-bit ECC standard PlayStation 3:
Low price Hybrid multi-core architecture
Implement Pollard rho on the Cell architecture Design SIMD arithmetic algorithms Optimize modular arithmetic for 112-bit prime
4
p
p
Largest solved instance 109-bit prime field (2002)
It took “104 computers (mostly PCs) running 24 hours a day for 549 days”.
P
The most efficient algorithm in the literature (for generic curves). The underlying idea of this method is to search for two distinct pairs
5
J.M. Pollard. Monte Carlo methods for index computation (mod p).
Mathematics of Computation, 32:918-924, 1978.
j j d
i i d
i
i
j
j
i
j
j
i
j
i
i
j
j
i
6
i i i
1
+
i i
7
Parallel version: distinguish points and send them to a central server
P.C. van Oorschot and M. J. Wiener, [1999].
Mark points with a certain property e.g., Xi=(xi,yi), DPT: 224 | xi
Communicate them to a central DB to check collisions
Leads to a linear speed-up on the number of processors.
1 i-
X
i
X
1 i
X +
2 i
X +
1 i-
X′
i
X′
1 i
X + ′
2 i
X + ′
1 i-
X ′ ′
i
X ′ ′
1 i
X + ′ ′
2 i
X + ′ ′ DB
8
r-adding walks, E. Teske, [2001].
1 i-
X ) , (
i i i
y x X =
1 i
X +
i
R X +
1 i
R X +
15 i
R X +
r 1]
[0, : → 〉 〈P h Q P Rj ⋅ + ⋅ =
j j
d c
) (
) (
i
X h i i 1 i
R X X f X + = =
+
16 r ≥ Use the least significant 4-bit to determine the next partition
Divide into different partitions 〉 〈P For each partition: partitions random mapping
≈
9
Simultaneous Inversion, trade inversions for multiplications
P.L. Montgomery, [1987].
Suitable for cryptanalytic purposes
Trade M modular inversions for 3(M-1) modular multiplications and 1 modular inversion
1 i-
X
i
X
1 i
X +
2 i
X +
1 i-
X′
i
X′
1 i
X + ′
2 i
X + ′
1 i-
X ′ ′
i
X ′ ′
1 i
X + ′ ′
2 i
X + ′ ′
Apply to independent walks
10
Negation Map (not used)
M.J. Wiener and R. J. Zuccherato, [1998]. Computation of the negative is cheap Given an equivalence relation ~ on Iterate over the set of equivalence classes Reduce search space by a factor of 2 ) y x (
1 i 1 i − − ,
) y x (
1 i 1 i − − −
,
) y x ( P
= , P ~ / P
1 i−
R ) y x (
i i,
) y x (
i i −
, ) y x (
1 i 1 i + + ,
) y x (
1 i 1 i + + −
,
i
R
11
1 “Power Processor Element ” (PPE) 8 “Synergistic Processing Elements” (SPEs)
(6 available to the user in the PS3 under Linux)
Characteristics of the SPEs: Synergistic Processing Unit (SPU) Access to 128 registers of 128-bit SIMD operations Dual pipeline (odd and even) In-order processor 256 KB of fast local memory (Local Store)
12
The executable and all data should fit in the LS (256KB).
No “smart” dynamic branch prediction. Instead “prepare-to-branch” instructions to redirect instruction
prefetch to branch targets.
16 x 16 → 32 bit multipliers (4-SIMD)
One odd and one even instruction can be dispatched per clock cycle.
13
Using Montgomery’s simultaneous inversion and running M curves in parallel. ) ( ,
p
F E Q P ∈ ) y (x and ) y , (x
2 2 1 1
, Q P = =
} {Ο ) y , (x then If
3 3
= + ≠ Q P Q P
2 1 2 3
1 3 1 3
= + ≠ − − = Q P y a x Q P x x y y μ
1 2 1 1 2 1 2
if 2 3 if
16
= ⋅
⋅ =
1
i i 16 i 2
a A
bit 16 − bit 16 − high low
b c d
i
b
i
c
i
d
i
a
i m
a
− i m
b
− i m
c
− i m
d
−
= ⋅
⋅ =
1
i i 16 i 2
b B
= ⋅
⋅ =
1
i i 16 i 2
c C
= ⋅
⋅ =
1
i i 16 i 2
d D
a
15
16
p
16
16
p
17
16
p
128 −
18
x′
h
x 3 ⋅
x′ ′
h
x 3 ′ ⋅
+ + x
128
2
h
x
l
x
modulus Use
3 × 3 ×
v
l
v } 1 { vh , ∈
v
h =
h
x′
l
x′
⋅ + →
128 128
2 3 ) 2 mod ( x x x p ~ x R x x x x x
H L L H
mod ) ( 3 2128 = ⋅ + ≡ + ⋅ =
Z Z/ Z Z/ R
256 256
2 2 : →
Overwhelming prob.
19
128 +
20
21
1
A
1
B
2
A
2
B p x 1 p mod x B 1 A
1 1
× ≡ × p mod x B 1 A
2 2
× ≡ ×
] B , A , B , [A
2 2 1 1
] B A B B A [A
2 2 2 1 2 1
, , , − − ← ] B B A A B [A
1 2 1 2 1 1
− − ← , , , ] t B , t A , t B , t [A
1 2 2 2 2 1 1 1
<< >> << >> ← ] B , A , B , [A
2 2 1 1
] B , A , B , [A
2 2 1 1
) A A (
2 1,
gcd p x 1
2
B p mod 2 x z
k 1 ⋅
=
− 32
2 r =
22
23
Operation #cycles required by each operation #operation per iteration #cycles per iteration Mod Mul 53 6 318 Mod Sub 5 6 30 Partial Mon Red 24 1 24 Mod Inv 4941 1/400 12 Misc. 69 1 69 Total 453
[ 1 SPU, 4-SIMD @3.2 GHZ ]
33 9
24
[1] T.Güneysu, C. Paar, and J. Pelzl. Special-purpose hardware for solving the elliptic curve discrete logarithm problem. ACM Transactions on Reconfigurable Technology and Systems, 1(2):1-21, 2008. [2] S. Kumar, C. Paar, J. Pelzl, G. Pfeiffer, and M. Schimmler. Breaking ciphers with COPACOBANA a cost-optimized parallel code breaker. In CHES 2006, vol. 4249 of LNCS, pages 101-118, 2006.
25
96 bits 128 bits COPACABANA
(XC3S1000)
4.0 ·107 2.1 ·107 + Moore’s law 7.9 ·107 4.2 ·107 +Negation map 1.1 ·108 5.9 ·107 PS3 4.2 ·107 33 PS3 1.4 ·109
26
16
( )
96082688 8271678618 5491033170 752,341987 2223778713 5797253489 1882814650 P =
( )
623885544 6724286139 5960649470 5028,38467 2643383279 8979323846 1415926535 Q = 8933 9189154254 0937147764 4451685225 n =
P Q ⋅ =
The point
is given in the standard. The -coordinate of was chose as
January 13, 2009 – July 8, 2009 (not run continuously). If run continuously, using the latest version of our code, the same calculation would have taken 3.5 months.
699 1767351856 1477247716 3125216360
34
3)10
Expected # iterations:
27
28
(connected to the cluster)
(for programming purposes)