Pollard Rho on the PlayStation 3 Joppe W. Bos 1 Marcelo E. Kaihara 1 - - PowerPoint PPT Presentation

pollard rho on the playstation 3
SMART_READER_LITE
LIVE PREVIEW

Pollard Rho on the PlayStation 3 Joppe W. Bos 1 Marcelo E. Kaihara 1 - - PowerPoint PPT Presentation

Pollard Rho on the PlayStation 3 Joppe W. Bos 1 Marcelo E. Kaihara 1 Peter L. Montgomery 2 1 EPFL IC LACAL, CH-1015 Lausanne, Switzerland 2 Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA RAIM09 October 27 th 2009 LIP ENS Lyon


slide-1
SLIDE 1

Pollard Rho

  • n the PlayStation 3

RAIM’09 October 27th 2009 LIP ENS Lyon

Joppe W. Bos1 Marcelo E. Kaihara1 Peter L. Montgomery2

1 EPFL IC LACAL, CH-1015 Lausanne, Switzerland 2 Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA

slide-2
SLIDE 2

2

Motivation

Elliptic Curve Cryptography (ECC):

 Widely standardized

Standard for Efficient Cryptography 2, SEC2 (112-521 bit) Wireless Transport Layer Security Specification (112-224 bit) Digital Signature Standard, FIPS 186-3, NIST (192-521 bit)

 Security relies on hardness of solving Elliptic Curve

Discrete Logarithm Problem (ECDLP)

 Are the standardized key sizes secure?  What is the practical cost of solving the ECDLP?

slide-3
SLIDE 3

3

Objective

Evaluate cost of solving ECDLP for small key sizes Use broadly available platform 112-bit ECC standard PlayStation 3:

Low price Hybrid multi-core architecture

Implement Pollard rho on the Cell architecture Design SIMD arithmetic algorithms Optimize modular arithmetic for 112-bit prime

slide-4
SLIDE 4

4

ECDLP

Settings: Problem:

Ε

p

F p

is an elliptic curve over with odd prime

) (

p

F E P ∈ n

is a point of order

n P, p, E, Q k

Given and what is ?

〉 〈 ∈ ⋅ = P P k Q

Largest solved instance 109-bit prime field (2002)

It took “104 computers (mostly PCs) running 24 hours a day for 549 days”.

Q log k

P

=

slide-5
SLIDE 5

Pollard rho:

 The most efficient algorithm in the literature (for generic curves).  The underlying idea of this method is to search for two distinct pairs

) , ( ) , (

5

Solving the ECDLP

Q P Q P ⋅ + ⋅ = ⋅ + ⋅ P k Q P ⋅ − = ⋅ − = ⋅ − ) ( ) ( ) (

 J.M. Pollard. Monte Carlo methods for index computation (mod p).

Mathematics of Computation, 32:918-924, 1978.

j j d

c ,

i i d

c that such Z/nZ Z/nZ × ∈

i

c

i

d

j

c

j

d

i

c

j

c

j

d

i

d

j

d

i

d n k mod ) ( ) (

  • 1

− ⋅ − ≡

i

c

j

c

j

d

i

d

slide-6
SLIDE 6

6

Pollard Rho

〉 〈P Q d P c X

i i i

⋅ + ⋅ = 〉 〈 → 〉 〈 P P f : 2 〉 〈 ⋅ π P i ) (

1

≥ =

+

, X f X

i i

“Walk” through the set Iteration function This sequence eventually collides Expected number of iterations

slide-7
SLIDE 7

7

Optimization I

Parallel version: distinguish points and send them to a central server

P.C. van Oorschot and M. J. Wiener, [1999].

Mark points with a certain property e.g., Xi=(xi,yi), DPT: 224 | xi

Communicate them to a central DB to check collisions

Leads to a linear speed-up on the number of processors.

1 i-

X

i

X

1 i

X +

2 i

X +

1 i-

X′

i

X′

1 i

X + ′

2 i

X + ′

1 i-

X ′ ′

i

X ′ ′

1 i

X + ′ ′

2 i

X + ′ ′ DB

slide-8
SLIDE 8

8

Optimization II

r-adding walks, E. Teske, [2001].

1 i-

X ) , (

i i i

y x X =

1 i

X +

i

R X +

1 i

R X +

15 i

R X +

r 1]

  • r

[0, : → 〉 〈P h Q P Rj ⋅ + ⋅ =

j j

d c

) (

) (

i

X h i i 1 i

R X X f X + = =

+

16 r ≥ Use the least significant 4-bit to determine the next partition

Divide into different partitions 〉 〈P For each partition: partitions random mapping

slide-9
SLIDE 9

9

Optimization III

Simultaneous Inversion, trade inversions for multiplications

P.L. Montgomery, [1987].

Suitable for cryptanalytic purposes

Trade M modular inversions for 3(M-1) modular multiplications and 1 modular inversion

1 i-

X

i

X

1 i

X +

2 i

X +

1 i-

X′

i

X′

1 i

X + ′

2 i

X + ′

1 i-

X ′ ′

i

X ′ ′

1 i

X + ′ ′

2 i

X + ′ ′

Affine Weierstrass representation

Apply to independent walks

slide-10
SLIDE 10

10

Optimization IV

Negation Map (not used)

M.J. Wiener and R. J. Zuccherato, [1998]. Computation of the negative is cheap Given an equivalence relation ~ on Iterate over the set of equivalence classes Reduce search space by a factor of 2 ) y x (

1 i 1 i − − ,

) y x (

1 i 1 i − − −

,

) y x ( P

= , P ~ / P

1 i−

R ) y x (

i i,

) y x (

i i −

, ) y x (

1 i 1 i + + ,

) y x (

1 i 1 i + + −

,

i

R

slide-11
SLIDE 11

11

The PlayStation 3

The Cell contains

1 “Power Processor Element ” (PPE) 8 “Synergistic Processing Elements” (SPEs)

(6 available to the user in the PS3 under Linux)

Characteristics of the SPEs: Synergistic Processing Unit (SPU) Access to 128 registers of 128-bit SIMD operations Dual pipeline (odd and even) In-order processor 256 KB of fast local memory (Local Store)

slide-12
SLIDE 12

12

Programming Constraints

Memory

 The executable and all data should fit in the LS (256KB).

Branches

 No “smart” dynamic branch prediction.  Instead “prepare-to-branch” instructions to redirect instruction

prefetch to branch targets.

Instruction set limitations

 16 x 16 → 32 bit multipliers (4-SIMD)

Dual pipeline

 One odd and one even instruction can be dispatched per clock cycle.

slide-13
SLIDE 13

13

Arithmetic

Using affine Weierstrass representation

Using Montgomery’s simultaneous inversion and running M curves in parallel. ) ( ,

p

F E Q P ∈ ) y (x and ) y , (x

2 2 1 1

, Q P = =

} {Ο ) y , (x then If

3 3

= + ≠ Q P Q P

2 1 2 3

x

  • x
  • x

µ =

1 3 1 3

y

  • )

x

  • (x

y µ =

       = + ≠ − − = Q P y a x Q P x x y y μ

1 2 1 1 2 1 2

if 2 3 if

6 modular multiplications 6 modular subtractions modular inversions

M 1

slide-14
SLIDE 14

Integer Representation

16

2

= ⋅

⋅ =

1

  • m

i i 16 i 2

a A

= V[0] = V[i] = 1]

  • V[m

 

bit 16 − bit 16 − high low

b c d

 

i

b

i

c

i

d

i

a

i m

a

− i m

b

− i m

c

− i m

d

SIMD

  • 4

= ⋅

⋅ =

1

  • m

i i 16 i 2

b B

= ⋅

⋅ =

1

  • m

i i 16 i 2

c C

= ⋅

⋅ =

1

  • m

i i 16 i 2

d D

a

Integers A, B, C, D represented in radix

slide-15
SLIDE 15

15

Modular Reduction

The prime 112-bit p in the target curve is

16

B 208 BEAD 668076 E 35 E 62 ABF 2 C 7 DB = p ) (

p

F E

slide-16
SLIDE 16

16

Modular Reduction

The prime 112-bit p in the target curve is

16

B 208 BEAD 668076 E 35 E 62 ABF 2 C 7 DB = p ) (

p

F E 6949 11 3 2128 ⋅ − = p

slide-17
SLIDE 17

17

Modular Reduction

The prime 112-bit p in the target curve is

16

B 208 BEAD 668076 E 35 E 62 ABF 2 C 7 DB = p ) (

p

F E 6949 11 3 2128 ⋅ − = p

Perform calculation using a redundant representation

3 2 6949 11

128 −

= ⋅ ⋅ = p p ~

slide-18
SLIDE 18

18

Fast reduction

p p ~ ⋅ ⋅ = − = 6949 11 3 2128

x′

h

x 3 ⋅

x′ ′

h

x 3 ′ ⋅

+ + x

128

2

h

x

l

x

modulus Use

3 × 3 ×

v

l

v } 1 { vh , ∈

v

h =

h

x′

l

x′

      ⋅ + →

128 128

2 3 ) 2 mod ( x x x p ~ x R x x x x x

H L L H

mod ) ( 3 2128 = ⋅ + ≡ + ⋅ =

Z Z/ Z Z/ R

256 256

2 2 : →

Overwhelming prob.

slide-19
SLIDE 19

19

Fast Modular Multiplication

p y)) R(R(x ~ < ⋅ ≤

Proposition For independent random 128-bit non-negative integers x and y there is overwhelming probability that Counter-examples easy to construct:

6 2 R(R(x))

128 +

< ≤

During the whole run not a single faulty reduction

slide-20
SLIDE 20

20

Distinguish Point Property

) , ( y x P = p x x mod 2-16 ⋅ = ′

Need to uniquely determine the partition number and DTP property during the r-adding walk. Partial Montgomery Reduction in order to reduce modulo p. Check least significant 24 bits of x in partial Montgomery representation.

p ~ x x < ≤ :

slide-21
SLIDE 21

21

Modular Inversion

Based on Extended Binary GCD algorithm: Compute Obtain from almost Montgomery inverse:

1

A

1

B

2

A

2

B p x 1 p mod x B 1 A

1 1

× ≡ × p mod x B 1 A

2 2

× ≡ ×

] B , A , B , [A

2 2 1 1

] B A B B A [A

2 2 2 1 2 1

, , , − − ← ] B B A A B [A

1 2 1 2 1 1

− − ← , , , ] t B , t A , t B , t [A

1 2 2 2 2 1 1 1

<< >> << >> ← ] B , A , B , [A

2 2 1 1

] B , A , B , [A

2 2 1 1

) A A (

2 1,

gcd p x 1

p mod x z

  • 1

2

B p mod 2 x z

k 1 ⋅

=

− 32

2 r =

SIMD-operations: Branches significantly reduced

slide-22
SLIDE 22

22

Modular Inversion

slide-23
SLIDE 23

23

Performance Results

Operation #cycles required by each operation #operation per iteration #cycles per iteration Mod Mul 53 6 318 Mod Sub 5 6 30 Partial Mon Red 24 1 24 Mod Inv 4941 1/400 12 Misc. 69 1 69 Total 453

[ 1 SPU, 4-SIMD @3.2 GHZ ]

Hence, our cluster of 214 PS3s computes:

33 9

2 10 9.1 ≈ ⋅

iterations per sec

0.5M >

It works on curves in parallel

slide-24
SLIDE 24

24

Performance Comparison

[1] T.Güneysu, C. Paar, and J. Pelzl. Special-purpose hardware for solving the elliptic curve discrete logarithm problem. ACM Transactions on Reconfigurable Technology and Systems, 1(2):1-21, 2008. [2] S. Kumar, C. Paar, J. Pelzl, G. Pfeiffer, and M. Schimmler. Breaking ciphers with COPACOBANA a cost-optimized parallel code breaker. In CHES 2006, vol. 4249 of LNCS, pages 101-118, 2006.

XC3S1000 FPGAs[1] FPGA results of EC over 96 and 128-bit generic prime fields for COPACABANA [2] Can host up to 120 FPGAs (Cost: 10’000 USD) PlayStation 3 (our implementation) Targeted at 112-bit prime curve. Use 128-bit multiplication + fast reduction modulo For 10’000 USD 33 PS3s

p ~

slide-25
SLIDE 25

25

Comparison

96 bits 128 bits COPACABANA

(XC3S1000)

4.0 ·107 2.1 ·107 + Moore’s law 7.9 ·107 4.2 ·107 +Negation map 1.1 ·108 5.9 ·107 PS3 4.2 ·107 33 PS3 1.4 ·109

33 PS3 / COPACABANA (96 bits): 12.4 times faster 33 PS3 / COPACABANA (128 bits): 23.8 times faster Note: numbers without using 33 dual-threaded PPEs! Table: Iterations per second

slide-26
SLIDE 26

26

The 112-bit Solution

P

16

10 4 8 2 ⋅ ≈ ⋅ π . n

( )

96082688 8271678618 5491033170 752,341987 2223778713 5797253489 1882814650 P =

( )

623885544 6724286139 5960649470 5028,38467 2643383279 8979323846 1415926535 Q = 8933 9189154254 0937147764 4451685225 n =

P Q ⋅ =

The point

  • f order

is given in the standard. The -coordinate of was chose as

Q n x

January 13, 2009 – July 8, 2009 (not run continuously). If run continuously, using the latest version of our code, the same calculation would have taken 3.5 months.

699 1767351856 1477247716 3125216360

 

34

3)10

Expected # iterations:

slide-27
SLIDE 27

27

Conclusions

We have measured the hardness of solving the ECDLP on a 112-bit prime field. Requires 62.6 PS3 years to solve it. We have presented modular arithmetic algorithms using SIMD instructions. Optimized for 112-bit prime. Set a new record for solving the ECDLP. Do not use the standardized elliptic curve

  • ver 112-bit prime fields!
slide-28
SLIDE 28

28

The PS3 cluster at LACAL

  • Cluster room: 190 PS3s
  • PlayLaB: 6 x 4 PS3s

(connected to the cluster)

  • Offices: 5 PS3s

(for programming purposes)

  • Total: 219 PS3s