CUDA based implementation of parallelized Pollard's Rho algorithm for ECDLP
- M. Chinnicia, S. Cuomob, M. Laportac, B. Pennacchiod, A. Pizziranie, S. Migliorif
a,dENEA- FIM-INFOPPQ, Casaccia Research Center, Via Anguillarese 301, 00123 S.Maria di Galeria, Italy b, c, d, e UNIVERSITA’ FEDERICO II, Dipartimento di Matematica e Applicazioni”R.Caccioppoli” Via Cinthia –80136 Napoli , Italy fENEA-FIM, Enea-Sede, Lungotevere Thaon di Revel n. 76, 00196 Roma, Italy
Pollard’s rho algorithm
Best general purpose algorithm to solve instances of ECDLP is Pollard’s rho algorithm. This algorithm proposed by Pollard use an iteration function f:〈P〉→〈P〉 to build a walk in the subgroup 〈P〉 (generated by point P) of the group of points of the elliptic curve. For ECDLP, starting point of this algorithm is a linear combination of P and Q (mP○
+nQ), and function iterates until a
point A=(aP○
+bQ) belonging to the walk is generated a second time
A=A’=(a’P○
+b’Q) generating a “collision”.
If a “good”collision is found then, by A=(aP○
+bQ)=(a’P○ +b’Q) can
be computed the value k used to compute Q=kP. Pollard3 showed that if this walk is “random enough”, the algo- rithm has expected running time of (π|〈P〉|/2)1/2 . Further optimization to the algorithm have been submitted by Teske4,5 modifying iterating function, by Van Oorschot6 and Wiener6 that showed that algorithm can be efficiently parallelized
- n R processor obtaining a speedup of R, and by Floyd7 that sho-
wed that is not needed to store all points to check for collisions, but collision can be searched in a subset of points of the walk (distinguished points).
Setting
Elliptic curves are geometric object having a “dual nature” of algebraic object. The set of their points together with a so called “point to infi- nity” can be viewed as a group structure. This means that points of this set, together with a well defined
- peration (usually called sum, and indicated with ○
+ ) have some
interesting properties:
- operation is associative;
- existence of identity (the point to infinity);
- existence of inverses.
- operation is commutative.
Elliptic curves maintain their structure of group regardless of the ground field so can be considered groups of points of elliptic curves defined over complex, reals, rationals and finite fields. The group of points of an elliptic curves defined over a finite fields has been proposed in the mid 1980s (independently) by Koblitz1 and Miller2 as base for a cryptosystem. Embedding a message (in some way) into a point of a curve and choosing an integer k as key we can compute a multiple kP
- f this point P, simply using repeated addition of P and
computing 2P=P○
+P, 3P=P○ +P○ +P, …, kP=P○ +P○ +...○ +P.
Multiple Q=kP of the point P is considered the encyphered message. Security of cryptosystem based on elliptic curves rely on the difficulty to invert this process: given Q, known to be a multiple of a point P, it’s really hard to compute the value k so that Q=kP. This problem is called ECDLP.
Introduction
Recent introduction by NVidia of CUDA (Compute Unified Device Architecture) libraries for HPC (High Performance Computing) on GPUs (Graphic Processing Units) has started the trend to use video cards for resolution of many computatio- nally hard problems in different areas like(among others): fluid dynamics, molecular dynamics, computer vision and astro- physics. Another area of interest where HPC is really useful is cryptoa- nalysis. In this paper we show how CUDA libraries (and hardware) can be used in cryptography as cryptoanalytic tool. Increase of data communications made data cryptography a re- al necessity. Sometimes private key cryptosystems are enough, more often public key cryptosystems are needed for communi- cations on insecure channels. Cryptosystems based on elliptic curves offers both schemas with a relatively low communication overhead. In elliptic cur- ves cryptography security is strongly based on presumed in- tractability of DLP (Discrete Logarithm Problem) in group of points of elliptic curve. So testing resistance of ECDLP (Elliptic Curves Discrete Logarithm Problem) means testing their security. In literature are known various methods (more
- r less efficient) to solve instances of DLP, some of them with
deterministic running time, like Shank’s ”Baby step-Giant step”, others with probabilistic running time but with a better trade off between space and time, like Pollard’s Rho method. We describe an implementation of parallelized Pollard’s Rho attack for ECDLP, realized using recent results for optimiza- tion of Pollard’s Rho method and some choice ”ad-hoc” for CUDA.
CUDA
CUDA is a computing architecture developed by NVidia8 to u- se graphic processing unit as a general purpose parallel proces- sor. Programming of CUDA enabled hardware is realized mainly through C for CUDA, an extension of the C language that give user access to CUDA capabilities of the device. Even if C is the principal language to use CUDA hardware, third party wrappers are available for Python, Fortran, Java and MatLab. Actually, as reported by NVidia, there are millions of CUDA- capable gpus, and this diffusion is mainly due to price of this hardware varying from low prices for hardware with limited computing capabilities, to thousand of euros for dedicated hardware with 4 teraflops power (tesla series). Advantages offered by CUDA are:
- Scattered reads – code can read to arbitrary addresses in me-
mory.
- Shared memory – CUDA exposes a fast shared memory re-
gion (16KB in size) that can be shared amongst threads. This can be used as a user-managed cache, enabling higher ban- dwidth than is possible using texture lookups.
- Faster downloads and readbacks to and from the GPU
- Full support for integer and bitwise operations, including in-
teger texture lookups. Some limitations of CUDA enabled hardware are:
- No support for recursive functions on device.
- Division and inversion are computationally expansive opera-
tions.
- Threads using device memory should access memory to a-
void “coalescence”, so data in device memory must be writ- ten ad-hoc.
- Fig. 1: An Example of elliptic
curve on reals.
- Fig. 2: An Example of walk in Pol-
lard’s rho algorithm, with a collision
- n a2, giving the typical shape of the
walk similar to greek letter rho.
Application and first results
Our implementation based on cuda of the parallelized version
- f Pollard’s rho algorithm act in this way:
- 1. Host computes starting points and points needed for the ite-
rarting funtion.
- 2. Starting points are copied from main memory to device me-
mory, points of the iterating function and curve data are co- pied from main memory to constant memory of the video card.
- 3. Host starts 256 threads on gpu to compute new points.
- 4. Gpu computes new points using iterating function and check
if new generated points are distinguished points.
- 5. If a new distinguished point is found it is reported to host.
- 6. Host stores distinguished points into a hash table and check
for collision. Test made on a preliminary version of our application perfor- ming 4096 iterations with 256 threads (generating a total of 1048576 points) shown a speed of more than 7200000 points/ sec (test took 0.145 seconds to complete).
References
- 1. N. KOBLITZ. Elliptic curve cryptosystems. Mathematics of
Computation, 48:203–209, 1987.
- 2. V. MILLER. Use of elliptic curves in cryptography. Advances in
Cryptology—CRYPTO’85 (LNCS 218) [483], 417–426, 1986.
- 3. J. POLLARD. Monte Carlo methods for index computation (mod
p). Mathematics of Computation, 32:918–924, 1978.
- 4. E. TESKE. Speeding up Pollard’s rho method for computing
discrete logarithms. Algorithmic Number Theory—ANTS-III (LNCS 1423) [82], 541–554, 1998.
- 5. E. TESKE. On random walks for Pollard’s rho method.
Mathematics of Computation,70:809–825, 2001.
- 6. P. VAN OORSCHOT AND M. WIENER. Parallel collision
search with cryptanalytic applications. Journal of Cryptology,12:1–28, 1999.Cryptology,12:1–28, 1999.
- 7. D.E. Knuth. The Art of Computer Programming, vol. II: Seminu-
merical Algorithms, Addison-Wesley , exercises 6 and 7, page 7. Knuth (p.4) credits Floyd for the algorithm called “Tortoise and hare“, without citation.
- 8. http://www.nvidia.com/object/cuda_home.html
Problems Solutions Inefficient use of Division and inversion for modular arithmetic No division used for modular ad- dition, difference and multiplica-
- tion. Multiplication uses Mon-
tgomery algorithm. Affine coordinates need compu- tation of an inverse for sum and double of points. Used Jacobian coordinate system for starting points of the algo- rithm with a good trade-off be- tween performances and occupa- tion in memory. Points used to generate walks of the Pollard’s rho algorithm need to be accessed by all threads. Points to generate the function a- re stored in constant memory in affine coordinates to reduce oc- cupation. Different coordinates system for starting points (Jacobian) of the iterating function and points nee- ded to generate iteration (Affine) Use of mixed addition formula for Jacobian-Affine coordinates. Original Pollard’s rho iterating function divide subgroup generated by P into 3 subsets and hasn’t really good performances. Teske shown there’s a perfor- mance increase splitting su- bgroup generated by P into a la- ger number of subsets. Use of af- fine coordinates for points of the iterating function allow us to use more than 64 subsets. Too many space required to store all points generated for curves on finite fields of large charateristic W i l l b e s t o r e d
- n l y
“distinguished points” having 30bits of x–coordinate all zero. Table showing problems encountered during application development