Details and description of Application CUDA Application and first - - PDF document

▶

Feb 03, 2023 330 likes •448 views

CUDA based implementation of parallelized Pollard's Rho algorithm for ECDLP M. Chinnici a , S. Cuomo b , M. Laporta c , B. Pennacchio d , A. Pizzirani e , S. Migliori f a,d ENEA- FIM-INFOPPQ, Casaccia Research Center, Via Anguillarese 301, 00123

SLIDE 1

CUDA based implementation of parallelized Pollard's Rho algorithm for ECDLP

M. Chinnicia, S. Cuomob, M. Laportac, B. Pennacchiod, A. Pizziranie, S. Migliorif

a,dENEA- FIM-INFOPPQ, Casaccia Research Center, Via Anguillarese 301, 00123 S.Maria di Galeria, Italy b, c, d, e UNIVERSITA’ FEDERICO II, Dipartimento di Matematica e Applicazioni”R.Caccioppoli” Via Cinthia –80136 Napoli , Italy fENEA-FIM, Enea-Sede, Lungotevere Thaon di Revel n. 76, 00196 Roma, Italy

Pollard’s rho algorithm

Best general purpose algorithm to solve instances of ECDLP is Pollard’s rho algorithm. This algorithm proposed by Pollard use an iteration function f:〈P〉→〈P〉 to build a walk in the subgroup 〈P〉 (generated by point P) of the group of points of the elliptic curve. For ECDLP, starting point of this algorithm is a linear combination of P and Q (mP○

+nQ), and function iterates until a

point A=(aP○

+bQ) belonging to the walk is generated a second time

A=A’=(a’P○

+b’Q) generating a “collision”.

If a “good”collision is found then, by A=(aP○

+bQ)=(a’P○ +b’Q) can

be computed the value k used to compute Q=kP. Pollard3 showed that if this walk is “random enough”, the algorithm has expected running time of (π|〈P〉|/2)1/2 . Further optimization to the algorithm have been submitted by Teske4,5 modifying iterating function, by Van Oorschot6 and Wiener6 that showed that algorithm can be efficiently parallelized

n R processor obtaining a speedup of R, and by Floyd7 that sho-

wed that is not needed to store all points to check for collisions, but collision can be searched in a subset of points of the walk (distinguished points).

Setting

Elliptic curves are geometric object having a “dual nature” of algebraic object. The set of their points together with a so called “point to infi- nity” can be viewed as a group structure. This means that points of this set, together with a well defined

peration (usually called sum, and indicated with ○

+ ) have some

interesting properties:

operation is associative;
existence of identity (the point to infinity);
existence of inverses.
operation is commutative.

Elliptic curves maintain their structure of group regardless of the ground field so can be considered groups of points of elliptic curves defined over complex, reals, rationals and finite fields. The group of points of an elliptic curves defined over a finite fields has been proposed in the mid 1980s (independently) by Koblitz1 and Miller2 as base for a cryptosystem. Embedding a message (in some way) into a point of a curve and choosing an integer k as key we can compute a multiple kP

f this point P, simply using repeated addition of P and

computing 2P=P○

+P, 3P=P○ +P○ +P, …, kP=P○ +P○ +...○ +P.

Multiple Q=kP of the point P is considered the encyphered message. Security of cryptosystem based on elliptic curves rely on the difficulty to invert this process: given Q, known to be a multiple of a point P, it’s really hard to compute the value k so that Q=kP. This problem is called ECDLP.

Introduction

Recent introduction by NVidia of CUDA (Compute Unified Device Architecture) libraries for HPC (High Performance Computing) on GPUs (Graphic Processing Units) has started the trend to use video cards for resolution of many computatio- nally hard problems in different areas like(among others): fluid dynamics, molecular dynamics, computer vision and astro- physics. Another area of interest where HPC is really useful is cryptoa- nalysis. In this paper we show how CUDA libraries (and hardware) can be used in cryptography as cryptoanalytic tool. Increase of data communications made data cryptography a re- al necessity. Sometimes private key cryptosystems are enough, more often public key cryptosystems are needed for communications on insecure channels. Cryptosystems based on elliptic curves offers both schemas with a relatively low communication overhead. In elliptic curves cryptography security is strongly based on presumed in- tractability of DLP (Discrete Logarithm Problem) in group of points of elliptic curve. So testing resistance of ECDLP (Elliptic Curves Discrete Logarithm Problem) means testing their security. In literature are known various methods (more

r less efficient) to solve instances of DLP, some of them with

deterministic running time, like Shank’s ”Baby step-Giant step”, others with probabilistic running time but with a better trade off between space and time, like Pollard’s Rho method. We describe an implementation of parallelized Pollard’s Rho attack for ECDLP, realized using recent results for optimization of Pollard’s Rho method and some choice ”ad-hoc” for CUDA.

CUDA

CUDA is a computing architecture developed by NVidia8 to u- se graphic processing unit as a general purpose parallel proces- sor. Programming of CUDA enabled hardware is realized mainly through C for CUDA, an extension of the C language that give user access to CUDA capabilities of the device. Even if C is the principal language to use CUDA hardware, third party wrappers are available for Python, Fortran, Java and MatLab. Actually, as reported by NVidia, there are millions of CUDA- capable gpus, and this diffusion is mainly due to price of this hardware varying from low prices for hardware with limited computing capabilities, to thousand of euros for dedicated hardware with 4 teraflops power (tesla series). Advantages offered by CUDA are:

Scattered reads – code can read to arbitrary addresses in me-

mory.

Shared memory – CUDA exposes a fast shared memory re-

gion (16KB in size) that can be shared amongst threads. This can be used as a user-managed cache, enabling higher ban- dwidth than is possible using texture lookups.

Faster downloads and readbacks to and from the GPU
Full support for integer and bitwise operations, including in-

teger texture lookups. Some limitations of CUDA enabled hardware are:

No support for recursive functions on device.
Division and inversion are computationally expansive opera-

tions.

Threads using device memory should access memory to a-

void “coalescence”, so data in device memory must be writ- ten ad-hoc.

Fig. 1: An Example of elliptic

curve on reals.

Fig. 2: An Example of walk in Pol-

lard’s rho algorithm, with a collision

n a2, giving the typical shape of the

walk similar to greek letter rho.

Application and first results

Our implementation based on cuda of the parallelized version

f Pollard’s rho algorithm act in this way:
1. Host computes starting points and points needed for the ite-

rarting funtion.

2. Starting points are copied from main memory to device me-

mory, points of the iterating function and curve data are co- pied from main memory to constant memory of the video card.

3. Host starts 256 threads on gpu to compute new points.
4. Gpu computes new points using iterating function and check

if new generated points are distinguished points.

5. If a new distinguished point is found it is reported to host.
6. Host stores distinguished points into a hash table and check

for collision. Test made on a preliminary version of our application perfor- ming 4096 iterations with 256 threads (generating a total of 1048576 points) shown a speed of more than 7200000 points/ sec (test took 0.145 seconds to complete).

References

1. N. KOBLITZ. Elliptic curve cryptosystems. Mathematics of

Computation, 48:203–209, 1987.

2. V. MILLER. Use of elliptic curves in cryptography. Advances in

Cryptology—CRYPTO’85 (LNCS 218) [483], 417–426, 1986.

3. J. POLLARD. Monte Carlo methods for index computation (mod

p). Mathematics of Computation, 32:918–924, 1978.

4. E. TESKE. Speeding up Pollard’s rho method for computing

discrete logarithms. Algorithmic Number Theory—ANTS-III (LNCS 1423) [82], 541–554, 1998.

5. E. TESKE. On random walks for Pollard’s rho method.

Mathematics of Computation,70:809–825, 2001.

6. P. VAN OORSCHOT AND M. WIENER. Parallel collision

search with cryptanalytic applications. Journal of Cryptology,12:1–28, 1999.Cryptology,12:1–28, 1999.

7. D.E. Knuth. The Art of Computer Programming, vol. II: Seminu-

merical Algorithms, Addison-Wesley , exercises 6 and 7, page 7. Knuth (p.4) credits Floyd for the algorithm called “Tortoise and hare“, without citation.

8. http://www.nvidia.com/object/cuda_home.html

Problems Solutions Inefficient use of Division and inversion for modular arithmetic No division used for modular addition, difference and multiplica-

tion. Multiplication uses Mon-

tgomery algorithm. Affine coordinates need computation of an inverse for sum and double of points. Used Jacobian coordinate system for starting points of the algorithm with a good trade-off between performances and occupa- tion in memory. Points used to generate walks of the Pollard’s rho algorithm need to be accessed by all threads. Points to generate the function a- re stored in constant memory in affine coordinates to reduce oc- cupation. Different coordinates system for starting points (Jacobian) of the iterating function and points needed to generate iteration (Affine) Use of mixed addition formula for Jacobian-Affine coordinates. Original Pollard’s rho iterating function divide subgroup generated by P into 3 subsets and hasn’t really good performances. Teske shown there’s a performance increase splitting subgroup generated by P into a la- ger number of subsets. Use of affine coordinates for points of the iterating function allow us to use more than 64 subsets. Too many space required to store all points generated for curves on finite fields of large charateristic W i l l b e s t o r e d

n l y

“distinguished points” having 30bits of x–coordinate all zero. Table showing problems encountered during application development

Details and description of Application

SLIDE 2

SLIDE 3

SLIDE 4

SLIDE 5

Fig.3 Benefits of multicluster queue.

SLIDE 6

SLIDE 7

ENEA-FIM, C.R. Portici

SLIDE 8

SLIDE 9

References http://www.cresco.enea.it/LA1/cresco_ sp12_graf3d/

SLIDE 10