solving discrete logarithms in smooth order groups with
play

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Ryan - PowerPoint PPT Presentation

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Ryan Henry Ian Goldberg Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition Let G be a cyclic group of order q and let g G be a generator. Given G , the


  1. Solving Discrete Logarithms in Smooth-Order Groups with CUDA Ryan Henry Ian Goldberg

  2. Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition Let G be a cyclic group of order q and let g ∈ G be a generator. Given α ∈ G , the discrete logarithm (DL) problem is to find x ∈ Z q such that g x = α . 01 / 21

  3. Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition Let G be a cyclic group of order q and let g ∈ G be a generator. Given α ∈ G , the discrete logarithm (DL) problem is to find x ∈ Z q such that g x = α . Why do we care? ◮ Computing DLs is apparently difficult for classical computers ◮ Inverse problem (modular exponentiation) is easy ◮ Many cryptographic protocols exploit this asymmetry 01 / 21

  4. Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition An integer n is called B -smooth if each of its prime factors is bounded above by B . A smooth-order group is just a group whose order is B -smooth for some “suitably small” value of B . 02 / 21

  5. Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition An integer n is called B -smooth if each of its prime factors is bounded above by B . A smooth-order group is just a group whose order is B -smooth for some “suitably small” value of B . Why do we care? ◮ If ϕ ( N ) is B -smooth, then Z ∗ N has smooth order ◮ Many DL-based cryptographic protocols work in Z ∗ N ◮ Pollard’s rho algorithm (plus Pohlig-Hellman) solves DLs in time proportional to smoothness of group order 02 / 21

  6. Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition The Compute Unified Device Architecture (CUDA) is Nvidia’s parallel computing architecture. It enables developers to use CUDA-enabled Nvidia GPUs for general purpose computing. 03 / 21

  7. Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition The Compute Unified Device Architecture (CUDA) is Nvidia’s parallel computing architecture. It enables developers to use CUDA-enabled Nvidia GPUs for general purpose computing. Why do we care? ◮ Nvidia GPUs are widely deployed, and offer better price-to-GFLOP ratio than CPUs ◮ Modern GPUs have many cores and support highly parallel computation ◮ Pollard’s rho algorithm is extremely parallelizable 03 / 21

  8. In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant 04 / 21

  9. In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs 04 / 21

  10. In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs ◮ present our implementation of modular multiplication and parallel rho in CUDA and analyze its performance 04 / 21

  11. In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs ◮ present our implementation of modular multiplication and parallel rho in CUDA and analyze its performance ◮ point out a simple attack on Boudot’s zero-knowledge range proofs 04 / 21

  12. In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs ◮ present our implementation of modular multiplication and parallel rho in CUDA and analyze its performance ◮ point out a simple attack on Boudot’s zero-knowledge range proofs ◮ construct and analyze trapdoor discrete logarithm groups 04 / 21

  13. Part I: Pollard’s rho

  14. Pollard’s rho algorithm (1/4) Problem Given g , h ∈ G , compute the discrete logarithm x ∈ Z n of h with respect to g . 05 / 21

  15. Pollard’s rho algorithm (1/4) Problem Given g , h ∈ G , compute the discrete logarithm x ∈ Z n of h with respect to g . Key observation: ◮ Consider elements g a h b ∈ G and search for collisions ◮ Since g a 1 h b 1 = g a 2 h b 2 = ⇒ g a 1 − a 2 = h b 2 − b 1 , we have ⇒ x ≡ ( a 1 − a 2 )( b 2 − b 1 ) − 1 mod n a 1 − a 2 ≡ x ( b 2 − b 1 ) mod n = � ◮ Birthday paradox: about π n / 2 selections should ⇒ expected runtime and storage in Θ( √ n ) suffice = 05 / 21

  16. Pollard’s rho algorithm (2/4) Problem Given g , h ∈ G , compute the discrete logarithm x ∈ Z n of h with respect to g . Pollard’s idea: ◮ Walk through G using iteration function f : G → G , f ( g a i h b i ) = g a i + 1 h b i + 1 ◮ Collisions = ⇒ cycles, which are cheap to detect ◮ If iteration function behaves “randomly enough”, then expected runtime is in Θ( √ n ) and storage is in Θ( 1 ) 06 / 21

  17. Pollard’s rho algorithm (3/4) gai + 1 hbi + 1 gai + 2 hbi + 2 gai + 3 hbi + 3 gai hbi gaj hbj gaj − 1 hbj − 1 gai − 1 hbi − 1 ga 2 hb 2 ga 1 hb 1 ga 0 hb 0 07 / 21

  18. Pollard’s rho algorithm (3/4) gai + 1 hbi + 1 gai + 2 hbi + 2 gai + 1 hbi + 1 gai + 1 hbi + 1 gai + 3 hbi + 3 gai + 2 hbi + 2 gai + 2 hbi + 2 gai hbi gaj hbj gaj − 1 hbj − 1 gai + 3 hbi + 3 gai + 3 hbi + 3 gai hbi gaj hbj gai hbi gaj hbj gaj − 1 hbj − 1 gaj − 1 hbj − 1 gai − 1 hbi − 1 gai − 1 hbi − 1 gai − 1 hbi − 1 ga 2 hb 2 ga 2 hb 2 ga 2 hb 2 ga 1 hb 1 ga 1 hb 1 ga 0 hb 0 ga 1 hb 1 ga 0 hb 0 ga 0 hb 0 07 / 21

  19. Pollard’s rho algorithm (4/4) Problem Given g , h ∈ G , compute the discrete logarithm x ∈ Z n of h with respect to g . van Oorschot’s and Wiener’s idea: ◮ Define a distinguished point (DP) as any point with some cheap-to-detect property (e.g., m trailing zeros) ◮ Run Ψ client threads in parallel, each reporting DPs to a central server that checks for collisions � √ n / Ψ ◮ Expected runtime is in Θ � 08 / 21

  20. Part II: GPUs and CUDA

  21. SMPs and CUDA cores Fermi architecture Instruction cache ◮ GPU has several streaming Warp scheduler Warp scheduler multiprocessors (SMP) ◮ Our Tesla M2050 cards each Dispatch unit Dispatch unit have 14 SMPs Register file (2 15 × 32-bit) ◮ SIMD architecture LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST Interconnect network 64 KB memory / L1 cache Uniform cache 09 / 21

  22. SMPs and CUDA cores Fermi architecture Instruction cache ◮ GPU has several streaming Warp scheduler Warp scheduler multiprocessors (SMP) ◮ Our Tesla M2050 cards each Dispatch unit Dispatch unit have 14 SMPs Register file (2 15 × 32-bit) ◮ SIMD architecture LD/ST CUDA Core Core Core Core Core LD/ST SFU LD/ST Dispatch port Core Core Core Core LD/ST Operand collector LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST FPU unit INT unit LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST Result queue LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST Interconnect network 64 KB memory / L1 cache Uniform cache 09 / 21

  23. CUDA memory hierarchy Thread ◮ Developer manages memory explicitly ◮ 1 clock pulse for shared Shared memory L1 cache memory and L1 cache ◮ ≈ 300 clock pulses for L2 cache Local RAM ◮ Many more clock pulses for system RAM Local RAM 10 / 21

  24. Tesla M2050 Nvidia Tesla M2050 GPU cards: ◮ Based on Fermi architecture ◮ 14 SMPs × 32 cores / SMP = 448 cores (each running at 1.55 GHz) ◮ 2 15 × 32-bit registers / SMP ◮ Configurable: 64 KB shared memory / L1 cache price: 1,299.00 USD ◮ 3 GB GDDR5 of Local RAM Our experiments used a host PC with: ◮ Intel Xeon E5620 quad core (2.4 GHz) ◮ 2 × 4 GB of DDR3-1333 RAM ◮ 2 × Tesla M2050 GPU cards 11 / 21

  25. Part III: Implementation

  26. CUDA modular multiplication (1/2) ◮ Iteration function for Pollard rho:  g x if 0 ≤ x < q  3   x 2 if q 3 ≤ x < 2 q f ( x ) = 3  h x if 2 q  3 ≤ x < q  ◮ Need fast, multiprecision modular multiplication to solve DLs in Z ∗ N ◮ We used Koç et al’s CIOS algorithm for Montgomery multiplication ◮ Low auxiliary storage = ⇒ lots of threads ◮ We do one thread per multiplication 12 / 21

  27. CUDA modular multiplication (2/2) Table: k -bit modular multiplications per second and (amortized) time per k -bit modular multiplication on a single Tesla M2050. Bit length Time per trial Amortized time Modmults of modulus ± std dev per modmult per second 192 30.538 s ± 4 ms 1.19 ns ≈ 840,336,000 256 50.916 s ± 5 ms 1.98 ns ≈ 505,050,000 512 186.969 s ± 4 ms 7.30 ns ≈ 136,986,000 768 492.6 s ± 200 ms 19.24 ns ≈ 51,975,000 1024 2304.5 s ± 300 ms 90.02 ns ≈ 11,108,000 = Larger k each multiplication takes longer ⇒ ◮ = can compute fewer multiplications in parallel ⇒ 13 / 21

  28. CUDA Pollard rho (1/2) Goal Compute discrete logarithms modulo k N -bit RSA numbers N = pq with 2 k B -smooth totient. Our implementation: ◮ Optimized for k N = 1536 and k B ≈ 55 ◮ Assumes that the factorization of p − 1 and q − 1 is known ◮ Uses Pohlig-Hellman approach to decompose problem to k B -bit subproblems ◮ Distinguished points: at least 10 trailing zeros in binary (Montgomery) representation 14 / 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend