Solving Discrete Logarithms in Smooth-Order Groups with CUDA Ryan - - PowerPoint PPT Presentation

solving discrete logarithms in smooth order groups with
SMART_READER_LITE
LIVE PREVIEW

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Ryan - - PowerPoint PPT Presentation

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Ryan Henry Ian Goldberg Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition Let G be a cyclic group of order q and let g G be a generator. Given G , the


slide-1
SLIDE 1

Solving Discrete Logarithms in Smooth-Order Groups with CUDA

Ryan Henry Ian Goldberg

slide-2
SLIDE 2

Solving Discrete Logarithms in Smooth-Order Groups with CUDA

Definition

Let G be a cyclic group of order q and let g ∈ G be a generator. Given α ∈ G, the discrete logarithm (DL) problem is to find x ∈ Zq such that gx = α.

01 / 21

slide-3
SLIDE 3

Solving Discrete Logarithms in Smooth-Order Groups with CUDA

Definition

Let G be a cyclic group of order q and let g ∈ G be a generator. Given α ∈ G, the discrete logarithm (DL) problem is to find x ∈ Zq such that gx = α. Why do we care?

◮ Computing DLs is apparently difficult for classical

computers

◮ Inverse problem (modular exponentiation) is easy ◮ Many cryptographic protocols exploit this asymmetry 01 / 21

slide-4
SLIDE 4

Solving Discrete Logarithms in Smooth-Order Groups with CUDA

Definition

An integer n is called B-smooth if each of its prime factors is bounded above by B. A smooth-order group is just a group whose order is B-smooth for some “suitably small” value of B.

02 / 21

slide-5
SLIDE 5

Solving Discrete Logarithms in Smooth-Order Groups with CUDA

Definition

An integer n is called B-smooth if each of its prime factors is bounded above by B. A smooth-order group is just a group whose order is B-smooth for some “suitably small” value of B. Why do we care?

◮ If ϕ(N) is B-smooth, then Z∗ N has smooth order ◮ Many DL-based cryptographic protocols work in Z∗ N ◮ Pollard’s rho algorithm (plus Pohlig-Hellman) solves DLs in

time proportional to smoothness of group order

02 / 21

slide-6
SLIDE 6

Solving Discrete Logarithms in Smooth-Order Groups with CUDA

Definition

The Compute Unified Device Architecture (CUDA) is Nvidia’s parallel computing architecture. It enables developers to use CUDA-enabled Nvidia GPUs for general purpose computing.

03 / 21

slide-7
SLIDE 7

Solving Discrete Logarithms in Smooth-Order Groups with CUDA

Definition

The Compute Unified Device Architecture (CUDA) is Nvidia’s parallel computing architecture. It enables developers to use CUDA-enabled Nvidia GPUs for general purpose computing. Why do we care?

◮ Nvidia GPUs are widely deployed, and offer better

price-to-GFLOP ratio than CPUs

◮ Modern GPUs have many cores and support highly

parallel computation

◮ Pollard’s rho algorithm is extremely parallelizable 03 / 21

slide-8
SLIDE 8

In this presentation, we...

◮ describe Pollard’s rho algorithm and its parallel variant 04 / 21

slide-9
SLIDE 9

In this presentation, we...

◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs 04 / 21

slide-10
SLIDE 10

In this presentation, we...

◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs ◮ present our implementation of modular multiplication

and parallel rho in CUDA and analyze its performance

04 / 21

slide-11
SLIDE 11

In this presentation, we...

◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs ◮ present our implementation of modular multiplication and

parallel rho in CUDA and analyze its performance

◮ point out a simple attack on Boudot’s zero-knowledge

range proofs

04 / 21

slide-12
SLIDE 12

In this presentation, we...

◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs ◮ present our implementation of modular multiplication and

parallel rho in CUDA and analyze its performance

◮ point out a simple attack on Boudot’s zero-knowledge

range proofs

◮ construct and analyze trapdoor discrete logarithm

groups

04 / 21

slide-13
SLIDE 13

Part I: Pollard’s rho

slide-14
SLIDE 14

Pollard’s rho algorithm (1/4)

Problem

Given g, h ∈ G, compute the discrete logarithm x ∈ Zn of h with respect to g.

05 / 21

slide-15
SLIDE 15

Pollard’s rho algorithm (1/4)

Problem

Given g, h ∈ G, compute the discrete logarithm x ∈ Zn of h with respect to g. Key observation:

◮ Consider elements gahb ∈ G and search for collisions ◮ Since ga1hb1 = ga2hb2 =

⇒ ga1−a2 = hb2−b1, we have a1−a2 ≡ x (b2−b1) mod n = ⇒x ≡ (a1−a2)(b2−b1)−1 mod n

◮ Birthday paradox: about

  • π n/2 selections should

suffice = ⇒ expected runtime and storage in Θ(√n )

05 / 21

slide-16
SLIDE 16

Pollard’s rho algorithm (2/4)

Problem

Given g, h ∈ G, compute the discrete logarithm x ∈ Zn of h with respect to g. Pollard’s idea:

◮ Walk through G using iteration function f : G → G,

f(gaihbi) = gai+1hbi+1

◮ Collisions =

⇒ cycles, which are cheap to detect

◮ If iteration function behaves “randomly enough”, then

expected runtime is in Θ(√n ) and storage is in Θ(1)

06 / 21

slide-17
SLIDE 17

Pollard’s rho algorithm (3/4)

gai hbi gaj hbj gaj−1 hbj−1 gai+3 hbi+3 gai+2 hbi+2 gai+1 hbi+1 gai−1 hbi−1 ga2 hb2 ga1 hb1 ga0 hb0

07 / 21

slide-18
SLIDE 18

Pollard’s rho algorithm (3/4)

gai hbi gaj hbj gaj−1 hbj−1 gai+3 hbi+3 gai+2 hbi+2 gai+1 hbi+1 gai−1 hbi−1 ga2 hb2 ga1 hb1 ga0 hb0

gai hbi gaj hbj gaj−1 hbj−1 gai+3 hbi+3 gai+2 hbi+2 gai+1 hbi+1 gai−1 hbi−1 ga2 hb2 ga1 hb1 ga0 hb0

gai hbi gaj hbj gaj−1 hbj−1 gai+3 hbi+3 gai+2 hbi+2 gai+1 hbi+1 gai−1 hbi−1 ga2 hb2 ga1 hb1 ga0 hb0

07 / 21

slide-19
SLIDE 19

Pollard’s rho algorithm (4/4)

Problem

Given g, h ∈ G, compute the discrete logarithm x ∈ Zn of h with respect to g. van Oorschot’s and Wiener’s idea:

◮ Define a distinguished point (DP) as any point with some

cheap-to-detect property (e.g., m trailing zeros)

◮ Run Ψ client threads in parallel, each reporting DPs to a

central server that checks for collisions

◮ Expected runtime is in Θ

√n /Ψ

  • 08 / 21
slide-20
SLIDE 20

Part II: GPUs and CUDA

slide-21
SLIDE 21

SMPs and CUDA cores

Fermi architecture

◮ GPU has several streaming

multiprocessors (SMP)

◮ Our Tesla M2050 cards each

have 14 SMPs

◮ SIMD architecture

Instruction cache

Warp scheduler Warp scheduler Dispatch unit Dispatch unit

Register file (215 × 32-bit)

Interconnect network

64 KB memory / L1 cache

Uniform cache

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST

SFU SFU SFU SFU

09 / 21

slide-22
SLIDE 22

SMPs and CUDA cores

Fermi architecture

◮ GPU has several streaming

multiprocessors (SMP)

◮ Our Tesla M2050 cards each

have 14 SMPs

◮ SIMD architecture

CUDA Core

Dispatch port Operand collector

FPU unit INT unit

Result queue

Instruction cache

Warp scheduler Warp scheduler Dispatch unit Dispatch unit

Register file (215 × 32-bit)

Interconnect network

64 KB memory / L1 cache

Uniform cache

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST

SFU SFU SFU SFU

09 / 21

slide-23
SLIDE 23

CUDA memory hierarchy

Thread

Shared memory L1 cache L2 cache

Local RAM

◮ Developer manages

memory explicitly

◮ 1 clock pulse for shared

memory and L1 cache

◮ ≈ 300 clock pulses for

Local RAM

◮ Many more clock pulses

for system RAM

10 / 21

slide-24
SLIDE 24

Tesla M2050

Nvidia Tesla M2050 GPU cards:

◮ Based on Fermi architecture ◮ 14 SMPs × 32 cores/ SMP = 448 cores

(each running at 1.55 GHz)

◮ 215 × 32-bit registers/

SMP

◮ Configurable: 64 KB shared

memory / L1 cache

◮ 3 GB GDDR5 of Local RAM

price: 1,299.00 USD

Our experiments used a host PC with:

◮ Intel Xeon E5620 quad core (2.4 GHz) ◮ 2 × 4 GB of DDR3-1333 RAM ◮ 2× Tesla M2050 GPU cards 11 / 21

slide-25
SLIDE 25

Part III: Implementation

slide-26
SLIDE 26

CUDA modular multiplication (1/2)

◮ Iteration function for Pollard rho:

f(x) =        g x if 0 ≤ x < q

3

x2 if q

3 ≤ x < 2q 3

h x if 2q

3 ≤ x < q ◮ Need fast, multiprecision modular multiplication to

solve DLs in Z ∗

N ◮ We used Koç et al’s CIOS algorithm for Montgomery

multiplication

◮ Low auxiliary storage =

⇒ lots of threads

◮ We do one thread per multiplication

12 / 21

slide-27
SLIDE 27

CUDA modular multiplication (2/2)

Table: k-bit modular multiplications per second and (amortized) time per k-bit modular multiplication on a single Tesla M2050.

Bit length Time per trial Amortized time Modmults

  • f modulus

± std dev per modmult per second 192 30.538 s ± 4 ms 1.19 ns ≈ 840,336,000 256 50.916 s ± 5 ms 1.98 ns ≈ 505,050,000 512 186.969 s ± 4 ms 7.30 ns ≈ 136,986,000 768 492.6 s ± 200 ms 19.24 ns ≈ 51,975,000 1024 2304.5 s ± 300 ms 90.02 ns ≈ 11,108,000

Larger k = ⇒ each multiplication takes longer = ⇒ can compute fewer multiplications in parallel

13 / 21

slide-28
SLIDE 28

CUDA Pollard rho (1/2)

Goal

Compute discrete logarithms modulo kN-bit RSA numbers N = pq with 2kB-smooth totient. Our implementation:

◮ Optimized for kN = 1536 and kB ≈ 55 ◮ Assumes that the factorization of p − 1 and q − 1 is known ◮ Uses Pohlig-Hellman approach to decompose problem to

kB-bit subproblems

◮ Distinguished points: at least 10 trailing zeros in binary

(Montgomery) representation

14 / 21

slide-29
SLIDE 29

CUDA Pollard rho (2/2)

10 100 1000 250 252 254 256 258 Compute time (s) B (smoothness of totient) Time to compute discrete logarithm 5.78E-5 B0.38

◮ Expected cost per B-smooth DL is in Θ(

√ B)

◮ Each card solves 768 lg B such DLs =

⇒ runtime in Θ( √ B/ lg B)

◮ B ≈ 254 =

⇒ runtime roughly proportional to B0.39

15 / 21

slide-30
SLIDE 30

Part IV: Implications

slide-31
SLIDE 31

Implications

What are the implications for existing DL-based cryptosystems?

In most cases, there are no real implications.

16 / 21

slide-32
SLIDE 32

Implications

What are the implications for existing DL-based cryptosystems?

In most cases, there are no real implications. So why am I speaking at SHARCS?

◮ Cost estimates for cryptographically interesting

computations are useful

◮ Construct trapdoor discrete logarithm groups ◮ Potential attacks on some zero-knowledge proofs ◮ Menezes: duplicate signature key selection (DSKS)

attacks on RSA

16 / 21

slide-33
SLIDE 33

Attack on zero-knowledge “range proofs”

Problem

For a fixed generator g ∈ G and commitment C = gx, prove (in zero-knowledge, with knowledge of x) that a ≤ x ≤ b.

17 / 21

slide-34
SLIDE 34

Attack on zero-knowledge “range proofs”

Problem

For a fixed generator g ∈ G and commitment C = gx, prove (in zero-knowledge, with knowledge of x) that a ≤ x ≤ b. Lagrange’s four square theorem: An integer x ∈ Z is nonnegative if and only if it can be expressed as the sum of (at most) four integer squares.

◮ Idea: Compute Ca = C/ga = gx−a and Cb = gb/C = gb−x,

then prove that Ca and Cb each commit to a sum of four squares.

17 / 21

slide-35
SLIDE 35

Attack on zero-knowledge “range proofs”

Problem

For a fixed generator g ∈ G and commitment C = gx, prove (in zero-knowledge, with knowledge of x) that a ≤ x ≤ b. Lagrange’s four square theorem: An integer x ∈ Z is nonnegative if and only if it can be expressed as the sum of (at most) four integer squares.

◮ Idea: Compute Ca = C/ga = gx−a and Cb = gb/C = gb−x,

then prove that Ca and Cb each commit to a sum of four squares.

◮ Soundness relies on order of G being hidden, which it

usually is not!

17 / 21

slide-36
SLIDE 36

Attack on zero-knowledge “range proofs”

Problem

For a fixed generator g ∈ G and commitment C = gx, prove (in zero-knowledge, with knowledge of x) that a ≤ x ≤ b. Lagrange’s four square theorem: An integer x ∈ Z is nonnegative if and only if it can be expressed as the sum of (at most) four integer squares.

◮ Idea: Compute Ca = C/ga = gx−a and Cb = gb/C = gb−x,

then prove that Ca and Cb each commit to a sum of four squares.

◮ Soundness relies on order of G being hidden, which it

usually is not!

◮ Move proof into Z∗ N for RSA number N = pq (whose

factorization is kept secret from the prover)

17 / 21

slide-37
SLIDE 37

Trapdoor discrete logarithm groups (1/3)

Idea

Work modulo an RSA modulus N = pq such that p − 1 and q − 1 are B-smooth.

◮ Public key: N ◮ Private key: p, q and the factorization of p − 1 and q − 1 18 / 21

slide-38
SLIDE 38

Trapdoor discrete logarithm groups (1/3)

Idea

Work modulo an RSA modulus N = pq such that p − 1 and q − 1 are B-smooth.

◮ Public key: N ◮ Private key: p, q and the factorization of p − 1 and q − 1

Trapdoor DL cost

◮ With trapdoor key: DL computation takes Θ

  • lg N

lg B

√ B

  • highly parallelizable work

◮ Let µ1 be the number of (lg N/2)-bit modular multiplications

computable per core-second, then trapdoor DL runtime is ≈ lg N lg B · c · √ B Ψ · µ1 seconds, for some constant c.

18 / 21

slide-39
SLIDE 39

Trapdoor discrete logarithm groups (2/3)

Idea

Work modulo an RSA modulus N = pq such that p − 1 and q − 1 are B-smooth.

◮ Public key: N ◮ Private key: p, q and the factorization of p − 1 and q − 1

Non-trapdoor DL cost (1/2)

◮ Without trapdoor key: Best approach seems to be

factoring to recover private key!

◮ Pollard’s p − 1 algorithm: Factors B-smooth numbers

with O(B) work

◮ p − 1 attack is inherently serial! Parallelism won’t help

much!

19 / 21

slide-40
SLIDE 40

Trapdoor discrete logarithm groups (2/3)

Idea

Work modulo an RSA modulus N = pq such that p − 1 and q − 1 are B-smooth.

◮ Public key: N ◮ Private key: p, q and the factorization of p − 1 and q − 1

Non-trapdoor DL cost (2/2)

◮ ECM, QS, et al.: highly parallelizable and subexponential

cost, but cost scales with lg N instead of B

◮ For 1536-bit RSA moduli, cross over point occurs when

Ψ · B ≈ 285

◮ Need Ψ ≫ 230 cores to do faster non-trapdoor DL with

  • ther algorithms

19 / 21

slide-41
SLIDE 41

Trapdoor discrete logarithm groups (3/3)

Idea

Work modulo an RSA modulus N = pq such that p − 1 and q − 1 are B-smooth.

◮ Public key: N ◮ Private key: p, q and the factorization of p − 1 and q − 1

Practical security analysis B ≈ 255 = ⇒

  • > 1700 years for non-trapdoor DL

< 2 minutes for trapdoor DL

◮ These are wall-clock times! 20 / 21

slide-42
SLIDE 42

Part V: Conclusion

slide-43
SLIDE 43

Summary

◮ Used CUDA to solve DLs in smooth-order groups ◮ Up to about 258-smooth 1536-bit RSA numbers in under 5

minutes on 2 × Tesla M2050

◮ > 100 million 768-bit modular multiplications per second ◮ > 1.7 billion 192-bit modular multiplications per second ◮ Extrapolating: 280-smooth DL should be feasible in ≈ 23

hours on same Tesla cards (with a bit more system RAM)

◮ Constructed and analyzed trapdoor discrete logarithm

groups

◮ Proposed simple attack on (naively implementations of)

Boudot’s zero-knowledge range proofs

21 / 21

slide-44
SLIDE 44

All of our code is free and open source: http://crysp.uwaterloo.ca/software/