Counting points as a video game D. J. Bernstein University of - - PDF document

counting points as a video game d j bernstein university
SMART_READER_LITE
LIVE PREVIEW

Counting points as a video game D. J. Bernstein University of - - PDF document

Counting points as a video game D. J. Bernstein University of Illinois at Chicago Want efficient computation of secure twist-secure genus-2 with very small coefficients for fastest known DiffieHellman. Cant do that with CM. This


slide-1
SLIDE 1

Counting points as a video game

  • D. J. Bernstein

University of Illinois at Chicago Want efficient computation of secure twist-secure genus-2 ❈ with very small coefficients for fastest known Diffie–Hellman. Can’t do that with CM. This talk focuses on algorithms; does not report any computations. Need results today? Ask Gaudry. But first an advertisement✿ ✿ ✿

slide-2
SLIDE 2

1985 H. Lange–Ruppert “Complete systems of addition laws on abelian varieties”: ❆(❦) has a complete system

  • f addition laws, degree ✔ (3❀ 3).

Symmetry ✮ degree ✔ (2❀ 2). “The proof is nonconstructive✿ ✿ ✿ To determine explicitly a complete system of addition laws requires tedious computations already in the easiest case

  • f an elliptic curve

in Weierstrass normal form.”

slide-3
SLIDE 3

1985 Lange–Ruppert: Explicit complete system

  • f 3 addition laws

for short Weierstrass curves. Reduce formulas to 53 monomials by introducing extra variables ①✐②❥ + ①❥②✐, ①✐②❥ ①❥②✐. I won’t copy the formulas here. 1987 Lange–Ruppert “Addition laws on elliptic curves in arbitrary characteristics”: Explicit complete system

  • f 3 addition laws

for long Weierstrass curves.

slide-4
SLIDE 4
slide-5
SLIDE 5

1995 Bosma–Lenstra: Explicit complete system

  • f 2 addition laws

for long Weierstrass curves: explicit polynomials ❳3❀ ❨3❀ ❩3❀ ❳✵

3❀ ❨ ✵ 3❀ ❩✵ 3

✷ Z[❛1❀ ❛2❀ ❛3❀ ❛4❀ ❛6❀ ❳1❀ ❨1❀ ❩1❀ ❳2❀ ❨2❀ ❨2].

slide-6
SLIDE 6

1995 Bosma–Lenstra: Explicit complete system

  • f 2 addition laws

for long Weierstrass curves: explicit polynomials ❳3❀ ❨3❀ ❩3❀ ❳✵

3❀ ❨ ✵ 3❀ ❩✵ 3

✷ Z[❛1❀ ❛2❀ ❛3❀ ❛4❀ ❛6❀ ❳1❀ ❨1❀ ❩1❀ ❳2❀ ❨2❀ ❨2]. My previous slide in this talk: Bosma–Lenstra ❨ ✵

3❀ ❩✵ 3.

Not human-comprehensible.

slide-7
SLIDE 7

1995 Bosma–Lenstra: Explicit complete system

  • f 2 addition laws

for long Weierstrass curves: explicit polynomials ❳3❀ ❨3❀ ❩3❀ ❳✵

3❀ ❨ ✵ 3❀ ❩✵ 3

✷ Z[❛1❀ ❛2❀ ❛3❀ ❛4❀ ❛6❀ ❳1❀ ❨1❀ ❩1❀ ❳2❀ ❨2❀ ❨2]. My previous slide in this talk: Bosma–Lenstra ❨ ✵

3❀ ❩✵ 3.

Not human-comprehensible. Actually, slide shows Publish(❨ ✵

3)❀ Publish(❩✵ 3),

where Publish introduces typos.

slide-8
SLIDE 8

What this means: For all fields ❦, all P2 Weierstrass curves ❊❂❦ : ❨ 2❩ + ❛1❳❨ ❩ + ❛3❨ ❩2 = ❳3 + ❛2❳2❩ + ❛4❳❩2 + ❛6❩3, all P1 = (❳1 : ❨1 : ❩1) ✷ ❊(❦), all P2 = (❳2 : ❨2 : ❩2) ✷ ❊(❦): (❳3 : ❨3 : ❩3) is P1 + P2 or (0 : 0 : 0); (❳✵

3 : ❨ ✵ 3 : ❩✵ 3)

is P1 + P2 or (0 : 0 : 0); at most one of these is (0 : 0 : 0).

slide-9
SLIDE 9

2009.11 Bernstein–T. Lange, eprint.iacr.org/2009/580: For all fields ❦ with 2 ✻= 0, all P1 ✂ P1 Edwards curves ❊❂❦ : ❳2❚ 2 + ❨ 2❩2 = ❩2❚ 2 + ❞❳2❨ 2, all P1❀ P2 ✷ ❊(❦), P1 = ((❳1 : ❩1)❀ (❨1 : ❚1)), P2 = ((❳2 : ❩2)❀ (❨2 : ❚2)): (❳3 : ❩3) is ①(P1 + P2) or (0 : 0); (❳✵

3 : ❩✵ 3) is ①(P1 + P2) or (0 : 0);

(❨3 : ❚3) is ②(P1 + P2) or (0 : 0); (❨ ✵

3 : ❚ ✵ 3) is ②(P1 + P2) or (0 : 0);

at most one of these is (0 : 0).

slide-10
SLIDE 10

❳3 = ❳1❨2❩2❚1 + ❳2❨1❩1❚2, ❩3 = ❩1❩2❚1❚2 + ❞❳1❳2❨1❨2, ❨3 = ❨1❨2❩1❩2 ❳1❳2❚1❚2, ❚3 = ❩1❩2❚1❚2 ❞❳1❳2❨1❨2, ❳✵

3 = ❳1❨1❩2❚2 + ❳2❨2❩1❚1,

❩✵

3 = ❳1❳2❚1❚2 + ❨1❨2❩1❩2,

❨ ✵

3 = ❳1❨1❩2❚2 ❳2❨2❩1❚1,

❚ ✵

3 = ❳1❨2❩2❚1 ❳2❨1❩1❚2.

Much, much, much simpler than Lange–Ruppert, Bosma–Lenstra. Also much easier to prove. Also useful for computations. Geometrically, all elliptic curves. (Handle 2 = 0 separately.)

slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14

History of these addition laws: 1761 Euler, 1866 Gauss: Beautiful addition law for ①2 + ②2 = 1 ①2②2, the “lemniscatic elliptic curve.” (①1❀ ②1) + (①2❀ ②2) = (①3❀ ②3) with ①3 = ①1②2 + ①2②1 1 ①1①2②1②2 , ②3 = ②1②2 ①1①2 1 + ①1①2②1②2 . 1986 Chudnovsky–Chudnovsky factorization-speed study begins with Ga, Gm, T2, lemniscate; but focuses on curve families.

slide-15
SLIDE 15

2007 Edwards: Obtain all elliptic curves over Q by generalizing to curve ①2 + ②2 = 1 + ❞①2②2. (①1❀ ②1) + (①2❀ ②2) = (①3❀ ②3) with ①3 = ①1②2 + ①2②1 1 + ❞①1①2②1②2 , ②3 = ②1②2 ①1①2 1 ❞①1①2②1②2 . Edwards actually used ❞ = ❝4. Scaling: ①2 + ②2 = ❝2(1 + ①2②2). But ①2 + ②2 = 1 + ❞①2②2 lowers ❥ degree; includes lemniscate; simplifies degeneration to clock.

slide-16
SLIDE 16

Embed ❊ into P1 ✂ P1, as recommended by Edwards.

  • ✶❀ ✝1

♣ ❞

✁ ❀ ✝1

♣ ❞❀ ✶

✁ ✷ ❊(❦( ♣ ❞)). Edwards commented that the addition law works for (①1❀ ②1)+ 1

♣ ❞❀ ✶

✁ =

  • 1

②1 ♣ ❞❀ 1 ①1 ♣ ❞

✁ . Can easily use this to obtain a dual addition law: ①3 = ①1②1 + ①2②2 ①1①2 + ②1②2 , ②3 = ①1②1 ①2②2 ①1②2 ①2②1 .

slide-17
SLIDE 17

Here’s how: (①1❀ ②1) + (①2❀ ②2) = (①1❀ ②1) + 1

♣ ❞❀ ✶

✁ + (①2❀ ②2) 1

♣ ❞❀ ✶

slide-18
SLIDE 18

Here’s how: (①1❀ ②1) + (①2❀ ②2) = (①1❀ ②1) + 1

♣ ❞❀ ✶

✁ + (①2❀ ②2) 1

♣ ❞❀ ✶

✁ =

  • 1

②1 ♣ ❞❀ 1 ①1 ♣ ❞

✁ +(①2❀ ②2) 1

♣ ❞❀ ✶

slide-19
SLIDE 19

Here’s how: (①1❀ ②1) + (①2❀ ②2) = (①1❀ ②1) + 1

♣ ❞❀ ✶

✁ + (①2❀ ②2) 1

♣ ❞❀ ✶

✁ =

  • 1

②1 ♣ ❞❀ 1 ①1 ♣ ❞

✁ +(①2❀ ②2) 1

♣ ❞❀ ✶

✁ = ✵ ❅

②2 ②1 ♣ ❞ ①2 ①1 ♣ ❞

1 ❞①2②2

❞①1②1

②2 ①1 ♣ ❞ ①2 ②1 ♣ ❞

1 + ❞①2②2

❞①1②1

✶ ❆

  • 1

♣ ❞❀ ✶

slide-20
SLIDE 20

Here’s how: (①1❀ ②1) + (①2❀ ②2) = (①1❀ ②1) + 1

♣ ❞❀ ✶

✁ + (①2❀ ②2) 1

♣ ❞❀ ✶

✁ =

  • 1

②1 ♣ ❞❀ 1 ①1 ♣ ❞

✁ +(①2❀ ②2) 1

♣ ❞❀ ✶

✁ = ✵ ❅

②2 ②1 ♣ ❞ ①2 ①1 ♣ ❞

1 ❞①2②2

❞①1②1

②2 ①1 ♣ ❞ ①2 ②1 ♣ ❞

1 + ❞①2②2

❞①1②1

✶ ❆

  • 1

♣ ❞❀ ✶

✁ = ✵ ❅

①1②2①2②1 ♣ ❞

①1②1 ①2②2 ❀

②1②2①1①2 ♣ ❞

①1②1 + ①2②2 ✶ ❆

  • 1

♣ ❞❀ ✶

slide-21
SLIDE 21

Here’s how: (①1❀ ②1) + (①2❀ ②2) = (①1❀ ②1) + 1

♣ ❞❀ ✶

✁ + (①2❀ ②2) 1

♣ ❞❀ ✶

✁ =

  • 1

②1 ♣ ❞❀ 1 ①1 ♣ ❞

✁ +(①2❀ ②2) 1

♣ ❞❀ ✶

✁ = ✵ ❅

②2 ②1 ♣ ❞ ①2 ①1 ♣ ❞

1 ❞①2②2

❞①1②1

②2 ①1 ♣ ❞ ①2 ②1 ♣ ❞

1 + ❞①2②2

❞①1②1

✶ ❆

  • 1

♣ ❞❀ ✶

✁ = ✵ ❅

①1②2①2②1 ♣ ❞

①1②1 ①2②2 ❀

②1②2①1①2 ♣ ❞

①1②1 + ①2②2 ✶ ❆

  • 1

♣ ❞❀ ✶

✁ = (①1②1+①2②2

①1①2+②1②2 ❀ ①1②1①2②2 ①1②2①2②1 ).

slide-22
SLIDE 22

2007 Bernstein–Lange: Edwards addition law gives speed records for ECM, ECC, etc. 2008 Hisil–Wong–Carter–Dawson: First publication of dual addition law; new speed records. (Completely different derivation.) 2009.11 Bernstein–Lange: Addition law and dual form a complete system. Elementary, computational proof, giving elementary, computational definition of the group ❊(❦) using these formulas.

slide-23
SLIDE 23

1987 Lenstra “Elliptic curves and number-theoretic algorithms”: Use Lange–Ruppert complete system of addition laws to give computational definition

  • f the Weierstrass group ❊(❘)

for more general rings ❘. Define P2(❘) = ❢(❳ : ❨ : ❩) : ❳❀ ❨❀ ❩ ✷ ❘; ❳❘+❨ ❘+❩❘ = ❘❣ where (❳ : ❨ : ❩) is the module ❢(✕❳❀ ✕❨❀ ✕❩) : ✕ ✷ ❘❣. Define ❊(❘) = ❢(❳ : ❨ : ❩) ✷ P2(❘) : ❨ 2❩ = ❳3 + ❛4❳❩2 + ❛6❩3❣.

slide-24
SLIDE 24

To define (and compute) sum (❳1 : ❨1 : ❩1) + (❳2 : ❨2 : ❩2): Consider (and compute) Lange–Ruppert (❳3 : ❨3 : ❩3), (❳✵

3 : ❨ ✵ 3 : ❩✵ 3), (❳✵✵ 3 : ❨ ✵✵ 3 : ❩✵✵ 3).

Add these ❘-modules: ❢ (✕❳3❀ ✕❨3❀ ✕❩3) + (✕✵❳✵

3❀ ✕✵❨ ✵ 3❀ ✕✵❩✵ 3)

+ (✕✵✵❳✵✵

3 ❀ ✕✵✵❨ ✵✵ 3 ❀ ✕✵✵❩✵✵ 3) :

✕❀ ✕✵❀ ✕✵✵ ✷ ❘❣. Allow any ring ❘ having trivial class group. Then this sum of modules can be expressed as (❳ : ❨ : ❩).

slide-25
SLIDE 25

Counting points: Schoof etc. Input: prime ❵; abelian variety ❆❂Fq, usually Jac(genus-❣ curve). Write down generic point P ✷ ❆ with ❵P = 0. Specifically: express ❵P = 0 as system of equations

  • n coordinates of P;

extend Fq to ring ❘ = Fq[coords]❂equations; note that ❵P = 0 in ❆(❘). Genus 1: #❘ ✙ q❵2. Genus 2: #❘ ✙ q❵4. Much larger computations.

slide-26
SLIDE 26

True often enough to be useful: Genus 1: Unique linear equation ✬2(P) s1✬(P) + qP = 0 for qth-power ✬ : ❆(❘) ✦ ❆(❘) with s1 ✷ ❢0❀ 1❀ ✿ ✿ ✿ ❀ ❵ 1❣. Then 1 s1 + q #❆(Fq) ✷ ❵Z. Genus 2: Unique linear equation ✬4(P) s1✬3(P) + s2✬2(P) qs1✬(P) + q2P = 0 with s1❀ s2 ✷ ❢0❀ 1❀ ✿ ✿ ✿ ❀ ❵ 1❣. Then 1 s1 + s2 qs1 + q2 #❆(Fq) ✷ ❵Z. Try many ❵; deduce #❆(Fq). Silly name: “❵-adic method.”

slide-27
SLIDE 27

Which coords to choose for ❆ = Jac(❈) when ❈ has genus 2? 2000 Gaudry–Harley, 2004 Gaudry–Schost, 2009 Gaudry–Schost: Use Mumford coordinates for ❆, and write P = P1 P2 with P✐ = (①✐❀ ②✐) ✷ ❈ ✦ ❆. ❘ = Fq[①1❀ ②1❀ ①2❀ ②2]❂( (①1❀ ②1) ✷ ❈; (①2❀ ②2) ✷ ❈; ❵(①1❀ ②1) = ❵(①2❀ ②2) ).

slide-28
SLIDE 28

❵(①1❀ ②1) = ❵(①2❀ ②2) gives two equations in ①1❀ ①2

  • f degree ❵2+♦(1). Eliminate ①2,
  • btaining equation in ①1.

Elimination time (❵6 lg q)1+♦(1) using fast-arithmetic techniques. Equation in ①1: degree ❵4+♦(1). Computing ✬(P) etc.: time (❵4(lg q)2)1+♦(1). Total time (lg q)8+♦(1) to handle all ❵ ✔ (lg q)1+♦(1). 2004 Gaudry–Schost: symmetrize; constant-factor speedup.

slide-29
SLIDE 29

2000 Gaudry–Harley et al. don’t actually use ❆(❘). They map ❘ to a field, allegedly saving time.

slide-30
SLIDE 30

2000 Gaudry–Harley et al. don’t actually use ❆(❘). They map ❘ to a field, allegedly saving time. But factorization is slow! Latest factorization algorithm, 2008 Kedlaya–Umans, takes time (lg q)7+♦(1) to factor the ①1 equation. Sum over ❵: (lg q)8+♦(1). Closer analysis of ♦(1) shows that factorization still loses time here, except for “free” factors.

slide-31
SLIDE 31

Can save time in genus 1 by building a smaller ❘ that defines a ✬-stable subgroup of ❵-torsion. (1991 Elkies; 1992 Atkin) Fastest such techniques reported for genus 2: time ❵12+♦(1). Use for ❵ ✔ (lg q)2❂3+♦(1). Asymptotic speedup 1 + ♦(1). Also “kangaroos”/“cockroaches”: Asymptotic speedup 1 + ♦(1). Also #❆ mod 22 etc.: Asymptotic speedup 1 + ♦(1).

slide-32
SLIDE 32

Video games Millions of people buy PCs to “play video games”: i.e., to participate in applied physics simulations,

  • ften highly networked.

Society adjusts ultracomputer to improve these simulations. Algorithm designers

  • btain much better results

by paying attention to the ultracomputer architecture! Most important fact: #ALUs ✷ Θ(#bits of RAM).

slide-33
SLIDE 33

My university has just spent ❁$20000 on a cluster for me. 8 CPUs; 15 GTX 295 cards. Each GTX 295 has 2 “GPUs.”

slide-34
SLIDE 34

My university has just spent ❁$20000 on a cluster for me. 8 CPUs; 15 GTX 295 cards. Each GTX 295 has 2 “GPUs.” Each GPU has 30 cores.

slide-35
SLIDE 35

My university has just spent ❁$20000 on a cluster for me. 8 CPUs; 15 GTX 295 cards. Each GTX 295 has 2 “GPUs.” Each GPU has 30 cores. Each core has 8 ALUs and ❁100KB of fast RAM. Each ALU performs a 32-bit

  • peration every cycle @1.242GHz.
slide-36
SLIDE 36

My university has just spent ❁$20000 on a cluster for me. 8 CPUs; 15 GTX 295 cards. Each GTX 295 has 2 “GPUs.” Each GPU has 30 cores. Each core has 8 ALUs and ❁100KB of fast RAM. Each ALU performs a 32-bit

  • peration every cycle @1.242GHz.

I also have accounts

  • n several “TeraGrid” clusters.

Right now I’m using 448 GPUs; 13440 cores; 107520 32-bit ALUs.

slide-37
SLIDE 37

GPU cores can communicate through slow “global” RAM. ✔3 bits per ALU per cycle.

slide-38
SLIDE 38

GPU cores can communicate through slow “global” RAM. ✔3 bits per ALU per cycle. Cluster nodes can communicate through a slow network. ✔0.003 bits per ALU per cycle.

slide-39
SLIDE 39

GPU cores can communicate through slow “global” RAM. ✔3 bits per ALU per cycle. Cluster nodes can communicate through a slow network. ✔0.003 bits per ALU per cycle. Algorithm-analysis students are taught to count algorithm “operations.” RAM access: 1 operation. Resulting algorithms are poorly

  • ptimized for the real world.

Gap grows with cluster size.

slide-40
SLIDE 40

Much better model developed 30 years ago: Computation is carried out

  • n a 2-dimensional circuit.

Measure circuit area, time. e.g. 1981 Brent–Kung: multiply ♥-bit integers in time ♥0✿5+♦(1) using circuit area ♥1+♦(1). Scalability in this model is fairly close to scalability

  • f real-world computations.
slide-41
SLIDE 41

Many other “buildable” models. Time to sort ♥ small integers

  • n machine of size ♥1+♦(1):

♥2✿0+♦(1): 1-tape Turing machine. ♥1✿5+♦(1): 2-dimensional RAM. ♥1✿0+♦(1): pipelined RAM. ♥0✿5+♦(1): 2-dimensional circuit. Why does anyone say that sorting time is ♥1+♦(1)? Why choose third machine? Silly! Once ♥ is large enough, fourth machine is better.

slide-42
SLIDE 42

Let’s see what this means for genus-2 point-counting. Machine cost: (lg q)5+♦(1). (lg q)5+♦(1) ALUs.

slide-43
SLIDE 43

Let’s see what this means for genus-2 point-counting. Machine cost: (lg q)5+♦(1). (lg q)5+♦(1) ALUs. Multiplying two univariate polynomials of degree (lg q)2+♦(1): (lg q)3+♦(1) ALUs; time (lg q)1✿5+♦(1).

slide-44
SLIDE 44

Let’s see what this means for genus-2 point-counting. Machine cost: (lg q)5+♦(1). (lg q)5+♦(1) ALUs. Multiplying two univariate polynomials of degree (lg q)2+♦(1): (lg q)3+♦(1) ALUs; time (lg q)1✿5+♦(1). (lg q)4+♦(1) resultants: (lg q)5+♦(1) ALUs; time (lg q)3✿5+♦(1).

slide-45
SLIDE 45

Multiplying mod ①1 equation: (lg q)5+♦(1) ALUs; time (lg q)2✿5+♦(1).

slide-46
SLIDE 46

Multiplying mod ①1 equation: (lg q)5+♦(1) ALUs; time (lg q)2✿5+♦(1). Computing ✬(P) etc.: (lg q)5+♦(1) ALUs; time (lg q)3✿5+♦(1). Total time (lg q)4✿5+♦(1).

slide-47
SLIDE 47

Multiplying mod ①1 equation: (lg q)5+♦(1) ALUs; time (lg q)2✿5+♦(1). Computing ✬(P) etc.: (lg q)5+♦(1) ALUs; time (lg q)3✿5+♦(1). Total time (lg q)4✿5+♦(1). In oversimplified RAM model, lg q exponent was dominated solely by the resultants. No longer true here.

slide-48
SLIDE 48

Most important computation: big multiplication on GPUs, and then on network of GPUs. First steps: 2009 Emeliyanenko, “Efficient multiplication of polynomials on graphics hardware.” Uses algorithm ideas developed for FFT on tape, ✙ computation on disk, etc. Recent tool development: 2010 Bernstein–Chen–Cheng–Lange– Niederhagen–Schwabe–Yang, “Usable assembly language for GPUs: a success story.”