SLIDE 1 Counting points as a video game
University of Illinois at Chicago Want efficient computation of secure twist-secure genus-2 ❈ with very small coefficients for fastest known Diffie–Hellman. Can’t do that with CM. This talk focuses on algorithms; does not report any computations. Need results today? Ask Gaudry. But first an advertisement✿ ✿ ✿
SLIDE 2 1985 H. Lange–Ruppert “Complete systems of addition laws on abelian varieties”: ❆(❦) has a complete system
- f addition laws, degree ✔ (3❀ 3).
Symmetry ✮ degree ✔ (2❀ 2). “The proof is nonconstructive✿ ✿ ✿ To determine explicitly a complete system of addition laws requires tedious computations already in the easiest case
in Weierstrass normal form.”
SLIDE 3 1985 Lange–Ruppert: Explicit complete system
for short Weierstrass curves. Reduce formulas to 53 monomials by introducing extra variables ①✐②❥ + ①❥②✐, ①✐②❥ ①❥②✐. I won’t copy the formulas here. 1987 Lange–Ruppert “Addition laws on elliptic curves in arbitrary characteristics”: Explicit complete system
for long Weierstrass curves.
SLIDE 4
SLIDE 5 1995 Bosma–Lenstra: Explicit complete system
for long Weierstrass curves: explicit polynomials ❳3❀ ❨3❀ ❩3❀ ❳✵
3❀ ❨ ✵ 3❀ ❩✵ 3
✷ Z[❛1❀ ❛2❀ ❛3❀ ❛4❀ ❛6❀ ❳1❀ ❨1❀ ❩1❀ ❳2❀ ❨2❀ ❨2].
SLIDE 6 1995 Bosma–Lenstra: Explicit complete system
for long Weierstrass curves: explicit polynomials ❳3❀ ❨3❀ ❩3❀ ❳✵
3❀ ❨ ✵ 3❀ ❩✵ 3
✷ Z[❛1❀ ❛2❀ ❛3❀ ❛4❀ ❛6❀ ❳1❀ ❨1❀ ❩1❀ ❳2❀ ❨2❀ ❨2]. My previous slide in this talk: Bosma–Lenstra ❨ ✵
3❀ ❩✵ 3.
Not human-comprehensible.
SLIDE 7 1995 Bosma–Lenstra: Explicit complete system
for long Weierstrass curves: explicit polynomials ❳3❀ ❨3❀ ❩3❀ ❳✵
3❀ ❨ ✵ 3❀ ❩✵ 3
✷ Z[❛1❀ ❛2❀ ❛3❀ ❛4❀ ❛6❀ ❳1❀ ❨1❀ ❩1❀ ❳2❀ ❨2❀ ❨2]. My previous slide in this talk: Bosma–Lenstra ❨ ✵
3❀ ❩✵ 3.
Not human-comprehensible. Actually, slide shows Publish(❨ ✵
3)❀ Publish(❩✵ 3),
where Publish introduces typos.
SLIDE 8
What this means: For all fields ❦, all P2 Weierstrass curves ❊❂❦ : ❨ 2❩ + ❛1❳❨ ❩ + ❛3❨ ❩2 = ❳3 + ❛2❳2❩ + ❛4❳❩2 + ❛6❩3, all P1 = (❳1 : ❨1 : ❩1) ✷ ❊(❦), all P2 = (❳2 : ❨2 : ❩2) ✷ ❊(❦): (❳3 : ❨3 : ❩3) is P1 + P2 or (0 : 0 : 0); (❳✵
3 : ❨ ✵ 3 : ❩✵ 3)
is P1 + P2 or (0 : 0 : 0); at most one of these is (0 : 0 : 0).
SLIDE 9
2009.11 Bernstein–T. Lange, eprint.iacr.org/2009/580: For all fields ❦ with 2 ✻= 0, all P1 ✂ P1 Edwards curves ❊❂❦ : ❳2❚ 2 + ❨ 2❩2 = ❩2❚ 2 + ❞❳2❨ 2, all P1❀ P2 ✷ ❊(❦), P1 = ((❳1 : ❩1)❀ (❨1 : ❚1)), P2 = ((❳2 : ❩2)❀ (❨2 : ❚2)): (❳3 : ❩3) is ①(P1 + P2) or (0 : 0); (❳✵
3 : ❩✵ 3) is ①(P1 + P2) or (0 : 0);
(❨3 : ❚3) is ②(P1 + P2) or (0 : 0); (❨ ✵
3 : ❚ ✵ 3) is ②(P1 + P2) or (0 : 0);
at most one of these is (0 : 0).
SLIDE 10
❳3 = ❳1❨2❩2❚1 + ❳2❨1❩1❚2, ❩3 = ❩1❩2❚1❚2 + ❞❳1❳2❨1❨2, ❨3 = ❨1❨2❩1❩2 ❳1❳2❚1❚2, ❚3 = ❩1❩2❚1❚2 ❞❳1❳2❨1❨2, ❳✵
3 = ❳1❨1❩2❚2 + ❳2❨2❩1❚1,
❩✵
3 = ❳1❳2❚1❚2 + ❨1❨2❩1❩2,
❨ ✵
3 = ❳1❨1❩2❚2 ❳2❨2❩1❚1,
❚ ✵
3 = ❳1❨2❩2❚1 ❳2❨1❩1❚2.
Much, much, much simpler than Lange–Ruppert, Bosma–Lenstra. Also much easier to prove. Also useful for computations. Geometrically, all elliptic curves. (Handle 2 = 0 separately.)
SLIDE 11
SLIDE 12
SLIDE 13
SLIDE 14
History of these addition laws: 1761 Euler, 1866 Gauss: Beautiful addition law for ①2 + ②2 = 1 ①2②2, the “lemniscatic elliptic curve.” (①1❀ ②1) + (①2❀ ②2) = (①3❀ ②3) with ①3 = ①1②2 + ①2②1 1 ①1①2②1②2 , ②3 = ②1②2 ①1①2 1 + ①1①2②1②2 . 1986 Chudnovsky–Chudnovsky factorization-speed study begins with Ga, Gm, T2, lemniscate; but focuses on curve families.
SLIDE 15
2007 Edwards: Obtain all elliptic curves over Q by generalizing to curve ①2 + ②2 = 1 + ❞①2②2. (①1❀ ②1) + (①2❀ ②2) = (①3❀ ②3) with ①3 = ①1②2 + ①2②1 1 + ❞①1①2②1②2 , ②3 = ②1②2 ①1①2 1 ❞①1①2②1②2 . Edwards actually used ❞ = ❝4. Scaling: ①2 + ②2 = ❝2(1 + ①2②2). But ①2 + ②2 = 1 + ❞①2②2 lowers ❥ degree; includes lemniscate; simplifies degeneration to clock.
SLIDE 16 Embed ❊ into P1 ✂ P1, as recommended by Edwards.
♣ ❞
✁ ❀ ✝1
♣ ❞❀ ✶
✁ ✷ ❊(❦( ♣ ❞)). Edwards commented that the addition law works for (①1❀ ②1)+ 1
♣ ❞❀ ✶
✁ =
②1 ♣ ❞❀ 1 ①1 ♣ ❞
✁ . Can easily use this to obtain a dual addition law: ①3 = ①1②1 + ①2②2 ①1①2 + ②1②2 , ②3 = ①1②1 ①2②2 ①1②2 ①2②1 .
SLIDE 17
Here’s how: (①1❀ ②1) + (①2❀ ②2) = (①1❀ ②1) + 1
♣ ❞❀ ✶
✁ + (①2❀ ②2) 1
♣ ❞❀ ✶
✁
SLIDE 18 Here’s how: (①1❀ ②1) + (①2❀ ②2) = (①1❀ ②1) + 1
♣ ❞❀ ✶
✁ + (①2❀ ②2) 1
♣ ❞❀ ✶
✁ =
②1 ♣ ❞❀ 1 ①1 ♣ ❞
✁ +(①2❀ ②2) 1
♣ ❞❀ ✶
✁
SLIDE 19 Here’s how: (①1❀ ②1) + (①2❀ ②2) = (①1❀ ②1) + 1
♣ ❞❀ ✶
✁ + (①2❀ ②2) 1
♣ ❞❀ ✶
✁ =
②1 ♣ ❞❀ 1 ①1 ♣ ❞
✁ +(①2❀ ②2) 1
♣ ❞❀ ✶
✁ = ✵ ❅
②2 ②1 ♣ ❞ ①2 ①1 ♣ ❞
1 ❞①2②2
❞①1②1
❀
②2 ①1 ♣ ❞ ①2 ②1 ♣ ❞
1 + ❞①2②2
❞①1②1
✶ ❆
♣ ❞❀ ✶
✁
SLIDE 20 Here’s how: (①1❀ ②1) + (①2❀ ②2) = (①1❀ ②1) + 1
♣ ❞❀ ✶
✁ + (①2❀ ②2) 1
♣ ❞❀ ✶
✁ =
②1 ♣ ❞❀ 1 ①1 ♣ ❞
✁ +(①2❀ ②2) 1
♣ ❞❀ ✶
✁ = ✵ ❅
②2 ②1 ♣ ❞ ①2 ①1 ♣ ❞
1 ❞①2②2
❞①1②1
❀
②2 ①1 ♣ ❞ ①2 ②1 ♣ ❞
1 + ❞①2②2
❞①1②1
✶ ❆
♣ ❞❀ ✶
✁ = ✵ ❅
①1②2①2②1 ♣ ❞
①1②1 ①2②2 ❀
②1②2①1①2 ♣ ❞
①1②1 + ①2②2 ✶ ❆
♣ ❞❀ ✶
✁
SLIDE 21 Here’s how: (①1❀ ②1) + (①2❀ ②2) = (①1❀ ②1) + 1
♣ ❞❀ ✶
✁ + (①2❀ ②2) 1
♣ ❞❀ ✶
✁ =
②1 ♣ ❞❀ 1 ①1 ♣ ❞
✁ +(①2❀ ②2) 1
♣ ❞❀ ✶
✁ = ✵ ❅
②2 ②1 ♣ ❞ ①2 ①1 ♣ ❞
1 ❞①2②2
❞①1②1
❀
②2 ①1 ♣ ❞ ①2 ②1 ♣ ❞
1 + ❞①2②2
❞①1②1
✶ ❆
♣ ❞❀ ✶
✁ = ✵ ❅
①1②2①2②1 ♣ ❞
①1②1 ①2②2 ❀
②1②2①1①2 ♣ ❞
①1②1 + ①2②2 ✶ ❆
♣ ❞❀ ✶
✁ = (①1②1+①2②2
①1①2+②1②2 ❀ ①1②1①2②2 ①1②2①2②1 ).
SLIDE 22
2007 Bernstein–Lange: Edwards addition law gives speed records for ECM, ECC, etc. 2008 Hisil–Wong–Carter–Dawson: First publication of dual addition law; new speed records. (Completely different derivation.) 2009.11 Bernstein–Lange: Addition law and dual form a complete system. Elementary, computational proof, giving elementary, computational definition of the group ❊(❦) using these formulas.
SLIDE 23 1987 Lenstra “Elliptic curves and number-theoretic algorithms”: Use Lange–Ruppert complete system of addition laws to give computational definition
- f the Weierstrass group ❊(❘)
for more general rings ❘. Define P2(❘) = ❢(❳ : ❨ : ❩) : ❳❀ ❨❀ ❩ ✷ ❘; ❳❘+❨ ❘+❩❘ = ❘❣ where (❳ : ❨ : ❩) is the module ❢(✕❳❀ ✕❨❀ ✕❩) : ✕ ✷ ❘❣. Define ❊(❘) = ❢(❳ : ❨ : ❩) ✷ P2(❘) : ❨ 2❩ = ❳3 + ❛4❳❩2 + ❛6❩3❣.
SLIDE 24
To define (and compute) sum (❳1 : ❨1 : ❩1) + (❳2 : ❨2 : ❩2): Consider (and compute) Lange–Ruppert (❳3 : ❨3 : ❩3), (❳✵
3 : ❨ ✵ 3 : ❩✵ 3), (❳✵✵ 3 : ❨ ✵✵ 3 : ❩✵✵ 3).
Add these ❘-modules: ❢ (✕❳3❀ ✕❨3❀ ✕❩3) + (✕✵❳✵
3❀ ✕✵❨ ✵ 3❀ ✕✵❩✵ 3)
+ (✕✵✵❳✵✵
3 ❀ ✕✵✵❨ ✵✵ 3 ❀ ✕✵✵❩✵✵ 3) :
✕❀ ✕✵❀ ✕✵✵ ✷ ❘❣. Allow any ring ❘ having trivial class group. Then this sum of modules can be expressed as (❳ : ❨ : ❩).
SLIDE 25 Counting points: Schoof etc. Input: prime ❵; abelian variety ❆❂Fq, usually Jac(genus-❣ curve). Write down generic point P ✷ ❆ with ❵P = 0. Specifically: express ❵P = 0 as system of equations
extend Fq to ring ❘ = Fq[coords]❂equations; note that ❵P = 0 in ❆(❘). Genus 1: #❘ ✙ q❵2. Genus 2: #❘ ✙ q❵4. Much larger computations.
SLIDE 26
True often enough to be useful: Genus 1: Unique linear equation ✬2(P) s1✬(P) + qP = 0 for qth-power ✬ : ❆(❘) ✦ ❆(❘) with s1 ✷ ❢0❀ 1❀ ✿ ✿ ✿ ❀ ❵ 1❣. Then 1 s1 + q #❆(Fq) ✷ ❵Z. Genus 2: Unique linear equation ✬4(P) s1✬3(P) + s2✬2(P) qs1✬(P) + q2P = 0 with s1❀ s2 ✷ ❢0❀ 1❀ ✿ ✿ ✿ ❀ ❵ 1❣. Then 1 s1 + s2 qs1 + q2 #❆(Fq) ✷ ❵Z. Try many ❵; deduce #❆(Fq). Silly name: “❵-adic method.”
SLIDE 27
Which coords to choose for ❆ = Jac(❈) when ❈ has genus 2? 2000 Gaudry–Harley, 2004 Gaudry–Schost, 2009 Gaudry–Schost: Use Mumford coordinates for ❆, and write P = P1 P2 with P✐ = (①✐❀ ②✐) ✷ ❈ ✦ ❆. ❘ = Fq[①1❀ ②1❀ ①2❀ ②2]❂( (①1❀ ②1) ✷ ❈; (①2❀ ②2) ✷ ❈; ❵(①1❀ ②1) = ❵(①2❀ ②2) ).
SLIDE 28 ❵(①1❀ ②1) = ❵(①2❀ ②2) gives two equations in ①1❀ ①2
- f degree ❵2+♦(1). Eliminate ①2,
- btaining equation in ①1.
Elimination time (❵6 lg q)1+♦(1) using fast-arithmetic techniques. Equation in ①1: degree ❵4+♦(1). Computing ✬(P) etc.: time (❵4(lg q)2)1+♦(1). Total time (lg q)8+♦(1) to handle all ❵ ✔ (lg q)1+♦(1). 2004 Gaudry–Schost: symmetrize; constant-factor speedup.
SLIDE 29
2000 Gaudry–Harley et al. don’t actually use ❆(❘). They map ❘ to a field, allegedly saving time.
SLIDE 30
2000 Gaudry–Harley et al. don’t actually use ❆(❘). They map ❘ to a field, allegedly saving time. But factorization is slow! Latest factorization algorithm, 2008 Kedlaya–Umans, takes time (lg q)7+♦(1) to factor the ①1 equation. Sum over ❵: (lg q)8+♦(1). Closer analysis of ♦(1) shows that factorization still loses time here, except for “free” factors.
SLIDE 31
Can save time in genus 1 by building a smaller ❘ that defines a ✬-stable subgroup of ❵-torsion. (1991 Elkies; 1992 Atkin) Fastest such techniques reported for genus 2: time ❵12+♦(1). Use for ❵ ✔ (lg q)2❂3+♦(1). Asymptotic speedup 1 + ♦(1). Also “kangaroos”/“cockroaches”: Asymptotic speedup 1 + ♦(1). Also #❆ mod 22 etc.: Asymptotic speedup 1 + ♦(1).
SLIDE 32 Video games Millions of people buy PCs to “play video games”: i.e., to participate in applied physics simulations,
Society adjusts ultracomputer to improve these simulations. Algorithm designers
- btain much better results
by paying attention to the ultracomputer architecture! Most important fact: #ALUs ✷ Θ(#bits of RAM).
SLIDE 33
My university has just spent ❁$20000 on a cluster for me. 8 CPUs; 15 GTX 295 cards. Each GTX 295 has 2 “GPUs.”
SLIDE 34
My university has just spent ❁$20000 on a cluster for me. 8 CPUs; 15 GTX 295 cards. Each GTX 295 has 2 “GPUs.” Each GPU has 30 cores.
SLIDE 35 My university has just spent ❁$20000 on a cluster for me. 8 CPUs; 15 GTX 295 cards. Each GTX 295 has 2 “GPUs.” Each GPU has 30 cores. Each core has 8 ALUs and ❁100KB of fast RAM. Each ALU performs a 32-bit
- peration every cycle @1.242GHz.
SLIDE 36 My university has just spent ❁$20000 on a cluster for me. 8 CPUs; 15 GTX 295 cards. Each GTX 295 has 2 “GPUs.” Each GPU has 30 cores. Each core has 8 ALUs and ❁100KB of fast RAM. Each ALU performs a 32-bit
- peration every cycle @1.242GHz.
I also have accounts
- n several “TeraGrid” clusters.
Right now I’m using 448 GPUs; 13440 cores; 107520 32-bit ALUs.
SLIDE 37
GPU cores can communicate through slow “global” RAM. ✔3 bits per ALU per cycle.
SLIDE 38
GPU cores can communicate through slow “global” RAM. ✔3 bits per ALU per cycle. Cluster nodes can communicate through a slow network. ✔0.003 bits per ALU per cycle.
SLIDE 39 GPU cores can communicate through slow “global” RAM. ✔3 bits per ALU per cycle. Cluster nodes can communicate through a slow network. ✔0.003 bits per ALU per cycle. Algorithm-analysis students are taught to count algorithm “operations.” RAM access: 1 operation. Resulting algorithms are poorly
- ptimized for the real world.
Gap grows with cluster size.
SLIDE 40 Much better model developed 30 years ago: Computation is carried out
- n a 2-dimensional circuit.
Measure circuit area, time. e.g. 1981 Brent–Kung: multiply ♥-bit integers in time ♥0✿5+♦(1) using circuit area ♥1+♦(1). Scalability in this model is fairly close to scalability
- f real-world computations.
SLIDE 41 Many other “buildable” models. Time to sort ♥ small integers
- n machine of size ♥1+♦(1):
♥2✿0+♦(1): 1-tape Turing machine. ♥1✿5+♦(1): 2-dimensional RAM. ♥1✿0+♦(1): pipelined RAM. ♥0✿5+♦(1): 2-dimensional circuit. Why does anyone say that sorting time is ♥1+♦(1)? Why choose third machine? Silly! Once ♥ is large enough, fourth machine is better.
SLIDE 42
Let’s see what this means for genus-2 point-counting. Machine cost: (lg q)5+♦(1). (lg q)5+♦(1) ALUs.
SLIDE 43
Let’s see what this means for genus-2 point-counting. Machine cost: (lg q)5+♦(1). (lg q)5+♦(1) ALUs. Multiplying two univariate polynomials of degree (lg q)2+♦(1): (lg q)3+♦(1) ALUs; time (lg q)1✿5+♦(1).
SLIDE 44
Let’s see what this means for genus-2 point-counting. Machine cost: (lg q)5+♦(1). (lg q)5+♦(1) ALUs. Multiplying two univariate polynomials of degree (lg q)2+♦(1): (lg q)3+♦(1) ALUs; time (lg q)1✿5+♦(1). (lg q)4+♦(1) resultants: (lg q)5+♦(1) ALUs; time (lg q)3✿5+♦(1).
SLIDE 45
Multiplying mod ①1 equation: (lg q)5+♦(1) ALUs; time (lg q)2✿5+♦(1).
SLIDE 46
Multiplying mod ①1 equation: (lg q)5+♦(1) ALUs; time (lg q)2✿5+♦(1). Computing ✬(P) etc.: (lg q)5+♦(1) ALUs; time (lg q)3✿5+♦(1). Total time (lg q)4✿5+♦(1).
SLIDE 47
Multiplying mod ①1 equation: (lg q)5+♦(1) ALUs; time (lg q)2✿5+♦(1). Computing ✬(P) etc.: (lg q)5+♦(1) ALUs; time (lg q)3✿5+♦(1). Total time (lg q)4✿5+♦(1). In oversimplified RAM model, lg q exponent was dominated solely by the resultants. No longer true here.
SLIDE 48
Most important computation: big multiplication on GPUs, and then on network of GPUs. First steps: 2009 Emeliyanenko, “Efficient multiplication of polynomials on graphics hardware.” Uses algorithm ideas developed for FFT on tape, ✙ computation on disk, etc. Recent tool development: 2010 Bernstein–Chen–Cheng–Lange– Niederhagen–Schwabe–Yang, “Usable assembly language for GPUs: a success story.”