SLIDE 1 Efficient arithmetic
in large characteristic
University of Illinois at Chicago
SLIDE 2 Fix a field and an elliptic curve. e.g. NIST P-224: the elliptic curve
y2 = x3 3x + a6 over Z =p.
Here
p = 2224 296 + 1
and
a6 = 18958286285566608
00040866854449392 64155046809686793 21075787234672564. e.g. NIST P-256: the elliptic curve
y2 = x3 3x +
p = 2256 2224 + 2192 + 296 1.
e.g. Curve25519: the elliptic curve
y2 = x3 + 486662x2 + x over Z =p
where
p = 2255 19.
SLIDE 3
“Elliptic-curve scalar multiplication”: Given (
x; y) on curve,
and given integer
n 0,
compute
nth multiple of ( x; y)
in the elliptic-curve group. This is the bottleneck in elliptic-curve Diffie-Hellman. The big question: How quickly can we do this? Many variations of problem: e.g.
m; n; P ; Q 7! mP + nQ,
critical for elliptic-curve signatures.
SLIDE 4 Review of addition chains Typical recursive formulas: 2P =
P+P. 3 P = 2P+P.
4P = 2P+2P. 5
P = 3P+2P.
6P = 3P+3P. 7
P = 5P+2P.
2nP = 7P+(
n7) P if 4 n<8.
(2n+1)
P = 2nP+P if 4 n<8.
(4n+1)
P = 4nP+P if 4 n<8.
(4n+3)
P = 4nP+3P if 4 n<8.
2nP =
nP+ nP if 8
(8n+1)
P = 8nP+P if 4
(8n+3)
P = 8nP+3P if 4
(8n+5)
P = 8nP+5P if 4
(8n+7)
P = 8nP+7P if 4
SLIDE 5 This addition chain (“length-3 sliding windows”) uses
lg n doublings and 0:25 lg n more additions
to compute
nP for average n.
e.g.
320 additions for
average
n 2
: : : ; 2256 1
Some easy improvements from fast negation on elliptic curves: (16n
7) P = 16nP 7P, etc.
Also use endomorphisms for “Koblitz curves,” “GLV curves.” More complicated methods replace 0 :25 by
1=lg lg n.
SLIDE 6 Explicit doubling formulas On curve
y2 = x3 3x + a6:
2(
x; y) = ( x 00 ; y 00) where = (3 x2 3) =2y, x 00 = 2 2x, y 00 = ( x
00)
7 subs etc., 2 squarings, 1 more mult, 1 division. How do we divide efficiently in a finite field?
SLIDE 7 f =g = f g p2 in prime field Z =p.
Can compute
g p2 with lg p squarings and (lg p) =lg lg p more mults.
e.g.
p = 2224 296 + 1:
223 squarings, 11 more mults. More generally,
f =g = f g q 2
in any field of size
q.
There are faster division methods (e.g. “Euclid”—beware timing attacks!); smaller “I/M ratio.” Special methods for some fields.
SLIDE 8
Speedup: delay divisions Division costs many mults even with fastest division methods. Save time by delaying divisions. Naive division-delay method: Store field elements as fractions until end of computation. Divide once before output. Mult fractions with 2 field mults. Divide fractions with 2 field mults. Add fractions with 3 field mults.
SLIDE 9 Speedup: unify denominators For elliptic-curve doubling, have denominator 2
y
in
= (3 x2 3) =2y;
denominator (2
y)2
in
x 00 = 2 2x;
denominator (2
y)3
in
y 00 = ( x
00)
Subsequent computations will perform separate computations
y)2 ; (2y)3
x 00 ; y 00.
Save time by manipulating denominators together.
SLIDE 10 “Jacobian coordinates”: Store (
x; y ; z) to represent
elliptic-curve point (
x=z2 ; y =z3).
2(
x=z2 ; y =z3) = ( x 00 ; y 00) where = (3( x=z2)2 3) =2( y =z3)
=
=2y z with = 3x2 3z4; x 00 = 2 2( x=z2)
= (
2 8xy2) =(2y z)2; y 00 = (( x=z2)
00) ( y =z3)
= (12
xy2
8y4) =(2y z)3.
SLIDE 11 2(
x=z2 ; y =z3) = ( x2 =z2
2
; y2 =z3
2)
where
z2 = 2 y z, = 3 x2 3z4, x2 = 2 8xy2, y2 = (4xy2
8y4.
Easily compute with 6 squarings, 3 more mults:
x2, z2, z4, y2, y4, y z, xy2, 2, (
Also some subs, doublings, etc. Use fast field arithmetic: e.g., can delay carries and reductions in computing
y2.
SLIDE 12 Speedup: difference of squares Can compute 3x2
3z4 as
3(
x
x + z2).
Replace 3 squarings by 1 mult, 1 squaring. Revised total: 4 squarings, 4 more mults. Note: 3x2
3z4 came from 3 x2 3,
derivative of
x3 3x + a6.
Wouldn’t have same speedup for, e.g.,
x3 5x + a6.
SLIDE 13 Speedup:
f2 ; g2 ; 2f g
After computing
f2 and g2
can compute 2 f
g
as (
f + g)2
In particular: After computing
y2 and z2
can compute 2 y
z
as (
y + z)2
Replace 1 mult with 1 squaring. Revised total: 5 squarings, 3 more mults.
SLIDE 14
Explicit addition formulas Similar speedups in formulas for adding distinct points. 5 squarings, 11 more mults. Again some opportunities to delay carries, etc.
SLIDE 15
Speedup: cache results In adding (
x1 =z2
1
; y1 =z3
1)
to (
x2 =z2
2
; y2 =z3
2),
compute many intermediates, including
z2
1
; z3
1.
Often add same point again to a different point; can reuse
z2
1
; z3
1.
“Chudnovsky coordinates.”
SLIDE 16 Speedup: delay fewer divisions? Faster divisions sometimes justify delaying fewer divisions. e.g. Do we really need fractions for
P ; 3P ; 5P ; 7P?
Can convert
P ; 3P ; 5P ; 7P
- ut of Jacobian coordinates
with one division, several mults. Then save mults in every addition of
P ; 3P ; 5P ; 7P.
“Mixed coordinates.” Sometimes worthwhile, depending on division speed.
SLIDE 17 Montgomery coordinates On elliptic curves with “Montgomery form”
y2 = x3 + a2 x2 + x,
preferably with small (
a2 2) =4: n( x1 ; : : :) = ( x n =z n ; : : :) where z1 = 1; x2m = ( x2 m
m)2; z2m=4x m z m( x2 m+a2 x m z m+z2 m); x2m+1=4( x m x m+1 z m z m+1)2; z2m+1=4( x m z m+1 z m x m+1)2 x1.
Can also figure out
y,
- r use cryptographic protocols
that ignore
y.
SLIDE 18 x m
m
m+1
m+1
2
4
z2m x2m+1 z2m+1
SLIDE 19
Assuming (
a2 2) =4 small,
main operations are 4 squarings, 5 more mults for each bit of
n.
Compare to Jacobian coordinates: each bit of
n has
5 squarings, 3 more mults, and on occasion 5 more squarings, 11 more mults. Montgomery form is better if
n is not gigantic.
SLIDE 20 What are today’s speed records? Let’s focus on Pentium M. Each Pentium M cycle does
1 floating-point operation:
fp add or fp sub or fp mult. Current scalar-multiplication software for
y2 = x3+486662x2+ x
19):
640838 Pentium M cycles. 589825 fp ops;
0:92 per cycle.
Understand cycle counts fairly well by simply counting fp ops.
SLIDE 21 Main loop: 545700 fp ops. 2140 times 255 iterations. Reciprocal: 43821 fp ops. 41148 = 254
162 for 254 squares;
2673 = 11
243 for 11 more mults.
Additional work: 304 fp ops. Inside one main-loop iteration: 80 = 8
10 for 8 adds/subs;
55 for mult by 121665; 648 = 4
162 for 4 squarings;
1215 = 5
243 for 5 more mults;
142 for
bx[1] + (1
x[0] etc.
SLIDE 22 An integer mod 2255
19 is
represented in radix 225:5 as a sum of 10 fp numbers in specified ranges. Add/sub: 10 fp adds/subs. Delay reductions and carries! Mult: poly mult using 102 fp mults, 92 fp adds; reduce using 9 fp mults, 9 fp adds; carry 11 times, each 4 fp adds;
102 + 4 10 + 3 fp ops.
Squaring: first do 9 fp doublings; then eliminate 92 + 9 fp ops;
102 + 6 10 + 2 fp ops.
SLIDE 23
Course advertisement “High-speed cryptography” at the Fields Institute, 36 hours, starting 23 Oct, ending 17 Nov. What are the state-of-the-art cryptographic functions for sharing secrets, expanding keys, authenticating data, signing data? How fast are these functions in software for typical CPUs? What’s known about security? How were the functions chosen? cr.yp.to/highspeed.html