New speed records 640838 Pentium M cycles for point multiplication - - PowerPoint PPT Presentation

new speed records 640838 pentium m cycles for point
SMART_READER_LITE
LIVE PREVIEW

New speed records 640838 Pentium M cycles for point multiplication - - PowerPoint PPT Presentation

New speed records 640838 Pentium M cycles for point multiplication to compute a 32-byte secret shared by Dan and Tanja, D. J. Bernstein given Dans 32-byte secret key and Tanjas 32-byte public key . 2 128 cycles. All known


slide-1
SLIDE 1

New speed records for point multiplication

  • D. J. Bernstein

Thanks to: University of Illinois at Chicago NSF CCR–9983950 Alfred P. Sloan Foundation 640838 Pentium M cycles to compute a 32-byte secret shared by Dan and Tanja, given Dan’s 32-byte secret key

  • and Tanja’s 32-byte public key

. All known attacks: 2128 cycles. This is the new speed record for high-security Diffie-Hellman. Encrypt and authenticate messages using hash of shared secret as key. Diffie-Hellman is the bottleneck if total message length is short.

slide-2
SLIDE 2

rds multiplication Illinois at Chicago CCR–9983950 Foundation 640838 Pentium M cycles to compute a 32-byte secret shared by Dan and Tanja, given Dan’s 32-byte secret key

  • and Tanja’s 32-byte public key

. All known attacks: 2128 cycles. This is the new speed record for high-security Diffie-Hellman. Encrypt and authenticate messages using hash of shared secret as key. Diffie-Hellman is the bottleneck if total message length is short. 640838 Pentium M to compute

  • coordinate
  • multiple of (
✁ ✂ ✂ ✂ )

given

✁ 1 ✁ ✂ ✂ ✂ ✁ ✄
  • 2254 + 8 0
✁ 1 ✁ ✂ ✂ ✂ ✁ ✄

Curve25519 is the

2 =

3 + 486662
  • mod the prime 2255

624786 Athlon (622) 832457 Pentium II 957904 Pentium 4 I anticipate similar for UltraSPARC, P

slide-3
SLIDE 3

640838 Pentium M cycles to compute a 32-byte secret shared by Dan and Tanja, given Dan’s 32-byte secret key

  • and Tanja’s 32-byte public key

. All known attacks: 2128 cycles. This is the new speed record for high-security Diffie-Hellman. Encrypt and authenticate messages using hash of shared secret as key. Diffie-Hellman is the bottleneck if total message length is short. 640838 Pentium M (695) cycles to compute

  • coordinate of
th

multiple of (

✁ ✂ ✂ ✂ ) on Curve25519,

given

✁ 1 ✁ ✂ ✂ ✂ ✁ 2256 ✄

1 and

  • 2254 + 8 0
✁ 1 ✁ ✂ ✂ ✂ ✁ 2251 ✄

1 . Curve25519 is the elliptic curve

2 =

3 + 486662 2 +
  • mod the prime 2255

19. 624786 Athlon (622) cycles; 832457 Pentium III (686) cycles; 957904 Pentium 4 (f12) cycles. I anticipate similar cycle counts for UltraSPARC, PowerPC, etc.

slide-4
SLIDE 4

M cycles 32-byte secret and Tanja, yte secret key

  • yte public key

. attacks: 2128 cycles. speed record Diffie-Hellman. authenticate messages shared secret as key. the bottleneck length is short. 640838 Pentium M (695) cycles to compute

  • coordinate of
th

multiple of (

✁ ✂ ✂ ✂ ) on Curve25519,

given

✁ 1 ✁ ✂ ✂ ✂ ✁ 2256 ✄

1 and

  • 2254 + 8 0
✁ 1 ✁ ✂ ✂ ✂ ✁ 2251 ✄

1 . Curve25519 is the elliptic curve

2 =

3 + 486662 2 +
  • mod the prime 2255

19. 624786 Athlon (622) cycles; 832457 Pentium III (686) cycles; 957904 Pentium 4 (f12) cycles. I anticipate similar cycle counts for UltraSPARC, PowerPC, etc. Immune to timing including cache-timing including hyperthreading No data-dependent no data-dependent Software is in public 16 kilobytes when cr.yp.to/ecdh.html No known patent p For comparison, Bro much smaller prime,

✄ ✄

780000 PII cycles; no timing-attack p

slide-5
SLIDE 5

640838 Pentium M (695) cycles to compute

  • coordinate of
th

multiple of (

✁ ✂ ✂ ✂ ) on Curve25519,

given

✁ 1 ✁ ✂ ✂ ✂ ✁ 2256 ✄

1 and

  • 2254 + 8 0
✁ 1 ✁ ✂ ✂ ✂ ✁ 2251 ✄

1 . Curve25519 is the elliptic curve

2 =

3 + 486662 2 +
  • mod the prime 2255

19. 624786 Athlon (622) cycles; 832457 Pentium III (686) cycles; 957904 Pentium 4 (f12) cycles. I anticipate similar cycle counts for UltraSPARC, PowerPC, etc. Immune to timing attacks, including cache-timing attacks, including hyperthreading attacks. No data-dependent branches; no data-dependent indexing. Software is in public domain. 16 kilobytes when compiled. cr.yp.to/ecdh.html No known patent problems. For comparison, Brown et al.: much smaller prime, 2192

264

1; 780000 PII cycles; given; no timing-attack protection.

slide-6
SLIDE 6

M (695) cycles

  • rdinate of
th ✁ ✂ ✂ ✂ ) on Curve25519, ✁ ✁ ✂ ✂ ✂ ✁ 2256 ✄

1 and

  • ✁ 1
✁ ✂ ✂ ✂ ✁ 2251 ✄

1 . the elliptic curve

  • 486662
2 +
  • 255

19. (622) cycles; III (686) cycles; 4 (f12) cycles. similar cycle counts ARC, PowerPC, etc. Immune to timing attacks, including cache-timing attacks, including hyperthreading attacks. No data-dependent branches; no data-dependent indexing. Software is in public domain. 16 kilobytes when compiled. cr.yp.to/ecdh.html No known patent problems. For comparison, Brown et al.: much smaller prime, 2192

264

1; 780000 PII cycles; given; no timing-attack protection. Where are the cycles Focus today on Pentium Fastest arithmetic uses floating-point fp adds, fp subs, fp Each Pentium M cycle 1 fp op. Point multiplication: 589825 fp ops;

Understand cycle counts by simply counting

slide-7
SLIDE 7

Immune to timing attacks, including cache-timing attacks, including hyperthreading attacks. No data-dependent branches; no data-dependent indexing. Software is in public domain. 16 kilobytes when compiled. cr.yp.to/ecdh.html No known patent problems. For comparison, Brown et al.: much smaller prime, 2192

264

1; 780000 PII cycles; given; no timing-attack protection. Where are the cycles going? Focus today on Pentium M. Fastest arithmetic on Pentium M uses floating-point operations: fp adds, fp subs, fp mults. Each Pentium M cycle does 1 fp op. Point multiplication: 640838 cycles. 589825 fp ops;

✂ 92 per cycle.

Understand cycle counts fairly well by simply counting fp ops.

slide-8
SLIDE 8

timing attacks, cache-timing attacks, erthreading attacks. endent branches; endent indexing. public domain. when compiled. cr.yp.to/ecdh.html patent problems. Brown et al.: rime, 2192

264

1; cycles; given; protection. Where are the cycles going? Focus today on Pentium M. Fastest arithmetic on Pentium M uses floating-point operations: fp adds, fp subs, fp mults. Each Pentium M cycle does 1 fp op. Point multiplication: 640838 cycles. 589825 fp ops;

✂ 92 per cycle.

Understand cycle counts fairly well by simply counting fp ops. Avoiding all time va to stop timing attacks:

  • 1. For
✁ 1 , compute
  • as
[1] + (1 ✄

)

  • Avoids data-dependent

Costs 36210 fp ops

  • 2. Compute final recip

by Fermat, not extended Avoids data-dependent

  • 3. Don’t branch fo

Allow non-least remainders. No cost—this saves

slide-9
SLIDE 9

Where are the cycles going? Focus today on Pentium M. Fastest arithmetic on Pentium M uses floating-point operations: fp adds, fp subs, fp mults. Each Pentium M cycle does 1 fp op. Point multiplication: 640838 cycles. 589825 fp ops;

✂ 92 per cycle.

Understand cycle counts fairly well by simply counting fp ops. Avoiding all time variability to stop timing attacks:

  • 1. For
✁ 1 , compute [ ]

as

[1] + (1 ✄

)

[0] or similar.

Avoids data-dependent indexing. Costs 36210 fp ops (6%).

  • 2. Compute final reciprocal

by Fermat, not extended Euclid. Avoids data-dependent branching.

  • 3. Don’t branch for remainders.

Allow non-least remainders. No cost—this saves time!

slide-10
SLIDE 10

cycles going? Pentium M. rithmetic on Pentium M

  • int operations:

subs, fp mults. cycle does multiplication: 640838 cycles.

✂ 92 per cycle.

cycle counts fairly well counting fp ops. Avoiding all time variability to stop timing attacks:

  • 1. For
✁ 1 , compute [ ]

as

[1] + (1 ✄

)

[0] or similar.

Avoids data-dependent indexing. Costs 36210 fp ops (6%).

  • 2. Compute final reciprocal

by Fermat, not extended Euclid. Avoids data-dependent branching.

  • 3. Don’t branch for remainders.

Allow non-least remainders. No cost—this saves time! Main loop: 545700 2140 times 255 iterations. Reciprocal: 43821 41148 = 254

162

2673 = 11

243 for

Additional work: 304 Inside one main-loop 80 = 8

10 for 8 adds/subs;

55 for mult by 121665; 648 = 4

162 for 4

1215 = 5

243 for

142 for

[1] + (1 ✄
slide-11
SLIDE 11

Avoiding all time variability to stop timing attacks:

  • 1. For
✁ 1 , compute [ ]

as

[1] + (1 ✄

)

[0] or similar.

Avoids data-dependent indexing. Costs 36210 fp ops (6%).

  • 2. Compute final reciprocal

by Fermat, not extended Euclid. Avoids data-dependent branching.

  • 3. Don’t branch for remainders.

Allow non-least remainders. No cost—this saves time! Main loop: 545700 fp ops (92.5%). 2140 times 255 iterations. Reciprocal: 43821 fp ops (7.4%). 41148 = 254

162 for 254 squarings;

2673 = 11

243 for 11 more mults.

Additional work: 304 fp ops. Inside one main-loop iteration: 80 = 8

10 for 8 adds/subs;

55 for mult by 121665; 648 = 4

162 for 4 squarings;

1215 = 5

243 for 5 more mults;

142 for

[1] + (1 ✄

)

[0] etc.
slide-12
SLIDE 12

time variability attacks:

, compute

[ ]

)

[0] or similar.

endent indexing.

  • ps (6%).

final reciprocal extended Euclid. endent branching. for remainders. remainders. saves time! Main loop: 545700 fp ops (92.5%). 2140 times 255 iterations. Reciprocal: 43821 fp ops (7.4%). 41148 = 254

162 for 254 squarings;

2673 = 11

243 for 11 more mults.

Additional work: 304 fp ops. Inside one main-loop iteration: 80 = 8

10 for 8 adds/subs;

55 for mult by 121665; 648 = 4

162 for 4 squarings;

1215 = 5

243 for 5 more mults;

142 for

[1] + (1 ✄

)

[0] etc.

An integer mod 2255

represented in radix

  • as a sum of 10 fp numb

in specified ranges. Add/sub: 10 fp adds/subs. Delay reductions and Mult: poly mult using 102 fp mults, 92 fp reduce using 9 fp mults, carry 11 times, each

  • verall 2
102 + 4
  • Squaring: start with

then eliminate 92 +

  • verall 1
102 + 6
slide-13
SLIDE 13

Main loop: 545700 fp ops (92.5%). 2140 times 255 iterations. Reciprocal: 43821 fp ops (7.4%). 41148 = 254

162 for 254 squarings;

2673 = 11

243 for 11 more mults.

Additional work: 304 fp ops. Inside one main-loop iteration: 80 = 8

10 for 8 adds/subs;

55 for mult by 121665; 648 = 4

162 for 4 squarings;

1215 = 5

243 for 5 more mults;

142 for

[1] + (1 ✄

)

[0] etc.

An integer mod 2255

19 is represented in radix 225

5

as a sum of 10 fp numbers in specified ranges. Add/sub: 10 fp adds/subs. Delay reductions and carries! Mult: poly mult using 102 fp mults, 92 fp adds; reduce using 9 fp mults, 9 fp adds; carry 11 times, each 4 fp adds;

  • verall 2
102 + 4 10 + 3 fp ops.

Squaring: start with 9 fp doublings; then eliminate 92 + 9 fp ops;

  • verall 1
102 + 6 10 + 2 fp ops.
slide-14
SLIDE 14

545700 fp ops (92.5%). iterations. 43821 fp ops (7.4%).

162 for 254 squarings;
  • for 11 more mults.

304 fp ops. main-loop iteration:

  • adds/subs;

121665;

  • r 4 squarings;
  • for 5 more mults;
  • (1

)

[0] etc.

An integer mod 2255

19 is represented in radix 225

5

as a sum of 10 fp numbers in specified ranges. Add/sub: 10 fp adds/subs. Delay reductions and carries! Mult: poly mult using 102 fp mults, 92 fp adds; reduce using 9 fp mults, 9 fp adds; carry 11 times, each 4 fp adds;

  • verall 2
102 + 4 10 + 3 fp ops.

Squaring: start with 9 fp doublings; then eliminate 92 + 9 fp ops;

  • verall 1
102 + 6 10 + 2 fp ops.

How was the prime Use prime close to to save time in field Also reduces NFS so would need larger traditional discrete-log but doesn’t seem to Use prime not far b

  • to avoid wasting bandwidth.

Comfortable securit 2253 + 39, 2253 + 51, 2255

31, 2255

19,

slide-15
SLIDE 15

An integer mod 2255

19 is represented in radix 225

5

as a sum of 10 fp numbers in specified ranges. Add/sub: 10 fp adds/subs. Delay reductions and carries! Mult: poly mult using 102 fp mults, 92 fp adds; reduce using 9 fp mults, 9 fp adds; carry 11 times, each 4 fp adds;

  • verall 2
102 + 4 10 + 3 fp ops.

Squaring: start with 9 fp doublings; then eliminate 92 + 9 fp ops;

  • verall 1
102 + 6 10 + 2 fp ops.

How was the prime chosen? Use prime close to power of 2 to save time in field operations. Also reduces NFS exponent, so would need larger prime for traditional discrete-log systems; but doesn’t seem to affect ECDL. Use prime not far below 232

  • to avoid wasting bandwidth.

Comfortable security, = 8: 2253 + 39, 2253 + 51, 2254 + 79, 2255

31, 2255

19, 2255 + 95.

slide-16
SLIDE 16

2255

19 is radix 225

5

fp numbers ranges. adds/subs. and carries! using fp adds; fp mults, 9 fp adds; each 4 fp adds;

  • 4
10 + 3 fp ops.

with 9 fp doublings; + 9 fp ops;

  • 6
10 + 2 fp ops.

How was the prime chosen? Use prime close to power of 2 to save time in field operations. Also reduces NFS exponent, so would need larger prime for traditional discrete-log systems; but doesn’t seem to affect ECDL. Use prime not far below 232

  • to avoid wasting bandwidth.

Comfortable security, = 8: 2253 + 39, 2253 + 51, 2254 + 79, 2255

31, 2255

19, 2255 + 95. Bender, Castagnoli, “2127 + 24933 is p

✂ ✂ ✂ For this curve

convenient in computer we also give

✂ ✂ ✂ ”

I use the prime 2255

convenient for the No trouble from “shift patent 5159632 filed

slide-17
SLIDE 17

How was the prime chosen? Use prime close to power of 2 to save time in field operations. Also reduces NFS exponent, so would need larger prime for traditional discrete-log systems; but doesn’t seem to affect ECDL. Use prime not far below 232

  • to avoid wasting bandwidth.

Comfortable security, = 8: 2253 + 39, 2253 + 51, 2254 + 79, 2255

31, 2255

19, 2255 + 95. Bender, Castagnoli, CRYPTO ’89: “2127 + 24933 is prime.

✂ ✂ ✂ For this curve which is

convenient in computer arithmetic we also give

✂ ✂ ✂ ”

I use the prime 2255

19, convenient for the same reasons. No trouble from “shift and add” patent 5159632 filed 1991.09.17.

slide-18
SLIDE 18

rime chosen? to power of 2 field operations. NFS exponent, rger prime for discrete-log systems; to affect ECDL. r below 232

  • bandwidth.

security, = 8: 51, 2254 + 79,

✄ ✄

19, 2255 + 95. Bender, Castagnoli, CRYPTO ’89: “2127 + 24933 is prime.

✂ ✂ ✂ For this curve which is

convenient in computer arithmetic we also give

✂ ✂ ✂ ”

I use the prime 2255

19, convenient for the same reasons. No trouble from “shift and add” patent 5159632 filed 1991.09.17. How was the curve Use Montgomery shap

2 =

3 + 2 +
  • to save time in curve

and to avoid square Choose (

2) 4 to save time in curve Montgomery’s recursion:

  • 1 = 1;
2 ✁

= (

2 ✁ ✄
2 ✁

= 4

  • ✁ (
2 ✁
2 ✁

+1 = 4(

2 ✁

+1 = 4(

+1

then

( ✁ ✂ ✂ ✂ ) = (
✁ ✂ ✂ ✂
slide-19
SLIDE 19

Bender, Castagnoli, CRYPTO ’89: “2127 + 24933 is prime.

✂ ✂ ✂ For this curve which is

convenient in computer arithmetic we also give

✂ ✂ ✂ ”

I use the prime 2255

19, convenient for the same reasons. No trouble from “shift and add” patent 5159632 filed 1991.09.17. How was the curve chosen? Use Montgomery shape

2 =

3 + 2 +
  • to save time in curve operations

and to avoid square roots. Choose (

2) 4 as small integer to save time in curve operations. Montgomery’s recursion:

1 =

;

1 = 1; 2 ✁

= (

2 ✁ ✄ 2 ✁ )2; 2 ✁

= 4

  • ✁ (
2 ✁

+

+

2 ✁ ); 2 ✁

+1 = 4(

+1

+1)2;

2 ✁

+1 = 4(

+1

+1)2

; then

( ✁ ✂ ✂ ✂ ) = (
✁ ✂ ✂ ✂ ).
slide-20
SLIDE 20

Castagnoli, CRYPTO ’89: prime.

✂ ✂ ✂

curve which is computer arithmetic

✂ ✂ ✂ ”

255

19, the same reasons. “shift and add” filed 1991.09.17. How was the curve chosen? Use Montgomery shape

2 =

3 + 2 +
  • to save time in curve operations

and to avoid square roots. Choose (

2) 4 as small integer to save time in curve operations. Montgomery’s recursion:

1 =

;

1 = 1; 2 ✁

= (

2 ✁ ✄ 2 ✁ )2; 2 ✁

= 4

  • ✁ (
2 ✁

+

+

2 ✁ ); 2 ✁

+1 = 4(

+1

+1)2;

2 ✁

+1 = 4(

+1

+1)2

; then

( ✁ ✂ ✂ ✂ ) = (
✁ ✂ ✂ ✂ ).

+

  • +
  • 2
✁ 2 ✁
slide-21
SLIDE 21

How was the curve chosen? Use Montgomery shape

2 =

3 + 2 +
  • to save time in curve operations

and to avoid square roots. Choose (

2) 4 as small integer to save time in curve operations. Montgomery’s recursion:

1 =

;

1 = 1; 2 ✁

= (

2 ✁ ✄ 2 ✁ )2; 2 ✁

= 4

  • ✁ (
2 ✁

+

+

2 ✁ ); 2 ✁

+1 = 4(

+1

+1)2;

2 ✁

+1 = 4(

+1

+1)2

; then

( ✁ ✂ ✂ ✂ ) = (
✁ ✂ ✂ ✂ ).

+1

+1

  • +
  • +
  • +
  • +
✂ 2

4

  • 2
✁ 2 ✁ 2 ✁

+1

2 ✁

+1

slide-22
SLIDE 22

curve chosen? Montgomery shape

  • curve operations

square roots.

4 as small integer curve operations. recursion:

1 =

;

2 ✁ ✄ 2 ✁ )2;
2 ✁

+

+

2 ✁ );

+1

+1)2;

+1

+1)2

;

✂ ✂ ✂

(

✁ ✂ ✂ ✂ ).

+1

+1

  • +
  • +
  • +
  • +
✂ 2

4

  • 2
✁ 2 ✁ 2 ✁

+1

2 ✁

+1

Reject unless curve

  • rders are

4

prime ✁
  • Montgomery shape

characteristic in 4Z

For = 486662: Curve 8 times prime

1 =

  • The twist has order

4 times prime

2 =

slide-23
SLIDE 23

+1

+1

  • +
  • +
  • +
  • +
✂ 2

4

  • 2
✁ 2 ✁ 2 ✁

+1

2 ✁

+1

Reject unless curve and twist

  • rders are

4

prime ✁ 8 prime .

Montgomery shape forces 4; characteristic in 4Z + 1 forces 4

✁ 8.

For = 486662: Curve has order 8 times prime

1 = 2252 +

  • .

The twist has order 4 times prime

2 = 2253

  • .
slide-24
SLIDE 24

+1

+1

  • +

+

✂ 2

4

2 ✁

+1

2 ✁

+1

Reject unless curve and twist

  • rders are

4

prime ✁ 8 prime .

Montgomery shape forces 4; characteristic in 4Z + 1 forces 4

✁ 8.

For = 486662: Curve has order 8 times prime

1 = 2252 +

  • .

The twist has order 4 times prime

2 = 2253

  • .

For = 358990: One prime is 2252

  • so user’s secret key
  • 2254 + 8 0
✁ 1 ✁ ✂ ✂ ✂ ✁ ✄

could be 8 times that Extremely unlikely, but annoys implemento so reject this .

slide-25
SLIDE 25

Reject unless curve and twist

  • rders are

4

prime ✁ 8 prime .

Montgomery shape forces 4; characteristic in 4Z + 1 forces 4

✁ 8.

For = 486662: Curve has order 8 times prime

1 = 2252 +

  • .

The twist has order 4 times prime

2 = 2253

  • .

For = 358990: One prime is 2252

  • ,

so user’s secret key

  • 2254 + 8 0
✁ 1 ✁ ✂ ✂ ✂ ✁ 2251 ✄

1 could be 8 times that prime. Extremely unlikely, but annoys implementors, so reject this .

slide-26
SLIDE 26

curve and twist

  • rime
✁ 8 prime .

shape forces 4; 4Z + 1 forces 4

✁ 8.

486662: Curve has order = 2252 +

  • .

rder = 2253

  • .

For = 358990: One prime is 2252

  • ,

so user’s secret key

  • 2254 + 8 0
✁ 1 ✁ ✂ ✂ ✂ ✁ 2251 ✄

1 could be 8 times that prime. Extremely unlikely, but annoys implementors, so reject this . Note on comparing and comparing coo Count fp ops, not Otherwise you mak Reality: mult by small is as expensive as several Reality: square-to-multiply is 2 3 for this field, Reality:

2 +

2 +

faster than (

2 ✁

2

✁✂✁
slide-27
SLIDE 27

For = 358990: One prime is 2252

  • ,

so user’s secret key

  • 2254 + 8 0
✁ 1 ✁ ✂ ✂ ✂ ✁ 2251 ✄

1 could be 8 times that prime. Extremely unlikely, but annoys implementors, so reject this . Note on comparing curves and comparing coordinate systems: Count fp ops, not field ops! Otherwise you make bad choices. Reality: mult by small constant is as expensive as several adds. Reality: square-to-multiply ratio is 2 3 for this field, not 4 5. Reality:

2 +

2 +

✁ 2 is

faster than (

2 ✁

2

✁✂✁ 2).
slide-28
SLIDE 28

358990:

252

  • ,

key

  • ✁ 1
✁ ✂ ✂ ✂ ✁ 2251 ✄

1 that prime. ely, implementors, Note on comparing curves and comparing coordinate systems: Count fp ops, not field ops! Otherwise you make bad choices. Reality: mult by small constant is as expensive as several adds. Reality: square-to-multiply ratio is 2 3 for this field, not 4 5. Reality:

2 +

2 +

✁ 2 is

faster than (

2 ✁

2

✁✂✁ 2).

How was the key range Public key for secret

  • is
  • coordinate of
  • f standard base p
✁ ✂ ✂ ✂

Base-point order is so uniform random

  • 2251 +
✁ 1 ✁ 2 ✁ ✂ ✂ ✂ ✁ ✄

produces almost exactly random public key among 2251 possibilities. The addition of 2251 and avoids timing

slide-29
SLIDE 29

Note on comparing curves and comparing coordinate systems: Count fp ops, not field ops! Otherwise you make bad choices. Reality: mult by small constant is as expensive as several adds. Reality: square-to-multiply ratio is 2 3 for this field, not 4 5. Reality:

2 +

2 +

✁ 2 is

faster than (

2 ✁

2

✁✂✁ 2).

How was the key range chosen? Public key for secret key

  • is
  • coordinate of
th multiple
  • f standard base point (9
✁ ✂ ✂ ✂ ).

Base-point order is

1

2252, so uniform random

  • in

2251 +

✁ 1 ✁ 2 ✁ ✂ ✂ ✂ ✁ 2251 ✄

1 produces almost exactly uniform random public key from among 2251 possibilities. The addition of 2251 avoids and avoids timing attacks.

slide-30
SLIDE 30

ring curves coordinate systems: not field ops! make bad choices. small constant as several adds. re-to-multiply ratio field, not 4 5.

  • +
✁ 2 is

2

✁✂✁ 2).

How was the key range chosen? Public key for secret key

  • is
  • coordinate of
th multiple
  • f standard base point (9
✁ ✂ ✂ ✂ ).

Base-point order is

1

2252, so uniform random

  • in

2251 +

✁ 1 ✁ 2 ✁ ✂ ✂ ✂ ✁ 2251 ✄

1 produces almost exactly uniform random public key from among 2251 possibilities. The addition of 2251 avoids and avoids timing attacks. Miller, CRYPTO ’85: “For the key exchange

✂ ✂ ✂
  • nly the
  • coordinate
  • transmitted. The fo

multiples of a point first section make

  • coordinate of a multiple
  • nly on the
  • coordinate
  • riginal point.”

This is the compression

  • use. No trouble from

compression” patent 1994.07.29.

slide-31
SLIDE 31

How was the key range chosen? Public key for secret key

  • is
  • coordinate of
th multiple
  • f standard base point (9
✁ ✂ ✂ ✂ ).

Base-point order is

1

2252, so uniform random

  • in

2251 +

✁ 1 ✁ 2 ✁ ✂ ✂ ✂ ✁ 2251 ✄

1 produces almost exactly uniform random public key from among 2251 possibilities. The addition of 2251 avoids and avoids timing attacks. Miller, CRYPTO ’85: “For the key exchange

✂ ✂ ✂
  • nly the
  • coordinate needs to be
  • transmitted. The formulas for

multiples of a point cited in the first section make it clear that the

  • coordinate of a multiple depends
  • nly on the
  • coordinate of the
  • riginal point.”

This is the compression method I

  • use. No trouble from “point

compression” patent 6141420 filed 1994.07.29.

slide-32
SLIDE 32

range chosen? secret key

  • f
th multiple

point (9

✁ ✂ ✂ ✂ ).

is

1

2252, random

  • in
✁ ✁ ✁ ✂ ✂ ✂ ✁ 2251 ✄

1 exactly uniform ey from

  • ssibilities.

2251 avoids timing attacks. Miller, CRYPTO ’85: “For the key exchange

✂ ✂ ✂
  • nly the
  • coordinate needs to be
  • transmitted. The formulas for

multiples of a point cited in the first section make it clear that the

  • coordinate of a multiple depends
  • nly on the
  • coordinate of the
  • riginal point.”

This is the compression method I

  • use. No trouble from “point

compression” patent 6141420 filed 1994.07.29. Insert factor of 8 into

  • in case (
✁ ✂ ✂ ✂ ) is not

in this group of order Three possibilities

✁ ✂ ✂ ✂

, output as 0;

  • r a nontrivial point

in the desired prime

  • r a nontrivial point

in the twist prime Don’t spend time “validating” , i.e., checking it’s in desired

slide-33
SLIDE 33

Miller, CRYPTO ’85: “For the key exchange

✂ ✂ ✂
  • nly the
  • coordinate needs to be
  • transmitted. The formulas for

multiples of a point cited in the first section make it clear that the

  • coordinate of a multiple depends
  • nly on the
  • coordinate of the
  • riginal point.”

This is the compression method I

  • use. No trouble from “point

compression” patent 6141420 filed 1994.07.29. Insert factor of 8 into

  • in case (
✁ ✂ ✂ ✂ ) is not actually

in this group of order

1.

Three possibilities for 8(

✁ ✂ ✂ ✂ ):

, output as 0;

  • r a nontrivial point

in the desired prime group;

  • r a nontrivial point

in the twist prime group. Don’t spend time “validating” , i.e., checking it’s in desired group.

slide-34
SLIDE 34

’85: exchange

✂ ✂ ✂
  • rdinate needs to be

The formulas for

  • int cited in the

e it clear that the

  • a multiple depends
  • rdinate of the

ression method I from “point patent 6141420 filed Insert factor of 8 into

  • in case (
✁ ✂ ✂ ✂ ) is not actually

in this group of order

1.

Three possibilities for 8(

✁ ✂ ✂ ✂ ):

, output as 0;

  • r a nontrivial point

in the desired prime group;

  • r a nontrivial point

in the twist prime group. Don’t spend time “validating” , i.e., checking it’s in desired group. Even if attacker were same

  • times point

would still need to hash-Diffie-Hellman

  • f these two prime

For uniform random provably requires b at least one of the Curve and twist both No known way to exploit limited exponent range. Often used in Diffie-Hellman for multiplicative group.

slide-35
SLIDE 35

Insert factor of 8 into

  • in case (
✁ ✂ ✂ ✂ ) is not actually

in this group of order

1.

Three possibilities for 8(

✁ ✂ ✂ ✂ ):

, output as 0;

  • r a nontrivial point

in the desired prime group;

  • r a nontrivial point

in the twist prime group. Don’t spend time “validating” , i.e., checking it’s in desired group. Even if attacker were given same

  • times point on twist,

would still need to break hash-Diffie-Hellman for product

  • f these two prime groups.

For uniform random exponent, provably requires breaking at least one of the prime groups. Curve and twist both seem secure. No known way to exploit limited exponent range. Often used in Diffie-Hellman for multiplicative group.

slide-36
SLIDE 36

into

✂ ✂ ✂

is not actually

  • rder

1.

  • ssibilities for 8(
✁ ✂ ✂ ✂ ):
  • int

rime group;

  • int

rime group. time i.e., desired group. Even if attacker were given same

  • times point on twist,

would still need to break hash-Diffie-Hellman for product

  • f these two prime groups.

For uniform random exponent, provably requires breaking at least one of the prime groups. Curve and twist both seem secure. No known way to exploit limited exponent range. Often used in Diffie-Hellman for multiplicative group. Bernstein, sci.crypt, “You can happily skip transmission and the In fact, if both the twist have nearly p you can even skip I use a curve of this No trouble from rumo “public-key validation” filed 2003.

slide-37
SLIDE 37

Even if attacker were given same

  • times point on twist,

would still need to break hash-Diffie-Hellman for product

  • f these two prime groups.

For uniform random exponent, provably requires breaking at least one of the prime groups. Curve and twist both seem secure. No known way to exploit limited exponent range. Often used in Diffie-Hellman for multiplicative group. Bernstein, sci.crypt, 2001.11.09: “You can happily skip both the transmission and the square root. In fact, if both the curve and its twist have nearly prime order, then you can even skip square testing.” I use a curve of this type. No trouble from rumored new “public-key validation” patent filed 2003.

slide-38
SLIDE 38

were given

  • int on twist,

to break hash-Diffie-Hellman for product rime groups. random exponent, requires breaking the prime groups. both seem secure. to exploit range. Diffie-Hellman group. Bernstein, sci.crypt, 2001.11.09: “You can happily skip both the transmission and the square root. In fact, if both the curve and its twist have nearly prime order, then you can even skip square testing.” I use a curve of this type. No trouble from rumored new “public-key validation” patent filed 2003. How was the softw Common phenomenon: Write fp op sequence Feed it to C compiler to produce machine Observe that cycles is much larger than sometimes 5 or mo Have faith. Don’t

Understand and eliminate non-fp-op cycles. (I have more work Athlon et al. Expect

slide-39
SLIDE 39

Bernstein, sci.crypt, 2001.11.09: “You can happily skip both the transmission and the square root. In fact, if both the curve and its twist have nearly prime order, then you can even skip square testing.” I use a curve of this type. No trouble from rumored new “public-key validation” patent filed 2003. How was the software built? Common phenomenon: Write fp op sequence in C. Feed it to C compiler to produce machine language. Observe that cycles fp ops is much larger than 1: sometimes 5 or more! Have faith. Don’t accept 1

✂ 1.

Understand and eliminate non-fp-op cycles. (I have more work to do here for Athlon et al. Expect speedups.)

slide-40
SLIDE 40

sci.crypt, 2001.11.09: happily skip both the the square root. the curve and its prime order, then skip square testing.” this type. rumored new validation” patent How was the software built? Common phenomenon: Write fp op sequence in C. Feed it to C compiler to produce machine language. Observe that cycles fp ops is much larger than 1: sometimes 5 or more! Have faith. Don’t accept 1

✂ 1.

Understand and eliminate non-fp-op cycles. (I have more work to do here for Athlon et al. Expect speedups.) Some important dela

3-cycle “load” latency

copying data from “register” for arithmetic. Only 8 registers.

3-cycle fp add latency 5-cycle fp mult latency

An op waits if its inputs aren’t ready. CPU ability to reorder ops, uses greedy algorithm;

slide-41
SLIDE 41

How was the software built? Common phenomenon: Write fp op sequence in C. Feed it to C compiler to produce machine language. Observe that cycles fp ops is much larger than 1: sometimes 5 or more! Have faith. Don’t accept 1

✂ 1.

Understand and eliminate non-fp-op cycles. (I have more work to do here for Athlon et al. Expect speedups.) Some important delays:

3-cycle “load” latency,

copying data from “cache” to “register” for arithmetic. Only 8 registers.

3-cycle fp add latency. 5-cycle fp mult latency.

An op waits if its inputs aren’t ready. CPU has some ability to reorder ops, but uses greedy algorithm; suboptimal.

slide-42
SLIDE 42

software built? phenomenon: sequence in C. compiler machine language. cycles fp ops than 1: more! Don’t accept 1

✂ 1.

eliminate cycles. rk to do here for Expect speedups.) Some important delays:

3-cycle “load” latency,

copying data from “cache” to “register” for arithmetic. Only 8 registers.

3-cycle fp add latency. 5-cycle fp mult latency.

An op waits if its inputs aren’t ready. CPU has some ability to reorder ops, but uses greedy algorithm; suboptimal. Can’t rely on C compiler to sensibly permute Sometimes

  • +
✁ ;
  • +

a sequence of exact best done as, e.g.,

+ ✁ ;
  • +

But sometimes

is a non-associative deliberately rounded The C language has to express this distinction.

slide-43
SLIDE 43

Some important delays:

3-cycle “load” latency,

copying data from “cache” to “register” for arithmetic. Only 8 registers.

3-cycle fp add latency. 5-cycle fp mult latency.

An op waits if its inputs aren’t ready. CPU has some ability to reorder ops, but uses greedy algorithm; suboptimal. Can’t rely on C compiler to sensibly permute fp ops. Sometimes

  • + ;
  • +
✁ ;
  • +

is a sequence of exact fp adds best done as, e.g.,

  • + ;
✁ + ✁ ;
  • +
✁ .

But sometimes

  • +

is a non-associative deliberately rounded fp add! The C language has no way to express this distinction.

slide-44
SLIDE 44

delays:

  • latency,

from “cache” to arithmetic. registers.

  • latency.
  • mult latency.

its inputs CPU has some

  • ps, but

rithm; suboptimal. Can’t rely on C compiler to sensibly permute fp ops. Sometimes

  • + ;
  • +
✁ ;
  • +

is a sequence of exact fp adds best done as, e.g.,

  • + ;
✁ + ✁ ;
  • +
✁ .

But sometimes

  • +

is a non-associative deliberately rounded fp add! The C language has no way to express this distinction. Curve25519 implementation is actually in qhasm new programming for high-speed computations. Language allows decla and propagation of guided register allo Lets me write desired with much less human traditional asm and Have also used for fast Poly1305, fast

slide-45
SLIDE 45

Can’t rely on C compiler to sensibly permute fp ops. Sometimes

  • + ;
  • +
✁ ;
  • +

is a sequence of exact fp adds best done as, e.g.,

  • + ;
✁ + ✁ ;
  • +
✁ .

But sometimes

  • +

is a non-associative deliberately rounded fp add! The C language has no way to express this distinction. Curve25519 implementation is actually in qhasm, new programming language for high-speed computations. Language allows declaration and propagation of fp ranges; guided register allocation; et al. Lets me write desired code with much less human time than traditional asm and C compiler. Have also used for fast AES, fast Poly1305, fast Salsa20, etc.

slide-46
SLIDE 46

compiler ermute fp ops.

  • + ;
  • +

is exact fp adds e.g.,

  • + ;
  • +
✁ .
  • +

ciative rounded fp add! has no way distinction. Curve25519 implementation is actually in qhasm, new programming language for high-speed computations. Language allows declaration and propagation of fp ranges; guided register allocation; et al. Lets me write desired code with much less human time than traditional asm and C compiler. Have also used for fast AES, fast Poly1305, fast Salsa20, etc. What’s next? Culmination of extensive

  • n eliminating field

genus-2 hyperelliptic 25 mults per bit. Gaudry eprint.iacr.org/2005/314 Half-size prime: e.g.,

Select curve to mak mults easier, like cho Should count fp ops Prediction: this will

slide-47
SLIDE 47

Curve25519 implementation is actually in qhasm, new programming language for high-speed computations. Language allows declaration and propagation of fp ranges; guided register allocation; et al. Lets me write desired code with much less human time than traditional asm and C compiler. Have also used for fast AES, fast Poly1305, fast Salsa20, etc. What’s next? Culmination of extensive work

  • n eliminating field mults for

genus-2 hyperelliptic curves: 25 mults per bit. Gaudry, eprint.iacr.org/2005/314 Half-size prime: e.g., 2127

1. Select curve to make some mults easier, like choosing . Should count fp ops instead. Prediction: this will beat genus 1.