High-speed Define 19; prime. elliptic-curve cryptography Define - - PowerPoint PPT Presentation

high speed define 19 prime elliptic curve cryptography
SMART_READER_LITE
LIVE PREVIEW

High-speed Define 19; prime. elliptic-curve cryptography Define - - PowerPoint PPT Presentation

= 2 255 High-speed Define 19; prime. elliptic-curve cryptography Define = 358990. Define 1 Curve : Z 0 1 by D. J. Bernstein th multiple coordinate of


slide-1
SLIDE 1

High-speed elliptic-curve cryptography

  • D. J. Bernstein

Thanks to: University of Illinois at Chicago NSF CCR–9983950 Alfred P. Sloan Foundation Define = 2255

  • 19; prime.

Define = 358990. Define Curve : Z

✁ 1 ✁ ✂ ✂ ✂ ✁
  • 1

by

✄ ☎ ✆

coordinate of

✄ th multiple
  • f (2
✁ ✂ ✂ ✂ ) on the elliptic curve

2 =

✆ 3 + ✆ 2 + ✆
  • ver F
✝ .

Main topic of this talk: Compute

✁ Curve( ) ☎

Curve( ) in very few CPU cycles. In particular, use floating point for fast arithmetic mod .

slide-2
SLIDE 2

cryptography Illinois at Chicago CCR–9983950 Foundation Define = 2255

  • 19; prime.

Define = 358990. Define Curve : Z

✁ 1 ✁ ✂ ✂ ✂ ✁
  • 1

by

✄ ☎ ✆

coordinate of

✄ th multiple
  • f (2
✁ ✂ ✂ ✂ ) on the elliptic curve

2 =

✆ 3 + ✆ 2 + ✆
  • ver F
✝ .

Main topic of this talk: Compute

✁ Curve( ) ☎

Curve( ) in very few CPU cycles. In particular, use floating point for fast arithmetic mod . Why cryptographers Each user has secret public key Curve( Users with secret k

exchange Curve( )

through an authenticated compute Curve( use hash as shared encrypt and authenticate Curve speed is imp when number of messages

slide-3
SLIDE 3

Define = 2255

  • 19; prime.

Define = 358990. Define Curve : Z

✁ 1 ✁ ✂ ✂ ✂ ✁
  • 1

by

✄ ☎ ✆

coordinate of

✄ th multiple
  • f (2
✁ ✂ ✂ ✂ ) on the elliptic curve

2 =

✆ 3 + ✆ 2 + ✆
  • ver F
✝ .

Main topic of this talk: Compute

✁ Curve( ) ☎

Curve( ) in very few CPU cycles. In particular, use floating point for fast arithmetic mod . Why cryptographers care Each user has secret key , public key Curve( ). Users with secret keys

exchange Curve( )

✁ Curve( )

through an authenticated channel; compute Curve( ); hash it; use hash as shared secret to encrypt and authenticate messages. Curve speed is important when number of messages is small.

slide-4
SLIDE 4
  • 19; prime.
  • 358990. Define
✁ 1 ✁ ✂ ✂ ✂ ✁
  • 1

by

✄ ☎ ✆

rdinate of

✄ th multiple ✁ ✂ ✂ ✂

elliptic curve

✆ ✆ ✆
  • ver F
✝ .

this talk: Compute

✁ ☎

Curve( ) cycles. floating point rithmetic mod . Why cryptographers care Each user has secret key , public key Curve( ). Users with secret keys

exchange Curve( )

✁ Curve( )

through an authenticated channel; compute Curve( ); hash it; use hash as shared secret to encrypt and authenticate messages. Curve speed is important when number of messages is small. Analogous system

  • 1976 Diffie Hellman.

Using elliptic curves to avoid index-calculus 1986 Miller, 1987 Koblitz. Using

✆ 3 + ✆ 2 + ✆

1987 Montgomery High precision from 1968 Veltkamp, 1971 Speedups: 1999–2005

slide-5
SLIDE 5

Why cryptographers care Each user has secret key , public key Curve( ). Users with secret keys

exchange Curve( )

✁ Curve( )

through an authenticated channel; compute Curve( ); hash it; use hash as shared secret to encrypt and authenticate messages. Curve speed is important when number of messages is small. Analogous system using 2

  • mod

: 1976 Diffie Hellman. Using elliptic curves to avoid index-calculus attacks: 1986 Miller, 1987 Koblitz. Using

✆ 3 + ✆ 2 + ✆

for speed: 1987 Montgomery (for ECM). High precision from fp sums: 1968 Veltkamp, 1971 Dekker. Speedups: 1999–2005 Bernstein.

slide-6
SLIDE 6

cryptographers care secret key , Curve( ). secret keys

)

✁ Curve( )

authenticated channel; ); hash it; red secret to authenticate messages. important messages is small. Analogous system using 2

  • mod

: 1976 Diffie Hellman. Using elliptic curves to avoid index-calculus attacks: 1986 Miller, 1987 Koblitz. Using

✆ 3 + ✆ 2 + ✆

for speed: 1987 Montgomery (for ECM). High precision from fp sums: 1968 Veltkamp, 1971 Dekker. Speedups: 1999–2005 Bernstein. Understanding CPU Computers are designed music, movies, Photoshop,

  • etc. Heavy use of

i.e., approximate real Example: Athlon, does one add and

  • f high-precision fp

Programmer paying to these CPU features can use them for cryptography

slide-7
SLIDE 7

Analogous system using 2

  • mod

: 1976 Diffie Hellman. Using elliptic curves to avoid index-calculus attacks: 1986 Miller, 1987 Koblitz. Using

✆ 3 + ✆ 2 + ✆

for speed: 1987 Montgomery (for ECM). High precision from fp sums: 1968 Veltkamp, 1971 Dekker. Speedups: 1999–2005 Bernstein. Understanding CPU design Computers are designed for music, movies, Photoshop, Doom 3,

  • etc. Heavy use of fp arithmetic,

i.e., approximate real arithmetic. Example: Athlon, every cycle, does one add and one multiply

  • f high-precision fp numbers.

Programmer paying attention to these CPU features can use them for cryptography.

slide-8
SLIDE 8

system using 2

  • mod

: Hellman. curves index-calculus attacks: 1987 Koblitz.

✆ ✆

+

for speed: Montgomery (for ECM). from fp sums: 1971 Dekker. 1999–2005 Bernstein. Understanding CPU design Computers are designed for music, movies, Photoshop, Doom 3,

  • etc. Heavy use of fp arithmetic,

i.e., approximate real arithmetic. Example: Athlon, every cycle, does one add and one multiply

  • f high-precision fp numbers.

Programmer paying attention to these CPU features can use them for cryptography. A 53-bit fp numb is a real number 2

  • with
✁ ✁

Z and

✂ ✂

Round each real numb

closest 53-bit fp numb

Round halves to even. Examples: fp53(8675309) = 8675309; fp53(2127 + 8675309) fp53(2127

  • 8675309)
slide-9
SLIDE 9

Understanding CPU design Computers are designed for music, movies, Photoshop, Doom 3,

  • etc. Heavy use of fp arithmetic,

i.e., approximate real arithmetic. Example: Athlon, every cycle, does one add and one multiply

  • f high-precision fp numbers.

Programmer paying attention to these CPU features can use them for cryptography. A 53-bit fp number is a real number 2

  • with
✁ ✁

Z and

✂ ✂

253. Round each real number

to closest 53-bit fp number, fp53

✄ .

Round halves to even. Examples: fp53(8675309) = 8675309; fp53(2127 + 8675309) = 2127; fp53(2127

  • 8675309) = 2127.
slide-10
SLIDE 10

CPU design designed for Photoshop, Doom 3,

  • f fp arithmetic,

real arithmetic. thlon, every cycle, and one multiply fp numbers. ying attention features r cryptography. A 53-bit fp number is a real number 2

  • with
✁ ✁

Z and

✂ ✂

253. Round each real number

to closest 53-bit fp number, fp53

✄ .

Round halves to even. Examples: fp53(8675309) = 8675309; fp53(2127 + 8675309) = 2127; fp53(2127

  • 8675309) = 2127.

Typical CPU: UltraSP Every cycle, UltraSP

  • ne fp multiplication
  • ✁✂✁

fp53(

  • ✁ )

and one fp addition

  • ✁✂✁

fp53(

+ ✁ ),

subject to limits on

“4-cycle fp-operation Results available after Can substitute subtraction for addition. I’ll count subtractions as additions.

slide-11
SLIDE 11

A 53-bit fp number is a real number 2

  • with
✁ ✁

Z and

✂ ✂

253. Round each real number

to closest 53-bit fp number, fp53

✄ .

Round halves to even. Examples: fp53(8675309) = 8675309; fp53(2127 + 8675309) = 2127; fp53(2127

  • 8675309) = 2127.

Typical CPU: UltraSPARC III. Every cycle, UltraSPARC III can do

  • ne fp multiplication
  • ✁✂✁

fp53(

  • ✁ )

and one fp addition

  • ✁✂✁

fp53(

+ ✁ ),

subject to limits on

✁ .

“4-cycle fp-operation latency”: Results available after 4 cycles. Can substitute subtraction for addition. I’ll count subtractions as additions.

slide-12
SLIDE 12

number 2

and

✂ ✂

253. number

to number, fp53

✄ .

even. 8675309; 8675309) = 2127;

  • 8675309) = 2127.

Typical CPU: UltraSPARC III. Every cycle, UltraSPARC III can do

  • ne fp multiplication
  • ✁✂✁

fp53(

  • ✁ )

and one fp addition

  • ✁✂✁

fp53(

+ ✁ ),

subject to limits on

✁ .

“4-cycle fp-operation latency”: Results available after 4 cycles. Can substitute subtraction for addition. I’ll count subtractions as additions. Some variation among PowerPC RS64 IV:

  • r one multiplication

“fused”

  • ✁✂✁
✁✁ ☎

fp

  • Results available after

Athlon: fp64 instead

  • ne multiplication

Results available after I’ll focus on UltraSP Not the most impo but it’s a good warmup.

slide-13
SLIDE 13

Typical CPU: UltraSPARC III. Every cycle, UltraSPARC III can do

  • ne fp multiplication
  • ✁✂✁

fp53(

  • ✁ )

and one fp addition

  • ✁✂✁

fp53(

+ ✁ ),

subject to limits on

✁ .

“4-cycle fp-operation latency”: Results available after 4 cycles. Can substitute subtraction for addition. I’ll count subtractions as additions. Some variation among CPUs. PowerPC RS64 IV: One addition

  • r one multiplication or one

“fused”

  • ✁✂✁
✁✁ ☎

fp53(

  • ✁ +
).

Results available after 4 cycles. Athlon: fp64 instead of fp53;

  • ne multiplication and one addition.

Results available after 4 cycles. I’ll focus on UltraSPARC III. Not the most important CPU, but it’s a good warmup.

slide-14
SLIDE 14

UltraSPARC III. UltraSPARC III can do multiplication

  • ✁✂✁

addition

  • ✁✂✁
  • ✁ ),
  • n
✁ .

eration latency”: after 4 cycles. subtraction count additions. Some variation among CPUs. PowerPC RS64 IV: One addition

  • r one multiplication or one

“fused”

  • ✁✂✁
✁✁ ☎

fp53(

  • ✁ +
).

Results available after 4 cycles. Athlon: fp64 instead of fp53;

  • ne multiplication and one addition.

Results available after 4 cycles. I’ll focus on UltraSPARC III. Not the most important CPU, but it’s a good warmup. Exact dot products If

220 ✁ ✂ ✂ ✂ ✁ ✁ ✁ ✂ ✂ ✂ ✁

then

  • is a 53-bit

so

  • = fp53(
  • ).

If

✁✂✁ ✁ 220 ✁ ✂ ✂ ✂ ✁

then

  • ✁✄✁
  • +

53-bit fp numbers

  • = fp53(
  • ),

=

  • +

= fp53(

UltraSPARC III computes

✁✄✁ ✁ ☎
  • +

two fp mults, one

slide-15
SLIDE 15

Some variation among CPUs. PowerPC RS64 IV: One addition

  • r one multiplication or one

“fused”

  • ✁✂✁
✁✁ ☎

fp53(

  • ✁ +
).

Results available after 4 cycles. Athlon: fp64 instead of fp53;

  • ne multiplication and one addition.

Results available after 4 cycles. I’ll focus on UltraSPARC III. Not the most important CPU, but it’s a good warmup. Exact dot products If

220 ✁ ✂ ✂ ✂ ✁ 0 ✁ 1 ✁ ✂ ✂ ✂ ✁ 220

then

  • is a 53-bit fp number

so

  • = fp53(
  • ).

If

✁✂✁ ✁ 220 ✁ ✂ ✂ ✂ ✁ 220

then

  • ✁✄✁
  • +

are 53-bit fp numbers so

  • = fp53(
  • ),

= fp53(

),

  • +

= fp53(

  • +

). UltraSPARC III computes

✁✄✁ ✁ ☎
  • +

with two fp mults, one fp add.

slide-16
SLIDE 16

among CPUs. IV: One addition multiplication or one

  • ✁✂✁
✁✁ ☎

fp53(

  • ✁ +
).

after 4 cycles. instead of fp53; multiplication and one addition. after 4 cycles. UltraSPARC III. important CPU, armup. Exact dot products If

220 ✁ ✂ ✂ ✂ ✁ 0 ✁ 1 ✁ ✂ ✂ ✂ ✁ 220

then

  • is a 53-bit fp number

so

  • = fp53(
  • ).

If

✁✂✁ ✁ 220 ✁ ✂ ✂ ✂ ✁ 220

then

  • ✁✄✁
  • +

are 53-bit fp numbers so

  • = fp53(
  • ),

= fp53(

),

  • +

= fp53(

  • +

). UltraSPARC III computes

✁✄✁ ✁ ☎
  • +

with two fp mults, one fp add. Bit extraction Define

  • ✁ = 3
✂ 2 ✁ +51

top

✁ = fp53(fp53(

bottom

✁ = fp53(
  • If
  • is a 53-bit fp numb

and

2

✁ +51 then

top

  • 2
✁ Z; ✂ bottom ✁

2

✁☎✄ 1 = top ✁ + bottom ✁
slide-17
SLIDE 17

Exact dot products If

220 ✁ ✂ ✂ ✂ ✁ 0 ✁ 1 ✁ ✂ ✂ ✂ ✁ 220

then

  • is a 53-bit fp number

so

  • = fp53(
  • ).

If

✁✂✁ ✁ 220 ✁ ✂ ✂ ✂ ✁ 220

then

  • ✁✄✁
  • +

are 53-bit fp numbers so

  • = fp53(
  • ),

= fp53(

),

  • +

= fp53(

  • +

). UltraSPARC III computes

✁✄✁ ✁ ☎
  • +

with two fp mults, one fp add. Bit extraction Define

  • ✁ = 3
✂ 2 ✁ +51,

top

✁ = fp53(fp53( +
  • ✁ )
  • ✁ ),

bottom

✁ = fp53(
  • top
✁ ).

If

  • is a 53-bit fp number

and

2

✁ +51 then

top

  • 2
✁ Z; ✂ bottom ✁

2

✁☎✄ 1; and = top ✁ + bottom ✁ .
slide-18
SLIDE 18

ducts

✂ ✂ ✂ ✁ 0 ✁ 1 ✁ ✂ ✂ ✂ ✁ 220
  • 53-bit fp number
  • ).
✁✂✁ ✁ 220 ✁ ✂ ✂ ✂ ✁ 220
  • ✁✄✁

are ers so

= fp53(

),

  • +

). computes

✁✄✁ ✁ ☎

with

  • ne fp add.

Bit extraction Define

  • ✁ = 3
✂ 2 ✁ +51,

top

✁ = fp53(fp53( +
  • ✁ )
  • ✁ ),

bottom

✁ = fp53(
  • top
✁ ).

If

  • is a 53-bit fp number

and

2

✁ +51 then

top

  • 2
✁ Z; ✂ bottom ✁

2

✁☎✄ 1; and = top ✁ + bottom ✁ .

Big integers as fp sums Every integer mod

  • can be written as a
0 + 22 + 43 +
  • 85 +
107 + 128
  • 170 +
192 + 213
  • where

2

✂ ✂ ✂ ✁

Indices

✁ are ✂ 255 ✄

for

✁ 1 ✁ ✂ ✂ ✂ ✁ 11

Representation is not it’s not the input/output Uniqueness would

slide-19
SLIDE 19

Bit extraction Define

  • ✁ = 3
✂ 2 ✁ +51,

top

✁ = fp53(fp53( +
  • ✁ )
  • ✁ ),

bottom

✁ = fp53(
  • top
✁ ).

If

  • is a 53-bit fp number

and

2

✁ +51 then

top

  • 2
✁ Z; ✂ bottom ✁

2

✁☎✄ 1; and = top ✁ + bottom ✁ .

Big integers as fp sums Every integer mod 2255

  • 19

can be written as a sum

0 + 22 + 43 + 64 + 85 + 107 + 128 + 149 + 170 + 192 + 213 + 234

where

2

✁ 222 ✁ ✂ ✂ ✂ ✁ 222 .

Indices

✁ are ✂ 255

12

for

✁ 1 ✁ ✂ ✂ ✂ ✁ 11 .

Representation is not unique; it’s not the input/output format. Uniqueness would cost cycles!

slide-20
SLIDE 20
✂ ✁ +51, ✁
  • 53(
+
  • ✁ )
  • ✁ ),
  • (
  • top
✁ ).
  • fp number

then

✂ ✁
✁☎✄ 1; and
  • ttom
✁ .

Big integers as fp sums Every integer mod 2255

  • 19

can be written as a sum

0 + 22 + 43 + 64 + 85 + 107 + 128 + 149 + 170 + 192 + 213 + 234

where

2

✁ 222 ✁ ✂ ✂ ✂ ✁ 222 .

Indices

✁ are ✂ 255

12

for

✁ 1 ✁ ✂ ✂ ✂ ✁ 11 .

Representation is not unique; it’s not the input/output format. Uniqueness would cost cycles! Assume

  • =

and similarly

=
  • =

0 + 22 +

✂ ✂ ✂

where

0 =

0,

22 =

22 + 22
  • 43 =
43 + 22
  • etc.

Each

✁ is a 53-bit

Given

  • ✁ ’s and
  • ✁ ’s,

can compute

✁ ’s

144 fp mults, 121

slide-21
SLIDE 21

Big integers as fp sums Every integer mod 2255

  • 19

can be written as a sum

0 + 22 + 43 + 64 + 85 + 107 + 128 + 149 + 170 + 192 + 213 + 234

where

2

✁ 222 ✁ ✂ ✂ ✂ ✁ 222 .

Indices

✁ are ✂ 255

12

for

✁ 1 ✁ ✂ ✂ ✂ ✁ 11 .

Representation is not unique; it’s not the input/output format. Uniqueness would cost cycles! Assume

  • =
  • ✁ as above,

and similarly

=
  • ✁ . Then
  • =

0 + 22 +

✂ ✂ ✂ +

468

where

0 =

0,

22 =

22 + 22 0,

43 =

43 + 22 22 + 43 0,

etc. Each

✁ is a 53-bit fp number.

Given

  • ✁ ’s and
  • ✁ ’s,

can compute

✁ ’s using

144 fp mults, 121 fp adds.

slide-22
SLIDE 22

fp sums d 2255

  • 19

as a sum

  • +
64 +
  • 128 +
149 +
  • 213 +
234
✁ 222 ✁ ✂ ✂ ✂ ✁ 222 . ✁ ✂ 255

12

✄ ✁ ✁ ✂ ✂ ✂ ✁ 11 .

is not unique; input/output format.

  • uld cost cycles!

Assume

  • =
  • ✁ as above,

and similarly

=
  • ✁ . Then
  • =

0 + 22 +

✂ ✂ ✂ +

468

where

0 =

0,

22 =

22 + 22 0,

43 =

43 + 22 22 + 43 0,

etc. Each

✁ is a 53-bit fp number.

Given

  • ✁ ’s and
  • ✁ ’s,

can compute

✁ ’s using

144 fp mults, 121 fp adds. Furthermore, modulo

  • 0 +
22 + ✂ ✂ ✂
  • where
0 =

0 + 19

✂ ✄ 22 =

22 + 19

✂ 2 ✄

Each

  • ✁ is a 53-bit

Example:

0 is an ✂ ✂

381

✂ 244.

Computing

  • ✁ ’s from

11 fp mults, 11 fp Structure: (Z[

]
  • (2255
12
  • 19)
slide-23
SLIDE 23

Assume

  • =
  • ✁ as above,

and similarly

=
  • ✁ . Then
  • =

0 + 22 +

✂ ✂ ✂ +

468

where

0 =

0,

22 =

22 + 22 0,

43 =

43 + 22 22 + 43 0,

etc. Each

✁ is a 53-bit fp number.

Given

  • ✁ ’s and
  • ✁ ’s,

can compute

✁ ’s using

144 fp mults, 121 fp adds. Furthermore, modulo 2255

  • 19,
  • 0 +
22 + ✂ ✂ ✂ + 234

where

0 =

0 + 19

✂ 2 ✄ 255

255,

22 =

22 + 19

✂ 2 ✄ 255

277, etc.

Each

  • ✁ is a 53-bit fp number.

Example:

0 is an integer; ✂ ✂

381

✂ 244.

Computing

  • ✁ ’s from
✁ ’s takes

11 fp mults, 11 fp adds. Structure: (Z[

]

Z[2255

12 ])

(2255

12
  • 19)

Z (2255

  • 19).
slide-24
SLIDE 24
  • ✁ as above,
  • ✁ . Then
  • +
✂ ✂ ✂ +

468

  • 0,
  • 22
0,
  • 22
22 + 43 0, ✁

53-bit fp number.

  • ✁ ’s,
✁ ’s using

121 fp adds. Furthermore, modulo 2255

  • 19,
  • 0 +
22 + ✂ ✂ ✂ + 234

where

0 =

0 + 19

✂ 2 ✄ 255

255,

22 =

22 + 19

✂ 2 ✄ 255

277, etc.

Each

  • ✁ is a 53-bit fp number.

Example:

0 is an integer; ✂ ✂

381

✂ 244.

Computing

  • ✁ ’s from
✁ ’s takes

11 fp mults, 11 fp adds. Structure: (Z[

]

Z[2255

12 ])

(2255

12
  • 19)

Z (2255

  • 19).

Carries “Carry from

0 to
  • replace
0 and 22

bottom22

0 and
  • This takes 4 fp adds,

and guarantees

✂ ✂

Series of 13 carries

in range for subsequent from

192 to 213 to
  • then from
0 to 22
✂ ✂

to

192 to 213.

This takes 52 fp adds.

slide-25
SLIDE 25

Furthermore, modulo 2255

  • 19,
  • 0 +
22 + ✂ ✂ ✂ + 234

where

0 =

0 + 19

✂ 2 ✄ 255

255,

22 =

22 + 19

✂ 2 ✄ 255

277, etc.

Each

  • ✁ is a 53-bit fp number.

Example:

0 is an integer; ✂ ✂

381

✂ 244.

Computing

  • ✁ ’s from
✁ ’s takes

11 fp mults, 11 fp adds. Structure: (Z[

]

Z[2255

12 ])

(2255

12
  • 19)

Z (2255

  • 19).

Carries “Carry from

0 to 22”:

replace

0 and 22 by

bottom22

0 and 22 + top22 0.

This takes 4 fp adds, and guarantees

✂ ✂

221. Series of 13 carries puts all

  • ✁ ’s

in range for subsequent products: from

192 to 213 to 234 to

255;

then from

0 to 22 to 43 to ✂ ✂ ✂

to

192 to 213.

This takes 52 fp adds.

slide-26
SLIDE 26

dulo 2255

  • 19,
✂ ✂ + 234
  • 19
✂ 2 ✄ 255

255,

  • ✂ 2
✄ 255

277, etc.

53-bit fp number.

  • an integer;

from

✁ ’s takes

fp adds.

  • Z[2255
12 ])
  • Z (2255
  • 19).

Carries “Carry from

0 to 22”:

replace

0 and 22 by

bottom22

0 and 22 + top22 0.

This takes 4 fp adds, and guarantees

✂ ✂

221. Series of 13 carries puts all

  • ✁ ’s

in range for subsequent products: from

192 to 213 to 234 to

255;

then from

0 to 22 to 43 to ✂ ✂ ✂

to

192 to 213.

This takes 52 fp adds. Total 155 mults, 184 to multiply modulo

  • in this representation.

184 UltraSPARC = 184 cycles? Two fp-operation latency; “load/store” latency limited number of Schedule instructions to bring cycles down

slide-27
SLIDE 27

Carries “Carry from

0 to 22”:

replace

0 and 22 by

bottom22

0 and 22 + top22 0.

This takes 4 fp adds, and guarantees

✂ ✂

221. Series of 13 carries puts all

  • ✁ ’s

in range for subsequent products: from

192 to 213 to 234 to

255;

then from

0 to 22 to 43 to ✂ ✂ ✂

to

192 to 213.

This takes 52 fp adds. Total 155 mults, 184 adds to multiply modulo 2255

  • 19

in this representation. 184 UltraSPARC III cycles. = 184 cycles? Two obstacles: fp-operation latency; “load/store” latency imposed by limited number of “registers.” Schedule instructions carefully to bring cycles down to 184.

slide-28
SLIDE 28
  • to
22”:
  • 22 by
  • 22 + top22
0.

adds,

✂ ✂

221. rries puts all

  • ✁ ’s

subsequent products:

  • to
234 to

255;

  • 22 to
43 to ✂ ✂ ✂
  • adds.

Total 155 mults, 184 adds to multiply modulo 2255

  • 19

in this representation. 184 UltraSPARC III cycles. = 184 cycles? Two obstacles: fp-operation latency; “load/store” latency imposed by limited number of “registers.” Schedule instructions carefully to bring cycles down to 184. Have developed qhasm new programming for high-speed computations. Includes range verification, guided register allo Lets me write desired with much less human traditional asm, C Have also used for fast Poly1305, fast see, e.g., http://cr.yp.to /mac/poly1305_athlon.s

slide-29
SLIDE 29

Total 155 mults, 184 adds to multiply modulo 2255

  • 19

in this representation. 184 UltraSPARC III cycles. = 184 cycles? Two obstacles: fp-operation latency; “load/store” latency imposed by limited number of “registers.” Schedule instructions carefully to bring cycles down to 184. Have developed qhasm, new programming language for high-speed computations. Includes range verification, guided register allocation, et al. Lets me write desired code with much less human time than traditional asm, C compiler, etc. Have also used for fast AES, fast Poly1305, fast Salsa20, etc.; see, e.g., http://cr.yp.to /mac/poly1305_athlon.s.

slide-30
SLIDE 30

mults, 184 adds dulo 2255

  • 19

resentation. ARC III cycles. Two obstacles: latency; latency imposed by

  • f “registers.”

instructions carefully down to 184. Have developed qhasm, new programming language for high-speed computations. Includes range verification, guided register allocation, et al. Lets me write desired code with much less human time than traditional asm, C compiler, etc. Have also used for fast AES, fast Poly1305, fast Salsa20, etc.; see, e.g., http://cr.yp.to /mac/poly1305_athlon.s. Speedup: Squarings Often know in advance

  • 64 +
22 43 +
  • is more efficiently computed

2(

64 + 22 43).

Even better: First 2

✁ 2 22 ✁ ✂ ✂ ✂ ✁ 2 234

and then compute (2

0) 64 + (2 22)
  • 130 fp adds instead

Makes carry time even

slide-31
SLIDE 31

Have developed qhasm, new programming language for high-speed computations. Includes range verification, guided register allocation, et al. Lets me write desired code with much less human time than traditional asm, C compiler, etc. Have also used for fast AES, fast Poly1305, fast Salsa20, etc.; see, e.g., http://cr.yp.to /mac/poly1305_athlon.s. Speedup: Squarings Often know in advance that

  • =
. 64 + 22 43 + 43 22 + 64

is more efficiently computed as 2(

64 + 22 43).

Even better: First compute 2

✁ 2 22 ✁ ✂ ✂ ✂ ✁ 2 234

and then compute (2

0) 64 + (2 22) 43 etc.

130 fp adds instead of 184. Makes carry time even more visible.

slide-32
SLIDE 32

qhasm, rogramming language computations. verification, allocation, et al. desired code human time than C compiler, etc. for fast AES, fast Salsa20, etc.; http://cr.yp.to /mac/poly1305_athlon.s. Speedup: Squarings Often know in advance that

  • =
. 64 + 22 43 + 43 22 + 64

is more efficiently computed as 2(

64 + 22 43).

Even better: First compute 2

✁ 2 22 ✁ ✂ ✂ ✂ ✁ 2 234

and then compute (2

0) 64 + (2 22) 43 etc.

130 fp adds instead of 184. Makes carry time even more visible. Speedup: Karatsuba’s Say

0 =

0 + 22
✂ ✂
  • 1 =
128 + 149
✂ ✂
  • 0 =
0 + ✂ ✂ ✂ ,

1

✂ ✂

Original, 184 adds:

0 +( 1 + 1

  • Karatsuba, 182 adds:

((

0+ 1)( 0+ 1

  • +

0 + 1 1

12

Improved Karatsuba, (

0 + 1)( 0 +

  • + (
  • 1

1

6
slide-33
SLIDE 33

Speedup: Squarings Often know in advance that

  • =
. 64 + 22 43 + 43 22 + 64

is more efficiently computed as 2(

64 + 22 43).

Even better: First compute 2

✁ 2 22 ✁ ✂ ✂ ✂ ✁ 2 234

and then compute (2

0) 64 + (2 22) 43 etc.

130 fp adds instead of 184. Makes carry time even more visible. Speedup: Karatsuba’s method Say

0 =

0 + 22 + ✂ ✂ ✂ + 107 5,

1 =

128 + 149 + ✂ ✂ ✂ + 234 5,

0 =

0 + ✂ ✂ ✂ ,

1 =

128 + ✂ ✂ ✂ .

Original, 184 adds: Product is

0 +( 1 + 1 0)

6 +

1 1

12.

Karatsuba, 182 adds: ((

0+ 1)( 0+ 1)

  • 1

1)

6

+

0 + 1 1

12.

Improved Karatsuba, 177 adds: (

0 + 1)( 0 + 1)

6

+ (

  • 1

1

6)(1
  • 6).
slide-34
SLIDE 34

rings advance that

  • =
.
  • +
43 22 + 64

efficiently computed as

  • 43).

First compute

✂ ✂ ✂ ✁ 234

compute

  • 22)
43 etc.

instead of 184. time even more visible. Speedup: Karatsuba’s method Say

0 =

0 + 22 + ✂ ✂ ✂ + 107 5,

1 =

128 + 149 + ✂ ✂ ✂ + 234 5,

0 =

0 + ✂ ✂ ✂ ,

1 =

128 + ✂ ✂ ✂ .

Original, 184 adds: Product is

0 +( 1 + 1 0)

6 +

1 1

12.

Karatsuba, 182 adds: ((

0+ 1)( 0+ 1)

  • 1

1)

6

+

0 + 1 1

12.

Improved Karatsuba, 177 adds: (

0 + 1)( 0 + 1)

6

+ (

  • 1

1

6)(1
  • 6).

The Curve function Overall strategy to

✁ Curve( ) ☎

Curve using arithmetic mo

  • For various integers

find

  • such that

Curve(

)

  • i.e.,
✄ Curve( ✄

)

  • e.g.
✆ 1 = Curve( ✄

assuming Curve( ) Can easily restrict

to ensure that never

slide-35
SLIDE 35

Speedup: Karatsuba’s method Say

0 =

0 + 22 + ✂ ✂ ✂ + 107 5,

1 =

128 + 149 + ✂ ✂ ✂ + 234 5,

0 =

0 + ✂ ✂ ✂ ,

1 =

128 + ✂ ✂ ✂ .

Original, 184 adds: Product is

0 +( 1 + 1 0)

6 +

1 1

12.

Karatsuba, 182 adds: ((

0+ 1)( 0+ 1)

  • 1

1)

6

+

0 + 1 1

12.

Improved Karatsuba, 177 adds: (

0 + 1)( 0 + 1)

6

+ (

  • 1

1

6)(1
  • 6).

The Curve function Overall strategy to compute

✁ Curve( ) ☎

Curve( ), using arithmetic mod = 2255

  • 19:

For various integers

✄ ,

find

  • such that

Curve(

)

  • (mod

), i.e.,

✄ Curve( ✄

)

  • (mod

). e.g.

✆ 1 = Curve( ), ✄ 1 = 1,

assuming Curve( ) = . Can easily restrict

✁ Curve( )

to ensure that never appears.

slide-36
SLIDE 36

ratsuba’s method

  • 22
+ ✂ ✂ ✂ + 107 5,
  • 149
+ ✂ ✂ ✂ + 234 5,
✂ ✂

1 =

128 + ✂ ✂ ✂ .

adds: Product is

1 0)

6 +

1 1

12.

adds:

1)

  • 1

1)

6 12.

ratsuba, 177 adds:

1)

6
  • 6)(1
  • 6).

The Curve function Overall strategy to compute

✁ Curve( ) ☎

Curve( ), using arithmetic mod = 2255

  • 19:

For various integers

✄ ,

find

  • such that

Curve(

)

  • (mod

), i.e.,

✄ Curve( ✄

)

  • (mod

). e.g.

✆ 1 = Curve( ), ✄ 1 = 1,

assuming Curve( ) = . Can easily restrict

✁ Curve( )

to ensure that never appears. We’ll see how to compute

✆ 2
✄ 2
  • +1
✁ ✄
  • +1
✁ ☎ ✆ 2
  • +1
✁ ✄ 2
  • +1.

Combine to compute

  • +1
✁ ✄
  • +1
✁ ✁ ☎ ✆
✆ +1 ✁ ✄
  • where

=

✁ ✄

2

✂ , ✄

Conditional branches input-dependent load can leak via timing. Replace with arithmetic: e.g., (1

  • )
  • + (
slide-37
SLIDE 37

The Curve function Overall strategy to compute

✁ Curve( ) ☎

Curve( ), using arithmetic mod = 2255

  • 19:

For various integers

✄ ,

find

  • such that

Curve(

)

  • (mod

), i.e.,

✄ Curve( ✄

)

  • (mod

). e.g.

✆ 1 = Curve( ), ✄ 1 = 1,

assuming Curve( ) = . Can easily restrict

✁ Curve( )

to ensure that never appears. We’ll see how to compute

✆ 2
✄ 2 ; and ✆
  • +1
✁ ✄
  • +1
✁ Curve( ) ☎ ✆ 2
  • +1
✁ ✄ 2
  • +1.

Combine to compute

  • +1
✁ ✄
  • +1
✁ ✁ Curve( ) ☎ ✆
✆ +1 ✁ ✄ +1

where =

✁ ✄

2

✂ ,

=

mod 2. Conditional branches and input-dependent load addresses can leak via timing. Replace with arithmetic: e.g., (1

  • )
  • + ( )
  • +1.
slide-38
SLIDE 38

function to compute

✁ ☎

Curve( ), mod = 2255

  • 19:

integers

✄ , ✆
  • that
✄ ✆
  • (mod

),

)

  • (mod

).

),

✄ 1 = 1,

) = . restrict

✁ Curve( )

never appears. We’ll see how to compute

✆ 2
✄ 2 ; and ✆
  • +1
✁ ✄
  • +1
✁ Curve( ) ☎ ✆ 2
  • +1
✁ ✄ 2
  • +1.

Combine to compute

  • +1
✁ ✄
  • +1
✁ ✁ Curve( ) ☎ ✆
✆ +1 ✁ ✄ +1

where =

✁ ✄

2

✂ ,

=

mod 2. Conditional branches and input-dependent load addresses can leak via timing. Replace with arithmetic: e.g., (1

  • )
  • + ( )
  • +1.

Eventually reach

Divide

  • by
  • mo

to obtain Curve( Simple division metho

✝ ✄ 2
  • .

Euclid-type division are faster but have input-dependent timings. Finally convert from floating-point representation to byte-string output

slide-39
SLIDE 39

We’ll see how to compute

✆ 2
✄ 2 ; and ✆
  • +1
✁ ✄
  • +1
✁ Curve( ) ☎ ✆ 2
  • +1
✁ ✄ 2
  • +1.

Combine to compute

  • +1
✁ ✄
  • +1
✁ ✁ Curve( ) ☎ ✆
✆ +1 ✁ ✄ +1

where =

✁ ✄

2

✂ ,

=

mod 2. Conditional branches and input-dependent load addresses can leak via timing. Replace with arithmetic: e.g., (1

  • )
  • + ( )
  • +1.

Eventually reach

= . Divide

  • by
  • modulo

to obtain Curve( ). Simple division method: Fermat!

✝ ✄ 2
  • .

Euclid-type division methods are faster but have input-dependent timings. Finally convert from floating-point representation to byte-string output format.

slide-40
SLIDE 40

compute

✄ 2 ; and ✆
  • +1
✁ Curve( ) ☎ ✆
  • .

compute

  • +1
✁ ✁ Curve( ) ☎ ✆
✄ +1 ✁ ✄

2

✂ ,

=

mod 2. ranches and load addresses timing. rithmetic:

  • + ( )
  • +1.

Eventually reach

= . Divide

  • by
  • modulo

to obtain Curve( ). Simple division method: Fermat!

✝ ✄ 2
  • .

Euclid-type division methods are faster but have input-dependent timings. Finally convert from floating-point representation to byte-string output format. From

to 2

In Z :

✆ 2
  • = (
✆ 2
  • ✄ 2
)2, ✄ 2
  • = 4
( ✆ 2 + ✆
  • Compute as follows:

(

)2; ( ✆ + ✄
  • ✆ 2
  • = (
)2( ✆
  • 4
  • = (
✆ + ✄
  • (
  • 2)
  • = 89747
✂ ✆
  • ✄ 2
  • =

4

(( ✆ + ✄ )2
slide-41
SLIDE 41

Eventually reach

= . Divide

  • by
  • modulo

to obtain Curve( ). Simple division method: Fermat!

✝ ✄ 2
  • .

Euclid-type division methods are faster but have input-dependent timings. Finally convert from floating-point representation to byte-string output format. From

to 2

In Z :

✆ 2
  • = (
✆ 2
  • ✄ 2
)2, ✄ 2
  • = 4
( ✆ 2 + ✆
+ ✄ 2 ).

Compute as follows: (

)2; ( ✆ + ✄ )2; ✆ 2
  • = (
)2( ✆ + ✄ )2;

4

  • = (
✆ + ✄ )2
  • (
)2;

(

  • 2)
  • = 89747
✂ 4 ✆
; ✄ 2
  • =

4

(( ✆ + ✄ )2 + (
  • 2)
).
slide-42
SLIDE 42

= .

  • modulo

). method: Fermat!

✝ ✄ 2
  • .

division methods have timings. from representation

  • utput format.

From

to 2

In Z :

✆ 2
  • = (
✆ 2
  • ✄ 2
)2, ✄ 2
  • = 4
( ✆ 2 + ✆
+ ✄ 2 ).

Compute as follows: (

)2; ( ✆ + ✄ )2; ✆ 2
  • = (
)2( ✆ + ✄ )2;

4

  • = (
✆ + ✄ )2
  • (
)2;

(

  • 2)
  • = 89747
✂ 4 ✆
; ✄ 2
  • =

4

(( ✆ + ✄ )2 + (
  • 2)
).

From

✄ ✁ ✄

+ 1 to 2

✄ ✆ 2 +1 = 4( ✆
+1
  • ✄ 2
+1 =

4(

+1
+1

Compute as follows: (

)( ✆ +1 + ✄
  • (
✆ + ✄ )( ✆ +1
  • 2(
+1
+1

2(

+1
+1 ✆ 2 +1 = (2( ✆
+1
  • (2(
+1
  • ✄ 2
+1 = ( ✂ ✂ ✂ ) Curve
slide-43
SLIDE 43

From

to 2

In Z :

✆ 2
  • = (
✆ 2
  • ✄ 2
)2, ✄ 2
  • = 4
( ✆ 2 + ✆
+ ✄ 2 ).

Compute as follows: (

)2; ( ✆ + ✄ )2; ✆ 2
  • = (
)2( ✆ + ✄ )2;

4

  • = (
✆ + ✄ )2
  • (
)2;

(

  • 2)
  • = 89747
✂ 4 ✆
; ✄ 2
  • =

4

(( ✆ + ✄ )2 + (
  • 2)
).

From

✄ ✁ ✄

+ 1 to 2

+ 1

✆ 2 +1 = 4( ✆
+1
+1)2, ✄ 2 +1 =

4(

+1
+1)2 Curve( ).

Compute as follows: (

)( ✆ +1 + ✄ +1);

(

✆ + ✄ )( ✆ +1
+1);

2(

+1
+1) = sum;

2(

+1
+1) = difference; ✆ 2 +1 = (2( ✆
+1
+1))2;

(2(

+1
+1))2; ✄ 2 +1 = ( ✂ ✂ ✂ ) Curve( ).
slide-44
SLIDE 44 ✄ ✄ ✆
)2, ✄
+ ✆
+ ✄ 2 ).

ws:

+ ✄ )2; ✆
)2( ✆ + ✄ )2; ✆
)2
  • (
)2;
  • 89747
✂ 4 ✆
; ✄
)2 + (
  • 2)
).

From

✄ ✁ ✄

+ 1 to 2

+ 1

✆ 2 +1 = 4( ✆
+1
+1)2, ✄ 2 +1 =

4(

+1
+1)2 Curve( ).

Compute as follows: (

)( ✆ +1 + ✄ +1);

(

✆ + ✄ )( ✆ +1
+1);

2(

+1
+1) = sum;

2(

+1
+1) = difference; ✆ 2 +1 = (2( ✆
+1
+1))2;

(2(

+1
+1))2; ✄ 2 +1 = ( ✂ ✂ ✂ ) Curve( ).

Total time Slightly over 1600 (520 from carries) for each bit of . Total for 256-bit 413000 fp adds; 50000 fp adds fo Aiming for 500000 Still have to finish Should end up even my NIST P-224 soft despite 14% more

slide-45
SLIDE 45

From

✄ ✁ ✄

+ 1 to 2

+ 1

✆ 2 +1 = 4( ✆
+1
+1)2, ✄ 2 +1 =

4(

+1
+1)2 Curve( ).

Compute as follows: (

)( ✆ +1 + ✄ +1);

(

✆ + ✄ )( ✆ +1
+1);

2(

+1
+1) = sum;

2(

+1
+1) = difference; ✆ 2 +1 = (2( ✆
+1
+1))2;

(2(

+1
+1))2; ✄ 2 +1 = ( ✂ ✂ ✂ ) Curve( ).

Total time Slightly over 1600 fp adds (520 from carries) for each bit of . Total for 256-bit : 413000 fp adds; plus 50000 fp adds for final division. Aiming for 500000 cycles. Still have to finish software. Should end up even faster than my NIST P-224 software, despite 14% more bits!