Recent Advances in Parallel Implementations of Scalar Multiplication - - PowerPoint PPT Presentation

recent advances in parallel implementations of scalar
SMART_READER_LITE
LIVE PREVIEW

Recent Advances in Parallel Implementations of Scalar Multiplication - - PowerPoint PPT Presentation

Recent Advances in Parallel Implementations of Scalar Multiplication over Binary Elliptic Curves C. Negre and J.M. Robert april 8, 2015 1 / 39 Outline Overview of elliptic curve cryptography 1 Implementation of F 2 m arithmetic 2 Elliptic


slide-1
SLIDE 1

Recent Advances in Parallel Implementations of Scalar Multiplication over Binary Elliptic Curves

  • C. Negre and J.M. Robert

april 8, 2015

1 / 39

slide-2
SLIDE 2

Outline

1

Overview of elliptic curve cryptography

2

Implementation of F2m arithmetic

3

Elliptic curve arithmetic

4

Scalar multiplication

2 / 39

slide-3
SLIDE 3

Outline

1

Overview of elliptic curve cryptography

2

Implementation of F2m arithmetic

3

Elliptic curve arithmetic

4

Scalar multiplication

3 / 39

slide-4
SLIDE 4

Diffie-Hellmann key exchange

Alice and Bob agree on a group (G, +, O) and a generating point of the group P.

Bob Alice

4 / 39

slide-5
SLIDE 5

Diffie-Hellmann key exchange

Alice and Bob agree on a group (G, +, O) and a generating point of the group P.

Bob Alice

a ← random() b ← random() 4 / 39

slide-6
SLIDE 6

Diffie-Hellmann key exchange

Alice and Bob agree on a group (G, +, O) and a generating point of the group P.

Bob Alice

a ← random() Computes A = a · P Computes B = b · P b ← random() 4 / 39

slide-7
SLIDE 7

Diffie-Hellmann key exchange

Alice and Bob agree on a group (G, +, O) and a generating point of the group P.

Bob Alice

sends B sends A

Computes A = a · P a ← random() b ← random() Computes B = b · P 4 / 39

slide-8
SLIDE 8

Diffie-Hellmann key exchange

Alice and Bob agree on a group (G, +, O) and a generating point of the group P.

Bob Alice

sends B sends A

Computes A = a · P a ← random() Computes B = b · P b ← random() Computes K = b · A Computes K = a · B

Shared secret key K = a · b · P

4 / 39

slide-9
SLIDE 9

Diffie-Hellmann key exchange

Alice and Bob agree on a group (G, +, O) and a generating point of the group P.

Bob Alice

sends B sends A

Computes A = a · P a ← random() Computes B = b · P b ← random() Computes K = b · A Computes K = a · B

Shared secret key K = a · b · P

The main operation is the scalar multiplication a · P.

4 / 39

slide-10
SLIDE 10

Group law for an elliptic curve y 2 = x3 − 2x + 1

x

P = (xP, yP) Q = (xQ, yQ)

5 / 39

slide-11
SLIDE 11

Group law for an elliptic curve y 2 = x3 − 2x + 1

x

P = (xP, yP) Q = (xQ, yQ) R = P + Q

Addition (chord): xR = λ − xP − xQ yR = yP − λ(xR − xP) with λ = yP−yQ

xP−xQ

5 / 39

slide-12
SLIDE 12

Group law for an elliptic curve y 2 = x3 − 2x + 1

x

P = (xP, yP) Q = (xQ, yQ) R = P + Q

Addition (chord): xR = λ − xP − xQ yR = yP − λ(xR − xP) with λ = yP−yQ

xP−xQ

x

P = (xP, yP) R = 2P

Doubling (tangent): xR = λ − 2xP yR = yP − λ(xR − xP) with λ = 3x2

P+a

2yP

5 / 39

slide-13
SLIDE 13

Scalar multiplication : k · P

x

P 2P

Scalar multiplication: 7P 2 · P 3P = (2P) + P 6P = 2 · (3P) 7P = (6P) + P

6 / 39

slide-14
SLIDE 14

Scalar multiplication : k · P

x

P 2P 3P

Scalar multiplication: 7P 2 · P 3P = (2P) + P 6P = 2 · (3P) 7P = (6P) + P

6 / 39

slide-15
SLIDE 15

Scalar multiplication : k · P

x

P 3P 6P

Scalar multiplication: 7P 2 · P 3P = (2P) + P 6P = 2 · (3P) 7P = (6P) + P

6 / 39

slide-16
SLIDE 16

Scalar multiplication : k · P

x

P 6P 7P

Scalar multiplication: 7P 2 · P 3P = (2P) + P 6P = 2 · (3P) 7P = (6P) + P

6 / 39

slide-17
SLIDE 17

Hierarchy of operations

ECDSA (sign) Diffie-Hellman (key exchange) Double-and-add Halve-and-add Point doubling Point addition Point halving Field addition Field multiplication Field inversion Quadratic solver

← Field operation ← Curve operation ← Scalar multiplication ← Protocols

7 / 39

slide-18
SLIDE 18

The considered elliptic curves E(F2m)

Binary field: F2 = Z/2Z. Extended binary field: F2m = F2[t]/(f (t)) where f (t) is irreducible. For A = m−1

i=0 aiti and B = m−1 i=0 biti in F2m

addition : A + B =

m−1

  • i=0

(ai + bi) · ti, multiplication : A × B = A · B mod f (t). Binary elliptic curve: the set of points P = (x, y) ∈ F2

2m satisfying

E : y2 + xy = x3 + ax2 + b, a, b ∈ F2m.

8 / 39

slide-19
SLIDE 19

Curve and field implemented

NIST curve B233: defined over F2[t]/(t233 + t74 + 1) with equation E : y2 + xy = x3 + x2 + b where

b =0x066647ede6c332c7f 8c0923bb58213b333b20e9ce4281fe115f 7d8f 90ad, N =6901746346790563787434755862277025555839812737345013555379383634485463.

GHS curve E(F22·127): defined over the field F22·127 constructed as F2127 = F[t]/(t127 + t63 + 1) F22·127 = F2127[u]/(u2 + u + 1) with curve equation E : y2 + xy = x3 + ux2 + b √ b = 0xE2DA921E91E38DD1 and admitting an endomorphism.

9 / 39

slide-20
SLIDE 20

Outline

1

Overview of elliptic curve cryptography

2

Implementation of F2m arithmetic

3

Elliptic curve arithmetic

4

Scalar multiplication

10 / 39

slide-21
SLIDE 21

F2m arithmetic over Intel Cores

Intel Core i3,i5 and i7 offer: Logical instructions XOR, AND over 128 and 256 bits. PCLMUL instruction computing the product of two degree 64 binary polynomials. PSHUFB a byte shuffling instructions . Shifting instruction (vector 64 bit shifts and full 128 bit shifts).

11 / 39

slide-22
SLIDE 22

F2m arithmetic over Intel Cores

Intel Core i3,i5 and i7 offer: Logical instructions XOR, AND over 128 and 256 bits. PCLMUL instruction computing the product of two degree 64 binary polynomials. PSHUFB a byte shuffling instructions . Shifting instruction (vector 64 bit shifts and full 128 bit shifts). We will see how to implement arithmetic over F2233:

1 Polynomial multiplication with PCLMUL. 2 Polynomial squaring with PSHUFB. 3 Reduction with shift, 128-bit XOR and AND. 4 Look up table for quadratic-solver. 11 / 39

slide-23
SLIDE 23

Multiplication in F2233 with Karatsuba

Karatsuba formula

For A(x) = Ah + tm/2Al and B(x) = Bh + tm/2Bl A × B = AhBhtm + ((Ah + Al)(Bh + Bl) + AhBh + AlBl)tm/2 + AlBl

12 / 39

slide-24
SLIDE 24

Multiplication in F2233 with Karatsuba

Karatsuba formula

For A(x) = Ah + tm/2Al and B(x) = Bh + tm/2Bl A × B = AhBhtm + ((Ah + Al)(Bh + Bl) + AhBh + AlBl)tm/2 + AlBl Two recursions for degree m = 233:

PCLMUL PCLMUL PCLMUL PCLMUL PCLMUL PCLMUL PCLMUL PCLMUL PCLMUL

128 bits B[3] B[0] B[1] B[2] C[2] C[3] C[6] C[5] C[0] C[1] C[7] A[2] A[3] A[0] A[1]

×

C[4]

12 / 39

slide-25
SLIDE 25

Squaring with PSHUFB

Let a and b be two 128-bits data = 16 bytes. The PSHUFB instruction permute the bytes of a as specified by b b =

14 15 12 13 10 11 8 9 6 7 4 5 2 3 1

a =

a[15] a[14] a[13] a[12] a[11] a[10] a[9] a[8] a[7] a[6] a[5] a[4] a[3] a[2] a[1] a[0]

PSHUFB(b, a) outputs c =

a[14] a[15] a[12] a[13] a[10] a[11] a[8] a[9] a[6] a[7] a[4] a[5] a[2] a[3] a[0] a[1]

In other words c[i] = a[b[i]]

13 / 39

slide-26
SLIDE 26

Squaring with PSHUFB

Squaring a polynomial b(t) = m−1

i=0 biti ∈ F2[t]:

b(t)2 =

m−1

  • i=0

bit2i. Aranha et al. 2010: Use PSHUFB for simultaneous look-up table:

◮ We store in a[j] the squaring of j (seen as an element of F2[t])

a[j] = j2 for j = 0, . . . , 16.

◮ PSHUFB(b,a) computes

c[i] = a[b[i]] = (b[i])2.

Squaring 128 bits = 2 PSHUFB + 1 Masking + 3 shifts.

14 / 39

slide-27
SLIDE 27

Square root

We express the square root of A(t) = m−1

i=0 aiti as:

(A(t))1/2 = ( even degree

  • m−1

2

  • i=0

a2it2i +

  • dd degree
  • (

m−1 2

  • i=0

a2i+1t2i+1))1/2 = m−1

2

i=0 a2iti

  • + √x

m−1

2

i=0 a2i+1ti

  • Masking: We separate A as Aodd and Aeven.

PSHUB: We suppress zeros in Aodd and Aeven. Shift and XOR: we multiply Aodd by √x = and XOR it to Aeven.

15 / 39

slide-28
SLIDE 28

Reduction modulo f (t) = t233 + t74 + 1

0 · · · · · · 0 c464 · · · c384 c383 · · · · · · c256 c255 · · · · · · c128 c127 · · · · · · c1c0

16 / 39

slide-29
SLIDE 29

Reduction modulo f (t) = t233 + t74 + 1

0 · · · · · · 0 c464 · · · c384 c383 · · · · · · c256 c255 · · · · · · c128 c127 · · · · · · c1c0

t384 = t225 + t151

0 · · · · · · 0 c464 · · · c384 0 · · · · · · 0 c464 · · · c384

16 / 39

slide-30
SLIDE 30

Reduction modulo f (t) = t233 + t74 + 1

c383 · · · · · · c256 c255 · · · · · · c128 c127 · · · · · · c1c0 0 · · · · · · 0 c464 · · · c384 0 · · · · · · 0 c464 · · · c384 cr,383 · · · · · · cr,256 cr,255 · · · · · · cr,128 cr,127 · · · · · · cr,1cr,0

16 / 39

slide-31
SLIDE 31

Reduction modulo f (t) = t233 + t74 + 1

c383 · · · · · · c256 c255 · · · · · · c128 c127 · · · · · · c1c0 0 · · · · · · 0 c464 · · · c384 0 · · · · · · 0 c464 · · · c384 cr,383 · · · · · · cr,256 cr,255 · · · · · · cr,128 cr,127 · · · · · · cr,1cr,0

t256 = t97 + t23

cr,383 · · · · · · cr,256 cr,383 · · · · · · cr,256

16 / 39

slide-32
SLIDE 32

Reduction modulo f (t) = t233 + t74 + 1

c383 · · · · · · c256 c255 · · · · · · c128 c127 · · · · · · c1c0 0 · · · · · · 0 c464 · · · c384 0 · · · · · · 0 c464 · · · c384 cr,255 · · · · · · cr,128 cr,127 · · · · · · cr,1cr,0 cr,383 · · · · · · cr,256 cr,383 · · · · · · cr,256 cr,255..cr,233 · · · cr,128 cr,127 · · · · · · cr,1cr,0

16 / 39

slide-33
SLIDE 33

Reduction modulo f (t) = t233 + t74 + 1

c383 · · · · · · c256 c255 · · · · · · c128 c127 · · · · · · c1c0 0 · · · · · · 0 c464 · · · c384 0 · · · · · · 0 c464 · · · c384 cr,255 · · · · · · cr,128 cr,127 · · · · · · cr,1cr,0 cr,383 · · · · · · cr,256 cr,383 · · · · · · cr,256 cr,255..cr,233 · · · cr,128 cr,127 · · · · · · cr,1cr,0

t233 = t74 + 1

cr,255..cr,233 cr,255..cr,233

16 / 39

slide-34
SLIDE 34

Reduction modulo f (t) = t233 + t74 + 1

c383 · · · · · · c256 c255 · · · · · · c128 c127 · · · · · · c1c0 0 · · · · · · 0 c464 · · · c384 0 · · · · · · 0 c464 · · · c384 cr,255 · · · · · · cr,128 cr,127 · · · · · · cr,1cr,0 cr,383 · · · · · · cr,256 cr,383 · · · · · · cr,256 0 · · · · · · 0 · · · cr,128 cr,127 · · · · · · cr,1cr,0 cr,255..cr,233 cr,255..cr,233 0 · · · · · · 0 cr,233 · · · cr,127 · · · · · · cr,1cr,0

16 / 39

slide-35
SLIDE 35

Lazy Reduction for F2233

Usual reduction

a232 a128 a127 255 a0

A

A1 A0

b232 b128 b127 255 b0

B

B1 B0

c464 c383 c127 c255 511 c256 c128 c0

Cr1 Cr0

cr127 511 cr128 cr0 cr232

C1 C0 C2 C3

C = A · B Cr = C mod f

Reduction of C3, C2 and C1.

17 / 39

slide-36
SLIDE 36

Lazy Reduction for F2233

Usual reduction

a232 a128 a127 255 a0

A

A1 A0

b232 b128 b127 255 b0

B

B1 B0

c464 c383 c127 c255 511 c256 c128 c0

Cr1 Cr0

cr127 511 cr128 cr0 cr232

C1 C0 C2 C3

C = A · B Cr = C mod f

Reduction of C3, C2 and C1.

Lazy reduction

Cr1 Cr0

cr127 511 cr128 cr0

C1 C0 C2 C3

C = A · B Cr = C mod f

Reduction of C3 and C2 only.

A

A1 A0

b128 b127 255 b0

B

B1 B0

0; c510 c384 c383 c127 c255 511 c256 c128 c0 a128 a127 255 a0 cr255 b255 a255

17 / 39

slide-37
SLIDE 37

Quadratic solver

The solution of an equation in y y2 + y = c with c ∈ F2m s.t. Tr(c) = 0 (where Tr(c) =

m−1

  • i=0

c2i) is given by the Half-Trace (if m is odd): y = HT(c) =

m−1 2

  • i=0

c22i Indeed, we have HT(c)2 + HT(c) =

  • dd power
  • m−1

2

  • i=0

c22i+1 + even power

m−1 2

  • i=0

c22i = c2m + Tr(c)

18 / 39

slide-38
SLIDE 38

Quadratic solver: implementation

We precompute tab[i][u3u2u1u0] = HT(t4i(u3t3 + u2t2 + u1t + u0)) for all 0 ≤ i < m/4. We use the linearity: HT(c1 + c2) = HT(c2) + HT(c1)

19 / 39

slide-39
SLIDE 39

Quadratic solver: implementation

We precompute tab[i][u3u2u1u0] = HT(t4i(u3t3 + u2t2 + u1t + u0)) for all 0 ≤ i < m/4. We use the linearity: HT(c1 + c2) = HT(c2) + HT(c1) We do m/4 look-up table and add:

c3c2c1c0 c7c6c5c4 c11c10c9c8 · · · · · · 19 / 39

slide-40
SLIDE 40

Quadratic solver: implementation

We precompute tab[i][u3u2u1u0] = HT(t4i(u3t3 + u2t2 + u1t + u0)) for all 0 ≤ i < m/4. We use the linearity: HT(c1 + c2) = HT(c2) + HT(c1) We do m/4 look-up table and add:

c3c2c1c0 c7c6c5c4 c11c10c9c8 · · · · · · tab[0][c3c2c1c0] 19 / 39

slide-41
SLIDE 41

Quadratic solver: implementation

We precompute tab[i][u3u2u1u0] = HT(t4i(u3t3 + u2t2 + u1t + u0)) for all 0 ≤ i < m/4. We use the linearity: HT(c1 + c2) = HT(c2) + HT(c1) We do m/4 look-up table and add:

c3c2c1c0 c7c6c5c4 c11c10c9c8 · · · · · · tab[0][c3c2c1c0] tab[1][c7c6c5c4]

+

19 / 39

slide-42
SLIDE 42

Quadratic solver: implementation

We precompute tab[i][u3u2u1u0] = HT(t4i(u3t3 + u2t2 + u1t + u0)) for all 0 ≤ i < m/4. We use the linearity: HT(c1 + c2) = HT(c2) + HT(c1) We do m/4 look-up table and add:

c3c2c1c0 c7c6c5c4 c11c10c9c8 · · · · · · tab[0][c3c2c1c0] tab[1][c7c6c5c4]

+

tab[2][c11c10c9c8]

+ + · · ·

19 / 39

slide-43
SLIDE 43

Timings Core i7 SB (ratios vs multiplication)

20 / 39

slide-44
SLIDE 44

Outline

1

Overview of elliptic curve cryptography

2

Implementation of F2m arithmetic

3

Elliptic curve arithmetic

4

Scalar multiplication

21 / 39

slide-45
SLIDE 45

Elliptic curves over binary fields

The formulas for point doubling and point addition are : x3 = λ2 + λ + x1 + x2 + a y3 = (x1 + x3)λ + x3 + y1 with    λ = y1+y2

x1+x2 if P1 = P2

λ = y1

x1 + x1 if P1 = P2

22 / 39

slide-46
SLIDE 46

Elliptic curves over binary fields

The formulas for point doubling and point addition are : x3 = λ2 + λ + x1 + x2 + a y3 = (x1 + x3)λ + x3 + y1 with    λ = y1+y2

x1+x2 if P1 = P2

λ = y1

x1 + x1 if P1 = P2

A point doubling requires: 1 Inv, 2 Mul, 1 Squ and 8 Add ; A point addition requires: 1 Inv, 2 Mul, 1 Squ and 9 Add;

22 / 39

slide-47
SLIDE 47

Elliptic curves over binary fields

The formulas for point doubling and point addition are : x3 = λ2 + λ + x1 + x2 + a y3 = (x1 + x3)λ + x3 + y1 with    λ = y1+y2

x1+x2 if P1 = P2

λ = y1

x1 + x1 if P1 = P2

A point doubling requires: 1 Inv, 2 Mul, 1 Squ and 8 Add ; A point addition requires: 1 Inv, 2 Mul, 1 Squ and 9 Add; Opposite of P = (x, y) is −P = (x, x + y)

22 / 39

slide-48
SLIDE 48

Projective coordinates

Lopez-Dahab projective coordinates. P = (XP : YP : ZP) with

  • xP = XP

ZP

yP = YP

Z 2

P 23 / 39

slide-49
SLIDE 49

Projective coordinates

Lopez-Dahab projective coordinates. P = (XP : YP : ZP) with

  • xP = XP

ZP

yP = YP

Z 2

P

The coordinates (X2P : Y2P : Z2P) of 2P are computed as: X2P = X 4

P + b · Z 4 P

Y2P = bZ 4

P · Z2P + X2P · (aZ2P + Y 2 P + bZ 4 P)

Z2P = X 2

P · Z 2 P 23 / 39

slide-50
SLIDE 50

Projective coordinates

Lopez-Dahab projective coordinates. P = (XP : YP : ZP) with

  • xP = XP

ZP

yP = YP

Z 2

P

The coordinates (X2P : Y2P : Z2P) of 2P are computed as: X2P = X 4

P + b · Z 4 P

Y2P = bZ 4

P · Z2P + X2P · (aZ2P + Y 2 P + bZ 4 P)

Z2P = X 2

P · Z 2 P

Other interesting coordinates systems:

◮ Kim-Kim (2006) (X, Y , Z, T) ∼

= (X/Z, Y /T),

◮ Lambda-projective coordinates (Oliveira et al. 2013)

(X, L, Z) ∼ = (X/Z, L/Z(X/Z) + (X/Z)2).

23 / 39

slide-51
SLIDE 51

Point halving

Point doubling Q = 2P: Q = (u, v) from P = (x, y) λ = x + y/x, (1) u = λ2 + λ + a, (2) v = x2 + u(λ + 1). (3)

24 / 39

slide-52
SLIDE 52

Point halving

Point doubling Q = 2P: Q = (u, v) from P = (x, y) λ = x + y/x, (1) u = λ2 + λ + a, (2) v = x2 + u(λ + 1). (3) Point halving P = 1

2Q : we compute x, y in terms of u, v as follows:

◮ we solve (2): λ2 + λ = u + a → λ (Quadratic solver); ◮ we solve (3): x2 = v + u(λ + 1) → x (Square root); ◮ we get y with (1): y = λx + x2. 24 / 39

slide-53
SLIDE 53

Point halving

Point doubling Q = 2P: Q = (u, v) from P = (x, y) λ = x + y/x, (1) u = λ2 + λ + a, (2) v = x2 + u(λ + 1). (3) Point halving P = 1

2Q : we compute x, y in terms of u, v as follows:

◮ we solve (2): λ2 + λ = u + a → λ (Quadratic solver); ◮ we solve (3): x2 = v + u(λ + 1) → x (Square root); ◮ we get y with (1): y = λx + x2.

A halving requires: 1Quadratic Solver, 1Square Root, 2M and 1S.

24 / 39

slide-54
SLIDE 54

Cost of point operations

Doubling Halving Mixed Doubling addition + mixed add. Affine 2M + 1S + 1I QS+SR+S+2M 2M + 1S + 1I 4M + 2S + 2I Lopez-Dahab 4M + 4S

  • 9M + 5S

13M +10S Kim-Kim 4M + 5S

  • 8M + 5S

12M +10S Lambda Proj. 4M + 4S

  • 8M + 2S

10M +6S

25 / 39

slide-55
SLIDE 55

Outline

1

Overview of elliptic curve cryptography

2

Implementation of F2m arithmetic

3

Elliptic curve arithmetic

4

Scalar multiplication

26 / 39

slide-56
SLIDE 56

The Scalar Multiplication Algorithm : Double-and-add

Let k = (kℓ−1, ..., k1, k0)2 ∈ N, P ∈ E(F2m)

k · P = (ℓ−1

i=0 2iki) · P

= (k0 + 2(k1 + 2(k2 + 2(. . . + 2kℓ−1) . . .)))) · P = (k0 · P + 2(k1 · P + 2(k2 · P + 2(. . . + 2(kℓ−2 · P + 2(kℓ−1 · P)) . . .))))

27 / 39

slide-57
SLIDE 57

The Scalar Multiplication Algorithm : Double-and-add

Let k = (kℓ−1, ..., k1, k0)2 ∈ N, P ∈ E(F2m)

k · P = (ℓ−1

i=0 2iki) · P

= (k0 + 2(k1 + 2(k2 + 2(. . . + 2kℓ−1) . . .)))) · P = (k0 · P + 2(k1 · P + 2(k2 · P + 2(. . . + 2(kℓ−2 · P + 2(kℓ−1 · P)) . . .))))

This can be performed with the following algorithm :

1: Q ← O 2: for i = ℓ − 1 down to 0 do 3:

Q ← 2 · Q

4:

if ki = 1 then

5:

Q ← Q + P

6:

end if

7: end for 8: return Q

27 / 39

slide-58
SLIDE 58

Improvements of Double-and-add with NAF and NAFw.

NAF makes k sparser: let k ∈ [0, 2ℓ[ satisfying k = 2i − 1, then (k)2 = 111...1 i times and we can write : (k)NAF = 100...00 − 1

  • i+1 digits

. This representation decreases the average number of non-zero digits from ℓ/2 to ℓ/3.

28 / 39

slide-59
SLIDE 59

Improvements of Double-and-add with NAF and NAFw.

NAF makes k sparser: let k ∈ [0, 2ℓ[ satisfying k = 2i − 1, then (k)2 = 111...1 i times and we can write : (k)NAF = 100...00 − 1

  • i+1 digits

. This representation decreases the average number of non-zero digits from ℓ/2 to ℓ/3. NAFw further reduces the average number of non-zero digits down to ℓ/(w + 1) using larger digits : {−2w−1 + 1, ..., −5, −3, −1, 0, 1, 3, 5, ..., 2w−1 − 1}. Complexity of scalar multiplication over E(F2m) :

  • nb. of doublings
  • nb. of additions

Double-and-add ℓ ℓ/2 NAF Double-and-add ℓ ℓ/3 NAFw Double-and-add ℓ ℓ/(w + 1) + 2w−2

28 / 39

slide-60
SLIDE 60

(Double,Halve)-and-add parallel scalar multiplication

Recode with NAFw k =

  • i=0

k′

i 2i Double-and-add R ← O //Precomputation: Pi = iP for i ∈ {1, 3, . . . , 2w−1 − 1} for i = ℓ down to 0 do R ← 2 · R if k′

i = 0 then

R ← R + sign(ki)P|ki | endif endfor return(R)

29 / 39

slide-61
SLIDE 61

(Double,Halve)-and-add parallel scalar multiplication

. Recode k =

  • i=0

k′′

i 2−i

  • Double-and-add

R ← O //Precomputation: Pi = iP for i ∈ {1, 3, . . . , 2w−1 − 1} for i = ℓ down to 0 do R ← 2 · R if k′

i = 0 then

R ← R + sign(ki)P|ki | endif endfor return(R) Halve-and-add S ← P //Initialization: Ri = O for i ∈ {1, 3, . . . , 2w−1 − 1} for i = 0 to ℓ do S ← 1

2 · S

if k′

i = 0 then

R|k′

i | ← R|k′ i | + sign(ki)S

endif endfor R ←

i=1,3,...,2w−1−1 i · Ri

return(R)

29 / 39

slide-62
SLIDE 62

(Double,Halve)-and-add parallel scalar multiplication

. Recode k =

ℓ−s

  • i=0

k′

i 2i

+

s

  • i=0

k′′

i 2−i

  • Double-and-add

R ← O //Precomputation: Pi = iP for i ∈ {1, 3, . . . , 2w−1 − 1} for i = ℓ − s down to 0 do R ← 2 · R if k′

i = 0 then

R ← R + sign(ki)P|ki | endif endfor return(R) Halve-and-add S ← P //Initialization: Ri = O for i ∈ {1, 3, . . . , 2w−1 − 1} for i = 0 to ℓ − s do S ← 1

2 · S

if k′

i = 0 then

R|k′

i | ← R|k′ i | + sign(ki)S

endif endfor R ←

i=1,3,...,2w−1−1 i · Ri

return(R)

. Add the two points

29 / 39

slide-63
SLIDE 63

Simple power analysis on scalar multiplication

For a scalar k = (kℓ, . . . , k0)2 and a point P ∈ E(Fq)

Double-and-add R ← O for i = ℓ − 1 to 0 do R ← 2 · R if ki = 1 then R ← R + P endif endfor return(R) Double-and-add-always R ← O for i = ℓ − 1 to 0 do R ← 2 · R if ki = 1 then R ← R + P else R′ ← R′ + P endif endfor return(R) Montgomery-ladder R ← O R′ ← P for i = ℓ − 1 to 0 do if ki = 1 then R ← R + R′ R′ ← 2 · R′ else R′ ← R + R′ R ← 2 · R endif endfor return(R)

↓ ↓

30 / 39

slide-64
SLIDE 64

Parallelization of the Montgomery ladder (Robert 2013)

Recode and split k =

ℓ−s

  • i=0

k′

i 2i

+

s

  • i=0

k′′

i 2−i

Montgomery-ladder R ← O R′ ← P for i = ℓ − s to 0 do if k′

i = 1 then

R ← R + R′ R′ ← 2 · R′ else R′ ← R + R′ R ← 2 · R endif endfor return(R) Montgomery-halving T ← O T ′ ← 2P for i = s to 0 do if k′′

i = 0 then

S ← T/2, T ← S, T ′ ← T ′ − S else S ← T ′/2, T ′ ← S, T ← T − S endif endfor return(T)

↓ ↓ add the two results

31 / 39

slide-65
SLIDE 65

Parallelization of the Montgomery ladder (Oliveira (2014))

Recode and split k =

ℓ−s

  • i=0

k′

i 2i

+

s

  • i=0

k′′

i 2−i

Montgomery-ladder R ← O R′ ← P for i = ℓ − s to 0 do if k′

i = 1 then

R ← R + R′ R′ ← 2 · R′ else R′ ← R + R′ R ← 2 · R endif endfor return(R) Montgomery-halving T ← O T ′ ← 2P precompute Si = 1

2i P for i = 1, . . . , s

for i = 0 to s do if k′′

i = 0 then

T ← T + Si else T ′ ← T ′ + Si endif endfor return(T)

↓ ↓ add the two results

32 / 39

slide-66
SLIDE 66

Parallelization with endomorphism

Curve endomorphism (curve and group): φ: E(Fq) → E(Fq) (x, y) → (φx(x, y), φy(x, y)) and there exists γφ ∈ [0, ord(P)] such shat φ(P) = γφ · P.

33 / 39

slide-67
SLIDE 67

Parallelization with endomorphism

Curve endomorphism (curve and group): φ: E(Fq) → E(Fq) (x, y) → (φx(x, y), φy(x, y)) and there exists γφ ∈ [0, ord(P)] such shat φ(P) = γφ · P. Four thread parallelization

Splitting k = k0 +γφ k1 with log2(k0), log2(k1) ∼ = ℓ/2 k0 =

ℓ/2−s

  • i=0

k′

0,i2i

  • +

s

  • i=0

k′′

0,i2−i

  • k1 =

ℓ/2−s

  • i=0

k′

1,i2i

  • +

s

  • i=0

k′′

1,i2−i

  • Double-and-add

Halve-and-add Double-and-add Halve-and-add

P0 Q0 P1 Q1 P0 + Q0 + φ(P1 + Q1)

33 / 39

slide-68
SLIDE 68

Timings: two threads (NIST B233 curve, Core i7 SB)

103 clock-cycles Robert Oliveira Doubling

  • Montg.

Halving

  • Montg.

Parallel

  • Montg.

Montgomery-ladder

  • 10%
  • 25%

50 100 150 200 700 750

103 clock-cycles

NAFw Double,halve-and-add and parallel

  • 23%

Double

  • and-add

Halve

  • and-add

Parallel

50 100 150 200

34 / 39

slide-69
SLIDE 69

Timings: with endomorphism (E(F22·127), Core i7 HW)

Oliveira et al. (2014) SAC and JCEN.

103 clock-cycles Doubling

  • Montg.

Halving

  • Montg.

Parallel

  • Montg.

(4 threads)

Montgomery-ladder

  • 35%

50 100

103 clock-cycles

NAFw Double,halve-and-add and parallel with GLV

  • 35%

Double

  • and-add

Halve

  • and-add

Parallel (2 threads)

50 100

35 / 39

slide-70
SLIDE 70

Conclusion

New instructions on recent Intel Cores provide significant speed-up:

◮ PCLMUL, ◮ PSHUFB, ◮ Halving.

New speed records:

◮ A scalar multiplication over E(F2m) requires 27 000 clock-cycles (Core

i7 Haswell).

◮ Better than scalar multiplication over E(Fp) (∼

= 100 000 clock-cycles).

For further parallelization the limitation might come from: recoding, thread management, final additions.

36 / 39

slide-71
SLIDE 71

Thank you!

37 / 39

slide-72
SLIDE 72

References

Parallel Montgomery (Oliveira): Thomaz Oliveira, Diego F. Aranha, Julio L´

  • pez Hernandez, Francisco Rodr´

ıguez-Henr´ ıquez. Fast Point Multiplication Algorithms for Binary Elliptic Curves with and without

  • Precomputation. Selected Areas in Cryptography 2014: 324-344.

Parallel Montgomery (Robert): C. Negre and J.-M. Robert. New Parallel Approaches for Scalar Multiplication in Elliptic Curve over Fields of Small Characteristic. IEEE TC, to be published. Lambda Projective coordinates: Thomaz Oliveira, Julio L´

  • pez, Diego
  • F. Aranha, Francisco Rodr´

ıguez-Henr´ ıquez. Two is the fastest prime: lambda coordinates for binary elliptic curves. J. Cryptographic Engineering 4(1): 3-17 (2014). Implemenation of the squaring: Diego F. Aranha, Julio L´

  • pez, Darrel
  • Hankerson. Efficient Software Implementation of Binary Field

Arithmetic Using Vector Instruction Sets. LATINCRYPT 2010: 144-161.

38 / 39

slide-73
SLIDE 73

Inversion

Little Fermat theorem in F2m gives a2m = a ⇒

  • a2m−1−12

= a−1 if a = 0.

39 / 39

slide-74
SLIDE 74

Inversion

Little Fermat theorem in F2m gives a2m = a ⇒

  • a2m−1−12

= a−1 if a = 0. Chaining: from a2u−1 and a2v−1 we get a2u+v−1:

  • a2u−12v

× a2v−1 = a2u+v−2v+2v−1 = a2u+v−1.

39 / 39

slide-75
SLIDE 75

Inversion

Little Fermat theorem in F2m gives a2m = a ⇒

  • a2m−1−12

= a−1 if a = 0. Chaining: from a2u−1 and a2v−1 we get a2u+v−1:

  • a2u−12v

× a2v−1 = a2u+v−2v+2v−1 = a2u+v−1. To invert a ∈ F2233 we choose an addition chain to get a2232−1 1 → 2 → 3 → 6 → 7 → 14 → 28 → 29 → 58 → 116 → 232

1 → a21−1 = a 2 = 1 + 1 → (a21−1)21 × a21−1 = a22−1 3 = 2 + 1 → (a22−1)21 × a21−1 = a23−1 6 = 3 + 3 →

  • 223−123

× a23−1 = a26−1 7 = 6 + 1 →

  • 226−121

× a21−1 = a27−1 . . . . . . 116 = 58 + 58 →

  • 2258−1258

× a258−1 = a2116−1 232 = 116 + 116 →

  • 22116−12116

× a2116−1 = a2232−1 and a−1 =

  • a2232−12

39 / 39