Part I: RELIC Diego F. Aranha Efficient Binary Field Arithmetic - - PDF document

part i relic
SMART_READER_LITE
LIVE PREVIEW

Part I: RELIC Diego F. Aranha Efficient Binary Field Arithmetic - - PDF document

Efficient Binary Field Arithmetic and Applications to Curve-based Cryptography Diego F. Aranha Department of Computer Science University of Bras lia CHES 2012 Tutorial Diego F. Aranha Efficient Binary Field Arithmetic Part I: RELIC


slide-1
SLIDE 1

Efficient Binary Field Arithmetic and Applications to Curve-based Cryptography Diego F. Aranha

Department of Computer Science University of Bras´ ılia

CHES 2012 Tutorial

Diego F. Aranha Efficient Binary Field Arithmetic

Part I: RELIC

Diego F. Aranha Efficient Binary Field Arithmetic

slide-2
SLIDE 2

Numbers

RELIC is an Efficient LIbrary for Cryptography (http://code.google.com/p/relic-toolkit): Research framework Licensed as free software (LGPL) 11 source code releases 78,000 lines of code 1300 visitors from 74 countries 1500 downloads

toolkit

elic elic R

Diego F. Aranha Efficient Binary Field Arithmetic

Introduction

Limitations of other libraries: Restricted portability Uninteresting licensing model Emphasis on standards and commercial algorithms Why a new criptographic library? Organization oriented for portability Complete control of licensing model Code sharing and reproducibility of results Focus on research

Diego F. Aranha Efficient Binary Field Arithmetic

slide-3
SLIDE 3

Organization

Basic organization: Meta-library Compile-time configuration Inspired on GNU Multiple Precision Arithmetic Library (GMP)

Arithmetic backend Protocols

Diego F. Aranha Efficient Binary Field Arithmetic

Breakdown

Arithmetic backend: Architecture-dependent Rigid interface with upper layers Generic modules available in C and with GMP support 21 functions for multiple precision integer arithmetic, 26 functions for binary fields, 32 functions for prime fields Why this organization? It is currently possible to obtain competitive timings with the same library in an 8-bit processor with 4KB of RAM and an 8-core Intel desktop processor.

Diego F. Aranha Efficient Binary Field Arithmetic

slide-4
SLIDE 4

Breakdown

Binary field arithmetic: Field size specified on compile time 3 different strategies for squaring, 5 for multiplication, 2 for square root extraction, 2 for half-trace and 6 for inversion Modular reduction by trinomials and pentanomials Binary curve arithmetic: Supersingular, Koblitz and ordinary (standardized or not) Affine, projective and mixed coordinate systems 4 different strategies for random point scalar multiplication, 6 for fixed point and 4 for multiple point Symmetric pairings over genus-1 or genus-2 curves

Diego F. Aranha Efficient Binary Field Arithmetic

Breakdown

Miscellaneous: Support for words of 8, 16, 32 and 64 bits Static, stack, automatic and dynamic memory allocators Helper macros for testing and benchmarking Support for debugging, profiling, tracing and multithreading Abundant Doxygen documentation Deactivation of modules and automatic elimination of algorithms to reduce code size Standard PRNG with configurable seed source Support for FreeBSD, Linux, Mac OS X, Windows Management of configuration and build system with CMake Open collaboration with academia and industry

Diego F. Aranha Efficient Binary Field Arithmetic

slide-5
SLIDE 5

Part II: Binary fields

Diego F. Aranha Efficient Binary Field Arithmetic

Introduction

A finite field Fpm consists of all polynomials with coefficients in Zp, prime p, modulo an irreducible degree-m polynomial f (z). Prime p is the characteristic of the field and m is the extension degree. A binary field F2m is the special case p = 2 and is formed by polynomials with binary coefficients.

Diego F. Aranha Efficient Binary Field Arithmetic

slide-6
SLIDE 6

Introduction

Example: Field F28 Irreducible polynomial: f (z) = z8 + z4 + z3 + z + 1 = 1 0001 1011 Representation: a(z) = z7 + z3 + 1 = 1000 1001 = 0x89 b(z) = z6 + z5 + z2 = 0110 0100 = 0x64 Addition: a(z) + b(z) = z7 + z6 + z5 + z3 + z2 + 1 = 1110 1101 = 0xED Note: a(z) + a(z) = 2 · a(z) = 0, ∀a ∈ F2m

Diego F. Aranha Efficient Binary Field Arithmetic

Introduction

Example: Field F28 Irreducible polynomial: f (z) = z8 + z4 + z3 + z + 1 = 1 0001 1011 Representation: a(z) = z7 + z3 + 1 = 1000 1001 = 0x89 b(z) = z6 + z5 + z2 = 0110 0100 = 0x64 Multiplication: a(z) × b(z) = z13 + z12 + z8 + z6 + z2 mod f (z) = z7 + z5 + z4 + z3 + z2 + 1 = 0xBD Multiplication by z: z × b(z) = z7 + z6 + z3 = 1100 1000 = 0xC8 = b ≪ 1

Diego F. Aranha Efficient Binary Field Arithmetic

slide-7
SLIDE 7

Introduction

Binary fields (F2m) are omnipresent in Cryptography: Efficient Curve-based Cryptography (ECC, PBC) Post-quantum Cryptography Block ciphers Many algorithms/optimizations already described in the literature: Is it possible to unify the fastest ones in a simple formulation? Can such a formulation reflect the state-of-the-art and provide new ideas?

Diego F. Aranha Efficient Binary Field Arithmetic

Objective

Contributions Formulation of state-of-the-art binary field arithmetic using vector instructions New strategy for the implementation of multiplication Time-memory trade-offs to compensate for native multiplier Experimental results

Diego F. Aranha Efficient Binary Field Arithmetic

slide-8
SLIDE 8

Arsenal

Intel Core architecture: 128-bit Streaming SIMD Extensions instructions (65/45 nm) Super shuffle engine introduced in 45 nm series Carry-less multiplier introduced in Nehalem family 256-bit Advanced Vector Extensions instructions (32 nm) Relevant vector instructions: Instruction Description Cost Mnemonic MOVDQA Memory load/store 3/2 ← PSLLQ, PSRLQ 64-bit bitwise shifts 1 ≪∤8, ≫∤8 PXOR,PAND,POR Bitwise XOR,AND,OR 1 ⊕, ∧, ∨ PUNPCKLBW/HBW Byte interleaving 3 interlo/hi PSLLDQ,PSRLDQ 128-bit bytewise shift 2 (1) ≪8, ≫8 PSHUFB Byte shuffling 3 (1) shuffle,lookup PALIGNR Memory alignment 2 (1) ⊳ PCLMULQDQ Carry-less multiplication 10 (8) ⊗

Diego F. Aranha Efficient Binary Field Arithmetic

New SSSE3 instructions

PSHUFB instruction ( mm shuffle epi8): Real power: We can implement in parallel any function:

Diego F. Aranha Efficient Binary Field Arithmetic

slide-9
SLIDE 9

New SSSE3 instructions

Example: Bit manipulation

Diego F. Aranha Efficient Binary Field Arithmetic

New SSSE3 instructions

PALIGNR instruction ( mm alignr epi8):

Diego F. Aranha Efficient Binary Field Arithmetic

slide-10
SLIDE 10

Binary field F2m

Irreducible polynomial: f (z) (trinomial or pentanomial) Polynomial basis: a(z) ∈ F2m =

m−1

  • i=0

aizi. Software representation: vector of n = ⌈m/64⌉ words (even). Graphical representation:

Diego F. Aranha Efficient Binary Field Arithmetic

Data types

#if WORD == 8 typedef uint8_t dig_t; #elif WORD == 16 typedef uint16_t dig_t; #elif WORD == 32 typedef uint32_t dig_t; #elif WORD == 64 typedef uint64_t dig_t; #endif typedef __m128i vec_t;

Diego F. Aranha Efficient Binary Field Arithmetic

slide-11
SLIDE 11

Useful macros

#define LOAD _mm_load_si128 #define STORE _mm_store_si128 #define PSHUFB _mm_shuffle_epi8 #define XOR _mm_xor_si128 #define AND _mm_and_si128 #define SHL _mm_slli_epi64 #define SHR _mm_srli_epi64 #define SHL8 _mm_slli_si128 #define SHR8 _mm_srli_si128 #define UNPACKLO _mm_unpacklo_epi8 #define UNPACKHI _mm_unpackhi_epi8 #define CLMUL _mm_clmulepi64_si128

Diego F. Aranha Efficient Binary Field Arithmetic

Proposed representation

To employ 4-bit granular arithmetic, convert to split form: aL =

  • 0≤i<m,

0≤i mod 8≤3

aizi, aH =

  • 0≤i<m,

4≤i mod 8≤7

aizi−4,

i

A

L

A

H

A

Diego F. Aranha Efficient Binary Field Arithmetic

slide-12
SLIDE 12

Proposed representation

Easy to convert to split form: AL = Ai ∧ 0x0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F AH = (Ai ∧ 0xF0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0) >> 4 Easy to convert back: a(z) = aH(z)z4 + aL(z).

Diego F. Aranha Efficient Binary Field Arithmetic

Addition/subtraction in F2m

c(z) = a(z) + b(z) =

m−1

  • i=0

(ai ⊕ bi)zi

+ + + + + + + + + + + +

A

2 3 4 5 6 7 8 9 ...

n-1

1 A

A A A A A A A A A A

B

2 3 4 5 6 7 8 9 ...

n-1

1 B

B B B B B B B B B B

C

2 3 4 5 6 7 8 9 ...

n-1

1 C

C C C C C C C C C C

Guidelines: Use XOR instruction with largest operand size. Verify impact of higher throughput.

Diego F. Aranha Efficient Binary Field Arithmetic

slide-13
SLIDE 13

Addition/subtraction in F2m

void fb_addn_low(dig_t *c, dig_t *a, dig_t *b) { int i; for (i = 0; i < FB_DIGS; i += 2, c += 2, a += 2, b += 2) { vec_t t0 = LOAD (( vec_t *)a); vec_t t1 = LOAD (( vec_t *)b); t0 = XOR(t0 , t1); STORE (( vec_t *)c, t0); } }

Diego F. Aranha Efficient Binary Field Arithmetic

Squaring in F2m

a(z) =

m

  • i=0

aizi = am−1 + · · · + a2z2 + a1z + a0 a(z)2 =

m−1

  • i=0

aiz2i = am−1z2m−2 + · · · + a2z4 + a1z2 + a0 Example: a(z) = (am−1, am−2, . . . , a2, a1, a0) a(z)2 = (am−1, 0, am−2, 0, . . . , 0, a2, 0, a1, 0, a0)

Diego F. Aranha Efficient Binary Field Arithmetic

slide-14
SLIDE 14

Squaring in F2m

Since squaring is a linear operation: a(z)2 = aH(z)2 · z8 + aL(z)2. We can compute aL(z)2 and aH(z)2 with a lookup table. For u = (u3, u2, u1, u0), use table(u) = (0, u3, 0, u2, 0, u1, 0, u0):

Diego F. Aranha Efficient Binary Field Arithmetic

Proposed squaring in F2m

i

A

L

A

H

A

00000000 00000100 00000101 00010000 00010001 01010101 00000001

...

H

A

L

A

2i+1

T

2i

T

interhi, interlo lookup lookup table

a(z)2 = aL(z)2 + aH(z)2 · z8.

Diego F. Aranha Efficient Binary Field Arithmetic

slide-15
SLIDE 15

Squaring in F2m

Algorithm 1 Proposed optimization for squaring in F2m.

Input: a(z) = a[0..n − 1]. Output: c(z) = c[0..n − 1] = a(z)2 mod f (z).

1: ⋄ Store in table the squares u(z)2 of all 4-bit polynomials u(z). 2: table ← (0x5554515045444140,0x1514111005040100) 3: mask ← (0x0F0F0F0F0F0F0F0F,0x0F0F0F0F0F0F0F0F) 4: for i ← 0 to n

2 − 1 do

5:

a0 ←load(a[2i])

6:

⋄ Convert to split representation.

7:

aL ← a0 ∧ mask

8:

aH ← a0 ≫∤8 4, aH ← aH ∧ mask

9:

⋄ Perform parallel table lookups.

10:

aL ←lookup(table, aL), aH ←lookup(table, aH)

11:

⋄ Simulate addition with 8-bit offset.

12:

t[2i] ←interlo(aL, aH)

13:

t[2i + 1] ←interhi(aL, aH)

14: end for 15: return c = t mod f (z)

Diego F. Aranha Efficient Binary Field Arithmetic

Squaring in F2m

void fb_sqrm_low(dig_t *c, dig_t *a) { vec_t t0 , t1 , t2 , t3 , mask; t0 = _mm_set_epi32 (0 x55545150 , 0x45444140 , 0x15141110 , 0x05040100 ); mask = _mm_set_epi32 (0 x0F0F0F0F , 0x0F0F0F0F , 0x0F0F0F0F , 0x0F0F0F0F ); for (int i = 0; i < FB_DIGS; i += 2) { t1 = LOAD (( vec_t *)(a + i)); t2 = AND(t1 , mask ); t2 = PSHUFB(t0 , t2); t3 = SHR(t1 , 4); t3 = AND(t3 , mask ); t3 = PSHUFB(t0 , t3); t1 = UNPACKLO(t2 , t3); t2 = UNPACKHI(t2 , t3); STORE (( vec_t *)(c + 2*i), t1); STORE (( vec_t *)(c + 2*(i + 1)), t2); } REDUCE(c, c); }

Diego F. Aranha Efficient Binary Field Arithmetic

slide-16
SLIDE 16

Square root extraction in F2m

Algorithm by Fong et al.: √a = a2m−1 =

m−1

  • i=0
  • aizi2m−1 =

m−1

  • i=0

ai

  • z2m−1i

=

  • i even

aiz

i 2 + √z

  • i odd

aiz

i−1 2

= aeven + √z · aodd Since square-root is also a linear operation:

  • a(z)

=

  • aH(z)z4 + aL(z)

=

  • aH(z)z2 +
  • aL(z)

= √z · (aLodd(z) + aHodd(z)z2) + aLeven(z) + aHeven(z)z2 Note: Multiplication by √z ideally requires shifted additions only. If not possible, precompute product by √z.

Diego F. Aranha Efficient Binary Field Arithmetic

Proposed square root in F2m

i

A

L

A

H

A

00000000 00000001 00110011 ...

H

A

L

A

lookup lookup table shuffle

00000000 00000100 11001100 ...

table · z²

L

A

H

A

even

A

  • dd

A

  • a(z) = √z · (aLodd(z) + aHodd(z)z2) + aLeven(z) + aHeven(z)z2

Diego F. Aranha Efficient Binary Field Arithmetic

slide-17
SLIDE 17

Multiplication in F2m

Three strategies:

1

  • pez-Dahab comb method

2

Shuffle-based multiplication

3

Native multiplication

Diego F. Aranha Efficient Binary Field Arithmetic

  • pez-Dahab multiplication in F2m

We can compute u · b(z) using shifts and additions. If a(z) is divided into 4-bit polynomials, compute a(z) · b(z) by:

Diego F. Aranha Efficient Binary Field Arithmetic

slide-18
SLIDE 18

  • pez-Dahab multiplication in F2m

If the multiplier is represented in split form: a(z) · b(z) = b(z) · (aH(z)z4 + aL(z)) = b(z)z4aH(z) + b(z)aL(z) This is a well-known technique for removing expensive 4-bit shifts! Note: The core operation is accumulating u × dense b(z).

Diego F. Aranha Efficient Binary Field Arithmetic

  • pez-Dahab multiplication in F2m

Algorithm 2 LD multiplication implemented with n 128-bit registers.

Input: a(z) = a[0..n − 1], b(z) = b[0..n − 1]. Output: c(z) = c[0..n − 1]. Note: mi denotes the vector of n

2 128-bit registers (r(i−1+n/2), . . . , ri).

1: Compute T0(u) = u(z) · b(z), T1(u) = u(z) · (b(z)z4) for all u(z) of degree < 4. 2: (rn−1 . . . , r0) ← 0 3: for k ← 56 downto 0 by 8 do 4:

for j ← 1 to n − 1 by 2 do

5:

Let u = (u3, u2, u1, u0), where ut is bit (k + t) of a[j].

6:

Let v = (v3, v2, v1, v0), where vt is bit (k + t + 4) of a[j].

7:

m(j−1)/2 ← m(j−1)/2 ⊕ T0(u), m(j−1)/2 ← m(j−1)/2 ⊕ T1(v)

8:

end for

9:

(rn−1 . . . , r0) ← (rn−1 . . . , r0) ⊳ 8

10: end for 11: for k ← 56 downto 0 by 8 do 12:

for j ← 0 to n − 2 by 2 do

13:

Let u = (u3, u2, u1, u0), where ut is bit (k + t) of a[j].

14:

Let v = (v3, v2, v1, v0), where vt is bit (k + t + 4) of a[j].

15:

mj/2 ← mj/2 ⊕ T0(u), mj/2 ← mj/2 ⊕ T1(v)

16:

end for

17:

if k > 0 then (rn−1 . . . , r0) ← (rn−1 . . . , r0) ⊳ 8

18: end for 19: return c = (rn−1 . . . , r0) mod f (z)

Diego F. Aranha Efficient Binary Field Arithmetic

slide-19
SLIDE 19

Shuffle-based multiplication in F2m

If both multiplicand and multiplier are represented in split form: a(z) · b(z) = (bH(z)z4 + bL(z)) · (aH(z)z4 + aL(z)) Using Karatsuba formula, we can reduce it to 3 multiplications: a(z)·b(z) = aHbHz8+[(aH + aL)(bH + bL) + aHbH + aLbL] z4+aLbL Note: The core operation is accumulating u × sparse bL,H(z).

x

2 3 4 5 6 7 8 9 ...

n-1

1 B

B B B B B B B B B B

Diego F. Aranha Efficient Binary Field Arithmetic

Shuffle-based multiplication in F2m

Algorithm 3 Multiplication in split form.

Input: Operands a, b in split representation. Output: Result a · b stored in registers (rn−1 . . . , r0).

1: ⋄ table stores all products of 4-bit × 4-bit polynomials. 2: (rn−1 . . . , r0) ← 0 3: for k ← 56 downto 0 by 8 do 4:

for j ← 1 to n − 1 by 2 do

5:

Let u = (u3, u2, u1, u0), where ut is bit (k + t) of a[j].

6:

for i ← 0 to n

2 − 1 do ri ← ri ⊕ shuffle(table[u], b[i])

7:

end for

8:

(rn−1 . . . , r0) ← (rn−1 . . . , r0) ⊳ 8

9: end for 10: for k ← 56 downto 0 by 8 do 11:

for j ← 0 to n − 2 by 2 do

12:

Let u = (u3, u2, u1, u0), where ut is bit (k + t) of a[j].

13:

for i ← 0 to n

2 − 1 do ri ← ri ⊕ shuffle(table[u], b[i])

14:

end for

15:

if k > 0 then (rn−1 . . . , r0) ← (rn−1 . . . , r0) ⊳ 8

16: end for

Diego F. Aranha Efficient Binary Field Arithmetic

slide-20
SLIDE 20

Native multiplication

Two organizations:

1 128-bit granularity:

Reduce number of required registers Use Karatsuba for each 128 × 128-bit multiplication Use maximum number of Karatsuba levels for n

2 digits

2 64-bit granularity:

Interleave storage to reduce additions and number of registers Use the formula with the lower number of multiplications

Diego F. Aranha Efficient Binary Field Arithmetic

Native multiplication

Algorithm 4 Proposed implementation for multiplication in F2283.

Input: a(z) = a[0..4], b(z) = b[0..4]. Output: c(z) = c[0..4] = a(z) · b(z). Note: Pairs ai, bi, ci, mi of 64-bit words represent vector registers.

1: for i ← 0 to 4 do ci ← (a[i], b[i]) 2: c5 ← c0 ⊕ c1,

c6 ← c0 ⊕ c2, c7 ← c2 ⊕ c4, c8 ← c3 ⊕ c4

3: c9 ← c3 ⊕ c6,

c10 ← c1 ⊕ c7, c11 ← c5 ⊕ c8, c12 ← c2 ⊕ c11

4: for i ← 0 to 12 do mi ← ci[0] ⊗ ci[1] 5: c0 ← m0,

c8 ← m4

6: c1 ← c0 ⊕ m1,

c2 ← c1 ⊕ m6

7: c1 ← c1 ⊕ m5,

c2 ← c2 ⊕ m2

8: c7 ← c8 ⊕ m3, c6 ← c7 ⊕ m7 9: c7 ← c7 ⊕ m8, c6 ← c6 ⊕ m2 10: c5 ← m11 ⊕ m12, c3 ← c5 ⊕ m9 11: c3 ← c3 ⊕ c0 ⊕ c10 12: c4 ← c1 ⊕ c7 ⊕ m9 ⊕ m10 ⊕ m12 13: c5 ← c5 ⊕ c2 ⊕ c8 ⊕ m10 14: c9 ← c7 ≪8 64 15: (c7, c5, c3, c1) ← (c7, c5, c3, c1) ⊳ 8 16: c0 ← c0 ⊕ c1,

c1 ← c2 ⊕ c3, c2 ← c4 ⊕ c5

17: c3 ← c6 ⊕ c7,

c4 ← c8 ⊕ c9

18: return c = (c4, c3, c2, c1, c0) mod f (z)

Diego F. Aranha Efficient Binary Field Arithmetic

slide-21
SLIDE 21

Comparison

  • pez-Dahab multiplication:

Explores highest-granularity XOR operation Consumes memory space proportional to field size Shuffle-based multiplication: Relies on sparser core operation Consumes constant memory space (apart from Karatsuba) Depends on constants stored in memory Native multiplication: Faster and with constant memory consumption. No widespread support.

Diego F. Aranha Efficient Binary Field Arithmetic

Modular reduction

We need to reduce squaring/multiplication results in F2m modulo f (z) = zm + r(z). We can process multiple bits at a time based on the observation: c(z) = c2m−2z2m−2 + · · · + cmzm + cm−1zm−1 + . . . c1z + c0 = (c2m−2zm−2 + · · · + cm)r(z) + cm−1zm−1 + . . . c1z + c0. Important: We need to multiply by r(z).

Diego F. Aranha Efficient Binary Field Arithmetic

slide-22
SLIDE 22

Modular reduction

Requires heavy shifting, so split representation does not help. Guidelines: If f (z) is a trinomial, implement with vector digits If f (z) is a pentanomial, process pairs of digits in parallel or 64-bit mode Write f (z) in a format that minimizes shifting: f (z) = z283 + z12 + z7 + z5 + 1 = z283 + (z7 + 1)(z5 + 1) Accumulate writes into registers before writing to memory Reduce squaring/multiplication results in registers

Diego F. Aranha Efficient Binary Field Arithmetic

Modular reduction (64-bit mode)

Algorithm 5 Fast modular reduction by f (z) = z1223 + z255 + 1.

Input: c(z) = c[0..2n − 1]. Output: c(z) mod f (z) = c[0..n − 1].

1: for i ← 2n − 1 downto n do 2:

t ← c[i]

3:

c[i − 15] ← c[i − 15] ⊕ (t ≫ 8)

4:

c[i − 16] ← c[i − 16] ⊕ (t ≪ 56)

5:

c[i − 19] ← c[i − 19] ⊕ (t ≫ 7)

6:

c[i − 20] ← c[i − 20] ⊕ (t ≪ 57)

7: end for 8: t ← c[19] ≫ 7, c[0] ← c[0] ⊕ t, t ← t ≪ 7 9: c[3] ← c[3] ⊕ (t ≪ 56) 10: c[4] ← c[4] ⊕ (t ≫ 8) 11: c[19] ← (c[19] ⊕ t) ∧ 0x7F 12: return c

Diego F. Aranha Efficient Binary Field Arithmetic

slide-23
SLIDE 23

Modular reduction (128-bit mode)

Algorithm 6 Proposed fast reduction by f (z) = z1223 + z255 + 1.

Input: t(z) = t[0..n − 1] (vector of 128-bit elements). Output: c(z) mod f (z) = c[0..n − 1]. Note: The accumulate function R(r3, r2, r1, r0, t) executes: s ← t ≫∤8 7, r3 ← t ≪∤8 57 r3 ← r3 ⊕ (s ≪8 64) r2 ← r2 ⊕ (s ≫8 64) r1 ← r1 ⊕ (t ≪8 56) r0 ← r0 ⊕ (t ≫8 72)

1: r0, r1, r2, r3 ← 0 2: for i ← 19 downto 15 by 4 do 3:

R(r3, r2, r1, r0, t[i]), t[i − 7] ← t[i − 7] ⊕ r0

4:

R(r0, r3, r2, r1, t[i − 1]), t[i − 8] ← t[i − 8] ⊕ r1

5:

R(r1, r0, r3, r2, t[i − 2]), t[i − 9] ← t[i − 9] ⊕ r2

6:

R(r2, r1, r0, r3, t[i − 3]), t[i − 10] ← t[i − 10] ⊕ r3

7: end for 8: R(r3, r2, r1, r0, t[11]),

t[4] ← t[4] ⊕ r0

9: R(r0, r3, r2, r1, t[10]),

t[3] ← t[3] ⊕ r1

10: t[2] ← t[2] ⊕ r2,

t[1] ← t[1] ⊕ r3, t[0] ← t[0] ⊕ r0

11: r0 ← m[9] ≫8 64,

r0 ← r0 ≫∤8 7, t[0] ← t[0] ⊕ r0

12: r1 ← r0 ≪8 64,

r1 ← r1 ≪∤8 63, t[1] ← t[1] ⊕ r1

13: r1 ← r0 ≫∤8 1,

t[2] ← t[2] ⊕ r1

14: for i ← 0 to 9 do c[2i] ← store(t[i]),

c[19] ← c[19] ∧ 0x7F

15: return c

Diego F. Aranha Efficient Binary Field Arithmetic

Modular reduction (128-bit mode)

Algorithm 7 Proposed fast reduction by f (z) = z283 +(z7 +1)(z5 +1).

Input: Double-precision polynomial stored into 128-bit registers c = (c4, c3, c2, c1, c0). Output: Field element c mod f (z) stored into 128-bit registers (c2, c1, c0).

1: t2 ← c2, t0 ← (c3, c2) ⊲ 64, t1 ← (c4, c3) ⊲ 64 2: c4 ← c4 ≫∤8 27, c3 ← c3 ≫∤8 27, c3 ← c3 ⊕ (t1 ≪∤8 37) 3: c2 ← c2 ≫∤8 27, c2 ← c2 ⊕ (t0 ≪∤8 37) 4: t0 ← (c4, c3) ⊲ 120, c4 ← c4 ⊕ (t0 ≫∤8 1) 5: t1 ← (c3, c2) ⊲ 64, c3 ← c3 ⊕ (c3 ≪∤8 7) ⊕ (t1 ≫∤8 57) 6: t0 ← c2 ≪8 64, c2 ← c2 ⊕ (c2 ≪∤8 7) ⊕ (t0 ≫∤8 57) 7: t0 ← (c4, c3) ⊲ 120, c4 ← c4 ⊕ (t0 ≫∤8 3) 8: t1 ← (c3, c2) ⊲ 64, c3 ← c3 ⊕ (c3 ≪∤8 5) ⊕ (t1 ≫∤8 59) 9: t0 ← c2 ≪8 64, c2 ← c2 ⊕ (c2 ≪∤8 5) ⊕ (t0 ≫∤8 59) 10: c0 ← c0 ⊕ c2, c1 ← c1 ⊕ c3, c2 ← t2 ⊕ c4 11: t0 ← c4 ≫∤8 27 12: t1 ← t0 ⊕ (t0 ≪∤8 5) 13: t0 ← t1 ⊕ (t1 ≪∤8 7) 14: c0 ← c0 ⊕ t0, c2 ← c2∧ (0x0000000000000000,0x0000000007FFFFFF) 15: return c = (c2, c1, c0)

Diego F. Aranha Efficient Binary Field Arithmetic

slide-24
SLIDE 24

Half-trace

We want to compute H(c) = (m−1)/2

i=0

c22i. Important: For even i, H(zi) = H(zi/2) + zi/2 + Tr(zi). Algorithm 8 Solve x2 + x = c

Input: c = m−1

i=0 cizi ∈ F2m where m is odd and Tr(c) = 0

Output: solution s of x2 + x = c.

1: Compute H(l0z8i+1 + l1z8i+3 + l2z8i+5 + l3z8i+7) for 0 ≤ i ≤ ⌊ m−3

8

⌋ and lj ∈ F2.

2: s ← 0 3: for i = (m − 1)/2 downto 1 do 4:

if c2i = 1 then

5:

c ← c + zi, s ← s + zi

6:

end if

7: end for 8: return s +

i∈I c8i+1H(z8i+1) + c8i+3H(z8i+3) + c8i+5H(z8i+5) + c8i+7H(z8i+7) Diego F. Aranha Efficient Binary Field Arithmetic

Multi-squaring [Bos et al.]

Precompute a table T of 16⌈ m

4 ⌉ field elements such that

T[j, i0 + 2i1 + 4i2 + 8i3] = (i0z4j + i1z4j+1 + i2z4j+2 + i3z4j+3)2k Then we can compute a2k as: a2k =

⌈ m

4 ⌉

  • j=0

T[j, ⌊a/24j⌋ mod 24].

Diego F. Aranha Efficient Binary Field Arithmetic

slide-25
SLIDE 25

Inversion

Guidelines: If memory is not available, implement Extended Euclidean Algorithm in 64-bit mode. If memory is available, implement Itoh-Tsuji with precomputed 2i powers: a−1 = a(2m−1−1)2

Diego F. Aranha Efficient Binary Field Arithmetic

Inversion

Algorithm 9 Inversion in F2musing EEA. Input: a = m−1

i=0 aizi ∈ F2m

Output: a−1 mod f .

1: u ← a, v ← f 2: g1 ← 1, g2 ← 0 3: while u = 1 do 4:

j ← deg(u) − deg(v)

5:

if j < 0 then

6:

u ↔ v, g1 ↔ g2, j ← −j

7:

u ← u + zjv

8:

g1 ← g1 + zjg2

9:

end if

10: end while 11: return g1

Diego F. Aranha Efficient Binary Field Arithmetic

slide-26
SLIDE 26

Implementation

Material: GCC 4.1.2 (fastest SSE intrinsics, GCC 4.5.0 is good again) RELIC cryptographic library1 Intel Core 2 65,45nm processors OpenMP constructs for parallelism; Parameters: 16 different binary fields ranging from 113 to 1223 bits Choices of square-root friendly and standard f (z) Comparison: Only vector implementations (mpFq, Beuchat et al. 2009) Only in entry-level Intel Core 2 65 nm

1http://code.google.com/p/relic-toolkit/ Diego F. Aranha Efficient Binary Field Arithmetic

Experimental results – Squaring

100 200 300 400 500 200 400 600 800 1000 1200 Cycles in Intel Core 2 65nm Field size Related work This work Diego F. Aranha Efficient Binary Field Arithmetic

slide-27
SLIDE 27

Experimental results – Square-root with friendly f (z)

100 200 300 400 500 600 700 800 200 400 600 800 1000 1200 Cycles in Intel Core 2 65nm Field size Related work This work Diego F. Aranha Efficient Binary Field Arithmetic

Experimental results – Square-root with standard f (z)

100 200 300 400 500 600 700 800 900 200 400 600 800 1000 1200 Cycles in Intel Core 2 65nm Field size Related work This work Diego F. Aranha Efficient Binary Field Arithmetic

slide-28
SLIDE 28

Experimental results – L´

  • pez-Dahab multiplication

1000 2000 3000 4000 5000 6000 200 400 600 800 1000 1200 Cycles in Intel Core 2 65nm Field size Related work This work (López-Dahab) Diego F. Aranha Efficient Binary Field Arithmetic

Experimental results – Shuffle-based multiplication

1000 2000 3000 4000 5000 6000 7000 8000 200 400 600 800 1000 1200 Cycles in Intel Core 2 45nm Field size This work (López-Dahab) This work (Shuffling)

Note: Native multiplier on newer machines is twice faster than LD.

Diego F. Aranha Efficient Binary Field Arithmetic

slide-29
SLIDE 29

Observations

Squaring and square-root are: Efficiently formulated with M/S ratio up to 34 Faster when shuffling throughput is higher Heavily dependent on the choice of f (z) Shuffle-based multiplication: Has a bottleneck with constants stored in memory Requires faster table addressing scheme Is only 50%-90% slower than L´

  • pez-Dahab!

Other operations: Restore the ratio to native multiplication (H ≈ M, I ≈ 25M).

Diego F. Aranha Efficient Binary Field Arithmetic

Part III: Applications

Diego F. Aranha Efficient Binary Field Arithmetic

slide-30
SLIDE 30

Introduction

Elliptic Curve Cryptography (ECC): Underlying problem harder than integer factoring (RSA) Same security level with smaller parameters Efficiency in storage and execution time Pairing-Based Cryptography (PBC): Initially destructive Allows innovative protocols Flexibilizes curve-based cryptography

Diego F. Aranha Efficient Binary Field Arithmetic

Introduction

Point multiplication is the most expensive operation in Elliptic Curve Cryptography. Pairing computation is the most expensive operation in Pairing-Based Cryptography. Parallelism is being increasingly introduced in modern architectures.

Diego F. Aranha Efficient Binary Field Arithmetic

slide-31
SLIDE 31

Objective

Explore two types of parallelism in software to reduce computation latency: Vector instructions Multiprocessing Applications: desktop-class computers, real-time services, embedded devices. Contributions State-of-the-art timings for ECC and PBC Parallelization of Miller’s Algorithm with load balancing Experimental results

Diego F. Aranha Efficient Binary Field Arithmetic

Elliptic curves

(a) Point addition R = P + Q; (b) Point doubling R = 2P;

Figure : Elliptic curve arithmetic.

[Picture: Hankerson et al. 2003] Diego F. Aranha Efficient Binary Field Arithmetic

slide-32
SLIDE 32

Binary elliptic curves

A binary elliptic curve is the set of solutions (x, y) ∈ F2m × F2m satisfying the equation y2 + xy = x3 + ax2 + b, where a, b ∈ F2m with b = 0, and a point at infinity ∞. When a ∈ {0, 1} and b = 1, the curve is called a Koblitz curve.

Diego F. Aranha Efficient Binary Field Arithmetic

Elliptic curves

The set of points {(x, y) ∈ E(F2m)} ∪ {∞} under the addition

  • peration + (chord-and-tangent rule) forms an additive group.

Given an elliptic point P and an integer k, the operation kP, called scalar multiplication, is defined by kP = P + P + . . . + P.

  • k times

This is the fundamental operation employed by protocols based on elliptic curves. Important: Underlying problem: ECDLP: Recover k from P, kP.

Diego F. Aranha Efficient Binary Field Arithmetic

slide-33
SLIDE 33

Elliptic curve arithmetic

Algorithm 10 Double-and-add scalar multiplication Input: k = t−1

i=0 ki2i, P ∈ E(F2m) of order r.

Output: kP.

1: Q ← ∞ 2: for i = t − 1 downto 0 do 3:

Q ← 2Q

4:

if ki = 1 then

5:

Q ← Q + P

6:

end if

7: end for 8: return Q

Diego F. Aranha Efficient Binary Field Arithmetic

Elliptic curve arithmetic

Algorithm 11 Left-to-right wNAF scalar multiplication Input: w, k ∈ Z, P ∈ E(F2m) of order r. Output: kP.

1: Obtain the representation NAFw(k) = t−1

i=0 ki2i

2: Compute Pi = iP for i ∈ {1, 3, . . . , 2w−1 − 1} 3: Q ← ∞ 4: for i = t − 1 downto 0 do 5:

Q ← 2Q

6:

if ki > 0 then

7:

Q ← Q + Pki

8:

else if ki < 0 then

9:

Q ← Q − Pki

10:

end if

11: end for 12: return Q

Diego F. Aranha Efficient Binary Field Arithmetic

slide-34
SLIDE 34

Elliptic curve arithmetic

Algorithm 12 Left-to-right multiple scalar multiplication

Input: w, k ∈ Z, l ∈ Z, P ∈ E(F2m) of order r. Output: kP + lQ.

1: Obtain the representations NAFw(k) = t−1

i=0 ki2i, NAFw(l) = t−1 i=0 li2i

2: Compute Pi = iP, Qi = iQ for i ∈ {1, 3, . . . , 2w−1 − 1} 3: R ← ∞ 4: for i = t − 1 downto 0 do 5:

R ← 2R

6:

if ki > 0 then

7:

R ← R + Pki

8:

else if ki < 0 then

9:

R ← R − Pki

10:

end if

11:

if li > 0 then

12:

R ← R + Qli

13:

else if li < 0 then

14:

R ← R − Qli

15:

end if

16: end for 17: return R

Important: With endomorphism ψ, compute kP = k1P + k2ψ(P).

Diego F. Aranha Efficient Binary Field Arithmetic

Elliptic curve arithmetic

Koblitz curves have the Frobenius automorphism τ on E(F2m) given by τ(x, y) = (x2, y2). If it is possible to recode k in another basis related to τ, point doublings can be replaced by applications of τ. Important: Many other approaches for scalar multiplication and

  • ther scenarios (fixed, multiple).

Diego F. Aranha Efficient Binary Field Arithmetic

slide-35
SLIDE 35

Elliptic curve arithmetic

Algorithm 13 Left-to-right τ-and-add scalar multiplication Input: w, k ∈ Z, P ∈ E(F2m) of order r. Output: kP.

1: Obtain the representation TNAF(k) = t−1

i=0 uiτ i

2: Q ← ∞ 3: for i = t − 1 downto 0 do 4:

Q ← τQ

5:

if ui = 1 then

6:

Q ← Q + P

7:

else if ui = −1 then

8:

Q ← Q − P

9:

end if

10: end for 11: return Q

Diego F. Aranha Efficient Binary Field Arithmetic

Elliptic curve arithmetic

Algorithm 14 Left-to-right wτNAF scalar multiplication Input: w, k ∈ Z, P ∈ E(F2m) of order r. Output: kP.

1: Obtain the representation TNAFw(k) = t

i=0 uiτ i

2: Compute Pu = αuP foru ∈ {1, 3, 5, . . . , 2w−1 − 1}

where αi = i mod τ ω

3: Q ← ∞ 4: for i = t − 1 downto 0 do 5:

Q ← τQ

6:

if ui = αj, for some j then

7:

Q ← Q + Pj

8:

else if ui = −αj, for some j then

9:

Q ← Q − Pj

10:

end if

11: end for 12: return Q

Diego F. Aranha Efficient Binary Field Arithmetic

slide-36
SLIDE 36

Elliptic curve arithmetic

Algorithm 15 Constant-time point multiplication. Input: k = t−1

i=0 ki ∈ Z, P = (x, y) ∈ E(F2m), b-coefficient.

Output: kP ∈ E(F2m).

1: x1 ← x, z1 ← 1, z2 ← x2, x2 ← z2

2 + b,

2: for i ← t − 2 to 0 do 3:

r1 ← x1z2, r2 ← x2z1, r3 ← r1 + r2, r4 ← r1r2

4:

if ki = 0 then

5:

z1 ← r2

3 , r1 ← xz1, x1 ← r1 + r4, r1 ← z2 2, r2 ← x2 2

6:

z2 ← r1r2, x2 ← r2

1 , r1 ← r2 2 , r2 ← br1, x2 ← x2 + r2

7:

else

8:

z2 ← r2

3 , r1 ← xz2, x2 ← r1 + r4, r1 ← z2 1, r2 ← x2 1

9:

z1 ← r1r2, x1 ← r2

1 , r2 ← r2 2 , r2 ← br1, x1 ← x1 + r2

10:

end if

11: end for 12: return Q = (x3, y3) from (x1/z1, x2/z2);

Diego F. Aranha Efficient Binary Field Arithmetic

Experimental results – Elliptic curve arithmetic

Table : Timings given in 103 cycles for side-channel resistant scalar multiplication.

Curve This work CURVE2251 - Core 2 594 CURVE2251 - CLMUL 282 CURVE2251 - CLMUL + AVX 225 Related work BBE (Bernstein) - Core 2 314 eBACS (mpFq) - Core 2 855 4-GLV-GLS TED (Longa) - Core i7 137

Diego F. Aranha Efficient Binary Field Arithmetic

slide-37
SLIDE 37

Elliptic curve arithmetic

Briefly recall that Koblitz curves have the Frobenius automorphism τ on E(F2m) given by τ(x, y) = (x2, y2). Computing fixed powers 2k in constant time (independent of k) provides an endomorphism in the context of the GLV method.

Let map ψ ≡ τ ⌊m/2⌋, giving kP = k1P + 2⌊m/2⌋k2P = k1P + k2ψ(P). Interleaving saves ⌊ m

2 ⌋ applications of the Frobenius.

In general, exploiting the ⌊m/s⌋-th power of τ is the analogue of an s-dimensional GLV decomposition and saves (s − 1)⌊ m

s ⌋ Frobenius.

Note: Tables can be reused for fast Ito-Tsuji inversion.

Diego F. Aranha Efficient Binary Field Arithmetic

Algorithm 16 Interleaved width-w τNAF scalar multiplication.

Input: k ∈ Z, P ∈ E(F2m), integer s denoting the interleaving factor. Output: kP ∈ E(F2m).

1: Compute width-w τ-NAF(k) = l−1

i=0 uiτ i

2: Compute P0,u = αuP, for u ∈ {1, 3, 5, . . . , 2w−1 − 1} 3: for i ← 1 to (s − 1) do Compute Pi,u = τ ⌊m/s⌋Pi−1,u 4: Q ← ∞ 5: for i ← l − 1 to s⌊ m

s ⌋ do

6:

Q ← τQ

7:

if ui = 0 then

8:

Let u be such that αu = ui or α−u = −ui

9:

if ui > 0 then Q ← Q + P0,u; else Q ← Q − P0,u

10:

end if

11: end for 12: for i ← (⌊ m

s ⌋ − 1) to 0 do

13:

Q ← τQ

14:

for j ← 0 to (s − 1) do

15:

if ui+j⌊m/s⌋ = 0 then

16:

Let u be such that αu = ui+j⌊m/s⌋ or α−u = −ui+j⌊m/s⌋

17:

if ui > 0 then Q ← Q + Pj,u; else Q ← Q − Pj,u

18:

end if

19:

end for

20: end for 21: return Q = (x, y)

Diego F. Aranha Efficient Binary Field Arithmetic

slide-38
SLIDE 38

Elliptic curve arithmetic

Table : Timings given in 103 cycles for unprotected scalar multiplication.

Curve This work NISTK283 - CLMUL (w = 5, s = 2) 128 NISTK283 - CLMUL + AVX (w = 5, s = 2) 99 Related work 4-GLV-GLS TED (Longa) - Sandy Bridge 91

Diego F. Aranha Efficient Binary Field Arithmetic

Bilinear pairings

Let G1 = P and G2 = Q be additive groups and GT be a multiplicative group such that |G1| = |G2| = |GT| = prime n. An efficiently-computable map e : G1 × G2 → GT is an admissible bilinear map if the following properties are satisfied:

1 Bilinearity: given (V , W ) ∈ G1 × G2 and (a, b) ∈ Z∗

q:

e(aV , bW ) = e(V , W )ab = e(abV , W ) = e(V , abW ).

2 Non-degeneracy: e(P, Q) = 1GT , where 1GT is the identity of

the group GT.

Diego F. Aranha Efficient Binary Field Arithmetic

slide-39
SLIDE 39

Bilinear pairings

[Picture: Avanzi, Cesena 2009] Diego F. Aranha Efficient Binary Field Arithmetic

Bilinear pairings

If G1 = G2, the pairing is symmetric.

[Picture: Avanzi, Cesena 2009] Diego F. Aranha Efficient Binary Field Arithmetic

slide-40
SLIDE 40

Example of protocol

Non-interactive ID-based key distribution protocol [SOK 2000]: A trusted authority generates master key s; An user i receives IDi, Pi = h(IDi), Si = sPi; Users A and B can derive the same key e(SA, PB) = e(SB, PA) = e(PA, PB)s. Important: Underlying problem: BCDHP: Compute e(P, Q)abc from P, aP, bP, cP, Q, aQ, bQ, cQ.

Diego F. Aranha Efficient Binary Field Arithmetic

Pairing computation

Let P, Q be r-torsion points. The pairing e(P, Q) is defined by the evaluation of fr,P at a divisor related to Q. [Miller 1986] constructed fr,P in stages combining Miller functions evaluated at divisors. [Barreto et al. 2002] showed how to evaluate fr,P at Q using the final exponentiation employed by the Tate pairing.

Diego F. Aranha Efficient Binary Field Arithmetic

slide-41
SLIDE 41

Pairing computation

Let gU,V be the line equation through points U, V ∈ E(Fqk) and gU the shorthand for gU,−U. For any integers a and b, we have:

1 fa+b,P(D) = fa,P(D) · fb,P(D) · gaP,bP(D)

g(a+b)P(D);

2 f2a,P(D) = fa,P(D)2 · gaP,aP(D)

g2aP(D) ;

3 fa+1,P(D) = fa,P(D) ·

g(a)P,P(D) g(a+1)P(D).

Diego F. Aranha Efficient Binary Field Arithmetic

Pairing computation

Algorithm 17 Miller’s Algorithm [Miller 1986, Barreto et al. 2002]. Entrada: r = log2 r

i=0 ri2i, P, Q.

Sa´ ıda: er(P, Q).

1: T ← P 2: f ← 1 3: r ← r − 1 4: for i = ⌊log2(r)⌋ − 1 downto 0 do 5:

f ← f 2 · lT,T(Q)

6:

T ← 2T

7:

if ri = 1 then

8:

f ← f · lT,P(Q)

9:

T ← T + P

10:

end if

11: end for 12: return f (qk−1/r)

Diego F. Aranha Efficient Binary Field Arithmetic

slide-42
SLIDE 42

Related work

Scalable approaches: [Mitsunari 2009] and [Beuchat et al. 2009] precompute pairs (Ti, part of lTi,Ti(Q)) in the symmetric case and divide loop iterations among processors. Problem: High storage costs (large precomputation).

Diego F. Aranha Efficient Binary Field Arithmetic

New approach

Property of Miller functions fa·b,P(D) = f b,P(D)a · f a,bP(D) We can write r = 2wr1 + r0 and compute fr,P(D): fr,P(D) = f2wr1+r0,P(D) = f r1,P(D)2w · f 2w,r1P(D) · f r0,P(D) · g(2wr1)P,r0P(D) grP(D) . If r has low Hamming weight, w can be chosen so that r0 is small. For many processors, we can: Apply the formula recursively: Write r as r = 2wiri + · · · + 2w2r2 + 2w1r1 + r0. If P is fixed (private key), riP can also be precomputed.

Diego F. Aranha Efficient Binary Field Arithmetic

slide-43
SLIDE 43

Load balancing

Problem: We must determine an optimal partition wi. Let c1(1) the cost of a serial loop and cπ(i) the cost of a parallel loop for processor 1 ≤ i ≤ π. We can count the operations executed by each processor and solve the system cπ(1) = cπ(i) to obtain wi. The speedup is: s(π) =

c1(1)+exp cπ(1)+par+exp,

where par is the cost of parallelization and exp is the cost of the final exponentiation.

Diego F. Aranha Efficient Binary Field Arithmetic

Symmetric case – Elliptic curves

A pairing-friendly supersingular binary elliptic curve is the set

  • f solutions (x, y) ∈ F2m × F2m satisfying the equation

y2 + y = x3 + x + b, where b ∈ {0, 1}, and a point at infinity ∞. The order of this curve is N = 2m + 1 ± 2

m+1 2

and the embedding degree is k = 4 (the least integer such that N divides 2km − 1).

Diego F. Aranha Efficient Binary Field Arithmetic

slide-44
SLIDE 44

Symmetric case – Pairing definition

Choosing T = 2m − N and a prime r dividing N, [Barreto et al. 2004] defined the reduced ηT pairing: ηT : E(F2m)[r] × E(F2m)[r] → F∗

24m

ηT(P, Q) = fT ′,P′(ψ(Q))

24m−1 N

, where T ′ = ±T and P′ = ±P. The function f is a Miller function and ψ is the distortion map ψ(x, y) = (x2 + s, y + sx + t).

Diego F. Aranha Efficient Binary Field Arithmetic

Symmetric case – Pairing algorithm

Algorithm 18 ηT pairing [Barreto et al. 2004], [Beuchat et al. 2008].

Input: P = (xP, yP), Q = (xQ, yQ) ∈ E(F2m)[r]. Output: ηT (P, Q) ∈ F∗

24m.

1: yP ← yP + 1 − δ 2: u ← xP + α, v ← xQ + α 3: g0 ← u · v + yP + yQ + β 4: g1 ← u + xQ, g2 ← v + x2

P

5: G ← g0 + g1s + t 6: L ← (g0 + g2) + (g1 + 1)s + t 7: F ← L · G 8: for i ← 1 to m−1

2

do

9:

xP ← √xP, yP ← √yP, xQ ← x2

Q, yQ ← y2 Q

10:

u ← xP + α, v ← xQ + α

11:

g0 ← u · v + yP + yQ + β

12:

g1 ← u + xQ

13:

G ← g0 + g1s + t

14:

F ← F · G

15: end for 16: return F (22m−1)(2m+1±2

m+1 2

) Diego F. Aranha Efficient Binary Field Arithmetic

slide-45
SLIDE 45

Symmetric case – Parallel pairing

Algorithm 19 Proposed parallel ηT pairing.

Input: P = (xP, yP), Q = (xQ, yQ) ∈ E(F2m)[r]. Output: ηT (P, Q) ∈ F∗

24m.

1: parallel section(processor i) 2: if i = 1 then Initialize F1 as in lines 1-7 of the previous algorithm; 3: else Fi ← 1 4: xP i ← (xP)

1 2wi , yP i ← (yP) 1 2wi , xQ i ← (xQ)2wi , yQ i ← (yQ)2wi

5: for j ← wi to wi+1 − 1 do 6:

xP i ← √xP i, yP i ← √yP i, xQ i ← xQ 2

i , yQ i ← yQ 2 i

7:

ui ← xP i + α, vi ← xQ i + α

8:

g0i ← ui · vi + yP i + yQ i + β

9:

g1i ← ui + xQ i

10:

Gi ← g0i + g1is + t

11:

Fi ← Fi · Gi

12: end for 13: F ← π

i=1 Fi

14: end parallel 15: return F M

Note: If memory is available, use partitions with the same size.

Diego F. Aranha Efficient Binary Field Arithmetic

Experimental results – Speedup (45nm)

2 4 6 8 10 12 14 10 20 30 40 50 60 Speedup Number of processors Beuchat et al. 2009 Aranha et al. 2010 Diego F. Aranha Efficient Binary Field Arithmetic

slide-46
SLIDE 46

Experimental results – Latency (45nm)

5 10 15 20 25 30 Latency (millions of cycles) 1 2 4 8 Number of threads Beuchat et al. 2009 23.03 13.14 9.08 8.93 Aranha et al. 2010 17.40 9.34 5.08 3.02 Diego F. Aranha Efficient Binary Field Arithmetic

Conclusions

New formulation and implementation of binary field arithmetic: Follows trend of faster shuffle instructions Improve results from related work by 8%-84% Induces a new implementation strategy for multiplication Still requires architectural features to be optimal May be cheaper to support than a full native multiplier Timings for non-batched arithmetic on binary elliptic curves: Provide new speed record for side-channel resistant scalar multiplication on binary curves Improve results for kP on eBACS by at least 27%-30%

Diego F. Aranha Efficient Binary Field Arithmetic

slide-47
SLIDE 47

Conclusions

New state-of-the-art for parallel implementation of pairings: No significant storage costs, smaller precomputation; In comparison with our serial implementation, speedups of 46%, 70% and 83% with 2, 4 and 8 cores; In comparison with previous state-of-the-art, improvements in latency of 24%, 29%, 44% and 66% with 1, 2, 4 and 8 cores. Parallelization scales: In the covered case, point doublings and extension field squarings are efficient; Our finite field implementation make these exceptionally fast.

Diego F. Aranha Efficient Binary Field Arithmetic

Detailed results

Number of threads Platform 1 – Intel Core 2 65nm 1 2 4 8* Hankerson et al. – latency 39 – – – Beuchat et al. – latency 26.86 16.13 10.13 – Beuchat et al. – speedup 1 1.67 2.65 – This work – latency 18.76 10.08 5.72 3.55 This work – speedup 1 1.86 3.28 5.28 Improvement 30.2% 32.9% 39.9% – Platform 2 – Intel Core 2 45nm 1 2 4 8 Beuchat et al. – latency 23.03 13.14 9.08 8.93 Beuchat et al. – speedup 1 1.77 2.54 2.58 This work – latency 17.40 9.34 5.08 3.02 This work – speedup 1 1.86 3.42 5.76 Improvement 24.4% 28.9% 44.0% 66.2% Platform 3: Intel Core i7 32nm 1 2 4 8* This work – latency 6.46 3.37 1.79 1.03 This work – speedup 1.00 1.92 3.60 6.24 Improvement 62.3% 63.9% 64.8% 65.9%

Table : Timings are reported in millions of cycles.

Diego F. Aranha Efficient Binary Field Arithmetic

slide-48
SLIDE 48

Sources

[1] D. F. Aranha, J. L´

  • pez, D. Hankerson, “High-speed parallel software

implementation of the ηT pairing”, In CT-RSA 2010, Springer LNCS 5985, pp. 89–105, San Francisco, USA, 2010. [2] D. F. Aranha, J. L´

  • pez, D. Hankerson, “Efficient Software Implementation of

Binary Field Arithmetic Using Vector Instruction Sets”, In LATINCRYPT 2010. Springer LNCS 6212, pp. 144–161, Puebla, Mexico, 2010. [3] J. Taverne, A. Faz-Hern´ andez, D. F. Aranha, F. Rodr´ ıguez-Henr´ ıquez, D. Hankerson, J. L´

  • pez, “Software implementation of binary elliptic curves: impact
  • f the carry-less multiplier on scalar multiplication”, In CHES 2011, Springer

LNCS 6917, pp. 108–123, Nara, Japan, 2011. [4] J. Taverne, A. Faz-Hern´ andez, D. F. Aranha, F. Rodr´ ıguez-Henr´ ıquez, D. Hankerson, J. L´

  • pez, “Speeding scalar multiplication over binary elliptic curves

using the new carry-less multiplication instruction”, Journal of Cryptographic Engineering, Vol. 1, Number 3, pp. 187–199, 2011. [5] D. F. Aranha, E. Knapp, A. Menezes, F. Rodr´ ıguez-Henr´ ıquez, “Parallelizing the Weil and Tate pairings”, In IMACC 2011, Springer LNCS 7089, pp. 275–295, Oxfork, UK, 2011. [6] D. F. Aranha, A. Faz-Hern´ andez, J. L´

  • pez, F. Rodr´

ıguez-Henr´ ıquez, “Faster Implementation of Scalar Multiplication on Koblitz Curves”, In LATINCRYPT 2012, Springer LNCS 7533, pp. 177–193, Santiago, Chile, 2012.

Diego F. Aranha Efficient Binary Field Arithmetic