Four Q on FPGA: New Hardware Speed Records for Elliptic Curve - - PowerPoint PPT Presentation

four q on fpga
SMART_READER_LITE
LIVE PREVIEW

Four Q on FPGA: New Hardware Speed Records for Elliptic Curve - - PowerPoint PPT Presentation

Four Q on FPGA: New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields K. Jrvinen 1 , A. Miele 2 , R. Azarderakhsh 3 , and P . Longa 4 1 Aalto University 2 Intel Corporation 3 Rochester Institute of


slide-1
SLIDE 1

FourQ on FPGA:

New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields

  • K. Järvinen1, A. Miele2, R. Azarderakhsh3, and P

. Longa4

1 Aalto University 2 Intel Corporation 3 Rochester Institute of Technology 4 Microsoft Research

Contact: kimmo.jarvinen@aalto.fi, plonga@microsoft.com CHES 2016, Santa Barbara, CA, USA, August 17–19, 2016

slide-2
SLIDE 2

FourQ on FPGA CHES 2016 2/17

Introduction

FourQ:

◮ FourQ is a high-performance elliptic curve with very good

SW performance (2–3× faster than Curve25519)

◮ FourQ has been shown to offer the fastest scalar

multiplications on a wide range of software platforms:

◮ On several 32-bit ARM microarchitectures (SAC 2016) ◮ On several 64-bit Intel/AMD processors, low and high-end

(ASIACRYPT 2015)

◮ FourQ employs four-dimensional scalar decompositions,

requires extensive precomputation, complex control, etc. ⇒ Not clear how well it suits for HW implementation

slide-3
SLIDE 3

FourQ on FPGA CHES 2016 3/17

Introduction

Contributions:

◮ The first FPGA-based implementations of FourQ ◮ FourQ offers 2–2.5× faster performance than Curve25519 ◮ Speed-area tradeoff is the primary optimization goal ◮ Protected against timing and SPA attacks ◮ We present three implementations:

single-core, multi-core, and Montgomery ladder variant

slide-4
SLIDE 4

FourQ on FPGA CHES 2016 4/17

FourQ

Costello, Longa, ASIACRYPT’15

E/Fp2 : −x2 + y2 = 1 + dx2y2

◮ Twisted Edwards curve with #E(Fp2) = 392 · ξ

where ξ is a 246-bit prime

◮ Defined over Fp2 with the Mersenne prime p = 2127 − 1 ◮ Complete addition formulas over extended twisted

Edwards coordinates (Hisil et al. ASIACRYPT’08)

slide-5
SLIDE 5

FourQ on FPGA CHES 2016 4/17

FourQ

Costello, Longa, ASIACRYPT’15

E/Fp2 : −x2 + y2 = 1 + dx2y2

◮ Twisted Edwards curve with #E(Fp2) = 392 · ξ

where ξ is a 246-bit prime

◮ Defined over Fp2 with the Mersenne prime p = 2127 − 1 ◮ Complete addition formulas over extended twisted

Edwards coordinates (Hisil et al. ASIACRYPT’08)

◮ Two efficiently-computable endomorphisms ψ and φ ◮ Four-dimensional decomposition for the 256-bit scalar m

with (a1, a2, a3, a4) such that ai ∈ [0, 264): [m]P = [a1]P + [a2]ψ(P) + [a3]φ(P) + [a4]ψ(φ(P))

slide-6
SLIDE 6

FourQ on FPGA CHES 2016 5/17

Scalar Multiplication

Input: Point P, integer m ∈ [0, 2256) Output: [m]P

1 Decompose and recode m 2 Precompute lookup table T 3 Q ← T[v64] 4 for i = 63 to 0 do 5

Q ← [2]Q

6

Q ← Q + miT[vi]

slide-7
SLIDE 7

FourQ on FPGA CHES 2016 5/17

Scalar Multiplication

Input: Point P, integer m ∈ [0, 2256) Output: [m]P

1 Decompose and recode m 2 Precompute lookup table T 3 Q ← T[v64] 4 for i = 63 to 0 do 5

Q ← [2]Q

6

Q ← Q + miT[vi]

Scalar decompose and recode

◮ Decompose to a multi-scalar

(a1, a2, a3, a4)

◮ Sign-aligned so that a1[j] ∈ {±1}

and ai[j] ∈ {0, a1[j]} for 2 ≤ j ≤ 4

◮ Recode to signs mi ∈ {−1, 1}

and values vi ∈ [0, 7] (point index)

slide-8
SLIDE 8

FourQ on FPGA CHES 2016 5/17

Scalar Multiplication

Input: Point P, integer m ∈ [0, 2256) Output: [m]P

1 Decompose and recode m 2 Precompute lookup table T 3 Q ← T[v64] 4 for i = 63 to 0 do 5

Q ← [2]Q

6

Q ← Q + miT[vi]

Precomputation

◮ Precompute 8 points: T[u] = P +

[u0]φ(P)+[u1]ψ(P)+[u2]ψ(φ(P)) for u = (u2, u1, u0) ∈ [0, 7]

◮ Store them with 5 coordinates

(X + Y, Y − X, 2Z, 2dT, −2dT) ⇒ +T[u] : (X + Y, Y − X, 2Z, 2dT) −T[u] : (Y − X, X + Y, 2Z, −2dT)

◮ 68M + 27S and several additions

slide-9
SLIDE 9

FourQ on FPGA CHES 2016 5/17

Scalar Multiplication

Input: Point P, integer m ∈ [0, 2256) Output: [m]P

1 Decompose and recode m 2 Precompute lookup table T 3 Q ← T[v64] 4 for i = 63 to 0 do 5

Q ← [2]Q

6

Q ← Q + miT[vi]

Main for-loop

◮ Fully regular and constant-time ◮ Only 64 double-and-adds ◮ Doubling:

(X, Y, Z, Ta, Tb) ← (X, Y, Z)

◮ Addition:

(X, Y, Z, Ta, Tb) ← (X, Y, Z, Ta, Tb) × (X + Y, Y − X, 2Z, 2dT)

slide-10
SLIDE 10

FourQ on FPGA CHES 2016 6/17

General Architecture

Scalar Decomposition and Recoding Unit

◮ Decomposes and recodes the scalar ◮ Mainly multiplications with constants

Field Arithmetic Unit (“the core”)

◮ Precomputation and the main for-loop ◮ Highly optimized for Fp with the Mersenne prime

slide-11
SLIDE 11

FourQ on FPGA CHES 2016 7/17

Scalar Unit

◮ Decomposition is computed

with a truncated multiplier (mainly multiplications with constants)

◮ The main component is a

17×264-bit row multiplier built by using 11 DSPs

◮ Recoding is bit manipulations

and 64-bit additions

◮ Outputs (m0, v0) first, scalar

multiplication begins with (m64, v64) ⇒ Store in a LIFO buffer

FSM 17×264-bit multiplier X Y ZH ZL 64 64 264 17 281 17 264 195 264 281 +

slide-12
SLIDE 12

FourQ on FPGA CHES 2016 8/17

Field Arithmetic Unit

Datapath Dual-port RAM

127 18 16 127 127 127 64 64

Control

16 2

di do responses Interface logic commands,

2

slide-13
SLIDE 13

FourQ on FPGA CHES 2016 8/17

Field Arithmetic Unit

Datapath Dual-port RAM

127 18 16 127 127 127 64 64

Control

16 2

di do responses Interface logic commands,

2

256 × 127-bit RAM (128 Fp2 elements) 4 BRAM

slide-14
SLIDE 14

FourQ on FPGA CHES 2016 8/17

Field Arithmetic Unit

Datapath Dual-port RAM

127 18 16 127 127 127 64 64

Control

16 2

di do responses Interface logic commands,

2

127-bit datapath,

  • ptimized for

p = 2127 − 1

slide-15
SLIDE 15

FourQ on FPGA CHES 2016 8/17

Field Arithmetic Unit

Datapath Dual-port RAM

127 18 16 127 127 127 64 64

Control

16 2

di do responses Interface logic commands,

2

FSM + Program ROM (6 BRAMs)

slide-16
SLIDE 16

FourQ on FPGA CHES 2016 9/17

Field Arithmetic Unit: Datapath

multiplier 64 × 64-bit (pipelined)

127 127 63 64 63 64 64 64 128 128 129 127 127 127 127 1 1

c c

127 127 127 127

b a r

+ +/−

slide-17
SLIDE 17

FourQ on FPGA CHES 2016 9/17

Field Arithmetic Unit: Datapath

multiplier 64 × 64-bit (pipelined)

127 127 63 64 63 64 64 64 128 128 129 127 127 127 127 1 1

c c

127 127 127 127

b a r

+ +/−

Multiplier path

slide-18
SLIDE 18

FourQ on FPGA CHES 2016 9/17

Field Arithmetic Unit: Datapath

multiplier 64 × 64-bit (pipelined)

127 127 63 64 63 64 64 64 128 128 129 127 127 127 127 1 1

c c

127 127 127 127

b a r

+ +/−

Adder path

slide-19
SLIDE 19

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1)

slide-20
SLIDE 20

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) Dual-port RAM Input regs Multiplier pipeline Adders

slide-21
SLIDE 21

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 1

R R

slide-22
SLIDE 22

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 2

slide-23
SLIDE 23

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 3

slide-24
SLIDE 24

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 4

R R

slide-25
SLIDE 25

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 5

R R

slide-26
SLIDE 26

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 6

slide-27
SLIDE 27

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 7

+

slide-28
SLIDE 28

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 8

R R %

slide-29
SLIDE 29

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 9

W +

slide-30
SLIDE 30

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 10

%

slide-31
SLIDE 31

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 11

W clr

slide-32
SLIDE 32

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 12

+

slide-33
SLIDE 33

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 13

slide-34
SLIDE 34

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 14

+

slide-35
SLIDE 35

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 15

R R +

slide-36
SLIDE 36

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 16

slide-37
SLIDE 37

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 17

+

slide-38
SLIDE 38

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 18

clr +

slide-39
SLIDE 39

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 19

+ %

slide-40
SLIDE 40

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 20

W ↓

slide-41
SLIDE 41

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 21

+

slide-42
SLIDE 42

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 22

+

(1)

R R

slide-43
SLIDE 43

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 23

(2)

slide-44
SLIDE 44

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 24

+

(3)

slide-45
SLIDE 45

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 25

clr +

(4)

R R

slide-46
SLIDE 46

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 26

+ %

(5)

R R

slide-47
SLIDE 47

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 27

W ↓

(6)

slide-48
SLIDE 48

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 28

+

(7)

+

slide-49
SLIDE 49

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 29

+

(8)

R R %

slide-50
SLIDE 50

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 30

(9)

W +

slide-51
SLIDE 51

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 31

R R +

(10)

%

slide-52
SLIDE 52

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 32

clr +

(11)

W clr

slide-53
SLIDE 53

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 33

%

(12)

+

slide-54
SLIDE 54

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 34

W

(13)

slide-55
SLIDE 55

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 35

(14)

+

slide-56
SLIDE 56

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 36

%

(15)

R R +

slide-57
SLIDE 57

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 37

W

(16)

slide-58
SLIDE 58

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 38

R R

(17)

+

slide-59
SLIDE 59

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 39 (18)

clr +

slide-60
SLIDE 60

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 40

R

(19)

+ %

slide-61
SLIDE 61

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 41

(20)

W ↓

slide-62
SLIDE 62

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 42

%

(21)

+

slide-63
SLIDE 63

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 43

(1,22)

+ R R

slide-64
SLIDE 64

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 44

%

(2,23)

slide-65
SLIDE 65

FourQ on FPGA CHES 2016 10/17

Example: Multiplication in Fp2

3 multiplications, 2 additions and 3 subtractions in Fp: a × b = (a0, a1) × (b0, b1) = (a0 · b0 − a1 · b1 , (a0 + a1) · (b0 + b1) − a0 · b0 − a1 · b1) 45

W

(3,24)

+

slide-66
SLIDE 66

FourQ on FPGA CHES 2016 11/17

Latencies

Field operations

in Fp in Fp2 Addition 6 (2) clocks 8 (4) clocks Multiplication 20 (7) clocks 38/45 (31/21) clocks Squaring 20 (7) clocks 28 (16) clocks Inversion 2760 clocks 2817 clocks In practice, almost all additions in parallel with multiplications

slide-67
SLIDE 67

FourQ on FPGA CHES 2016 11/17

Latencies

Field operations

in Fp in Fp2 Addition 6 (2) clocks 8 (4) clocks Multiplication 20 (7) clocks 38/45 (31/21) clocks Squaring 20 (7) clocks 28 (16) clocks Inversion 2760 clocks 2817 clocks In practice, almost all additions in parallel with multiplications

Operations for scalar multiplication

Precomputation 4185 clocks Scalar decomposition and recoding 1984 (0) clocks Double-and-add (64 times) 354 clocks Affine conversion 2869 clocks Scalar multiplication 29739 clocks

slide-68
SLIDE 68

FourQ on FPGA CHES 2016 12/17

Multi-Core Architecture

Read/write control Scalar unit LIFO 1 Core 1 LIFO 2 LIFO 3 LIFO 4 LIFO N Core 2 Core 3 Core 4 Core N di, do commands responses

slide-69
SLIDE 69

FourQ on FPGA CHES 2016 13/17

Area Results on Zynq-7020

Single-Core Architecture

100% 80% 60% 40% 20% 0%

7.9 % 4,217 4.1 % 4,413 12.7 % 1,691 7.1 % 10 12.3 % 27

LUTs Regs Slices BRAMs DSPs

53,200 106,400 13,300 140 220

slide-70
SLIDE 70

FourQ on FPGA CHES 2016 13/17

Area Results on Zynq-7020

Single-Core Architecture

100% 80% 60% 40% 20% 0%

7.9 % 4.1 % 12.7 % 7.1 % 12.3 % 6.3 % 2.6 % 9.2 % 0.0 % 5.0 %

LUTs Regs Slices BRAMs DSPs

53,200 106,400 13,300 140 220

Total Scalar unit

slide-71
SLIDE 71

FourQ on FPGA CHES 2016 13/17

Area Results on Zynq-7020

Multi-Core Architecture (N = 11)

100% 80% 60% 40% 20% 0%

7.9 % 4.1 % 12.7 % 7.1 % 12.3 % 6.3 % 2.6 % 9.2 % 0.0 % 5.0 % 25.6 % 13,595 19.7 % 20,924 42.8 % 5,697 78.6 % 110 85.0 % 187

LUTs Regs Slices BRAMs DSPs

53,200 106,400 13,300 140 220

Total Scalar unit

slide-72
SLIDE 72

FourQ on FPGA CHES 2016 13/17

Area Results on Zynq-7020

Multi-Core Architecture (N = 11)

100% 80% 60% 40% 20% 0%

7.9 % 4.1 % 12.7 % 7.1 % 12.3 % 6.3 % 2.6 % 9.2 % 0.0 % 5.0 % 25.6 % 13,595 19.7 % 20,924 42.8 % 5,697 78.6 % 110 85.0 % 187 6.4 % 2.8 % 9.0 % 0.0 % 5.0 %

LUTs Regs Slices BRAMs DSPs

53,200 106,400 13,300 140 220

Total Scalar unit

slide-73
SLIDE 73

FourQ on FPGA CHES 2016 14/17

Performance Results on Zynq-7020

VHDL for Xilinx Zynq-7020 with Vivado 2015.4

◮ One scalar multiplication takes 29,739 clock cycles ◮ Single-core: 190 MHz ⇒ 157 µs or 6,389 ops ◮ Multi-core: 175 MHz (×11) ⇒ 170 µs or 64,730 ops ◮ Point validation (124 clocks), cofactor killing (1760 clocks)

slide-74
SLIDE 74

FourQ on FPGA CHES 2016 14/17

Performance Results on Zynq-7020

VHDL for Xilinx Zynq-7020 with Vivado 2015.4

◮ One scalar multiplication takes 29,739 clock cycles ◮ Single-core: 190 MHz ⇒ 157 µs or 6,389 ops ◮ Multi-core: 175 MHz (×11) ⇒ 170 µs or 64,730 ops ◮ Point validation (124 clocks), cofactor killing (1760 clocks)

Variant using Montgomery ladder

◮ No scalar unit (saves 11 DSPs), no precomputations,

simpler control, etc.

◮ 522 slices, 7 BRAMs, 16 DSP ◮ 58967 clocks at 190 MHz ⇒ 310 µs or 3,222 ops

slide-75
SLIDE 75

FourQ on FPGA CHES 2016 15/17

Comparison

◮ Many implementations for ECC over prime fields ◮ Comparison is extremely difficult because of different

FPGAs, different optimization goals, etc.

◮ Best match with Sasdrich & Güneysu’s Curve25519

design, both on Xilinx Zynq-7020

◮ See the paper for further comparisons

slide-76
SLIDE 76

FourQ on FPGA CHES 2016 16/17

FourQ vs. Curve25519

Single-Core Architectures

25% 20% 15% 10% 5% 0% 10,000 8,000 6,000 4,000 2,000

12.7 % 1691 7.7 % 1029 7.1 % 10 1.4 % 2 12.3 % 27 9.1 % 20 6389 2519 2.54 × 236.6 126.0 1.88 ×

Slices BRAMs DSPs Throughput Tput/DSP

13,300 140 220

Our FourQ Sasdrich & Güneysu’s Curve25519

slide-77
SLIDE 77

FourQ on FPGA CHES 2016 16/17

FourQ vs. Curve25519

Montgomery Ladder

25% 20% 15% 10% 5% 0% 10,000 8,000 6,000 4,000 2,000

4.2 % 565 7.7 % 1029 5.0 % 7 1.4 % 2 7.3 % 16 9.1 % 20 3222 2519 1.28 × 201.4 126.0 1.60 ×

Slices BRAMs DSPs Throughput Tput/DSP

13,300 140 220

Our FourQ Sasdrich & Güneysu’s Curve25519

slide-78
SLIDE 78

FourQ on FPGA CHES 2016 16/17

FourQ vs. Curve25519

Multi-Core Architectures (N = 11)

0% 100% 80% 60% 40% 20% 100,000 80,000 60,000 40,000 20,000

42.8 % 5697 84.8 % 11277 78.6 % 110 10.0 % 22 85.0 % 187 220 64730 32304 2.00 × 346.2 146.8 2.36 ×

Slices BRAMs DSPs Throughput Tput/DSP

13,300 140 220

Our FourQ Sasdrich & Güneysu’s Curve25519

slide-79
SLIDE 79

FourQ on FPGA CHES 2016 17/17

Conclusions

◮ We showed that FourQ is very efficient also on FPGAs ◮ FourQ is significantly more efficient in terms of speed-area

ratio than the closest counterpart

slide-80
SLIDE 80

FourQ on FPGA CHES 2016 17/17

Conclusions

◮ We showed that FourQ is very efficient also on FPGAs ◮ FourQ is significantly more efficient in terms of speed-area

ratio than the closest counterpart

Future Work

◮ Low-latency implementation ◮ Better side-channel protection:

e.g., against DPA and advanced horizontal attacks

slide-81
SLIDE 81

FourQ on FPGA CHES 2016 17/17

Conclusions

◮ We showed that FourQ is very efficient also on FPGAs ◮ FourQ is significantly more efficient in terms of speed-area

ratio than the closest counterpart

Future Work

◮ Low-latency implementation ◮ Better side-channel protection:

e.g., against DPA and advanced horizontal attacks

Thank you! Questions?