High-performance Elliptic Curve Cryptography by Using the CIOS - - PowerPoint PPT Presentation

high performance elliptic curve cryptography by using the
SMART_READER_LITE
LIVE PREVIEW

High-performance Elliptic Curve Cryptography by Using the CIOS - - PowerPoint PPT Presentation

High-performance Elliptic Curve Cryptography by Using the CIOS Method for Modular Multiplication A mine Mrabet , Nadia El-Mrabet, Ronan Lashermes , Jean-Baptiste Rigaud, Belgacem Bouallegue, Sihem Mesnager and Mohsen Machhout September 2016


slide-1
SLIDE 1

High-performance Elliptic Curve Cryptography by Using the CIOS Method for Modular Multiplication

Amine Mrabet, Nadia El-Mrabet, Ronan Lashermes, Jean-Baptiste Rigaud, Belgacem Bouallegue, Sihem Mesnager and Mohsen Machhout September 2016

Efficient MMM for ECC Mrabet et al. September 2016 1/37

slide-2
SLIDE 2

Arithmetic Our architecture Results

Introduction

Public key cryptography is still costly (computing resources). Elliptic Curve Cryptography has a better cost/security trade-off w.r.t. RSA. We can still reduce the cost with better hardware architectures.

Efficient MMM for ECC Mrabet et al. September 2016 2/37

slide-3
SLIDE 3

Arithmetic Our architecture Results

1

Arithmetic ECC Montgomery Modular Multiplication

2

Our architecture Basics PEs Scheduling Resources

3

Results Results Conclusion

Efficient MMM for ECC Mrabet et al. September 2016 3/37

slide-4
SLIDE 4

Arithmetic Our architecture Results ECC

Elliptic Curve Cryptography (ECC)

Why? Elliptic curves allow to define groups with a hard Discrete Logarithm Problem. In the general case, cracking methods are far less efficient than for RSA.

Efficient MMM for ECC Mrabet et al. September 2016 4/37

slide-5
SLIDE 5

Arithmetic Our architecture Results ECC

Elliptic Curve Cryptography (ECC)

Why? Elliptic curves allow to define groups with a hard Discrete Logarithm Problem. In the general case, cracking methods are far less efficient than for RSA. How? (simplified) Let p > 3 a big prime, E(Fp) is the (short Weierstrass) elliptic curve E(Fp) : y2 = x3 + ax + b, where x, y, a, b ∈ Fp with 4a3 + 27b2 = 0.

Efficient MMM for ECC Mrabet et al. September 2016 4/37

slide-6
SLIDE 6

Arithmetic Our architecture Results ECC

EC Group

The points (x, y) on the curve define an abelian group together with the point at infinity 0∞, the neutral element for addition.

Efficient MMM for ECC Mrabet et al. September 2016 5/37

slide-7
SLIDE 7

Arithmetic Our architecture Results ECC

EC Group

The points (x, y) on the curve define an abelian group together with the point at infinity 0∞, the neutral element for addition. Jacobian coordinates The triple (x : y : z) can be mapped to (x/z2, y/z3) if z = 0. If z = 0 it is 0∞. The curve becomes: y2 = x3 + axz4 + bz6.

Efficient MMM for ECC Mrabet et al. September 2016 5/37

slide-8
SLIDE 8

Arithmetic Our architecture Results ECC

Operations in Jacobian coordinates (a = 0, points = 0∞)

Doubling (7S+5M+13A) T(XT : YT : ZT) = 2 · Q(XQ : YQ : ZQ). XT = 9X 4

Q − 8XQY 2 Q,

YT = 3X 2

Q(4XQYQ − XT) − 8Y 4 Q,

ZT = 2YQZQ.

Efficient MMM for ECC Mrabet et al. September 2016 6/37

slide-9
SLIDE 9

Arithmetic Our architecture Results ECC

Operations in Jacobian coordinates (a = 0, points = 0∞)

Doubling (7S+5M+13A) T(XT : YT : ZT) = 2 · Q(XQ : YQ : ZQ). XT = 9X 4

Q − 8XQY 2 Q,

YT = 3X 2

Q(4XQYQ − XT) − 8Y 4 Q,

ZT = 2YQZQ. Addition (4S + 14M + 6A) R = T + Q. XR = (2YQZ 3

T − 2YT)2 − 4(XQZ 2 T − XT)3 − 8(XQZ 2 T − XT)2XT,

YR = (2YQZ 3

T − 2YT)(4XT(XQZ 2 T − XT) − XR) − 8YT(XQZ 2 T − XT)3,

ZR = 2ZT(XQZ 2

T − XT).

Efficient MMM for ECC Mrabet et al. September 2016 6/37

slide-10
SLIDE 10

Arithmetic Our architecture Results Montgomery Modular Multiplication

Montgomery Modular Multiplication (MMM)

MMM MMM provides an efficient way for modular multiplication mod p (noted ·): there is no division by p.

Efficient MMM for ECC Mrabet et al. September 2016 7/37

slide-11
SLIDE 11

Arithmetic Our architecture Results Montgomery Modular Multiplication

Montgomery Modular Multiplication (MMM)

MMM MMM provides an efficient way for modular multiplication mod p (noted ·): there is no division by p. Residue Let a, b, R ∈ Fp where R is Montgomery’s residue. a′ = aR mod p is said to be a in Montgomery’s form. a · b = abR−1 mod p, as a consequence a′ · b′ = aRbRR−1 mod p = abR mod p = (ab)′.

Efficient MMM for ECC Mrabet et al. September 2016 7/37

slide-12
SLIDE 12

Arithmetic Our architecture Results Montgomery Modular Multiplication

Montgomery Modular Multiplication (MMM)

MMM MMM provides an efficient way for modular multiplication mod p (noted ·): there is no division by p. Residue Let a, b, R ∈ Fp where R is Montgomery’s residue. a′ = aR mod p is said to be a in Montgomery’s form. a · b = abR−1 mod p, as a consequence a′ · b′ = aRbRR−1 mod p = abR mod p = (ab)′. Conversion Field values are converted in Montgomery’s form at the beginning

  • f the computation and back to normal at the end.

Efficient MMM for ECC Mrabet et al. September 2016 7/37

slide-13
SLIDE 13

Arithmetic Our architecture Results Montgomery Modular Multiplication

How to compute MMM?

Koç’s multiword CIOS algorithm

Efficient MMM for ECC Mrabet et al. September 2016 8/37

slide-14
SLIDE 14

Arithmetic Our architecture Results Montgomery Modular Multiplication

CIOS details

Efficient MMM for ECC Mrabet et al. September 2016 9/37

slide-15
SLIDE 15

Arithmetic Our architecture Results Montgomery Modular Multiplication

Benefits

Low memory footprint, apart from some precomputations (p′, R...), easy to change p and operand sizes, neat structure, without divisions, easy to implement in hardware.

Efficient MMM for ECC Mrabet et al. September 2016 10/37

slide-16
SLIDE 16

Arithmetic Our architecture Results Basics

Basics

Here, each operation takes 1 unit of time. Let’s compute r = a · b + b + c. Sequential Time · + Operations 1 x t1 = a · b 2 x t2 = b + c 3 x r = t1 + t2

Efficient MMM for ECC Mrabet et al. September 2016 11/37

slide-17
SLIDE 17

Arithmetic Our architecture Results Basics

Basics

Here, each operation takes 1 unit of time. Let’s compute r = a · b + b + c. Sequential Time · + Operations 1 x t1 = a · b 2 x t2 = b + c 3 x r = t1 + t2 Parallel Time · + Operations 1 x x t1 = a · b, t2 = b + c 2 x r = t1 + t2

Efficient MMM for ECC Mrabet et al. September 2016 11/37

slide-18
SLIDE 18

Arithmetic Our architecture Results Basics

Basics - 2

Here, each operation takes 1 unit of time. Let’s compute r = a · b + b + c. Atomic Latency Throughput · + Operations 2 0.5 1 2 r = a · b + b + c The choice of operations and how they are chained together is called scheduling.

Efficient MMM for ECC Mrabet et al. September 2016 12/37

slide-19
SLIDE 19

Arithmetic Our architecture Results Basics

Basics - 2

Here, each operation takes 1 unit of time. Let’s compute r = a · b + b + c. Atomic Latency Throughput · + Operations 2 0.5 1 2 r = a · b + b + c Pipelined Latency Throughput · + Operations 2 + ǫ 0.5 1 1

1 : t1 = a · b, t2 = b + c, 2 : r = t1 + t2

2 + ǫ 1 1 2

1 : t1 = a · b, t2 = b + c, 2 : r = t1 + t2

The choice of operations and how they are chained together is called scheduling.

Efficient MMM for ECC Mrabet et al. September 2016 12/37

slide-20
SLIDE 20

Arithmetic Our architecture Results Basics

Systolic arrays

A systolic array is an architecture both parallel and pipelined. To create such an architecture, we have to identify small Processing Elements (PEs) (no control flow logic).

Efficient MMM for ECC Mrabet et al. September 2016 13/37

slide-21
SLIDE 21

Arithmetic Our architecture Results PEs

Where is Waldo the PE?

Efficient MMM for ECC Mrabet et al. September 2016 14/37

slide-22
SLIDE 22

Arithmetic Our architecture Results PEs

α

Efficient MMM for ECC Mrabet et al. September 2016 15/37

slide-23
SLIDE 23

Arithmetic Our architecture Results PEs

αf

Efficient MMM for ECC Mrabet et al. September 2016 16/37

slide-24
SLIDE 24

Arithmetic Our architecture Results PEs

β

Efficient MMM for ECC Mrabet et al. September 2016 17/37

slide-25
SLIDE 25

Arithmetic Our architecture Results PEs

γ

Efficient MMM for ECC Mrabet et al. September 2016 18/37

slide-26
SLIDE 26

Arithmetic Our architecture Results PEs

γf

Efficient MMM for ECC Mrabet et al. September 2016 19/37

slide-27
SLIDE 27

Arithmetic Our architecture Results Scheduling

S=8, Time=1

Efficient MMM for ECC Mrabet et al. September 2016 20/37

slide-28
SLIDE 28

Arithmetic Our architecture Results Scheduling

S=8, Time=2

Efficient MMM for ECC Mrabet et al. September 2016 21/37

slide-29
SLIDE 29

Arithmetic Our architecture Results Scheduling

S=8, Time=3

Efficient MMM for ECC Mrabet et al. September 2016 22/37

slide-30
SLIDE 30

Arithmetic Our architecture Results Scheduling

S=8, Time=4

Efficient MMM for ECC Mrabet et al. September 2016 23/37

slide-31
SLIDE 31

Arithmetic Our architecture Results Scheduling

S=8, Time=10

Efficient MMM for ECC Mrabet et al. September 2016 24/37

slide-32
SLIDE 32

Arithmetic Our architecture Results Scheduling

S=8, Time=10

Efficient MMM for ECC Mrabet et al. September 2016 25/37

slide-33
SLIDE 33

Arithmetic Our architecture Results Scheduling

S=8, Time=13

Efficient MMM for ECC Mrabet et al. September 2016 26/37

slide-34
SLIDE 34

Arithmetic Our architecture Results Scheduling

S=8, All

Efficient MMM for ECC Mrabet et al. September 2016 27/37

slide-35
SLIDE 35

Arithmetic Our architecture Results Resources

Alpha

Efficient MMM for ECC Mrabet et al. September 2016 28/37

slide-36
SLIDE 36

Arithmetic Our architecture Results Resources

Gamma

Efficient MMM for ECC Mrabet et al. September 2016 29/37

slide-37
SLIDE 37

Arithmetic Our architecture Results Resources

Resources

Our architecture requires: 3 α, 3 γ, 1 β, 1 αf , 1 γf .

Efficient MMM for ECC Mrabet et al. September 2016 30/37

slide-38
SLIDE 38

Arithmetic Our architecture Results Resources

Regrouping

Efficient MMM for ECC Mrabet et al. September 2016 31/37

slide-39
SLIDE 39

Arithmetic Our architecture Results Resources

Block diagram

Efficient MMM for ECC Mrabet et al. September 2016 32/37

slide-40
SLIDE 40

Arithmetic Our architecture Results Results

MMM architecture variants

CIOS (bits per word) s=8 s=16 s=32 s=64 K=256 32 16 8 4 K=512 64 32 16 8 K=1024 128 64 32 16 K=2048 256 128 64 32 Clock cycles= 3 × (s + nb) 33 66 132 264 Number of cells 6 +3 12 +3 24 +3 48 +3

Efficient MMM for ECC Mrabet et al. September 2016 33/37

slide-41
SLIDE 41

Arithmetic Our architecture Results Results

ECC results (Artix-7)

Slice DSPs BRAM Freq Slice FF Slice LUT NW-8 (256) 3745 33 12 98 8281 9722 NW-16 (256) 3770 34 12 130 8313 9255 NW-8 (512) 7066 92 23 59 16500 20394 NW-16 (512) 7116 60 23 74 16501 19199

Efficient MMM for ECC Mrabet et al. September 2016 34/37

slide-42
SLIDE 42

Arithmetic Our architecture Results Conclusion

Conclusion

Very efficient Montgomery Modular Multiplication with low latency. Give mixed results for a straightforward ECC implementation. Yet improvements are still possible: we should not wait the complete ending of an MMM to start the next. Should be particularly interesting for latency and throughput.

Efficient MMM for ECC Mrabet et al. September 2016 35/37

slide-43
SLIDE 43

Arithmetic Our architecture Results Conclusion

Thank you!

Any questions?

Efficient MMM for ECC Mrabet et al. September 2016 36/37

slide-44
SLIDE 44

Arithmetic Our architecture Results Conclusion

ECC results

... et al. Curve Device Lut Reg Size (DSP) Freq. Bajard 2014 256 any Kintex-7 4250 3532 1630 slices (46) 281 Bajard 2014 521 any Kintex-7 7067 5882 2565 slices (91) 266 Bajard classic 256 any – 7482 4605 – slices (46) – Guillermin 256 any Stratix-2 – – 9177 ALM (96) 157 Guillermin 512 any Stratix-2 – – 17017 ALM (244) 145 Güneysu 256 NIST Virtex-4 – – 1715 slices (32) 490 Yuan Ma 256 any Virtex-4 5740 4876 4655 slices (37) 250 Yuan Ma 256 any Virtex-5 4177 4792 1725 slices (37) 291 McIvor 256 any Virtex-II – – 15755 slice 39 Us NW-8 256 any Artix-7 9722 8281 3745 slices (33) 98 Us NW-8 512 any Artix-7 20394 16500 7066 slices (92) 59 Us NW-16 256 any Artix-7 9255 8313 3770 slices (34) 130 Us NW-16 512 any Artix-7 19199 16501 7116 slices (60) 74

Efficient MMM for ECC Mrabet et al. September 2016 37/37