Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core - - PowerPoint PPT Presentation

ECC implementation methods Multi-core systems Coarse-Grained Parallelism (CGP)

Fine-Grained Parallelism (FGP)

Fine-Grained Parallelism (FGP) Two Dimensional Parallelism (TDP) Results Conclusions

SLIDE 3

Q = k ⋅ P NAF, window method Projective coordinates, P1 + P2

2⋅ P2

a+b mod p a*b mod p a-1 mod p Projective coordinates, Weighted projective coordinates

Montgomery/ Barrett reduction,

Itoh-Tsujii inversion Fast multiplier, systolic array, Super-scalar coprocessor

SLIDE 4

[Mentens ]

SLIDE 5

Atmel Diopsis You name it….. Multi-Core Systems Cell Processor AMD quad core Intel quad core ARM quad core

SLIDE 6

Advantages Powerful platform Lower clock frequency Energy efficient Challenges Task partitioning Communication between cores Concurrency manage

SLIDE 7

SLIDE 8

SLIDE 9

t1 = X2 ⋅ X2 t1 = 3t1 Singlecore Multicore core1 core2 core3 t2 = Z2 ⋅ Z2 t2 = t2 ⋅ t2 t2 = at2 + t1 t1 = 2Y2 Z2 = Z2 ⋅ t1 t3 = t1 ⋅ t1 t5 = t2 ⋅ t2 t4 = X2 ⋅ t3 t1 = t1 ⋅ t3 X2 = 2 t4 t1 = t1 ⋅ Y2 X2 = t5 − X2 t3 = t4 − X2 Y2 = t2⋅t3−t1 … … …

SLIDE 10

t1 = X2 ⋅ X2 Singlecore Multicore core1 core2 core3 t1 = 3t1 t2 = Z2 ⋅ Z2 t2 = t2 ⋅ t2 t2 = at2 + t1 t1 = 2Y2 Z2 = Z2 ⋅ t1 t3 = t1 ⋅ t1 …

How to efficiently perform modular multiplications with a multicore system?

SLIDE 11

Mont(X,Y,M) = XYR-1 mod M

X Y

C_1

M’
T=C_0M’ mod R

M

Z_1

XYR-1 mod M

SLIDE 12

x31 x30 … x1 x0

c30 … c;1 c0 c32

512bit MMM on a 16bit core
mod

T m31 m30 … m1 m0

z30 … z1 z0 z32

c29 … c0 c31

← ← mod ! "← div ! # $% ← &

c32

SLIDE 13

x31 yi … m31 T x1 yi m1 T x0 yi m0 T × × × × × × … + c0 + c1 c0 + c31 + c30 … c31 c31 c32

SLIDE 14

T c0 c1 … T c0 core1 core2 core3 core4 In each iteration ← mod ! ← div ! … c31 c32 c0 c1 … c31 c32 T c0 c1 … c31 c32 T c0 c1 … c31 c32 T c0 c1 … T … Note:

1. Carry is used in local

core.

SLIDE 15

core1 core2 core3 core4 T c0 c1 … c8 c9 … c16 c17 … In each iteration ← mod ! ← div ! c24 c25 … Carry is not … c7 … c15 … c23 … c31 c32 Carry_7 Carry_15 Carry_23 Carry is not propagated! T c0 c1 … c7 c8 c9 … c15 c16 c17 … c23 c24 c25 … c31 c32 Carry_7 Carry_15 Carry_23 … … … … Note:

1. Carry is used in local

core.

SLIDE 16

Multicore system core1 core2 core3 core3 … t1 = 3t1 t2 = t2 ⋅ t2 … Z2 = Z2 ⋅ t1 t3 = t1 ⋅ t1 t5 = t2 ⋅ t2 … …

SLIDE 17

(a) Four Vertical Parallelism (c) Two MMMs in parallel (a)Three MMMs in parallel core

core

MMM

core

MMM

core

core core

MMM MMM

core core

MMM

core core core core

MMM

core core core core

core

MMM

core

MMM

core core

MMM MMM

core core

MMM

core core core core

MMM

core core core core

MMM

core core core core

unused

SLIDE 18

13.4 10.2 9.9 14.5 10 12 14 16

ce [msec]

Inversion PA/PD chain

: only CGP;

2 4 6 8 10 case I case II case III case IV

Performance [mse Strategy for parallelism

: only CGP; : TDP with up to threeway CGP; : TDP with up to twoway CGP; : only FGP. bit ECC on the prototype processor with four 32bit cores

SLIDE 19

Conclusions

We describe a parallel computing method for ECC. By using two-dimensional parallelism, it is % times

faster than using only coarse-grained parallelism. faster than using only coarse-grained parallelism.

Applicable to other PKCs.

Future work

Apply this method to off-the-shelf multi-core processors. Improve the performance further with algorithm-

architecture co-design methods .

SLIDE 20