Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core - - PowerPoint PPT Presentation

junfeng fan esat cosic ecc implementation methods multi
SMART_READER_LITE
LIVE PREVIEW

Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core - - PowerPoint PPT Presentation

Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core systems Coarse-Grained Parallelism (CGP) Fine-Grained Parallelism (FGP) Fine-Grained Parallelism (FGP) Two Dimensional Parallelism (TDP) Results Conclusions Q


slide-1
SLIDE 1

Junfeng Fan ESAT/COSIC

slide-2
SLIDE 2

ECC implementation methods Multi-core systems Coarse-Grained Parallelism (CGP)

Fine-Grained Parallelism (FGP)

Fine-Grained Parallelism (FGP) Two Dimensional Parallelism (TDP) Results Conclusions

slide-3
SLIDE 3

Q = k ⋅ P NAF, window method Projective coordinates, P1 + P2

2⋅ P2

a+b mod p a*b mod p a-1 mod p Projective coordinates, Weighted projective coordinates

Montgomery/ Barrett reduction,

Itoh-Tsujii inversion Fast multiplier, systolic array, Super-scalar coprocessor

slide-4
SLIDE 4

[Mentens ]

slide-5
SLIDE 5

Atmel Diopsis You name it….. Multi-Core Systems Cell Processor AMD quad core Intel quad core ARM quad core

slide-6
SLIDE 6

Advantages Powerful platform Lower clock frequency Energy efficient Challenges Task partitioning Communication between cores Concurrency manage

slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

t1 = X2 ⋅ X2 t1 = 3t1 Singlecore Multicore core1 core2 core3 t2 = Z2 ⋅ Z2 t2 = t2 ⋅ t2 t2 = at2 + t1 t1 = 2Y2 Z2 = Z2 ⋅ t1 t3 = t1 ⋅ t1 t5 = t2 ⋅ t2 t4 = X2 ⋅ t3 t1 = t1 ⋅ t3 X2 = 2 t4 t1 = t1 ⋅ Y2 X2 = t5 − X2 t3 = t4 − X2 Y2 = t2⋅t3−t1 … … …

slide-10
SLIDE 10

t1 = X2 ⋅ X2 Singlecore Multicore core1 core2 core3 t1 = 3t1 t2 = Z2 ⋅ Z2 t2 = t2 ⋅ t2 t2 = at2 + t1 t1 = 2Y2 Z2 = Z2 ⋅ t1 t3 = t1 ⋅ t1 …

How to efficiently perform modular multiplications with a multicore system?

slide-11
SLIDE 11

Mont(X,Y,M) = XYR-1 mod M

X Y

  • Y
  • C_0

C_1

  • M’
  • T=C_0M’ mod R

M

  • Z_0

Z_1

  • XYR-1 mod M
slide-12
SLIDE 12

x31 x30 … x1 x0

  • yi
  • c31

c30 … c;1 c0 c32

  • 512bit MMM on a 16bit core
  • mod

T m31 m30 … m1 m0

  • z31

z30 … z1 z0 z32

  • c30

c29 … c0 c31

  • mod

← ← mod ! "← div ! # $% ← &

  • '

c32

slide-13
SLIDE 13

x31 yi … m31 T x1 yi m1 T x0 yi m0 T × × × × × × … + c0 + c1 c0 + c31 + c30 … c31 c31 c32

slide-14
SLIDE 14

T c0 c1 … T c0 core1 core2 core3 core4 In each iteration ← mod ! ← div ! … c31 c32 c0 c1 … c31 c32 T c0 c1 … c31 c32 T c0 c1 … c31 c32 T c0 c1 … T … Note:

  • 1. Carry is used in local

core.

slide-15
SLIDE 15

core1 core2 core3 core4 T c0 c1 … c8 c9 … c16 c17 … In each iteration ← mod ! ← div ! c24 c25 … Carry is not … c7 … c15 … c23 … c31 c32 Carry_7 Carry_15 Carry_23 Carry is not propagated! T c0 c1 … c7 c8 c9 … c15 c16 c17 … c23 c24 c25 … c31 c32 Carry_7 Carry_15 Carry_23 … … … … Note:

  • 1. Carry is used in local

core.

slide-16
SLIDE 16

Multicore system core1 core2 core3 core3 … t1 = 3t1 t2 = t2 ⋅ t2 … Z2 = Z2 ⋅ t1 t3 = t1 ⋅ t1 t5 = t2 ⋅ t2 … …

slide-17
SLIDE 17

(a) Four Vertical Parallelism (c) Two MMMs in parallel (a)Three MMMs in parallel core

  • core

MMM

core

MMM

core

core core

MMM MMM

core core

MMM

core core core core

MMM

core core core core

  • MMM

(c) Two MMMs in parallel (d) Single MMM

  • core

MMM

core

MMM

core core

core core

MMM MMM

core core

MMM

core core core core

MMM

core core core core

MMM

core core core core

unused

slide-18
SLIDE 18

13.4 10.2 9.9 14.5 10 12 14 16

ce [msec]

Inversion PA/PD chain

: only CGP;

2 4 6 8 10 case I case II case III case IV

Performance [mse Strategy for parallelism

: only CGP; : TDP with up to threeway CGP; : TDP with up to twoway CGP; : only FGP. bit ECC on the prototype processor with four 32bit cores

slide-19
SLIDE 19

Conclusions

We describe a parallel computing method for ECC. By using two-dimensional parallelism, it is % times

faster than using only coarse-grained parallelism. faster than using only coarse-grained parallelism.

Applicable to other PKCs.

Future work

Apply this method to off-the-shelf multi-core processors. Improve the performance further with algorithm-

architecture co-design methods .

slide-20
SLIDE 20