Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core - - PowerPoint PPT Presentation
Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core - - PowerPoint PPT Presentation
Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core systems Coarse-Grained Parallelism (CGP) Fine-Grained Parallelism (FGP) Fine-Grained Parallelism (FGP) Two Dimensional Parallelism (TDP) Results Conclusions Q
ECC implementation methods Multi-core systems Coarse-Grained Parallelism (CGP)
Fine-Grained Parallelism (FGP)
Fine-Grained Parallelism (FGP) Two Dimensional Parallelism (TDP) Results Conclusions
Q = k ⋅ P NAF, window method Projective coordinates, P1 + P2
2⋅ P2
a+b mod p a*b mod p a-1 mod p Projective coordinates, Weighted projective coordinates
Montgomery/ Barrett reduction,
Itoh-Tsujii inversion Fast multiplier, systolic array, Super-scalar coprocessor
[Mentens ]
Atmel Diopsis You name it….. Multi-Core Systems Cell Processor AMD quad core Intel quad core ARM quad core
Advantages Powerful platform Lower clock frequency Energy efficient Challenges Task partitioning Communication between cores Concurrency manage
t1 = X2 ⋅ X2 t1 = 3t1 Singlecore Multicore core1 core2 core3 t2 = Z2 ⋅ Z2 t2 = t2 ⋅ t2 t2 = at2 + t1 t1 = 2Y2 Z2 = Z2 ⋅ t1 t3 = t1 ⋅ t1 t5 = t2 ⋅ t2 t4 = X2 ⋅ t3 t1 = t1 ⋅ t3 X2 = 2 t4 t1 = t1 ⋅ Y2 X2 = t5 − X2 t3 = t4 − X2 Y2 = t2⋅t3−t1 … … …
t1 = X2 ⋅ X2 Singlecore Multicore core1 core2 core3 t1 = 3t1 t2 = Z2 ⋅ Z2 t2 = t2 ⋅ t2 t2 = at2 + t1 t1 = 2Y2 Z2 = Z2 ⋅ t1 t3 = t1 ⋅ t1 …
How to efficiently perform modular multiplications with a multicore system?
Mont(X,Y,M) = XYR-1 mod M
X Y
- Y
- C_0
C_1
- M’
- T=C_0M’ mod R
M
- Z_0
Z_1
- XYR-1 mod M
x31 x30 … x1 x0
- yi
- c31
c30 … c;1 c0 c32
- 512bit MMM on a 16bit core
- mod
T m31 m30 … m1 m0
- z31
z30 … z1 z0 z32
- c30
c29 … c0 c31
- mod
← ← mod ! "← div ! # $% ← &
- '
c32
x31 yi … m31 T x1 yi m1 T x0 yi m0 T × × × × × × … + c0 + c1 c0 + c31 + c30 … c31 c31 c32
T c0 c1 … T c0 core1 core2 core3 core4 In each iteration ← mod ! ← div ! … c31 c32 c0 c1 … c31 c32 T c0 c1 … c31 c32 T c0 c1 … c31 c32 T c0 c1 … T … Note:
- 1. Carry is used in local
core.
core1 core2 core3 core4 T c0 c1 … c8 c9 … c16 c17 … In each iteration ← mod ! ← div ! c24 c25 … Carry is not … c7 … c15 … c23 … c31 c32 Carry_7 Carry_15 Carry_23 Carry is not propagated! T c0 c1 … c7 c8 c9 … c15 c16 c17 … c23 c24 c25 … c31 c32 Carry_7 Carry_15 Carry_23 … … … … Note:
- 1. Carry is used in local
core.
Multicore system core1 core2 core3 core3 … t1 = 3t1 t2 = t2 ⋅ t2 … Z2 = Z2 ⋅ t1 t3 = t1 ⋅ t1 t5 = t2 ⋅ t2 … …
(a) Four Vertical Parallelism (c) Two MMMs in parallel (a)Three MMMs in parallel core
- core
MMM
core
MMM
core
core core
MMM MMM
core core
MMM
core core core core
MMM
core core core core
- MMM
(c) Two MMMs in parallel (d) Single MMM
- core
MMM
core
MMM
core core
core core
MMM MMM
core core
MMM
core core core core
MMM
core core core core
MMM
core core core core
unused
13.4 10.2 9.9 14.5 10 12 14 16
ce [msec]
Inversion PA/PD chain
: only CGP;
2 4 6 8 10 case I case II case III case IV
Performance [mse Strategy for parallelism