Optimizing multiplications with vector instructions Chitchanok - - PowerPoint PPT Presentation

optimizing multiplications with vector instructions
SMART_READER_LITE
LIVE PREVIEW

Optimizing multiplications with vector instructions Chitchanok - - PowerPoint PPT Presentation

Optimizing multiplications with vector instructions Chitchanok Chuengsatiansup INRIA and ENS de Lyon 4 June 2018 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 1 Introduction Current position: Postdoc (INRIA


slide-1
SLIDE 1

Optimizing multiplications with vector instructions

Chitchanok Chuengsatiansup

INRIA and ENS de Lyon

4 June 2018

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 1

slide-2
SLIDE 2

Introduction

Current position:

Postdoc (INRIA and ENS de Lyon) Supervisor: Damien Stehl´ e

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 2

slide-3
SLIDE 3

Introduction

Current position:

Postdoc (INRIA and ENS de Lyon) Supervisor: Damien Stehl´ e

Previous position:

PhD student at TU/Eindhoven, The Netherlands Cryptographic Implementations group Thesis: “Optimizing Curve-Based Cryptography” Supervisors: Daniel J. Bernstein and Tanja Lange

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 2

slide-4
SLIDE 4

Introduction

Current position:

Postdoc (INRIA and ENS de Lyon) Supervisor: Damien Stehl´ e

Previous position:

PhD student at TU/Eindhoven, The Netherlands Cryptographic Implementations group Thesis: “Optimizing Curve-Based Cryptography” Supervisors: Daniel J. Bernstein and Tanja Lange

Experience

Software implementations Optimizing cryptographic software and algorithms

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 2

slide-5
SLIDE 5

Vectorization speedups

without vector a + b = a + b

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 3

slide-6
SLIDE 6

Vectorization speedups

without vector a + b = a + b with vector a0 a1 a2 a3 + + + + b0 b1 b2 b3 = = = = a0 + b0 a1 + b1 a2 + b2 a3 + b3

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 3

slide-7
SLIDE 7

Vectorization speedups

without vector a + b = a + b with vector a0 a1 a2 a3 + + + + b0 b1 b2 b3 = = = = a0 + b0 a1 + b1 a2 + b2 a3 + b3 single instruction performing n independent operations on aligned inputs

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 3

slide-8
SLIDE 8

Side-channel attacks

Prevent software side-channel attacks:

constant-time no input-dependent branch no input-dependent array index

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 4

slide-9
SLIDE 9

Side-channel attacks

Prevent software side-channel attacks:

constant-time no input-dependent branch no input-dependent array index

Constant-time table-lookup:

read entire table select via arithmetic if c is 1, select tbl[i] if c is 0, ignore tbl[i] t = (t · (1 − c)) + (tbl[i] · (c) ) t = (t ∧ (c − 1)) ∨ (tbl[i] ∧ (−c))

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 4

slide-10
SLIDE 10

Curve41417

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 5

slide-11
SLIDE 11

Design of Curve41417

High-security elliptic curve (security level above 2200) Defined over prime field Fp where p = 2414 − 17 In Edwards curve form x2 + y2 = 1 + 3617x2y2

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 6

slide-12
SLIDE 12

Design of Curve41417

High-security elliptic curve (security level above 2200) Defined over prime field Fp where p = 2414 − 17 In Edwards curve form x2 + y2 = 1 + 3617x2y2 Large prime-order subgroup (cofactor 8) IEEE P1363 criteria (large embedding degree, etc.) Twist secure, i.e., twist of Curve41417 also secure

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 6

slide-13
SLIDE 13

ECC arithmetic

Mixed-coordinate systems:

doubling: projective X, Y , Z addition: extended X, Y , Z, T

(See https://hyperelliptic.org/EFD/)

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 7

slide-14
SLIDE 14

ECC arithmetic

Mixed-coordinate systems:

doubling: projective X, Y , Z addition: extended X, Y , Z, T

(See https://hyperelliptic.org/EFD/) Scalar multiplication:

signed fixed windows of width w = 5 precompute 0P, 1P, 2P, . . . , 16P also multiply d = 3617 to T coordinate special first doubling compute T only before addition

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 7

slide-15
SLIDE 15

Point operations

Point doubling x1

  • y1
  • z1

×

  • +

×

  • ×

×

  • +
  • +
  • ×
  • ×
  • ×
  • ×
  • y3

z3 x3 t3 Point addition x2

  • y2
  • z2
  • d·t2
  • x1
  • +
  • y1
  • z1
  • t1
  • ×
  • +
  • ×
  • ×
  • ×
  • ×
  • +
  • ×
  • ×
  • ×
  • x3

y3 z3

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 8

slide-16
SLIDE 16

ARM Cortex-A8 vector unit

128-bit vector registers Arithmetic and load/store unit can perform in parallel Operate in parallel on vectors of four 32-bit integers or two 64-bit integers

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 9

slide-17
SLIDE 17

ARM Cortex-A8 vector unit

128-bit vector registers Arithmetic and load/store unit can perform in parallel Operate in parallel on vectors of four 32-bit integers or two 64-bit integers Each cycle produces: four 32-bit integer additions: a0+b0, a1+b1, a2+b2, a3+b3

  • r

two 64-bit integer additions: c0+d0, c1+d1

  • r
  • ne multiply-add instruction: a0b0 + c0

where ai, bi are 32- and ci, di are 64-bit integers

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 9

slide-18
SLIDE 18

Redundant representation

Use non-integer radix 2414/16 = 225.875 Decompose integer f modulo 2414 − 17 into 16 integer pieces Write f as f0 + 226 f1 + 252 f2 + 278 f3 + 2104f4 + 2130f5 + 2156f6 + 2182f7 + 2207f8 + 2233f9 + 2259f10+ 2285f11+ 2311f12+ 2337f13+ 2363f14+ 2389f15

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 10

slide-19
SLIDE 19

Carries

Goal: Bring each limb down to 26 or 25 bits

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

slide-20
SLIDE 20

Carries

Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m0 → m1 → m2 → · · · → m14 → m15 → m0 → m1

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

slide-21
SLIDE 21

Carries

Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m0 → m1 → m2 → · · · → m14 → m15 → m0 → m1 Increase throughput:

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

slide-22
SLIDE 22

Carries

Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m0 → m1 → m2 → · · · → m14 → m15 → m0 → m1 Increase throughput: m0→m1 m8→m9

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

slide-23
SLIDE 23

Carries

Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m0 → m1 → m2 → · · · → m14 → m15 → m0 → m1 Increase throughput: m0→m1 →m2 m8→m9 →m10

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

slide-24
SLIDE 24

Carries

Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m0 → m1 → m2 → · · · → m14 → m15 → m0 → m1 Increase throughput: m0→m1 →m2 →m3 →m4 →m5 →m6 →m7 →m8→m9 m8→m9 →m10 →m11→m12→m13→m14→m15→m0→m1

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

slide-25
SLIDE 25

Carries

Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m0 → m1 → m2 → · · · → m14 → m15 → m0 → m1 Increase throughput: m0→m1 →m2 →m3 →m4 →m5 →m6 →m7 →m8→m9 m8→m9 →m10 →m11→m12→m13→m14→m15→m0→m1 Decrease latency:

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

slide-26
SLIDE 26

Carries

Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m0 → m1 → m2 → · · · → m14 → m15 → m0 → m1 Increase throughput: m0→m1 →m2 →m3 →m4 →m5 →m6 →m7 →m8→m9 m8→m9 →m10 →m11→m12→m13→m14→m15→m0→m1 Decrease latency: m0 → m1 m8 → m9

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

slide-27
SLIDE 27

Carries

Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m0 → m1 → m2 → · · · → m14 → m15 → m0 → m1 Increase throughput: m0→m1 →m2 →m3 →m4 →m5 →m6 →m7 →m8→m9 m8→m9 →m10 →m11→m12→m13→m14→m15→m0→m1 Decrease latency: m0 → m1 m8 → m9 m4 → m5 m12 → m13

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

slide-28
SLIDE 28

Carries

Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m0 → m1 → m2 → · · · → m14 → m15 → m0 → m1 Increase throughput: m0→m1 →m2 →m3 →m4 →m5 →m6 →m7 →m8→m9 m8→m9 →m10 →m11→m12→m13→m14→m15→m0→m1 Decrease latency: m0 → m1 → m2 m8 → m9 → m10 m4 → m5 m12 → m13

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

slide-29
SLIDE 29

Carries

Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m0 → m1 → m2 → · · · → m14 → m15 → m0 → m1 Increase throughput: m0→m1 →m2 →m3 →m4 →m5 →m6 →m7 →m8→m9 m8→m9 →m10 →m11→m12→m13→m14→m15→m0→m1 Decrease latency: m0 → m1 → m2 m8 → m9 → m10 m4 → m5 → m6 m12 → m13 → m14

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

slide-30
SLIDE 30

Carries

Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m0 → m1 → m2 → · · · → m14 → m15 → m0 → m1 Increase throughput: m0→m1 →m2 →m3 →m4 →m5 →m6 →m7 →m8→m9 m8→m9 →m10 →m11→m12→m13→m14→m15→m0→m1 Decrease latency: m0 → m1 → m2 → m3 → m4 → m5 m8 → m9 → m10 → m11 → m12 → m13 m4 → m5 → m6 → m7 → m8 → m9 m12 → m13 → m14 → m15 → m0 → m1

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

slide-31
SLIDE 31

Polynomial multiplication

Goal: Compute P = AB given A = a0 + a1tn and B = b0 + b1tn Method 1: schoolbook P = a0b0 + (a0b1 + a1b0)tn + a1b1t2n Method 2: Karatsuba (8n−4 additions) P = a0b0+((a0+a1)(b0+b1)−a0b0−a1b1)tn+a1b1t2n Method 3: refined Karatsuba (7n−3 additions) P = (a0b0 − a1b1tn)(1 − tn) + (a0 + a1)(b0 + b1)tn

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 12

slide-32
SLIDE 32

Polynomial multiplication modQ

Goal: Compute P = AB modQ given A = a0 + a1tn and B = b0 + b1tn Method 1: schoolbook P = a0b0 + (a0b1 + a1b0)tn + a1b1t2n modQ Method 2: Karatsuba (8n−4 additions) P = a0b0+((a0+a1)(b0+b1)−a0b0−a1b1)tn+a1b1t2n modQ Method 3: refined Karatsuba (7n−3 additions) P = (a0b0 − a1b1tn)(1 − tn) + (a0 + a1)(b0 + b1)tn modQ

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 12

slide-33
SLIDE 33

Polynomial multiplication modQ

Goal: Compute P = AB modQ given A = a0 + a1tn and B = b0 + b1tn Method 1: schoolbook P = a0b0 + (a0b1 + a1b0)tn + a1b1t2n modQ Method 2: Karatsuba (8n−4 additions) P = a0b0+((a0+a1)(b0+b1)−a0b0−a1b1)tn+a1b1t2n modQ Method 3: refined Karatsuba (7n−3 additions) P = (a0b0 − a1b1tn)(1 − tn) + (a0 + a1)(b0 + b1)tn modQ Method 4: reduced refined Karatsuba (6n−2 additions) (new) P=(a0b0−a1b1tn modQ)(1−tn)+(a0+a1)(b0+b1)tn modQ

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 12

slide-34
SLIDE 34

Reduced refined Karatsuba

a0b0 a1b1 subtract reduce a0b0 − tna1b1 a0b0 − tna1b1 subtract (1 − tn)(a0b0 − tna1b1) (a0 + a1)(b0 + b1) add reduce

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 13

slide-35
SLIDE 35

Level of Karatsuba

Karatsuba splits 1 (2n × 2n) into 3 (n × n)

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 14

slide-36
SLIDE 36

Level of Karatsuba

Karatsuba splits 1 (2n × 2n) into 3 (n × n) Zero-level Karatsuba (Schoolbook) e.g. for 16 limbs: 16 × 16 = 256

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 14

slide-37
SLIDE 37

Level of Karatsuba

Karatsuba splits 1 (2n × 2n) into 3 (n × n) Zero-level Karatsuba (Schoolbook) e.g. for 16 limbs: 16 × 16 = 256 One-level Karatsuba e.g.: 16 × 16 → 3 · (8 × 8) + some additions = 192 + some additions

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 14

slide-38
SLIDE 38

Level of Karatsuba

Karatsuba splits 1 (2n × 2n) into 3 (n × n) Zero-level Karatsuba (Schoolbook) e.g. for 16 limbs: 16 × 16 = 256 One-level Karatsuba e.g.: 16 × 16 → 3 · (8 × 8) + some additions = 192 + some additions Two-level Karatsuba e.g.: 3 · (8 × 8) → 3 · (3 · (4 × 4)) + even more additions = 144 + even more additions

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 14

slide-39
SLIDE 39

Level of Karatsuba

Karatsuba splits 1 (2n × 2n) into 3 (n × n) Zero-level Karatsuba (Schoolbook) e.g. for 16 limbs: 16 × 16 = 256 One-level Karatsuba e.g.: 16 × 16 → 3 · (8 × 8) + some additions = 192 + some additions Two-level Karatsuba e.g.: 3 · (8 × 8) → 3 · (3 · (4 × 4)) + even more additions = 144 + even more additions What is the zero-level/one-level cutoff for number of limbs?

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 14

slide-40
SLIDE 40

GMP’s cutoffs for Karatsuba

20 40 60 80 100 1024 2048 3072 4096 cycles/byte bits

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 15

slide-41
SLIDE 41

GMP’s cutoffs for Karatsuba

20 40 60 80 100 1024 2048 3072 4096 cycles/byte bits

GMP 6.0.0a library chooses 1248 bits on ARM Cortex-A8

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 15

slide-42
SLIDE 42

GMP’s cutoffs for Karatsuba

20 40 60 80 100 1024 2048 3072 4096 cycles/byte bits

GMP 6.0.0a library chooses 1248 bits on ARM Cortex-A8 We reduce cutoff via improvements to Karatsuba

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 15

slide-43
SLIDE 43

GMP’s cutoffs for Karatsuba

20 40 60 80 100 1024 2048 3072 4096 cycles/byte bits

GMP 6.0.0a library chooses 1248 bits on ARM Cortex-A8 We reduce cutoff via improvements to Karatsuba We reduce cutoff via redundant representation

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 15

slide-44
SLIDE 44

Cost comparison (Karatsuba)

Level Mult. Add Cost 64-bit 32-bit 0-level 256 15 256+ 8+ 0 = 264 1-level 192 59 16 192+30+ 4 = 226 2-level 144 119 40 144+60+10 = 214 3-level 108 191 76 108+96+19 = 223

Note: use multiply-add instructions Recall: 1 cycle per multiplication 0.5 cycle per 64-bit addition 0.25 cycle per 32-bit addition

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 16

slide-45
SLIDE 45

Cost comparison (refined Karatsuba)

Level Mult. Add Cost 64-bit 32-bit 0-level 256 15 256+ 8+ 0 = 264 1-level 192 52 16 192+26+ 4 = 222 2-level 144 103 40 144+52+10 = 206 3-level 108 166 76 108+83+19 = 210

Note: use multiply-add instructions Recall: 1 cycle per multiplication 0.5 cycle per 64-bit addition 0.25 cycle per 32-bit addition

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 17

slide-46
SLIDE 46

Cost comparison (reduced refined Karatsuba)

Level Mult. Add Cost 64-bit 32-bit 0-level 256 15 256+ 8+ 0 = 264 1-level 192 45 16 192+23+ 4 = 219 2-level 144 96 40 144+48+10 = 202 3-level 108 159 76 108+80+19 = 207

Note: use multiply-add instructions Recall: 1 cycle per multiplication 0.5 cycle per 64-bit addition 0.25 cycle per 32-bit addition

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 18

slide-47
SLIDE 47

Performance comparison

OpenSSL curve # cycle on i.MX515 # cycle on Sitara secp160r1 ≈ 2.1 million ≈ 2.1 million nistp192 ≈ 2.9 million ≈ 2.8 million nistp224 ≈ 4.0 million ≈ 3.9 million nistp256 ≈ 4.0 million ≈ 3.9 million nistp384 ≈ 13.3 million ≈ 13.2 million nistp521 ≈ 29.7 million ≈ 29.7 million Curve41417 (security level above 2200)

≈ 1.6 million cycles on FreeScale i.MX515 ≈ 1.8 million cycles on TI Sitara

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 19

slide-48
SLIDE 48

NTRU Prime

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 20

slide-49
SLIDE 49

NTRU Prime

High-security prime-degree large-Galois-group inert-modulus ideal-lattice-based cryptography System parameters (p, q, t)

p, q are prime p ≥ max{2t, 3} q ≥ 32t + 1 xp − x − 1 is irreducible in polynomial ring (Z/q)[x]

Fields of the form (Z/q)[x]/(xp − x − 1)

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 21

slide-50
SLIDE 50

NTRU Prime

High-security prime-degree large-Galois-group inert-modulus ideal-lattice-based cryptography System parameters (p, q, t)

p, q are prime p ≥ max{2t, 3} q ≥ 32t + 1 xp − x − 1 is irreducible in polynomial ring (Z/q)[x]

Fields of the form (Z/q)[x]/(xp − x − 1) Abbreviation:

ring Z[x]/(xp − x − 1) as R ring (Z/3)[x]/(xp − x − 1) as R/3 field (Z/q)[x]/(xp − x − 1) as R/q

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 21

slide-51
SLIDE 51

Streamlined NTRU Prime: private and public key

Pick g ∈ R g = g0 + · · · + gp−1xp−1 with gi ∈ {−1, 0, 1} g is required to be invertible in R/3 Pick f ∈ R f = f0 + · · · + fp−1xp−1 with fi ∈ {−1, 0, 1} and

  • |fi| = 2t

f is nonzero and hence invertible in R/q Public key: h = g/(3f ) in R/q Private keys: f in R and 1/g in R/3

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 22

slide-52
SLIDE 52

Streamlined NTRU Prime: KEM/DEM

Use Key Encapsulation Mechanism (KEM) combined with Data Encapsulation Mechanism (DEM) KEM:

look up public key h pick r ∈ R (i.e., ri ∈ {−1, 0, 1}, |ri| = 2t) compute hr in R/q round each coefficient (viewed as Z ∩ [−(q − 1)/2, (q − 1)/2]) to the nearest multiple of 3 to get c compute Hash(r) = (C|K) send (C|c), use session key K for DEM

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 23

slide-53
SLIDE 53

Streamlined NTRU Prime: decapsulation

To decrypt (C|c)

(reminder: h = g/(3f ) in R/q) compute 3fc = 3f (hr + m) = gr + 3fm in R/q reduce the coefficients modulo 3 to get a = gr ∈ R/3 compute r ′ = a/g ∈ R/3, lift r ′ to R compute Hash(r ′) = (C ′|K ′) and c′ as rounding of hr ′ verify that c′ = c and C ′ = C

If all verifications are ok, then K = K ′ is the session key

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 24

slide-54
SLIDE 54

Streamlined NTRU Prime 4591761

Field (Z/4591)[x]/(x761 − x − 1) Parameters:

p = 761 q = 4591 t = 143

Security: 2248 (pre-quantum)

considered hybrid lattice-reduction and meet-in-the-middle attack

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 25

slide-55
SLIDE 55

Polynomial multiplication

Main bottleneck is polynomial multiplication Multiplication algorithms considered:

Toom (3–6) refined Karatsuba arbitrary degree variant of Karatsuba (3–6)

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 26

slide-56
SLIDE 56

Polynomial multiplication

Main bottleneck is polynomial multiplication Multiplication algorithms considered:

Toom (3–6) refined Karatsuba arbitrary degree variant of Karatsuba (3–6)

Best operation count found so far for 768 × 768:

5-level refined Karatsuba up to 128 × 128 Toom6: evaluated at 0, ±1, ±2, ±3, ±4, 5, ∞

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 26

slide-57
SLIDE 57

Combination of Toom and Karatsuba

768 Blue = Toom Red = Karatsuba Green = Schoolbook

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 27

slide-58
SLIDE 58

Combination of Toom and Karatsuba

768 128 128 128 128 128 128 Blue = Toom Red = Karatsuba Green = Schoolbook

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 27

slide-59
SLIDE 59

Combination of Toom and Karatsuba

768 128 128 128 128 128 128 64 64 Blue = Toom Red = Karatsuba Green = Schoolbook

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 27

slide-60
SLIDE 60

Combination of Toom and Karatsuba

768 128 128 128 128 128 128 64 64 32 32 Blue = Toom Red = Karatsuba Green = Schoolbook

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 27

slide-61
SLIDE 61

Combination of Toom and Karatsuba

768 128 128 128 128 128 128 64 64 32 32 16 16 Blue = Toom Red = Karatsuba Green = Schoolbook

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 27

slide-62
SLIDE 62

Combination of Toom and Karatsuba

768 128 128 128 128 128 128 64 64 32 32 16 16 8 8 Blue = Toom Red = Karatsuba Green = Schoolbook

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 27

slide-63
SLIDE 63

Combination of Toom and Karatsuba

768 128 128 128 128 128 128 64 64 32 32 16 16 8 8 Blue = Toom Red = Karatsuba Green = Schoolbook 4 4

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 27

slide-64
SLIDE 64

Toom: decomposition

Decompose a(x) = a0 + a1x + a2x2 + · · · + a767x767 into a(x, y) = A0(x)+A1(x)y +A2(x)y 2 +A3(x)y 3 +A4(x)y 4 +A5(x)y 5 where y = x128 and A0(x) = a0 + a1 x + a2 x2 + · · · + a127x127 A1(x) = a128 + a129x + a130x2 + · · · + a255x127 A2(x) = a256 + a257x + a258x2 + · · · + a383x127 A3(x) = a384 + a385x + a386x2 + · · · + a511x127 A4(x) = a512 + a513x + a514x2 + · · · + a639x127 A5(x) = a640 + a641x + a642x2 + · · · + a767x127

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 28

slide-65
SLIDE 65

Toom: decomposition

Decompose a(x) = a0 + a1x + a2x2 + · · · + a767x767 into a(x, y) = A0(x)+A1(x)y +A2(x)y 2 +A3(x)y 3 +A4(x)y 4 +A5(x)y 5 where y = x128 and A0(x) = a0 + a1 x + a2 x2 + · · · + a127x127 A1(x) = a128 + a129x + a130x2 + · · · + a255x127 A2(x) = a256 + a257x + a258x2 + · · · + a383x127 A3(x) = a384 + a385x + a386x2 + · · · + a511x127 A4(x) = a512 + a513x + a514x2 + · · · + a639x127 A5(x) = a640 + a641x + a642x2 + · · · + a767x127 Similarly for b(x), then ab = C0 + C1y + C2y 2 + C3y 3 + C4y 4 + C5y 5 C6y 6 + C7y 7 + C8y 8 + C9y 9 + C10y 10

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 28

slide-66
SLIDE 66

Toom: evaluation

0 : A0 · B0 1 : (A0+ A1+ A2+ A3+ A4+ A5) · (B0+ B1+ B2+ B3+ B4+ B5) −1 : (A0− A1+ A2− A3+ A4− A5) · (B0− B1+ B2− B3+ B4− B5) 2 : (A0+2A1+22A2+23A3+24A4+25A5) · (B0+2B1+22B2+23B3+24B4+25B5) −2 : (A0−2A1+22A2−23A3+24A4−25A5) · (B0−2B1+22B2−23B3+24B4−25B5) 3 : (A0+3A1+32A2+33A3+34A4+35A5) · (B0+3B1+32B2+33B3+34B4+35B5) −3 : (A0−3A1+32A2−33A3+34A4−35A5) · (B0−3B1+32B2−33B3+34B4−35B5) 4 : (A0+4A1+42A2+43A3+44A4+45A5) · (B0+4B1+42B2+43B3+44B4+45B5) −4 : (A0−4A1+42A2−43A3+44A4−45A5) · (B0−4B1+42B2−43B3+44B4−45B5) 5 : (A0+5A1+52A2+53A3+54A4+55A5) · (B0+5B1+52B2+53B3+54B4+55B5) ∞ : A5 · B5

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 29

slide-67
SLIDE 67

Karatsuba

(F0 + tnF1)(G0 + tnG1) = (1 − tn)(F0G0 − tnF1G1) + tn(F0 + F1)(G0 + G1)

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 30

slide-68
SLIDE 68

Karatsuba

(F0 + tnF1)(G0 + tnG1) = (1 − tn)(F0G0 − tnF1G1) + tn(F0 + F1)(G0 + G1) Level 1: F0 =f0 +f1 x+f2 x2+. . . +f63 x63; F1 =f64 +f65 x+f66 x2+. . . +f127 x63; G0=g0+g1x+g2x2+. . . +g63x63; G1=g64+g65x+g66x2+. . . +g127x63; fg = (1 − x64)(F0G0 − x64F1G1) + x64(F0 + F1)(G0 + G1)

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 30

slide-69
SLIDE 69

Karatsuba

(F0 + tnF1)(G0 + tnG1) = (1 − tn)(F0G0 − tnF1G1) + tn(F0 + F1)(G0 + G1) Level 1: F0 =f0 +f1 x+f2 x2+. . . +f63 x63; F1 =f64 +f65 x+f66 x2+. . . +f127 x63; G0=g0+g1x+g2x2+. . . +g63x63; G1=g64+g65x+g66x2+. . . +g127x63; fg = (1 − x64)(F0G0 − x64F1G1) + x64(F0 + F1)(G0 + G1) Level 2: F00=f0 +f1 x+f2 x2+. . . +f31x31; F01=f32+f33x+f34x2+. . . +f63 x31; F10=f64+f65x+f66x2+. . . +f95x31; F11=f96+f97x+f98x2+. . . +f127x31;

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 30

slide-70
SLIDE 70

Karatsuba

(F0 + tnF1)(G0 + tnG1) = (1 − tn)(F0G0 − tnF1G1) + tn(F0 + F1)(G0 + G1) Level 1: F0 =f0 +f1 x+f2 x2+. . . +f63 x63; F1 =f64 +f65 x+f66 x2+. . . +f127 x63; G0=g0+g1x+g2x2+. . . +g63x63; G1=g64+g65x+g66x2+. . . +g127x63; fg = (1 − x64)(F0G0 − x64F1G1) + x64(F0 + F1)(G0 + G1) Level 2: F00=f0 +f1 x+f2 x2+. . . +f31x31; F01=f32+f33x+f34x2+. . . +f63 x31; F10=f64+f65x+f66x2+. . . +f95x31; F11=f96+f97x+f98x2+. . . +f127x31; let F2 = F0 + F1 = F20 + x32F21 F20=(f0 +f64)+(f1 +f65)x+(f2 +f66)x2+. . . +(f31+f95 )x31; F21=(f32+f96)+(f33+f97)x+(f34+f98)x2+. . . +(f63+f127)x31;

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 30

slide-71
SLIDE 71

Karatsuba

(F0 + tnF1)(G0 + tnG1) = (1 − tn)(F0G0 − tnF1G1) + tn(F0 + F1)(G0 + G1) Level 1: F0 =f0 +f1 x+f2 x2+. . . +f63 x63; F1 =f64 +f65 x+f66 x2+. . . +f127 x63; G0=g0+g1x+g2x2+. . . +g63x63; G1=g64+g65x+g66x2+. . . +g127x63; fg = (1 − x64)(F0G0 − x64F1G1) + x64(F0 + F1)(G0 + G1) Level 2: F00=f0 +f1 x+f2 x2+. . . +f31x31; F01=f32+f33x+f34x2+. . . +f63 x31; F10=f64+f65x+f66x2+. . . +f95x31; F11=f96+f97x+f98x2+. . . +f127x31; let F2 = F0 + F1 = F20 + x32F21 F20=(f0 +f64)+(f1 +f65)x+(f2 +f66)x2+. . . +(f31+f95 )x31; F21=(f32+f96)+(f33+f97)x+(f34+f98)x2+. . . +(f63+f127)x31; F0G0=(1 − x32)(F00G00-x32F01G01)+x32(F00+F01)(G00+G01); F1G1=(1 − x32)(F10G10-x32F11G11)+x32(F10+F11)(G10+G11); F2G2=(1 − x32)(F20G20-x32F21G21)+x32(F20+F21)(G20+G21);

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 30

slide-72
SLIDE 72

Karatsuba

(F0 + tnF1)(G0 + tnG1) = (1 − tn)(F0G0 − tnF1G1) + tn(F0 + F1)(G0 + G1) Level 1: F0 =f0 +f1 x+f2 x2+. . . +f63 x63; F1 =f64 +f65 x+f66 x2+. . . +f127 x63; G0=g0+g1x+g2x2+. . . +g63x63; G1=g64+g65x+g66x2+. . . +g127x63; fg = (1 − x64)(F0G0 − x64F1G1) + x64(F0 + F1)(G0 + G1) Level 2: F00=f0 +f1 x+f2 x2+. . . +f31x31; F01=f32+f33x+f34x2+. . . +f63 x31; F10=f64+f65x+f66x2+. . . +f95x31; F11=f96+f97x+f98x2+. . . +f127x31; let F2 = F0 + F1 = F20 + x32F21 F20=(f0 +f64)+(f1 +f65)x+(f2 +f66)x2+. . . +(f31+f95 )x31; F21=(f32+f96)+(f33+f97)x+(f34+f98)x2+. . . +(f63+f127)x31; F0G0=(1 − x32)(F00G00-x32F01G01)+x32(F00+F01)(G00+G01); F1G1=(1 − x32)(F10G10-x32F11G11)+x32(F10+F11)(G10+G11); F2G2=(1 − x32)(F20G20-x32F21G21)+x32(F20+F21)(G20+G21); Similarly for level 3, level 4 and level 5

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 30

slide-73
SLIDE 73

Schoolbook

Lowest-level multiplication of 4n × 4n e.g., F00000G00000 h0 = f0g0 h1 = f0g1 + f1g0 h2 = f0g2 + f1g1 + f2g0 h3 = f0g3 + f1g2 + f2g1 + f3g0 h4 = f1g3 + f2g2 + f3g1 h5 = f2g3 + f3g2 h6 = f3g3

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 31

slide-74
SLIDE 74

Schoolbook

Lowest-level multiplication of 4n × 4n e.g., F00000G00000 h0 = f0g0 h1 = f0g1 + f1g0 h2 = f0g2 + f1g1 + f2g0 h3 = f0g3 + f1g2 + f2g1 + f3g0 h4 = f1g3 + f2g2 + f3g1 h5 = f2g3 + f3g2 h6 = f3g3 Using 5-level Karatsuba, there are 35 = 243 of 4n × 4n for one 128 × 128

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 31

slide-75
SLIDE 75

Haswell floating-point vector unit

256-bit 4-way vectorization

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 32

slide-76
SLIDE 76

Haswell floating-point vector unit

256-bit 4-way vectorization Two vectorized multiply-add units (port 0 and port 1)

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 32

slide-77
SLIDE 77

Haswell floating-point vector unit

256-bit 4-way vectorization Two vectorized multiply-add units (port 0 and port 1) Each cycle produces 8 independent multiply-add ab + c for 64-bit double-precision inputs a, b, c

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 32

slide-78
SLIDE 78

Haswell floating-point vector unit

256-bit 4-way vectorization Two vectorized multiply-add units (port 0 and port 1) Each cycle produces 8 independent multiply-add ab + c for 64-bit double-precision inputs a, b, c One vectorized addition unit (port 1)

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 32

slide-79
SLIDE 79

Haswell floating-point vector unit

256-bit 4-way vectorization Two vectorized multiply-add units (port 0 and port 1) Each cycle produces 8 independent multiply-add ab + c for 64-bit double-precision inputs a, b, c One vectorized addition unit (port 1) Each cycle produces 4 independent additions a + b for 64-bit double-precision input a, b

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 32

slide-80
SLIDE 80

Vectorization

f = g = Toom & Karatsuba

vectorize inside each limb

+ + + + + +

Schoolbook

transpose inputs vectorize across independent multiplications

× × × × Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 33

slide-81
SLIDE 81

Performance

Theoretical lower bound

0.125 cycles per floating-point multiplication 0.250 cycles per floating-point addition and shift permutation fully interleavable

mul con mult add shift total

  • p.

42768 9700 98548 6385 157401 cycles 5346 1213 24637 1597 32793 Actual implementation

46784 cycles possibly due to dependency, latency, scheduling issues

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 34

slide-82
SLIDE 82

Current projects

PRF from module lattices Module-NTRU in QROM Ring-signature from module lattices Middle product and integer LWE

Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 35