Fast Integer Multiplication with Sch¨
- nhage-Strassen’s Algorithm
Alexander Kruppa
CACAO team at LORIA, Nancy s´ eminaire Algorithms INRIA Rocquencourt
Fast Integer Multiplication with Sch onhage-Strassens Algorithm - - PowerPoint PPT Presentation
Fast Integer Multiplication with Sch onhage-Strassens Algorithm Alexander Kruppa CACAO team at LORIA, Nancy s eminaire Algorithms INRIA Rocquencourt Contents Contents Contents of this talk: 1. Basics of Integer Multiplication (a)
Alexander Kruppa
CACAO team at LORIA, Nancy s´ eminaire Algorithms INRIA Rocquencourt
Contents
Contents of this talk:
(a) by polynomial multiplication (b) Evaluation/Interpolation (c) Karatsuba’s method (d) Toom-Cook method (e) FFT and weighted transforms (f) Sch¨
Alexander Kruppa 2 INRIA Rocquencourt
Contents
(a) Mersenne Transform (b) CRT Reconstruction (c)
(a) Arithmetic modulo 2n + 1 (b) Cache Locality (c) Fine-Grained Tuning
Alexander Kruppa 3 INRIA Rocquencourt
Integer Multiplication
n−1
2n−1
n−1
n−1
with 0 ≤ ci < β.
Alexander Kruppa 4 INRIA Rocquencourt
by Polynomial Multiplication
n−1
so that a = A(β), likewise for B(x), then
n−1
n−1
so that c = ab = C(β).
we can do much better
Alexander Kruppa 5 INRIA Rocquencourt
Evaluation/Interpolation
uniquely by values at d distinct points
polynomial multiplication to:
Alexander Kruppa 6 INRIA Rocquencourt
Karatsuba’s Method
Alexander Kruppa 7 INRIA Rocquencourt
Toom-Cook Method
1966)
arbitrarily for given n
Alexander Kruppa 8 INRIA Rocquencourt
Evaluation/Interpolation with FFT
ℓ), 0 ≤ j <
evaluation
interpolation
Alexander Kruppa 9 INRIA Rocquencourt
Weighted Transform
ℓ)ℓ = 1 for all j ∈ N, C1(x)xℓ + C0(x) has same DFT
coefficients as C1(x) + C0(x): implicit modulus xℓ − 1 in DFT
convolution
compute
so that C(x) = (A(x)B(x)) mod (xℓ − wℓ)
need 2ℓ-th root of unity in R
Alexander Kruppa 10 INRIA Rocquencourt
Sch¨
multiplication by powers of 2 is fast! (O(n))
next level’s ℓ′, M ′ so that M ′ℓ′ = n
Alexander Kruppa 11 INRIA Rocquencourt
Alexander Kruppa 12 INRIA Rocquencourt
Motivation for Improving SSA
testing, ECPP , polynomial multiplication
algorithm [SSA]: good asymptotic complexity O(n log n log log n), fast in practice for large operands, exact (only integer arithmetic)
, widely deployed
Alexander Kruppa 13 INRIA Rocquencourt
Sch¨
has at most 221 bits
compute ab mod (2N + 1)
theorem: Fourier transform and pointwise multiplication
Alexander Kruppa 14 INRIA Rocquencourt
Sch¨
Then 22n/ℓ has order ℓ
products can use SSA recursively without padding
Z = ⇒ Z2N+1 = ⇒ Z[x] mod (xℓ + 1) = ⇒ Z2n+1[x] mod (xℓ + 1) = ⇒ Z2n+1
❄ ✻
n
small?
✟✟✟✟ ✟ ❍❍❍❍ ❍ ❍ ❍ ❍ ❍ ❍ ✟ ✟ ✟ ✟ ✟
No, recurse
❄
Yes, multiply
Alexander Kruppa 15 INRIA Rocquencourt
High-Level Optimizations
Alexander Kruppa 16 INRIA Rocquencourt
Mersenne Transform
needs ℓ | n to get 2ℓ-th root of unity in Rn
not required at top level
compute c(x) = a(x)b(x) mod (xℓ − 1)
Alexander Kruppa 17 INRIA Rocquencourt
CRT Reconstruction
Convolution
with 2N + 1 > ab
Convolution Convolution CRT
with (2rN + 1)(2sN − 1) > ab
Alexander Kruppa 18 INRIA Rocquencourt
CRT Reconstruction
speed
Alexander Kruppa 19 INRIA Rocquencourt
The
√ 2 Trick
takes only 2 shift, 1 subtraction modulo 2n + 1
have no simple form
Alexander Kruppa 20 INRIA Rocquencourt
Low-Level Optimizations
Alexander Kruppa 21 INRIA Rocquencourt
Arithmetic modulo 2n + 1
full words plus one extra word ≤ 1
c = a[m] + b[m] + mpn_add_n (r, a, b, m); r[m] = (r[0] < c); MPN_DECR_U (r, m + 1, c - r[m]);
Assures r[m] = 0 or 1, c − r[m] > r[0] ⇒ r[m] = 1 so carry propagation must terminate.
non-taken) in carry propagation of MPN_DECR_U
Alexander Kruppa 22 INRIA Rocquencourt
Improving Cache Locality
elements have hundreds or thousands of bytes instead of 16
size quickly, overhead small compared to arithmetic
as many recursion levels, 4 operands usually fit into level 1 cache
❅ ❅ ❘ ❅ ❅ ❘
ai ai+k ai + ai+k wk(ai − ai+k)
Radix-2 butterfly
❅ ❅ ❘ ❅ ❅ ❘
❅ ❅ ❘ ❅ ❅ ❘
❅ ❅ ❘ ❅ ❅ ❘
❅ ❅ ❘ ❅ ❅ ❘
✟✟ ✟ ❍❍ ❍ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲
Radix-4 butterfly
Alexander Kruppa 23 INRIA Rocquencourt
Improving Cache Locality
second pass
useful for SSA with large ℓ as well
perform first FFT level (likewise for b)
Alexander Kruppa 24 INRIA Rocquencourt
Fine-Grained Tuning
length grows monotonously with input size
time
input size graphs for different transform lengths intersect multiple times:
0.00013 0.00014 0.00015 0.00016 0.00017 0.00018 0.00019 0.0002 700 750 800 850 900 Seconds Size in 64-bit words K=32 K=64
Alexander Kruppa 25 INRIA Rocquencourt
Fine-Grained Tuning
transform length
CRT
1h up to 1M words) but yields significant speedup
Alexander Kruppa 26 INRIA Rocquencourt
Timings and Comparisons
code is about 40% faster than GMP 4.2.1 and Magma 2.13-6, more than twice as fast as GMP 4.1.4
10 20 30 40 50 60 70 80 1.5e7 1.25e7 1e7 7.5e6 5e6 2.5e6 Seconds Size in 64-bit words GMP 4.1.4 Magma V2.13-6 GMP 4.2.1 new GMP code
Alexander Kruppa 27 INRIA Rocquencourt
Timings and Comparisons
30% faster for some input sizes
integer multiplication, mostly for arithmetic mod 2n − 1 (Lucas- Lehmer test for Mersenne numbers)
Prime95 10x on Pentium 4, 2.5x on Opteron; Glucas 5x on Pentium 4, 2x on Opteron
length
Alexander Kruppa 28 INRIA Rocquencourt
Untested Ideas
interpolation step
urer’s new algorithm has lower theoretical complexity
Alexander Kruppa 29 INRIA Rocquencourt
REFERENCES
[GMP] T. Granlund, The GNU Multiple Precision Arithmetic library, http://gmplib.org/ [SSA] A. Sch¨
Zahlen, Computing 7 (1971) Source tarball with new code available at <http://www.loria.fr/˜kruppaal>
Alexander Kruppa 30 INRIA Rocquencourt