Faster arbitrary-precision dot product and matrix multiplication - - PowerPoint PPT Presentation

faster arbitrary precision dot product and matrix
SMART_READER_LITE
LIVE PREVIEW

Faster arbitrary-precision dot product and matrix multiplication - - PowerPoint PPT Presentation

Faster arbitrary-precision dot product and matrix multiplication Fredrik Johansson Inria Bordeaux 26th IEEE Symposium on Computer Arithmetic (ARITH 26) Kyoto, Japan June 10, 2019 1 / 26 Arbitrary-precision arithmetic Precision: p 2 bits


slide-1
SLIDE 1

Faster arbitrary-precision dot product and matrix multiplication

Fredrik Johansson

Inria Bordeaux

26th IEEE Symposium on Computer Arithmetic (ARITH 26)

Kyoto, Japan June 10, 2019

1 / 26

slide-2
SLIDE 2

Arbitrary-precision arithmetic

Precision: p ≥ 2 bits (can be thousands or millions) ◮ Floating-point numbers 3.14159265358979323846264338328 ◮ Ball arithmetic (mid-rad interval arithmetic) [3.14159265358979323846264338328 ± 8.65 · 10−31] Why? ◮ Computational number theory, computer algebra ◮ Dynamical systems, ill-conditioned problems ◮ Verifying/testing numerical results/methods

2 / 26

slide-3
SLIDE 3

This work: faster arithmetic and linear algebra

CPU time (seconds) to multiply two real 1000 × 1000 matrices p = 53 p = 106 p = 212 p = 848 BLAS 0.08 QD 11 111 MPFR 36 44 110 293 Arb* (classical) 19 25 76 258 Arb* (block) 3.6 5.6 8.2 27 * With ball coefficients Arb version 2.16 – http://arblib.org

3 / 26

slide-4
SLIDE 4

Two important requirements

◮ True arbitrary precision; inputs and output can have mixed precision; no restrictions on the exponents ◮ Preserve structure: near-optimal enclosures for each entry   [1.23·10100 ± 1080] −1.5 1 [2.34 ± 10−20] [3.45 ± 10−50] 2 [4.56 · 10−100 ± 10−130]  

4 / 26

slide-5
SLIDE 5

Dot product

N

  • k=1

akbk, ak, bk ∈ R or C Kernel in basecase (N 10 to 100) algorithms for: ◮ Matrix multiplication ◮ Triangular solving, recursive LU factorization ◮ Polynomial multiplication, division, composition ◮ Power series operations

5 / 26

slide-6
SLIDE 6

Dot product as an atomic operation

The old way: arb_mul(s, a, b, prec); for (k = 1; k < N; k++) arb_addmul(s, a + k, b + k, prec); The new way: arb_dot(s, NULL, 0, a, 1, b, 1, N, prec); (More generally, computes s = s0 + (−1)c N−1

k=0 ak·astepbk·bstep)

arb dot – ball arithmetic, real acb dot – ball arithmetic, complex arb approx dot – floating-point, real acb approx dot – floating-point, complex

6 / 26

slide-7
SLIDE 7

Numerical dot product

Approximate (floating-point) dot product: s =

N

  • k=1

akbk + ε, |ε| ≈ 2−p

N

  • k=1

|akbk| Ball arithmetic dot product: [m ± r] ⊇

N

  • k=1

[mk ± rk][m′

k ± r′ k]

m =

N

  • k=1

mkm′

k + ε,

r ≥ |ε|+

N

  • k=1

|mk|r′

k + |m′ k|rk + rkr′ k

7 / 26

slide-8
SLIDE 8

Representation of numbers in Arb (like MPFR)

Arbitrary-precision floating-point numbers: (−1)sign · 2exp ·

n−1

  • k=0

bk264(k−n) Limbs bk are 64-bit words, normalized: 0 ≤ bk < 264, bn−1 ≥ 263, b0 = 0 All core arithmetic operations are implemented using word manipulations and low-level GMP (mpn layer) function calls Radius: 30-bit unsigned floating-point

8 / 26

slide-9
SLIDE 9

Arbitrary-precision multiplication

1....... 1....... m limbs n limbs

slide-10
SLIDE 10

Arbitrary-precision multiplication

1....... 1....... m limbs n limbs Exact multiplication: mpn mul → m + n limbs 01......

slide-11
SLIDE 11

Arbitrary-precision multiplication

1....... 1....... m limbs n limbs Exact multiplication: mpn mul → m + n limbs 01...... Rounding to p bits and bit alignment 1....... ....1000 ≤ p bits

9 / 26

slide-12
SLIDE 12

Arbitrary-precision addition

Exponent difference

slide-13
SLIDE 13

Arbitrary-precision addition

Exponent difference Align limbs: mpn lshift etc.

slide-14
SLIDE 14

Arbitrary-precision addition

Exponent difference Align limbs: mpn lshift etc. Addition: mpn add n, mpn sub n, mpn add 1 etc.

slide-15
SLIDE 15

Arbitrary-precision addition

Exponent difference Align limbs: mpn lshift etc. Addition: mpn add n, mpn sub n, mpn add 1 etc. Rounding to p bits and bit alignment 1....... ....1000 ≤ p bits

10 / 26

slide-16
SLIDE 16

Dot product

First pass: inspect the terms ◮ Count nonzero terms ◮ Bound upper and lower exponents of terms ◮ Detect Inf/NaN/overflow/underflow (fallback code) Second pass: compute the dot product! ◮ Exploit knowledge about exponents ◮ Single temporary memory allocation ◮ Single final rounding and normalization

11 / 26

slide-17
SLIDE 17

Dot product

N terms

slide-18
SLIDE 18

Dot product

N terms p bits

slide-19
SLIDE 19

Dot product

N terms p bits 2’s complement accumulator A B Error accumulator A, B: ∼ log2 N bits padding

slide-20
SLIDE 20

Dot product

N terms p bits 2’s complement accumulator A B Error accumulator A, B: ∼ log2 N bits padding

12 / 26

slide-21
SLIDE 21

Technical comments

Radius dot products (for ball arithmetic): ◮ Dedicated code using 64-bit accumulator Special sizes: ◮ Inline ASM instead of GMP function calls for ≤ 2 × 2 limb product, ≤ 3 limb accumulator ◮ Mulder’s mulhigh (via MPFR) for 25 to 10000 limbs Complex numbers: ◮ Essentially done as two length-2N real dot products ◮ Karatsuba-style multiplication (3 instead of 4 real muls) for ≥ 128 limbs

13 / 26

slide-22
SLIDE 22

Dot product performance

64 128 192 256 384 512 640 768 Bit precision p 100 200 300 400 500 Cycles / term arb addmul mpfr mul/mpfr add arb dot arb approx dot

14 / 26

slide-23
SLIDE 23

Dot product performance

64 128 192 256 Bit precision p 50 100 150 200 250 300 Cycles / term arb addmul mpfr mul/mpfr add arb dot arb approx dot QD (p = 106) QD (p = 212)

15 / 26

slide-24
SLIDE 24

Dot product: polynomial operations speedup in Arb

1 10 100 1000 Polynomial degree N 1 2 3 4 5 Speedup mul mullow divrem inv series exp series sin cos compose revert

(Complex coefficients, p = 64 bits)

16 / 26

slide-25
SLIDE 25

Dot product: matrix operations speedup in Arb

1 10 100 1000 Matrix size N 1 2 3 4 5 Speedup mul solve inv det exp charpoly

(Complex coefficients, p = 64 bits)

17 / 26

slide-26
SLIDE 26

Matrix multiplication (large N)

Same ideas as polynomial multiplication in Arb:

  • 1. [A±a][B±b] via three multiplications AB, |A|b, a(|B|

+ b)

  • 2. Split + scale matrices into blocks with uniform magnitude
  • 3. Multiply blocks of A, B exactly over Z using FLINT
  • 4. Multiply blocks of |A|, b, a, |B|

+ b using hardware FP

18 / 26

slide-27
SLIDE 27

Matrix multiplication (large N)

Same ideas as polynomial multiplication in Arb:

  • 1. [A±a][B±b] via three multiplications AB, |A|b, a(|B|

+ b)

  • 2. Split + scale matrices into blocks with uniform magnitude
  • 3. Multiply blocks of A, B exactly over Z using FLINT
  • 4. Multiply blocks of |A|, b, a, |B|

+ b using hardware FP Where is the gain? ◮ Integers and hardware FP have less overhead ◮ Multimodular/RNS arithmetic (60-bit primes in FLINT) ◮ Strassen O(N 2.81) matrix multiplication in FLINT

18 / 26

slide-28
SLIDE 28

Matrix multiplication

Blocks As, Bs chosen (using greedy search) so that each row of As and column of Bs has a small internal exponent range C = AB ci,j =

k ai,kbk,j

A B Row i Column j

19 / 26

slide-29
SLIDE 29

Block matrix multiplication

Choose blocks As, Bs (indices s ⊆ {1, . . . , N}) so that each row

  • f As and column of Bs has a small internal exponent range

C ← C + AsBs As Bs Row i Column j

20 / 26

slide-30
SLIDE 30

Block matrix multiplication, scaled to integers

Scaling is applied internally to each block As, Bs each row of As and column of Bs has a small range of exponent differences C ← C + E−1

s

((EsAs)(BsFs))F −1

s

As Bs Row i × 2ei,s Column j × 2fj,s Es = diag(2ei,s), Fs = diag(2fi,s)

21 / 26

slide-31
SLIDE 31

Uniform and non-uniform matrices

Uniform matrix, N = 1000 p Classical Block Number of blocks Speedup 53 19 s 3.6 s 1 5.3 212 76 s 8.2 s 1 9.3 3392 1785 s 115 s 1 15.5 Pascal matrix, N = 1000 (entries Ai,j = π · i+j

j

  • )

p Classical Block Number of blocks Speedup 53 12 s 20 s 10 0.6 212 43 s 35 s 9 1.2 3392 1280 s 226 s 2 5.7

22 / 26

slide-32
SLIDE 32

Approximate and certified linear algebra

Three approaches to linear solving Ax = b: ◮ Gaussian elimination in floating-point arithmetic: stable if A is well-conditioned ◮ Gaussian elimination in interval/ball arithmetic: unstable for generic well-conditioned A (lose O(N) digits) ◮ Approx + certification: 3.141 → [3.141 ± 0.001] Example: Hansen-Smith algorithm

  • 1. Compute R ≈ A−1 approximately
  • 2. Solve (RA)x = Rb in interval/ball arithmetic

23 / 26

slide-33
SLIDE 33

Linear solving

Solving a dense real linear system Ax = b (N = 1000, p = 212) Eigen/MPFR Arb 20 40 60 Time (seconds) Approx (LU) Approx (LU) Certified (Hansen-Smith)

24 / 26

slide-34
SLIDE 34

Eigenvalues

Computing all eigenvalues and eigenvectors of a nonsymmetric complex matrix (N = 100, p = 128) Julia Arb 5 10 15 Time (seconds) Approx (QR method) Approx (QR method) Certified (vdHoeven-Mourrain) Certified (Rump)

25 / 26

slide-35
SLIDE 35

Conclusion

Faster arbitrary-precision arithmetic, linear algebra ◮ Handle dot product as an atomic operation, use instead

  • f single add/muls where possible (1 − 5× speedup)

◮ Accurate and fast large-N matrix multiplication using scaled integer blocks (≈ 10× speedup) ◮ Higher operations reduce well to dot product (small N), matrix multiplication (large N) Future work ideas ◮ Correctly rounded dot product, for MPFR (easy) ◮ Horner scheme (in analogy with dot product) ◮ Better matrix scaling + splitting algorithm

26 / 26