faster arbitrary precision dot product and matrix
play

Faster arbitrary-precision dot product and matrix multiplication - PowerPoint PPT Presentation

Faster arbitrary-precision dot product and matrix multiplication Fredrik Johansson Inria Bordeaux 26th IEEE Symposium on Computer Arithmetic (ARITH 26) Kyoto, Japan June 10, 2019 1 / 26 Arbitrary-precision arithmetic Precision: p 2 bits


  1. Faster arbitrary-precision dot product and matrix multiplication Fredrik Johansson Inria Bordeaux 26th IEEE Symposium on Computer Arithmetic (ARITH 26) Kyoto, Japan June 10, 2019 1 / 26

  2. Arbitrary-precision arithmetic Precision: p ≥ 2 bits (can be thousands or millions) ◮ Floating-point numbers 3 . 14159265358979323846264338328 ◮ Ball arithmetic (mid-rad interval arithmetic) [ 3 . 14159265358979323846264338328 ± 8 . 65 · 10 − 31 ] Why? ◮ Computational number theory, computer algebra ◮ Dynamical systems, ill-conditioned problems ◮ Verifying/testing numerical results/methods 2 / 26

  3. This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000 × 1000 matrices p = 53 p = 106 p = 212 p = 848 BLAS 0.08 QD 11 111 MPFR 36 44 110 293 Arb* (classical) 19 25 76 258 Arb* (block) 3.6 5.6 8.2 27 * With ball coefficients Arb version 2.16 – http://arblib.org 3 / 26

  4. Two important requirements ◮ True arbitrary precision; inputs and output can have mixed precision; no restrictions on the exponents ◮ Preserve structure: near-optimal enclosures for each entry [ 1 . 23 · 10 100 ± 10 80 ]   − 1 . 5 0 [ 2 . 34 ± 10 − 20 ] [ 3 . 45 ± 10 − 50 ] 1   [ 4 . 56 · 10 − 100 ± 10 − 130 ] 0 2 4 / 26

  5. Dot product N � a k , b k ∈ R or C a k b k , k = 1 Kernel in basecase ( N � 10 to 100) algorithms for: ◮ Matrix multiplication ◮ Triangular solving, recursive LU factorization ◮ Polynomial multiplication, division, composition ◮ Power series operations 5 / 26

  6. Dot product as an atomic operation The old way: arb_mul(s, a, b, prec); for (k = 1; k < N; k++) arb_addmul(s, a + k, b + k, prec); The new way: arb_dot(s, NULL, 0, a, 1, b, 1, N, prec); (More generally, computes s = s 0 + ( − 1 ) c � N − 1 k = 0 a k · astep b k · bstep ) arb dot – ball arithmetic, real acb dot – ball arithmetic, complex arb approx dot – floating-point, real acb approx dot – floating-point, complex 6 / 26

  7. Numerical dot product Approximate (floating-point) dot product: N N � � | ε | ≈ 2 − p s = a k b k + ε, | a k b k | k = 1 k = 1 Ball arithmetic dot product: N � [ m ± r ] ⊇ [ m k ± r k ][ m ′ k ± r ′ k ] k = 1 N N � � m k m ′ | m k | r ′ k + | m ′ k | r k + r k r ′ m = k + ε, r ≥ | ε | + k k = 1 k = 1 7 / 26

  8. Representation of numbers in Arb (like MPFR) Arbitrary-precision floating-point numbers: n − 1 ( − 1 ) sign · 2 exp · � b k 2 64 ( k − n ) k = 0 Limbs b k are 64-bit words, normalized: 0 ≤ b k < 2 64 , b n − 1 ≥ 2 63 , b 0 � = 0 All core arithmetic operations are implemented using word manipulations and low-level GMP ( mpn layer) function calls Radius: 30-bit unsigned floating-point 8 / 26

  9. Arbitrary-precision multiplication 1....... m limbs 1....... n limbs

  10. Arbitrary-precision multiplication 1....... m limbs 1....... n limbs Exact multiplication: mpn mul → m + n limbs 01......

  11. Arbitrary-precision multiplication 1....... m limbs 1....... n limbs Exact multiplication: mpn mul → m + n limbs 01...... Rounding to p bits and bit alignment 1....... ....1000 ≤ p bits 9 / 26

  12. Arbitrary-precision addition Exponent difference

  13. Arbitrary-precision addition Exponent Align limbs: mpn lshift etc. difference

  14. Arbitrary-precision addition Exponent Align limbs: mpn lshift etc. difference Addition: mpn add n , mpn sub n , mpn add 1 etc.

  15. Arbitrary-precision addition Exponent Align limbs: mpn lshift etc. difference Addition: mpn add n , mpn sub n , mpn add 1 etc. Rounding to p bits and bit alignment 1....... ....1000 ≤ p bits 10 / 26

  16. Dot product First pass: inspect the terms ◮ Count nonzero terms ◮ Bound upper and lower exponents of terms ◮ Detect Inf/NaN/overflow/underflow (fallback code) Second pass: compute the dot product! ◮ Exploit knowledge about exponents ◮ Single temporary memory allocation ◮ Single final rounding and normalization 11 / 26

  17. Dot product N terms

  18. Dot product p bits N terms

  19. Dot product p bits A, B: ∼ log 2 N bits padding A B N terms 2’s complement accumulator Error accumulator

  20. Dot product p bits A, B: ∼ log 2 N bits padding A B N terms 2’s complement accumulator Error accumulator 12 / 26

  21. Technical comments Radius dot products (for ball arithmetic): ◮ Dedicated code using 64-bit accumulator Special sizes: ◮ Inline ASM instead of GMP function calls for ≤ 2 × 2 limb product, ≤ 3 limb accumulator ◮ Mulder’s mulhigh (via MPFR) for 25 to 10000 limbs Complex numbers: ◮ Essentially done as two length-2 N real dot products ◮ Karatsuba-style multiplication (3 instead of 4 real muls) for ≥ 128 limbs 13 / 26

  22. Dot product performance 500 arb addmul mpfr mul/mpfr add 400 arb dot arb approx dot Cycles / term 300 200 100 0 64 128 192 256 384 512 640 768 Bit precision p 14 / 26

  23. Dot product performance 300 arb addmul mpfr mul/mpfr add 250 arb dot arb approx dot 200 QD ( p = 106) Cycles / term QD ( p = 212) 150 100 50 0 64 128 192 256 Bit precision p 15 / 26

  24. Dot product: polynomial operations speedup in Arb 5 mul 4 mullow divrem Speedup inv series 3 exp series sin cos 2 compose revert 1 1 10 100 1000 Polynomial degree N (Complex coefficients, p = 64 bits) 16 / 26

  25. Dot product: matrix operations speedup in Arb 5 4 mul solve Speedup inv 3 det exp 2 charpoly 1 1 10 100 1000 Matrix size N (Complex coefficients, p = 64 bits) 17 / 26

  26. Matrix multiplication (large N ) Same ideas as polynomial multiplication in Arb: 1. [ A ± a ][ B ± b ] via three multiplications AB , | A | b , a ( | B | + b ) 2. Split + scale matrices into blocks with uniform magnitude 3. Multiply blocks of A , B exactly over Z using FLINT 4. Multiply blocks of | A | , b , a , | B | + b using hardware FP 18 / 26

  27. Matrix multiplication (large N ) Same ideas as polynomial multiplication in Arb: 1. [ A ± a ][ B ± b ] via three multiplications AB , | A | b , a ( | B | + b ) 2. Split + scale matrices into blocks with uniform magnitude 3. Multiply blocks of A , B exactly over Z using FLINT 4. Multiply blocks of | A | , b , a , | B | + b using hardware FP Where is the gain? ◮ Integers and hardware FP have less overhead ◮ Multimodular/RNS arithmetic (60-bit primes in FLINT) ◮ Strassen O ( N 2 . 81 ) matrix multiplication in FLINT 18 / 26

  28. Matrix multiplication Blocks A s , B s chosen (using greedy search) so that each row of A s and column of B s has a small internal exponent range Column j B Row i A C = AB c i , j = � k a i , k b k , j 19 / 26

  29. Block matrix multiplication Choose blocks A s , B s (indices s ⊆ { 1 , . . . , N } ) so that each row of A s and column of B s has a small internal exponent range Column j B s Row i A s C ← C + A s B s 20 / 26

  30. Block matrix multiplication, scaled to integers Scaling is applied internally to each block A s , B s each row of A s and column of B s has a small range of exponent differences Column j × 2 f j , s E s = diag( 2 e i , s ) , B s F s = diag( 2 f i , s ) Row i C ← C + E − 1 (( E s A s )( B s F s )) F − 1 × 2 e i , s A s s s 21 / 26

  31. Uniform and non-uniform matrices Uniform matrix, N = 1000 p Classical Block Number of blocks Speedup 53 19 s 3.6 s 1 5.3 212 76 s 8.2 s 1 9.3 3392 1785 s 115 s 1 15.5 � i + j � Pascal matrix, N = 1000 (entries A i , j = π · ) j p Classical Block Number of blocks Speedup 53 12 s 20 s 10 0.6 212 43 s 35 s 9 1.2 3392 1280 s 226 s 2 5.7 22 / 26

  32. Approximate and certified linear algebra Three approaches to linear solving Ax = b : ◮ Gaussian elimination in floating-point arithmetic: stable if A is well-conditioned ◮ Gaussian elimination in interval/ball arithmetic: unstable for generic well-conditioned A (lose O ( N ) digits) ◮ Approx + certification: 3 . 141 → [ 3 . 141 ± 0 . 001 ] Example: Hansen-Smith algorithm 1. Compute R ≈ A − 1 approximately 2. Solve ( RA ) x = Rb in interval/ball arithmetic 23 / 26

  33. Linear solving Solving a dense real linear system Ax = b ( N = 1000, p = 212) (Hansen-Smith) 60 Certified Time (seconds) 40 Approx (LU) Approx (LU) 20 0 Arb Eigen/MPFR 24 / 26

  34. Eigenvalues Computing all eigenvalues and eigenvectors of a nonsymmetric complex matrix ( N = 100, p = 128) (QR method) (vdHoeven-Mourrain) 15 Approx Time (seconds) Certified (QR method) (Rump) 10 Certified Approx 5 0 Julia Arb 25 / 26

  35. Conclusion Faster arbitrary-precision arithmetic, linear algebra ◮ Handle dot product as an atomic operation, use instead of single add/muls where possible (1 − 5 × speedup) ◮ Accurate and fast large- N matrix multiplication using scaled integer blocks ( ≈ 10 × speedup) ◮ Higher operations reduce well to dot product (small N ), matrix multiplication (large N ) Future work ideas ◮ Correctly rounded dot product, for MPFR (easy) ◮ Horner scheme (in analogy with dot product) ◮ Better matrix scaling + splitting algorithm 26 / 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend