using mixed precision in numerical computations to
play

Using Mixed Precision in Numerical Computations to Speedup Linear - PowerPoint PPT Presentation

Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers Jack Dongarra, UTK/ORNL/U Manchester Azzam Haidar, Nvidia Nick Higham, U of Manchester Stan Tomov, UTK Slides can be found: http://bit.ly/icerm-05-2020-dongarra


  1. Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers Jack Dongarra, UTK/ORNL/U Manchester Azzam Haidar, Nvidia Nick Higham, U of Manchester Stan Tomov, UTK Slides can be found: http://bit.ly/icerm-05-2020-dongarra 1 5/7/20

  2. Background • My interest in mixed precision began with my dissertation … § Improving the Accuracy of Computed Matrix Eigenvalues • Compute the eigenvalues and eigenvectors in low precision then improve selected values/vectors to higher precision for O(n 2 ) ops using the the matrix decomposition § Extended to singular values, 1983 2 § Algorithm in TOMS 710, 1992

  3. IBM’s Cell Processor - 2004 • 9 Cores § Power PC at 3.2 GHz § 8 SPEs • 204.8 Gflop/s peak! $600 § The catch is that this is for 32 bit fl pt; (Single Precision SP) § 64 bit fl pt peak at 14.6 Gflop/s • 14 times slower that SP; factor of 2 because of DP and 7 because of latency issues The SPEs were fully IEEE-754 compliant in double precision. In single precision, they only implement round-towards-zero, denormalized numbers are flushed to zero and NaNs are treated like normal numbers.

  4. Mixed Precision Idea Goes Something Like This… • Exploit 32 bit floating point as much as possible. § Especially for the bulk of the computation • Correct or update the solution with selective use of 64 bit floating point to provide a refined results • Intuitively: § Compute a 32 bit result, § Calculate a correction to 32 bit result using selected higher precision and, § Perform the update of the 32 bit results with the correction using high precision. 4

  5. Leveraging Mixed Precision on Cell Processor Idea: use low precision to compute the expensive flops (LU O(n 3 )) and then iteratively refine (O(n 2 )) the solution in order to achieve the FP64 arithmetic Iterative refinement for dense systems, Ax = b , can work this way. L U = lu(A) FP32 precision O(n 3 ) x = U\(L\b) FP32 precision O(n 2 ) r = b – Ax (with original A) FP64 precision O(n 2 ) WHILE || r || not small enough 1. find a correction “z” to adjust x that satisfy Az=r FP32 precision O(n 2 ) 2. x = x + z FP64 precision O(n 1 ) 3. r = b – Ax (with original A) FP64 precision O(n 2 ) END Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. Ø It can be shown that using this approach we can compute the solution to 64-bit floating point precision. Ø Need a copy of the original matrix to compute residual (r) and matrix cannot be too badly conditioned Ø

  6. IBM Cell 3.2 GHz, Ax = b SP Theoretical 250 Peak 200 8 SGEMM (Embarrassingly Parallel) SP Peak (204 Gflop/s) 150 GFlop/s DP Peak (15 Gflop/s) 100 DP Theoretical Peak 50 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Matrix Size 6

  7. IBM Cell 3.2 GHz, Ax = b 250 200 SP Ax=b 8 SGEMM (Embarrassingly Parallel) Performance SP Peak (204 Gflop/s) SP Ax=b IBM .30 secs 150 DP Peak (15 Gflop/s) GFlop/s DP Ax=b IBM 100 DP Ax=b 50 Performance 3.9 secs 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Matrix Size 7

  8. IBM Cell 3.2 GHz, Ax = b 250 200 8 SGEMM (Embarrassingly Parallel) SP Peak (204 Gflop/s) SP Ax=b IBM .30 secs Mixed Precision 150 DSGESV Performance GFlop/s DP Peak (15 Gflop/s) DP Ax=b IBM .47 secs 100 8.3X Speedup 50 3.9 secs 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Matrix Size 8

  9. Intriguing Potential • Exploit lower precision as much as possible § Payoff in performance • Faster floating point • Less data to move • Automatically switch between SP and DP to match the desired accuracy § Compute solution in SP and then a correction to the solution in DP • Potential for GPU, FPGA, special purpose processors § Use as little precision as you can get away with and improve the accuracy • Linear systems and Eigenvalue, optimization problems, where Newton’s method is used. x i z (correction, x i+1 – x i ) Z = - A\(b – Ax) x i+1

  10. Machine Learning in Computational Science Many fields are beginning to adopt machine learning to augment modeling and simulation methods • Climate • Biology • Drug Design • Epidemology • Materials • Cosmology • High-Energy Physics

  11. Deep Learning Needs Small Matrix Operations Matrix Multiply is the time consuming part. Convolution Layers and Fully Connected Layers require matrix multiply There are many GEMM’s of small matrices, perfectly parallel, can get by with 16-bit floating point w 11 x 1 x 1 w 21 y 1 w 12 x 2 w 22 w 13 y 2 x 3 w 23 Convolution Step Fully Connected In this case 3x3 GEMM Classification 11 / 47

  12. Nvidia Volta Peak Rates • Four Performance levels for the different precision • 64 bit floating point (FMA): 7.5 Tflop/s peak • 32 bit floating point (FMA): 15 Tflop/s peak • 16 bit floating point (FMA): 30 Tflop/s peak • 16 bit floating point w/Tensor core: 120 Tflop/s peak Tensor Core, special hardware for: Mixed Precision Matrix Multiply 4x4 Matrices 07 12

  13. 07 13

  14. Mixed Precision • Today many precisions to deal with (IEEE Standard) • Note the number range with half precision IEEE SP (16 bit fl.pt.) largest fl pt largest fl pt number number 65,504 O(10 38 ) 14

  15. Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP64 square 90 85 80 75 70 65 60 Tflop/s 55 50 45 40 35 30 25 Matrix matrix multiplication GEMM 20 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n

  16. Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP32 square sgemm achieve about 14 Tflop/s • FP64 square 90 85 80 75 70 65 60 Tflop/s 55 50 45 40 35 30 25 Matrix matrix multiplication GEMM 20 15 10 ~2X 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n

  17. Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP16 square sgemm achieve about 14 Tflop/s • FP32 square 90 FP64 square hgemm achieve about 27 Tflop/s 85 • 80 75 70 65 60 Tflop/s 55 50 45 40 35 30 25 Matrix matrix multiplication GEMM ~4X 20 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n

  18. Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP16 TC square sgemm achieve about 14 Tflop/s • FP16 square 90 FP32 square hgemm achieve about 27 Tflop/s 85 • FP64 square 80 Tensor cores gemm reach about 85 Tflop/s • 75 70 65 60 Tflop/s 55 ~12X 50 45 40 35 30 25 Matrix matrix multiplication GEMM 20 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n

  19. Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP16 TC square sgemm achieve about 14 Tflop/s • FP16 square 90 FP32 square hgemm achieve about 27 Tflop/s 85 • FP64 square 80 Tensor cores gemm reach about 85 Tflop/s • 75 70 65 60 Tflop/s 55 50 45 40 35 30 25 Matrix matrix multiplication GEMM 20 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n

  20. Leveraging Half Precision in HPC on V100 Study of the rank k update used by the LU factorization algorithm on Nvidia V100 In LU factorization need matrix • FP16 TC square FP16 square FP32 square FP64 square multiple but operations is a FP16 TC k=256 FP16 k=256 FP32 k=256 FP64 k=256 90 rank-k update computing the 85 80 Schur complement 75 70 65 60 Tflop/s 55 50 45 40 35 Rank-k Ra k GEMM needed by 30 LU LU does not perform as 25 20 well a we as s square b but s still O OK 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n

  21. Leveraging Half Precision in HPC on V100 Study of the LU factorization algorithm on Nvidia V100 Performance of the LU factorization with different precision LU factorization is used to solve a • 24 FP16-TC->64 hgetrf linear system Ax=b 22 FP32->64 sgetrf A x = b 20 FP64 dgetrf x b A 18 16 14 U Tflop/s x b 3~4X LUx = b L 12 10 8 y b L 6 Ly = b 4 then 2 U x y Ux = y 0 2k4k6k8k 10k 14k 18k 22k 26k 30k 34k 40k Matrix size For the LU, half precision used only in GEMM, Panel and TRSM in SP .

  22. Leveraging Half Precision in HPC on V100 Us Use e Mixed ed Prec ecision algorithm hms Ø Achieve higher performance à faster time to solution (benefit from operations and data movement) Ø Reduce power consumption by decreasing the execution time à Ene Energy Saving ngs !!! – Reformulate to find correction to solution, rather than solution; Δ x rather than x. A. Haidar, P. Wu, S. Tomov, J. Dongarra, Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers, SC-17, ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ACM, Denver, Colorado, November 12-17, 2017. A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , SC-18, Dallas, IEEE.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend