experiments with mixed prevision algorithms in linear
play

Experiments with Mixed Prevision Algorithms in Linear Algebra Jack - PowerPoint PPT Presentation

Experiments with Mixed Prevision Algorithms in Linear Algebra Jack Dongarra (UTK/ORNL/U Manchester) Azzam Haidar (Nvidia) Stan Tomov (UTK) Nick Higham (U of Manchester) 8/28/19 1 Mixed Precision Today many precisions to deal with (IEEE


  1. Experiments with Mixed Prevision Algorithms in Linear Algebra Jack Dongarra (UTK/ORNL/U Manchester) Azzam Haidar (Nvidia) Stan Tomov (UTK) Nick Higham (U of Manchester) 8/28/19 1

  2. Mixed Precision • Today many precisions to deal with (IEEE Standard) • Note the number range with half precision (16 bit fl.pt.) IEEE SP largest fl pt largest fl pt number number 65,504 Google TPU: bfloat16 O(10 38 ) float16 2

  3. Nvidia Volta Peak Rates • Four Performance levels for the different precision • 64 bit floating point (FMA): 7.5 Tflop/s • 32 bit floating point (FMA): 15 Tflop/s • 16 bit floating point (FMA): 30 Tflop/s • 16 bit floating point with Tensor core: 120 Tflop/s • Numerical characteristics of arithmetic on Tensor core different Tensor Core Performance from: Mixed Precision Matrix Multiply 4x4 Matrices 3

  4. 4x4 matrix multiply: 32 bit floating point accuracy with 16 bit inputs 07 4

  5. Dense Linear Algebra (DLA) is needed in a wide variety of science and engineering applications: • Linear systems: Solve Ax = b Computational electromagnetics, material science, applications using • boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more Least squares: Find x to minimize || Ax – b || • Computational statistics (e.g., linear least squares or ordinary least squares), • econometrics, control theory, signal processing, curve fitting, and many more Eigenproblems: Solve Ax = λ x • Computational chemistry, quantum mechanics, material science, face recognition, • PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more SVD: A = U Σ V* (Au = σ v and A*v = σ u) • Information retrieval, web search, signal processing, big data analytics, low rank • matrix approximation, total least squares minimization, pseudo-inverse, and many more Many variations depending on structure of A • • A can be symmetric, positive definite, tridiagonal, Hessenberg, banded, sparse with dense blocks, etc. DLA is crucial to the development of sparse solvers • 5 / 19

  6. Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP64 GEMM 90 85 80 75 70 65 60 Tflop/s 55 50 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size

  7. Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP32 GEMM sgemm achieve about 14 Tflop/s • FP64 GEMM 90 85 80 75 70 65 60 Tflop/s 55 50 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 ~2X 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size

  8. Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP16 GEMM sgemm achieve about 14 Tflop/s • FP32 GEMM 90 FP64 GEMM hgemm achieve about 27 Tflop/s • 85 80 75 70 65 60 Tflop/s 55 50 45 40 35 30 Matrix matrix multiplication GEMM 25 ~4X 20 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size

  9. Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP16 GEMM Tensor Cores sgemm achieve about 14 Tflop/s • FP16 GEMM 90 FP32 GEMM hgemm achieve about 27 Tflop/s • 85 FP64 GEMM 80 Tensor cores gemm reach about 85 Tflop/s • 75 70 65 60 Tflop/s 55 ~12X 50 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size

  10. Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP16 GEMM Tensor Cores sgemm achieve about 14 Tflop/s • FP16 GEMM 90 FP32 GEMM hgemm achieve about 27 Tflop/s • 85 FP64 GEMM 80 Tensor cores gemm reach about 85 Tflop/s • 75 70 65 60 Tflop/s 55 50 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size

  11. Leveraging Half Precision in HPC on V100 Study of the rank k update used by the LU factorization algorithm on Nvidia V100 In LU factorization need matrix • FP16 TC square FP16 square FP32 square FP64 square multiple but operations is a FP16 TC k=256 FP16 k=256 FP32 k=256 FP64 k=256 90 rank-k update computing the 85 80 Schur complement 75 70 65 60 Tflop/s 55 50 45 40 35 Rank-k Ra k GEMM needed by 30 LU LU does not perform as 25 20 we well a as s square b but s still O OK 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k m=n

  12. Leveraging Half Precision in HPC on V100 solving linear system Ax = b solving linear system Ax = b LU factorization is used to solve a • linear system Ax=b LU factorization A x = b x b A U x b LUx = b L y b Ly = b L then U x y Ux = y

  13. Leveraging Half Precision in HPC on V100 solving linear system Ax = b For s = 0, nb, .. N 1. 1. pan panel f fac actoriz ize LU factorization requires O(n 3 ) 2. up 2. update e tr trailing ma matrix most of the operations are spent in GEMM TRSM nb U • Panel Factorization Panel • TRSM - Triangular solve L GEMM • GEMM – Matrix Multiply step 1 step 2 step 3 step 4 panel update

  14. Leveraging Half Precision in HPC on V100 Study of the LU factorization algorithm on Nvidia V100 LU factorization is used to solve a • 24 FP16 hgetrf LU factorization Tensor Cores linear system Ax=b FP16 hgetrf LU factorization FP32 sgetrf LU factorization A x = b x b FP64 dgetrf LU factorization A 20 16 U x b 3~4X Tflop/s LUx = b L 12 8 y b L Ly = b 4 then U x y Ux = y 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size

  15. Leveraging Half Precision in HPC on V100 solving linear system Ax = b For s = 0, nb, .. N 1. pan 1. panel f fac actoriz ize 2. 2. up update e tr trailing ma matrix • Panel Factorization performed with 32 bit fl pt Done using MAGMA on the front-end system • TRSM Panel • TRSM - Triangular solve performed with 32 bit fl pt GEMM Done using VT100 (no Tensor core) • • GEMM – Matrix Multiply performed with 16 bit fl pt Done on VT100 with Tensor cores • Most of the performance comes from GEMM using 16 bit fl pt

  16. Leveraging Half Precision in HPC on V100 Us Use e Mixed ed Prec ecision algorithm hms Ø Achieve higher performance à faster time to solution Ø Reduce power consumption by decreasing the execution time à Ene Energy Saving ngs !!! Reference: A. Haidar, P. Wu, S. Tomov, J. Dongarra, Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers, SC-17, ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ACM, Denver, Colorado, November 12-17, 2017. A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , SC-18, Dallas, TX, IEEE, November 2018.

  17. Leveraging Half Precision in HPC on V100 Idea: use low precision to compute the expensive flops (LU O(n 3 )) and then iteratively refine the solution in order to achieve the FP64 arithmetic Iterative refinement for dense systems, Ax = b , can work this way. L U = lu(A) lower precision O(n 3 ) x = U\(L\b) lower precision O(n 2 ) r = b – Ax FP64 precision O(n 2 ) WHILE || r || not small enough 1. find a correction “z” to adjust x that satisfy Az=r solving Az=r could be done by either: z = U\(L\r) Classical Iterative Refinement lower precision O(n 2 ) Ø GMRes preconditioned by the LU to solve Az=r Iterative Refinement using GMRes lower precision O(n 2 ) Ø 2. x = x + z FP64 precision O(n 1 ) 3. r = b – Ax FP64 precision O(n 2 ) END Higham and Carson showed can solve the inner problem with iterative method and not infect the solution. Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. Ø E. Carson & N. Higham, “Accelerating the Solution of Linear Systems by Iterative Refinement in Three It can be shown that using this approach we can compute the solution to 64-bit floating point precision. Ø Precisions SIAM J. Sci. Comput. , 40(2), A817–A847. Need the original matrix to compute residual (r) and matrix cannot be too badly conditioned Ø

  18. Improving Solution • z is the correction or (x i+1 – x i ) • Computed in lower precision and then added to the approximate solution in higher precision x i + z x i z x i+1 • Can be used in situations like this …

  19. Recent Results Run at Scale… • Mixed precision iterative refinement approach solved a matrix of order 10,091,520 on ORNL’s Summit system. – Composed of nodes made up of 2 IBM Power-9 processors (22 cores each) plus 6 Nvidia V100 GPUs (84 SMs each) – The run used 4500 nodes of Summit, 2,466,000 cores = 4500*(22*2 + 84*6) – Used a random matrix with large diagonal elements to insure convergence of the method. • Mixed precision HPL achieved 445 PFLOPS or 2.95X over DP precision HPL result on the Top500 (148 PFLOPS). – 43 Gflops/Watt • Same accuracy compared to full 64 bit precision

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend