Experiments with Mixed Prevision Algorithms in Linear Algebra Jack - PowerPoint PPT Presentation

Experiments with Mixed Prevision Algorithms in Linear Algebra Jack Dongarra (UTK/ORNL/U Manchester) Azzam Haidar (Nvidia) Stan Tomov (UTK) Nick Higham (U of Manchester) 8/28/19 1

Mixed Precision • Today many precisions to deal with (IEEE Standard) • Note the number range with half precision (16 bit fl.pt.) IEEE SP largest fl pt largest fl pt number number 65,504 Google TPU: bfloat16 O(10 38 ) float16 2

Nvidia Volta Peak Rates • Four Performance levels for the different precision • 64 bit floating point (FMA): 7.5 Tflop/s • 32 bit floating point (FMA): 15 Tflop/s • 16 bit floating point (FMA): 30 Tflop/s • 16 bit floating point with Tensor core: 120 Tflop/s • Numerical characteristics of arithmetic on Tensor core different Tensor Core Performance from: Mixed Precision Matrix Multiply 4x4 Matrices 3

4x4 matrix multiply: 32 bit floating point accuracy with 16 bit inputs 07 4

Dense Linear Algebra (DLA) is needed in a wide variety of science and engineering applications: • Linear systems: Solve Ax = b Computational electromagnetics, material science, applications using • boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more Least squares: Find x to minimize || Ax – b || • Computational statistics (e.g., linear least squares or ordinary least squares), • econometrics, control theory, signal processing, curve fitting, and many more Eigenproblems: Solve Ax = λ x • Computational chemistry, quantum mechanics, material science, face recognition, • PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more SVD: A = U Σ V* (Au = σ v and A*v = σ u) • Information retrieval, web search, signal processing, big data analytics, low rank • matrix approximation, total least squares minimization, pseudo-inverse, and many more Many variations depending on structure of A • • A can be symmetric, positive definite, tridiagonal, Hessenberg, banded, sparse with dense blocks, etc. DLA is crucial to the development of sparse solvers • 5 / 19

Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP64 GEMM 90 85 80 75 70 65 60 Tflop/s 55 50 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size

Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP32 GEMM sgemm achieve about 14 Tflop/s • FP64 GEMM 90 85 80 75 70 65 60 Tflop/s 55 50 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 ~2X 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size

Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP16 GEMM sgemm achieve about 14 Tflop/s • FP32 GEMM 90 FP64 GEMM hgemm achieve about 27 Tflop/s • 85 80 75 70 65 60 Tflop/s 55 50 45 40 35 30 Matrix matrix multiplication GEMM 25 ~4X 20 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size

Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP16 GEMM Tensor Cores sgemm achieve about 14 Tflop/s • FP16 GEMM 90 FP32 GEMM hgemm achieve about 27 Tflop/s • 85 FP64 GEMM 80 Tensor cores gemm reach about 85 Tflop/s • 75 70 65 60 Tflop/s 55 ~12X 50 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size

Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP16 GEMM Tensor Cores sgemm achieve about 14 Tflop/s • FP16 GEMM 90 FP32 GEMM hgemm achieve about 27 Tflop/s • 85 FP64 GEMM 80 Tensor cores gemm reach about 85 Tflop/s • 75 70 65 60 Tflop/s 55 50 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size

Leveraging Half Precision in HPC on V100 Study of the rank k update used by the LU factorization algorithm on Nvidia V100 In LU factorization need matrix • FP16 TC square FP16 square FP32 square FP64 square multiple but operations is a FP16 TC k=256 FP16 k=256 FP32 k=256 FP64 k=256 90 rank-k update computing the 85 80 Schur complement 75 70 65 60 Tflop/s 55 50 45 40 35 Rank-k Ra k GEMM needed by 30 LU LU does not perform as 25 20 we well a as s square b but s still O OK 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k m=n

Leveraging Half Precision in HPC on V100 solving linear system Ax = b solving linear system Ax = b LU factorization is used to solve a • linear system Ax=b LU factorization A x = b x b A U x b LUx = b L y b Ly = b L then U x y Ux = y

Leveraging Half Precision in HPC on V100 solving linear system Ax = b For s = 0, nb, .. N 1. 1. pan panel f fac actoriz ize LU factorization requires O(n 3 ) 2. up 2. update e tr trailing ma matrix most of the operations are spent in GEMM TRSM nb U • Panel Factorization Panel • TRSM - Triangular solve L GEMM • GEMM – Matrix Multiply step 1 step 2 step 3 step 4 panel update

Leveraging Half Precision in HPC on V100 Study of the LU factorization algorithm on Nvidia V100 LU factorization is used to solve a • 24 FP16 hgetrf LU factorization Tensor Cores linear system Ax=b FP16 hgetrf LU factorization FP32 sgetrf LU factorization A x = b x b FP64 dgetrf LU factorization A 20 16 U x b 3~4X Tflop/s LUx = b L 12 8 y b L Ly = b 4 then U x y Ux = y 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size

Leveraging Half Precision in HPC on V100 solving linear system Ax = b For s = 0, nb, .. N 1. pan 1. panel f fac actoriz ize 2. 2. up update e tr trailing ma matrix • Panel Factorization performed with 32 bit fl pt Done using MAGMA on the front-end system • TRSM Panel • TRSM - Triangular solve performed with 32 bit fl pt GEMM Done using VT100 (no Tensor core) • • GEMM – Matrix Multiply performed with 16 bit fl pt Done on VT100 with Tensor cores • Most of the performance comes from GEMM using 16 bit fl pt

Leveraging Half Precision in HPC on V100 Us Use e Mixed ed Prec ecision algorithm hms Ø Achieve higher performance à faster time to solution Ø Reduce power consumption by decreasing the execution time à Ene Energy Saving ngs !!! Reference: A. Haidar, P. Wu, S. Tomov, J. Dongarra, Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers, SC-17, ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ACM, Denver, Colorado, November 12-17, 2017. A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , SC-18, Dallas, TX, IEEE, November 2018.

Leveraging Half Precision in HPC on V100 Idea: use low precision to compute the expensive flops (LU O(n 3 )) and then iteratively refine the solution in order to achieve the FP64 arithmetic Iterative refinement for dense systems, Ax = b , can work this way. L U = lu(A) lower precision O(n 3 ) x = U\(L\b) lower precision O(n 2 ) r = b – Ax FP64 precision O(n 2 ) WHILE || r || not small enough 1. find a correction “z” to adjust x that satisfy Az=r solving Az=r could be done by either: z = U\(L\r) Classical Iterative Refinement lower precision O(n 2 ) Ø GMRes preconditioned by the LU to solve Az=r Iterative Refinement using GMRes lower precision O(n 2 ) Ø 2. x = x + z FP64 precision O(n 1 ) 3. r = b – Ax FP64 precision O(n 2 ) END Higham and Carson showed can solve the inner problem with iterative method and not infect the solution. Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. Ø E. Carson & N. Higham, “Accelerating the Solution of Linear Systems by Iterative Refinement in Three It can be shown that using this approach we can compute the solution to 64-bit floating point precision. Ø Precisions SIAM J. Sci. Comput. , 40(2), A817–A847. Need the original matrix to compute residual (r) and matrix cannot be too badly conditioned Ø

Improving Solution • z is the correction or (x i+1 – x i ) • Computed in lower precision and then added to the approximate solution in higher precision x i + z x i z x i+1 • Can be used in situations like this …

Recent Results Run at Scale… • Mixed precision iterative refinement approach solved a matrix of order 10,091,520 on ORNL’s Summit system. – Composed of nodes made up of 2 IBM Power-9 processors (22 cores each) plus 6 Nvidia V100 GPUs (84 SMs each) – The run used 4500 nodes of Summit, 2,466,000 cores = 4500*(22*2 + 84*6) – Used a random matrix with large diagonal elements to insure convergence of the method. • Mixed precision HPL achieved 445 PFLOPS or 2.95X over DP precision HPL result on the Top500 (148 PFLOPS). – 43 Gflops/Watt • Same accuracy compared to full 64 bit precision

Experiments with Mixed Prevision Algorithms in Linear Algebra Jack - PowerPoint PPT Presentation

Experiments with Mixed Prevision Algorithms in Linear Algebra Jack Dongarra (UTK/ORNL/U Manchester) Azzam Haidar (Nvidia) Stan Tomov (UTK) Nick Higham (U of Manchester) 8/28/19 1 Mixed Precision Today many precisions to deal with (IEEE

notation a real variable ( Pf ) 2 0 Pf the prevision of f P : a prevision

Outline Statistical inference for linear mixed models general form of linear mixed models

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Precision Training PAI Overview What is mixed-precision

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Mixing it up with random effects Joshua Loftus Mixed models Intro to mixed models What is a

Fully Conglomerable Coherent Upper Conditional Prevision Defined by the Choquet Integral with

Mixed Methodological Analysis David F. Feldon Utah State University May 8, 2018 Mixed Methods

Regression 2: Mixed Models Marco Baroni Practical Statistics in R Outline Mixed models with

Mixed Integer Linear Programming Combinatorial Problem Solving (CPS) Javier Larrosa Albert

From Mixed-Integer Linear to Mixed-Integer Bilevel Linear Programming Matteo Fischetti,

Workshop 11.2a: Generalized Linear Mixed Effects Models (GLMM) Murray Logan February 7, 2017

Recap: variance/covariance structure for linear mixed models Important features of linear mixed

Mixed Integer Linear Programming Combinatorial Problem Solving (CPS) Javier Larrosa Albert

Experiments on deflection of charged Experiments on deflection of charged Experiments on

Mixed models in R using the lme4 package Part 3: Linear mixed models with simple, scalar random

Tree-based model algorithm for maintaining consistency in real-time collaborative editing

A C A Core R Robot Al Algorithm hm: I Inverse K Kinematics Setting a robots joints so

Estimating the Variance of Complex Differentially Private Algorithms Robert Ashmead JSM 2019,

Algorithm-Hardware Co-design for Deformable Convolution Qijing Huang , Dequan Wang, Yizhao Gao

A New Algorithm for the Unbalanced Meet-in-the-Middle Problem Ivica Nikoli c (joint with Yu

Competitive Online Routing Algorithms in Delaunay Triangulations Presented by Anders Lundgren

FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION Marius Muja, David G.

The Sliding Window Algorithm The Sliding Window algorithm sums several small

Experiments with Mixed Prevision Algorithms in Linear Algebra Jack - PowerPoint PPT Presentation

Experiments with Mixed Prevision Algorithms in Linear Algebra Jack Dongarra (UTK/ORNL/U Manchester) Azzam Haidar (Nvidia) Stan Tomov (UTK) Nick Higham (U of Manchester) 8/28/19 1 Mixed Precision Today many precisions to deal with (IEEE

notation a real variable ( Pf ) 2 0 Pf the prevision of f P : a prevision

Outline Statistical inference for linear mixed models general form of linear mixed models

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Precision Training PAI Overview What is mixed-precision

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Mixing it up with random effects Joshua Loftus Mixed models Intro to mixed models What is a

Fully Conglomerable Coherent Upper Conditional Prevision Defined by the Choquet Integral with

Mixed Methodological Analysis David F. Feldon Utah State University May 8, 2018 Mixed Methods

Regression 2: Mixed Models Marco Baroni Practical Statistics in R Outline Mixed models with

Mixed Integer Linear Programming Combinatorial Problem Solving (CPS) Javier Larrosa Albert

From Mixed-Integer Linear to Mixed-Integer Bilevel Linear Programming Matteo Fischetti,

Workshop 11.2a: Generalized Linear Mixed Effects Models (GLMM) Murray Logan February 7, 2017

Recap: variance/covariance structure for linear mixed models Important features of linear mixed

Mixed Integer Linear Programming Combinatorial Problem Solving (CPS) Javier Larrosa Albert

Experiments on deflection of charged Experiments on deflection of charged Experiments on

Mixed models in R using the lme4 package Part 3: Linear mixed models with simple, scalar random

Tree-based model algorithm for maintaining consistency in real-time collaborative editing

A C A Core R Robot Al Algorithm hm: I Inverse K Kinematics Setting a robots joints so

Estimating the Variance of Complex Differentially Private Algorithms Robert Ashmead JSM 2019,

Algorithm-Hardware Co-design for Deformable Convolution Qijing Huang *, Dequan Wang*, Yizhao Gao

A New Algorithm for the Unbalanced Meet-in-the-Middle Problem Ivica Nikoli c (joint with Yu

Competitive Online Routing Algorithms in Delaunay Triangulations Presented by Anders Lundgren

FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION Marius Muja, David G.

The Sliding Window Algorithm The Sliding Window algorithm sums several small

Algorithm-Hardware Co-design for Deformable Convolution Qijing Huang , Dequan Wang, Yizhao Gao