Experiments with Mixed Prevision Algorithms in Linear Algebra Jack - - PowerPoint PPT Presentation

experiments with mixed prevision algorithms in linear
SMART_READER_LITE
LIVE PREVIEW

Experiments with Mixed Prevision Algorithms in Linear Algebra Jack - - PowerPoint PPT Presentation

Experiments with Mixed Prevision Algorithms in Linear Algebra Jack Dongarra (UTK/ORNL/U Manchester) Azzam Haidar (Nvidia) Stan Tomov (UTK) Nick Higham (U of Manchester) 8/28/19 1 Mixed Precision Today many precisions to deal with (IEEE


slide-1
SLIDE 1

8/28/19 1

Experiments with Mixed Prevision Algorithms in Linear Algebra

Jack Dongarra (UTK/ORNL/U Manchester) Azzam Haidar (Nvidia) Stan Tomov (UTK) Nick Higham (U of Manchester)

slide-2
SLIDE 2

Mixed Precision

  • Today many precisions to deal with (IEEE Standard)
  • Note the number range with half precision

(16 bit fl.pt.)

2

Google TPU: bfloat16 largest fl pt number 65,504 largest fl pt number O(1038) float16 IEEE SP

slide-3
SLIDE 3

Nvidia Volta Peak Rates

  • Four Performance levels for the different precision
  • 64 bit floating point (FMA): 7.5 Tflop/s
  • 32 bit floating point (FMA): 15 Tflop/s
  • 16 bit floating point (FMA): 30 Tflop/s
  • 16 bit floating point with Tensor core: 120 Tflop/s
  • Numerical characteristics of arithmetic on Tensor core different

3

Tensor Core Performance from: Mixed Precision Matrix Multiply 4x4 Matrices

slide-4
SLIDE 4

07 4

4x4 matrix multiply: 32 bit floating point accuracy with 16 bit inputs

slide-5
SLIDE 5

Dense Linear Algebra (DLA) is needed in a wide variety of science and engineering applications:

  • Linear systems: Solve Ax = b
  • Computational electromagnetics, material science, applications using

boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more

  • Least squares: Find x to minimize || Ax – b ||
  • Computational statistics (e.g., linear least squares or ordinary least squares),

econometrics, control theory, signal processing, curve fitting, and many more

  • Eigenproblems: Solve Ax = λ x
  • Computational chemistry, quantum mechanics, material science, face recognition,

PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more

  • SVD: A = U Σ V* (Au = σv and A*v = σu)
  • Information retrieval, web search, signal processing, big data analytics, low rank

matrix approximation, total least squares minimization, pseudo-inverse, and many more

  • Many variations depending on structure of A
  • A can be symmetric, positive definite, tridiagonal, Hessenberg, banded,

sparse with dense blocks, etc.

  • DLA is crucial to the development of sparse solvers

5 / 19

slide-6
SLIDE 6

Leveraging Half Precision in HPC on V100

matrix size

2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k

Tflop/s

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

FP64 GEMM

Matrix matrix multiplication GEMM

  • dgemm achieve about 6.4 Tflop/s

Study of the Matrix Matrix multiplication kernel on Nvidia V100

slide-7
SLIDE 7

Leveraging Half Precision in HPC on V100

matrix size

2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k

Tflop/s

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

FP32 GEMM FP64 GEMM

Matrix matrix multiplication GEMM

  • dgemm achieve about 6.4 Tflop/s
  • sgemm achieve about 14 Tflop/s

Study of the Matrix Matrix multiplication kernel on Nvidia V100

~2X

slide-8
SLIDE 8

Leveraging Half Precision in HPC on V100

Matrix matrix multiplication GEMM

  • dgemm achieve about 6.4 Tflop/s
  • sgemm achieve about 14 Tflop/s
  • hgemm achieve about 27 Tflop/s

matrix size

2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k

Tflop/s

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

FP16 GEMM FP32 GEMM FP64 GEMM

Study of the Matrix Matrix multiplication kernel on Nvidia V100

~4X

slide-9
SLIDE 9

Leveraging Half Precision in HPC on V100

Matrix matrix multiplication GEMM

  • dgemm achieve about 6.4 Tflop/s
  • sgemm achieve about 14 Tflop/s
  • hgemm achieve about 27 Tflop/s
  • Tensor cores gemm reach about 85 Tflop/s

matrix size

2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k

Tflop/s

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

FP16 GEMM Tensor Cores FP16 GEMM FP32 GEMM FP64 GEMM

Study of the Matrix Matrix multiplication kernel on Nvidia V100

~12X

slide-10
SLIDE 10

Leveraging Half Precision in HPC on V100

Matrix matrix multiplication GEMM

  • dgemm achieve about 6.4 Tflop/s
  • sgemm achieve about 14 Tflop/s
  • hgemm achieve about 27 Tflop/s
  • Tensor cores gemm reach about 85 Tflop/s

matrix size

2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k

Tflop/s

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

FP16 GEMM Tensor Cores FP16 GEMM FP32 GEMM FP64 GEMM

Study of the Matrix Matrix multiplication kernel on Nvidia V100

slide-11
SLIDE 11

Leveraging Half Precision in HPC on V100

  • In LU factorization need matrix

multiple but operations is a rank-k update computing the Schur complement

m=n

2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k

Tflop/s

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

FP16 TC square FP16 TC k=256 FP16 square FP16 k=256 FP32 square FP32 k=256 FP64 square FP64 k=256

Study of the rank k update used by the LU factorization algorithm on Nvidia V100

Ra Rank-k k GEMM needed by LU LU does not perform as we well a as s square b but s still O OK

slide-12
SLIDE 12

solving linear system Ax = b LU factorization

Leveraging Half Precision in HPC on V100 solving linear system Ax = b

  • LU factorization is used to solve a

linear system Ax=b A x = b LUx = b A x b U L x b L y b U x y Ly = b then Ux = y

slide-13
SLIDE 13

panel update step 1 step 2 step 3 step 4 nb

For s = 0, nb, .. N

1.

  • 1. pan

panel f fac actoriz ize 2.

  • 2. up

update e tr trailing ma matrix

Leveraging Half Precision in HPC on V100 solving linear system Ax = b

GEMM TRSM Panel

L U

LU factorization requires O(n3) most of the operations are spent in GEMM

  • Panel Factorization
  • TRSM - Triangular solve
  • GEMM – Matrix Multiply
slide-14
SLIDE 14

Leveraging Half Precision in HPC on V100

  • LU factorization is used to solve a

linear system Ax=b A x = b LUx = b Ly = b then Ux = y

A x b U L x b L y b U x y

matrix size

2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k

Tflop/s

4 8 12 16 20 24

FP16 hgetrf LU factorization Tensor Cores FP16 hgetrf LU factorization FP32 sgetrf LU factorization FP64 dgetrf LU factorization

Study of the LU factorization algorithm on Nvidia V100

3~4X

slide-15
SLIDE 15

For s = 0, nb, .. N

1.

  • 1. pan

panel f fac actoriz ize 2.

  • 2. up

update e tr trailing ma matrix

Leveraging Half Precision in HPC on V100 solving linear system Ax = b

GEMM TRSM Panel

  • Panel Factorization performed with 32 bit fl pt
  • Done using MAGMA on the front-end system
  • TRSM - Triangular solve performed with 32 bit fl pt
  • Done using VT100 (no Tensor core)
  • GEMM – Matrix Multiply performed with 16 bit fl pt
  • Done on VT100 with Tensor cores

Most of the performance comes from GEMM using 16 bit fl pt

slide-16
SLIDE 16

Us Use e Mixed ed Prec ecision algorithm hms

ØAchieve higher performance

à faster time to solution

ØReduce power consumption by decreasing the execution time

à Ene Energy Saving ngs !!!

Leveraging Half Precision in HPC on V100

Reference:

  • A. Haidar, P. Wu, S. Tomov, J. Dongarra,

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers, SC-17, ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ACM, Denver, Colorado, November 12-17, 2017.

  • A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham,

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers, SC-18, Dallas, TX, IEEE, November 2018.

slide-17
SLIDE 17

Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) lower precision O(n3) x = U\(L\b) lower precision O(n2) r = b – Ax FP64 precision O(n2) WHILE || r || not small enough 1. find a correction “z” to adjust x that satisfy Az=r solving Az=r could be done by either: Ø z = U\(L\r) Classical Iterative Refinement lower precision O(n2) Ø GMRes preconditioned by the LU to solve Az=r Iterative Refinement using GMRes lower precision O(n2) 2. x = x + z FP64 precision O(n1) 3. r = b – Ax FP64 precision O(n2) END

Idea: use low precision to compute the expensive flops (LU O(n3)) and then iteratively refine the solution in order to achieve the FP64 arithmetic

Ø Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. Ø It can be shown that using this approach we can compute the solution to 64-bit floating point precision. Ø Need the original matrix to compute residual (r) and matrix cannot be too badly conditioned

Leveraging Half Precision in HPC on V100

Higham and Carson showed can solve the inner problem with iterative method and not infect the solution.

  • E. Carson & N. Higham, “Accelerating the Solution of

Linear Systems by Iterative Refinement in Three Precisions SIAM J. Sci. Comput., 40(2), A817–A847.

slide-18
SLIDE 18

Improving Solution

  • z is the correction or (xi+1 – xi)
  • Computed in lower precision and

then added to the approximate solution in higher precision xi + z

  • Can be used in situations like this …

xi z xi+1

slide-19
SLIDE 19

Recent Results Run at Scale…

  • Mixed precision iterative refinement approach solved

a matrix of order 10,091,520 on ORNL’s Summit system.

– Composed of nodes made up of 2 IBM Power-9 processors (22 cores each) plus 6 Nvidia V100 GPUs (84 SMs each) – The run used 4500 nodes of Summit, 2,466,000 cores = 4500*(22*2 + 84*6) – Used a random matrix with large diagonal elements to insure convergence of the method.

  • Mixed precision HPL achieved 445 PFLOPS or 2.95X over DP

precision HPL result on the Top500 (148 PFLOPS).

– 43 Gflops/Watt

  • Same accuracy compared to full 64 bit precision
slide-20
SLIDE 20

Questions Asked to Answer?

  • Am I guaranteed the stability, accuracy and convergence properties using

lower precision?

– Ma Maybe, e, dep epen ends s on the e condition of the e matri rix, algori rithm need eeds s 1 digit of accura racy in in t the a appr pproxim imatio ion, t , then w wil ill c converge t to f full a accuracy.

  • What memory and performance improvements can I expect when using

lower precision?

– Co Cost is 1.25 times the memory, potential factor of 4 improvement in time to so solution

  • What implementation challenges exist for application and enabling

technologies developers?

– Ca Can be put into applications now.

slide-21
SLIDE 21

Conclusion:

Ø We accelerated the solution of linear system Ax = b solver using hardware-accelerated FP16 arithmetic on GPUs; Ø We introduced a framework for exploiting mixed-precision FP16-FP32/FP64 iterative refinement solvers and describe the path to draw high-performance and energy-aware GPU implementations; Ø Ideas can be applied to other 1 sided reductions (LU, LL

T, LDL T, QR) and also for 2 sided

in the case of eigenvalue/vectors. Building this into the SLATE LA library (part of ECP). Ø Our technique shows that a number of problems can be accelerated up to 4X by the usage

  • f the FP16-TC or 2X using the FP32 arithmetic.

Ø We have rigorous error analysis to support everything.

Ø Potentially provide an additional benchmarks for ML Supercomputers, looking at mixed precision performance.

slide-22
SLIDE 22

Time (sec) 1 2 3 4 5 6 7 Average power CPU+GPU (Watts) 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460

5.5 14 2021 Performance in Tflop/s Gflops/Watts Joules

Solving Ax=b on Nvidia V100

FP64 solver dgesv

CPU: 10 cores E5-2650 v3 GPU: Nvidia V100

Leveraging Half Precision in HPC Power awareness

CPU

Intel Xeon E5-2650 v3 (Haswell) 2x10 cores @ 2.30 GHz

V100

NVIDIA Volta GPU 80 MP x 64 @ 1.38 GHz

Power is for GPU + CPU + DRAM

Problem generated with an arithmetic distribution of the singular values and positive eigenvalues.

(e) matrix with positive σi = 1− ( i−1

n−1 )(1− 1 cond ).

  • Power consumption of the FP64 algorithm to

solve Ax=b for a matrix of size 34K, it achieve 5.5 Tflop/s and requires about 2021 joules providing about 14 Gflops/Watts.

slide-23
SLIDE 23

Leveraging Half Precision in HPC Power awareness

Problem generated with an arithmetic distribution of the singular values and positive eigenvalues.

(e) matrix with positive σi = 1− ( i−1

n−1 )(1− 1 cond ).

  • Power consumption of the FP64 algorithm to

solve Ax=b for a matrix of size 34K, it achieve 5.5 Tflop/s and requires about 2021 joules providing about 14 Gflops/Watts.

  • Power consumption of the mixed precision

FP32à64 algorithm to solve Ax=b for a matrix of size 34K, it achieve 10.7 Tflop/s and requires about 1041 joules providing about 30 Gflops/Watts.

Mixed precision techniques can provide a large gain in energy efficiency

Time (sec) 1 2 3 4 5 6 7 Average power CPU+GPU (Watts) 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460

5.5 14 2021 Performance in Tflop/s Gflops/Watts Joules 10.7 27 1041

Solving Ax=b on Nvidia V100

FP64 solver dgesv FP32 --> 64 solver dsgesv

CPU: 10 cores E5-2650 v3 GPU: Nvidia V100

Iterative refinement

slide-24
SLIDE 24

Leveraging Half Precision in HPC Power awareness

Problem generated with an arithmetic distribution of the singular values and positive eigenvalues.

(e) matrix with positive σi = 1− ( i−1

n−1 )(1− 1 cond ).

  • Power consumption of the FP64 algorithm to

solve Ax=b for a matrix of size 34K, it achieve 5.5 Tflop/s and requires about 2021 joules providing about 14 Gflops/Watts.

  • Power consumption of the mixed precision

FP32à64 algorithm to solve Ax=b for a matrix of size 34K, it achieve 10.7 Tflop/s and requires about 1041 joules providing about 30 Gflops/Watts.

  • Power consumption of the mixed precision

FP16à64 algorithm to solve Ax=b for a matrix of size 34K, it achieve 16.8 Tflop/s and requires about 609 joules providing about 48 Gflops/Watts.

Mixed precision techniques can provide a large gain in energy efficiency

Time (sec) 1 2 3 4 5 6 7 Average power CPU+GPU (Watts) 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460

5.5 14 2021 Performance in Tflop/s Gflops/Watts Joules 10.7 27 1041 16.8 48 609

Solving Ax=b on Nvidia V100

FP64 solver dgesv FP32 --> 64 solver dsgesv FP16 --> 64 solver dhgesv

CPU: 10 cores E5-2650 v3 GPU: Nvidia V100

slide-25
SLIDE 25

Leveraging Half Precision in HPC Power awareness

Problem generated with an arithmetic distribution of the singular values and positive eigenvalues.

(e) matrix with positive σi = 1− ( i−1

n−1 )(1− 1 cond ).

  • Power consumption of the FP64 algorithm to

solve Ax=b for a matrix of size 34K, it achieve 5.5 Tflop/s and requires about 2021 joules providing about 14 Gflops/Watts.

  • Power consumption of the mixed precision

FP32à64 algorithm to solve Ax=b for a matrix of size 34K, it achieve 10.7 Tflop/s and requires about 1041 joules providing about 30 Gflops/Watts.

  • Power consumption of the mixed precision

FP16à64 algorithm to solve Ax=b for a matrix of size 34K, it achieve 16.8 Tflop/s and requires about 609 joules providing about 48 Gflops/Watts.

  • Power consumption of the mixed precision

FP16à64 TC algorithm using Tensor Cores to solve Ax=b for a matrix of size 34K, it achieve 24 Tflop/s and requires about 470 joules providing about 74 Gflops/Watts.

Mixed precision techniques can provide a large gain in energy efficiency

Time (sec) 1 2 3 4 5 6 7 Average power CPU+GPU (Watts) 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460

5.5 14 2021 Performance in Tflop/s Gflops/Watts Joules 10.7 27 1041 16.8 48 609 24.0 74 470

Solving Ax=b on Nvidia V100

FP64 solver dgesv FP32 --> 64 solver dsgesv FP16 --> 64 solver dhgesv FP16 --> 64 solver dhgesv (TC)

CPU: 10 cores E5-2650 v3 GPU: Nvidia V100

FP16-TC reach 74 Gflops/Watt

slide-26
SLIDE 26

Conclusion:

Ø We accelerated the solution of linear system Ax = b solver using hardware-accelerated FP16 arithmetic on GPUs; Ø We introduced a framework for exploiting mixed-precision FP16-FP32/FP64 iterative refinement solvers and describe the path to draw high-performance and energy-aware GPU implementations; Ø Ideas can be applied to other 1 sided reductions (LU, LL

T, LDL T, QR) and also for 2 sided

in the case of eigenvalue/vectors. Building this into the SLATE LA library (part of ECP). Ø Our technique shows that a number of problems can be accelerated up to 4X by the usage

  • f the FP16-TC or 2X using the FP32 arithmetic.

Ø We studied the energy-efficiency of our approach that showed significant energy savings, 5X energy savings using the FP16-TC compared to the FP64 implementation.

  • We illustrated a technique to use V100 Tensor Cores FP16-TC that achieves FP64 accuracy at a

highly efficient/accelerated performance equating to 74 Gflops/Watt and 24 Tflops/s.

Ø We have rigorous error analysis to support everything