Performance and Precision
Tensor Core
Josef SchΓΌle, University Kaiserslautern, Germany, josef.schuele@rhrk.uni-kl.de
Tensor Core Performance and Precision Josef Schle, University - - PowerPoint PPT Presentation
Tensor Core Performance and Precision Josef Schle, University Kaiserslautern, Germany, josef.schuele@rhrk.uni-kl.de Why attend this Session? 90 deviation of weights and biases Assumed learning curve - deviation from final values 80 blue:
Josef SchΓΌle, University Kaiserslautern, Germany, josef.schuele@rhrk.uni-kl.de
10 20 30 40 50 60 70 80 90 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Learning Iterations
Tensor Core Performance and Precision
Assumed learning curve - deviation from final values blue: trend in FP32 red: range according to precision loss in FP16
deviation of weights and biases
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Learning Iterations
Tensor Core Performance and Precision
Assumed learning curve - deviation from final values blue: trend in FP32 red: range according to precision loss in FP16 green: possible behaviours in FP16 stagnation divergence
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Source: NVIDIA
Tensor Core Performance and Precision
#include "cublas_v2.h" cublasHandle_t handle=0; cublasStatus_t cublasStat = cublasCreate(&handle); cublasStat=cublasSetMathMode(handle,CUBLAS_TENSOR_OP_MATH); cublasGemmEx(handle,CUBLAS_OP_N,CUBLAS_OP_N,m,n,k,&beta, B, CUDA_R_16F, ldb, A, CUDA_R_16F, lda, &alpha, C, CUDA_R_32F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP);
Tensor Core Performance and Precision
N= 8192 -> 91 Tflops (of 120 Tflops Peak)
Tensor Core Performance and Precision
sign 5bits exponent 10bits significand
Tensor Core Performance and Precision
sign 8bits exponent 23bits significand
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Source: NVIDIA
Tensor Core Performance and Precision
Tensor Core Performance and Precision
0,00E+00 2,00E-02 4,00E-02 6,00E-02 8,00E-02 1,00E-01 1,20E-01 1,40E-01 1,60E-01 1,80E-01 2,00E-01 64 128 256 512 1024 2048 4096 8192 error matrix sizes A,B in [1,-1] A in [1,-1], B in [1,0] A in [1,-1], B in [4,-4]
Tensor Core Performance and Precision
larger value, larger error larger size larger error
Tensor Core Performance and Precision
3 3,5 4 4,5 64 128 256 512 1024 2048 4096 8192 error in digit matrix sizes A,B in [1,-1] A in [1,-1], B in [4,-4]
0.984β¦. 3.98β¦.
Tensor Core Performance and Precision
Tensor Core Performance and Precision
1024 representables for [1024,2048] in FP16
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
0,00E+00 5,00E-03 1,00E-02 1,50E-02 2,00E-02 2,50E-02 3,00E-02 3,50E-02 4,00E-02 4,50E-02 5,00E-02 64 128 256 512 1024 2048 4096 8192 error matrix sizes [1,-1]*[1,-1] [1,-1]*[1,0] Full Binomi [1,-1] 3 Term Binomi
Tensor Core Performance and Precision
3 3,5 4 4,5 5 5,5 6 6,5 7 7,5 64 128 256 512 1024 2048 4096 8192 error in digit matrix sizes A,B in [1,-1] A in [1,-1], B in [4,-4] 3T Binomi
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
0,00E+00 5,00E-04 1,00E-03 1,50E-03 2,00E-03 2,50E-03 64 128 256 512 1024 2048 4096 8192 error matrix sizes Binomi Karatsuba 3 Term scaled Binomi scaled Binomi
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
10 20 30 40 50 60 70 80 1024 2048 4096 8192 time [ms] matrix sizes Float32 Mixed Binomial
Scaled
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
10 20 30 40 50 60 70 80 1024 2048 4096 8192 time [ms] matrix sizes Float32 Binomial Scaled WMMA_Binomi
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
50 100 150 200 250 300 1024 2048 4096 8192 time [ms] matrix sizes Float32 Scaled WMMA_Kara. 3Mults WMMA_Kara. 4Mults
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
Tensor Core Performance and Precision
62