Te TensorCore an and Tensor soriz ization ion
Dec 5, 2019 Siyuan Feng 1
Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec - - PowerPoint PPT Presentation
Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec 5, 2019 1 Contents Te TensorCore 1 Introduct ction TensorCore Te 2 Su Support in TVM 3 Fu Futu ture Work k 2 Contents Te TensorCore 1 Introduct ction Te
Dec 5, 2019 Siyuan Feng 1
Te TensorCore Introduct ction
Te TensorCore Su Support in TVM
Fu Futu ture Work k
2
Te TensorCore Introduct ction
Te TensorCore Su Support in TV TVM
Fu Futu ture Work k
3
4
wmma::fill_fragment(Cmat, 0.0f); Warp 32 threads
5
__device__ void tensor_op_16_16_16 ( float *d, half *a, half *b, float *c) { wmma::fragment<matrix_a> Amat; wmma::fragment<matrix_b> Bmat; wmma::fragment <accumulator> Cmat; wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results
6
16x16x16 MatMul
7
Te TensorCore Introduct ction
Te TensorCore Su Support in TV TVM
Fu Futu ture Work k
8
9
1 2 3
Me Memory S Scope pe Create Schedule Tensorization
10
1 2 3 5 6 4
11
Me Memory S Scope pe Create Schedule Tensorization
Global Local Shared Global
12
Me Memory S Scope pe Create Sch chedule Tensorization
Global Fragment Local Shared Global
13
Me Memory S Scope pe Create Sch chedule Tensorization
14
Memory Scope Create Sch chedule Tensorization
blockDim.x = warp_size= 32
Warp Warp
Warp Warp
blockDim.y blockDim.z
15
Memory Scope Create Sch chedule Tensorization
for (i, 0, 16) { for (j, 0, 16) { for (k, 0, 16) { C[i*16 + j]= (C[i*16 + j] + (float32(A[i*16 + k])*float32(B[k*16 + j))) } } } tvm_mma_sync(C, 0, A, 0, B, 0, C, 0);
16
Memory Scope Create Schedule Te Tensorization
1 1 1 1 4.87 5.17 5.02 4.97
Large MatMul BatchConv Small MatMul BatchMatMul TVM w/o TensorCores tvm w/ TensorCores 17
1 1 1 1 0.76 0.83 1.16 1.44
Large MatMul BatchConv Small MatMul BatchMatMul CuDNN w/ TensorCores tvm w/ TensorCores
Comparable on traditional workloads
18
1 1 1 1 0.76 0.83 1.16 1.44
Large MatMul BatchConv Small MatMul BatchMatMul CuDNN w/ TensorCores tvm w/ TensorCores 19
1.4x on emerging workloads(BERT)
20
Te TensorCore Introduct ction
Te TensorCore Su Support in TV TVM
Fu Futu ture Work k
21
22
Dec 5, 2019 Siyuan Feng 23