te tensorcore an and tensor soriz ization ion
play

Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec - PowerPoint PPT Presentation

Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec 5, 2019 1 Contents Te TensorCore 1 Introduct ction TensorCore Te 2 Su Support in TVM 3 Fu Futu ture Work k 2 Contents Te TensorCore 1 Introduct ction Te


  1. Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec 5, 2019 1

  2. Contents Te TensorCore 1 Introduct ction TensorCore Te 2 Su Support in TVM 3 Fu Futu ture Work k 2

  3. Contents Te TensorCore 1 Introduct ction Te TensorCore 2 Su Support in TV TVM 3 Fu Futu ture Work k 3

  4. What are TensorCores 4

  5. Warp-Level Operation Warp 32 threads wmma::fill_fragment(Cmat, 0.0f); 5

  6. Programming TensorCore __device__ void tensor_op_16_16_16 ( float *d, half *a, half *b, float *c) { 16x16x16 MatMul wmma::fragment<matrix_a> Amat; wmma::fragment<matrix_b> Bmat; wmma::fragment <accumulator> Cmat; Create Fragments wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); Load Fragments wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat); Perform MatMul wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); Store Results } 6

  7. TensorCore Summary • TensorCores are hardware accelerators • Warp-level operation • New memory scope fragment 7

  8. Contents Te TensorCore 1 Introduct ction Te TensorCore 2 Su Support in TV TVM 3 Fu Futu ture Work k 8

  9. Steps for TensorCore Support in TVM Memory Scope Create Schedule Tensorization 9

  10. Current Memory Scope 1 2 3 Me Memory S Scope pe Create Schedule Tensorization 10

  11. Special Memory Scope 1 4 5 2 6 3 Me Memory S Scope pe Create Schedule Tensorization 11

  12. Traditional GPU Memory Scope Order Shared Global Local Global Me Memory S Scope pe Create Sch chedule Tensorization 12

  13. Enhanced TensorCore Memory Scope Order Fragment Shared Global Global Local Me Memory S Scope pe Create Sch chedule Tensorization 13

  14. Warp Level Schedule blockDim.x = warp_size= 32 Memory Scope Create Sch chedule Tensorization 14

  15. Warp Level Schedule blockDim.y …… Warp Warp blockDim.z …… …… …… …… Warp Warp blockDim.x = warp_size= 32 Memory Scope Create Sch chedule Tensorization 15

  16. Tensorization for (i, 0, 16) { for (j, 0, 16) { for (k, 0, 16) { C[i*16 + j]= (C[i*16 + j] + (float32(A[i*16 + k])*float32(B[k*16 + j))) } } } tvm_mma_sync(C, 0, A, 0, B, 0, C, 0); Memory Scope Create Schedule Te Tensorization 16

  17. Performance Improvements over non-TensorCore 5.17 5.02 4.97 4.87 1 1 1 1 Large MatMul BatchConv Small MatMul BatchMatMul TVM w/o TensorCores tvm w/ TensorCores 17

  18. Performance Comparison vs CuDNN 1.44 Comparable on traditional workloads 1.16 1 1 1 1 0.83 0.76 Large MatMul BatchConv Small MatMul BatchMatMul CuDNN w/ TensorCores tvm w/ TensorCores 18

  19. Performance Comparison vs CuDNN 1.44 1.4x on emerging workloads(BERT) 1.16 1 1 1 1 0.83 0.76 Large MatMul BatchConv Small MatMul BatchMatMul CuDNN w/ TensorCores tvm w/ TensorCores 19

  20. TVM TensorCore Support Summary • Massive speed up over non-tensorcore • Competitive performance with CuDNN • Based on tensor intrinsic 20

  21. Contents Te TensorCore 1 Introduct ction Te TensorCore 2 Su Support in TV TVM 3 Fu Futu ture Work k 21

  22. Future Work 1. Use TensorCore in TOPI and Relay 2. Apply TensorCore to popular ML model, such as BERT 22

  23. Th Thank k you Siyuan Feng Dec 5, 2019 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend