Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec - - PowerPoint PPT Presentation

te tensorcore an and tensor soriz ization ion
SMART_READER_LITE
LIVE PREVIEW

Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec - - PowerPoint PPT Presentation

Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec 5, 2019 1 Contents Te TensorCore 1 Introduct ction TensorCore Te 2 Su Support in TVM 3 Fu Futu ture Work k 2 Contents Te TensorCore 1 Introduct ction Te


slide-1
SLIDE 1

Te TensorCore an and Tensor soriz ization ion

Dec 5, 2019 Siyuan Feng 1

slide-2
SLIDE 2

Contents

Te TensorCore Introduct ction

1 2

Te TensorCore Su Support in TVM

3

Fu Futu ture Work k

2

slide-3
SLIDE 3

Contents

Te TensorCore Introduct ction

1 2

Te TensorCore Su Support in TV TVM

3

Fu Futu ture Work k

3

slide-4
SLIDE 4

What are TensorCores

4

slide-5
SLIDE 5

Warp-Level Operation

wmma::fill_fragment(Cmat, 0.0f); Warp 32 threads

5

slide-6
SLIDE 6

Programming TensorCore

__device__ void tensor_op_16_16_16 ( float *d, half *a, half *b, float *c) { wmma::fragment<matrix_a> Amat; wmma::fragment<matrix_b> Bmat; wmma::fragment <accumulator> Cmat; wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

6

16x16x16 MatMul

slide-7
SLIDE 7

TensorCore Summary

  • TensorCores are hardware accelerators
  • Warp-level operation
  • New memory scope fragment

7

slide-8
SLIDE 8

Contents

Te TensorCore Introduct ction

1 2

Te TensorCore Su Support in TV TVM

3

Fu Futu ture Work k

8

slide-9
SLIDE 9

Steps for TensorCore Support in TVM

Memory Scope Tensorization Create Schedule

9

slide-10
SLIDE 10

Current Memory Scope

1 2 3

Me Memory S Scope pe Create Schedule Tensorization

10

slide-11
SLIDE 11

Special Memory Scope

1 2 3 5 6 4

11

Me Memory S Scope pe Create Schedule Tensorization

slide-12
SLIDE 12

Traditional GPU Memory Scope Order

Global Local Shared Global

12

Me Memory S Scope pe Create Sch chedule Tensorization

slide-13
SLIDE 13

Enhanced TensorCore Memory Scope Order

Global Fragment Local Shared Global

13

Me Memory S Scope pe Create Sch chedule Tensorization

slide-14
SLIDE 14

Warp Level Schedule

blockDim.x = warp_size= 32

14

Memory Scope Create Sch chedule Tensorization

slide-15
SLIDE 15

Warp Level Schedule

blockDim.x = warp_size= 32

Warp Warp

……

Warp Warp

…… …… …… ……

blockDim.y blockDim.z

15

Memory Scope Create Sch chedule Tensorization

slide-16
SLIDE 16

Tensorization

for (i, 0, 16) { for (j, 0, 16) { for (k, 0, 16) { C[i*16 + j]= (C[i*16 + j] + (float32(A[i*16 + k])*float32(B[k*16 + j))) } } } tvm_mma_sync(C, 0, A, 0, B, 0, C, 0);

16

Memory Scope Create Schedule Te Tensorization

slide-17
SLIDE 17

Performance Improvements over non-TensorCore

1 1 1 1 4.87 5.17 5.02 4.97

Large MatMul BatchConv Small MatMul BatchMatMul TVM w/o TensorCores tvm w/ TensorCores 17

slide-18
SLIDE 18

Performance Comparison vs CuDNN

1 1 1 1 0.76 0.83 1.16 1.44

Large MatMul BatchConv Small MatMul BatchMatMul CuDNN w/ TensorCores tvm w/ TensorCores

Comparable on traditional workloads

18

slide-19
SLIDE 19

Performance Comparison vs CuDNN

1 1 1 1 0.76 0.83 1.16 1.44

Large MatMul BatchConv Small MatMul BatchMatMul CuDNN w/ TensorCores tvm w/ TensorCores 19

1.4x on emerging workloads(BERT)

slide-20
SLIDE 20

TVM TensorCore Support Summary

  • Massive speed up over non-tensorcore
  • Competitive performance with CuDNN
  • Based on tensor intrinsic

20

slide-21
SLIDE 21

Contents

Te TensorCore Introduct ction

1 2

Te TensorCore Su Support in TV TVM

3

Fu Futu ture Work k

21

slide-22
SLIDE 22

Future Work

  • 1. Use TensorCore in TOPI and Relay
  • 2. Apply TensorCore to popular ML model, such as

BERT

22

slide-23
SLIDE 23

Th Thank k you

Dec 5, 2019 Siyuan Feng 23