TC-CIM: Empowering Tensor Comprehensions for Computing In Memory - - PowerPoint PPT Presentation

tc cim empowering tensor comprehensions for computing in
SMART_READER_LITE
LIVE PREVIEW

TC-CIM: Empowering Tensor Comprehensions for Computing In Memory - - PowerPoint PPT Presentation

TC-CIM: Empowering Tensor Comprehensions for Computing In Memory Andi Drebes 1 Lorenzo Chelini 2,3 Oleksandr Zinenko 4 Albert Cohen 4 Henk Corporaal 2 Tobias Grosser 5 Kanishkan Vadivel 2 Nicolas Vasilache 4 1 Inria and erieure 2 TU Eindhoven 3


slide-1
SLIDE 1

TC-CIM: Empowering Tensor Comprehensions for Computing In Memory

Andi Drebes1 Lorenzo Chelini2,3 Oleksandr Zinenko4 Albert Cohen4 Henk Corporaal2 Tobias Grosser5 Kanishkan Vadivel2 Nicolas Vasilache4

1Inria and ´

Ecole Normale Sup´ erieure 2TU Eindhoven 3IBM Research Zurich

4Google 5ETH Zurich

01/22/2020

slide-2
SLIDE 2

Detecting Operations for Accelerators

High-Level Specification Optimized Code for Accelerator

Optimizing Compiler

◮ Goal: Reliably detect operations for efficient offloading

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 1 / 16

slide-3
SLIDE 3

Detecting Operations for Accelerators

High-Level Specification Optimized Code for Accelerator

Optimizing Compiler

◮ Goal: Reliably detect operations for efficient offloading ◮ At which stage? ◮ On which representation? ◮ Create reusable infrastructure

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 1 / 16

slide-4
SLIDE 4

Von-Neumann Bottleneck

Main Memory Host CPU

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 2 / 16

slide-5
SLIDE 5

Von-Neumann Bottleneck

Main Memory Host CPU

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 2 / 16

slide-6
SLIDE 6

Von-Neumann Bottleneck

Cache

Main Memory Host CPU

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 2 / 16

slide-7
SLIDE 7

Von-Neumann Bottleneck

Cache

Main Memory Host CPU Accelerator

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 2 / 16

slide-8
SLIDE 8

Von-Neumann Bottleneck

Cache

Main Memory Host CPU Accelerator

Memory

PU PU PU PU PU PU PU PU PU

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 2 / 16

slide-9
SLIDE 9

Compute In Memory (CIM)

CIM Accelerator Host (ARM)

CPU L1

Main Memory Column Buffers PCM Crossbar Output Buffers Row Buffers Context Register DMA S&H S&H S&H ADC Shift & Add G0,0 G0,1 G0,2 G1,0 G1,1 G1,2 G2,0 G2,1 G2,2 Control Unit V V V

I = v.G v0 v1 v2

CIM Tile CIM Accelerator

◮ Interweave Computation and Storage ◮ Example: Memristor-based Architecture from MNEMOSENE project (https://www.mnemosene.eu)

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 3 / 16

slide-10
SLIDE 10

Compute In Memory (CIM)

CIM Accelerator Host (ARM)

CPU L1

Main Memory Column Buffers PCM Crossbar Output Buffers Row Buffers Context Register DMA S&H S&H S&H ADC Shift & Add G0,0 G0,1 G0,2 G1,0 G1,1 G1,2 G2,0 G2,1 G2,2 Control Unit V V V

I = v.G v0 v1 v2

CIM Tile CIM Accelerator

◮ Interweave Computation and Storage ◮ Example: Memristor-based Architecture from MNEMOSENE project (https://www.mnemosene.eu) ◮ High energy efficiency and throughput with fixed functions (e.g., matrix multiplication)

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 3 / 16

slide-11
SLIDE 11

Detecting Accelerated Operations for CIM

High-Level Specification Optimized Code for Accelerator

Optimizing Compiler

◮ Goal: Reliably detect operations for efficient offloading ◮ At which stage? ◮ On which representation? ◮ Create reusable infrastructure

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 4 / 16

slide-12
SLIDE 12

Detecting Accelerated Operations for CIM

Tensor Comprehensions ISO C + Optimized library

Optimizing Compiler

◮ Goal: Reliably detect operations for efficient offloading ◮ At which stage? ◮ On which representation? ◮ Create reusable infrastructure

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 4 / 16

slide-13
SLIDE 13

Tensor Comprehensions

Math-like notation ◮ Expresses operations on tensors ◮ Only information needed to define operation unambiguously ◮ Compiler infers shapes and iteration domains

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 5 / 16

slide-14
SLIDE 14

Tensor Comprehensions

Math-like notation ◮ Expresses operations on tensors ◮ Only information needed to define operation unambiguously ◮ Compiler infers shapes and iteration domains Example: def mv(float(M,K) A, float(K) x) -> (C) { C(i) +=! A(i,k) * x(k) }

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 5 / 16

slide-15
SLIDE 15

Tensor Comprehensions: Compilation

TC lang Halide IR isl Schedule Tree Polyhedral Transformations CUDA C PTX CUDA backend isl AST Tensor Comprehensions

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 6 / 16

slide-16
SLIDE 16

Integration of Loop Tactics

TC lang Halide IR Tensor Comprehensions isl Schedule Tree Polyhedral Transformations isl AST ISO C99 Tactics backend Loop Tactics Pattern detection and marking

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 7 / 16

slide-17
SLIDE 17

Matching Example: Matrix Multiplications

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 8 / 16

slide-18
SLIDE 18

Matching Example: Matrix Multiplications

Band Node Band Node Sequence Node Filter Node Filter Node Leaf Node Leaf Node Arbitrary Ancestors

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 8 / 16

slide-19
SLIDE 19

Matching Example: Matrix Multiplications

Band Node Band Node Sequence Node Filter Node Filter Node Leaf Node Leaf Node Arbitrary Ancestors

Iteration over i, n and k

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 8 / 16

slide-20
SLIDE 20

Matching Example: Matrix Multiplications

Band Node Band Node Sequence Node Filter Node Filter Node Leaf Node Leaf Node Arbitrary Ancestors

Iteration over i, n and k 3D Iteration over input matrices

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 8 / 16

slide-21
SLIDE 21

Matching Example: Matrix Multiplications

Band Node Band Node Sequence Node Filter Node Filter Node Leaf Node Leaf Node Arbitrary Ancestors

Iteration over i, n and k 2D Initialization

  • f output matrix

3D Iteration over input matrices

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 8 / 16

slide-22
SLIDE 22

Matching Example: Matrix Multiplications

Band Node Band Node Sequence Node Filter Node Filter Node Leaf Node Leaf Node Arbitrary Ancestors Band Node Band Node Sequence Node Filter Node Filter Node Leaf Node Leaf Node Arbitrary Ancestors Mark Node GEMM Info Matching

Iteration over i, n and k 2D Initialization

  • f output matrix

3D Iteration over input matrices

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 8 / 16

slide-23
SLIDE 23

Matching Example: Matrix Multiplications

Band Node Band Node Sequence Node Filter Node Filter Node Leaf Node Leaf Node Arbitrary Ancestors Band Node Band Node Sequence Node Filter Node Filter Node Leaf Node Leaf Node Arbitrary Ancestors Mark Node GEMM Info Band Node For Node For Node User Node For Node User Node Arbitrary Ancestors AST Mark Node GEMM Info Matching AST Generation

Iteration over i, n and k 2D Initialization

  • f output matrix

3D Iteration over input matrices

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 8 / 16

slide-24
SLIDE 24

Matching Example: Matrix Multiplications

Band Node Band Node Sequence Node Filter Node Filter Node Leaf Node Leaf Node Arbitrary Ancestors Band Node Band Node Sequence Node Filter Node Filter Node Leaf Node Leaf Node Arbitrary Ancestors Mark Node GEMM Info Band Node For Node For Node User Node For Node User Node Arbitrary Ancestors AST Mark Node GEMM Info Matching AST Generation Printing

Iteration over i, n and k 2D Initialization

  • f output matrix

3D Iteration over input matrices

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 8 / 16

slide-25
SLIDE 25

Loop Tactics: Tree Matchers

Tree Matcher defines pattern for subtree and captures nodes

schedule node body; schedule node initBody; schedule node schedule; auto matcher = band(schedule, sequence( filter(initBody, hasGemmInitPattern, leaf ()), filter(body, hasGemmPattern, leaf ())));

band sequence filter filter leaf leaf

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 9 / 16

slide-26
SLIDE 26

Loop Tactics Access Relation Matchers

Access Relation Matcher: Matches tensor accesses

auto hasGemmPattern = [&]( schedule_node node) { auto _i = placeholder (); auto _j = placeholder (); auto _k = placeholder (); auto _A = arrayPlaceholder (); auto _B = arrayPlaceholder (); auto _C = arrayPlaceholder (); auto reads = /* get read accesses */; auto writes = /* get write accesses */; auto mRead = allOf( access(_C , _i , _j), access(_A , _i , _k), access(_B , _k , _j )); auto mWrite = allOf(access(_C , _i , _j )); return match(reads , mRead ). size () == 1 && match(writes , mWrite ). size () == 1; };

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 10 / 16

slide-27
SLIDE 27

Loop Tactics Access Relation Matchers

Access Relation Matcher: Matches tensor accesses

auto hasGemmPattern = [&]( schedule_node node) { auto _i = placeholder (); auto _j = placeholder (); auto _k = placeholder (); auto _A = arrayPlaceholder (); auto _B = arrayPlaceholder (); auto _C = arrayPlaceholder (); auto reads = /* get read accesses */; auto writes = /* get write accesses */; auto mRead = allOf( access(_C , _i , _j), access(_A , _i , _k), access(_B , _k , _j )); auto mWrite = allOf(access(_C , _i , _j )); return match(reads , mRead ). size () == 1 && match(writes , mWrite ). size () == 1; };

Additionally match leaf expressions

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 10 / 16

slide-28
SLIDE 28

Loop Tactics: Tree Builders

Tree Builder generates Subtree after Transformation

auto builder = mark ([&]() { return marker () }, band ([&]() { return

  • schedule. getSchedule () },

sequence( filter ([&]() { return initBody.getFilter () }), filter ([&]() { return body.getFilter () }))));

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 11 / 16

slide-29
SLIDE 29

Loop Tactics: Tree Builders

band sequence filter filter leaf leaf

Tree Builder generates Subtree after Transformation

auto builder = mark ([&]() { return marker () }, band ([&]() { return

  • schedule. getSchedule () },

sequence( filter ([&]() { return initBody.getFilter () }), filter ([&]() { return body.getFilter () }))));

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 11 / 16

slide-30
SLIDE 30

Loop Tactics: Tree Builders

band sequence filter filter leaf leaf band sequence filter filter leaf leaf mark New Subtree

Tree Builder generates Subtree after Transformation

auto builder = mark ([&]() { return marker () }, band ([&]() { return

  • schedule. getSchedule () },

sequence( filter ([&]() { return initBody.getFilter () }), filter ([&]() { return body.getFilter () }))));

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 11 / 16

slide-31
SLIDE 31

Loop Tactics: Tree Builders

band sequence filter filter leaf leaf band sequence filter filter leaf leaf mark New Subtree Replace in old Tree band sequence filter filter leaf leaf mark

Tree Builder generates Subtree after Transformation

auto builder = mark ([&]() { return marker () }, band ([&]() { return

  • schedule. getSchedule () },

sequence( filter ([&]() { return initBody.getFilter () }), filter ([&]() { return body.getFilter () }))));

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 11 / 16

slide-32
SLIDE 32

Experimental Methodology

Implemented Matchers ◮ Matrix-matrix multiplications ◮ Matrix-vector multiplications

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 12 / 16

slide-33
SLIDE 33

Experimental Methodology

Implemented Matchers ◮ Matrix-matrix multiplications ◮ Matrix-vector multiplications Benchmarks ◮ Benchmarks: mm, mv, batchMM, 3mm, 4cmm, mlp3

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 12 / 16

slide-34
SLIDE 34

Experimental Methodology

Implemented Matchers ◮ Matrix-matrix multiplications ◮ Matrix-vector multiplications Benchmarks ◮ Benchmarks: mm, mv, batchMM, 3mm, 4cmm, mlp3 Static Impact ◮ Percentage of detected patterns in the code ◮ Test robustness against prior Transposition / Tiling

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 12 / 16

slide-35
SLIDE 35

Experimental Methodology

Implemented Matchers ◮ Matrix-matrix multiplications ◮ Matrix-vector multiplications Benchmarks ◮ Benchmarks: mm, mv, batchMM, 3mm, 4cmm, mlp3 Static Impact ◮ Percentage of detected patterns in the code ◮ Test robustness against prior Transposition / Tiling Dynamic Impact ◮ Dynamic instruction count unoptimized vs. optimized version

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 12 / 16

slide-36
SLIDE 36

Detected Patterns in the Code

1 3

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 13 / 16

slide-37
SLIDE 37

Breakdown of Dynamic Host CPU Instructions

With offloading Normalized to #instructions without offloading No offloading (host CPU only)

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 14 / 16

slide-38
SLIDE 38

Positioning in the Pipeline

TC lang Halide IR isl Schedule Tree Polyhedral Transformations Isl AST ISO C99 Tactics backend Loop Tactics Pattern detection and marking Tensor Comprehensions

Matching after Affine Scheduling without Rescheduling: ◮ Leverages enabling transformations (e.g., fusion, tiling) ◮ Initial schedule as canonical form (e.g., permutability, band coalescing) ◮ No feedback for transformations (e.g., no architecture-specific tiling, fusion decisions, etc.) ◮ Complexity of matchers rises with prior transformations

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 15 / 16

slide-39
SLIDE 39

Positioning in the Pipeline

TC lang Halide IR isl Schedule Tree Polyhedral Transformations Isl AST ISO C99 Tactics backend Loop Tactics Pattern detection and marking Tensor Comprehensions

Matching after Affine Scheduling without Rescheduling: ◮ Leverages enabling transformations (e.g., fusion, tiling) ◮ Initial schedule as canonical form (e.g., permutability, band coalescing) ◮ No feedback for transformations (e.g., no architecture-specific tiling, fusion decisions, etc.) ◮ Complexity of matchers rises with prior transformations Matching earlier (at higher level of abstraction) ◮ More high-level information for matchers ◮ Simpler matchers & builders ◮ Less / no benefits from affine transformations

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 15 / 16

slide-40
SLIDE 40

Summary and Future Work

Summary ◮ TC-CIM: Compilation flow for (CIM) accelerators ◮ Integration of Loop Tactics into Tensor Comprehensions ◮ Reliable detection + significant dynamic impact

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 16 / 16

slide-41
SLIDE 41

Summary and Future Work

Summary ◮ TC-CIM: Compilation flow for (CIM) accelerators ◮ Integration of Loop Tactics into Tensor Comprehensions ◮ Reliable detection + significant dynamic impact Future Work ◮ Explore positioning in the pipeline ◮ More complex matchers: fusion / minimizing data transfers ◮ Matching in MLIR (e.g., raise from lower-level dialects to high-level dialects)

Drebes et al. – TC-CIM: Empowering Tensor Comprehensions for Computing In Memory 16 / 16