Strassens Algorithm for Tensor Contraction Jianyu Huang Joint work - PowerPoint PPT Presentation

Strassen’s Algorithm for Tensor Contraction Jianyu Huang Joint work with Devin A. Matthews and Robert A. van de Geijn The University of Texas at Austin September 18-19, 2017 BLIS Retreat 2017

Marry Strassen with Tensor Contraction M 0 := ( A 00 + A 11 )( B 00 + B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 ( B 01 – B 11 ); M 3 := A 11 ( B 10 – B 00 ); M 4 := (A 00 +A 01 )B 11 ; M 5 := ( A 10 – A 00 )( B 00 + B 01 ); M 6 := ( A 01 – A 11 )( B 10 + B 11 ); C 00 += M 0 + M 3 – M 4 + M 6 C 01 += M 2 + M 4 C 10 += M 1 + M 3 Practical Speedup? C 11 += M 0 – M 1 + M 2 + M 5 O(n 3 ) → O(n 2.8 )

Outline • Background – High-performance GEMM – High-performance Strassen – High-performance Tensor Contraction • Strassen’s Algorithm for Tensor Contraction • Performance Model • Experiments • Conclusion 3

High-performance matrix multiplication (GEMM) 4

State-of-the-art GEMM in BLIS • BLAS-like Library Instantiation Software (BLIS) is a portable framework for instantiating BLAS-like dense linear algebra libraries.  Field Van Zee, and Robert van de Geijn . “BLIS: A Framework for Rapidly Instantiating BLAS Functionality.” ACM TOMS 41.3 (2015): 14. • BLIS provides a refactoring of GotoBLAS algorithm (best-known approach on CPU) to implement GEMM.  Kazushige Goto, and Robert van de Geijn . “High -performance implementation of the level- 3 BLAS.” ACM TOMS 35.1 (2008): 4.  Kazushige Goto, and Robert van de Geijn . “Anatomy of high - performance matrix multiplication.” ACM TOMS 34.3 (2008): 12. • GEMM implementation in BLIS has 6-layers of loops. The outer 5 loops are written in C. The inner-most loop (micro-kernel) is written in assembly for high performance. 5

GotoBLAS algorithm for GEMM in BLIS k C x n C L3 Cache m C x k C L2 Cache m R x n R *Field G. Van Zee, and Tyler M. Smith. “Implementing high -performance m R x k C k C x n R Register complex matrix multiplication via the 3m and 4m methods.” In ACM 6 Transactions on Mathematical Software (TOMS), accepted.

High-performance Strassen * Jianyu Huang , Tyler Smith, Greg Henry, and Robert van de Geijn . “Strassen’s Algorithm Reloaded.” In SC’16 . 7

Strassen’s Algorithm Reloaded M 0 := ( A 00 + A 11 )( B 00 + B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 0 := ( A 00 + A 11 )( B 00 + B 11 ); C 00 += M 0 ; C 11 += M 0 ; M 2 := A 00 ( B 01 – B 11 ); M 1 := (A 10 +A 11 )B 00 ; C 10 += M 1 ; C 11 – = M 1 ; M 3 := A 11 ( B 10 – B 00 ); M 2 := A 00 ( B 01 – B 11 ); C 01 += M 2 ; C 11 += M 2 ; M 4 := (A 00 +A 01 )B 11 ; M 3 := A 11 ( B 10 – B 00 ); C 00 += M 3 ; C 10 += M 3 ; M 5 := ( A 10 – A 00 )( B 00 + B 01 ); M 4 := (A 00 +A 01 )B 11 ; C 01 += M 4 ; C 00 – = M 4 ; M 6 := ( A 01 – A 11 )( B 10 + B 11 ); M 5 := ( A 10 – A 00 )( B 00 + B 01 ); C 11 += M 5 ; C 00 += M 0 + M 3 – M 4 + M 6 M 6 := ( A 01 – A 11 )( B 10 + B 11 ); C 00 += M 6 ; C 01 += M 2 + M 4 C 10 += M 1 + M 3 C 11 += M 0 – M 1 + M 2 + M 5 M := ( X + Y )( V + W ); C += M ; D += M ; M := ( X + d Y )( V + e W ); C += g 0 M ; D += g 1 M ; g 0 , g 1 , d , e  {-1, 0, 1}. General operation for one-level Strassen: * Jianyu Huang , Tyler Smith, Greg Henry, and Robert van de Geijn . “Strassen’s Algorithm Reloaded.” In SC’16 . 8

M := ( X + Y )( V + W ); C += M ; D += M ; * Jianyu Huang , Tyler Smith, Greg Henry, and Robert van de Geijn . “Strassen’s Algorithm Reloaded.” In SC’16 . 9

C += AB ; M := ( X + Y )( V + W ); C += M ; D += M ; 10

C += AB ; M := ( X + Y )( V + W ); C += M ; D += M ; k C x n C L3 Cache m C x k C L2 Cache m R x n R Register 11

High-performance Tensor Contraction Devin A. Matthews. “High - Performance Tensor Contraction without Transposition.” Accepted in SISC . 12

Matrix vs. Tensor Matrix Multiplication Tensor Contraction BLAS/BLIS! TBLIS! Devin A. Matthews. “High - Performance Tensor Contraction without Transposition.” Accepted in SISC . 13

C := := AB AB + + C 14

Tensors As Matrices: Block-Scatter-Matrix View Tensor : ,  8x2x4 48 49 50 51 52 53 54 55 with 32 33 34 35 36 37 38 39 a 16 17 18 19 20 21 22 23  “d” dimension is stride -1, other dimensions have d 0 1 2 3 4 5 6 7 c increasing strides (8, 16). 56 57 58 59 60 61 62 63 40 41 42 43 44 45 46 47 24 25 26 27 28 29 30 31 8 9 10 11 12 13 14 15 Devin A. Matthews. “High - Performance Tensor Contraction without Transposition.” Accepted in SISC . 18

Tensors As Matrices: Block-Scatter-Matrix View Tensor : ,  8x2x4 48 49 50 51 52 53 54 55 with 32 33 34 35 36 37 38 39 a 16 17 18 19 20 21 22 23  “d” dimension is stride -1, other dimensions have d 0 1 2 3 4 5 6 7 c increasing strides (8, 16). 56 57 58 59 60 61 62 63 40 41 42 43 44 45 46 47 24 25 26 27 28 29 30 31 8 9 10 11 12 13 14 15 Matrix : ,  8x8 with  Column “ac” dimension has stride of “c” (8x2=16). d Row “d” dimension has is stirde -1. (i.e. A is row-major.) ac cscat A 0 1 2 3 4 5 6 7 cbs A 1 1  , store offset for each position in rows rscat A rbs A 0 0 1 2 3 4 5 6 7 or columns: Scatter-Matrix Vector 16 17 18 19 20 21 22 23 16 16 32 32 31 32 33 34 35 36 37 48 48 49 50 51 52 53 54 55  , store stride for each block or zero for 8 8 9 10 11 12 13 14 15 Block-Scatter-Matrix Vector irregular blocks: 24 25 26 27 28 29 30 31 24 16 - vector load/store instructions for stride-1 index 40 40 41 42 43 44 45 46 47 56 56 57 58 59 60 61 62 63 - vector gather/scatter instructions for stride-n index. Devin A. Matthews. “High - Performance Tensor Contraction without Transposition.” Accepted in SISC . 19

Strassen’s Algorithm for Tensor Contraction C += A × B abc dca db 48 49 50 51 52 53 54 55 0 8 16 24 32 40 48 56 3 7 11 15 19 23 27 31 a 32 33 34 35 36 37 38 39 2 6 10 14 18 22 26 30 a 1 9 17 25 33 41 49 57 16 17 18 19 20 21 22 23 1 5 9 13 17 21 25 29 b d 2 10 18 26 34 42 50 58 c c 0 1 2 3 4 5 6 7 0 4 8 12 16 20 24 28 35 39 43 47 51 55 59 63 56 57 58 59 60 61 62 63 3 11 19 27 35 43 51 59 40 41 42 43 44 45 46 47 34 38 42 46 50 54 58 62 4 12 20 28 36 44 52 60 b 24 25 26 27 28 29 30 31 33 37 41 45 49 53 57 61 d 5 13 21 29 37 45 53 61 32 36 40 44 48 52 56 60 8 9 10 11 12 13 14 15 6 14 22 30 38 46 54 62 7 15 23 31 39 47 55 63 C += A × B b d b ac ac d cscat B cscat A cscat C 0 8 16 24 32 40 48 56 0 4 8 12 16 20 24 28 0 1 2 3 4 5 6 7 cbs B cbs A cbs C 1 1 8 8 4 4 rscat B rbs B rscat A rbs A rscat C rbs C 0 0 4 8 12 16 20 24 28 0 0 1 2 3 4 5 6 7 0 0 8 16 24 32 40 48 56 1 1 5 9 13 17 21 25 29 16 16 17 18 19 20 21 22 23 1 1 9 17 25 33 41 49 57 1 16 1 2 2 6 10 14 18 22 26 30 32 31 32 33 34 35 36 37 2 2 10 18 26 34 42 50 58 32 3 3 7 11 15 19 23 27 31 48 48 49 50 51 52 53 54 55 3 3 11 19 27 35 43 51 59 32 32 36 40 44 48 52 56 60 8 8 9 10 11 12 13 14 15 4 4 12 20 28 36 44 52 60 33 33 37 41 45 49 53 57 61 24 24 25 26 27 28 29 30 31 5 5 13 21 29 37 45 53 61 1 16 1 34 34 38 42 46 50 54 58 62 40 40 41 42 43 44 45 46 47 6 6 14 22 30 38 46 54 62 35 35 39 43 47 51 55 59 63 56 57 58 59 60 61 62 63 7 7 15 23 31 39 47 55 63 56 Jianyu Huang , Devin A. Matthews, and Robert A. van de Geijn . “Strassen’s Algorithm for Tensor Contraction.” arXiv:1704.03092 (2017). 20

Modifications to GEMM M 0 := ( A 00 + A 11 )( B 00 + B 11 ); C 00 += M 0 ; C 11 += M 0 ; • Packing routines: – Implicit tensor-to-matrix permutations – Addition of submatrices of A and B. • Micro-kernel: – Implicit matrix-to-tensor transformations – Scatter update of submatrices of C.  Additional workspace for Transposition (Tensor Contraction)  Additional Workspace for Summation (Strassen)

C += AB ; M := ( X + Y )( V + W ); C += M ; D += M ; k C x n C L3 Cache m C x k C L2 Cache m R x n R Register 22

Variations on a theme ✘ ✘ ✘ • Naïve Strassen • AB Strassen ✔ ✔ ✘ ✔ ✔ ✔ • ABC Strassen 24

Performance Model • Performance Metric • Total Time Breakdown Arithmetic Memory Operations Operations 26

Strassens Algorithm for Tensor Contraction Jianyu Huang Joint work - PowerPoint PPT Presentation

Strassens Algorithm for Tensor Contraction Jianyu Huang Joint work with Devin A. Matthews and Robert A. van de Geijn The University of Texas at Austin September 18-19, 2017 BLIS Retreat 2017 Marry Strassen with Tensor Contraction M 0 := (

Tensor Contraction with Extended BLAS Kernels on CPU and GPU Yang Shi University of California,

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

MA/CSSE 473 Day 17 Divide-and-conquer Convex Hull Strassen's Algorithm: Matrix Multiplication

Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer and Paolo

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

MA/CSSE 473 Day 17 Divide-and-conquer Convex Hull Strassen's Algorithm: Matrix Multiplication

Mechanism of Contraction 25a A&P: Muscular System - Mechanism of Contraction Class

Stochastic contraction BACS Workshop Chamonix, January 14, 2008 Q.-C. Pham N. Tabareau J.-J.

MA/CSSE 473 Day 14 Strassen's Algorithm: Matrix Multiplication Decrease and Conquer DFS

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

THE ANATOMY OF GAMER MOTIVATIONS WHAT WE LEARNED FROM 250,000 GAMERS PREVIOUSLY ON QUANTIC

Multimodal Integration Outline Spatial Transformation Motion Correction

PL Techniques for 3D Printing Zachary Chandakana Anat Dan Tatlock Nandi Caspi Grossman

Anatomy of an HTML document This is the DOCTYPE declaration . It indicates <!DOCTYPE html>

JSON hijacking For the modern web About me Im a researcher at PortSwigger I love

Basics of Version Control Matthew Evans https://github.com/ml-evs/git-tutorial What is version

Evaluating the normal distribution Slides developed by Mine etinkaya-Rundel of OpenIntro The

WITNESS EXAMINATION November 3, 2017 CAP Seminar Anatomy of a Murder Case Robert F.

Sambuz

Useful Links

Newsletter

Mail Us

Strassens Algorithm for Tensor Contraction Jianyu Huang Joint work - PowerPoint PPT Presentation

Strassens Algorithm for Tensor Contraction Jianyu Huang Joint work with Devin A. Matthews and Robert A. van de Geijn The University of Texas at Austin September 18-19, 2017 BLIS Retreat 2017 Marry Strassen with Tensor Contraction M 0 := (

Tensor Contraction with Extended BLAS Kernels on CPU and GPU Yang Shi University of California,

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

MA/CSSE 473 Day 17 Divide-and-conquer Convex Hull Strassen's Algorithm: Matrix Multiplication

Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer and Paolo

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

MA/CSSE 473 Day 17 Divide-and-conquer Convex Hull Strassen's Algorithm: Matrix Multiplication

Mechanism of Contraction 25a A&amp;P: Muscular System - Mechanism of Contraction Class

Stochastic contraction BACS Workshop Chamonix, January 14, 2008 Q.-C. Pham N. Tabareau J.-J.

MA/CSSE 473 Day 14 Strassen's Algorithm: Matrix Multiplication Decrease and Conquer DFS

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

THE ANATOMY OF GAMER MOTIVATIONS WHAT WE LEARNED FROM 250,000 GAMERS PREVIOUSLY ON QUANTIC

Multimodal Integration Outline Spatial Transformation Motion Correction

PL Techniques for 3D Printing Zachary Chandakana Anat Dan Tatlock Nandi Caspi Grossman

Anatomy of an HTML document This is the DOCTYPE declaration . It indicates &lt;!DOCTYPE html&gt;

JSON hijacking For the modern web About me Im a researcher at PortSwigger I love

Basics of Version Control Matthew Evans https://github.com/ml-evs/git-tutorial What is version

Evaluating the normal distribution Slides developed by Mine etinkaya-Rundel of OpenIntro The

WITNESS EXAMINATION November 3, 2017 CAP Seminar Anatomy of a Murder Case Robert F.

Sambuz

Useful Links

Newsletter

Mail Us

Mechanism of Contraction 25a A&P: Muscular System - Mechanism of Contraction Class

Anatomy of an HTML document This is the DOCTYPE declaration . It indicates <!DOCTYPE html>