An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix - PowerPoint PPT Presentation

An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix Multiply Jiajia Li , Casey Battaglino, Ioakeim Perros, Jimeng Sun, Richard Vuduc Computational Science & Engineering, Georgia Institute of Technology 19 th Nov 2015 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 1 / 33

The problem TTM I U Y = X x n U F F Y K X I J K J 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 2 / 33

The problem TTM I U Y = X x n U F Y F K X I Transform J Transform K J JK JK I Y (n) U F F Y (n) = UX (n) X (n) I GEMM 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 3 / 33

The problem Transform: TTM I U Y = X x n U 70% running time. F Y F K 50% space. X I Transform J Transform K J JK JK I Y (n) U F F Y (n) = UX (n) X (n) I GEMM We proposed an in-place TTM algorithm and employed auto-tuning method to adapt its parameters. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 4 / 33

Outline Background Motivation InTensLi Framework Experiments and Analysis Conclusion 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 5 / 33

Background Tensor and Applications Tensor: interpreted as a multi-dimensional array, e.g. X ∈ R I × J × K . Special cases: vectors ( x ) are 1 D tensors, and matrices ( A )are 2 D tensors. Tensor dimension ( N ): also called mode or order. We focus on dense tensors in this work. Applications Quantum chemistry, quantum physics, signal and image processing, neuroscience, and data analytics. i=1,...,I K , . . . , 1 = k j=1,...,J A third-order (or three-dimensional) I × J × K tensor. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 6 / 33

Background Tensor Representations Sub-tensor Slices Fibers K I K I J X ( i , :, k ) X ( i , j , :) X (:, j , k ) J Tube Column Row X ( i , :, :) X (:, j , :) X (:, :, k ) Horizontal Lateral Frontal Whole tensor X X (2) 5 7 Matricization 1 5 2 6 J=2 1 3 6 8 I=2 3 7 4 8 Tensorization 2 4 2 IK=4 = K J=2 Di ff representations → Di ff algorithms → Di ff performance. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 7 / 33

Background Memory Mapping Tensor organization Multi-dimensional array – logically Linear storage – physically Memory mapping 1 . Logical Physical Row-major (LDim: k) X 1 5 3 7 K -> J -> I 5 7 2 6 4 8 1 3 Mapping 6 8 I=2 Function 2 4 K=2 Column-major (LDim: 1) J=2 1 2 3 4 I -> J -> K 5 6 7 8 LDim: Leading Dimension 1GARCIA, R.,and LUMSDAINE, A. Multiarray:A c++ library for generic programming with arrays.Software Practive Experience 35 (2004), 159–188. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 8 / 33

Background Tensor Operations Matricization, aka unfolding or flattening. Mode- n product, aka tensor-times-matrix multiply ( Ttm ) TTM on Mode-1 I Y = X x n U F F K I J K J JK I JK F F Y (n) = UX (n) I Tensor contraction, Kronecker product, Matricized tensor times Khatri-Rao product (MTTKRP) etc. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 9 / 33

Background Ttm Algorithm Baseline Ttm algorithm in Tensor Toolbox and Cyclops Tensor Framework ( Ctf ). Input: Output: X Y Transformation Tensorization Matricization Y (n) U X (n) Multiplication: Y (n) = UX (n) Ttm Applications Low-rank tensor decomposition. Tucker decomposition, e.g. Tucker-HOOI algorithm. Y = X × 1 A (1) T · · · × n − 1 A ( n − 1) T × n +1 A ( n +1) T · · · × N A ( N ) T . 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 10 / 33

Background Main Contributions Proposed an in-place tensor-times-matrix multiply ( InTtm ) algorithm, by avoiding physical reorganization of tensors. Built an input-adaptive framework InTensLi to automatically adapt parameters and generate the code. Achieved 4 × and 13 × speedups compared to the state-of-the-art Tensor Toolbox and Ctf tools. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 11 / 33

Motivation Observation 1: Transformation is expensive. Notation: the number of words ( Q ), floating-point operations ( W ), last-level cache size ( Z ). Z − Z 2 for both general matrix-matrix multiply ( Gemm ) and W The relation of them is Q ≥ √ 8 Ttm . Suppose Ttm does the same number of flops as Gemm ( ˆ W = W ), the relation of Arithmetic Intensity of Gemm and Ttm : ˆ A ≈ A / (1 + A m ) when counting transformation. (1 + A m ) is the penalty. Assume cache size Z is 8MB, the penalty of a 3-D tensor is 33. Conclusion: When Ttm and Gemm do the same number of flops, Arithmetic Intensity of Ttm is decreased by a penalty of 33 or more, as tensor dimension increases. 2G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:pp. 1–155, 2014. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 12 / 33

Motivation Observation 2: Performance of the multiplication in Ttm is far below peak. Ttm algorithm involves a variety of rectangular problem sizes. 14 140 13 (short) (fat) 12 120 k =I n n= I 1 ...I n-1 I n+1 ...I N 11 100 10 m =16 9 U log 2 n (tiny) 80 8 X (n) 7 60 k 6 (short) 5 40 4 3 20 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 log 2 k (a) TTM’s multiplication. (b) GEMM performance in Intel MKL with 4 threads. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 13 / 33

Motivation Observation 3: Ttm organization is critical to data locality. There are many ways to organize data accesses. I U X(: ,:, k) F I J F U X(:, j, :) K Non-Contigunous Contigunous 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 14 / 33

Motivation Observation 3: Ttm organization is critical to data locality. There are many ways to organize data accesses. Choose slice representation. Table 1 : Di ff erent representation forms of mode-1 Ttm on a I × J × K tensor. Mode-1 Product Representation Forms BLAS Level Transformation Tensor representation — — Full Y = X × 1 U reorganization Matrix representation L3 Yes Y (1) = UX (1) Fiber representation y ( f , : , k ) = X (: , : , k ) u ( f , :) , L2 No Sub-tensor Loops : k = 1 , · · · , K , f = 1 , · · · , F extraction Slice representation L3 No Y (: , : , k ) = UX (: , : , k ) , Loops : k = 1 , · · · , K 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 15 / 33

InTensLi Framework Algorithmic Strategy Layout Background 1 Motivation 2 InTensLi Framework 3 Algorithmic Strategy InTensLi Framework Experiments and Analysis 4 Conclusion 5 References 6 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 16 / 33

InTensLi Framework Algorithmic Strategy Algorithmic Strategy n-1 N-n { { I 1 I 2 I 3 I 4 I N { ... I 1 I 2 I 4 I 5 ...I N I 3 X sub backward forward Group To avoid data copy, Rules: 1) compress only contiguous dimensions; 2) always include the leading dimension. Lemma: Ttm can be performed on up to max { n − 1 , N − n } contiguous dimensions without physical reorganization. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 17 / 33

InTensLi Framework Algorithmic Strategy Algorithmic Strategy n-1 N-n { { I 1 I 2 I 3 I 4 I N { ... I 1 I 2 I 4 I 5 ...I N I 3 X sub backward forward Group To avoid data copy, Rules: 1) compress only contiguous dimensions; 2) always include the leading dimension. Lemma: Ttm can be performed on up to max { n − 1 , N − n } contiguous dimensions without physical reorganization. To get high performance of Gemm , Find an approximate matrix size according to computer architecture. Use auto-tuning method in InTensLi framework. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 17 / 33

InTensLi Framework Algorithmic Strategy InTtm Algorithm and Comparison √ ˆ InTtm ’s AI: ˜ Q = 8 Z ≈ A . A . ˆ Q √ 8 Z Traditional Ttm ’s AI: ˆ A m . A ≈ 1+ A InTtm eliminates the AI by a factor 1 + A m . In-place Tensor-Times-Matrix Multiply ( InTtm ) algorithm. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 18 / 33

InTensLi Framework Algorithmic Strategy InTtm Algorithm and Comparison √ ˆ InTtm ’s AI: ˜ Q = 8 Z ≈ A . A . ˆ Q √ 8 Z Traditional Ttm ’s AI: ˆ A m . A ≈ 1+ A InTtm eliminates the AI by a factor 1 + A m . In-place Tensor-Times-Matrix Multiply ( InTtm ) algorithm. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 19 / 33

InTensLi Framework InTensLi Framework Layout Background 1 Motivation 2 InTensLi Framework 3 Algorithmic Strategy InTensLi Framework Experiments and Analysis 4 Conclusion 5 References 6 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 20 / 33

InTensLi Framework InTensLi Framework InTensLi Framework Input: tensor features, hardware configuration, and MM benchmark. Parameter estimation Mode partitioning: M L and M C . Thread allocation: P L and P C . Code generation Parameter Estimator InTTM Code Data Layout Nested loops Tensor Mode M L Input parfor i1=1 : I1 Partition Parameters Mode n parfor i2 = 1 : I2 M C ... ... Code Thresholds MM A ect P Generator Matrix-matrix L Benchmark Multiplication Max # of P Y sub =UX sub Thread C threads Hardware OR Allocation Y sub =X sub U’ Parameters MM Libraries InTensLi framework BLIS MKL 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 21 / 33

An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix - PowerPoint PPT Presentation

An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix Multiply Jiajia Li , Casey Battaglino, Ioakeim Perros, Jimeng Sun, Richard Vuduc Computational Science & Engineering, Georgia Institute of Technology 19 th Nov 2015 19 th

A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

The Place Approach What is the Place Approach? What makes a Great Place The Benefits of a Great

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Dense Flow Visualization Lecture 10 February 27, 2020 General Overview Dense methods in 2D

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

Input Input devices Text entry Positional input Input Devices 1 iPod Wheel Input Devices 2

COVID-19 - CESB/CERB Eligibility + Immigration & Employment Issues May 28, 2020 ACCESS TO

Module 5 After the webcast you will have an understanding of - Leaves and absences - Vacation -

Working Families Breakfast Briefing: Pregnancy, maternity and returning to work after a break

Pregnancy in the Academy November 16, 2016 Saranna Thornton Saranna Thornton is a professor of

Lecture 15: Dense Linear Algebra II David Bindel 19 Oct 2011 Logistics Matrix multiply is

Declaring and initializing 2D arrays 4 // setting up a 2D array 0,0 0,1 0,2 0,3 2D Arrays

Introduction to Matlab Marco Chiarandini Department of Mathematics & Computer Science

1ST SERVICE SOLUTIONS AGENDA CMBS Assumptions CMBS Maturity Market Update

An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix - PowerPoint PPT Presentation

An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix Multiply Jiajia Li , Casey Battaglino, Ioakeim Perros, Jimeng Sun, Richard Vuduc Computational Science & Engineering, Georgia Institute of Technology 19 th Nov 2015 19 th

A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

The Place Approach What is the Place Approach? What makes a Great Place The Benefits of a Great

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Dense Flow Visualization Lecture 10 February 27, 2020 General Overview Dense methods in 2D

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

Input Input devices Text entry Positional input Input Devices 1 iPod Wheel Input Devices 2

COVID-19 - CESB/CERB Eligibility + Immigration &amp; Employment Issues May 28, 2020 ACCESS TO

Module 5 After the webcast you will have an understanding of - Leaves and absences - Vacation -

Working Families Breakfast Briefing: Pregnancy, maternity and returning to work after a break

Pregnancy in the Academy November 16, 2016 Saranna Thornton Saranna Thornton is a professor of

Lecture 15: Dense Linear Algebra II David Bindel 19 Oct 2011 Logistics Matrix multiply is

Declaring and initializing 2D arrays 4 // setting up a 2D array 0,0 0,1 0,2 0,3 2D Arrays

Introduction to Matlab Marco Chiarandini Department of Mathematics &amp; Computer Science

1ST SERVICE SOLUTIONS AGENDA CMBS Assumptions CMBS Maturity Market Update

COVID-19 - CESB/CERB Eligibility + Immigration & Employment Issues May 28, 2020 ACCESS TO

Introduction to Matlab Marco Chiarandini Department of Mathematics & Computer Science