a sparse tensor format and a benchmark suite
play

A Sparse Tensor Format and a Benchmark Suite Jiajia Li Pacific - PowerPoint PPT Presentation

A Sparse Tensor Format and a Benchmark Suite Jiajia Li Pacific Northwest National Laboratory January 25, 2019 @ MIT Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores HiCOO: Hierarchical Storage of Sparse


  1. A Sparse Tensor Format and a Benchmark Suite Jiajia Li Pacific Northwest National Laboratory January 25, 2019 @ MIT Figure sources: “A brief survey of tensors” by Berton Earnshaw and NVIDIA Tensor Cores

  2. HiCOO: Hierarchical Storage of Sparse Tensors Jiajia Li 1,2 , Jimeng Sun 1 , Richard Vuduc 1 1 Georgia Institute of Technology 2 Pacific Northwest National Laboratory SUNLAB Code: https://github.com/hpcgarage/ParTI (v1.0.0) Figure sources: “A brief survey of tensors” by Berton Earnshaw and NVIDIA Tensor Cores

  3. Challenges Compactness: A space-efficient data structure Mode-Genericity: Efficient traversals of the data structure for computations The concept “mode-genericity” is inherited from [Baskaran et al. 2012]. [Baskaran et al. 2012] M. Baskaran et al., “Efficient and scalable computations with sparse tensors,” HPEC2012 � 3

  4. Baseline Sparse Tensor Formats in This Work COO: coordinate formats [Bader et al., 2006] CSF: Compressed Sparse Fibers, extension of CSR. [Smith et al. 2015] i = 1,…,I F-COO: Flagged COO format [Liu et al., 2017] 4 k = 1,…,K j = i j k val 1 bf j k val , … , J 4 3 0 1 2 3 i 0 0 0 1 1 0 0 1 0 1 0 2 0 1 0 2 sf[0]=1 1 0 0 3 1 0 0 3 0 1 0 0 1 2 0 3 j 1 0 2 4 0 0 2 4 2 1 0 5 1 1 0 5 2 2 2 6 0 2 2 6 0 0 0 2 0 2 1 2 k sf[1]=1 3 0 1 7 1 0 1 7 3 3 2 8 1 2 3 4 5 6 7 8 0 3 2 8 val (a) COO (b) CSF (c) F-COO Mode-Generic Mode-Specific prefer different representations for different modes. � 4

  5. Mode-Specific Tensor Formats Three CSF/F-COO representations are required/preferred for three kernels. Kernel in Mode-1 Tensor 0 1 2 3 i Decomposition 0 1 0 0 1 2 0 3 j CSF-1 0 0 0 2 0 2 1 2 k 1 2 3 4 5 6 7 8 val 0 1 2 3 j Kernel in Mode-2 0 1 1 3 0 2 2 3 i CSF-2 0 0 2 1 0 0 2 2 k val 1 3 4 7 2 5 6 8 Kernel in Mode-3 k 0 1 2 0 0 1 2 3 1 2 3 i CSF-3 j 0 1 0 1 0 0 2 3 val 1 2 3 5 7 4 6 8 � 5

  6. Mode-Specific Tensor Formats Three CSF/F-COO representations are required/preferred for three kernels. Kernel in Mode-1 Tensor 0 1 2 3 i Decomposition 0 1 0 0 1 2 0 3 j CSF-1 0 0 0 2 0 2 1 2 k 1 2 3 4 5 6 7 8 val Kernel in Mode-2 Performance drops Kernel in Mode-3 � 6

  7. Mode Orientation Tensor decomposition Mode-Specific Mode-Generic Kernel in Mode-1 Mode-1 oriented (CSF/FCOO) Coordinate (COO) HiCOO Kernel in Mode-2 Kernel in Mode-3 Efficient In-efficient � 7

  8. HiCOO Format Store a sparse tensor in units of small sparse blocks bptr bi bj bk ei ej ek val i j k val 0 0 0 0 0 0 0 1 0 0 0 1 i = 1,…,I B1 0 1 0 2 0 1 0 2 1 0 0 3 1 0 0 3 B2 3 0 0 1 1 0 0 4 1 0 2 4 k = 1,…,K j = 1,…,J 4 1 0 0 0 1 0 5 2 1 0 5 B3 1 0 1 7 2 2 2 6 6 1 1 1 0 0 0 6 3 0 1 7 B4 Block size: 2*2*2 1 1 0 8 3 3 2 8 COO HiCOO Extension from Compressed Sparse Blocks (CSB) format by Buluc et al. SPAA. 2009. 8 �

  9. HiCOO Format Store a sparse tensor in units of small sparse blocks Shorten the bit-length of element indices • block indices element indices 32-bit 32-bit 8-bit bptr bi bj bk ei ej ek val i j k val 0 0 0 0 0 0 0 1 0 0 0 1 i = 1,…,I B1 0 1 0 2 0 1 0 2 1 0 0 3 1 0 0 3 B2 3 0 0 1 1 0 0 4 1 0 2 4 k = 1,…,K j = 1,…,J 4 1 0 0 0 1 0 5 2 1 0 5 B3 1 0 1 7 2 2 2 6 6 1 1 1 0 0 0 6 3 0 1 7 B4 Block size: 2*2*2 1 1 0 8 3 3 2 8 COO HiCOO i = bi * B + ei � 9

  10. HiCOO Format Store a sparse tensor in units of small sparse blocks Shorten the bit-length of element indices • Compress the number of block indices • block indices element indices 32-bit 32-bit 8-bit bptr bi bj bk ei ej ek val i j k val 0 0 0 0 0 0 0 1 0 0 0 1 B1 0 1 0 2 0 1 0 2 1 0 0 3 1 0 0 3 B2 3 0 0 1 1 0 0 4 1 0 2 4 4 1 0 0 0 1 0 5 2 1 0 5 B3 1 0 1 7 2 2 2 6 6 1 1 1 0 0 0 6 3 0 1 7 B4 1 1 0 8 3 3 2 8 COO HiCOO � 10

  11. HiCOO Format Store a sparse tensor in units of small sparse blocks Shorten the bit-length of element indices • Compress the number of block indices • block indices element indices 32-bit 32-bit 8-bit bptr bi bj bk ei ej ek val i j k val 0 0 0 0 0 0 0 1 0 0 0 1 COO indices: B1 0 1 0 2 0 1 0 2 = nnz * 3 * 32 1 0 0 3 1 0 0 3 B2 3 0 0 1 1 0 0 4 1 0 2 4 HiCOO indices: 4 1 0 0 0 1 0 5 2 1 0 5 B3 = nnz * 3 * 8 + nnb * (3 * 32 + 32) 1 0 1 7 2 2 2 6 6 1 1 1 0 0 0 6 3 0 1 7 B4 1 1 0 8 3 3 2 8 COO HiCOO i = bi * B + ei nnz: #Nonzeros; nnb: #Non-zero blocks 11 �

  12. HiCOO Format Store a sparse tensor in units of small sparse blocks Shorten the bit-length of element indices • Compress the number of block indices • For arbitrary-order sparse tensors. • 32-bit 32-bit 8-bit bptr bi bj bk ei ej ek val i j k val 0 0 0 0 0 0 0 1 0 0 0 1 B1 0 1 0 2 0 1 0 2 For the tensor: Reduce its storage 1 0 0 3 1 0 0 3 and memory footprints B2 3 0 0 1 1 0 0 4 1 0 2 4 4 1 0 0 0 1 0 5 2 1 0 5 B3 For matrices: Better data locality 1 0 1 7 2 2 2 6 6 1 1 1 0 0 0 6 3 0 1 7 B4 1 1 0 8 3 3 2 8 COO HiCOO 12 �

  13. Platform and Dataset Platform : Intel Xeon CPU E7-4850 v3 platform consisting 56 physical cores with icc 18.0.2 and parallelized by OpenMP. Dataset : FROSTT [Smith et al. 2017], HaTen2 [Jeon et al. 2015], and healthcare data [Perros et al. 2017]. � 13

  14. Multicore CP-ALS HiCOO outperforms COO by 6.2 × and CSF by 2.1 × on average. Speedup ov er CSF (higher is better) 3D 4D choa 4.00 cr ime darpa fb−m nips nips nell2 HiCOO 2.00 fb−s nell2 HiCOO cr ime flickr CSF−1 deli4d darpa nell1 deli 1.00 CSF−1 fb−m fb−s enron choa � flickr deli nell2 enron � choa nell1 0.50 COO � fb−m � � deli4d darpa nips � fb−s � � cr ime deli COO 0.25 � deli4d � � flickr nell1 � enron 1 2 4 1 2 4 Compression r atio relati v e to CSF (higher is better) � 14

  15. Following Work HiCOO for other tensor operations and Tucker decomposition HiCOO-MTTKRP/CPD on GPUs and distributed systems. � 15

  16. PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite Jiajia Li 1 , Yuchen Ma 2 , Xiaolong Wu 3 , Ang Li 1 , Kevin Barker 1 1 Pacific Northwest National Laboratory 2 Hangzhou Dianzi University 3 Virginia Tech Code: https://gitlab.com/tensorworld/pasta Figure sources: “A brief survey of tensors” by Berton Earnshaw and NVIDIA Tensor Cores

  17. PASTA Workloads MTTKRP Data (Matriced TEW TS TTV TTM Platforms Structures/ Tensor-Times- (Element-Wise) (Tensor-scalar) (Tensor-Times- (Tensor-Times- Khatri-Rao Algorithms Vector) Matrix) Product) Single-core CPUs COO Multi-core CPUs

  18. PASTA Workloads Arbitrary shape and nonuniform nonzero pattern MTTKRP Data (Matriced TEW TS TTV TTM Platforms Structures/ Tensor-Times- (Element-Wise) (Tensor-scalar) (Tensor-Times- (Tensor-Times- Khatri-Rao Algorithms Vector) Matrix) Product) Single-core CPUs COO Multi-core CPUs

  19. PASTA Workloads Parallelize Parallelize nonzero Parallelize Parallelize nonzeros with partitions nonzeros nonzero fibers atomics MTTKRP Data (Matriced TEW TS TTV TTM Platforms Structures/ Tensor-Times- (Element-Wise) (Tensor-scalar) (Tensor-Times- (Tensor-Times- Khatri-Rao Algorithms Vector) Matrix) Product) Single-core CPUs COO Multi-core CPUs

  20. Memory-Bound Workloads � 20

  21. Following Work Include HiCOO, CSF and other formats Support GPUs, FPGAs (long-term future) � 21

  22. Other Recent Work A dynamic sparse tensor structure for tensor contraction • Collaborators: Sriram Krishnamoorthy (PNNL) • Application: Quantum Chemistry, NWChemEx Hybrid formats and nonzero partitioning strategies • Collaborators: Israt Nisa (OSU), P. (Saday) Sadayappan (OSU), Sriram Krishnamoorthy (PNNL) � 22

  23. Acknowledgement � 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend