Sparse Matrix-Matrix Mul/plica/on for Modern Manycore - - PowerPoint PPT Presentation

▶

Sep 16, 2022 154 likes •428 views

Sparse Matrix-Matrix Mul/plica/on for Modern Manycore Architectures Mehmet Deveci , Erik Boman, Siva Rajamanickam Sandia National Laboratories is a multi-program laboratory managed

SLIDE 1

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Sparse ¡Matrix-‑Matrix ¡Mul/plica/on ¡for ¡ Modern ¡Manycore ¡Architectures ¡

Mehmet ¡Deveci, ¡Erik ¡Boman, ¡ ¡ Siva ¡Rajamanickam ¡

SLIDE 2

Problem ¡

▪ SPGEMM: ¡fundamental ¡block ¡for ¡ ▪ Algebraic ¡mul/grid ¡ ▪ Various ¡graph ¡analy/cs ¡problems: ¡clustering, ¡betweenness ¡ centrality… ¡ ▪ Extra ¡irregularity: ¡nnz ¡of ¡C ¡is ¡unknown ¡beforehand ¡

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

2 ¡

SLIDE 3

Background ¡ ¡

▪ Distributed ¡algorithms: ¡ ▪ 1D ¡Trilinos ¡ ▪ 2D ¡Combinatorial ¡Blas ¡[Buluç ¡12], ¡ ▪ 3D ¡[Azad ¡15] ¡ ▪ Hypergraph-‑based: ¡[Akbudak ¡14], ¡[Ballard ¡16] ¡ ▪ Most ¡of ¡the ¡shared ¡memory ¡algorithms ¡bases ¡on ¡1D-‑Gustavson ¡ algorithm ¡[Gustavson ¡78] ¡

3 ¡

SLIDE 4

Background ¡

▪ Mul/-‑threaded ¡algorithms: ¡ ▪ Dense ¡Accumulator ¡(with ¡B ¡column ¡par//ons) ¡[Patwary ¡15] ¡ ▪ Sparse ¡Heap ¡accumulators: ¡ViennaCL, ¡CommBlass ¡ ▪ Sparse ¡accumulators: ¡MKL ¡ ▪ GPUs: ¡ ▪ CUSP ¡[Dalton ¡15]: ¡3D ¡-‑ ¡outer ¡product ¡(O(FLOPS) ¡memory) ¡ ▪ Hierarchical: ¡cuSPARSE, ¡bhSparse ¡[Liu ¡14] ¡ ▪ Aim: ¡Portable ¡methods ¡for ¡GPUs ¡and ¡massively-‑threaded ¡ architectures ¡using ¡Kokkos ¡ ▪ C++ ¡templated ¡library ¡ ▪ Abstrac/ng ¡execu/on, ¡memory ¡spaces, ¡and ¡data ¡layouts ¡ ▪ Contact: ¡Carter ¡Edwards ¡hcedwar@sandia.gov ¡

4 ¡

SLIDE 5

Portable ¡SPGEMM ¡Method ¡

▪ 2-‑phase, ¡symbolic ¡(calculate ¡#nnz), ¡then ¡numeric ¡(actual ¡flops) ¡ ▪ Over ¡alloca/on ¡is ¡expensive ¡or ¡dynamic ¡increase ¡are ¡not ¡suitable ¡on ¡

GPUs. ¡Es/ma/ons ¡[Cohen ¡98] ¡are ¡s/ll ¡not ¡an ¡upperbound. ¡

▪ It ¡is ¡common ¡in ¡scien/fic ¡compu/ng ¡where ¡mul/plica/on ¡is ¡repeated ¡ for ¡different ¡numeric ¡values ¡with ¡same ¡symbolic ¡structure ¡ ▪ Speedup ¡symbolic ¡with ¡compression: ¡ ¡ ▪ Symbolic ¡phase ¡performs ¡unions ¡on ¡rows, ¡which ¡consists ¡of ¡binary ¡ rela/ons ¡ ¡ ▪ Compress ¡the ¡rows ¡of ¡B: ¡O(nnz(B)) ¡using ¡2 ¡integers. ¡ ▪ Column ¡Set ¡Index ¡(CSI): ¡represents ¡column ¡set ¡index ¡ ¡ ▪ Column ¡Set ¡(CS): ¡the ¡bits ¡represent ¡the ¡existence ¡of ¡a ¡column ¡ ▪ Symbolic ¡complexity: ¡O(FLOPS) ¡-‑> ¡on ¡average ¡~O(avgdeg(A)x ¡nnz(B)) ¡

5 ¡

SLIDE 6

KokkosKernels ¡(KK) ¡-‑ ¡SPGEMM ¡

▪ Each ¡team ¡works ¡on ¡a ¡bunch ¡of ¡rows ¡of ¡C ¡(or ¡A) ¡

▪ Team: ¡Thread ¡block ¡(GPU) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡group ¡of ¡hyper-‑threads ¡in ¡a ¡core ¡(CPU) ¡

▪ Each ¡worker ¡in ¡team ¡works ¡on ¡consecu/ve ¡rows ¡of ¡C ¡ ▪ Worker: ¡Warp ¡(GPUs), ¡hyperthread ¡(CPU) ¡ ▪ More ¡coalesced ¡access ¡on ¡GPUs, ¡ ¡ ▪ beler ¡L1-‑cache ¡usage ¡on ¡CPUs. ¡ ▪ Each ¡vectorlane ¡in ¡a ¡worker ¡works ¡on ¡a ¡different ¡ mul/plica/ons ¡within ¡a ¡row: ¡

▪ Vectorlane: ¡Threads ¡in ¡a ¡Warp ¡(GPUs), ¡vector ¡units ¡ (CPU) ¡

6 ¡

SLIDE 7

KK ¡-‑ ¡SPGEMM ¡

▪ Implemented ¡4 ¡methods ¡ ▪ KKMEM: ¡Memory ¡efficient ¡ ▪ Uses ¡sparse ¡hashmap ¡accumulators ¡and ¡memory ¡pools ¡ ▪ KKSPEED: ¡ ▪ Dense ¡accumulators ¡on ¡CPU ¡ ▪ KKMCR ¡ ▪ Graph ¡coloring ¡variant ¡-‑ ¡1 ¡ ¡ ▪ KKMCW ¡ ▪ Graph ¡coloring ¡variant ¡-‑ ¡2 ¡

7 ¡

SLIDE 8

KKMEM ¡

▪ Hierarchical ¡1D ¡Gustavson ¡Algorithm ¡ ▪ Features ¡to ¡make ¡it ¡thread ¡scalable ¡ ▪ 2 ¡level ¡Hashmap ¡Accumulator: ¡ ▪ 1st ¡level ¡uses ¡scratch ¡space: ¡ ▪ GPUs ¡shared ¡memory ¡ ¡ ▪ Small ¡memory ¡that ¡will ¡fit ¡in ¡L1 ¡cache ¡on ¡CPUs ¡ ▪ 2nd ¡level ¡goes ¡to ¡global ¡memory ¡ ▪ Memory ¡Pool: ¡ ▪ Only ¡some ¡of ¡the ¡workers ¡need ¡2nd ¡level ¡hash ¡map. ¡ ¡ ▪ Request ¡memory ¡from ¡memory ¡pool. ¡

8 ¡

SLIDE 9

Distance-‑2 ¡Graph ¡Coloring ¡

▪ Distance-‑2 ¡coloring ¡on ¡the ¡structure ¡of ¡C ¡in ¡symbolic ¡phase ¡ ▪ Dense ¡accumulator ¡per ¡color ¡ ▪ Coloring ¡on ¡C ¡is ¡more ¡restric/ve ¡coloring ¡on ¡A ¡ ▪ It ¡is ¡also ¡distance-‑2 ¡coloring ¡on ¡A ¡ – The ¡rows ¡of ¡A ¡do ¡not ¡share ¡any ¡column ¡(!) ¡ ▪ No ¡reuse ¡of ¡rows ¡of ¡B ¡

9 ¡

SLIDE 10

Distance-‑2 ¡Graph ¡Coloring ¡

▪ Distance-‑2 ¡coloring ¡on ¡the ¡structure ¡of ¡C ¡in ¡symbolic ¡phase ¡ ▪ Dense ¡accumulator ¡per ¡color ¡ ▪ Coloring ¡on ¡C ¡is ¡more ¡restric/ve ¡coloring ¡on ¡A ¡ ▪ No ¡reuse ¡of ¡rows ¡of ¡B ¡ ▪ Improve ¡by ¡using ¡mul/ple ¡colors ¡at ¡a ¡/me=nnz(C) ¡/ ¡numcols(C) ¡ ▪ MCR: ¡Permute ¡rows ¡within ¡mul/colors ¡– ¡beler ¡reads ¡ ▪ MCW: ¡Permute ¡rows ¡within ¡single ¡colors ¡– ¡beler ¡writes ¡

10 ¡

SLIDE 11

Hypergraph ¡Model ¡[Ballard ¡15] ¡

Wcomputation= 1 for red vertices, 0 for yellow
Wmemory = 0 for red vertices, 1 for yellow

multiplications Input data Output data

11 ¡

SLIDE 12

SHMEM ¡Directed ¡HG ¡Model ¡

No owners of the data, data lies in the memory (part k+1)
There are no messages exchanged between parts
Instead incoming/outgoing arrows correspond reads/writes
Merge nets for data that lives in the same cache line, or range of coalesced accesses
We use the model to evaluate the read/write of algorithms

12 ¡

SLIDE 13

Experiments ¡

▪ Experiments ¡on ¡matrices ¡ ▪ Laplace3D ¡(15M, ¡109M), ¡Brick ¡(15M, ¡418M) ¡and ¡Empire ¡ (2M, ¡303M)(Internal ¡Sandia ¡App.) ¡

▪ Mul/plica/ons ¡for ¡mul/grid ¡solver ¡in ¡the ¡form ¡ – Acoarse ¡= ¡Rrestric/on ¡x ¡Afine ¡x ¡Pprolonga/on ¡ ¡

– RxA, ¡RAxP, ¡AxP ¡RxAP ¡

▪ Some ¡matrices ¡used ¡in ¡the ¡literature ¡for ¡AxA ¡ ▪ Bowman ¡and ¡Hansen ¡Clusters ¡ ▪ Bowman: ¡Intel ¡KNL ¡ ▪ 68 ¡cores, ¡1.40 ¡GHz, ¡4 ¡hyper-‑threads ¡per ¡core. ¡ ¡

▪ 16 ¡Gb ¡HBW ¡MCDRAM ¡(476.2 ¡GB/s), ¡96 ¡GB ¡DDR4 ¡(84.3 ¡GB/s) ¡

▪ ¡Hansen: ¡NVIDIA ¡ ¡Tesla ¡ ¡K80 ¡ ▪ CC ¡3.7 ¡and ¡11.25 ¡GB ¡memory ¡

13 ¡

SLIDE 14

GPU ¡Gflops ¡for ¡RxAxP ¡

AxP$ RX(AP)$ RXA$ RAXP$ AxP$ RX(AP)$ RXA$ RAXP$ AxP$ RX(AP)$ RXA$ RAXP$ Laplace$ Brick$ Empire$ CUSPARSE$ 0.10$ 0.23$ 0.16$ 0.16$ 0.29$ 0.54$ 0.32$ 0.51$ 0.65$ 0.71$ 1.61$ 0.52$ KKMEM$ 1.49$ 1.46$ 0.87$ 0.68$ 2.23$ 2.12$ 1.78$ 0.97$ 2.38$ 1.68$ 2.06$ 0.79$ 0.00$ 0.50$ 1.00$ 1.50$ 2.00$ 2.50$

CUSP runs out of memory
Speedups range from 1.28 to 14.83. Average 3.90

14 ¡

Higher is better

SLIDE 15

KNL ¡Experiments ¡

0.04$ 0.07$ 0.14$ 0.28$ 0.54$ 1.03$ 1.73$ 1.96$ 1.81$ 0.04$ 0.07$ 0.14$ 0.27$ 0.54$ 1.05$ 2.03$ 2.87$ 3.10$ 0.06$ 0.12$ 0.24$ 0.48$ 0.95$ 1.88$ 3.52$ 4.67$ 4.43$ 0.06$ 0.12$ 0.24$ 0.47$ 0.94$ 1.86$ 3.65$ 5.51$ 6.37$ 0.00# 1.00# 2.00# 3.00# 4.00# 5.00# 6.00# 7.00# 1# 2# 4# 8# 16# 32# 64#128# 256# 1# 2# 4# 8# 16# 32# 64#128# 256# 1# 2# 4# 8# 16# 32# 64#128# 256# 1# 2# 4# 8# 16# 32# 64#128# 256# DDR4# MCDRAM# DDR4# MCDRAM# No3Reuse# Reuse# GFlops$ KKMEM# MKL#

Geometric mean for 13 multiplications. Compared against MKL.
First MKL run takes 4-5x times more than the next ones. First one is excluded.
Overall: almost linear scaling up to 64 cores.
MKL is slightly faster up to 64 cores – no performance diff for MCDRAM and DDR4 (!).
KKMEM is 1.17 times faster on 128 threads MCDRAM,
MKL does not scale on 256 threads
If reuse 2.12 - 2.25 on 1-128 threads (3.05, 4.08 on 256 threads) times faster.
The difference between reuse vs no-reuse is high.
Compression reduces the size 7-20 % for RxAxP, while it can reduce 87% for UFL matrices

15 ¡

SLIDE 16

Flop ¡per ¡Double ¡Laplace ¡AxP ¡

1.43% 0.85% 0.45% 0.34% 1.35% 0.80% 0.44% 0.34% 1.28% 0.73% 0.43% 0.34% 1.43% 0.85% 0.45% 0.34% 1.35% 0.80% 0.44% 0.34% 1.28% 0.73% 0.43% 0.34% 0.00# 0.20# 0.40# 0.60# 0.80# 1.00# 1.20# 1.40# 1.60# 0.00# 0.20# 0.40# 0.60# 0.80# 1.00# 1.20# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# 64# 128# 256# 64# 128# 256# DDR4# MCDRAM# Flops/Double% GFlops% KernelFlops# Numeric#Flops# FLOP/Double#

Dense Accumulator

16 ¡

SLIDE 17

Laplace ¡AxP ¡MCDRAM ¡

1.43% 0.85% 0.45% 0.34% 1.35% 0.80% 0.44% 0.34% 1.28% 0.73% 0.43% 0.34% 1.43% 0.85% 0.45% 0.34% 1.35% 0.80% 0.44% 0.34% 1.28% 0.73% 0.43% 0.34% 0.00# 0.20# 0.40# 0.60# 0.80# 1.00# 1.20# 1.40# 1.60# 0.00# 1.00# 2.00# 3.00# 4.00# 5.00# 6.00# 7.00# 8.00# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# 64# 128# 256# 64# 128# 256# DDR4# MCDRAM# Flops/Double% GFlops% KernelFlops# Numeric#Flops# FLOP/Double#

Has more hashmap

perations than Flops

17 ¡

SLIDE 18

Laplace ¡AxP ¡DDR4 ¡

1.43% 0.85% 0.45% 0.34% 1.35% 0.80% 0.44% 0.34% 1.28% 0.73% 0.43% 0.34% 1.43% 0.85% 0.45% 0.34% 1.35% 0.80% 0.44% 0.34% 1.28% 0.73% 0.43% 0.34% 0.00# 0.20# 0.40# 0.60# 0.80# 1.00# 1.20# 1.40# 1.60# 0.00# 1.00# 2.00# 3.00# 4.00# 5.00# 6.00# 7.00# 8.00# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# KKMEM# KKSPEED# MCR# MCW# 64# 128# 256# 64# 128# 256# DDR4# MCDRAM# Flops/Double% GFlops% KernelFlops# Numeric#Flops# FLOP/Double#

18 ¡

SLIDE 19

Conclusions ¡& ¡Future ¡Work ¡

▪ Portable ¡SPGEMM ¡method ¡with ¡decent ¡performance ¡

n ¡various ¡new ¡architectures ¡

▪ Hypergraph ¡model ¡to ¡study ¡the ¡effect ¡of ¡read/writes ¡ to ¡the ¡overall ¡performance ¡ ▪ Ongoing: ¡ ¡ ▪ Analyzing ¡flop ¡per ¡read ¡and ¡flop ¡per ¡write ¡and ¡ experiment ¡with ¡MCDRAM ¡and ¡DDR4. ¡ ▪ Future: ¡ ▪ Fast ¡packing ¡of ¡columns ¡of ¡B ¡for ¡beler ¡ compression ¡ ▪ Fast ¡reordering ¡of ¡rows ¡of ¡A ¡to ¡use ¡beler ¡locality ¡ ¡

19 ¡

SLIDE 20

For ¡more ¡informa/on ¡

▪ KokkosKernels: ¡

▪ Download ¡through ¡Trilinos: ¡ ¡hlp://trilinos.org ¡ ▪ Public ¡git ¡repository: ¡hlp://github.com/trilinos ¡

▪ For ¡more ¡informa/on: ¡

▪ mndevec@sandia.gov ¡

▪ Thanks ¡to: ¡

▪ NNSA ¡ASC ¡program ¡ ▪ DOE ¡ASCR ¡SciDAC ¡FASTMath ¡Ins/tute ¡ ▪ ATDM ¡

20 ¡

SLIDE 21

References ¡

▪

F. ¡G. ¡Gustavson, ¡“Two ¡fast ¡algorithms ¡for ¡sparse ¡matrices: ¡Mul/plica/on ¡and ¡permuted ¡transposi/on," ¡ACM ¡

Transac/ons ¡on ¡Mathema/cal ¡Soxware ¡(TOMS), ¡vol. ¡4, ¡no. ¡3, ¡pp. ¡250{269, ¡1978. ¡ ▪ Buluç, ¡Aydin, ¡and ¡John ¡R. ¡Gilbert. ¡"Parallel ¡sparse ¡matrix-‑matrix ¡mul/plica/on ¡and ¡indexing: ¡ Implementa/on ¡and ¡experiments." ¡SIAM ¡Journal ¡on ¡Scien5fic ¡Compu5ng ¡34.4 ¡(2012): ¡C170-‑C191. ¡ ▪ Azad, ¡Ariful, ¡et ¡al. ¡"Exploi/ng ¡mul/ple ¡levels ¡of ¡parallelism ¡in ¡sparse ¡matrix-‑matrix ¡mul/plica/on." ¡arXiv ¡ preprint ¡arXiv:1510.00844 ¡(2015). ¡ ▪ Akbudak, ¡Kadir, ¡and ¡Cevdet ¡Aykanat. ¡"Simultaneous ¡Input ¡and ¡Output ¡Matrix ¡Par//oning ¡for ¡Outer-‑ Product-‑-‑Parallel ¡Sparse ¡Matrix-‑Matrix ¡Mul/plica/on." ¡SIAM ¡Journal ¡on ¡Scien5fic ¡Compu5ng ¡36.5 ¡(2014): ¡ C568-‑C590. ¡ ¡ ¡ ▪ Ballard, ¡Grey, ¡et ¡al. ¡"Brief ¡announcement: ¡Hypergraph ¡par//oning ¡for ¡parallel ¡sparse ¡matrix-‑matrix ¡ mul/plica/on." ¡Proceedings ¡of ¡the ¡27th ¡ACM ¡symposium ¡on ¡Parallelism ¡in ¡Algorithms ¡and ¡Architectures. ¡ ACM, ¡2015. ¡ ¡ ¡ ▪ Patwary, ¡Md ¡Mostofa ¡Ali, ¡et ¡al. ¡"Parallel ¡efficient ¡sparse ¡matrix-‑matrix ¡mul/plica/on ¡on ¡mul/core ¡ pla{orms." ¡Interna5onal ¡Conference ¡on ¡High ¡Performance ¡Compu5ng. ¡Springer ¡Interna/onal ¡Publishing, ¡

2015. ¡

¡ ¡ ▪ Liu, ¡Weifeng, ¡and ¡Brian ¡Vinter. ¡"An ¡efficient ¡GPU ¡general ¡sparse ¡matrix-‑matrix ¡mul/plica/on ¡for ¡irregular ¡ data." ¡Parallel ¡and ¡Distributed ¡Processing ¡Symposium, ¡2014 ¡IEEE ¡28th ¡Interna5onal. ¡IEEE, ¡2014. ¡ ¡ ▪ Dalton, ¡Steven, ¡Luke ¡Olson, ¡and ¡Nathan ¡Bell. ¡"Op/mizing ¡sparse ¡matrix—matrix ¡mul/plica/on ¡for ¡the ¡gpu." ¡ ACM ¡Transac5ons ¡on ¡Mathema5cal ¡SoNware ¡(TOMS) ¡41.4 ¡(2015): ¡25. ¡ ¡ ¡ ¡ ¡ 21 ¡

SLIDE 22

GPU ¡RxAxP ¡Numeric ¡Flops ¡

AxP$ RX(AP)$ RXA$ RAXP$ AxP$ RX(AP)$ RXA$ RAXP$ AxP$ RX(AP)$ RXA$ RAXP$ Laplace$ Brick$ Empire$ CUSPARSE$ 0.18$ 0.28$ 0.25$ 0.21$ 0.36$ 0.58$ 0.50$ 0.57$ 0.72$ 0.76$ 2.15$ 0.57$ KKMEM$ 3.36$ 3.24$ 1.32$ 1.14$ 4.04$ 5.29$ 3.45$ 1.57$ 3.78$ 4.00$ 2.86$ 1.48$ KKSPEED$ 3.41$ 3.30$ 1.60$ 1.33$ 4.65$ 5.24$ 3.62$ 1.79$ 4.28$ 3.89$ 2.75$ 1.64$ KKMCR$ 0.77$ 1.42$ 1.11$ 0.97$ 1.49$ 3.06$ 3.39$ 1.47$ 4.99$ 1.53$ KKMCW$ 0.71$ 1.84$ 1.13$ 1.09$ 1.79$ 3.95$ 3.75$ 1.71$ 5.00$ 1.68$ 0.00$ 1.00$ 2.00$ 3.00$ 4.00$ 5.00$ 6.00$

Coloring based ones does much less operations.
But accesses to B (second matrix) suffer from non-coalesced
Still performance is comparable or better when second matrix has

dense rows.

Or when KKMEM also suffers from noncoalesced B accesses

22 ¡

SLIDE 23

GPU ¡Gflops ¡for ¡RxAxP ¡

CUSP runs out of memory
Speedups range from 1.28 (1.25) to 14.83 (14.93). Average 3.90 (4.06)
Cons:

KKMEM – cost to get memory through uniform pool KKSPEED – hash operations are done through ‘%’ instead of &.

23 ¡

Higher is better

SLIDE 24

GPU ¡AxA ¡ ¡Speedup ¡w.r.t ¡cuSPARSE ¡

2.61% 2.50% 4.87% 4.61% 1.31% 4.96% 3.24% 7.42% 9.83% 2.53% 5.16% 5.11% 1.29% 4.79% 3.13% 7.50% 10.09% 2.47% 4.00% 4.50% 1.00% 4.50% 2.50% 5.00% 7.00% 2.00% 1.04% 2.18% 0.67% 1.56% 2.30% 3.29% 1.66% 0.99%

0.00% 2.00% 4.00% 6.00% 8.00% 10.00% 12.00% a u d i % b u m p % 2 c u b e s _ s p h e c a g e 1 2 % c a n t % fi l t e r 3 D % h

% m 1 3 3 %

s h

e % p w t k %

KKMEM% KKSPEED% BHSPARSEMGTX% clSparseMK40Msp%

Overall KKMEM speedup: 3.76
KKSPEED - 4.19 (4.14 for KKMEM)
Audi has a very irregular row distribution. Output 7586MB
Pool requires -> 952 MB symbolic and 308MB numeric
Bump – Output 6410MB: pool: 280MB and 87 MB

24 ¡

SLIDE 25

KNL ¡Audi ¡AxA ¡

0.08$ 0.15$ 0.30$ 0.60$ 1.18$ 2.35$ 4.49$ 6.65$ 6.72$ 0.08$ 0.15$ 0.30$ 0.60$ 1.18$ 2.35$ 4.53$ 6.81$ 7.44$ 0.09$ 0.18$ 0.35$ 0.70$ 1.39$ 2.76$ 5.32$ 7.99$ 8.16$ 0.09$ 0.18$ 0.35$ 0.70$ 1.39$ 2.76$ 5.33$ 8.07$ 8.83$ 0.00# 1.00# 2.00# 3.00# 4.00# 5.00# 6.00# 7.00# 8.00# 9.00# 10.00# 1# 2# 4# 8# 16# 32# 64#128# 256# 1# 2# 4# 8# 16# 32# 64#128# 256# 1# 2# 4# 8# 16# 32# 64#128# 256# 1# 2# 4# 8# 16# 32# 64#128# 256# DDR4# MCDRAM# DDR4# MCDRAM# No4Reuse# Reuse# GFlops$ KKMEM# MKL#

MKL is faster upto 64 cores. Similar performance on 128, and MKL does not

scale on 256 threads.

With reuse upto 1.95 to 2.33 (1.20 on 128) speedups.
Compression is successful here. Symbolic is 85% faster than numeric.

25 ¡

SLIDE 26

KNL ¡Laplace ¡AxP ¡

0.03$ 0.05$ 0.10$ 0.20$ 0.39$ 0.76$ 1.44$ 1.94$ 1.88$ 0.03$ 0.05$ 0.10$ 0.20$ 0.39$ 0.77$ 1.53$ 2.22$ 2.36$ 0.05$ 0.10$ 0.20$ 0.39$ 0.77$ 1.54$ 3.03$ 4.38$ 4.63$ 0.05$ 0.10$ 0.20$ 0.40$ 0.78$ 1.55$ 3.11$ 4.74$ 5.39$ 0.00# 1.00# 2.00# 3.00# 4.00# 5.00# 6.00# 1# 2# 4# 8# 16# 32# 64#128# 256# 1# 2# 4# 8# 16# 32# 64#128# 256# 1# 2# 4# 8# 16# 32# 64#128# 256# 1# 2# 4# 8# 16# 32# 64#128# 256# DDR4# MCDRAM# DDR4# MCDRAM# No2Reuse# Reuse# GFlopd$ KKMEM$ MKL$

MKL is faster upto 64 cores. KKMEM is 10% faster on 128 threads
MKL does not finish in 1000 seconds on 256 threads.
With reuse upto 2.48 speedups.
Compression is not successful here (7% reduction).
Symbolic has same time with numeric, sometimes even more expensive
Need: Reorder/Pack of columns to improve compression. (SPMV cache

locality)

26 ¡

SLIDE 27

KKMEM ¡FLOP/Double ¡vs ¡GFLOPS ¡

0" 0.5" 1" 1.5" 2" 2.5" 3" 0" 1" 2" 3" 4" 5" 6" 7" 8" 9" Lap/RxA" Lap/RAxP" Lap/RxAP" Lap/AxP" Bri/RAxP" Bri/RxAP" Bri/RxA" Bri/AxP" Lap/RxA" Lap/RAxP" Lap/RxAP" Lap/AxP" Bri/RAxP" Bri/RxA" Bri/RxAP" Bri/AxP" Lap/RxA" Lap/RAxP" Lap/RxAP" Lap/AxP" Bri/RAxP" Bri/RxA" Bri/RxAP" Bri/AxP" Lap/RxA" Lap/RAxP" Lap/RxAP" Lap/AxP" Bri/RAxP" Bri/RxAP" Bri/RxA" Bri/AxP" Lap/RxA" Lap/RAxP" Lap/RxAP" Lap/AxP" Bri/RAxP" Bri/RxA" Bri/RxAP" Bri/AxP" Lap/RxA" Lap/RAxP" Lap/RxAP" Lap/AxP" Bri/RAxP" Bri/RxA" Bri/RxAP" Bri/AxP" 64" 128" 256" 64" 128" 256" DDR4" MCDRAM" Flop/Double" GFlops" KernelFlops" Numeric"Flops" FLOP/Double"

27 ¡

Sparse ¡Matrix-­‑Matrix ¡Mul/plica/on ¡for ¡ Modern ¡Manycore ¡Architectures ¡

Mehmet ¡Deveci, ¡Erik ¡Boman, ¡ ¡ Siva ¡Rajamanickam ¡

Problem ¡

▪ SPGEMM: ¡fundamental ¡block ¡for ¡ ▪ Algebraic ¡mul/grid ¡ ▪ Various ¡graph ¡analy/cs ¡problems: ¡clustering, ¡betweenness ¡ centrality… ¡ ▪ Extra ¡irregularity: ¡nnz ¡of ¡C ¡is ¡unknown ¡beforehand ¡

Background ¡ ¡

Background ¡

Portable ¡SPGEMM ¡Method ¡

▪ 2-­‑phase, ¡symbolic ¡(calculate ¡#nnz), ¡then ¡numeric ¡(actual ¡flops) ¡ ▪ Over ¡alloca/on ¡is ¡expensive ¡or ¡dynamic ¡increase ¡are ¡not ¡suitable ¡on ¡

KokkosKernels ¡(KK) ¡-­‑ ¡SPGEMM ¡

▪ Each ¡team ¡works ¡on ¡a ¡bunch ¡of ¡rows ¡of ¡C ¡(or ¡A) ¡

▪ Team: ¡Thread ¡block ¡(GPU) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡group ¡of ¡hyper-­‑threads ¡in ¡a ¡core ¡(CPU) ¡

▪ Vectorlane: ¡Threads ¡in ¡a ¡Warp ¡(GPUs), ¡vector ¡units ¡ (CPU) ¡

KK ¡-­‑ ¡SPGEMM ¡

KKMEM ¡

Distance-­‑2 ¡Graph ¡Coloring ¡

Distance-­‑2 ¡Graph ¡Coloring ¡

Hypergraph ¡Model ¡[Ballard ¡15] ¡

SHMEM ¡Directed ¡HG ¡Model ¡

Experiments ¡

▪ Experiments ¡on ¡matrices ¡ ▪ Laplace3D ¡(15M, ¡109M), ¡Brick ¡(15M, ¡418M) ¡and ¡Empire ¡ (2M, ¡303M)(Internal ¡Sandia ¡App.) ¡

▪ Mul/plica/ons ¡for ¡mul/grid ¡solver ¡in ¡the ¡form ¡ – Acoarse ¡= ¡Rrestric/on ¡x ¡Afine ¡x ¡Pprolonga/on ¡ ¡

▪ Some ¡matrices ¡used ¡in ¡the ¡literature ¡for ¡AxA ¡ ▪ Bowman ¡and ¡Hansen ¡Clusters ¡ ▪ Bowman: ¡Intel ¡KNL ¡ ▪ 68 ¡cores, ¡1.40 ¡GHz, ¡4 ¡hyper-­‑threads ¡per ¡core. ¡ ¡

▪ 16 ¡Gb ¡HBW ¡MCDRAM ¡(476.2 ¡GB/s), ¡96 ¡GB ¡DDR4 ¡(84.3 ¡GB/s) ¡

▪ ¡Hansen: ¡NVIDIA ¡ ¡Tesla ¡ ¡K80 ¡ ▪ CC ¡3.7 ¡and ¡11.25 ¡GB ¡memory ¡

GPU ¡Gflops ¡for ¡RxAxP ¡

KNL ¡Experiments ¡

Flop ¡per ¡Double ¡Laplace ¡AxP ¡

Laplace ¡AxP ¡MCDRAM ¡

Laplace ¡AxP ¡DDR4 ¡

Conclusions ¡& ¡Future ¡Work ¡

▪ Portable ¡SPGEMM ¡method ¡with ¡decent ¡performance ¡

For ¡more ¡informa/on ¡

▪ KokkosKernels: ¡

▪ Download ¡through ¡Trilinos: ¡ ¡hlp://trilinos.org ¡ ▪ Public ¡git ¡repository: ¡hlp://github.com/trilinos ¡

▪ For ¡more ¡informa/on: ¡

▪ mndevec@sandia.gov ¡

▪ Thanks ¡to: ¡

▪ NNSA ¡ASC ¡program ¡ ▪ DOE ¡ASCR ¡SciDAC ¡FASTMath ¡Ins/tute ¡ ▪ ATDM ¡

References ¡

GPU ¡RxAxP ¡Numeric ¡Flops ¡

GPU ¡Gflops ¡for ¡RxAxP ¡

GPU ¡AxA ¡ ¡Speedup ¡w.r.t ¡cuSPARSE ¡

KNL ¡Audi ¡AxA ¡

KNL ¡Laplace ¡AxP ¡

KKMEM ¡FLOP/Double ¡vs ¡GFLOPS ¡

Sparse ¡Matrix-‑Matrix ¡Mul/plica/on ¡for ¡ Modern ¡Manycore ¡Architectures ¡

▪ 2-‑phase, ¡symbolic ¡(calculate ¡#nnz), ¡then ¡numeric ¡(actual ¡flops) ¡ ▪ Over ¡alloca/on ¡is ¡expensive ¡or ¡dynamic ¡increase ¡are ¡not ¡suitable ¡on ¡

KokkosKernels ¡(KK) ¡-‑ ¡SPGEMM ¡

▪ Team: ¡Thread ¡block ¡(GPU) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡group ¡of ¡hyper-‑threads ¡in ¡a ¡core ¡(CPU) ¡

KK ¡-‑ ¡SPGEMM ¡

Distance-‑2 ¡Graph ¡Coloring ¡

Distance-‑2 ¡Graph ¡Coloring ¡

▪ Some ¡matrices ¡used ¡in ¡the ¡literature ¡for ¡AxA ¡ ▪ Bowman ¡and ¡Hansen ¡Clusters ¡ ▪ Bowman: ¡Intel ¡KNL ¡ ▪ 68 ¡cores, ¡1.40 ¡GHz, ¡4 ¡hyper-‑threads ¡per ¡core. ¡ ¡