Fast sparse matrixvector multiplication by partitioning and - PowerPoint PPT Presentation

Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel Example: sparse matrix, dense vector multiplication Step 2 ( mv ): use received elements from x for multiplication. Step 3 ( fan-in ): send local results to the correct processors; here, y is distributed cyclically, obviously a bad choice. �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel Example: sparse matrix, dense vector multiplication The algorithm: 1 for all nonzeroes k from A if column of k is not local request element from x from the appropiate processor synchronise 2 for all nonzeroes k from A do the SpMV for k send all non-local row sums to the correct processor synchronise 3 add all incoming row sums to the corresponding y [ i ] Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP For Multicore, the original model Communication network � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � P P P P P M M M M M may no longer apply. Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP The AMD Phenom II 945e processor has uniform memory access: Core 1 Core 2 Core 3 Core 4 64kB L1 64kB L1 64kB L1 64kB L1 �� 512kB L2 512kB L2 512kB L2 512kB L2 �� 6MB shared L3 cache System interface is modelled well by BSP; Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP The Intel Core 2 Q6600 processor has cache-coherent non-uniform memory access (cc-NUMA): Core 1 Core 2 Core 3 Core 4 32kB L1 32kB L1 32kB L1 32kB L1 �� 4MB L2 4MB L2 System interface is not modelled well by BSP. Leslie G. Valiant, A bridging model for multi-core computing , Lecture Notes in Computer Science , vol. 5193, Springer (2008); pp 13–28. Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP New primitive: Ask for some environment variables: bsp nprocs() bsp pid() Synchronise: bsp sync() Perform “direct” remote memory access (DRMA): bsp put(source, dest, dest PID) bsp get(source, source PID, dest) bsp direct get (source, source PID, dest) Send messages, synchronously (BSMP): bsp send(data, dest PID) bsp qsize() bsp move() Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP MulticoreBSP brings BSP programming to shared-memory architectures. Programmed in standard Java (5 and up), this is a fully object-oriented library containing only 10 primitives, 2 purely virtual functions ( parallel part and sequential part ), and 2 interfaces. Data types which can be communicated with are defined using an interface. This makes MulticoreBSP transparent and easy to learn, have predictable performance, robust (no data racing, no deadlocks), potentially usable for both shared- and distributed-memory systems Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP Alternative (2-step) SpMV algorithm in MulticoreBSP: 1 for all nonzeroes k from A if both row and column of k are local add do the SpMV for k if column of k is not local direct get element from x , and do SpMV for k send all non-local row sums to the correct processor synchronise 2 add all incoming row sums to the corresponding y [ i ] Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP Software is available at: http://www.multicorebsp.com Yzelman and Bisseling, An Object-Oriented BSP Library for Multicore Programming , Concurrancy and Computation: Practice and Experience, 2011 (Accepted for publication). Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Partitioning Bulk Synchronous Parallel 1 Partitioning 2 Sequential SpMV 3 Parallel cache-friendly SpMV 4 Experimental results 5 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Automatic nonzero partitioning What causes the communication? nonzeroes on the same column distributed to different processors: fan-out communication �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Automatic nonzero partitioning What causes the communication? nonzeroes on the same row distributed to different processors: fan-in communication �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Automatic nonzero partitioning Load balancing s ∈ [0 , P − 1] w ( s ) w i = 1 � For each superstep i , let ¯ be the average P i workload. The load-balance constraint is: w i − w ( s ) max | ¯ | ≤ ǫ ¯ w i , i s where ǫ is the maximum load imbalance parameter. Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Automatic nonzero partitioning “Shared” columns: communication during fan-out �� 1 �� 1 2 4 2 �� 3 �� 4 3 7 �� 5 �� 6 �� 6 8 5 7 �� 8 �� Column-net model; a cut net means a shared column Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Automatic nonzero partitioning “Shared” rows: communication during fan-in 1 2 3 4 5 6 7 8 �� 1 5 3 �� 6 7 �� 8 2 4 �� Row-net model; a cut net means a shared row Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Automatic nonzero partitioning Catch communication both ways: 1 2 1 2 10 11 3 4 5 6 7 7 4 3 8 9 10 11 9 12 6 12 13 14 8 13 14 5 Fine-grain model; a cut net means either a shared row or column Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning A cut net n i means communication. The number of processors involved is: λ i = # {V i ∩ n i � = ∅} . So the quantity to minimise is: � ( λ i − 1) . C = i Partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of that hypergraph in two so that C is minimised under the load-balance constraint. Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of that hypergraph in two. Kernighan & Lin, An efficient heuristic procedure for partitioning graphs , Bell Systems Technical Journal 49 (1970). Fiduccia & Mattheyses, A linear-time heuristic for improving network partitions , Proceedings of the 19th IEEE Design Automation Conference (1982). Cataly¨ urek & Aykanat, Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication , IEEE Transactions on Parallel Distributed Systems 10 (1999). Bisseling & Vastenhouw, A two-dimensional data distribution method for parallel sparse matrix-vector multiplication , SIAM Review Vol. 47(1), 2005. Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Mondriaan partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Try both row- and column-net, and choose best �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Mondriaan partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Mondriaan: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts �� Brendan Vastenhouw and Rob H. Bisseling, A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication , SIAM Review, Vol. 47, No. 1 (2005) pp. 67-95 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Mondriaan: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts Partition the vector elements ✳✁✳ ✲✁✲ ✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟ ✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞ ✵✁✵ ✴✁✴ ✶✁✶ ✷✁✷ ✸✁✸ ✹✁✹ ✻✁✻ ✺✁✺ ✽✁✽ ✼✁✼ ✿✁✿ ✾✁✾ ❀✁❀ ❁✁❁ ✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞ ✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟ ✲✁✲ ✳✁✳ ✵✁✵ ✴✁✴ ✷✁✷ ✶✁✶ ✸✁✸ ✹✁✹ ✻✁✻ ✺✁✺ ✽✁✽ ✼✁✼ ✿✁✿ ✾✁✾ ❁✁❁ ❀✁❀ ✲✁✲ ✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞ ✴✁✴ ✶✁✶ ✸✁✸ ✺✁✺ ✼✁✼ ✾✁✾ ❀✁❀ ✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟ ✳✁✳ ✵✁✵ ✷✁✷ ✹✁✹ ✻✁✻ ✽✁✽ ✿✁✿ ❁✁❁ �✁� ✫✁✫ ✂✁✂ ✪✁✪ ✄✁✄ ☎✁☎ ✪✁✪ �✁� ✂✁✂ ✫✁✫ ☎✁☎ ✄✁✄ �✁� ✪✁✪ ✫✁✫ ✂✁✂ ☎✁☎ ✄✁✄ ✱✁✱ ✰✁✰ ✂✁✂ �✁� ✱✁✱ ✰✁✰ �✁� ✂✁✂ ✱✁✱ ✰✁✰ �✁� ✦✁✦ ✧✁✧ ✂✁✂ ✎✁✎ ✏✁✏ ✑✁✑ ✒✁✒ ✔✁✔ ✓✁✓ ✂✁✂ ✦✁✦ ✧✁✧ �✁� ✎✁✎ ✏✁✏ ✒✁✒ ✑✁✑ ✓✁✓ ✔✁✔ �✁� ✧✁✧ ✦✁✦ ✂✁✂ ✏✁✏ ✎✁✎ ✑✁✑ ✒✁✒ ✓✁✓ ✔✁✔ ✥✁✥ ✤✁✤ ✕✁✕ ✖✁✖ �✁� ✂✁✂ ✥✁✥ ✤✁✤ ✖✁✖ ✕✁✕ ✂✁✂ �✁� ✤✁✤ ✥✁✥ ✕✁✕ ✖✁✖ ✂✁✂ ✣✁✣ �✁� ✢✁✢ ✆✁✆ ✝✁✝ ✍✁✍ ✌✁✌ �✁� ✢✁✢ ✂✁✂ ✣✁✣ ✝✁✝ ✆✁✆ ✍✁✍ ✌✁✌ �✁� ✢✁✢ ✆✁✆ ✌✁✌ ✂✁✂ ✣✁✣ ✝✁✝ ✍✁✍ ✮✁✮ ✯✁✯ �✁� ✂✁✂ ✯✁✯ ✮✁✮ �✁� ✂✁✂ ✯✁✯ ✮✁✮ ✭✁✭ ✂✁✂ �✁� ✬✁✬ ✗✁✗ ✘✁✘ ✙✁✙ ✚✁✚ ✛✁✛ ✜✁✜ �✁� ✬✁✬ ✭✁✭ ✂✁✂ ✗✁✗ ✘✁✘ ✙✁✙ ✚✁✚ ✜✁✜ ✛✁✛ ✬✁✬ ✂✁✂ ✭✁✭ �✁� ✗✁✗ ✘✁✘ ✙✁✙ ✚✁✚ ✜✁✜ ✛✁✛ ✩✁✩ ★✁★ ☞✁☞ ☛✁☛ ✠✁✠ ✡✁✡ �✁� ✂✁✂ ✩✁✩ ★✁★ ☞✁☞ ☛✁☛ ✡✁✡ ✠✁✠ �✁� ✂✁✂ ★✁★ ✩✁✩ ☛✁☛ ☞✁☞ ✠✁✠ ✡✁✡ Rob H. Bisseling and Wouter Meesen, Communication balancing in parallel sparse matrix-vector multiplication , Electronic Transactions on Numerical Analysis, Vol. 21 (2005) pp. 47-65 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Sequential SpMV Bulk Synchronous Parallel 1 Partitioning 2 Sequential SpMV 3 Parallel cache-friendly SpMV 4 Experimental results 5 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Realistic cache 1 < k < L , combining modulo-mapping and the LRU policy Modulo mapping Cache LRU−stack Main memory (RAM) Subcaches Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The dense case Dense matrix–vector multiplication  a 00 a 01 a 02 a 03   x 0   y 0  a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 3: x 0 ⇒ = Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The dense case Dense matrix–vector multiplication  a 00 a 01 a 02 a 03   x 0   y 0  a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 3: x 0 a 00 x 0 ⇒ ⇒ = = Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The dense case Dense matrix–vector multiplication  a 00 a 01 a 02 a 03   x 0   y 0  a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 3: x 0 a 00 y 0 x 0 a 00 ⇒ ⇒ x 0 = ⇒ = = Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The dense case Dense matrix–vector multiplication  a 00 a 01 a 02 a 03   x 0   y 0  a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 3: x 0 a 00 y 0 x 1 x 0 a 00 y 0 ⇒ ⇒ x 0 = ⇒ a 00 ⇒ = = = x 0 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The dense case Dense matrix–vector multiplication  a 00 a 01 a 02 a 03   x 0   y 0  a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 3: x 0 a 00 y 0 x 1 a 01 x 0 a 00 y 0 x 1 ⇒ ⇒ x 0 = ⇒ a 00 ⇒ y 0 ⇒ = = = = x 0 a 00 x 0 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The dense case Dense matrix–vector multiplication  a 00 a 01 a 02 a 03   x 0   y 0  a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 3: x 0 a 00 y 0 x 1 a 01 y 0 x 0 a 00 y 0 x 1 a 01 ⇒ ⇒ x 0 = ⇒ a 00 ⇒ y 0 ⇒ x 1 = = = = x 0 a 00 a 00 x 0 x 0 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The sparse case Standard datastructure: Compressed Row Storage (CRS) Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The sparse case Sparse matrix–vector multiplication (SpMV) x ? = ⇒ Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a 0? x ? = = ⇒ = ⇒ ⇒ Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a ?? x ? a 0? y 0 x ? x ? = a 0? y 0 = ⇒ = ⇒ ⇒ = ⇒ = ⇒ x ? a 0? x ? Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a ?? y ? x ? a 0? y 0 x ? a ?? x ? = a 0? y 0 x ? = ⇒ = ⇒ ⇒ = ⇒ = ⇒ x ? a 0? y ? x ? a 0? x ? We cannot predict memory accesses in the sparse case! Adapt sparse matrix data structures for locality and lower bandwidth? Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV CRS   4 1 3 0 0 0 2 3   A =   1 0 0 2   7 0 1 1 Stored as: nzs: [4 1 3 2 3 1 2 7 1 1] col: [0 1 2 2 3 0 3 0 2 3] , 2 nnz + ( m + 1) accesses row: [0 3 5 7 10] Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Incremental CRS  4 1 3 0  0 0 2 3   A =   1 0 0 2   7 0 1 1 Stored as: nzs: [4 1 3 2 3 1 2 7 1 1] col increment: [0 1 1 4 1 1 3 1 2 1] , 2 nnz + m accesses row increment: [0 1 1 1] Note: accesses like plain CRS, but requires less instructions for SpMV Reference: Joris Koster, Parallel templates for numerical linear algebra, a high-performance computation library (Masters Thesis), 2002. Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Blocked CRS   4 1 3 0 0 0 2 3   A =  , dense blocks: 4 , 1 , 3 / 2 , 3 / 1 / 2 / 7 , 0 , 1 , 1   1 0 0 2  7 0 1 1 Stored as: nzs: [4 1 3 2 3 1 2 7 0 1 1] blk: [0 3 5 6 7 11] , nnz + (2 nblk + 1) + ( m + 1) accesses col: [0 2 0 3 0] row: [0 1 2 4 5] Reference: Pinar and Heath, Improving Performance of Sparse Matrix-Vector Multiplication , 1999 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Fractal datastructures (triplets)   4 1 0 2 0 2 0 3   A =   1 0 0 2   7 0 1 0 Stored as: nzs: [7 1 4 1 2 2 3 2 1] i : [3 2 0 0 1 0 1 2 3] , 3nnz accesses per nonzero j : [0 0 0 1 1 3 3 3 2] Reference: Haase, Liebmann and Plank, A Hilbert-Order Multiplication Scheme for Unstructured Sparse Matrices , 2005 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Zig-zag CRS Change the order of CRS: Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Zig-zag CRS  4 1 3 0  0 0 2 3   A =   1 0 0 2   7 0 1 1 Stored as: nzs: [4 1 3 3 2 1 2 1 1 7] col: [0 1 2 3 2 0 3 3 2 0] , 2 nnz + ( m + 1) accesses row: [0 3 5 7 10] Reference: Yzelman and Bisseling, Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods , SISC, 2009 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Why not also change the input matrix structure? Assume zig-zag CRS ordering (theoretically) Allow only row and column permutations Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Separated Block Diagonal form Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Separated Block Diagonal form No cache misses 1 cache miss per row 3 cache misses per row 1 cache miss per row Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Separated Block Diagonal form No cache misses 1 cache miss per row 3 cache misses 1 cache miss per row 7 cache misses per row 1 cache miss per row 3 cache misses per row 1 cache miss per row Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Separated Block Diagonal form 1 2 3 4 1 2 3 4 � (Upper bound on) the number of cache misses: ( λ i − 1) i Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Separated Block Diagonal form In 1D, row and column permutations bring the original matrix A in Separated Block Diagonal (SBD) form as follows. A is modelled as a hypergraph H = ( V , N ), with V the set of columns of A , N the set of hyperedges , each element is a subset of V and corresponds to a row of A . A partitioning V 1 , V 2 of V can be constructed; and from these, three hyperedge categories can be constructed: N row as the set of hyperedges with vertices only in V 1 , − N row as the set of hyperedges with vertices both in V 1 and V 2 , c N row the set of remaining hyperedges. + Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Separated Block Diagonal form N row − N row c N row + V 1 V 2 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Input �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Column partitioning �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Column permutation �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Mixed row detection �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Row permutation �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Column subpartitioning �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Column permutation �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form No mixed rows - row permutation �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Reordering parameters Taking p = n S , the number of cache misses is strictly bounded by � ( λ i − 1); i : n i ∈N taking p → ∞ yields a cache-oblivious method with the same bound. References: Yzelman and Bisseling, Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods , SIAM Journal on Scientific Computing, 2009 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Two-dimensional SBD (doubly separated block diagonal) 1D 2D Yzelman and Bisseling, Two-dimensional cache-oblivious sparse matrix–vector multiplication , April 2011 (Revised pre-print); http://www.math.uu.nl/people/yzelman/publications/#pp Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Two-dimensional SBD (doubly separated block diagonal) Using a fine-grain model of the input sparse matrix, individual nonzeros each correspond to a vertex; each row and column has a corresponding net. N row − N row c N row + N col N col N col − + c The quantity minimised remains � i ( λ i − 1). Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Two-dimensional SBD (doubly separated block diagonal) Zig-zag CRS is not suitable for handling 2D SBD! Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Two-dimensional SBD �� Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Two-dimensional SBD Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Two-dimensional SBD; block ordering 1 2 1 2 4 � x 3 4 5 3 4 6 2 � x + 2 � y 7 7 6 5 1 4 1 2 2 � y 2 3 4 3 2 � x 7 5 5 6 7 6 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Bi-directional Incremental CRS (BICRS) ��  4 1 3 0  � � �� 0 0 2 3   � � � � A =   �� 1 0 0 2   �� 7 0 1 1 �� Stored as: nzs: [3 2 3 1 1 2 1 7 4 1] col increment: [2 4 1 4 -1 5 -3 4 4 1] , row increment: [0 1 2 -1 1 -3] 2 nnz + ( row jumps + 1) accesses Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Parallel cache-friendly SpMV Parallel cache-friendly SpMV Bulk Synchronous Parallel 1 Partitioning 2 Sequential SpMV 3 Parallel cache-friendly SpMV 4 Experimental results 5 Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Parallel cache-friendly SpMV On distributed-memory architectures Directly use partitioner output: Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Parallel cache-friendly SpMV On distributed-memory architectures Or: use both partitioner and reordering output: partition for p → ∞ , but distribute only over the actual number of processors: Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Parallel cache-friendly SpMV On shared-memory architectures Use: global version of the matrix A , stored in BICRS, global input vector x , global output vector y . Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Parallel cache-friendly SpMV On shared-memory architectures Use: global version of the matrix A , stored in BICRS, global input vector x , global output vector y . Multiple threads work simultaneously on contiguous blocks in the BICRS data structure; conflicts only arise on the row-wise separator areas. Use t − 1 synchronisation steps to prevent concurrent writes. Albert-Jan Yzelman

Fast sparse matrix–vector multiplication by partitioning and reordering > Experimental results Experimental results Bulk Synchronous Parallel 1 Partitioning 2 Sequential SpMV 3 Parallel cache-friendly SpMV 4 Experimental results 5 Albert-Jan Yzelman

Fast sparse matrixvector multiplication by partitioning and - PowerPoint PPT Presentation

Fast sparse matrixvector multiplication by partitioning and reordering Fast sparse matrixvector multiplication by partitioning and reordering Albert-Jan Yzelman June, 2011 Albert-Jan Yzelman Fast sparse matrixvector multiplication by

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Fast sparse matrixvector multiplication by partitioning and reordering Albert-Jan Yzelman

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Sparse matrix partitioning, ordering, and visualisation by Mondriaan 3.0 Outline Partitioning

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint

Heads in Context-Free Rules Add annotations specifying the head of each rule: Vi

Concurrent Zero Knowledge in Concurrent Zero Knowledge in the Bounded Player Model y Vipul Goyal

Evaluation of approaches for accommodating interactions and non-linear terms in multiple

Poverty and In Inequality Whats Next? Discussio ion Michael Grimm University of Passau,

3 rd October 2018 The Early Foundation Stage Curriculum Characteristics of Learning The

Recap: Lexicalized PCFGs We now need to estimate rule probabilities such as Prob (

Characterizations of commutative rings by their simple, cyclic, uniform and uniserial modules