sparse computations and multi bsp
play

Sparse Computations and Multi-BSP Albert-Jan Yzelman October 11, - PowerPoint PPT Presentation

Sparse Computations and Multi-BSP Sparse Computations and Multi-BSP Albert-Jan Yzelman October 11, 2016 Parallel Computing & Big Data Huawei Technologies France Albert-Jan Yzelman Sparse Computations and Multi-BSP BSP BSP machine = {


  1. Sparse Computations and Multi-BSP Sparse Computations and Multi-BSP Albert-Jan Yzelman October 11, 2016 Parallel Computing & Big Data Huawei Technologies France Albert-Jan Yzelman

  2. Sparse Computations and Multi-BSP BSP BSP machine = { sequential processor } + interconnect The machine is described entirely by ( p , g , L ): strobing synchronisation, homogeneous processing, uniform full-duplex network, Albert-Jan Yzelman

  3. Sparse Computations and Multi-BSP BSP BSP algorithm: strobing barriers full overlap h -relation bottlenecks: max s { sent s , recv s } work balance L. G. Valiant, A bridging model for parallel computation , CACM, 1990 Albert-Jan Yzelman

  4. Sparse Computations and Multi-BSP BSP BSP cost: w (0) w (1) h (1) T p = max + L + max { max + L , max s g + L } + . . . s s s s s Separation of computation vs. communication. L. G. Valiant, A bridging model for parallel computation , CACM, 1990 Albert-Jan Yzelman

  5. Sparse Computations and Multi-BSP BSP BSP cost: w (0) w (1) h (1) T p = max + L + max { max + L , max s g + L } + . . . s s s s s Separation of algorithm vs. hardware. L. G. Valiant, A bridging model for parallel computation , CACM, 1990 Albert-Jan Yzelman

  6. Sparse Computations and Multi-BSP Immortal algorithms The BSP paradigm, allows the design of immortal algorithms : given a problem to compute given a BSP computer ( p , g , l ) find the BSP algorithm that attains provably minimal cost. E.g., fast Fourier transforms, matrix-matrix multiplication. Thinking in Sync : the Bulk-Synchronous Parallel approach to large-scale computing. Bisseling and Yzelman, ACM Hot Topic ’16. http://www.computingreviews.com/hottopic/hottopic_essay.cfm?htname=BSP Albert-Jan Yzelman

  7. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Variables A s , x s , y s are local versions of the global variables A , x , y distributed according to π A , π x , π y . Albert-Jan Yzelman

  8. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Variables A s , x s , y s are local versions of the global variables A , x , y distributed according to π A , π x , π y . 1: for j | ∃ a ij � = 0 ∈ A s and π x ( j ) � = s do get x π x ( j ) , j 2: 3: sync { execute fan-out } Albert-Jan Yzelman

  9. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Variables A s , x s , y s are local versions of the global variables A , x , y distributed according to π A , π x , π y . 1: for j | ∃ a ij � = 0 ∈ A s and π x ( j ) � = s do get x π x ( j ) , j 2: 3: sync { execute fan-out } 4: y s = A s x s { local multiplication stage } Albert-Jan Yzelman

  10. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Variables A s , x s , y s are local versions of the global variables A , x , y distributed according to π A , π x , π y . 1: for j | ∃ a ij � = 0 ∈ A s and π x ( j ) � = s do get x π x ( j ) , j 2: 3: sync { execute fan-out } 4: y s = A s x s { local multiplication stage } 5: for i | ∃ a ij ∈ A s and π y ( i ) � = s do send ( i , y s , i ) to π y ( i ) 6: 7: sync { execute fan-in } Albert-Jan Yzelman

  11. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Variables A s , x s , y s are local versions of the global variables A , x , y distributed according to π A , π x , π y . 1: for j | ∃ a ij � = 0 ∈ A s and π x ( j ) � = s do get x π x ( j ) , j 2: 3: sync { execute fan-out } 4: y s = A s x s { local multiplication stage } 5: for i | ∃ a ij ∈ A s and π y ( i ) � = s do send ( i , y s , i ) to π y ( i ) 6: 7: sync { execute fan-in } 8: for all ( i , α ) received do add α to y s , i 9: Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004. Albert-Jan Yzelman

  12. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Suppose π A assigns every nonzero a ij ∈ A to processor π A ( i , j ). If 1 π y ( i ) ∈ { s | ∃ a ij ∈ A , π A ( i , j ) = s } and 2 π x ( j ) ∈ { s | ∃ a ij ∈ A , π A ( i , j ) = s } ; Albert-Jan Yzelman

  13. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Suppose π A assigns every nonzero a ij ∈ A to processor π A ( i , j ). If 1 π y ( i ) ∈ { s | ∃ a ij ∈ A , π A ( i , j ) = s } and 2 π x ( j ) ∈ { s | ∃ a ij ∈ A , π A ( i , j ) = s } ; then � � λ col fan-out communication scatters � − 1 elements from x , j j i ( λ row fan-in communication gathers � − 1) elements from y , i where λ row = |{ s | ∃ a ij ∈ A s }| and i λ col = |{ s | ∃ a ij ∈ A s }| . j Minimising the λ − 1 metric minimises total communication volume . Albert-Jan Yzelman

  14. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Partitioning combined with reordering illustrates clear separators: 1 2 3 4 1 2 3 4 Group nonzeroes a ij for which π A ( i ) = π A ( j ), permute rows i with λ i > 1 in between, apply recursive bipartitioning. Albert-Jan Yzelman

  15. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication When partitioning in both dimensions: Albert-Jan Yzelman

  16. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Classical worst-case bounds (in flops): (1 + ǫ ) + n / p ( √ p − 1)(2 g + 1) + 2 l . 2 nz ( A ) Block: p Albert-Jan Yzelman

  17. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Classical worst-case bounds (in flops): (1 + ǫ ) + n / p ( √ p − 1)(2 g + 1) + 2 l . 2 nz ( A ) Block: p 2 nz ( A ) Row 1D: (1 + ǫ ) + gh fan-out + l . p Albert-Jan Yzelman

  18. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Classical worst-case bounds (in flops): (1 + ǫ ) + n / p ( √ p − 1)(2 g + 1) + 2 l . 2 nz ( A ) Block: p 2 nz ( A ) Row 1D: (1 + ǫ ) + gh fan-out + l . p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Col 1D: + gh fan-in + l . s p Albert-Jan Yzelman

  19. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Classical worst-case bounds (in flops): (1 + ǫ ) + n / p ( √ p − 1)(2 g + 1) + 2 l . 2 nz ( A ) Block: p 2 nz ( A ) Row 1D: (1 + ǫ ) + gh fan-out + l . p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Col 1D: + gh fan-in + l . s p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Full 2D: + g ( h fan-out + h fan-in ) + 2 l . s p Albert-Jan Yzelman

  20. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Classical worst-case bounds (in flops): (1 + ǫ ) + n / p ( √ p − 1)(2 g + 1) + 2 l . 2 nz ( A ) Block: p 2 nz ( A ) Row 1D: (1 + ǫ ) + gh fan-out + l . p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Col 1D: + gh fan-in + l . s p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Full 2D: + g ( h fan-out + h fan-in ) + 2 l . s p Memory overhead (buffers):   � � � � � � � ( λ row λ col  = O Θ − 1) + − 1 . p 1 λ> 1 i i i j λ : λ row ∪ λ col Albert-Jan Yzelman

  21. Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Classical worst-case bounds (in flops): (1 + ǫ ) + n / p ( √ p − 1)(2 g + 1) + 2 l . 2 nz ( A ) Block: p 2 nz ( A ) Row 1D: (1 + ǫ ) + gh fan-out + l . p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Col 1D: + gh fan-in + l . s p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Full 2D: + g ( h fan-out + h fan-in ) + 2 l . s p Memory overhead (buffers):   � � � � � � � ( λ row λ col  = O Θ − 1) + − 1 . p 1 λ> 1 i i i j λ : λ row ∪ λ col Depending on the higher-level algorithm: fan-in latency can be hidden behind other kernels, fan-out latency can be hidden as well. Albert-Jan Yzelman

  22. Sparse Computations and Multi-BSP Multi-BSP Multi-BSP computer = p ( subcomputers or processors ) + M bytes of local memory+ an interconnect Albert-Jan Yzelman

  23. Sparse Computations and Multi-BSP Multi-BSP Multi-BSP computer = p ( subcomputers or processors ) + M bytes of local memory+ an interconnect A total of 4 L parameters: ( p 0 , g 0 , l 0 , M 0 , . . . , p L − 1 , g L − 1 , l L − 1 , M L − 1 ). Advantages: memory-aware, non-uniform! Albert-Jan Yzelman

  24. Sparse Computations and Multi-BSP Multi-BSP Multi-BSP computer = p ( subcomputers or processors ) + M bytes of local memory+ an interconnect A total of 4 L parameters: ( p 0 , g 0 , l 0 , M 0 , . . . , p L − 1 , g L − 1 , l L − 1 , M L − 1 ). Advantages: memory-aware, non-uniform! Disadvantages: (likely) harder to prove optimality. L. G. Valiant, A bridging model for multi-core computing , CACM 2011. Albert-Jan Yzelman

  25. Sparse Computations and Multi-BSP Multi-BSP An example with L = 3 quadlets ( p , g , l , M ): C = (2 , g 0 , l 0 , M 0 ) (4 , g 1 , l 1 , M 1 ) (8 , g 2 , l 2 , M 2 ) Each quadlet runs its own BSP SPMD program. Albert-Jan Yzelman

  26. Sparse Computations and Multi-BSP Multi-BSP SpMV multiplication SPMD-style Multi-BSP SpMV multiplication: define process 0 at level − 1 as the Multi-BSP root. let process s at level k have parent t at level k − 1. define ( A − 1 , 0 , x − 1 , 0 , y − 1 , 0 ) = ( A , x , y ), the original input. Albert-Jan Yzelman

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend