direct methods on gpu based systems
play

Direct methods on GPU-based systems Preliminary work towards a - PowerPoint PPT Presentation

Direct methods on GPU-based systems Preliminary work towards a functioning code A. Decollas and F. Lopez, Joint work with IRIT Toulouse, LaBRI / Inria Bordeaux, LIP / Inria Lyon Sparse Days 2012. Toulouse, June 25th Context of the work


  1. Direct methods on GPU-based systems Preliminary work towards a functioning code A. Decollas and F. Lopez, Joint work with IRIT Toulouse, LaBRI / Inria Bordeaux, LIP / Inria Lyon Sparse Days 2012. Toulouse, June 25th

  2. Context of the work

  3. Context F. Lopez @ IRIT-Toulouse A. Decollas @ Inria-Bordeaux Develop dense linear algebra Evaluate the efficiency of modern kernels specific to sparse, direct runtime systems for heterogeneous solvers capable of achieving high and irregular workloads such as efficiency on heterogeneous Multifrontal solvers on systems equipped with multiple homogeneous, multicore CPUs and GPUs. architectures. These two activities will ultimately be merged into a sparse, direct solver for accelerated multicore systems. 3/41 Sparse Days 2012. Toulouse, June 25th

  4. The multifrontal method The multifrontal factorization is guided by a graph called elimination tree : • At each node of the tree k pivots are eliminated • Each node of the tree is associated with a relatively small dense matrix called frontal matrix (or, simply, front ) which contains the k rows/columns related to the pivots and all the other coefficients concerned by their elimination 4/41 Sparse Days 2012. Toulouse, June 25th

  5. The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

  6. The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

  7. 5 1 4 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

  8. 5 1 4 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

  9. 5 1 4 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

  10. 4 1 5 2 3 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

  11. 4 1 5 2 3 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

  12. 4 1 5 2 3 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

  13. 5 2 4 1 3 3 4 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

  14. 5 2 4 1 3 3 4 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

  15. 3 4 3 2 5 4 1 4 5 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

  16. CPU-GPU hybrid architectures GPUs may be used as powerful accelerators for HPC applications: � High computational performance (comparison GPU-CPU: 10 × faster, memory access 5 × faster) � Energy efficient despite these capabilities, the use of GPUs is challenging: � Complex architectures (comparison GPU-CPU : 100 × more cores) � CPU-GPU programming models incompatible. � CPU ↔ GPU transfers are expensive (no shared memory) ⇒ specific algorithms 5/41 Sparse Days 2012. Toulouse, June 25th

  17. CPU CPU RAM RAM RAM GPU Heterogeneous platform CPU CPU Elimination tree GPU RAM GPU CPU CPU CPU CPU RAM CPU-GPU hybrid architectures • An extremely heterogeneous workload • A heterogeneous architecture • mapping tasks is challenging 6/41 Sparse Days 2012. Toulouse, June 25th

  18. CPU CPU RAM RAM RAM GPU Heterogeneous platform CPU CPU Elimination tree GPU RAM GPU CPU CPU CPU CPU RAM CPU-GPU hybrid architectures One option is to do the mapping by hand (see T. Davis’ talk at SIAM PP12). This requires a very accurate performance models difficult to achieve. 6/41 Sparse Days 2012. Toulouse, June 25th

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend