Direct methods on GPU-based systems Preliminary work towards a - PowerPoint PPT Presentation

Direct methods on GPU-based systems Preliminary work towards a functioning code A. Decollas and F. Lopez, Joint work with IRIT Toulouse, LaBRI / Inria Bordeaux, LIP / Inria Lyon Sparse Days 2012. Toulouse, June 25th

Context of the work

Context F. Lopez @ IRIT-Toulouse A. Decollas @ Inria-Bordeaux Develop dense linear algebra Evaluate the efficiency of modern kernels specific to sparse, direct runtime systems for heterogeneous solvers capable of achieving high and irregular workloads such as efficiency on heterogeneous Multifrontal solvers on systems equipped with multiple homogeneous, multicore CPUs and GPUs. architectures. These two activities will ultimately be merged into a sparse, direct solver for accelerated multicore systems. 3/41 Sparse Days 2012. Toulouse, June 25th

The multifrontal method The multifrontal factorization is guided by a graph called elimination tree : • At each node of the tree k pivots are eliminated • Each node of the tree is associated with a relatively small dense matrix called frontal matrix (or, simply, front ) which contains the k rows/columns related to the pivots and all the other coefficients concerned by their elimination 4/41 Sparse Days 2012. Toulouse, June 25th

The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

5 1 4 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

4 1 5 2 3 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

5 2 4 1 3 3 4 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

3 4 3 2 5 4 1 4 5 The multifrontal method The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficients from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix • factorization: the k pivots are eliminated through a partial factorization of the frontal matrix. As a result we get: ◦ k rows/columns of the global factors ◦ A contribution block that will be assembled into the parent’s front 4/41 Sparse Days 2012. Toulouse, June 25th

CPU-GPU hybrid architectures GPUs may be used as powerful accelerators for HPC applications: � High computational performance (comparison GPU-CPU: 10 × faster, memory access 5 × faster) � Energy efficient despite these capabilities, the use of GPUs is challenging: � Complex architectures (comparison GPU-CPU : 100 × more cores) � CPU-GPU programming models incompatible. � CPU ↔ GPU transfers are expensive (no shared memory) ⇒ specific algorithms 5/41 Sparse Days 2012. Toulouse, June 25th

CPU CPU RAM RAM RAM GPU Heterogeneous platform CPU CPU Elimination tree GPU RAM GPU CPU CPU CPU CPU RAM CPU-GPU hybrid architectures • An extremely heterogeneous workload • A heterogeneous architecture • mapping tasks is challenging 6/41 Sparse Days 2012. Toulouse, June 25th

CPU CPU RAM RAM RAM GPU Heterogeneous platform CPU CPU Elimination tree GPU RAM GPU CPU CPU CPU CPU RAM CPU-GPU hybrid architectures One option is to do the mapping by hand (see T. Davis’ talk at SIAM PP12). This requires a very accurate performance models difficult to achieve. 6/41 Sparse Days 2012. Toulouse, June 25th

Direct methods on GPU-based systems Preliminary work towards a - PowerPoint PPT Presentation

Direct methods on GPU-based systems Preliminary work towards a functioning code A. Decollas and F. Lopez, Joint work with IRIT Toulouse, LaBRI / Inria Bordeaux, LIP / Inria Lyon Sparse Days 2012. Toulouse, June 25th Context of the work

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Great Lakes Chloride, Inc. Direct Liquid Application (DLA) Direct Liquid Application (DLA)

State of Collaboration Direct Deposit and Payroll Reissuance 1 1 Topics Direct Deposit

Direct loan Direct loan Information Information Feder deral Direct Student Loans l Direct

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

East Tilbury Thames Industrial Estates Ltd welcomes you to this community It would be helpful for

Field Test Report Belanak Bronya Online application BRONYA WOOL CALCIUM PYROGEL SILICATE

Environmental studies: Pesticides/pathology combination challenge tests Truong Quoc Phu College

Hexa Resources Limited An Integrated Emerald business HEXA RESOURCES JANUARY 2020 1 Building

Q12018 RESULTS PRESENTATION May 10th 2018 @Grupa_Azoty #GrupaAzoty1Q18 @Grupa_Azoty

Hitching Post Research Results from Tirohanga and Harbour View Hayley Comrie Matthew Naushutz

Towards Smart Rural Transport Areas: the SMARTA Project Andrea Lorenzini Giorgio Ambrosino

'You've got to find what you love,' Jobs says This is a prepared text of the Commencement address

Direct methods on GPU-based systems Preliminary work towards a - PowerPoint PPT Presentation

Direct methods on GPU-based systems Preliminary work towards a functioning code A. Decollas and F. Lopez, Joint work with IRIT Toulouse, LaBRI / Inria Bordeaux, LIP / Inria Lyon Sparse Days 2012. Toulouse, June 25th Context of the work

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Great Lakes Chloride, Inc. Direct Liquid Application (DLA) Direct Liquid Application (DLA)

State of Collaboration Direct Deposit and Payroll Reissuance 1 1 Topics Direct Deposit

Direct loan Direct loan Information Information Feder deral Direct Student Loans l Direct

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

East Tilbury Thames Industrial Estates Ltd welcomes you to this community It would be helpful for

Field Test Report Belanak Bronya Online application BRONYA WOOL CALCIUM PYROGEL SILICATE

Environmental studies: Pesticides/pathology combination challenge tests Truong Quoc Phu College

Hexa Resources Limited An Integrated Emerald business HEXA RESOURCES JANUARY 2020 1 Building

Q12018 RESULTS PRESENTATION May 10th 2018 @Grupa_Azoty #GrupaAzoty1Q18 @Grupa_Azoty

Hitching Post Research Results from Tirohanga and Harbour View Hayley Comrie Matthew Naushutz

Towards Smart Rural Transport Areas: the SMARTA Project Andrea Lorenzini Giorgio Ambrosino

'You've got to find what you love,' Jobs says This is a prepared text of the Commencement address

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,