 
              6 th AsHES workshop May 26 th , 2016, Chicago, USA Efficiency of general Krylov methods on GPUs – An experimental study H. Anzt, M. Kreutzer, M. Koehler, G. Wellein, J. Dongarra Piotr Luszczek
Solving large sparse linear systems on GPUs http://blog.heltontool.com/category/tools/ Large variety of Iterative methods • Krylov solvers work good for many problems • Efficiency depends on problem characteristics • eigenvalue distribution • diagonal dominance • definiteness • Scenario: Problem characteristics are not known. • Bl Black-Box S 2
The Shotgun Approach Run mu multiple Kr Krylov solve vers simultaneously as po poly-iterative ve method • Theoretical b benefi fits • benefit from the fastest convergence • drop solvers that break down • Computational benefi fits • Runtime overhead small for solvers with similar structure • SpMM replaces SpMV to generate multiple Krylov subspaces • Interleaving global communication for low synchronization count • Enhanced fault-tolerance • Limitation: : Solve vers a are r required to h have ve similar structure ( (Sp SpMV/r /reduction) Barrett et al.: Al Algorithmic bombardment f for t the iterative solution of linear systems: A A po poly-it iterativ ive ap approac ach , Journal of Computational and Applied Mathematics 7 4, 1996. 3
Contribution • Run different Kr Krylov me methods s on large number of test matrices • Analyze with different target metrics: Conve vergence, , Sp SpMV, R , Runtime • Non-symmetric test matrices from University of Florida Matrix Collection • 1,000 < n < 5,000,000; nnz<100,000,000 • At least one of the considered methods converges within 2n SpMV • 94 non-symmetric test matrices in total 9 8 8 7 7 6 6 5 Matrix count Matrix count 5 4 4 3 3 2 2 1 1 0 0 10 3 10 4 10 5 10 6 10 4 10 5 10 6 10 7 Matrix size Nonzeros 4
Experiment setup • libufg fget • C - interface to access matrices at UFMC • Max Planck Institute for Dynamics of Complex Technical Systems • MAGMA MA MA • Accelerator-focused linear algebra software library • Dense and sparse linear algebra routines, solvers, eigensolvers • We choose: BiCGSTAB, CGS, QMR, IDR(2), IDR(4), IDR(8) • University of Tennessee • NVID IDIA IA K4 K40 GPU • 1,682 GFlop/s (double precision). • 12 GB; 288 GB/s (theoretical) –193 GB/s (experimentally) • CUDA v. 7.5 • Solve ver s setting • Solve: A x = b for b ≣ 1 starting with x ≣ 0 • Relative residual stopping criterion: 10 -10 | b | 5
Solver Robustness – The Convergence Metric 100 - - - - 80 - Matrix count 60 - 40 20 Convergence - fastest solver Convergence - not fastest solver 0 B S R ) ) ) 2 4 8 A G M ( ( ( R R R T C Q S D D D G I I I C . i B
The Shotgun Approach Run mu multiple Kr Krylov solve vers simultaneously as po poly-iterative ve method • Original work: pol oly-iterative ve solve ver with Bi BiCGSTAB, , QMR, C , CGS • ID IDR(s) structurally different, hard to combine in simultaneous fashion Barrett et al.: Al Algorithmic bombardment f for t the iterative solution of linear systems: A A po poly-it iterativ ive ap approac ach , Journal of Computational and Applied Mathematics 7 4, 1996. 7
Solver Orthogonality w.r.t. Problem Suitability • Which methods to include in Multi-Iterative solver? http://www.icl.utk.edu/~hanzt/solver_ortho/
The Shotgun Approach Run mu multiple Kr Krylov solve vers simultaneously as po poly-iterative ve method • Original work: pol oly-iterative ve solve ver with Bi BiCGSTAB, , QMR, C , CGS • IDR(s) structurally different, hard to combine in simultaneous fashion ID • poly-iterative po ve solve ver converges in 63 of 94 test cases (67%) • IDR(2) converges for 60 of 94 test cases (64%) • IDR(4) converges for 67 of 94 test cases (71%) • IDR(8) converges for 91 of 94 test cases (96%) Barrett et al.: Al Algorithmic bombardment f for t the iterative solution of linear systems: A A po poly-it iterativ ive ap approac ach , Journal of Computational and Applied Mathematics 7 4, 1996. 9
Performance– SpMV count and Runtime 100 IDR(8) IDR(4) 90 IDR(2) QMR 80 CGS BiCGSTAB 70 % of test matrices 60 50 40 30 20 10 0 SpMV Runtime Target metric • SpMV count indicative for performance when using preconditioners • IDR(8) wins most cases in SpMV metric ID 10
The Price of Robustness systems – but often t • ID IDR(8) solves many s there i is a a fa faster solve ver • Normalize e execution times for each matrix to fastest solver 10 1 BiCGSTAB CGS QMR IDR2 IDR4 IDR8 Runtime overhead 10 0 0 5 10 15 20 25 30 Test matrix 11
The Price of Robustness systems – but often t • ID IDR(8) solves many s there i is a a fa faster solve ver • Normalize e execution times for each matrix to fastest solver verage over all conve • Take ave verging c confi figurations 2.5 Runtime relative to fastest method 2 1.5 1 0.5 0 BiCGSTAB CGS QMR IDR(2) IDR(4) IDR(8) 12
Su Summary IDR(s) is in a very robust solver. • ID Robustness increases with shadow space dimension s . • ) solves 91 of 94 test problems (96% success). • ID IDR(8) For converging combinations, CGS, M , or Bi BiCGSTAB often fa faster . • , MQR, On average, ID r than the fastest method. • IDR(8) le less than twice slo lower Future w work Relate solve success to the pr problem origins . • ver s Enhance solvers with pr preconditioning . • Target other ar itectures (Xeon Phi, low-power & embedded devices). • archit The authors would like to acknowledge support from the U.S. Department of Energy, the German Research Foundation (DFG) through the Priority Program 1648, and NVIDIA. The authors would also like to thank Daniel B. Szyld for sharing his knowledge of Krylov methods. 13
Recommend
More recommend