Presentation: Cyclic Reduction Type Poisson and Helmholtz Solvers on - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/341347588 Presentation: Cyclic Reduction Type Poisson and Helmholtz Solvers on a GPU Presentation · February 2014 CITATIONS READS 0 3 3 authors , including: Mirko Myllykoski Jari Toivanen Umeå University University of Jyväskylä 40 PUBLICATIONS 69 CITATIONS 115 PUBLICATIONS 1,892 CITATIONS SEE PROFILE SEE PROFILE Some of the authors of this publication are also working on these related projects: PhD Thesis View project NLAFET View project All content following this page was uploaded by Mirko Myllykoski on 13 May 2020. The user has requested enhancement of the downloaded file.

Cyclic Reduction Type Poisson and Helmholtz Solvers on a GPU Mirko Myllykoski 1 , 2 Tuomo Rossi 3 Jari Toivanen 3 , 4 February 16, 2014 1Department of Mathematical Information Technology, University of Jyv¨ askyl¨ a, P.O. Box 35 (Agora), FI-40014 University of Jyv¨ askyl¨ a, Finland, E-mail: mirko.myllykoski@jyu.fi 2The research of the first author was supported by the Academy of Finland, grant #252549 3Department of Mathematical Information Technology, University of Jyv¨ askyl¨ a 4Department of Aeronautics and Astronautics, Stanford University

Introduction The topic of the presentation are the block cyclic reduction type linear system solvers and how they can be applied to the Poisson and Helmholtz equations on a GPU. The presented implementations are based on a method called radix-q PSCR. A total of three implementations are presented: a simplified radix-2 method, a simplified radix-4 method and a generalized radix-4 method. The presentation will focus on the GPU implementations and the obtained numerical results.

PSCR method PSCR stands for Partial Solution variant of the Cyclic Reduction method. A cyclic reduction type recursive method for block tridiagonal linear systems. A conventional cyclic reduction algorithm operates by halving the size of the linear system at each reduction step. In other words, the algorithm is a radix-2 method. The PSCR method can be formulated in a generalized radix-q framework. The first (radix-2) formulation was presented in 80’s 5 . Further developmental done later in the decade 6 . A more generalized radix-q algorithm was formulated in 1996 7 . 5Vassilevski P. Fast algorithm for solving a linear algebraic problem with separable variables. Comptes Rendus de Academie Bulgare des Sciences 1984; 37:305–308 6Kuznetsov YA. Numerical methods in subspaces. Vychislitel’-nye Processy i Sistemy II 1985; 37:265–350. 7Kuznetsov YA, Rossi T. Fast direct method for solving algebraic systems with separable symmetric band matrices. East-West Journal of Numerical Mathematics 1996; 4:53–68.

PSCR method: radix-2 vs radix-4 Radix-number: 2 ==> ==> ==> = = = = Radix-number: 4 Eliminated row Updated row ==> = =

PSCR method: Kronecker tensor product Let A ∈ K n 1 × m 1 and B ∈ K n 2 × m 2 . The Kronecker matrix tensor product is defined as follows   a 1 , 1 B a 1 , 2 B a 1 , m 1 B . . . a 2 , 1 B a 2 , 2 B a 2 , m 1 B . . .    ∈ K n 1 n 2 × m 1 m 2 . A ⊗ B =  . . .  ... . . .   . . .  a n 1 , 1 B a n 1 , 2 B a n 1 , m 1 B . . . The PSCR method can be applied to the following linear system [ A 1 ⊗ M 2 + M 1 ⊗ A 2 + c ( M 1 ⊗ M 2 )] u = f , where the factor matrices A 1 , M 1 ∈ K n 1 × n 1 and A 2 , M 2 ∈ K n 2 × n 2 are symmetric and tridiagonal.

PSCR method: Partial solution technique The PSCR method uses so-called partial solution technique 8 , 9 . The technique requires an initialization stage comprising from generalized eigenvalue problems. The actual solution process involves the solution of multiple sub-problems of the form � c j ( A 2 + α j M 2 ) − 1 ˆ u = ˆ f . j In total the PSCR method generates approximately O ( n 1 log n 1 ) tridiagonal sub-problems. 8Banegas A. Fast Poisson solvers for problems with sparsity. Mathematics of Computation 1978; 32:441–446 9Kuznetsov YA, Matsokin AM. On partial solution of systems of linear algebraic equations. Soviet Journal of Numerical Analysis and Mathematical Modelling 1978; 4:453–468

Implementation A total of three implementations were implemented: a simplified radix-2 method a simplified radix-4 method and a generalized radix-4 method. The GPU implementations utilize the OpenCL framework. The implementations consist mostly of scalar-vector multiplication and vector-vector additions/summations, which can be implemented trivially. The tridiagonal sub-problems and the large vector summations require some additional attention.

Implementation: Simplified implementations (1/2) The primary purpose of the simplified implementations was to test different ideas related to the formulation and the implementation. The implementations can be applied when the coefficient matrix is of the form A ⊗ I n 2 + I n 1 ⊗ ( D − 2 I n 2 ) , where A = tridiag {− 1 , 2 , − 1 } ∈ R n 1 × n 1 and D = tridiag { t , d , t } ∈ R n 2 × n 2 . In addition, it is assumed that n 1 = 2 k 1 − 1 and n 2 = 2 k 2 − 1 for some positive integers k 1 and k 2 .

Implementation: Simplified implementations (2/2) Utilizes a 3-stage tridiagonal system solver. All stages utilize scalar cyclic reduction and the right hand side is permuted after each recursion step. See earlier work 10 for more details. A slightly more generalized tridiagonal solver would allow that the matrix D can be chosen more freely. Does not require an initialization stage because the eigenvalues and eigenvectors are known in a closed form. 10Myllykoski M, Rossi T, Toivanen J. Fast Poisson solvers for graphics processing units. In Applied Parallel and Scientific Computing, Lecture Notes in Computer Science, Vol. 7782, Manninen P, ¨ Oster P (eds). Springer Berlin Heidelberg: Berlin, Germany, 2013; 265–279

Implementation: Generalized implementation Uses a generalized 4-stage tridiagonal system solver. Very similar to the one used in the simplified implementations but includes an additional stage which utilizes parallel cyclic reduction 11 . Uses the CPU to solve the generalized eigenvalue problems. This is not a major limitation because these methods are usually used in situations in which we must solve a large set of problems with nearly identical coefficient matrices. The matrix M 1 must be positive-definite. Includes a support for complex valued problems. 11Hockney R. W., Jesshope C. R. Parallel Computers: Architecture, Programming and Algorithm. Hilger, 1981

Numerical results The CPU test were carried out using a quad-core Intel Xeon X5550 2.66 GHz processor (only one core used in practice). The GPU test were carried out using a nVidia GeForce GTX580 videocard with 512 cuda cores. The CPU implementations were written using C and fortran. The GPU implementations utilize the OpenCL framework. All test were done using double precision floating point arithmetic.

Numerical results: Model problem A Model problem A: A two (or three) dimensional Poisson problem with homogeneous Dirichlet boundary conditions posed on a rectangle. Discrete form: Solve u ∈ R NN from       D − I u 1 f 1 ...       − I D u 2 f 2       = ,    .   .  ... ... . .       − I . .       − I D u N f N where D = tridiag {− 1 , 4 , − 1 } ∈ R N × N , when f ∈ R NN is given.

Numerical results: Model problem B Model problem B: A two dimensional Helmholtz problem with Perfectly Matched Layer (PML) or Absorbing Boundary Conditions (ABC) posed on a rectangle. Discrete form: Solve u ∈ C NN from [( A 1 − ω 2 M 1 ) ⊗ M 2 + M 1 ⊗ A 2 ] u = f , where ω is the wave number, when f ∈ C NN is given. Here the matrices A 1 , A 2 , M 1 , M 2 ∈ C N × N depend on the chosen boundary conditions 12 . 12See, e.g., Heikkola E., Rossi T., Toivanen J. Fast direct solution of the Helmholtz equation with a perfectly matched layer or an absorbing boundary conditions. International Journal of Numerical Methods is Engineering 2003; 57:2007–2025

Numerical results: Speedup gained when using a GPU (simplified radix-4) 25 Poisson 2d Poisson 3d 20 cpu run time / gpu run time 15 10 5 0 10 100 1000 10000 N^(1/2) or N^(1/3)

Numerical results: Speedup gained when using a GPU (generalized radix-4) 12 Poisson 11 Helmholtz 10 cpu run time / gpu run time 9 8 7 6 5 4 3 2 1 100 1000 10000 N^(1/2)

Numerical results: Other findings Example: 2D Poisson problem, n 1 = n 2 = 1023: Simplified radix-4 CPU: 0.082556s Simplified radix-4 GPU: 0.005389s Generalized radix-4 CPU: 0.196758s Generalized radix-4 GPU: 0.030405s Increased radix-number offers some additional speed-up, especially when the problem size is small 13 . The GPU implementation did not introduce any noticeable additional numerical error. The most important elements effecting the performance were the tridiagonal solver and the large vector summations. 13Myllykoski M, Rossi T, Toivanen J. Fast Poisson solvers for graphics processing units. In Applied Parallel and Scientific Computing, Lecture Notes in Computer Science, Vol. 7782, Manninen P, ¨ Oster P (eds). Springer Berlin Heidelberg: Berlin, Germany, 2013; 265–279

Presentation: Cyclic Reduction Type Poisson and Helmholtz Solvers on - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/341347588 Presentation: Cyclic Reduction Type Poisson and Helmholtz Solvers on a GPU Presentation February 2014 CITATIONS READS 0 3

Poisson Distribution: Review Poisson Over Time Let B 1 Poisson( ) be the number of bikes

Homotopy theory of Segal cyclic operads Philip Hackney, Marcy Robertson, Donald Yau Cyclic

Type Checking Grammar Rule Semantic Rule var-decl id : type-exp Insert (id.name, type-exp .

Randomness in Computing L ECTURE 14 Last time Poisson distribution Poisson approximation

ECEN 5682 Theory and Practice of Error Control Codes Cyclic Codes Peter Mathys University of

A study of cyclic codes BCH and Reed-Solomon code Welington Santos UFPR January 2015 Cyclic

Week 9 Difference Equations Discrete Math April 23, 2020 Marie Demlova: Discrete Math Cyclic

Simulating events: the Poisson process Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Poisson Regression Models for Count Data Outline Review Introduction to Poisson

15. Poisson Processes In Lecture 4, we introduced Poisson arrivals as the limiting behavior of

X-RAY SPECTRAL WORKSHOP 2019 POISSON STATISTICS WITH BACKGROUNDS Background measurement Image

Workshop 10.6a: Poisson regression Murray Logan 12 Sep 2016 Section 1 Poisson regression

Carina Haupt @caha42 carina.haupt@dlr.de German Aerospace Center (DLR) www.helmholtz.de The

Computational complexity of lattice problems and cyclic lattices Lenny Fukshansky Claremont

On the various definitions of cyclic operads Category Theory 2015, Aveiro Pierre-Louis Curien and

Holography via Dynamic Cyclic Spectroscopy Paul Demorest Willem van Straten (NRAO) (Auckland

RADIX SORT Parosh Aziz Abdulla Uppsala University September 21, 2008 Parosh Aziz Abdulla

Game Development Essentials: An Introduction Third Edition Chapter 7 Levels creating the world

The Rendering Equation Direct ( local ) illumination Light directly from light sources

Who are we? ITs that simple FieldEngineer.EU provides services on different We boost

Route Server Automation and ROV Nick Pratley nick@ix.asn.au Life Under Lockdown: how to stop

Lessons Learned Designing an Open Source UMPC Ben Goska and Tim Harder Oregon State University

Parallel and Memory-efficient Preprocessing for Metagenome Assembly Vasudevan Rengasamy Paul

1. Introduction Population projections are perhaps the most widely demanded product of national