Presentation: Cyclic Reduction Type Poisson and Helmholtz Solvers on - - PDF document

presentation cyclic reduction type poisson and helmholtz
SMART_READER_LITE
LIVE PREVIEW

Presentation: Cyclic Reduction Type Poisson and Helmholtz Solvers on - - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/341347588 Presentation: Cyclic Reduction Type Poisson and Helmholtz Solvers on a GPU Presentation February 2014 CITATIONS READS 0 3


slide-1
SLIDE 1

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/341347588

Presentation: Cyclic Reduction Type Poisson and Helmholtz Solvers on a GPU

Presentation · February 2014

CITATIONS READS

3

3 authors, including: Some of the authors of this publication are also working on these related projects: PhD Thesis View project NLAFET View project Mirko Myllykoski Umeå University

40 PUBLICATIONS 69 CITATIONS

SEE PROFILE

Jari Toivanen University of Jyväskylä

115 PUBLICATIONS 1,892 CITATIONS

SEE PROFILE

All content following this page was uploaded by Mirko Myllykoski on 13 May 2020.

The user has requested enhancement of the downloaded file.

slide-2
SLIDE 2

Cyclic Reduction Type Poisson and Helmholtz Solvers on a GPU

Mirko Myllykoski1,2 Tuomo Rossi3 Jari Toivanen3,4 February 16, 2014

1Department of Mathematical Information Technology, University of Jyv¨ askyl¨ a, P.O. Box 35 (Agora), FI-40014 University of Jyv¨ askyl¨ a, Finland, E-mail: mirko.myllykoski@jyu.fi 2The research of the first author was supported by the Academy of Finland, grant #252549 3Department of Mathematical Information Technology, University of Jyv¨ askyl¨ a 4Department of Aeronautics and Astronautics, Stanford University

slide-3
SLIDE 3

Introduction

The topic of the presentation are the block cyclic reduction type linear system solvers and how they can be applied to the Poisson and Helmholtz equations on a GPU. The presented implementations are based on a method called radix-q PSCR. A total of three implementations are presented: a simplified radix-2 method, a simplified radix-4 method and a generalized radix-4 method. The presentation will focus on the GPU implementations and the obtained numerical results.

slide-4
SLIDE 4

PSCR method

PSCR stands for Partial Solution variant of the Cyclic Reduction method. A cyclic reduction type recursive method for block tridiagonal linear systems. A conventional cyclic reduction algorithm operates by halving the size of the linear system at each reduction step. In other words, the algorithm is a radix-2 method. The PSCR method can be formulated in a generalized radix-q framework. The first (radix-2) formulation was presented in 80’s5. Further developmental done later in the decade6. A more generalized radix-q algorithm was formulated in 19967.

5Vassilevski P. Fast algorithm for solving a linear algebraic problem with separable variables. Comptes Rendus de Academie Bulgare des Sciences 1984; 37:305–308 6Kuznetsov YA. Numerical methods in subspaces. Vychislitel’-nye Processy i Sistemy II 1985; 37:265–350. 7Kuznetsov YA, Rossi T. Fast direct method for solving algebraic systems with separable symmetric band

  • matrices. East-West Journal of Numerical Mathematics 1996; 4:53–68.
slide-5
SLIDE 5

PSCR method: radix-2 vs radix-4

= = =

==> ==>

= =

==>

Radix-number: 2 Radix-number: 4 Eliminated row Updated row

=

==>

slide-6
SLIDE 6

PSCR method: Kronecker tensor product

Let A ∈ Kn1×m1 and B ∈ Kn2×m2. The Kronecker matrix tensor product is defined as follows A ⊗ B =      a1,1B a1,2B . . . a1,m1B a2,1B a2,2B . . . a2,m1B . . . . . . ... . . . an1,1B an1,2B . . . an1,m1B      ∈ Kn1n2×m1m2. The PSCR method can be applied to the following linear system [A1 ⊗ M2 + M1 ⊗ A2 + c(M1 ⊗ M2)]u = f , where the factor matrices A1, M1 ∈ Kn1×n1 and A2, M2 ∈ Kn2×n2 are symmetric and tridiagonal.

slide-7
SLIDE 7

PSCR method: Partial solution technique

The PSCR method uses so-called partial solution technique8,9. The technique requires an initialization stage comprising from generalized eigenvalue problems. The actual solution process involves the solution of multiple sub-problems of the form ˆ u =

  • j

cj(A2 + αjM2)−1ˆ f . In total the PSCR method generates approximately O(n1 log n1) tridiagonal sub-problems.

8Banegas A. Fast Poisson solvers for problems with sparsity. Mathematics of Computation 1978; 32:441–446 9Kuznetsov YA, Matsokin AM. On partial solution of systems of linear algebraic equations. Soviet Journal of Numerical Analysis and Mathematical Modelling 1978; 4:453–468

slide-8
SLIDE 8

Implementation

A total of three implementations were implemented:

a simplified radix-2 method a simplified radix-4 method and a generalized radix-4 method.

The GPU implementations utilize the OpenCL framework. The implementations consist mostly of scalar-vector multiplication and vector-vector additions/summations, which can be implemented trivially. The tridiagonal sub-problems and the large vector summations require some additional attention.

slide-9
SLIDE 9

Implementation: Simplified implementations (1/2)

The primary purpose of the simplified implementations was to test different ideas related to the formulation and the implementation. The implementations can be applied when the coefficient matrix is of the form A ⊗ In2 + In1 ⊗ (D − 2In2), where A = tridiag{−1, 2, −1} ∈ Rn1×n1 and D = tridiag{t, d, t} ∈ Rn2×n2. In addition, it is assumed that n1 = 2k1 − 1 and n2 = 2k2 − 1 for some positive integers k1 and k2.

slide-10
SLIDE 10

Implementation: Simplified implementations (2/2)

Utilizes a 3-stage tridiagonal system solver. All stages utilize scalar cyclic reduction and the right hand side is permuted after each recursion step. See earlier work10 for more details. A slightly more generalized tridiagonal solver would allow that the matrix D can be chosen more freely. Does not require an initialization stage because the eigenvalues and eigenvectors are known in a closed form.

10Myllykoski M, Rossi T, Toivanen J. Fast Poisson solvers for graphics processing units. In Applied Parallel and Scientific Computing, Lecture Notes in Computer Science, Vol. 7782, Manninen P, ¨ Oster P (eds). Springer Berlin Heidelberg: Berlin, Germany, 2013; 265–279

slide-11
SLIDE 11

Implementation: Generalized implementation

Uses a generalized 4-stage tridiagonal system solver. Very similar to the one used in the simplified implementations but includes an additional stage which utilizes parallel cyclic reduction11. Uses the CPU to solve the generalized eigenvalue problems. This is not a major limitation because these methods are usually used in situations in which we must solve a large set of problems with nearly identical coefficient matrices. The matrix M1 must be positive-definite. Includes a support for complex valued problems.

11Hockney R. W., Jesshope C. R. Parallel Computers: Architecture, Programming and Algorithm. Hilger, 1981

slide-12
SLIDE 12

Numerical results

The CPU test were carried out using a quad-core Intel Xeon X5550 2.66 GHz processor (only one core used in practice). The GPU test were carried out using a nVidia GeForce GTX580 videocard with 512 cuda cores. The CPU implementations were written using C and fortran. The GPU implementations utilize the OpenCL framework. All test were done using double precision floating point arithmetic.

slide-13
SLIDE 13

Numerical results: Model problem A

Model problem A: A two (or three) dimensional Poisson problem with homogeneous Dirichlet boundary conditions posed on a rectangle. Discrete form: Solve u ∈ RNN from       D −I −I D ... ... ... −I −I D             u1 u2 . . . uN       =       f1 f2 . . . fN       , where D = tridiag{−1, 4, −1} ∈ RN×N, when f ∈ RNN is given.

slide-14
SLIDE 14

Numerical results: Model problem B

Model problem B: A two dimensional Helmholtz problem with Perfectly Matched Layer (PML) or Absorbing Boundary Conditions (ABC) posed on a rectangle. Discrete form: Solve u ∈ CNN from [(A1 − ω2M1) ⊗ M2 + M1 ⊗ A2]u = f , where ω is the wave number, when f ∈ CNN is given. Here the matrices A1, A2, M1, M2 ∈ CN×N depend on the chosen boundary conditions12.

12See, e.g., Heikkola E., Rossi T., Toivanen J. Fast direct solution of the Helmholtz equation with a perfectly matched layer or an absorbing boundary conditions. International Journal of Numerical Methods is Engineering 2003; 57:2007–2025

slide-15
SLIDE 15

Numerical results: Speedup gained when using a GPU (simplified radix-4)

5 10 15 20 25 10 100 1000 10000 cpu run time / gpu run time N^(1/2) or N^(1/3) Poisson 2d Poisson 3d

slide-16
SLIDE 16

Numerical results: Speedup gained when using a GPU (generalized radix-4)

1 2 3 4 5 6 7 8 9 10 11 12 100 1000 10000 cpu run time / gpu run time N^(1/2) Poisson Helmholtz

slide-17
SLIDE 17

Numerical results: Other findings

Example: 2D Poisson problem, n1 = n2 = 1023:

Simplified radix-4 CPU: 0.082556s Simplified radix-4 GPU: 0.005389s Generalized radix-4 CPU: 0.196758s Generalized radix-4 GPU: 0.030405s

Increased radix-number offers some additional speed-up, especially when the problem size is small13. The GPU implementation did not introduce any noticeable additional numerical error. The most important elements effecting the performance were the tridiagonal solver and the large vector summations.

13Myllykoski M, Rossi T, Toivanen J. Fast Poisson solvers for graphics processing units. In Applied Parallel and Scientific Computing, Lecture Notes in Computer Science, Vol. 7782, Manninen P, ¨ Oster P (eds). Springer Berlin Heidelberg: Berlin, Germany, 2013; 265–279

slide-18
SLIDE 18

Conclusions

The block cyclic reduction algorithm offers a sufficient amount of fine-grained parallelism when combined with scalar cyclic reduction. Increased radix-number offers some additional speed-up, especially when the problem size of small. Thus, it is recommenced to increase the radix when possible. Complex valued problems seem to be particularly suitable for a GPU.

View publication stats View publication stats