Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU - - PowerPoint PPT Presentation

finite element multigrid solvers for pde problems on gpus
SMART_READER_LITE
LIVE PREVIEW

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU - - PowerPoint PPT Presentation

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Robert Strzodka Dominik Gddeke Integrative Scientific Computing Institute for Applied Mathematics Max Planck Institut Informatik Technical University of Dortmund


slide-1
SLIDE 1

Finite Element Multigrid Solvers for PDE Problems

  • n GPUs and GPU Clusters

Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik www.mpi$inf.mpg.de/ ~strzodka Dominik Göddeke Institute for Applied Mathematics Technical University of Dortmund www.mathematik.tu$dortmund.de/ ~goeddeke

slide-2
SLIDE 2
  • Structure of Double Lecture 2 x 90 min

PART 1 Parallelism Grid Discretization Multigrid & Smoothers Mixed$Precision Data Layout PART 2 FEM on GPU Clusters New MPI Application (Rewrite) Legacy MPI Code (Accelerate)

slide-3
SLIDE 3
  • Overview

Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi$valued Data

slide-4
SLIDE 4
  • Parallelism

Sequential execution is an illusionary software concept

All transistors always do something in parallel !

Billions of transistors in modern CPUs(>0.5) & GPUs(>2) Old: Implicit parallelism with caches, ILP, speculation

  • diminishing returns, power constraints

New: Explicit parallelism on multiple levels

  • much more efficient & natural
slide-5
SLIDE 5
  • SIMD Parallelism

It is impossible to execute just one instruction

(add, nop, nop, nop, >)

Penalty for ignoring SIMD

!"#$%&'()* "#! +, instructions

"- . #

  • /

" + "- . #

  • /

" +

slide-6
SLIDE 6
  • Load/store

Global Memory

Execution Manager

Input Assembler Host Cache

Cache Cache Cache Cache Cache Cache Cache Cache Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory

Load/store Load/store Load/store Load/store Load/store

Many$Core Parallelism

Penalty for ignoring many$cores

! /0! $()* "+!/+,

"- . #

  • /

" + "- . #

  • /

" +

slide-7
SLIDE 7
  • Intra$Node Parallelism (multiple CPUs/GPUs per PC)

Penalty for ignoring intra$node parallelism

! $! ! $!,

" " /

  • ,"

,0 ,/ , !

slide-8
SLIDE 8
  • "

Inter$Node Parallelism in a Cluster

  • Penalty for ignoring inter$node parallelism

123$23 114

slide-9
SLIDE 9
  • Bandwidth in a CPU$GPU System

#

  • 5!"0
  • (

6 !31

  • +,*7
  • (

6 !31

  • +,*7
  • (

6 !31

  • +,*7
  • (

6 !31

  • 40 GB/s

20 GB/s

343!2,5 28

  • 200 GB/s

2 TB/s

( 6 ( 6 ( 6 ( 6 !31

  • !31
  • !31
  • !31
  • 4 GB/s

2 GB/s

slide-10
SLIDE 10
  • Overview

Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi$valued Data

slide-11
SLIDE 11

  • Generalized Poisson Problem
  • =
  • 2
  • =

∂ν

  • 923"020:311;

,88$28"'' 80':2$<

  • =
  • =

∇ −

slide-12
SLIDE 12
  • Discretization Grids

=242

14<1 4<1 <direct

,>2!1242

14<1 4<1 <direct

slide-13
SLIDE 13
  • Discretization Grids

242

14<explicit 4<1 <2

21842

14<explicit 4<171 <33'14

slide-14
SLIDE 14
  • nD Arrays

,>2!1242

14<1 4<171 <direct

  • simple 1

1148precious 42:23 :$$!311 42?@>

  • 145A242$$

$=1:$8

slide-15
SLIDE 15
  • Deformation Adaptivity

This grid is a tensor$product ! Easier to accelerate in hardware than resolution adaptive grids Anisotropy level determines

  • ptimal solver
slide-16
SLIDE 16
  • nD Arrays

2187242

14<explicit 4<171 <2

  • 43$24

2$local 2 82418some data locality

  • 1824$1478

48:$3 2$$322341

slide-17
SLIDE 17
  • Hash

2187242

14<explicit 4<171 <2

  • 43$24

2$arbitrary global 2 perfect hashes 38'342?@

  • 1824$1478

33212$!4222 1$33418$234

slide-18
SLIDE 18
  • Tree

21842

14<explicit 4<171 <2

  • :$$213

124$global 1478 :dynamic changes 1

  • 822

2$$?@1

slide-19
SLIDE 19
  • #

Structured and Unstructured Sparse MatVec

3$B*2,20++CD

slide-20
SLIDE 20
  • Overview

Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi$valued Data

slide-21
SLIDE 21

  • Generalized Poisson Problem
  • =
  • 2
  • =

∂ν

  • 923"020:311;

,88$28"'' 80':2$<

  • =
  • =

∇ −

slide-22
SLIDE 22
  • Discretization Approach

6$$ 6% 6

  • =

606@A C!2;

slide-23
SLIDE 23
  • Geometric Multigrid Method

(= $2> E8<*8=523343$= 1'44:3:$= ?2<8312$42'324 :32$$$=$3

  • =

d5 b$Ax5 Fine 42 Ac5 d5 Coarse 42 x5" x5+c5 *5fine 42

slide-24
SLIDE 24
  • Multigrid Transfers

Restriction

?18$$ (:432431

fine coarse adjust index to read neighbors

  • utput region

coarse array i

2i 2i+1 2i$1

result

slide-25
SLIDE 25
  • $

Multigrid Transfers

Prolongation

8$$:3:434 (:4321

slide-26
SLIDE 26
  • Preconditioners
  • =

+ =

− +

ω

12'122 2$<

  • =

FE*?

  • +

+ + + =

,)E9

  • +

+ =

G)??

  • +

+ + + + =

G)?,)E9

  • +

+ + + + + + + =

slide-27
SLIDE 27
  • CPU Numerical and Runtime Efficiency

+'/HHHHHH ?;)$;2@,$%0'0,

slide-28
SLIDE 28
  • +

+ + + =

,)E9

Gauss$Seidel Preconditioner

  • =

+ =

− +

ω

12'122 2$<

slide-29
SLIDE 29
  • #

Multi$Colored Gauss$Seidel

slide-30
SLIDE 30
  • G)??!E(@I

ADI$TRIDI Preconditioner

  • =

+ =

− +

ω

12'122 2$< G)??!)E9

  • +

+ =

slide-31
SLIDE 31
  • SIMD Parallelism: Cyclic Reduction

/J /J!" /J0!" /J0!" /J!" /JC!0 EI *2 *4 *8 *8 *4 *2 O(N*logN)

slide-32
SLIDE 32
  • Memory Friendly Cyclic Reduction

B,K225 ;G0+""D

slide-33
SLIDE 33
  • G)?,!E(@I

ADI$TRIGS Preconditioner

  • =

+ =

− +

ω

12'122 2$<

  • +

+ + + + =

G)?,!)E9

slide-34
SLIDE 34
  • Many$Core Parallelism
Load/store Global Memory Input Assembler Host Cache Cache Cache Cache Cache Cache Cache Cache Cache Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Load/store Load/store Load/store Load/store Load/store

!:14 0!:14 $714

slide-35
SLIDE 35
  • $

CPU vs. GPU Numerical Efficiency

?;)$;2@,$%0'0,

slide-36
SLIDE 36
  • CPU vs. GPU Runtime Efficiency

0#.-+ ,G&0 + 9!.&C + G0+.+

slide-37
SLIDE 37
  • Multigrid on Refined Unstructured Grid

FE$gMG

242 )4$ )L4 @% ?12

Order & Storage (ELL)

slide-38
SLIDE 38
  • FE$gMG Results with SPAI Preconditioner

"

slide-39
SLIDE 39
  • #

#

Overview

Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi$valued Data

slide-40
SLIDE 40
  • Precision Comparison

Numerical Algorithms GPU Hardware Precision %& '&$ & (" Comparison )'*' *+'*+' +% *+'*+' ,++- ''+''+ ,++. + +

slide-41
SLIDE 41
  • Hardware Precision

float s23e8 double s52e11 Data Error

52 bit 23 bit

"70$ +;- "7/$ +;//////// "702 +;- "7/2 +;/////////////// Roundoff Error ";+++0J+;CCC $ " "! $ " $'$ $$' ";+++0J+;CCC 2 +;CCCCCCC# "!"- 2 " $'2 $2'

slide-42
SLIDE 42
  • The Erratic Roundoff Error

 

!"++ !C+ ! + !.+ !#+ !-+ !+ !/+ !0+ + "+ 0+ /+ +

  • +

40$'+!!M0N!"++ 40"7'"70N )2$$$<+$<O"N/!"/N0!/N/O 41 21

slide-43
SLIDE 43
  • Numerical Accuracy

Condition of Ax = b Discretization Error

ε ε ε

  • =

− − = −

  • δ

δ δ

= − − = −

  • float s23e8 double s52e11

52 bit 23 bit

  • =

∇ −

  • =

=

− =

slide-44
SLIDE 44
  • Mixed Precision Iterative Refinement

float s23e8 double s52e11

52 bit 23 bit 23 bit 52 bit 23 bit 23 bit

Iterative Refinement for Ax = b

d5 b$Ax5 Compute high 131 Ac5 d5 Solve low 1$ x5" x5+c5 Correct high 131 55" ?843431

Condition of Ax = b

ε ε ε

  • =

− − = −

  • ε

= − =

+ + +

slide-45
SLIDE 45
  • Mixed Precision Multigrid on GPU

 

, +! ,G0++! ,G0++!2 ,G0++!,

slide-46
SLIDE 46
  • Overview

Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi$valued Data

slide-47
SLIDE 47
  • Multi$Valued Data

Multi$valued data is ubiquitous

"<311';4; 1288 0<2$';4; $1'1 /<1!211';4; 2'8/42

slide-48
SLIDE 48
  • AoS and SoA
  • !

"!#$$ ! ! !

  • #"!$$

$ $

slide-49
SLIDE 49
  • Parallel Access in AoS and SoA

$ $

  • {}

{} {} {}

slide-50
SLIDE 50
  • Operating on Container Elements
  • !
  • %
  • In$place update of single and indexed structs

&'()#$*#

  • '%)++,-

'"!)++,-

  • This is not possible with standard SoA/C++ syntax:

#'")

slide-51
SLIDE 51
  • Abstraction: AoS + SoA = ASA

Array of Structs (AoS)

  • !

% & '() .!#$$ '"!)

Array of Structs of Arrays (ASA)

/01*01&2 3.4 5/61271 71 71 71

  • 788/3.46

69992 3.4/2 % /012& '3.4/12() .!#$$ '"!) 999*788,788,

B>25; !!D

slide-52
SLIDE 52
  • $

$

Overview

Levels of Parallelism

11$8<?@''5'

Grid Discretizations of PDEs

)4$

Multigrid and Strong Smoothers

*4232:=

Mixed Precision Iterative Refinement

:3$1

Layout of Multi$valued Data

3$21

slide-53
SLIDE 53

Questions?

Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik www.mpi$inf.mpg.de/ ~strzodka Dominik Göddeke Institute for Applied Mathematics Technical University of Dortmund www.mathematik.tu$dortmund.de/ ~goeddeke