Assembly of Finite Element Methods on Graphics Processors. Cris - - PowerPoint PPT Presentation

assembly of finite element methods on graphics processors
SMART_READER_LITE
LIVE PREVIEW

Assembly of Finite Element Methods on Graphics Processors. Cris - - PowerPoint PPT Presentation

Assembly of Finite Element Methods on Graphics Processors. Cris Cecka Adrian Lew Eric Darve Department of Mechanical Engineering Institute for Computational and Mathematical Engineering Stanford University July 19th, 2010 9th World Congress


slide-1
SLIDE 1

Assembly of Finite Element Methods

  • n Graphics Processors.

Cris Cecka Adrian Lew Eric Darve

Department of Mechanical Engineering Institute for Computational and Mathematical Engineering Stanford University

July 19th, 2010 9th World Congress on Computational Mechanics

Cecka, Lew, Darve FEM on GPU 1 / 1

slide-2
SLIDE 2

GPU Computing

Cecka, Lew, Darve FEM on GPU 2 / 1

slide-3
SLIDE 3

GPU Computing

Threads executed by streaming processors

On-Chip Registers Off-chip Local Memory

Cecka, Lew, Darve FEM on GPU 2 / 1

slide-4
SLIDE 4

GPU Computing

Threads executed by streaming processors

On-Chip Registers Off-chip Local Memory

Block of threads executed on streaming multiprocessors

On-chip Shared Memory

Cecka, Lew, Darve FEM on GPU 2 / 1

slide-5
SLIDE 5

GPU Computing

Threads executed by streaming processors

On-Chip Registers Off-chip Local Memory

Block of threads executed on streaming multiprocessors

On-chip Shared Memory

Grid of blocks executed on the device

Off-chip Global Memory

Cecka, Lew, Darve FEM on GPU 2 / 1

slide-6
SLIDE 6

Why FEM Assembly on the GPU?

Complex, real-time physics common on the GPU.

Gaming and graphics community. Simulation and visualization community. More recently, HPC community.

Cecka, Lew, Darve FEM on GPU 3 / 1

slide-7
SLIDE 7

Why FEM Assembly on the GPU?

Complex, real-time physics common on the GPU.

Gaming and graphics community. Simulation and visualization community. More recently, HPC community.

Sparse Linear Algebra coming of age on GPU.

Extensive research on Sparse Solvers on GPU. Extensive research on SpMV.

Cecka, Lew, Darve FEM on GPU 3 / 1

slide-8
SLIDE 8

Why FEM Assembly on the GPU?

Complex, real-time physics common on the GPU.

Gaming and graphics community. Simulation and visualization community. More recently, HPC community.

Sparse Linear Algebra coming of age on GPU.

Extensive research on Sparse Solvers on GPU. Extensive research on SpMV.

Non-linear and time-dependent problems require many assembly procedures.

Cecka, Lew, Darve FEM on GPU 3 / 1

slide-9
SLIDE 9

Why FEM Assembly on the GPU?

Complex, real-time physics common on the GPU.

Gaming and graphics community. Simulation and visualization community. More recently, HPC community.

Sparse Linear Algebra coming of age on GPU.

Extensive research on Sparse Solvers on GPU. Extensive research on SpMV.

Non-linear and time-dependent problems require many assembly procedures. Can assemble, solve, update, and visualize on the GPU

Completely avoid costly transfers with CPU. Fast (real-time) simulations with visualization.

Cecka, Lew, Darve FEM on GPU 3 / 1

slide-10
SLIDE 10

FEM Direct Assembly

Most common FEM assembly procedure: Compute element data.

One by one.

e

  • Ke

fe

Cecka, Lew, Darve FEM on GPU 4 / 1

slide-11
SLIDE 11

FEM Direct Assembly

Most common FEM assembly procedure: Compute element data.

One by one.

Accumulate into global system.

Using a local index to global index mapping.

e

  • Ke

fe

  • Ku = f

Cecka, Lew, Darve FEM on GPU 4 / 1

slide-12
SLIDE 12

Data Flow

Nodal Data

Cecka, Lew, Darve FEM on GPU 5 / 1

slide-13
SLIDE 13

Data Flow

Nodal Data Element Data e e e e e e

Cecka, Lew, Darve FEM on GPU 5 / 1

slide-14
SLIDE 14

Data Flow

Nodal Data Element Data e e e e e e

Cecka, Lew, Darve FEM on GPU 5 / 1

slide-15
SLIDE 15

Data Flow

Nodal Data Element Data e e e e e e FEM System

Cecka, Lew, Darve FEM on GPU 5 / 1

slide-16
SLIDE 16

GPU FEM Assembly Strategies

Key concerns for GPU algorithms: Distribute the task into independent blocks of work.

No inter-block communication. Minimize redundant computations.

Cecka, Lew, Darve FEM on GPU 6 / 1

slide-17
SLIDE 17

GPU FEM Assembly Strategies

Key concerns for GPU algorithms: Distribute the task into independent blocks of work.

No inter-block communication. Minimize redundant computations.

Maximize flop::word ratio.

Minimize global memory transactions. Use exposed memory hierarchy to maximize data reuse.

Cecka, Lew, Darve FEM on GPU 6 / 1

slide-18
SLIDE 18

GPU FEM Assembly Strategies Two Key Choices:

Store Element Data In Threads Assemble By

Cecka, Lew, Darve FEM on GPU 7 / 1

slide-19
SLIDE 19

GPU FEM Assembly Strategies Two Key Choices:

Store Element Data In Global Memory Local Memory Shared Memory Threads Assemble By

Cecka, Lew, Darve FEM on GPU 7 / 1

slide-20
SLIDE 20

GPU FEM Assembly Strategies Two Key Choices:

Store Element Data In Global Memory

Computation – Assembly. Min computation.

Local Memory Shared Memory Threads Assemble By

Cecka, Lew, Darve FEM on GPU 7 / 1

slide-21
SLIDE 21

GPU FEM Assembly Strategies Two Key Choices:

Store Element Data In Global Memory

Computation – Assembly. Min computation.

Local Memory

Fast read/write. No shared element data.

Shared Memory Threads Assemble By

Cecka, Lew, Darve FEM on GPU 7 / 1

slide-22
SLIDE 22

GPU FEM Assembly Strategies Two Key Choices:

Store Element Data In Global Memory

Computation – Assembly. Min computation.

Local Memory

Fast read/write. No shared element data.

Shared Memory

Fast read/write. Shared element data. Small size.

Threads Assemble By

Cecka, Lew, Darve FEM on GPU 7 / 1

slide-23
SLIDE 23

GPU FEM Assembly Strategies Two Key Choices:

Store Element Data In Global Memory

Computation – Assembly. Min computation.

Local Memory

Fast read/write. No shared element data.

Shared Memory

Fast read/write. Shared element data. Small size.

Threads Assemble By Non-zero (NZ) Row Element

Cecka, Lew, Darve FEM on GPU 7 / 1

slide-24
SLIDE 24

GPU FEM Assembly Strategies Two Key Choices:

Store Element Data In Global Memory

Computation – Assembly. Min computation.

Local Memory

Fast read/write. No shared element data.

Shared Memory

Fast read/write. Shared element data. Small size.

Threads Assemble By Non-zero (NZ)

Simple - Indexing. Imbalanced.

Row Element

Cecka, Lew, Darve FEM on GPU 7 / 1

slide-25
SLIDE 25

GPU FEM Assembly Strategies Two Key Choices:

Store Element Data In Global Memory

Computation – Assembly. Min computation.

Local Memory

Fast read/write. No shared element data.

Shared Memory

Fast read/write. Shared element data. Small size.

Threads Assemble By Non-zero (NZ)

Simple - Indexing. Imbalanced.

Row

More balanced. Lookup tables.

Element

Cecka, Lew, Darve FEM on GPU 7 / 1

slide-26
SLIDE 26

GPU FEM Assembly Strategies Two Key Choices:

Store Element Data In Global Memory

Computation – Assembly. Min computation.

Local Memory

Fast read/write. No shared element data.

Shared Memory

Fast read/write. Shared element data. Small size.

Threads Assemble By Non-zero (NZ)

Simple - Indexing. Imbalanced.

Row

More balanced. Lookup tables.

Element

Min computation. SIMD. Race conditions.

Cecka, Lew, Darve FEM on GPU 7 / 1

slide-27
SLIDE 27

Local-Element

Assign one thread to one element.

Compute the element data. Assemble directly into system.

Cecka, Lew, Darve FEM on GPU 8 / 1

slide-28
SLIDE 28

Local-Element

Assign one thread to one element.

Compute the element data. Assemble directly into system.

Race conditions still possible!

Cecka, Lew, Darve FEM on GPU 8 / 1

slide-29
SLIDE 29

Local-Element - Coloring the Mesh

Assign one thread to one element.

Compute the element data. Assemble directly into system.

Race conditions still possible! Partition elements to resolve race conditions.

Transform into a vertex coloring problem. In general, k-coloring is NP-complete.

But we don’t need an optimal coloring.

Cecka, Lew, Darve FEM on GPU 8 / 1

slide-30
SLIDE 30

Local-Element - Coloring the Mesh

Assign one thread to one element.

Compute the element data. Assemble directly into system.

Race conditions still possible! Partition elements to resolve race conditions.

Transform into a vertex coloring problem. In general, k-coloring is NP-complete.

But we don’t need an optimal coloring.

Problems.

No sharing of nodal or element data. Little utilization of GPU resources.

Cecka, Lew, Darve FEM on GPU 8 / 1

slide-31
SLIDE 31

Global-NZ

Kernel1 - Assign one thread to one element.

Compute the element data. Store element data in global memory.

Cecka, Lew, Darve FEM on GPU 9 / 1

slide-32
SLIDE 32

Global-NZ

Kernel1 - Assign one thread to one element.

Compute the element data. Store element data in global memory.

Kernel2 - Assign one thread to one NZ.

Assemble from global memory.

Cecka, Lew, Darve FEM on GPU 9 / 1

slide-33
SLIDE 33

Global-NZ

Kernel1 - Assign one thread to one element.

Compute the element data. Store element data in global memory.

Kernel2 - Assign one thread to one NZ.

Assemble from global memory.

Optimizing:

Cluster the elements so they share nodes. Prefetch nodal data into shared memory. Up to almost 3x speedup.

Cecka, Lew, Darve FEM on GPU 9 / 1

slide-34
SLIDE 34

Global-NZ

Kernel1 - Assign one thread to one element.

Compute the element data. Store element data in global memory.

Kernel2 - Assign one thread to one NZ.

Assemble from global memory.

Optimizing:

Cluster the elements so they share nodes. Prefetch nodal data into shared memory. Up to almost 3x speedup.

Problems.

Two passes through global memory. Limited by global memory size.

Cecka, Lew, Darve FEM on GPU 9 / 1

slide-35
SLIDE 35

Global-NZ Data Flow

The optimized algorithm looks like:

System of Equations: Reduction: Kernel Break Element Data: Coalesced Write: Element Subroutine: Nodal Data: Block Element Matrix Ek : Thread Sync Gather: Nodal Data: Cecka, Lew, Darve FEM on GPU 10 / 1

slide-36
SLIDE 36

Shared-NZ

Assign one thread to one element.

Compute the element data. Store element data in shared memory.

Cecka, Lew, Darve FEM on GPU 11 / 1

slide-37
SLIDE 37

Shared-NZ

Assign one thread to one element.

Compute the element data. Store element data in shared memory.

Reassign threads to NZs.

Assemble from shared memory.

Cecka, Lew, Darve FEM on GPU 11 / 1

slide-38
SLIDE 38

Shared-NZ

Assign one thread to one element.

Compute the element data. Store element data in shared memory.

Reassign threads to NZs.

Assemble from shared memory.

A set of NZs requires a set of elements.

Must compute all “halo” element data.

Cecka, Lew, Darve FEM on GPU 11 / 1

slide-39
SLIDE 39

Shared-NZ

Assign one thread to one element.

Compute the element data. Store element data in shared memory.

Reassign threads to NZs.

Assemble from shared memory.

A set of NZs requires a set of elements.

Must compute all “halo” element data.

A set of elements requires a set of nodes.

Must gather all “halo” nodal data.

Cecka, Lew, Darve FEM on GPU 11 / 1

slide-40
SLIDE 40

Shared-NZ

Assign one thread to one element.

Compute the element data. Store element data in shared memory.

Reassign threads to NZs.

Assemble from shared memory.

A set of NZs requires a set of elements.

Must compute all “halo” element data.

A set of elements requires a set of nodes.

Must gather all “halo” nodal data.

Cecka, Lew, Darve FEM on GPU 11 / 1

slide-41
SLIDE 41

Shared-NZ

Assign one thread to one element.

Compute the element data. Store element data in shared memory.

Reassign threads to NZs.

Assemble from shared memory.

A set of NZs requires a set of elements.

Must compute all “halo” element data.

A set of elements requires a set of nodes.

Must gather all “halo” nodal data.

Problems.

Shared memory size is very limiting.

Cecka, Lew, Darve FEM on GPU 11 / 1

slide-42
SLIDE 42

Shared-NZ Data Flow

The optimized algorithm looks like:

System of Equations: Reduction: Element Data: Thread Sync Element Subroutine: Nodal Data: Thread Sync Scatter: Nodal Data: Cecka, Lew, Darve FEM on GPU 12 / 1

slide-43
SLIDE 43

Scatter and Reduction Arrays

General procedure: Make a set of operations to be done for each partition. Pack these into an array such that reading is coalesced.

Cecka, Lew, Darve FEM on GPU 13 / 1

slide-44
SLIDE 44

Scatter and Reduction Arrays

General procedure: Make a set of operations to be done for each partition. Pack these into an array such that reading is coalesced.

Scatter Array:

Cecka, Lew, Darve FEM on GPU 13 / 1

slide-45
SLIDE 45

Scatter and Reduction Arrays

General procedure: Make a set of operations to be done for each partition. Pack these into an array such that reading is coalesced.

Scatter Array: Reduction Array:

Cecka, Lew, Darve FEM on GPU 13 / 1

slide-46
SLIDE 46

Scatter and Reduction Arrays

General procedure: Make a set of operations to be done for each partition. Pack these into an array such that reading is coalesced.

Scatter Array: Reduction Array:

Very fast. Highly adaptable. Significant setup cost. Significant memory cost.

Cecka, Lew, Darve FEM on GPU 13 / 1

slide-47
SLIDE 47

Scaling with Element Number

Cecka, Lew, Darve FEM on GPU 14 / 1

slide-48
SLIDE 48

Scaling with Element Order

Cecka, Lew, Darve FEM on GPU 15 / 1

slide-49
SLIDE 49

Application

GPU non-linear neoHookean model.

Cecka, Lew, Darve FEM on GPU 16 / 1

slide-50
SLIDE 50

Application

GPU non-linear neoHookean model. Newton-Raphson update at each step. Assemble, solve, update, and render at each step.

Cecka, Lew, Darve FEM on GPU 16 / 1

slide-51
SLIDE 51

Application

GPU non-linear neoHookean model. Newton-Raphson update at each step. Assemble, solve, update, and render at each step. 28,796 Nodes. 125,127 Elements.

Cecka, Lew, Darve FEM on GPU 16 / 1

slide-52
SLIDE 52

Application

GPU non-linear neoHookean model. Newton-Raphson update at each step. Assemble, solve, update, and render at each step. 28,796 Nodes. 125,127 Elements. Preliminary Results: Assembly on CPU and transferring requires ∼0.5s (optimized).

Cecka, Lew, Darve FEM on GPU 16 / 1

slide-53
SLIDE 53

Application

GPU non-linear neoHookean model. Newton-Raphson update at each step. Assemble, solve, update, and render at each step. 28,796 Nodes. 125,127 Elements. Preliminary Results: Assembly on CPU and transferring requires ∼0.5s (optimized). Assembly on GPU requires ∼0.01s (optimized).

Cecka, Lew, Darve FEM on GPU 16 / 1

slide-54
SLIDE 54

Application

GPU non-linear neoHookean model. Newton-Raphson update at each step. Assemble, solve, update, and render at each step. 28,796 Nodes. 125,127 Elements. Preliminary Results: Assembly on CPU and transferring requires ∼0.5s (optimized). Assembly on GPU requires ∼0.01s (optimized). With a GPU sparse solver, currently 5fps on GTX480.

Cecka, Lew, Darve FEM on GPU 16 / 1

slide-55
SLIDE 55

Application

GPU non-linear neoHookean model. Newton-Raphson update at each step. Assemble, solve, update, and render at each step. 28,796 Nodes. 125,127 Elements. Preliminary Results: Assembly on CPU and transferring requires ∼0.5s (optimized). Assembly on GPU requires ∼0.01s (optimized). With a GPU sparse solver, currently 5fps on GTX480. Expect 15+fps after a few more solver optimizations.

Cecka, Lew, Darve FEM on GPU 16 / 1

slide-56
SLIDE 56

Application

GPU non-linear neoHookean model. Newton-Raphson update at each step. Assemble, solve, update, and render at each step. 28,796 Nodes. 125,127 Elements. Preliminary Results: Assembly on CPU and transferring requires ∼0.5s (optimized). Assembly on GPU requires ∼0.01s (optimized). With a GPU sparse solver, currently 5fps on GTX480. Expect 15+fps after a few more solver optimizations.

  • C. Cecka, A. Lew, E. Darve. Application of Assembly, Solution, and

Visualization of FEM on GPUs. GPU Gems. Preprint.

Cecka, Lew, Darve FEM on GPU 16 / 1

slide-57
SLIDE 57

Conclusion

  • C. Cecka, A. Lew, E. Darve, Assembly of Finite Element Methods on Graphics
  • Processors. IJNME, 2009.
  • C. Cecka, A. Lew, E. Darve. Application of Assembly, Solution, and

Visualization of Finite Elements on Graphics Processors. GPU Gems. Preprint.

Cecka, Lew, Darve FEM on GPU 17 / 1

slide-58
SLIDE 58

Conclusion

  • C. Cecka, A. Lew, E. Darve, Assembly of Finite Element Methods on Graphics
  • Processors. IJNME, 2009.

Create and classify several GPU FEM assembly algorithms.

  • C. Cecka, A. Lew, E. Darve. Application of Assembly, Solution, and

Visualization of Finite Elements on Graphics Processors. GPU Gems. Preprint.

Cecka, Lew, Darve FEM on GPU 17 / 1

slide-59
SLIDE 59

Conclusion

  • C. Cecka, A. Lew, E. Darve, Assembly of Finite Element Methods on Graphics
  • Processors. IJNME, 2009.

Create and classify several GPU FEM assembly algorithms. Identification of optimizations and limitations of each algorithm.

  • C. Cecka, A. Lew, E. Darve. Application of Assembly, Solution, and

Visualization of Finite Elements on Graphics Processors. GPU Gems. Preprint.

Cecka, Lew, Darve FEM on GPU 17 / 1

slide-60
SLIDE 60

Conclusion

  • C. Cecka, A. Lew, E. Darve, Assembly of Finite Element Methods on Graphics
  • Processors. IJNME, 2009.

Create and classify several GPU FEM assembly algorithms. Identification of optimizations and limitations of each algorithm. Optimal method depends on the element:

Memory requirements of element kernels. Computational requirements of element kernels.

  • C. Cecka, A. Lew, E. Darve. Application of Assembly, Solution, and

Visualization of Finite Elements on Graphics Processors. GPU Gems. Preprint.

Cecka, Lew, Darve FEM on GPU 17 / 1

slide-61
SLIDE 61

Conclusion

  • C. Cecka, A. Lew, E. Darve, Assembly of Finite Element Methods on Graphics
  • Processors. IJNME, 2009.

Create and classify several GPU FEM assembly algorithms. Identification of optimizations and limitations of each algorithm. Optimal method depends on the element:

Memory requirements of element kernels. Computational requirements of element kernels.

Precomputation algorithms and support data structures.

  • C. Cecka, A. Lew, E. Darve. Application of Assembly, Solution, and

Visualization of Finite Elements on Graphics Processors. GPU Gems. Preprint.

Cecka, Lew, Darve FEM on GPU 17 / 1

slide-62
SLIDE 62

Conclusion

  • C. Cecka, A. Lew, E. Darve, Assembly of Finite Element Methods on Graphics
  • Processors. IJNME, 2009.

Create and classify several GPU FEM assembly algorithms. Identification of optimizations and limitations of each algorithm. Optimal method depends on the element:

Memory requirements of element kernels. Computational requirements of element kernels.

Precomputation algorithms and support data structures.

  • C. Cecka, A. Lew, E. Darve. Application of Assembly, Solution, and

Visualization of Finite Elements on Graphics Processors. GPU Gems. Preprint.

Applying the methods to a high-performance FEM application.

Cecka, Lew, Darve FEM on GPU 17 / 1