Assembly of Finite Element Methods on Graphics Processors. Cris - PowerPoint PPT Presentation

Assembly of Finite Element Methods on Graphics Processors. Cris Cecka Adrian Lew Eric Darve Department of Mechanical Engineering Institute for Computational and Mathematical Engineering Stanford University July 19th, 2010 9th World Congress on Computational Mechanics Cecka, Lew, Darve FEM on GPU 1 / 1

GPU Computing Cecka, Lew, Darve FEM on GPU 2 / 1

GPU Computing Threads executed by streaming processors On-Chip Registers Off-chip Local Memory Cecka, Lew, Darve FEM on GPU 2 / 1

GPU Computing Threads executed by streaming processors On-Chip Registers Off-chip Local Memory Block of threads executed on streaming multiprocessors On-chip Shared Memory Cecka, Lew, Darve FEM on GPU 2 / 1

GPU Computing Threads executed by streaming processors On-Chip Registers Off-chip Local Memory Block of threads executed on streaming multiprocessors On-chip Shared Memory Grid of blocks executed on the device Off-chip Global Memory Cecka, Lew, Darve FEM on GPU 2 / 1

Why FEM Assembly on the GPU? Complex, real-time physics common on the GPU. Gaming and graphics community. Simulation and visualization community. More recently, HPC community. Cecka, Lew, Darve FEM on GPU 3 / 1

Why FEM Assembly on the GPU? Complex, real-time physics common on the GPU. Gaming and graphics community. Simulation and visualization community. More recently, HPC community. Sparse Linear Algebra coming of age on GPU. Extensive research on Sparse Solvers on GPU. Extensive research on SpMV. Cecka, Lew, Darve FEM on GPU 3 / 1

Why FEM Assembly on the GPU? Complex, real-time physics common on the GPU. Gaming and graphics community. Simulation and visualization community. More recently, HPC community. Sparse Linear Algebra coming of age on GPU. Extensive research on Sparse Solvers on GPU. Extensive research on SpMV. Non-linear and time-dependent problems require many assembly procedures. Cecka, Lew, Darve FEM on GPU 3 / 1

Why FEM Assembly on the GPU? Complex, real-time physics common on the GPU. Gaming and graphics community. Simulation and visualization community. More recently, HPC community. Sparse Linear Algebra coming of age on GPU. Extensive research on Sparse Solvers on GPU. Extensive research on SpMV. Non-linear and time-dependent problems require many assembly procedures. Can assemble, solve, update, and visualize on the GPU Completely avoid costly transfers with CPU. Fast (real-time) simulations with visualization. Cecka, Lew, Darve FEM on GPU 3 / 1

FEM Direct Assembly Most common FEM assembly procedure: Compute element data. One by one. � e K e f e Cecka, Lew, Darve FEM on GPU 4 / 1

FEM Direct Assembly Most common FEM assembly procedure: Compute element data. One by one. Accumulate into global system. Using a local index to global index mapping. � � e K e f e Ku = f Cecka, Lew, Darve FEM on GPU 4 / 1

Data Flow Nodal Data Cecka, Lew, Darve FEM on GPU 5 / 1

Data Flow Nodal Data Element Data e e e e e e Cecka, Lew, Darve FEM on GPU 5 / 1

Data Flow FEM System Nodal Data Element Data e e e e e e Cecka, Lew, Darve FEM on GPU 5 / 1

GPU FEM Assembly Strategies Key concerns for GPU algorithms: Distribute the task into independent blocks of work. No inter-block communication. Minimize redundant computations. Cecka, Lew, Darve FEM on GPU 6 / 1

GPU FEM Assembly Strategies Key concerns for GPU algorithms: Distribute the task into independent blocks of work. No inter-block communication. Minimize redundant computations. Maximize flop::word ratio. Minimize global memory transactions. Use exposed memory hierarchy to maximize data reuse. Cecka, Lew, Darve FEM on GPU 6 / 1

GPU FEM Assembly Strategies Two Key Choices: Store Element Data In Threads Assemble By Cecka, Lew, Darve FEM on GPU 7 / 1

GPU FEM Assembly Strategies Two Key Choices: Store Element Data In Threads Assemble By Global Memory Local Memory Shared Memory Cecka, Lew, Darve FEM on GPU 7 / 1

GPU FEM Assembly Strategies Two Key Choices: Store Element Data In Threads Assemble By Global Memory Computation – Assembly. Min computation. Local Memory Shared Memory Cecka, Lew, Darve FEM on GPU 7 / 1

GPU FEM Assembly Strategies Two Key Choices: Store Element Data In Threads Assemble By Global Memory Computation – Assembly. Min computation. Local Memory Fast read/write. No shared element data. Shared Memory Cecka, Lew, Darve FEM on GPU 7 / 1

GPU FEM Assembly Strategies Two Key Choices: Store Element Data In Threads Assemble By Global Memory Computation – Assembly. Min computation. Local Memory Fast read/write. No shared element data. Shared Memory Fast read/write. Shared element data. Small size. Cecka, Lew, Darve FEM on GPU 7 / 1

GPU FEM Assembly Strategies Two Key Choices: Store Element Data In Threads Assemble By Global Memory Non-zero (NZ) Computation – Assembly. Min computation. Local Memory Row Fast read/write. No shared element data. Shared Memory Element Fast read/write. Shared element data. Small size. Cecka, Lew, Darve FEM on GPU 7 / 1

GPU FEM Assembly Strategies Two Key Choices: Store Element Data In Threads Assemble By Global Memory Non-zero (NZ) Computation – Assembly. Simple - Indexing. Min computation. Imbalanced. Local Memory Row Fast read/write. No shared element data. Shared Memory Element Fast read/write. Shared element data. Small size. Cecka, Lew, Darve FEM on GPU 7 / 1

GPU FEM Assembly Strategies Two Key Choices: Store Element Data In Threads Assemble By Global Memory Non-zero (NZ) Computation – Assembly. Simple - Indexing. Min computation. Imbalanced. Local Memory Row Fast read/write. More balanced. No shared element data. Lookup tables. Shared Memory Element Fast read/write. Shared element data. Small size. Cecka, Lew, Darve FEM on GPU 7 / 1

GPU FEM Assembly Strategies Two Key Choices: Store Element Data In Threads Assemble By Global Memory Non-zero (NZ) Computation – Assembly. Simple - Indexing. Min computation. Imbalanced. Local Memory Row Fast read/write. More balanced. No shared element data. Lookup tables. Shared Memory Element Fast read/write. Min computation. Shared element data. SIMD. Small size. Race conditions. Cecka, Lew, Darve FEM on GPU 7 / 1

Local-Element Assign one thread to one element. Compute the element data. Assemble directly into system. Cecka, Lew, Darve FEM on GPU 8 / 1

Local-Element Assign one thread to one element. Compute the element data. Assemble directly into system. Race conditions still possible! Cecka, Lew, Darve FEM on GPU 8 / 1

Local-Element - Coloring the Mesh Assign one thread to one element. Compute the element data. Assemble directly into system. Race conditions still possible! Partition elements to resolve race conditions. Transform into a vertex coloring problem. In general, k -coloring is NP-complete. But we don’t need an optimal coloring. Cecka, Lew, Darve FEM on GPU 8 / 1

Local-Element - Coloring the Mesh Assign one thread to one element. Compute the element data. Assemble directly into system. Race conditions still possible! Partition elements to resolve race conditions. Transform into a vertex coloring problem. In general, k -coloring is NP-complete. But we don’t need an optimal coloring. Problems. No sharing of nodal or element data. Little utilization of GPU resources. Cecka, Lew, Darve FEM on GPU 8 / 1

Global-NZ Kernel1 - Assign one thread to one element. Compute the element data. Store element data in global memory. Cecka, Lew, Darve FEM on GPU 9 / 1

Global-NZ Kernel1 - Assign one thread to one element. Compute the element data. Store element data in global memory. Kernel2 - Assign one thread to one NZ. Assemble from global memory. Cecka, Lew, Darve FEM on GPU 9 / 1

Global-NZ Kernel1 - Assign one thread to one element. Compute the element data. Store element data in global memory. Kernel2 - Assign one thread to one NZ. Assemble from global memory. Optimizing: Cluster the elements so they share nodes. Prefetch nodal data into shared memory. Up to almost 3x speedup. Cecka, Lew, Darve FEM on GPU 9 / 1

Global-NZ Kernel1 - Assign one thread to one element. Compute the element data. Store element data in global memory. Kernel2 - Assign one thread to one NZ. Assemble from global memory. Optimizing: Cluster the elements so they share nodes. Prefetch nodal data into shared memory. Up to almost 3x speedup. Problems. Two passes through global memory. Limited by global memory size. Cecka, Lew, Darve FEM on GPU 9 / 1

Global-NZ Data Flow The optimized algorithm looks like: Nodal Data: Gather: Nodal Data: Thread Sync Block Element Matrix E k : Element Subroutine: Coalesced Write: Element Data: Kernel Break Reduction: System of Equations: Cecka, Lew, Darve FEM on GPU 10 / 1

Shared-NZ Assign one thread to one element. Compute the element data. Store element data in shared memory. Cecka, Lew, Darve FEM on GPU 11 / 1

Shared-NZ Assign one thread to one element. Compute the element data. Store element data in shared memory. Reassign threads to NZs. Assemble from shared memory. Cecka, Lew, Darve FEM on GPU 11 / 1

Assembly of Finite Element Methods on Graphics Processors. Cris - PowerPoint PPT Presentation

Assembly of Finite Element Methods on Graphics Processors. Cris Cecka Adrian Lew Eric Darve Department of Mechanical Engineering Institute for Computational and Mathematical Engineering Stanford University July 19th, 2010 9th World Congress

Graphics Murray Cole Graphics 1 Graphics 2 Graphics 3 Graphics 4 Graphics 5 Graphics 6

- A Finite Element Software Teresa Beck, Simon Gawlok and HiFlow team HiFlow-Finite Element

Finite Element Method for netting Daniel.Priour@ifremer.fr IFREMER November 4, 2010

Utilizing commercial graphics processors Utilizing commercial graphics processors in the

CS378 - Mobile Computing 3D Graphics 2D Graphics android.graphics library for 2D graphics

3D GRAPHICS design animate render Computer Graphics 3D animation movies Computer Graphics

Finite Element tool box for Structural and Fluid Mechanics Cast3M Cast3M is a finite element tool

Slide 1 / 48 1 Elements Z and X are compared. Element Z is larger than Element X. Based on this

Time Domain Finite Element Methods for Maxwells Equations Asad Anees Advisor Lutz Angermann

Conjugate gradient methods for stochastic Galerkin finite element saddle point matrices B T A

Finite element methods for Maxwells equations: A local a priori estimate Claudio Rojik Vienna

Local convergence of adaptive finite element methods for nonlinear problems Gantumur Tsogtgerel

Finite Element Multigrid Framework for Mimetic Finite Difference Discretizations Xiaozhe Hu

Graphics Processing CS418 Computer Graphics John C. Hart Graphics Processing Graphics

Meshless Meshless Methods Meshless Meshless Methods Methods Methods Contents

Finite A to B implies |A| = |B| Cardinality for finite A, B finite-card .1 finite-card .2

Solvers for large linear systems arising in the Stochastic Finite Element Method Elisabeth

Stabilized Finite Element Method for 3D Navier-Stokes Equations with Physical Boundary Conditions

Concepts and Algorithms of Scientific and Visual Computing Finite Element Method CS448J,

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, Colorado State

trst sts r

An introduction to shape and topology optimization ric Bonnetier and Charles Dapogny

15-251 Great Theoretical Ideas in Computer Science Lecture 21: Computational Arithmetic November

Data Structures in Java Lecture 5: Sequences and Series, Proofs 9/23/2015 Daniel Bauer 1