Speeding up a Finite Element Computation on GPU Nelson Inoue

Summary • Introduction • Finite element implementation on GPU • Results • Conclusions 2

University and Researchers • Pontifical Catholic University of Rio de Janeiro – PUC- Rio • Group of Technology in Petroleum Engineering - GTEP • Research Team PhD Sergio Fontoura PhD Nelson Inoue PhD Carlos Emmanuel MSc Guilherme Righetto MSc Rafael Albuquerque Leader Researcher Senior Researcher Researcher Researcher Researcher 3

Introduction • Research & Development (R&D) project with Petrobras • The project began in 2010 • The subject of the project is Reservoir Geomechanics • There are great interest by oil and gas industry in this subject • This subject is still little researched 4

Introduction • What is Reservoir Geomechanics? – Branch of the petroleum engineering that studies the coupling between the problems of fluid flow and rock deformation (stress analysis) • Hydromechanical Coupling – Oil production causes rock deformation – Rock deformation contributes to oil production 5

Motivation • Geomechanical effects during reservoir production 1. Surface subsidence 2. Bedding-parallel slip 3. Fault reactivation 4. Caprock integrity 5. Reservoir compaction 6

Challenge • Evaluate geomechanical effects in a real reservoir • Overcome two major challenges 1. To use a reliable coupling scheme between fluid flow and stress analysis 2. To speed up the stress analysis (Finite Element Method) Finite Element Analysis spends most part of the simulation time 7

Hydromechanical coupling • Theoretical Approach Coupling program flowchart 8

Finite Element Method • Partial Differential Equations arise in the mathematical modelling of many engineering problems • Analytical solution or exact solution is very complicated • Alternative: Numerical Solution – Finite element method , finite difference method, finite volume method, boundary element method, discrete element method, etc. 9

Finite Element Method • Finite element method (FEM) is widely applied in stress analysis • The domain is an assembly of finite elements (FEs) (http://www.mscsoftware.com/product/dytran) Finite Element Domain 10

CHRONOS: FE Program • Chronos has been implemented on GPU CETUS Computer with 4 GPUs – Motivation : to reduce the simulation time in the hydromechanical analysis – Why to use GPU? Much more processing power CPU GPU 4 x GPUs >> 4 - 8 cores 2880 cores GeForce GTX Titan 11

Motivation • GPU Features: (Cuda C Programming Guide) – Highly parallel, multithreaded and manycore processor – Tremendous computational horsepower and very high memory bandwidth Number of FLoating-point Operations Per Second Bandwidth 12

Our Implementation • GPUs have good performance • We have developed and implemented an optimized and parallel finite element program on GPU • Programming Language CUDA is used to implement the finite element code • We have Implemented on GPU: – Assembly of the stiffness matrix – Solution of the system of linear equation – Evaluation of the strain state – Evaluation of the stress state 13

Global Memory Access on GPU • Getting maximum performance on GPU Coalesced Access Sequential/Aligned Strided Random Good Not so good Bad – Memory accesses are fully coalesced as long as all threads in a warp access the same relative address 14

Development on CPU • The assembly of the global stiffness matrix in the conventional FEM – Simple 1D problem – Element Stiffness Matrix a)         1 1 k k •   Element  1 1 11 12     k   1 1   k k Real model 21 22 b)         2 2 k k • 1 2 3 4   Element 2  2 11 12     k   Model discretization 2 2   k k 21 22 c) 1         3 3 k k   • Element  3 3  11 12    1 2 k 2   3 3   k k 21 22 1 2 3 • 1 2 Continuous model is discretized by elements Three Finite elements 15

Development on CPU • In terms of CPU implementation For i=1 , i ≤ numel=3 i =1 i =2 i =3                     3 3   k k   2 2 Evaluate Element 1 1 k k   k k       3  11 12     2  11 12    k 1 11 12 k     k       element 3 3 element 2 2   Stiffness Matrix 1 1  k k  k k   k k 21 22 21 22 21 22                   1 1 1 1 k k 0 0 1 1 k k 0 0 k k 0 0  11 12  11 12  11 12                        Assembly Global   1 1 2 2 1 1 1 1 2 2   k k 0 0   k k k k 0    k k k k 0         21 22 11 12 21 22 11 12 21 22 k k k               Stiffness Matrix     global  global global 2 2 3 3 2 2 0 k k 0 0 k k k k 0 0 0 0    21 22  21 22 11 12           3 3      0 0 0 0  0 0 k k   0 0 0 0 21 22 – The Storage in the memory Memory access is not coalesced             element  1 1 1 1 i =1 k k k 0 0 k k 0 0 0 0 0 0 0 0 0 0 11 12 21 22                       1 1 1 1 2 1 1 1 i =2 k k k 0 0 k k k k 0 0 k k 0 0 0 0 0 element 11 12 21 22 11 12 21 22                                1 1 1 1 2 2 2 2 3 3 3 3 i =3 k k k 0 0 k k k k 0 0 k k k k 0 0 k k element 11 12 21 22 11 12 21 22 11 12 21 22 16

Development on GPU • The assembly of the global stiffness matrix on GPU – Simple 1D problem – Each row of the global stiffness matrix         ] •      Node row 1 1 1 1 [ k ] [ k k k k 11 22 11 12 Real model •         ] Node    2 row 2 1 1 2 2 [ k ] [ k k k k 21 22 11 12 1 1 2 •         ] Node    3 row 3 2 2 3 3 [ k ] [ k k k k 21 22 11 12 2 1 1 2 3 •         ] Node      3 row 4 3 3 [ k ] [ k k k k 21 22 11 12 3 2 2 3 4 3 4 3 • Continuous model is discretized by nodes Four finite elements nodes 17

Development on GPU • In terms of GPU implementation Thread = 1 Column = 1     ]   row 1 1 1 [ k ] [ 0 k k 11 12     0 Thread = 2 Thread = 1                ]      row 2 1 1 2 2 k 1 All the threads do the same calculation [ k ] [ k k k k k Thread = 2   21 22 11 12 global 21       2 k Thread = 3 Thread = 3   21       3           ] k    row 3 2 2 3 3 [ k ] [ k k k k 21 21 22 11 12 – The Storage in the memory Column =1                    1 2 3 k 0 k k k global 21 21 21 Thread = 1 Thread = 2 Thread = 3 The memory access is sequential and aligned 18

Development on GPU • In terms of GPU implementation Thread = 1 Column = 2     ]   row 1 1 1 [ k ] [ 0 k k 11 12      1 0 k Thread = 1 Thread = 2  12                   ]  k 1 1 2   row 2 1 1 2 2  k k k  Thread = 2 [ k ] [ k k k k global 21 22 11 12 21 22 11         2 2 3 k k k Thread = 3   21 22 11 Thread = 3        3 3   k k         ]    row 3 2 2 3 3 [ k ] [ k k k k 21 22 21 22 11 12 – The Storage in the memory Memory access is coalesced Column =2                              1 2 3 1 1 2 2 3 3 k 0 k k k k k k k k k global 21 21 21 12 22 11 22 11 22 Thread = 1 Thread = 2 Thread = 3 19

Development on GPU • Solution of the systems of linear equations Ax = b – Direct solver – Iterative Solver – A = stiffness matrix, x = nodal displacement vector (unknown values) and b = nodal force vector Conjugate Gradient Algorithm – A is a symmetric and positive-definite • It was chosen the Conjugate Gradient Method – Iterative algorithm – Parallelizable algorithm on GPU – The operations of a conjugate gradient algorithm is suitable to implement on GPU 20

Speeding up a Finite Element Computation on GPU Nelson Inoue - PowerPoint PPT Presentation

Speeding up a Finite Element Computation on GPU Nelson Inoue Summary Introduction Finite element implementation on GPU Results Conclusions 2 University and Researchers Pontifical Catholic University of Rio de Janeiro

- A Finite Element Software Teresa Beck, Simon Gawlok and HiFlow team HiFlow-Finite Element

Finite Element Method for netting Daniel.Priour@ifremer.fr IFREMER November 4, 2010

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

WHAT IS SPECIAL ABOUT WHAT IS SPECIAL ABOUT NELSON S BIODIVERSITY S BIODIVERSITY NELSON

Speeding up the Inter-Planetary File System (IPFS) Speeding up the Inter-Planetary File System

Speeding Up Your Mac A Joe ON Tech Guide Speeding Up Your Mac Basics Three factors affect

Finite Element tool box for Structural and Fluid Mechanics Cast3M Cast3M is a finite element tool

Slide 1 / 48 1 Elements Z and X are compared. Element Z is larger than Element X. Based on this

Introduction to the Introduction to the Path Computation Element Path Computation Element

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

DOLFIN Dynamic Object-oriented Library for FINite element computation Johan Hoffman and Anders

NELSON The Peninsula s Community College THOMAS NELSON The Peninsula s Community

February 17 th , 2014 Nelson Garden Expansion Area Sponsored by Nelson Park Community Garden

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Renewable energy group > Peter RIEDERER, CSTB 7 mars 2008 | Saint Gobain | PAGE 1 REN at

IEEE P1528.3 CAD interlaboratory comparison Vikass Monebhurrun vikass.monebhurrun@supelec.fr

Third National Dam Safety Conference IIT Roorkee 18-19 February, 2017 PERFORMANCE EVALUATION OF

of inhomogeneous chiral phases . Tong-Gyu Lee Kochi U Nov. 21 - 24 , 2016 at Graduate

Use of FEA to improve the life of plant assets the life of plant assets How a plant saved $5M

DEPARTEMENT OF MECHANICAL & MANUFACTURING ENGINEERING Denis Jamie Edward Ciarn *CAD

Senior Design 2003- -2004 2004 Senior Design 2003 Project # 04024 Project # 04024 Team:

Advances I n Crack Growth Modelling Of 3D Aircraft Structures Sharon Mellings, John Baynham,