Fast, scalable and accurate finite-element based ab initio - PowerPoint PPT Presentation

Fast, scalable and accurate finite-element based ab initio calculations using mixed precision computing Vikram Gavini Department of Mechanical Engineering Department of Materials Science and Engineering University of Michigan, Ann Arbor Collaborators: Sambit Das (U. Mich); Phani Motamarri (U. Mich); Bruno Turcksin (ORNL); Ying Wai Li (ORNL/LANL); Brent Leback (Nvidia) Funding: DoE-BES, ARO, AFOSR, TRI, XSEDE, NERSC, ALCF, OLCF SMC 2019

Impact of Density Functional Theory Citations to seminal work of Walter Kohn (1964,1965 ) Data compiled from Web of Science 12 of the 100 most-cited papers in scientific literature pertain to DFT! (Nature 514 , 550 (2014)) 2 SMC 2019

DFT codes ~100 available DFT codes developed since 1980 Relationship to HPC Courtesy: Anubhav Jain Data compiled from Web of Science Key Issues Lack of good parallel scalability of existing DFT codes v Computational complexity of DFT calculations (O(N^3)) v 3 SMC 2019

Need for large scale DFT calculations Chemical properties of nanoparticles Biological systems Edge dislocation: Iyer et al. J. Mech, Phys. Solids (2015) Rocksalt phase formation during Litihiation of Magnetite Screw dislocation: 4 He et. al, Nature Comm, 2016 Das & Gavini J. Mech, Phys. Solids (2017) Defects in Materials SMC 2019

Technological challenge of low ductility in Mg Magnesium is the lightest structural material with high strength to weight ratio Ø 75% lighter than Steel and 30% lighter than Aluminum v Every 10% reduction in the weight of a vehicle will result in 6-8% increase in Ø fuel efficiency. Important implications to fuel efficiency and reducing carbon footprint v Low ductility key issue in the manufacturability of structural components. Main Ø limitation in the adoptability of Mg and Mg alloys in automotive and aerospace sectors. (T.M. Pollock, Science 328 , 986-987 (2010)) Courtesy: https://www.audi-technology-portal.de/en/body S. Sandlöbes et al. Scientific Reports 7, 10458 (2017). Current state of art: Hybrid Steel and Aluminum construction 5 SMC 2019

Technological challenge of low ductility in Mg 4 slip planes in Face Centered Cubic Crystals à higher ductility Prism II Basal Dislocations are energetically more favorable to v reside on certain slip systems. ( Energetics ) Prism I Pyramidal II Pyramidal I Dislocation glide occurs after the applied shear v stress is greater than the Perils barrier. ( Activation barrier ) More the number of slip systems where dislocations v can glide easily higher is the ductility. 6 SMC 2019

Density Functional Theory Kohn-Sham eigenvalue problem: Self consistent iteration (Kohn-Sham map) Orbital occupancy: 7 SMC 2019

DFT – Finite Element discretization Use finite-element basis for computing – Ø u Features of FE basis i=1 i=2 … r Ø Systematic convergence v Element size N 1 (r) N 2 (r) N 3 (r) 1 v Polynomial order Ø Adaptive refinement Ø Complex geometries and boundary r i=1 i=2 … conditions Ø Potential for excellent parallel scalability By changing the positioning of the nodes the spatial resolution of basis can be changed/adapted 8 SMC 2019

Higher (polynomial) order FE basis II. Mo periodic I. Cu nanoparticle supercell w/ vacancy 55 atoms 53 atoms ~1000x advantage by using higher-order FE basis ! 9 SMC 2019

Spatial adaptivity of the FE basis (Motamarri et al. J Comput Phys. (2013); Motamarri el al. Comput. Phys. Commun. (2019) ) Error Analysis: Ø Optimal FE mesh: Ø System Type DoFs DoFs for pyr II dislocation Uniform Mesh Adaptive Mesh 1848 atom Mg 347,206,614 55,112,161 10 6164 atom Mg 892,047,315 179,034,231 SMC 2019

Eigen-space computation: Chebyshev acceleration (Zhou et al. J. Comput. Phys. 219 (2006); Motamarri et al. J. Comp. Phys. 253, 308-343 (2013)) Kohn-Sham eigenvalue problem : for k = 1,2,…N (N ~ 1.1N e /2) Unwanted Spectrum Wanted Spectrum Chebyshev Filtering : Wanted Spectrum Unwanted Spectrum 11 SMC 2019

Numerical algorithm 1. Start with initial guess for electron density and the initial wavefunctions 2. Compute the discrete Hamiltonian using the input electron density 3. CF: Chebyshev filtering: 4. Orthonormalize CF basis: 5. Rayleigh-Ritz procedure : Compute projected Hamiltonian: v Diagonalize v Subspace rotation: v 6. Compute electron density 7. If , EXIT; else, compute new using a mixing scheme and go to (2). 12 SMC 2019

Chebyshev Filtering DoF 1 DoF 2 DoF FE Cell DoF DoF 1 DoF 2 : Number of FE cells 13 SMC 2019

Chebyshev Filtering Strided Batched xGEMM 14 SMC 2019

Chebyshev Filtering Atomic operations to avoid race conditions in addition DoF 1 DoF 2 DoF Assembly across processor boundaries: Communication in FP32 15 Repeat for SMC 2019

Performance of Chebyshev filtering (Summit) Case study : Mg 3x3x3 supercell with a vacancy. (1070 electrons) Fig : 14.7x GPU speed up for Chebyshev filtering. CPU run Fig : Chebyshev filtering throughput on 2 Summit nodes used 2 Summit nodes with 42 MPI tasks per node while using 12 GPUs (3 MPI tasks per GPU) for various block GPU run used 2 Summit nodes with 12 GPUs (3 MPI tasks sizes. FP64 peak of 2 Summit nodes is 87.6 TFLOPS per GPU) 16 SMC 2019

Orthogonalization: Cholesky Gram-Schmidt Ø Cholesky factorization of the overlap matrix: Ø Orthonormal basis construction: Blocked approach to reduce peak memory Mixed precision computation for Chol-GS 1. 2. in double precision. 3. Orthonormal basis construction: Copy block to CPU MPI_Allreduce (if computation performed on GPU) Fill ScaLAPACK parallelized S matrix 17 SMC 2019

Orthogonalization: Cholesky Gram-Schmidt Summit GPU cluster benchmark NERSC Cori CPU cluster benchmark Performance improvement in CholGS due to Performance improvement in computation of S mixed precision algorithm. Case study: due to mixed precision algorithm. Case study: Mg10x10x10 (39,990 electrons) and 61,640 electrons system using 1300 Summit Mo13x13x13 (61,502 electrons) nodes 18 SMC 2019

Rayleigh-Ritz procedure v Compute projected Hamiltonian: v Diagonalization of v Subspace rotation step: Mixed precision computation for RR 1. Compute projected Hamiltonian: N oc N fr 19 SMC 2019

Rayleigh-Ritz procedure 2. Diagonalization of in double precision. 3. Subspace rotation step: Summit GPU cluster benchmark Performance improvement in computation of due to mixed precision algorithm. Case study: 61,640 electrons system using 1300 Summit nodes 20 SMC 2019

Comparison with Quantum Espresso (Cori KNL) (Motamarri et al. Comput. Phys. Commun. (2019)) Monovacancy in HCP Mg – periodic calculation ; ONCV pseudopotential Accuracy for all calculations <0.1mHa/atom (~2meV/atom) Time per SCF in Node-Hrs for various system sizes (NERSC Cori KNL) System size Q-Espresso DFT-FE DFT-FE (Ecut: 45 Ha) (h_min: 0.46, p=4) Wall-time per SCF iteration (sec) QUANTUM ESPRESSO 800 255 atoms 0.1 0.3 600 (N e =2550) 863 atoms 4.4 3.3 400 (N e =8630) 2047 atoms 123.5 21.6 200 (N e =20470) 3999 atoms - 103.4 0 0 10000 20000 30000 40000 (N e =39990) Number of Electrons 21 SMC 2019

Comparison with Quantum Espresso (Cori KNL) Cu nanoparticles – non periodic calculation; ONCV pseudopotential Accuracy for all calculations <0.1mHa/atom (~2meV/atom) Time per SCF in Node-Hrs for various system sizes (NERSC Cori KNL) System size Q-Espresso DFT-FE (Ecut: 50 Ha) (h_min: 0.4; p=4) 147 atoms 0.2 0.3 (N e =2793) 309 atoms 5.5 1.7 (N e =5871) 561 atoms 63.4 5.3 (N e =10569) 923 atoms - 12.7 (N e =17537) 22 SMC 2019

Technological challenge of low ductility in Mg 12 slip systems in Face Centered Cubic Crystals à higher ductility Prism II Basal Dislocations are energetically more favorable to v reside on certain slip systems. ( Energetics ) Prism I Pyramidal II Pyramidal I Dislocation glide occurs after the applied shear v stress is greater than the Perils barrier. ( Activation barrier ) More the number of slip systems where dislocations v can glide easily higher is the ductility. 23 SMC 2019

Mg Pyramidal dislocation systems Pyramidal I and II dislocation systems of various sizes 728 Mg atoms 6164 Mg atoms 1848 Mg atoms 10,508 Mg atoms 24 SMC 2019

Performance Benchmarks – Strong Scaling/time to solution Mg pyr II screw dislocation – 1,848 atoms (18,480 e - ); 55.11 million FE DoFs Theta Summit GPUs 32 16 Ideal Speedup Observed Speedup Observed Speedup Ideal Speedup 16 8 Relative Speedup Relative Speedup 8 4 4 2 2 Wall-time on 2048 tasks: 1511 sec Wall-time on 1260 tasks: 97.6 sec Wall-time on 20,160 tasks: 13.99 sec Wall-time on 65,536 tasks: 104 sec 1 1 1260 2520 5040 10080 20160 2048 4096 8192 16384 32768 65536 Number of MPI Tasks Number of MPI tasks 3 MPI tasks per GPU via MPS 25 SMC 2019

Performance Benchmarks – Weak Scaling (Summit) Total MPI tasks (3 MPI tasks per GPU; via MPS ) Computational Complexity 54 180 12744 576 3294 Chebyshev filtering: O(MN) 120 Percent Weak Scaling Efficiency Orthonormalization: O(MN 2 ) 100 Rayleigh Ritz procedure: O(MN 2 ) 38250 80 60 Onset of cubic scaling significantly delayed ! 40 20 2500 5000 10000 20000 40000 100000 Number of Electrons 26 SMC 2019

Fast, scalable and accurate finite-element based ab initio - PowerPoint PPT Presentation

Fast, scalable and accurate finite-element based ab initio calculations using mixed precision computing Vikram Gavini Department of Mechanical Engineering Department of Materials Science and Engineering University of Michigan, Ann Arbor

Ab initio modelling methods Al Kikhney EMBL Hamburg Ab initio shape reconstruction Log I(s)

- A Finite Element Software Teresa Beck, Simon Gawlok and HiFlow team HiFlow-Finite Element

Finite Element Method for netting Daniel.Priour@ifremer.fr IFREMER November 4, 2010

Ab initio gene prediction Genome 559, Winter 2014 Ab initio gene prediction method Define

Ab Initio Models of Solar Activity Ab Initio Models of Solar Activity Robert Stein, Michigan State

Slide 1 / 48 1 Elements Z and X are compared. Element Z is larger than Element X. Based on this

Drive-Thru: Drive-Thru: Fast, Accurate Evaluation of Fast, Accurate Evaluation of Storage Power

Finite Element tool box for Structural and Fluid Mechanics Cast3M Cast3M is a finite element tool

Fast Boundary Element Methods Fast Boundary Element Methods ur Angewandte Analysis und Numerische

Amortized Finite Element Analysis for Fast PDE-Constrained Optimization Tianju Xue , Alex Beatson,

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Fast and Scalable Relational Division on Fast and Scalable Relational Division on Database

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Finite A to B implies |A| = |B| Cardinality for finite A, B finite-card .1 finite-card .2

NUMERICAL ANALYSIS OF THE EFFECT OF PARTICLE ARRANGEMENT ON MECHANICAL BEHAVIOR AND PARTICLE

Brookfield Infrastructure Partners RENE LUBIANSKI MANAGING DIRECTOR, INVESTMENTS MAY 2019

All times are MT Presentation Sessions Wed 4/29/2020 1st Morning Session Paper/Presentation

CPIPS Susan Quinn Highly Specialist Paediatric Physiotherapist NHS Lanarkshire March 2014

Conference Call Credit Presentation Financial Results for the Quarter Ended March 31, 2008 May 9,

Federal Policy Initiatives: From Best Practices to Creative Solutions? Automotive Communities

Wide Bandgap Semiconductors for Microwave Power Devices Dr. C. E. Weitzel Consultant: Wide

Investor Presentation January 2016 Heritage Insurance Holdings, Inc. (NYSE:HRTG) Disclaimer