GPU acceleration of plane-wave codes using SIRIUS library Materials - - PowerPoint PPT Presentation

gpu acceleration of plane wave codes using sirius library
SMART_READER_LITE
LIVE PREVIEW

GPU acceleration of plane-wave codes using SIRIUS library Materials - - PowerPoint PPT Presentation

GPU acceleration of plane-wave codes using SIRIUS library Materials Design Ecosystem at the Exascale: High-Performance and High-Throughput Computing Anton Kozhevnikov, CSCS January 29, 2018 Introduction Piz Daint: #3 supercomputer in the world


slide-1
SLIDE 1

GPU acceleration of plane-wave codes using SIRIUS library

Materials Design Ecosystem at the Exascale: High-Performance and High-Throughput Computing Anton Kozhevnikov, CSCS January 29, 2018

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Piz Daint: #3 supercomputer in the world

Cray XC50, 5320 nodes Intel Xeon E5-2690v3 12C, 2.6GHz, 64GB + NVIDIA Tesla P100 16GB 4.761 Teraflops / node

slide-4
SLIDE 4

Piz Daint node layout

CPU ~500 Gigaflops GPU ~4.2 Teraflops 16 Gb of high bandwidth memory 64 GB of DDR4 host memory 732 GB/s 32 GB/s bidirectional


  • ver

PCIe x16 ~60 GB/s

slide-5
SLIDE 5

Porting codes to GPUs No magic “silver bullet” exists!

slide-6
SLIDE 6

Porting codes to GPUs No magic “silver bullet” exists!

Usual steps in porting codes to GPUs

slide-7
SLIDE 7

▪ cleanup and refactor the code ▪ (possibly) change the data layout ▪ fully utilize CPU threads and prepare code for node-level parallelization ▪ move compute-intensive kernels to GPUs

Porting codes to GPUs No magic “silver bullet” exists!

Usual steps in porting codes to GPUs

slide-8
SLIDE 8

Porting codes to GPUs

▪ CUDA (C / C++ / Fortran) ▪ OpenACC ▪ OpenCL ▪ OpenMP 4.0

slide-9
SLIDE 9

Porting codes to GPUs

▪ CUDA (C / C++ / Fortran) ▪ OpenACC ▪ OpenCL ▪ OpenMP 4.0

slide-10
SLIDE 10

Why do we need a separation of concerns?

slide-11
SLIDE 11

Why do we need a separation of concerns?

Computational scientists Code developers Users

slide-12
SLIDE 12

Why do we need a separation of concerns?

Supercomputer Code Computational scientists Code developers Users

slide-13
SLIDE 13

Why do we need a separation of concerns?

Supercomputer Code Computational scientists Code developers Users

slide-14
SLIDE 14

Why do we need a separation of concerns?

Supercomputer Code Computational scientists Code developers Users

slide-15
SLIDE 15

Electronic-structure codes

slide-16
SLIDE 16

Electronic structure codes

Periodic Bloch functions (plane-waves or similar) Localized orbitals Full-potential FLEUR Wien2K Exciting Elk FHI-aims FPLO Pseudo-potential VASP CPMD Quantum ESPRESSO Abinit Qbox CP2K SIESTA OpenMX Atomic potential treatment Basis functions for KS states

slide-17
SLIDE 17

Delta DFT codes effort

slide-18
SLIDE 18

Pseudopotential plane-wave method

▪ Unit cell is mapped to a regular grid ▪ All functions are expanded in plane-waves ▪ Atomic potential is replaced by a pseudopotential ˆ

VPS = Vloc(r) + X

α

X

ξξ0

|βα

ξ iDα ξξ0hβα ξ0|

slide-19
SLIDE 19

Pseudopotential plane-wave method

▪ Unit cell is mapped to a regular grid ▪ All functions are expanded in plane-waves ▪ Atomic potential is replaced by a pseudopotential Basis functions:

ϕG+k(r) = 1 √ Ω ei(G+k)r

ˆ VPS = Vloc(r) + X

α

X

ξξ0

|βα

ξ iDα ξξ0hβα ξ0|

slide-20
SLIDE 20

Pseudopotential plane-wave method

▪ Unit cell is mapped to a regular grid ▪ All functions are expanded in plane-waves ▪ Atomic potential is replaced by a pseudopotential Basis functions:

ϕG+k(r) = 1 √ Ω ei(G+k)r

Potential and density:

V (r) = X

G

V (G)eiGr ρ(r) = X

G

ρ(G)eiGr

ˆ VPS = Vloc(r) + X

α

X

ξξ0

|βα

ξ iDα ξξ0hβα ξ0|

slide-21
SLIDE 21

Pseudopotential plane-wave method

▪ Approximation to atomic potential ▪ Core states are excluded ▪ Number of basis functions: ~1000 / atom ▪ Number of valence states: ~0.001 - 0.01% of the total basis size ▪ Efficient iterative subspace diagonalization schemes exist ▪ Atomic forces can be easily computed ▪ Stress tensor can be easily computed

slide-22
SLIDE 22

Full-potential linearized augmented plane-wave method

Interstitial atom #1 atom #2

▪ Unit cell is partitioned into “muffin-tin” spheres and interstitial region ▪ Inside MT spheres spherical harmonic expansion is used ▪ In the interstitial region functions are expanded in plane-waves

slide-23
SLIDE 23

Full-potential linearized augmented plane-wave method

Interstitial atom #1 atom #2

▪ Unit cell is partitioned into “muffin-tin” spheres and interstitial region ▪ Inside MT spheres spherical harmonic expansion is used ▪ In the interstitial region functions are expanded in plane-waves Basis functions:

ϕG+k(r) = 8 > > > < > > > : X

`m O↵

`

X

⌫=1

A↵

`m⌫(G + k)u↵ `⌫(r)Y`m(ˆ

r) r ∈ MTα 1 √ Ω ei(G+k)r r ∈ I

slide-24
SLIDE 24

Full-potential linearized augmented plane-wave method

Interstitial atom #1 atom #2

▪ Unit cell is partitioned into “muffin-tin” spheres and interstitial region ▪ Inside MT spheres spherical harmonic expansion is used ▪ In the interstitial region functions are expanded in plane-waves Basis functions:

ϕG+k(r) = 8 > > > < > > > : X

`m O↵

`

X

⌫=1

A↵

`m⌫(G + k)u↵ `⌫(r)Y`m(ˆ

r) r ∈ MTα 1 √ Ω ei(G+k)r r ∈ I

Potential and density:

V (r) = 8 > > < > > : X

`m

V ↵

`m(r)Y`m(ˆ

r) r ∈ MTα X

G

V (G)eiGr r ∈ I ρ(r) = 8 > > < > > : X

`m

ρ↵

`m(r)Y`m(ˆ

r) r ∈ MTα X

G

ρ(G)eiGr r ∈ I

slide-25
SLIDE 25

Full-potential linearized augmented plane-wave method

▪ No approximation to atomic potential ▪ Core states are included ▪ Number of basis functions: ~100 / atom ▪ Number of valence states: ~15-20% of the total basis size ▪ Large condition number of the overlap matrix ▪ Full diagonalization of dense matrix is required (iterative subspace diagonalization schemes are not efficient) ▪ Atomic forces can be easily computed ▪ Stress tensor can’t be easily computed (N-point numerical scheme is often required)

slide-26
SLIDE 26

Common features of the FP-LAPW and PP-PW methods

▪ Definition of the unit cell (atoms, atom types, lattice vectors, symmetry

  • perations, etc.)

▪ Definition of the reciprocal lattice, plane-wave cutoffs, G vectors, G+k vectors ▪ Definition of the wave-functions ▪ FFT driver ▪ Generation of the charge density on the regular grid ▪ Generation of the XC-potential ▪ Symmetrization of the density, potential and occupancy matrices ▪ Low-level numerics (spherical harmonics, Bessel functions, Gaunt coefficients, spline interpolation, Wigner D-matrix, linear algebra wrappers, etc.)

slide-27
SLIDE 27

SIRIUS library

slide-28
SLIDE 28

Motivation for a common domain specific library

Quantum ESPRESSO inherent PW / PAW implementation BLAS, PBLAS, LAPACK, ScaLAPACK, FFT Exciting / Elk inherent LAPW implementation CPU

Extend the legacy Fortran codes with the API calls to a domain-specific library which runs on GPUs and other novel architectures.

slide-29
SLIDE 29

Motivation for a common domain specific library

Quantum ESPRESSO inherent PW / PAW implementation BLAS, PBLAS, LAPACK, ScaLAPACK, FFT Exciting / Elk inherent LAPW implementation CPU SIRIUS domain specific library LAPW / PW / PAW implementation Quantum ESPRESSO inherent PW / PAW implementation Exciting / Elk inherent LAPW implementation BLAS, PBLAS, LAPACK, ScaLAPACK, FFT, cuBLAS, MAGMA, PLASMA, cuFFT CPU GPU

Extend the legacy Fortran codes with the API calls to a domain-specific library which runs on GPUs and other novel architectures.

slide-30
SLIDE 30

SIRIUS domain specific library LAPW / PW / PAW implementation

Where to draw the line?

Effective potential construction Density mixing Density generation Eigen-value problem

⇣ − 1 2∆ + veff(r) ⌘ ψj(r) = εjψj(r)

ρ(r) = αρnew(r) + (1 − α)ρold(r)

ρnew(r) = X

j

|ψj(r)|2

veff(r) = Z ρ(r0) |r0 − r|dr0 + vXC[ρ](r) + vext(r)

Output:

total energy , atomic forces and stress tensor charge density and magnetization wave-functions and eigen energies ψj(r)

εj

ρ(r) m(r)

Fα Etot

σαβ

slide-31
SLIDE 31

SIRIUS library

▪ full-potential (L)APW+lo ▪ non-magnetic, collinear and non-collinear magnetic ground states ▪ non-relativistic, ZORA and IORA valence solvers ▪ Dirac solver for core states ▪ norm-conserving, ultrasoft and PAW pseudopotentials ▪ non-magnetic, collinear and non-collinear magnetic ground states ▪ spin-orbit correction ▪ atomic forces ▪ stress tensor ▪ Gamma-point case

slide-32
SLIDE 32

SIRIUS library

mdarray Communicator splindex matrix3d vector3d

SIRIUS is a collection of classes that abstract away the different building blocks of PW and LAPW codes. The class composition hierarchy starts from the most primitive classes (Communicator, mdarray, etc.) and progresses towards several high-level classes (DFT_ground_state, Band, Potential, etc.). The code is written in C++11 with MPI, OpenMP and CUDA programming models.

Atom Spline Periodic_function K_point Step_function Matching_coefficients Gvec MPI_grid dmatrix FFT3D BLACS_grid linalg Eigensolver Wave_functions Atom_type Radial_grid Unit_cell Radial_integrals Augmentation_operator Simulation_context Non_local_operator Potential Local_operator Band DFT_ground_state Beta_projectors Density K_point_set

https://github.com/electronic-structure/SIRIUS

slide-33
SLIDE 33

Doxygen documentation

https://electronic-structure.github.io/SIRIUS-doc/

sirius::Local_operator sirius::Smooth_periodic _function< double > theta_ sddk::mdarray< double _complex, 1 > buf_rg_ vphi2_ vphi1_ f_pw_fft_ f_pw_local_ sddk::FFT3D fft_buffer_aux2_ fft_buffer_aux1_ fft_buffer_ sddk::mdarray_base < double_complex, N > sddk::mdarray< double, 1 > pw_ekin_ f_rg_ sddk::Gvec gvec_shell_len_ sddk::mdarray_base < double, N > sddk::Gvec_partition gkvec_p_ gvec_coarse_p_ gvec_partition_ sddk::mdarray< int, 1 > zcol_offs_ gvec_shell_ gvec_base_mapping_ z_offsets_ map_gvec_to_fft _buffer_ map_gvec_to_fft _buffer_x0y0_ z_sizes_ sddk::mdarray_base < int, N > sddk::mdarray< int, 3 > sddk::mdarray< int, 2 > sddk::block_data_descriptor gvec_distr_fft_ gvec_fft_slab_ zcol_distr_fft_ gvec_distr_ zcol_distr_ sddk::Communicator comm_ortho_fft_ fft_comm_ comm_ comm_ gvec_ gvec_base_ gvec_index_by_xy_ sddk::mdarray< uint32 _t, 1 > gvec_full_index_ sddk::mdarray_base < uint32_t, N > geometry3d::vector3d < double > vk_ sirius::Unit_cell_input a2_ a1_ a0_ std::array< double , 3 > fft_coarse_ z_col_pos_ sddk::splindex< block > spl_z_ sddk::splindex_base < int > sddk::mdarray< char, 1 > cufft_work_buf_ sddk::mdarray_base < char, N > sddk::FFT3D_grid grid_ sirius::Simulation _parameters param_ sirius::Parameters _input parameters_input_ sirius::Control_input control_input_ sirius::Hubbard_input hubbard_input_ sirius::Settings_input settings_input_ unit_cell_input_ sirius::Iterative_solver _input iterative_solver_input_ sirius::Mixer_input mixer_input_

slide-34
SLIDE 34

Development cycle

QEF/q-e/master /q-e/master /q-e/sirius Pull request Pull request https://github.com/electronic-structure/q-e

slide-35
SLIDE 35

Example of QE/SIRIUS interoperability

QE SIRIUS Initialization phase

read input file, read pseudopotentials, create a list of k-points, initialize data structures, communicators, etc. initialize simulation context set k-points initialize Density class initialize Potential class generate initial density initialize DFT_ground_state class set unit cell parameters (lattice vectors, atom types, atomic positions, etc.), cutoffs and other parameters initialize K_point_set class get rho(G) and mag(G)

slide-36
SLIDE 36

Example of QE/SIRIUS interoperability

QE SIRIUS Initialization phase

read input file, read pseudopotentials, create a list of k-points, initialize data structures, communicators, etc. initialize simulation context set k-points initialize Density class initialize Potential class generate initial density initialize DFT_ground_state class set unit cell parameters (lattice vectors, atom types, atomic positions, etc.), cutoffs and other parameters initialize K_point_set class get rho(G) and mag(G)

SCF cycle

set Veff(G) generate Veff(r) and Veff(G) solve band problem and find KS orbitals get band energies find band occupancies set band occupancies

QE SIRIUS

generate unsymmetrized rho(G) and mag(G) get rho(G) and mag(G) symmetrize rho(G) and mag(G) mix rho(G) and mag(G) generate forces get forces generate stress tensor get stress tensor

slide-37
SLIDE 37

QE: variable cell relaxation of Si63Ge

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 64-atom unit cell of Si1-xGex The runs we performed on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) , on dual socket 18-core Intel Broadwell @2.1GHz nodes (CPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). Time for the full ‘vc-relax’ calculation is reported.

slide-38
SLIDE 38

QE: variable cell relaxation of Si63Ge

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 64-atom unit cell of Si1-xGex The runs we performed on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) , on dual socket 18-core Intel Broadwell @2.1GHz nodes (CPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). Time for the full ‘vc-relax’ calculation is reported.

Time to solution (sec) 500 1000 1500 2000 Number of nodes 1 2 5 10 QE (CPU) QE+SIRIUS (CPU) QE+SIRIUS (KNL) QE+SIRIUS (GPU)

slide-39
SLIDE 39

QE: variable cell relaxation of Si63Ge

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 64-atom unit cell of Si1-xGex The runs we performed on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) , on dual socket 18-core Intel Broadwell @2.1GHz nodes (CPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). Time for the full ‘vc-relax’ calculation is reported.

Time to solution (sec) 500 1000 1500 2000 Number of nodes 1 2 5 10 QE (CPU) QE+SIRIUS (CPU) QE+SIRIUS (KNL) QE+SIRIUS (GPU)

232 406 988 1901

slide-40
SLIDE 40

QE: variable cell relaxation of Si63Ge

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 64-atom unit cell of Si1-xGex The runs we performed on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) , on dual socket 18-core Intel Broadwell @2.1GHz nodes (CPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). Time for the full ‘vc-relax’ calculation is reported.

Time to solution (sec) 500 1000 1500 2000 Number of nodes 1 2 5 10 QE (CPU) QE+SIRIUS (CPU) QE+SIRIUS (KNL) QE+SIRIUS (GPU)

252 426 960 1860 232 406 988 1901

slide-41
SLIDE 41

QE: variable cell relaxation of Si63Ge

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 64-atom unit cell of Si1-xGex The runs we performed on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) , on dual socket 18-core Intel Broadwell @2.1GHz nodes (CPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). Time for the full ‘vc-relax’ calculation is reported.

Time to solution (sec) 500 1000 1500 2000 Number of nodes 1 2 5 10 QE (CPU) QE+SIRIUS (CPU) QE+SIRIUS (KNL) QE+SIRIUS (GPU)

298.44 519.5 1020 1800 252 426 960 1860 232 406 988 1901

slide-42
SLIDE 42

QE: variable cell relaxation of Si63Ge

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 64-atom unit cell of Si1-xGex The runs we performed on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) , on dual socket 18-core Intel Broadwell @2.1GHz nodes (CPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). Time for the full ‘vc-relax’ calculation is reported.

Time to solution (sec) 500 1000 1500 2000 Number of nodes 1 2 5 10 QE (CPU) QE+SIRIUS (CPU) QE+SIRIUS (KNL) QE+SIRIUS (GPU)

150 240 480 900 298.44 519.5 1020 1800 252 426 960 1860 232 406 988 1901

slide-43
SLIDE 43

QE: ground state of Pt-cluster in water

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 288-atom unit cell of Pt cluster embedded in the water. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (BW), on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). ELPA eigen-value solver was used for CPU runs. Time for the SCF ground state calculation is reported.

slide-44
SLIDE 44

QE: ground state of Pt-cluster in water

Time to solution (sec.)

100 200 300 400

Number of nodes

18 32 50

QE (CPU) QE+SIRIUS (CPU) QE+SIRIUS (KNL) QE+SIRIUS (GPU)

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 288-atom unit cell of Pt cluster embedded in the water. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (BW), on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). ELPA eigen-value solver was used for CPU runs. Time for the SCF ground state calculation is reported.

slide-45
SLIDE 45

QE: ground state of Pt-cluster in water

Time to solution (sec.)

100 200 300 400

Number of nodes

18 32 50

QE (CPU) QE+SIRIUS (CPU) QE+SIRIUS (KNL) QE+SIRIUS (GPU)

364.03 275.49 344.18

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 288-atom unit cell of Pt cluster embedded in the water. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (BW), on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). ELPA eigen-value solver was used for CPU runs. Time for the SCF ground state calculation is reported.

slide-46
SLIDE 46

QE: ground state of Pt-cluster in water

Time to solution (sec.)

100 200 300 400

Number of nodes

18 32 50

QE (CPU) QE+SIRIUS (CPU) QE+SIRIUS (KNL) QE+SIRIUS (GPU)

186.6 208.35 305.46 364.03 275.49 344.18

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 288-atom unit cell of Pt cluster embedded in the water. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (BW), on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). ELPA eigen-value solver was used for CPU runs. Time for the SCF ground state calculation is reported.

slide-47
SLIDE 47

QE: ground state of Pt-cluster in water

Time to solution (sec.)

100 200 300 400

Number of nodes

18 32 50

QE (CPU) QE+SIRIUS (CPU) QE+SIRIUS (KNL) QE+SIRIUS (GPU)

240 249.88 330.14 186.6 208.35 305.46 364.03 275.49 344.18

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 288-atom unit cell of Pt cluster embedded in the water. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (BW), on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). ELPA eigen-value solver was used for CPU runs. Time for the SCF ground state calculation is reported.

slide-48
SLIDE 48

QE: ground state of Pt-cluster in water

Time to solution (sec.)

100 200 300 400

Number of nodes

18 32 50

QE (CPU) QE+SIRIUS (CPU) QE+SIRIUS (KNL) QE+SIRIUS (GPU)

112.34 125.07 166.49 240 249.88 330.14 186.6 208.35 305.46 364.03 275.49 344.18

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 288-atom unit cell of Pt cluster embedded in the water. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (BW), on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). ELPA eigen-value solver was used for CPU runs. Time for the SCF ground state calculation is reported.

slide-49
SLIDE 49

Exciting: ground state of Mn-based MoF (C5H11MnNO6)

Performance benchmark of the QE and SIRIUS-enabled Exciting codes for the 96-atom unit cell of Mn metal-

  • rganic framework. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (CPU)

and on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU).

slide-50
SLIDE 50

Exciting: ground state of Mn-based MoF (C5H11MnNO6)

Time to solution (minutes) 75 150 225 300 375 450 Number of nodes 12 24 96 216 Exciting (CPU, sequential diagonalization with MKL) Exciting+SIRIUS (CPU, sequential diagonalization with MKL) Exciting+SIRIUS (CPU, parallel diagonalization with ELPA) Exciting+SIRIUS (GPU, sequential diagonalization with MAGMA)

Performance benchmark of the QE and SIRIUS-enabled Exciting codes for the 96-atom unit cell of Mn metal-

  • rganic framework. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (CPU)

and on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU).

slide-51
SLIDE 51

Exciting: ground state of Mn-based MoF (C5H11MnNO6)

Time to solution (minutes) 75 150 225 300 375 450 Number of nodes 12 24 96 216 Exciting (CPU, sequential diagonalization with MKL) Exciting+SIRIUS (CPU, sequential diagonalization with MKL) Exciting+SIRIUS (CPU, parallel diagonalization with ELPA) Exciting+SIRIUS (GPU, sequential diagonalization with MAGMA)

430.8 378.1

Performance benchmark of the QE and SIRIUS-enabled Exciting codes for the 96-atom unit cell of Mn metal-

  • rganic framework. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (CPU)

and on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU).

slide-52
SLIDE 52

Exciting: ground state of Mn-based MoF (C5H11MnNO6)

Time to solution (minutes) 75 150 225 300 375 450 Number of nodes 12 24 96 216 Exciting (CPU, sequential diagonalization with MKL) Exciting+SIRIUS (CPU, sequential diagonalization with MKL) Exciting+SIRIUS (CPU, parallel diagonalization with ELPA) Exciting+SIRIUS (GPU, sequential diagonalization with MAGMA)

163.8 184.1 430.8 378.1

Performance benchmark of the QE and SIRIUS-enabled Exciting codes for the 96-atom unit cell of Mn metal-

  • rganic framework. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (CPU)

and on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU).

slide-53
SLIDE 53

Exciting: ground state of Mn-based MoF (C5H11MnNO6)

Time to solution (minutes) 75 150 225 300 375 450 Number of nodes 12 24 96 216 Exciting (CPU, sequential diagonalization with MKL) Exciting+SIRIUS (CPU, sequential diagonalization with MKL) Exciting+SIRIUS (CPU, parallel diagonalization with ELPA) Exciting+SIRIUS (GPU, sequential diagonalization with MAGMA)

44.0 53.1 147.1 279.5 163.8 184.1 430.8 378.1

Performance benchmark of the QE and SIRIUS-enabled Exciting codes for the 96-atom unit cell of Mn metal-

  • rganic framework. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (CPU)

and on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU).

slide-54
SLIDE 54

Exciting: ground state of Mn-based MoF (C5H11MnNO6)

Time to solution (minutes) 75 150 225 300 375 450 Number of nodes 12 24 96 216 Exciting (CPU, sequential diagonalization with MKL) Exciting+SIRIUS (CPU, sequential diagonalization with MKL) Exciting+SIRIUS (CPU, parallel diagonalization with ELPA) Exciting+SIRIUS (GPU, sequential diagonalization with MAGMA)

26.6 50.8 44.0 53.1 147.1 279.5 163.8 184.1 430.8 378.1

Performance benchmark of the QE and SIRIUS-enabled Exciting codes for the 96-atom unit cell of Mn metal-

  • rganic framework. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (CPU)

and on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU).

slide-55
SLIDE 55

Thank you for your attention.