Scalable GW software for excited electrons using OpenAtom Kavitha - - PowerPoint PPT Presentation

β–Ά
scalable gw software for excited electrons using openatom
SMART_READER_LITE
LIVE PREVIEW

Scalable GW software for excited electrons using OpenAtom Kavitha - - PowerPoint PPT Presentation

Scalable GW software for excited electrons using OpenAtom Kavitha Chandrasekar, Eric Mikida, Eric Bohm and Laxmikant Kale University of Illinois at Urbana-Champaign Kayahan Saritas, Minjung Kim and Sohrab Ismail-Beigi Yale University Glenn


slide-1
SLIDE 1

Scalable GW software for excited electrons using OpenAtom

Kayahan Saritas, Minjung Kim and Sohrab Ismail-Beigi Yale University Kavitha Chandrasekar, Eric Mikida, Eric Bohm and Laxmikant Kale University of Illinois at Urbana-Champaign Glenn Martyna Pimpernel Science, Software and Information Technology

slide-2
SLIDE 2

Electronic structure calculations

𝑗ℏ πœ– πœ–π‘’ | ⟩ Ξ¨(𝑒) = + 𝐼| ⟩ Ξ¨(𝑒) Β§ Time independent Schrodinger equation for a many-body system Β§ Density functional theory (DFT) simplifies this to one-body problem Many Ri & rj

Solve for wavefunctions πœ”!(𝑠) and energies πœ—!

slide-3
SLIDE 3

Comparison of the methods

1 100 10,000 10 1,000

FCI O(N!) Tight binding O(N3) HF, DFT O(N3) QMC GW CCSD(T) O(N7) Chemical Chemical Accuracy Accuracy Transition Transition States? States? Relative Relative Energies Energies Number of atoms Computational Cost Exact SchrΓΆdinger Equation O(N3-4)

slide-4
SLIDE 4

DFT problem with excitations

ground state

πœ—! πœ—!"# . . . . . . 𝐹$%& = % πœ–πΉ πœ–π‘‚ !"' βˆ’ % πœ–πΉ πœ–π‘‚ !(' = πœ—!"# βˆ’ πœ—!

DFT:

πœ—) Conduction band (empty) Valence band (filled) Band gap Janak’s theorem

slide-5
SLIDE 5

DFT problem with excitations

ground state

πœ—! πœ—!"# . . . . . . 𝐹$%& = % πœ–πΉ πœ–π‘‚ !"' βˆ’ % πœ–πΉ πœ–π‘‚ !(' = πœ—!"# βˆ’ πœ—!

DFT:

πœ—) Conduction band (empty) Valence band (filled) Band gap

Material DFT GW Expt. Diamond 3.9 5.6* 5.48 Si 0.5 1.3* 1.17 LiCl 6.0 9.1* 9.4 SrTiO3 2.0 3.4-3.8 3.25 Band gaps (eV)

Why band gap/excitations in a material is important? Β§ Metallic, semiconducting or insulating? Β§ Light-matter interactions in general Β§ A lot of engineering implications: PV, lasers, luminescence …

slide-6
SLIDE 6

GW method

Challenges Β§ Memory intensive Β§ Much larger number of conduction bands: Huge number of FFTs Β§ Large and dense matrix multiplications Β§ Unfavorable scaling 𝑃(𝑂4) Goal Β§ Efficient and highly scalable GW software Β§ 𝑃(𝑂3) scaling method

slide-7
SLIDE 7

𝑄 𝑠, 𝑠) = βˆ’2 4

* +,--./

4

.1234 πœ”*(𝑠)πœ”0(𝑠)πœ”*(𝑠))πœ”0(𝑠))

𝐹* βˆ’ 𝐹0

Β§ Lots of FFTs to get πœ”!(𝑠) functions Β§ However, πœ—"# can converge using a small r-grid

What is expensive in GW?

~ 𝑂* + 𝑂+ 𝑂, ln 𝑂, ~2𝑂,

  • ln 𝑂,

~ 𝑂*𝑂+ 𝑂,

  • = 𝑃(𝑂.)

* Kim et al., (2020), Phys. Rev. B., 101, pp. 035139

slide-8
SLIDE 8

𝑄<,<! = βˆ’2 4

* >"##

4

>$%"## πœ”<,* βˆ— πœ”<,0πœ”<!,0 βˆ—

πœ”<!,* 𝐹0 βˆ’ 𝐹*

1 𝐹$ βˆ’ 𝐹% = *

& '

𝑓" (!"(" )π‘’πœ = *

& '

𝑓"(!)𝑓(")π‘’πœ = *

& '

𝑔(𝜐)𝑓") π‘’πœ

(1) Laplace transform: (2) Gauss-Laguerre quadrature: *

& '

𝑔(𝜐)𝑓") π‘’πœ β‰ˆ 0

* +#

πœ•* 𝑔 𝜐*

Nr2NunoccNocc~ N4 Nr2Nq(Nunocc+Nocc)~ N3

O(N3) algorithm (CTSP) for P

𝑂! 𝑂"

π‘Œ<,<! = 4

@ >&

4

A >'

𝐡<,<)𝐢<,<) π‘₯ + 𝑏@ βˆ’ 𝑐

A

CTSP: Complex time shredded propagator

slide-9
SLIDE 9

𝑄<,<! = βˆ’2 4

* >"##

4

>$%"##

πœ”<,*

βˆ— πœ”<,0πœ”<!,0 βˆ—

πœ”<!,* 4

B >(

πœ•B 𝑔 𝜐B = 4

B >(

πœ•B [4

* >"##

πœ”<,*πœ”<!,*

βˆ—

𝑓C)D*][ 4

>$%"##

πœ”<,0πœ”<!,0

βˆ—

𝑓EC#D*] Nq(Nunocc+Nocc) Nr2~ N3

O(N3) algorithm (CTSP) for P

(3) Energy windows:

𝑄

$,$& = ( ' (/0

(

) (10

𝑄

$,$& ') !!,# "#$($#$, &; ( = 0) ,#$

(&'()

,#$

(*+)

  • ,,$

,',- ,',# ,',$ ,*,- ,*,# ,*,$ ,*,. ,*,/ ,& a) b)

𝐹!

slide-10
SLIDE 10

Most expensive

  • Real-space P
  • O(N3) method

Also expensive - O(N4)

Steps for typical GW calculations

slide-11
SLIDE 11

Σ± (πœ•)<,<!

GHI = 4 J,I

𝐢<,<!

J πœ”<Iπœ”<!I βˆ—

πœ• βˆ’ 𝐹I Β± πœ•J

𝐢$,$!

& : residues

πœ•&: energies of the poles of 𝑋(𝑠)$,$'

Β§ πœ• βˆ’ πœ—I Β± πœ•J=0 is possible: Gauss-Laguerre quadrature not applicable

O(N3) method for self-energy

Β§ New quadrature is needed and was developed: Hermite-Gauss-Laguerre quadrature 1 πœ• βˆ’ 𝐹I Β± πœ•J = 𝐽𝑛 L

K L

π‘’πœπ‘“EDED+/N𝑓@(OEC%Β±O,)D π‘Œ<,<! = 4

@ >&

4

A >'

𝐡<,<)𝐢<,<) π‘₯ + 𝑏@ βˆ’ 𝑐

A

slide-12
SLIDE 12

Β§ Si crystal (16 atoms) Β§ Number of bands: 399 Β§ 𝑂Q*=1, 𝑂Q0=4 Β§ MgO crystal (16 atoms) Β§ Number of bands: 433 Β§ 𝑂Q*=1, 𝑂Q0=4

Results: Energy gap

* Kim et al., (2020), Phys. Rev. B., 101, pp. 035139

slide-13
SLIDE 13

Performance against other codes

Β§ Si crystal (16 atoms) Β§ Number of bands: 399 Β§ 𝑂JQ=15, 𝑂IQ=30

* Kim et al., (2019), Comput. Phys. Commun., 244, pp. 427-441

http://charm.cs.illinois.edu/OpenAtom/

slide-14
SLIDE 14

OpenAtom GW Parallel Scaling

OpenAtom Team

slide-15
SLIDE 15

GW-BSE Parallelization Phase Serial Parallel 1 Compute P in Rspace (N4 and N3 methods) Complete Complete 2 FFT P to GSpace Complete Complete 3 Invert epsilon Complete Complete 4 Plasmon pole Complete Future Work 5 COHSEX Self-energy Complete Complete 6 Dynamic Self-energy Complete Future Work

14

slide-16
SLIDE 16

GW Phase-I P Matrix Computation (N4 and N3 method)

15

P Matrix R R M unoccupied R Ξ¨ Vectors L occupied

…

1D Chare Array 2D Chare Array 2D Tiles

slide-17
SLIDE 17

Parallel Decomposition: Input state vectors

16

Duplicate occupied and unoccupied states on each node

ψ ψ

ψ ψ ψ

slide-18
SLIDE 18

Computation of Pmatrix using N3 method

  • Outer loops are windows of occupied and unoccupied states
  • Most expensive computation - 𝜍 and 𝜍)matrices

for l = 1:Nvw for m = 1:Ncw for j = 1:Nquadlm calculate 𝜍01')! calculate 𝜍&01')! P[r,r’] += 𝜍01')![r,r’] x 𝜍&01')![r,r’]

slide-19
SLIDE 19

Computation 𝜍 matrix (Using occupied states)

  • State vectors are represented with ψ

β—‹ Number of occupied states = L, each state has N elements β—‹ All occupied states can be represented as a matrix ψV[1:L][1:N])

𝜍2345) -> Add elements of outer product of ψV[1:L] for l=1:L for r=1:N for r’=1:N 𝜍2345) [r,r’] += ψV [l] T[r] x ψV[l][r’] 𝜍2345) -> Same as ZGEMM of all ψV and all ψVT ZGEMM (ψVT[1:N][1:L] , ψV[1:L][1:N]) (i.e matrix multiply ) for r=1:N for r’=1:N for l=1:L 𝜍2345) [r,r’] += ψVT[r][l] x ψV[l][r’]

slide-20
SLIDE 20

Computation πœβ€™ matrix (Using unoccupied states)

  • Number of unoccupied states = M, each state has N elements
  • All unoccupied states can be represented as a matrix ψC[1:M][1:N])

𝜍2345) -> Add elements of outer product of ψC[1:M] for m=1:M for r=1:N for r’=1:N πœβ€²2345) [r,r’] += ψC [m] T[r] x ψC[m][r’] 𝜍2345) -> Same as ZGEMM of all ψC and all ψCT ZGEMM (ψCT[1:N][1:M] , ψC[1:M][1:N]) (i.e matrix multiply ) for r=1:N for r’=1:N for m=1:M πœβ€²2345) [r,r’] += ψCT[r][m] x ψC[m][r’]

slide-21
SLIDE 21

Computation of P-matrix (tiled) (N3)

P Matrix Unoccupied states ψC(1:M) Occupied states ψV(1:L) (ZGEMM) 𝜍 matrix (ZGEMM) πœβ€™ matrix (Element-wise multiply)

  • f 𝜍 & πœβ€™ matrix

L N N L N N M M N N N N N N

slide-22
SLIDE 22

Performance of N3 method

  • N3 method is an order faster than

N4 method for Si108 atoms dataset

β—‹ 20k X 20k output matrix size

  • Scales well on Intel KNL and

SkyLake nodes

  • Future scaling results for larger

datasets

10 100 1000 10000 8 16 32 64 Execution Time Node count (48 cores per node) N4 method N3 method 100 1000 10000 8 16 32 64 Execution Time Node count (128 cores per node) N4 method N3 method

Intel KNL nodes (Stampede2) Intel Skylake nodes (Stampede2)

slide-23
SLIDE 23

Questions?