OpenAtom: First Principles GW method for electronic excitation - - PowerPoint PPT Presentation

openatom first principles gw method for electronic
SMART_READER_LITE
LIVE PREVIEW

OpenAtom: First Principles GW method for electronic excitation - - PowerPoint PPT Presentation

OpenAtom: First Principles GW method for electronic excitation Minjung Kim, Subhasish Mandal, and Sohrab Ismail-Beigi Yale University Eric Mikida, Kavitha Chandrasekar, Eric Bohm, Nikhil Jain, and Laxmikant Kale University of Illinois at


slide-1
SLIDE 1

OpenAtom: First Principles GW method for electronic excitation

Minjung Kim, Subhasish Mandal, and Sohrab Ismail-Beigi Yale University Eric Mikida, Kavitha Chandrasekar, Eric Bohm, Nikhil Jain, and Laxmikant Kale University of Illinois at Urbana-Champaign Qi Li and Glenn Martyna IBM T.J. Watson Research Center

slide-2
SLIDE 2

Energy functional E[n] of electron density n(r) Minimizing over n(r) gives exact

  • Ground-state energy E0
  • Ground-state density n(r)

§ LDA/GGA for Exc : good geometries and total energies § Bad band gaps and excitations equivalent to Kohn-Sham equations Minimum condition

Hohenberg & Kohn, Phys. Rev. (1964); Kohn and Sham, Phys. Rev. (1965).

Density Functional Theory (DFT)

slide-3
SLIDE 3

Material LDA

  • Expt. [1]

Diamond 3.9 5.48 Si 0.5 1.17 LiCl 6.0 9.4 SrTiO3 2.0 3.25 Energy gaps (eV)

[1] Landolt-Bornstien, vol. III; Baldini & Bosacchi,

  • Phys. Stat. Solidi (1970).

Solar spectrum

DFT: problems with excitations

slide-4
SLIDE 4

Interfacial systems: § Electrons can transfer across § Depends on energy level alignment across interface § DFT has errors in band energies § Is any of it real?

e-

DFT: problems with energy alignment

slide-5
SLIDE 5

(r’,0) (r,t)

Dyson Equation: DFT:

One particle Green’s function

slide-6
SLIDE 6

Material LDA GW Expt. Diamond 3.9 5.6* 5.48 Si 0.5 1.3* 1.17 LiCl 6.0 9.1* 9.4 SrTiO3 2.0 3.4-3.8 3.25 Quasiparticle gaps (eV)

* Hybertsen & Louie, Phys. Rev. B (1986)

Band structure of Cu

Strokov et al., PRL/PRB (1998/2001)

Green’s function successes

slide-7
SLIDE 7

Zinc oxide nanowire P3HT polymer § Band alignment for this potential photovoltaic system? § 100s of atoms/unit cell § Not possible routinely (with current software)

What is a big system for GW?

slide-8
SLIDE 8

But in practice the GW is the killer

GW is expensive

DFT: N3 GW: N4 (gives better bands) BSE: N6 (gives optical excitations) Scaling with number of atoms N DFT: 1 cpu x hours GW: 91 cpu x hours BSE: 2 cpu x hours a nanoscale system with 50-75 atoms (GaN)

∴ Focus on GW

slide-9
SLIDE 9

Stage 1 : Run DFT calc. on structure à output : εi and 𝜔i(r) Stage 2.1 : compute Polarizability matrix Stage 2.2 : double FFT rows and columns à P(G,G’) Stage 3 : compute and invert dielectric screening function Stage 4 : “plasmon-pole” method à dynamic screening Stage 5 : put together εi , 𝜔i(r) and à self-energy 𝛵(𝜕)

P(r, r0) = @n(r) @V (r0)

✏ = I − p Vcoul ∗ P ∗ p Vcoul

→ ✏−1

→ ✏−1(!)

✏−1(!)

Steps for typical G0W0 calculation

slide-10
SLIDE 10

What is so expensive in GW?

P(r,r’) = Response of electron density n(r) at position r to change of potential V(r’) at position r’ One key element : response of electrons to perturbation

slide-11
SLIDE 11

One key element : response of electrons to perturbation Standard perturbation theory expression Problems:

  • 1. Must generate “all” empty states (sum over c )
  • 2. Lots of FFTs to get functions 𝜔i(r) functions
  • 3. Enormous outer produce to form P
  • 4. Dense r grid : P huge in memory

What is so expensive in GW?

slide-12
SLIDE 12

Basic Computation:

M unoccupied R Ψ Vectors L occupied

1D Chare Array P Matrix R R 2D Chare Array 2D Tiles

Parallel decomposition:

Computing P in Charm++

flm = ψl × ψm

* for all l, m

P += flm flm

† for all f

slide-13
SLIDE 13

1.Duplicate occupied states on each node

ψ ψ ψ

Computing P in Charm++

slide-14
SLIDE 14

ψ ψ ψ ψ

Computing P in Charm++

1.Duplicate occupied states on each node 2.Broadcast an unoccupied state to compute f vectors

slide-15
SLIDE 15

P P P P P P P P P

Computing P in Charm++

1.Duplicate occupied states on each node 2.Broadcast an unoccupied state to compute f vectors 3.Locally update each matrix tile

slide-16
SLIDE 16

Computing P in Charm++

1.Duplicate occupied states on each node 2.Broadcast an unoccupied state to compute f vectors 3.Locally update each matrix tile 4.Repeat step 2 for next unoccupied state

slide-17
SLIDE 17

§ 108 atom bulk Si § 216 occupied § 1832 unoccupied § 1 k point § 32 processors per node § FFT grids: same accuracy OA 42x42x22 BGW 111x55x55

Supercomputer : Mira (ANL) : BQ BlueGene/Q

Parallel performance: P calculation

slide-18
SLIDE 18

§ 108 atom bulk Si § 216 occupied § 1832 unoccupied § 1 k point § 32 processors per node § FFT grids: same accuracy OA 42x42x22 BGW 111x55x55

1 10 100 1000 1 10 100 1000 10000 Time(Sec) Number/of/Nodes Scaling/on/BlueWaters/ OpenAtom BerkeleyGW1.2 32/cores/per/node

Supercomputer : Blue Waters (NCSA) : Cray XE6

Parallel performance: P calculation

slide-19
SLIDE 19

* Bruneval and Gonze, PRB 78 (2008); Berger, Reining, Sottile, PRB 82 (2010) * Umari, Stenuit, Baroni, PRB 81, (2010) * Giustino, Cohen, Louie, PRB 81, (2010) * Wilson, Gygi, Galli, PRB 78, (2008); Govoni, Galli, J. Chem. Th. Comp., 11 (2015) * Gao, Xia, Gao, Zhang, Sci. Rep. 6 (2016) † Foerster, Koval, Sanchez-Portal, JCP 135 (2011) † Liu, Kaltak, Klimes and Kresse, PRB 94, (2016)

§ O(N4) = 𝑂%

&×𝑂(×𝑂)

§ Sum-over-state (i.e., sum over unoccupied c band) not to blame: removal of unocc. states still O(N4) but lower prefactor* § Working in r-space can reduce to O(N3) [see also †]

Reducing the scaling: quartic to cubic

slide-20
SLIDE 20

Quasi-philosophical: all basis good in quantum mechanics, why is r-space special? Practical: P is separable in r-space Observable is diagonal in the best basis

𝑄 𝑠, 𝑠- = −2 1 𝑒𝑦

4 5

6 𝜔)

∗(𝑠)𝜔)(𝑠′)𝑓<=>? 6 𝜔((𝑠)𝜔( ∗(𝑠′)𝑓=@?

  • (
  • )

separable 1 𝑔(𝑨)𝑓<D

4 5

𝑒𝑦 ≈ 6 𝜕F

GH F

𝑔 𝑨F Gauss-Laguerre quadrature:

1 𝜗) − 𝜗( = 1 𝑒𝑦 𝑓< =><=@ ?

4 5

𝑄 𝑠, 𝑠- = −2 6 𝜕F𝑓?L

GH F

6 𝜔)

∗(𝑠)𝜔)(𝑠′)𝑓<=>?L 6 𝜔((𝑠)𝜔( ∗(𝑠′)𝑓=@?L

  • (
  • )

𝑂%

&𝑂M(𝑂)+𝑂() ∝ 𝑂P

𝑂M is intensive

What’s special about r-space?

slide-21
SLIDE 21

100 200 300 400 500

Ebw/Eg

10 20 30 40 50

NGL

{Ev}1 {Ev}2 {Ec}1 {Ec}2

  • Example: 2 by 2 windows

𝑄 𝑠, 𝑠- = 6 6 𝑄QR(𝑠, 𝑠-)

GS> R GS@ Q

Nwv : # windows for Ev Nwc: # of windows for Ec § Save computation: small NGL for each window pair § Especially for materials with small band gaps 𝑄 = + 𝑄&T + 𝑄&& + 𝑄

T&

𝑄

TT

E Ev,min Ev,max Ec,max Ec,min

𝑄

&T

§ NGL depends on UVS

UWXY Ebw = Ecmax - Evmin

§ Largest error: 𝐹) − 𝐹( = 𝐹[ or 𝐹\]

Windowed cubic Laplace method

slide-22
SLIDE 22

Estimate the computational costs

Example: 2x2 window

0.5 9 1 9 1.5 ×104

Celab

2

Ecratio

1

Evratio

2.5 1

Real computational costs

9 50 9 100

Csimple

150

Ecratio

1

Evratio

200 1

Estimated computational costs

𝐹),%^_`a = 𝐹)

∗ − 𝐹),R`b

𝐹),R^? − 𝐹)

𝐹(,%^_`a = 𝐹(

∗ − 𝐹(,R`b

𝐹(,R^? − 𝐹(

Computation cost can be estimated with Ebw and Eg:

𝐷 ∝ 6 6 𝐹\]

QR

𝐹[

QR

  • 𝐹(Q

R^? − 𝐹(Q R`b

𝐹(

R^? − 𝐹( R`b 𝑂( − 𝐹)R R^? − 𝐹)R R`b

𝐹)

R^? − 𝐹) R`b 𝑂) G>S R G@S Q

slide-23
SLIDE 23

Compared to O(N4) method, for bigger system ratio is

d\a(e %^_`a G^_ Tf ⁄

§ Si crystal (16 atoms) § Number of bands: 399 § 𝑂](=1, 𝑂])=4 § MgO crystal (16 atoms) § Number of bands: 433 § 𝑂](=1, 𝑂])=4

Windowed Laplace: example

slide-24
SLIDE 24
  • 2 atoms Si , 8 k-points
  • Yambo N4 GW software
  • BG* acceleration

Correct practical comparison:

  • Our N3 method vs. available N4 method with acceleration
  • Crossover is at very few atoms: N3 method already competitive for small systems

* Bruneval & Gonze, PRB 78 (2008)

Do I care in practice?

slide-25
SLIDE 25

Windowed Laplace method for self-energy

Σ(𝜕)%,%i

jkb = 6

𝐶%,%i

m 𝜔%b𝜔%ib ∗

𝜕 − 𝜗b + 𝑡𝑕𝑜(𝜈 − 𝜗b)𝜕m

  • m,b

𝐶%,%i

m : residues

𝜕m: energies of the poles of 𝑋(𝑠)%,%-

Dynamic GW self-energy:

1 𝜕 − 𝜗b ± 𝜕m > 0 1 𝜕 − 𝜗b ± 𝜕m < 0 OR 𝐺 𝑦 = 1 𝑦 = 6 𝐶%,%i

m 𝜔%b𝜔%ib ∗

𝐺(𝜕 − 𝜗b ± 𝜕m)

  • m,b

Σ(𝜕) = 6 6 Σ(𝜕)QR

GxS R GYS Q

𝑓R

R`b ≤ 𝜕 − 𝜗b < 𝑓R R^?

ΩQ

R`b ≤ ±𝜕m < ΩQ R^?

Gauss-Laguerre quadrature not appropriate

slide-26
SLIDE 26

New quadrature for overlapping windows

New quadrature

𝑥 𝑤 = 𝑓<( 𝑥 𝑤 = 𝑓<(<(}/&

𝐺 𝑦 = 𝐽𝑛 1 𝑥 𝑤 𝑓`(?𝑒𝑤

4 5

% error nq (𝒇<𝒘) nq (𝒇<𝒘<𝒘𝟑/𝟑) 5 6 1 1 24 1 0.1 124 5 0.01 547 15 0.001 2216 36

Size of quadrature grid

slide-27
SLIDE 27

Results - G0W0 gap

0.1 0.2 0.3 0.4 0.5

ratio of computation to N 4 method

1.35 1.4 1.45 1.5 1.55 1.6 1.65

G0W0 Eg (eV) Si

Laplace+windowing N4

§ Si crystal (16 atoms) § Number of bands: 399 § 𝑂m]=15, 𝑂b]=30

slide-28
SLIDE 28

Phase Serial Parallel 1 Compute P in RSpace Complete Complete 2 FFT P to GSpace Complete Complete 3 Invert epsilon Complete Complete 4 Plasmon pole Complete In Progress 5 COHSEX self-energy Complete Complete 6 Dynamic self-energy Complete In Progress 7 Coulomb Truncation Future Future

Aim to release parallel COHSEX version late spring 2018

Where we are with OpenAtom GW

slide-29
SLIDE 29

§ OpenAtom framework § r-space has many advantages for GW § Charm++ run time library

  • Reduces parallelization/porting/refactoring headaches
  • Good performance, very good scaling

§ r-space separability leads to N3 scaling GW

  • Straightforward change to sum-over-states methods
  • Crossover with N4 for Natoms~ 5-10

Summary

slide-30
SLIDE 30

Ba Back k up up slides des

slide-31
SLIDE 31

§ Directly compute P in G space § Many FFTs : Nv Nc § Big multiply: Nv Nc NG

2 = O(N4)

Nv : # occupied states Nc : # unoccupied states NG : # of G vectors

  • NvNc FFTs needed
  • Big O(N4) matrix multiply

FFT Nr rows FFT Nr columns Nr: # r grid Nr ≈ 4Nc

  • Nv+Nc + 8Nc FFTs needed
  • Big O(N4) matrix multiply

R-space: Big multiply: Nv Nc Nr

2 = O(N4)

G vs. R space P calculation

G-space:

slide-32
SLIDE 32

Consider two key steps a. Many FFTs à 𝜔i(r) b. Outer product à P Typical MPI / OpenMP: working explicitly with # of processors 1. Divide 𝜔i(r) among procs 2. Do pile of FFTs on each proc 3. Divide (r,r’ ) among procs (e.g. ScaLAPACK) 4. Do outer product

“Physicist” programming

Problems

  • Ni > Nproc and Ni < Nproc need different

parallelizations: explicitly different coding

  • Typical programmer does 1. & 2. then 3. & 4. ;

hard to interleave

  • Machines/fashion change: need to recode

parallelization… (GPUs, SMPs, few cores, multicores, etc.)

slide-33
SLIDE 33

2 4 8 16

number of atoms

1 2 3 4 5

compute time per operation (ns)

N4 method L+W 1% L+W 10%

Si 16 atom calculation 𝑂„: 𝑂(𝑂)𝑂%

&

L+W: ∑ 𝑂‡M

QR(𝑂) R + 𝑂( Q)𝑂% &

  • QR

§ Number of computations § Comparable prefactor § Speedup for small 𝑂^_aRˆ ≿ 10

Where is crossover in scaling?