OpenAtom: First Principles GW method for electronic excitation - - PowerPoint PPT Presentation
OpenAtom: First Principles GW method for electronic excitation - - PowerPoint PPT Presentation
OpenAtom: First Principles GW method for electronic excitation Minjung Kim, Subhasish Mandal, and Sohrab Ismail-Beigi Yale University Eric Mikida, Kavitha Chandrasekar, Eric Bohm, Nikhil Jain, and Laxmikant Kale University of Illinois at
Energy functional E[n] of electron density n(r) Minimizing over n(r) gives exact
- Ground-state energy E0
- Ground-state density n(r)
§ LDA/GGA for Exc : good geometries and total energies § Bad band gaps and excitations equivalent to Kohn-Sham equations Minimum condition
Hohenberg & Kohn, Phys. Rev. (1964); Kohn and Sham, Phys. Rev. (1965).
Density Functional Theory (DFT)
Material LDA
- Expt. [1]
Diamond 3.9 5.48 Si 0.5 1.17 LiCl 6.0 9.4 SrTiO3 2.0 3.25 Energy gaps (eV)
[1] Landolt-Bornstien, vol. III; Baldini & Bosacchi,
- Phys. Stat. Solidi (1970).
Solar spectrum
DFT: problems with excitations
Interfacial systems: § Electrons can transfer across § Depends on energy level alignment across interface § DFT has errors in band energies § Is any of it real?
e-
DFT: problems with energy alignment
(r’,0) (r,t)
Dyson Equation: DFT:
One particle Green’s function
Material LDA GW Expt. Diamond 3.9 5.6* 5.48 Si 0.5 1.3* 1.17 LiCl 6.0 9.1* 9.4 SrTiO3 2.0 3.4-3.8 3.25 Quasiparticle gaps (eV)
* Hybertsen & Louie, Phys. Rev. B (1986)
Band structure of Cu
Strokov et al., PRL/PRB (1998/2001)
Green’s function successes
Zinc oxide nanowire P3HT polymer § Band alignment for this potential photovoltaic system? § 100s of atoms/unit cell § Not possible routinely (with current software)
What is a big system for GW?
But in practice the GW is the killer
GW is expensive
DFT: N3 GW: N4 (gives better bands) BSE: N6 (gives optical excitations) Scaling with number of atoms N DFT: 1 cpu x hours GW: 91 cpu x hours BSE: 2 cpu x hours a nanoscale system with 50-75 atoms (GaN)
∴ Focus on GW
Stage 1 : Run DFT calc. on structure à output : εi and 𝜔i(r) Stage 2.1 : compute Polarizability matrix Stage 2.2 : double FFT rows and columns à P(G,G’) Stage 3 : compute and invert dielectric screening function Stage 4 : “plasmon-pole” method à dynamic screening Stage 5 : put together εi , 𝜔i(r) and à self-energy 𝛵(𝜕)
P(r, r0) = @n(r) @V (r0)
✏ = I − p Vcoul ∗ P ∗ p Vcoul
→ ✏−1
→ ✏−1(!)
✏−1(!)
Steps for typical G0W0 calculation
What is so expensive in GW?
P(r,r’) = Response of electron density n(r) at position r to change of potential V(r’) at position r’ One key element : response of electrons to perturbation
One key element : response of electrons to perturbation Standard perturbation theory expression Problems:
- 1. Must generate “all” empty states (sum over c )
- 2. Lots of FFTs to get functions 𝜔i(r) functions
- 3. Enormous outer produce to form P
- 4. Dense r grid : P huge in memory
What is so expensive in GW?
Basic Computation:
M unoccupied R Ψ Vectors L occupied
…
1D Chare Array P Matrix R R 2D Chare Array 2D Tiles
Parallel decomposition:
Computing P in Charm++
flm = ψl × ψm
* for all l, m
P += flm flm
† for all f
1.Duplicate occupied states on each node
ψ ψ ψ
Computing P in Charm++
ψ ψ ψ ψ
Computing P in Charm++
1.Duplicate occupied states on each node 2.Broadcast an unoccupied state to compute f vectors
P P P P P P P P P
Computing P in Charm++
1.Duplicate occupied states on each node 2.Broadcast an unoccupied state to compute f vectors 3.Locally update each matrix tile
Computing P in Charm++
1.Duplicate occupied states on each node 2.Broadcast an unoccupied state to compute f vectors 3.Locally update each matrix tile 4.Repeat step 2 for next unoccupied state
§ 108 atom bulk Si § 216 occupied § 1832 unoccupied § 1 k point § 32 processors per node § FFT grids: same accuracy OA 42x42x22 BGW 111x55x55
Supercomputer : Mira (ANL) : BQ BlueGene/Q
Parallel performance: P calculation
§ 108 atom bulk Si § 216 occupied § 1832 unoccupied § 1 k point § 32 processors per node § FFT grids: same accuracy OA 42x42x22 BGW 111x55x55
1 10 100 1000 1 10 100 1000 10000 Time(Sec) Number/of/Nodes Scaling/on/BlueWaters/ OpenAtom BerkeleyGW1.2 32/cores/per/node
Supercomputer : Blue Waters (NCSA) : Cray XE6
Parallel performance: P calculation
* Bruneval and Gonze, PRB 78 (2008); Berger, Reining, Sottile, PRB 82 (2010) * Umari, Stenuit, Baroni, PRB 81, (2010) * Giustino, Cohen, Louie, PRB 81, (2010) * Wilson, Gygi, Galli, PRB 78, (2008); Govoni, Galli, J. Chem. Th. Comp., 11 (2015) * Gao, Xia, Gao, Zhang, Sci. Rep. 6 (2016) † Foerster, Koval, Sanchez-Portal, JCP 135 (2011) † Liu, Kaltak, Klimes and Kresse, PRB 94, (2016)
§ O(N4) = 𝑂%
&×𝑂(×𝑂)
§ Sum-over-state (i.e., sum over unoccupied c band) not to blame: removal of unocc. states still O(N4) but lower prefactor* § Working in r-space can reduce to O(N3) [see also †]
Reducing the scaling: quartic to cubic
Quasi-philosophical: all basis good in quantum mechanics, why is r-space special? Practical: P is separable in r-space Observable is diagonal in the best basis
𝑄 𝑠, 𝑠- = −2 1 𝑒𝑦
4 5
6 𝜔)
∗(𝑠)𝜔)(𝑠′)𝑓<=>? 6 𝜔((𝑠)𝜔( ∗(𝑠′)𝑓=@?
- (
- )
separable 1 𝑔(𝑨)𝑓<D
4 5
𝑒𝑦 ≈ 6 𝜕F
GH F
𝑔 𝑨F Gauss-Laguerre quadrature:
1 𝜗) − 𝜗( = 1 𝑒𝑦 𝑓< =><=@ ?
4 5
𝑄 𝑠, 𝑠- = −2 6 𝜕F𝑓?L
GH F
6 𝜔)
∗(𝑠)𝜔)(𝑠′)𝑓<=>?L 6 𝜔((𝑠)𝜔( ∗(𝑠′)𝑓=@?L
- (
- )
𝑂%
&𝑂M(𝑂)+𝑂() ∝ 𝑂P
𝑂M is intensive
What’s special about r-space?
100 200 300 400 500
Ebw/Eg
10 20 30 40 50
NGL
{Ev}1 {Ev}2 {Ec}1 {Ec}2
- Example: 2 by 2 windows
𝑄 𝑠, 𝑠- = 6 6 𝑄QR(𝑠, 𝑠-)
GS> R GS@ Q
Nwv : # windows for Ev Nwc: # of windows for Ec § Save computation: small NGL for each window pair § Especially for materials with small band gaps 𝑄 = + 𝑄&T + 𝑄&& + 𝑄
T&
𝑄
TT
E Ev,min Ev,max Ec,max Ec,min
𝑄
&T
§ NGL depends on UVS
UWXY Ebw = Ecmax - Evmin
§ Largest error: 𝐹) − 𝐹( = 𝐹[ or 𝐹\]
Windowed cubic Laplace method
Estimate the computational costs
Example: 2x2 window
0.5 9 1 9 1.5 ×104
Celab
2
Ecratio
1
Evratio
2.5 1
Real computational costs
9 50 9 100
Csimple
150
Ecratio
1
Evratio
200 1
Estimated computational costs
𝐹),%^_`a = 𝐹)
∗ − 𝐹),R`b
𝐹),R^? − 𝐹)
∗
𝐹(,%^_`a = 𝐹(
∗ − 𝐹(,R`b
𝐹(,R^? − 𝐹(
∗
Computation cost can be estimated with Ebw and Eg:
𝐷 ∝ 6 6 𝐹\]
QR
𝐹[
QR
- 𝐹(Q
R^? − 𝐹(Q R`b
𝐹(
R^? − 𝐹( R`b 𝑂( − 𝐹)R R^? − 𝐹)R R`b
𝐹)
R^? − 𝐹) R`b 𝑂) G>S R G@S Q
Compared to O(N4) method, for bigger system ratio is
d\a(e %^_`a G^_ Tf ⁄
§ Si crystal (16 atoms) § Number of bands: 399 § 𝑂](=1, 𝑂])=4 § MgO crystal (16 atoms) § Number of bands: 433 § 𝑂](=1, 𝑂])=4
Windowed Laplace: example
- 2 atoms Si , 8 k-points
- Yambo N4 GW software
- BG* acceleration
Correct practical comparison:
- Our N3 method vs. available N4 method with acceleration
- Crossover is at very few atoms: N3 method already competitive for small systems
* Bruneval & Gonze, PRB 78 (2008)
Do I care in practice?
Windowed Laplace method for self-energy
Σ(𝜕)%,%i
jkb = 6
𝐶%,%i
m 𝜔%b𝜔%ib ∗
𝜕 − 𝜗b + 𝑡𝑜(𝜈 − 𝜗b)𝜕m
- m,b
𝐶%,%i
m : residues
𝜕m: energies of the poles of 𝑋(𝑠)%,%-
Dynamic GW self-energy:
1 𝜕 − 𝜗b ± 𝜕m > 0 1 𝜕 − 𝜗b ± 𝜕m < 0 OR 𝐺 𝑦 = 1 𝑦 = 6 𝐶%,%i
m 𝜔%b𝜔%ib ∗
𝐺(𝜕 − 𝜗b ± 𝜕m)
- m,b
Σ(𝜕) = 6 6 Σ(𝜕)QR
GxS R GYS Q
𝑓R
R`b ≤ 𝜕 − 𝜗b < 𝑓R R^?
ΩQ
R`b ≤ ±𝜕m < ΩQ R^?
Gauss-Laguerre quadrature not appropriate
New quadrature for overlapping windows
New quadrature
𝑥 𝑤 = 𝑓<( 𝑥 𝑤 = 𝑓<(<(}/&
𝐺 𝑦 = 𝐽𝑛 1 𝑥 𝑤 𝑓`(?𝑒𝑤
4 5
% error nq (𝒇<𝒘) nq (𝒇<𝒘<𝒘𝟑/𝟑) 5 6 1 1 24 1 0.1 124 5 0.01 547 15 0.001 2216 36
Size of quadrature grid
Results - G0W0 gap
0.1 0.2 0.3 0.4 0.5
ratio of computation to N 4 method
1.35 1.4 1.45 1.5 1.55 1.6 1.65
G0W0 Eg (eV) Si
Laplace+windowing N4
§ Si crystal (16 atoms) § Number of bands: 399 § 𝑂m]=15, 𝑂b]=30
Phase Serial Parallel 1 Compute P in RSpace Complete Complete 2 FFT P to GSpace Complete Complete 3 Invert epsilon Complete Complete 4 Plasmon pole Complete In Progress 5 COHSEX self-energy Complete Complete 6 Dynamic self-energy Complete In Progress 7 Coulomb Truncation Future Future
Aim to release parallel COHSEX version late spring 2018
Where we are with OpenAtom GW
§ OpenAtom framework § r-space has many advantages for GW § Charm++ run time library
- Reduces parallelization/porting/refactoring headaches
- Good performance, very good scaling
§ r-space separability leads to N3 scaling GW
- Straightforward change to sum-over-states methods
- Crossover with N4 for Natoms~ 5-10
Summary
Ba Back k up up slides des
§ Directly compute P in G space § Many FFTs : Nv Nc § Big multiply: Nv Nc NG
2 = O(N4)
Nv : # occupied states Nc : # unoccupied states NG : # of G vectors
- NvNc FFTs needed
- Big O(N4) matrix multiply
FFT Nr rows FFT Nr columns Nr: # r grid Nr ≈ 4Nc
- Nv+Nc + 8Nc FFTs needed
- Big O(N4) matrix multiply
R-space: Big multiply: Nv Nc Nr
2 = O(N4)
G vs. R space P calculation
G-space:
Consider two key steps a. Many FFTs à 𝜔i(r) b. Outer product à P Typical MPI / OpenMP: working explicitly with # of processors 1. Divide 𝜔i(r) among procs 2. Do pile of FFTs on each proc 3. Divide (r,r’ ) among procs (e.g. ScaLAPACK) 4. Do outer product
“Physicist” programming
Problems
- Ni > Nproc and Ni < Nproc need different
parallelizations: explicitly different coding
- Typical programmer does 1. & 2. then 3. & 4. ;
hard to interleave
- Machines/fashion change: need to recode
parallelization… (GPUs, SMPs, few cores, multicores, etc.)
2 4 8 16
number of atoms
1 2 3 4 5
compute time per operation (ns)
N4 method L+W 1% L+W 10%
Si 16 atom calculation 𝑂„: 𝑂(𝑂)𝑂%
&
L+W: ∑ 𝑂‡M
QR(𝑂) R + 𝑂( Q)𝑂% &
- QR
§ Number of computations § Comparable prefactor § Speedup for small 𝑂^_aRˆ ≿ 10