Scalable Algorithms for Electronic Structure Calculations on - - PowerPoint PPT Presentation

scalable algorithms for electronic structure calculations
SMART_READER_LITE
LIVE PREVIEW

Scalable Algorithms for Electronic Structure Calculations on - - PowerPoint PPT Presentation

Scalable Algorithms for Electronic Structure Calculations on Petascale Computers Franois Gygi University of California, Davis fgygi@ucdavis.edu http://eslab.ucdavis.edu Supported by NSF ITR-HECURA-0749217 and DOE-SciDAC RANMEP2008


slide-1
SLIDE 1

1

Scalable Algorithms for Electronic Structure Calculations on Petascale Computers

François Gygi

University of California, Davis fgygi@ucdavis.edu http://eslab.ucdavis.edu

RANMEP2008 Workshop, NCTS, Taiwan, Jan 6, 2008

Supported by NSF ITR-HECURA-0749217 and DOE-SciDAC

slide-2
SLIDE 2

FG 2

Outline

  • First-Principles simulations
  • Eigenvalue problems in electronic structure

calculations

  • Localized representations of solutions and

simultaneous diagonalization problem

  • Data compression through simultaneous

diagonalization

slide-3
SLIDE 3

FG 3

First-Principles Simulations

  • Goal: Simulate molecules, solids, liquids, from first

principles, without input from experiments

  • The approach: Molecular dynamics: an atomic-scale

simulation method

– Compute the trajectories of all atoms – extract statistical information from the trajectories

Atoms move according to Newton’s law:

i i i

m = R F

slide-4
SLIDE 4

FG 4

First-Principles Simulations

  • Why “First-Principles”?

– Avoid empirical models and adjustable parameters

  • Goal: applications to extreme conditions (high pressure, etc.)

where no experimental data is available

– Use fundamental principles: Quantum Mechanics – Must describe ions and electrons consistently and simultaneously

At each time step: 1) Compute the electronic structure 2) Derive interatomic forces 3) Move atoms

slide-5
SLIDE 5

FG 5

First-Principles Simulations

  • Applications

– Chemistry – Nanotechnology – Semiconductors – Biochemistry – High-pressure physics

Growth of a carbon nanotube

  • n an iron catalyst

Biotin on silicon carbide Ice-water interface Silicon quantum dot

slide-6
SLIDE 6

FG 6

First-Principles Simulations

  • The computation of the electronic structure is

the most expensive part of the simulation

At each time step: 1) Compute the electronic structure 2) Derive interatomic forces 3) Move atoms >99% of CPU time

slide-7
SLIDE 7

FG 7

First-principles simulations require large computing resources

  • Cost of one time step scales as O(n3)

– n: number of electrons

  • Many time steps required / long simulations
  • Requires use of large-scale parallel platforms

– target: O(104) to O(105) CPUs

  • Focus on scalable algorithms

– communication cost is primary concern

slide-8
SLIDE 8

FG 8

Chip (2 processors) Compute Card (2 chips, 2x1x1) Node Board (32 chips, 4x4x2) 16 Compute Cards System (64 cabinets, 64x32x32) Cabinet (32 Node boards, 8x8x16) 2.8/5.6 GF/s 4 MB 5.6/11.2 GF/s 0.5 GB DDR 90/180 GF/s 8 GB DDR 2.9/5.7 TF/s 256 GB DDR 180/360 TF/s 16 TB DDR

Using large computers: BlueGene/L

  • 65,536 nodes, 128k CPUs
  • 3D torus network
  • 512 MB/node
  • 367 TFlop peak
slide-9
SLIDE 9

FG 9

Computing the electronic structure

  • Kohn-Sham equations

– solutions φi represent electronic wavefunctions (one per electron)

ion XC 2 1

( , ) 1 ( ) ( , ) ( ) ( ( ), ( )) ( ) ( ) ( ) ( )

i

i i i i i n i i j ij

H V i n V V d V d ϕ ϕ ρ ϕ ε ϕ ρ ρ ρ ρ ρ ϕ ϕ ϕ δ

= ∗

= −Δ + = = ⎧ ⎪ ′ ⎪ ′ = + + ∇ ′ − ⎪ ⎪ ⎨ ⎪ = ⎪ ⎪ = ⎪ ⎩

∫ ∑ ∫

r r r r r r r r r r r r r r …

( )

2 3 i

L ϕ ∈

slide-10
SLIDE 10

FG 10

Computing the electronic structure

  • Solutions are represented as Fourier series
  • A set of solutions is represented by an

(orthogonal) (m x n) matrix of complex Fourier coefficients

  • Dimensions of Y: 106x104
  • Note: typically m/n ~ 100

2 cut

,

( )

i j j E

c e ϕ

⋅ <

= ∑

q r q q

r

,

i

ij j

Y c =

q

slide-11
SLIDE 11

FG 11

Computing the electronic structure

  • The energy is invariant under unitary

transformations of Y

( )

( )

[ ]

tr

T

E Y Y HY F ρ = +

2

( ) ( )

j j

ρ ϕ =∑ r r

( ) ( ),

unitary E Y E YQ Q =

slide-12
SLIDE 12

FG 12

Electronic structure calculation: (with fixed potential)

  • Invariant subspace computation

Find Y such that: – H is sparse – Cost of computing Hx: O(m log m) (involves Fast Fourier Transforms)

, ,

m m m n n n

HY Y H Y

× × ×

= Λ ∈ ∈ Λ∈

slide-13
SLIDE 13

FG 13

Electronic structure calculation: (with fixed potential)

  • Iterative methods for invariant subspace

computations

– Variants of Jacobi-Davidson – DIIS (a.k.a. Anderson acceleration)

  • Simple, diagonal preconditioning works well
  • Robustness of eigensolvers is key
slide-14
SLIDE 14

FG 14

Preconditioned steepest descent

1) correction 2) orthogonalization

( )

:

T

Y Y K I YY HY β = + −

slide-15
SLIDE 15

FG 15

Preconditioned DIIS

1) descent direction 2) update 3) orthogonalization

( )

T k k k k

K I Y Y HY Δ = −

( ) ( ) ( )

1 1 1 1 1

tr

T k k k k k F k k k k k k k k k k k

Y Y Y Y Y Y θ θ θ β

− − − − +

Δ Δ − Δ = Δ − Δ = + − Δ = Δ + Δ − Δ = + Δ

slide-16
SLIDE 16

FG 16

Self-consistent electronic structure computation

  • H depends non-linearly on the solution Y (through ρ)
  • Fixed point iteration:

repeat { 1) compute charge density 2) solve } until converged (i.e. ρ does not change)

  • Convergence can be accelerated using various charge-mixing

schemes (e.g. Broyden)

( ) H Y Y ρ = Λ

( )

T i ii

YY ρ =

slide-17
SLIDE 17

FG 17

Molecular Dynamics: solve the SCF problem at each time step

  • H is time-dependent (depends on positions of atoms)

for each time step t { repeat { 1) compute charge density 2) solve } until converged compute forces, move atoms }

( , ) ( ) ( ) ( ) H t Y t Y t t ρ = Λ

( )

T i ii

YY ρ =

slide-18
SLIDE 18

FG 18

Molecular Dynamics: using previous solutions optimally

  • Computing Y(t)

– The previous solution Y(t-dt) is “close” to Y(t), can be used as initial guess for iterative calculation of Y(t)

1 k

Y +

k

Y

1 k

Y −

slide-19
SLIDE 19

FG 19

Molecular Dynamics: using previous solutions optimally

  • Computing Y(t)

– The previous solution Y(t-dt) is “close” to Y(t), can be used as initial guess for iterative calculation of Y(t) – The extrapolated subspace is a better initial guess

1

2

k k

Y Y Y − = −

  • 1

2

k k

Y Y Y − = −

  • 1

k

Y +

k

Y

1 k

Y −

slide-20
SLIDE 20

FG 20

Molecular Dynamics: using previous solutions optimally

  • Subspace alignment

– The eigensolver introduces arbitrary rotations in Y(t) – Extrapolation must be preceded by subspace alignment – Orthogonal Procrustes problem

1

min

T k k Q

Y Y Q Q Q I

− =

1

2

k k

Y Y Y Q

= −

  • 1

k

Y +

k

Y

1 k

Y −

slide-21
SLIDE 21

FG 21

Subspace alignment

Orthogonal Procrustes problem: minimize 1) Compute the polar decomposition where U is unitary, H hermitian. 2) rotation of Yk-1

1 1

:

k k

Y Y U

− −

=

1 k

T k

Y Y A UH

≡ =

1 k k

Y Y Q

slide-22
SLIDE 22

FG 22

Polar decomposition

Polar decomposition A=UH (Higham ‘86) converges quadratically to the unitary polar factor U Need better, inverse-free, scalable algorithm

( )

( )

1 1 1 2 k k k

X A X X X

∗ − +

= = +

slide-23
SLIDE 23

FG 23

Outline

  • First-Principles simulations
  • Eigenvalue problems in electronic structure

calculations

  • Localized representations of solutions and

simultaneous diagonalization problem

  • Data compression through simultaneous

diagonalization

slide-24
SLIDE 24

FG 24

  • Linear combinations of electronic wavefunctions that

minimize the spatial spread are called “Maximally Localized Wannier Functions” (MLWF)

  • MLWFs are used to compute the electronic polarization

in crystals

  • Computing MLWFs during a molecular dynamics

simulation yields the infrared absorption spectrum

Localized representations of the invariant subspace

2 2 ˆ X

x x σ = −

  • N. Marzari and D. Vanderbilt, Phys. Rev. B56, 12847 (1997)
  • R. Resta, Phys. Rev. Lett. 80, 1800 (1998)
slide-25
SLIDE 25

FG 25

  • Spread of a wavefunction associated with an
  • perator Â
  • Spread of a set of wavefunctions associated

with an operator Â

Spread Functionals

( )

( )

2 2 ˆ 2 2

ˆ ˆ | | | | ˆ ˆ | | | |

A

A A A A σ φ φ φ φ φ φ φ φ φ = − = −

{ }

( )

( )

2 2 ˆ ˆ i i A A i

σ φ σ φ = ∑

slide-26
SLIDE 26

FG 26

  • The spread is not invariant under orthogonal

transformations

  • There exists a matrix X that minimizes the

spread

Spread Functionals

{ }

( )

{ }

( )

2 2 ˆ ˆ

  • rthogonal

n n i ij j j i i A A

x X ψ φ σ ψ σ φ

×

= ∈ ≠

slide-27
SLIDE 27

FG 27

  • Let
  • Minimize the spread = maximize

= diagonalize A

Spread Functionals

2

ˆ ˆ , | | | |

n n ij ij

A B a i A j b i A j

×

∈ = =

  • { }

( )

( ) ( )

2 2 ˆ 1

tr

n T T i A ii i

X BX X AX σ ψ

=

= −∑

( )

2 1 n T ii i

X AX

=

slide-28
SLIDE 28

FG 28

  • Case of multiple operators
  • Minimize the spread = maximize

= joint approximate diagonalization of the matrices A(k)

Spread Functionals

{ }

( )

( )

( )

( ) ( ) 2 2 ˆ ˆ

ˆ

  • perators

1, , matrices 1, ,

k

k k i i A A i k

A k m A k m σ ψ σ ψ = = = ∑∑ … …

( )

2 ( ) 1 n T k ii i k

X A X

=

∑∑

slide-29
SLIDE 29

FG 29

  • Example of multiple operators
  • The matrices A(k) do not necessarily commute,

even if the operators Â(k) do commute

Spread Functionals

( ) ( ) ( )

(1) (2) (3)

ˆ ˆ ˆ ( , , ) ( , , ) ˆ ˆ ˆ ( , , ) ( , , ) ˆ ˆ ˆ ( , , ) ( , , ) A X X x y z x x y z A Y Y x y z y x y z A Z Z x y z z x y z ϕ ϕ ϕ ϕ ϕ ϕ = ≡ = ≡ = ≡

slide-30
SLIDE 30

FG 30

  • Jacobi algorithm for simultaneous

diagonalization

Computation of MLWFs by simultaneous diagonalization

( ) ( )

repeat for each pair , compute Jacobi rotation ( , ) until converged

k T k

i j R i j A R A R k ← ∀

  • A. Bunse-Gerstner, R. Byers, V. Mehrmann, SIAM J. Mat. Anal. Appl. 14, 927 (1993)

J.F.Cardoso and A. Souloumiac, SIAM J. Mat. Anal. Appl. 17, 161 (1996).

slide-31
SLIDE 31

FG 31

Computation of MLWFs by simultaneous diagonalization

2 2 2 2 2

( , ) , 1 2 2 ( )

ii ij ji jj

r r c s R i j r r s c c s c s x r y iz c s r x y z r r x r ⎛ ⎞ ⎛ ⎞ = = ⎜ ⎟ ⎜ ⎟ − ⎝ ⎠ ⎝ ⎠ ∈ + = + − = = = + + +

  • J.F.Cardoso and A. Souloumiac, SIAM J. Mat. Anal. Appl. 17, 161 (1996).
slide-32
SLIDE 32

FG 32

Computation of MLWFs by simultaneous diagonalization

( ) ( )

( )

( )

( ) ( )

eigenvector of Re

k H k k ii jj ij ji ji ij

x y G h A h A z a a h A a a i a a ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ ≡ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎜ ⎟ ⎝ ⎠ ⎛ ⎞ − ⎜ ⎟ = + ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ − ⎝ ⎠

J.F.Cardoso and A. Souloumiac, SIAM J. Mat. Anal. Appl. 17, 161 (1996).

slide-33
SLIDE 33

FG 33

  • In periodic systems

Calculation of MLWFs

(1) (2) (3) (4) (5) (6)

2 2 ˆ ˆ ˆ ˆ ˆ ˆ cos sin 2 2 ˆ ˆ ˆ ˆ ˆ ˆ cos sin 2 2 ˆ ˆ ˆ ˆ ˆ ˆ cos sin

x x x x y y x x z z x x

A C x A S x L L A C y A S y L L A C z A S z L L π π π π π π = ≡ = ≡ = ≡ = ≡ = ≡ = ≡

F.Gygi, J.L.Fattebert, E.Schwegler, Comput.. Phys. Comm. 155, 1 (2003)

slide-34
SLIDE 34

FG 34

  • The spread is minimized by simultaneous

diagonalization of the matrices Cx,Sx,Cy,Sy,Cz,Sz

  • Positions of the center of mass of the localized

solutions (“Wannier centers”)

  • Spreads

Calculation of MLWFs

( ) ( )

( )

( ) ( )

( )

2 2 2 2

2 arctan 2 2 1

x i x y x x i ii i y i x ii z i z i x x x ii ii x

L S L C L L C S θ π θ τ θ π θ π σ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ = = ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ = − −

F.Gygi, J.L.Fattebert, E.Schwegler, Comput.. Phys. Comm. 155, 1 (2003)

slide-35
SLIDE 35

FG 35

Simultaneous diagonalization algorithms

  • The cost of simultaneous diagonalization is O(N3)
  • Conventional parallel implementations do not scale

beyond NCPU~Nst/2

slide-36
SLIDE 36

FG 36

Parallel Jacobi algorithm

  • G. H. Golub and C. F. Van

Loan, Matrix Computations, 3rd ed., (1996)

1 2 3 4 8 7 6 5 1 2 3 4 8 7 6 5

Each processor holds a pair of columns

Limited to n/2 processors for n columns

slide-37
SLIDE 37

FG 37

One-sided Jacobi algorithms

repeat until converged

T

A R AR U UR ← ←

P.J. Eberlein, H. Park, J. Parallel and Distrib. Comp. 8 (1990), 358–366.

repeat until converged A AR U UR ← ←

Conventional Jacobi One-sided Jacobi The one-sided Jacobi algorithm reduces the amount of global communication

slide-38
SLIDE 38

FG 38

Scalability of simultaneous diagonalization

  • One-sided Jacobi

simultaneous diagonalization algorithm:

– 64-node dual-dual-core AMD Opteron/Infinipath cluster (256 CPUs) – 1 rack ANL BlueGene/L (1024 CPUs)

  • Slight superlinear scaling

due to:

– cache effects – size-dependent MPI protocols – node mapping

1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200 N_CPU Speedup: t(NCPUmin)/t(NCPU) m=8192 BG/L speedup m=8192 AMD/Opt speedup BG/L ideal speedup AMD/Opt ideal speedup

slide-39
SLIDE 39

FG 39

Large-scale calculations of maximally localized Wannier functions

  • Computation of MLWF starting from KS eigenvectors
  • Simultaneous diagonalization of 6 (n x n) matrices

– 64-node (256 CPUs) Opteron platform

System n m time (s)

(HfSiO4)32 1216 68723 38 (H2O)512 4096 534310 440

slide-40
SLIDE 40

FG 40

Outline

  • First-Principles simulations
  • Eigenvalue problems in electronic structure

calculations

  • Localized representations of solutions and

simultaneous diagonalization problem

  • Data compression through simultaneous

diagonalization

slide-41
SLIDE 41

FG 41

Finding reduced representations

  • f Kohn-Sham subspaces
  • A KS invariant subspace is represented as an
  • rthogonal matrix
  • basis set size = m
  • number of solutions = n
  • In general, Y is dense
  • Saving and reloading Y is costly (file size up to

250 GB, grows as O(nm) )

  • Data compression is needed for storage of Y

m n

Y

×

slide-42
SLIDE 42

FG 42

Subspace bisection

  • Subspace Bisection

– divide the simulation domain into two subdomains (e.g. “left” and “right”) – try to find an orthogonal transformation V that makes every function either

  • localized on the left domain
  • localized on the right domain
  • extended across both domains

,

L R

Ω Ω

L

Ω

R

Ω

slide-43
SLIDE 43

FG 43

Subspace Bisection

Y

L

⎧ ⎪ Ω ⎨ ⎪ ⎩

R

⎧ ⎪ Ω ⎨ ⎪ ⎩

YV

slide-44
SLIDE 44

FG 44

The CS decomposition

  • A matrix Y having orthogonal columns, can be

decomposed as where U1, U2, V are orthogonal matrices,

1 1 1 2 2 2 T T

Y U V Y Y U V ⎛ ⎞ Σ ⎛ ⎞ = = ⎜ ⎟ ⎜ ⎟ Σ ⎝ ⎠ ⎝ ⎠

1 2 1 1 2 2

diag( , , ) diag( , , ) 1

n n i i

C S C c c S s s c s ⎛ ⎞ ⎛ ⎞ Σ = Σ = ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ = = + = … …

Stewart (1982)

slide-45
SLIDE 45

FG 45

CS Decomposition

Y

L

⎧ ⎪ Ω ⎨ ⎪ ⎩

R

⎧ ⎪ Ω ⎨ ⎪ ⎩

YV

slide-46
SLIDE 46

FG 46

CS Decomposition

Y

L

⎧ ⎪ Ω ⎨ ⎪ ⎩

R

⎧ ⎪ Ω ⎨ ⎪ ⎩

YV

2

norm

i

c =

2 2

norm 1

i i

s c = = −

slide-47
SLIDE 47

FG 47

CS Decomposition

Y

L

⎧ ⎪ Ω ⎨ ⎪ ⎩

R

⎧ ⎪ Ω ⎨ ⎪ ⎩

2 i

c ε <

2 i

s ε <

extended states localized in ΩR localized in ΩL

slide-48
SLIDE 48

FG 48

Subspace Bisection Algorithm

  • 1. Choose the acceptable error
  • 2. Perform a CS decomposition of the matrix Y
  • 3. For each vector of YV:
  • if ci

2 < ε

state localized in ΩR (truncate in ΩL) else if si

2 < ε state localized in ΩL (truncate in ΩR)

else state is extended

ε

Ideal limit: 2-fold data reduction

slide-49
SLIDE 49

FG 49

Subspace Bisection Algorithm

  • When does the CS decomposition lead to

efficient data compression?

– in some systems (insulators, where the spectrum of H has a gap) most of the solutions can be localized in ΩL or ΩR – in some systems (metals, not spectral gap) fewer states can be localized

slide-50
SLIDE 50

FG 50

Subspace Bisection Algorithm

  • A. Edelman et al. SIAM J. Sci. Comput. 20, 1094, (1999)

Y matrix = eigenvectors of the Laplacian Y matrix = random unitary matrix

2 i

c

2 i

c

slide-51
SLIDE 51

FG 51

Multiple Subspace Bisection

  • The Subspace Bisection algorithm can be applied

simultaneously in three directions

slide-52
SLIDE 52

FG 52

Multiple Subspace Bisection

  • Each solution is associated with a triplet

Ideal limit: 8-fold data reduction Simultaneous CS decomposition: Requires simultaneous diagonalization of three matrices

2 2 2 x y z

c c c ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠

slide-53
SLIDE 53

FG 53

Applications

  • (H2O)512
  • (19x0) Carbon nanotube (304 atoms)

N1 48 2% N1/2 276 844 880 14% N1/4 41% N1/8 43%

3

10 ε

=

Data size reduction: 4.03

3

10 ε

=

⎫ ⎪ ⎬ ⎪ ⎭

Data size reduction: 2.72

N1 108 18% N1/2 80 183 237 13% N1/4 30% N1/8 39% 2048 functions

⎫ ⎪ ⎬ ⎪ ⎭

608 functions

slide-54
SLIDE 54

FG 54

(H2O)512 states after bisection

localized extended

slide-55
SLIDE 55

FG 55

CNT(19x0) states after bisection

localized extended

slide-56
SLIDE 56

FG 56

Distribution of singular values

  • (H2O)512

2 2 2 x y z

c c c ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠

( ) ( ) ( )

2 2 2 2 2 2

log 1 log 1 log 1

x x y y z z

c c c c c c ⎛ ⎞ − ⎜ ⎟ ⎜ ⎟ − ⎜ ⎟ ⎜ ⎟ − ⎝ ⎠

slide-57
SLIDE 57

FG 57

Distribution of singular values

  • (19x0) CNT C304

2 2 2 x y z

c c c ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠

( ) ( ) ( )

2 2 2 2 2 2

log 1 log 1 log 1

x x y y z z

c c c c c c ⎛ ⎞ − ⎜ ⎟ ⎜ ⎟ − ⎜ ⎟ ⎜ ⎟ − ⎝ ⎠

slide-58
SLIDE 58

FG 58

Distribution of singular values

  • BCC Mo54

2 2 2 x y z

c c c ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠

( ) ( ) ( )

2 2 2 2 2 2

log 1 log 1 log 1

x x y y z z

c c c c c c ⎛ ⎞ − ⎜ ⎟ ⎜ ⎟ − ⎜ ⎟ ⎜ ⎟ − ⎝ ⎠

slide-59
SLIDE 59

FG 59

Remaining open issues

  • Uniqueness of minimum of spread functionals
  • Exploration of other simultaneous

diagonalization methods

  • Recursive implementation of the subspace

bisection algorithm

slide-60
SLIDE 60

FG 60

Summary

  • Electronic structure calculations involve large invariant

subspace computations

  • Extrapolation of invariant subspaces is performed using

subspace alignment (Procrustes problem)

  • Localized representations of solutions of the Kohn-

Sham equations can be obtained via simultaneous diagonalization

  • A subspace bisection algorithm provides a reduced

representation of Kohn-Sham invariant subspaces with controlled accuracy