S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating - - PowerPoint PPT Presentation

s6540
SMART_READER_LITE
LIVE PREVIEW

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating - - PowerPoint PPT Presentation

S6540 Need for Speed: Accelerating S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC Directives Quantum Chemistry using OpenACC Janus J. Eriksen Directives Motivation OpenACC compiler


slide-1
SLIDE 1

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives

Janus J. Eriksen

qLEAP Center for Theoretical Chemistry, Department of Chemistry, Aarhus University, Langelandsgade 140, DK–8000 Aarhus C, Denmark

GPU Technology Conference 2016 San Jose, CA, April 2016

slide-2
SLIDE 2

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Motivation

“Theoretical chemistry could be seen as a bridge from the real physics of the physicists to the real chemistry of the experimental chemists.” — Pekka Pyykk¨

  • , Chem. Rev. 112, 1 (2012)
slide-3
SLIDE 3

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Motivation

◮ Domain scientists are forced to care as much about their

scientific output as on the implementation of a given method.

◮ Accelerated code should be relatively easy to write (from

scratch), extend, and maintain (possibly by others).

◮ Many codes (like ours) are platform-independent, so portability

is of key importance.

◮ Any addition of accelerated code must not interfere with the

standard compilation process. OpenACC accelerator directives as an alternative to CUDA C

slide-4
SLIDE 4

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Motivation

◮ Domain scientists are forced to care as much about their

scientific output as on the implementation of a given method.

◮ Accelerated code should be relatively easy to write (from

scratch), extend, and maintain (possibly by others).

◮ Many codes (like ours) are platform-independent, so portability

is of key importance.

◮ Any addition of accelerated code must not interfere with the

standard compilation process. OpenACC accelerator directives as an alternative to CUDA C

slide-5
SLIDE 5

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Motivation

◮ Domain scientists are forced to care as much about their

scientific output as on the implementation of a given method.

◮ Accelerated code should be relatively easy to write (from

scratch), extend, and maintain (possibly by others).

◮ Many codes (like ours) are platform-independent, so portability

is of key importance.

◮ Any addition of accelerated code must not interfere with the

standard compilation process. OpenACC accelerator directives as an alternative to CUDA C

slide-6
SLIDE 6

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Motivation

◮ Domain scientists are forced to care as much about their

scientific output as on the implementation of a given method.

◮ Accelerated code should be relatively easy to write (from

scratch), extend, and maintain (possibly by others).

◮ Many codes (like ours) are platform-independent, so portability

is of key importance.

◮ Any addition of accelerated code must not interfere with the

standard compilation process. OpenACC accelerator directives as an alternative to CUDA C

slide-7
SLIDE 7

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Motivation

◮ Domain scientists are forced to care as much about their

scientific output as on the implementation of a given method.

◮ Accelerated code should be relatively easy to write (from

scratch), extend, and maintain (possibly by others).

◮ Many codes (like ours) are platform-independent, so portability

is of key importance.

◮ Any addition of accelerated code must not interfere with the

standard compilation process. OpenACC accelerator directives as an alternative to CUDA C

slide-8
SLIDE 8

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

OpenACC compiler directives

◮ OpenACC-accelerated code will be based on original source.

◮ In turn, this makes the implementation intuitively

transparent and thus easier to maintain and extend.

◮ Like with OpenMP

, OpenACC directives are treated as mere comments to non-accelerating compilers (portability).

◮ The programmer leaves most of the hard work to the compilers

(i.e., the current and future developers of these).

◮ Accelerated code should preferably be easy and fast to

implement (Amdahl’s law).

slide-9
SLIDE 9

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

OpenACC compiler directives

◮ OpenACC-accelerated code will be based on original source.

◮ In turn, this makes the implementation intuitively

transparent and thus easier to maintain and extend.

◮ Like with OpenMP

, OpenACC directives are treated as mere comments to non-accelerating compilers (portability).

◮ The programmer leaves most of the hard work to the compilers

(i.e., the current and future developers of these).

◮ Accelerated code should preferably be easy and fast to

implement (Amdahl’s law).

slide-10
SLIDE 10

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

OpenACC compiler directives

◮ OpenACC-accelerated code will be based on original source.

◮ In turn, this makes the implementation intuitively

transparent and thus easier to maintain and extend.

◮ Like with OpenMP

, OpenACC directives are treated as mere comments to non-accelerating compilers (portability).

◮ The programmer leaves most of the hard work to the compilers

(i.e., the current and future developers of these).

◮ Accelerated code should preferably be easy and fast to

implement (Amdahl’s law).

slide-11
SLIDE 11

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

OpenACC compiler directives

◮ OpenACC-accelerated code will be based on original source.

◮ In turn, this makes the implementation intuitively

transparent and thus easier to maintain and extend.

◮ Like with OpenMP

, OpenACC directives are treated as mere comments to non-accelerating compilers (portability).

◮ The programmer leaves most of the hard work to the compilers

(i.e., the current and future developers of these).

◮ Accelerated code should preferably be easy and fast to

implement (Amdahl’s law).

slide-12
SLIDE 12

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

OpenACC compiler directives

◮ OpenACC-accelerated code will be based on original source.

◮ In turn, this makes the implementation intuitively

transparent and thus easier to maintain and extend.

◮ Like with OpenMP

, OpenACC directives are treated as mere comments to non-accelerating compilers (portability).

◮ The programmer leaves most of the hard work to the compilers

(i.e., the current and future developers of these).

◮ Accelerated code should preferably be easy and fast to

implement (Amdahl’s law).

slide-13
SLIDE 13

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Quantum chemistry (in 3 slides)

slide-14
SLIDE 14

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

The Schr¨

  • dinger equation

◮ For light elements, the electronic equation of motion is known as

the time-independent Schr¨

  • dinger equation:

H|Ψ = E|Ψ where |Ψ is a multi-electron wave function, and H(r,R) is the electronic Hamiltonian: H(r,R) = Jel(ri)+ Vel-el(ri,rj)+ Vel-nuc(ri,RI)+ Vnuc-nuc(RI,RJ)

◮ Due to the presence of the repulsive Vel-el(ri,rj) operator, exact

analytical solutions cannot in general be derived.

◮ Thus, approximations have to be invoked.

slide-15
SLIDE 15

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

The Schr¨

  • dinger equation

◮ For light elements, the electronic equation of motion is known as

the time-independent Schr¨

  • dinger equation:

H|Ψ = E|Ψ where |Ψ is a multi-electron wave function, and H(r,R) is the electronic Hamiltonian: H(r,R) = Jel(ri)+ Vel-el(ri,rj)+ Vel-nuc(ri,RI)+ Vnuc-nuc(RI,RJ)

◮ Due to the presence of the repulsive Vel-el(ri,rj) operator, exact

analytical solutions cannot in general be derived.

◮ Thus, approximations have to be invoked.

slide-16
SLIDE 16

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

The Schr¨

  • dinger equation

◮ For light elements, the electronic equation of motion is known as

the time-independent Schr¨

  • dinger equation:

H|Ψ = E|Ψ where |Ψ is a multi-electron wave function, and H(r,R) is the electronic Hamiltonian: H(r,R) = Jel(ri)+ Vel-el(ri,rj)+ Vel-nuc(ri,RI)+ Vnuc-nuc(RI,RJ)

◮ Due to the presence of the repulsive Vel-el(ri,rj) operator, exact

analytical solutions cannot in general be derived.

◮ Thus, approximations have to be invoked.

slide-17
SLIDE 17

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Coupled cluster methods

◮ Distinction between mean-field (Hartree-Fock) and correlated

methods (Vel-el(ri,rj)).

◮ The methods of the coupled cluster hierarchy are predominant. ◮ Upon traversing up through the coupled cluster hierarchy, the

accuracy increases, but so does the computational complexity.

◮ The amount of FLOPs and required memory increase.

◮ The RI-MP2 and CCSD(T) models are examples of methods at

either end of the CC hierarchy (wrt accuracy and cost).

◮ These models scale as O(N5) and O(N7), respectively.

slide-18
SLIDE 18

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Coupled cluster methods

◮ Distinction between mean-field (Hartree-Fock) and correlated

methods (Vel-el(ri,rj)).

◮ The methods of the coupled cluster hierarchy are predominant. ◮ Upon traversing up through the coupled cluster hierarchy, the

accuracy increases, but so does the computational complexity.

◮ The amount of FLOPs and required memory increase.

◮ The RI-MP2 and CCSD(T) models are examples of methods at

either end of the CC hierarchy (wrt accuracy and cost).

◮ These models scale as O(N5) and O(N7), respectively.

slide-19
SLIDE 19

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Coupled cluster methods

◮ Distinction between mean-field (Hartree-Fock) and correlated

methods (Vel-el(ri,rj)).

◮ The methods of the coupled cluster hierarchy are predominant. ◮ Upon traversing up through the coupled cluster hierarchy, the

accuracy increases, but so does the computational complexity.

◮ The amount of FLOPs and required memory increase.

◮ The RI-MP2 and CCSD(T) models are examples of methods at

either end of the CC hierarchy (wrt accuracy and cost).

◮ These models scale as O(N5) and O(N7), respectively.

slide-20
SLIDE 20

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Coupled cluster methods

◮ Distinction between mean-field (Hartree-Fock) and correlated

methods (Vel-el(ri,rj)).

◮ The methods of the coupled cluster hierarchy are predominant. ◮ Upon traversing up through the coupled cluster hierarchy, the

accuracy increases, but so does the computational complexity.

◮ The amount of FLOPs and required memory increase.

◮ The RI-MP2 and CCSD(T) models are examples of methods at

either end of the CC hierarchy (wrt accuracy and cost).

◮ These models scale as O(N5) and O(N7), respectively.

slide-21
SLIDE 21

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Coupled cluster methods

◮ Distinction between mean-field (Hartree-Fock) and correlated

methods (Vel-el(ri,rj)).

◮ The methods of the coupled cluster hierarchy are predominant. ◮ Upon traversing up through the coupled cluster hierarchy, the

accuracy increases, but so does the computational complexity.

◮ The amount of FLOPs and required memory increase.

◮ The RI-MP2 and CCSD(T) models are examples of methods at

either end of the CC hierarchy (wrt accuracy and cost).

◮ These models scale as O(N5) and O(N7), respectively.

slide-22
SLIDE 22

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

One- and N-electron expansions

◮ One initially performs a Hartree-Fock calculation, which returns

a number of occupied (occ) and virtual (virt) molecular orbitals.

◮ The accuracy and amount of these are governed by the size of

the one-electron basis set (cc-pVDZ, cc-pVTZ, cc-pVQZ, etc.)

slide-23
SLIDE 23

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

OpenACC implementations

slide-24
SLIDE 24

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Transition from OpenMP to OpenACC

◮ OpenACC adaptions of existing OpenMP pragmas:

!$omp parallel do [...] → !$acc parallel loop [...]

◮ No use of architecture-specific clauses, e.g.:

cache, tile, num gangs, num workers, vector length.

◮ Structured (!$acc data/!$acc end data) and unstructured

data regions (!$acc enter/exit data).

◮ Call optimized library routines (CUBLAS/libsci acc) through:

!$acc host data use device (e.g., async BLAS3 routines).

◮ Use of async clauses on parallel and data directive as well as

async waits (!$acc wait async(handle)):

◮ sync by passing intrinsic acc async sync handle.

◮ Multiple GPUs through mixture of OpenMP and OpenACC. ◮ In toto: RI-MP2 impl. (∼ 30 dir.), CCSD(T) impl. (∼ 100 dir.).

slide-25
SLIDE 25

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Transition from OpenMP to OpenACC

◮ OpenACC adaptions of existing OpenMP pragmas:

!$omp parallel do [...] → !$acc parallel loop [...]

◮ No use of architecture-specific clauses, e.g.:

cache, tile, num gangs, num workers, vector length.

◮ Structured (!$acc data/!$acc end data) and unstructured

data regions (!$acc enter/exit data).

◮ Call optimized library routines (CUBLAS/libsci acc) through:

!$acc host data use device (e.g., async BLAS3 routines).

◮ Use of async clauses on parallel and data directive as well as

async waits (!$acc wait async(handle)):

◮ sync by passing intrinsic acc async sync handle.

◮ Multiple GPUs through mixture of OpenMP and OpenACC. ◮ In toto: RI-MP2 impl. (∼ 30 dir.), CCSD(T) impl. (∼ 100 dir.).

slide-26
SLIDE 26

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Transition from OpenMP to OpenACC

◮ OpenACC adaptions of existing OpenMP pragmas:

!$omp parallel do [...] → !$acc parallel loop [...]

◮ No use of architecture-specific clauses, e.g.:

cache, tile, num gangs, num workers, vector length.

◮ Structured (!$acc data/!$acc end data) and unstructured

data regions (!$acc enter/exit data).

◮ Call optimized library routines (CUBLAS/libsci acc) through:

!$acc host data use device (e.g., async BLAS3 routines).

◮ Use of async clauses on parallel and data directive as well as

async waits (!$acc wait async(handle)):

◮ sync by passing intrinsic acc async sync handle.

◮ Multiple GPUs through mixture of OpenMP and OpenACC. ◮ In toto: RI-MP2 impl. (∼ 30 dir.), CCSD(T) impl. (∼ 100 dir.).

slide-27
SLIDE 27

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Transition from OpenMP to OpenACC

◮ OpenACC adaptions of existing OpenMP pragmas:

!$omp parallel do [...] → !$acc parallel loop [...]

◮ No use of architecture-specific clauses, e.g.:

cache, tile, num gangs, num workers, vector length.

◮ Structured (!$acc data/!$acc end data) and unstructured

data regions (!$acc enter/exit data).

◮ Call optimized library routines (CUBLAS/libsci acc) through:

!$acc host data use device (e.g., async BLAS3 routines).

◮ Use of async clauses on parallel and data directive as well as

async waits (!$acc wait async(handle)):

◮ sync by passing intrinsic acc async sync handle.

◮ Multiple GPUs through mixture of OpenMP and OpenACC. ◮ In toto: RI-MP2 impl. (∼ 30 dir.), CCSD(T) impl. (∼ 100 dir.).

slide-28
SLIDE 28

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Transition from OpenMP to OpenACC

◮ OpenACC adaptions of existing OpenMP pragmas:

!$omp parallel do [...] → !$acc parallel loop [...]

◮ No use of architecture-specific clauses, e.g.:

cache, tile, num gangs, num workers, vector length.

◮ Structured (!$acc data/!$acc end data) and unstructured

data regions (!$acc enter/exit data).

◮ Call optimized library routines (CUBLAS/libsci acc) through:

!$acc host data use device (e.g., async BLAS3 routines).

◮ Use of async clauses on parallel and data directive as well as

async waits (!$acc wait async(handle)):

◮ sync by passing intrinsic acc async sync handle.

◮ Multiple GPUs through mixture of OpenMP and OpenACC. ◮ In toto: RI-MP2 impl. (∼ 30 dir.), CCSD(T) impl. (∼ 100 dir.).

slide-29
SLIDE 29

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Transition from OpenMP to OpenACC

◮ OpenACC adaptions of existing OpenMP pragmas:

!$omp parallel do [...] → !$acc parallel loop [...]

◮ No use of architecture-specific clauses, e.g.:

cache, tile, num gangs, num workers, vector length.

◮ Structured (!$acc data/!$acc end data) and unstructured

data regions (!$acc enter/exit data).

◮ Call optimized library routines (CUBLAS/libsci acc) through:

!$acc host data use device (e.g., async BLAS3 routines).

◮ Use of async clauses on parallel and data directive as well as

async waits (!$acc wait async(handle)):

◮ sync by passing intrinsic acc async sync handle.

◮ Multiple GPUs through mixture of OpenMP and OpenACC. ◮ In toto: RI-MP2 impl. (∼ 30 dir.), CCSD(T) impl. (∼ 100 dir.).

slide-30
SLIDE 30

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Transition from OpenMP to OpenACC

◮ OpenACC adaptions of existing OpenMP pragmas:

!$omp parallel do [...] → !$acc parallel loop [...]

◮ No use of architecture-specific clauses, e.g.:

cache, tile, num gangs, num workers, vector length.

◮ Structured (!$acc data/!$acc end data) and unstructured

data regions (!$acc enter/exit data).

◮ Call optimized library routines (CUBLAS/libsci acc) through:

!$acc host data use device (e.g., async BLAS3 routines).

◮ Use of async clauses on parallel and data directive as well as

async waits (!$acc wait async(handle)):

◮ sync by passing intrinsic acc async sync handle.

◮ Multiple GPUs through mixture of OpenMP and OpenACC. ◮ In toto: RI-MP2 impl. (∼ 30 dir.), CCSD(T) impl. (∼ 100 dir.).

slide-31
SLIDE 31

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 implementation (pseudocode)

01: 02: 03: 04: 05: 06: do i = 1,nocc 07: 08: do j = i,nocc 09: 10: 11: call dgemm(...) // g_ij from [C(naux,nvirt,i)]**T and C(naux,nvirt,j) 12: 13: eps_ij = e_occ(i) + e_occ(j) 14: 15: !$omp parallel do reduction(+:energy) 16: do a = 1,nvirt 17: eps_ija = eps_ij - e_virt(a) 18: do b = a,nvirt 19: eps = eps_ija - e_virt(b) 20: energy = energy + perm_sym * (g_ij(a,b)**2 + g_ij(b,a)**2 & 21: & - g_ij(a,b)*g_ij(b,a)) / eps 22: enddo 23: enddo 24: !$omp end parallel do 25: 26: enddo 27: 28: enddo 29: 30: 31:

slide-32
SLIDE 32

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 implementation (pseudocode)

01: 02: 03: 04: !$acc data create(g_ij) copyin(C) 05: 06: do i = 1,nocc 07: 08: do j = i,nocc 09: 10: !$acc host_data use_device(C(:,:,i),C(:,:,j),g_ij) 11: stat = cublasDgemm_v2(...) // CUBLAS dgemm 12: !$acc end host_data 13: eps_ij = e_occ(i) + e_occ(j) 14: !$acc update host(g_ij) 15: !$omp parallel do reduction(+:energy) 16: do a = 1,nvirt 17: eps_ija = eps_ij - e_virt(a) 18: do b = a,nvirt 19: eps = eps_ija - e_virt(b) 20: energy = energy + perm_sym * (g_ij(a,b)**2 + g_ij(b,a)**2 & 21: & - g_ij(a,b)*g_ij(b,a)) / eps 22: enddo 23: enddo 24: !$omp end parallel do 25: 26: enddo 27: 28: enddo 29: 30: !$acc end data 31:

slide-33
SLIDE 33

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 implementation (pseudocode)

01: 02: 03: 04: !$acc data create(g_ij) copyin(C) 05: 06: do i = 1,nocc 07: 08: do j = i,nocc 09: 10: !$acc host_data use_device(C(:,:,i),C(:,:,j),g_ij) 11: stat = cublasDgemm_v2(...) // CUBLAS dgemm 12: !$acc end host_data 13: eps_ij = e_occ(i) + e_occ(j) 14: !$acc update host(g_ij) 15: !$omp parallel do reduction(+:energy) 16: do a = 1,nvirt 17: eps_ija = eps_ij - e_virt(a) 18: do b = a,nvirt 19: eps = eps_ija - e_virt(b) 20: energy = energy + perm_sym * (g_ij(a,b)**2 + g_ij(b,a)**2 & 21: & - g_ij(a,b)*g_ij(b,a)) / eps 22: enddo 23: enddo 24: !$omp end parallel do 25: 26: enddo 27: 28: enddo 29: 30: !$acc end data 31:

slide-34
SLIDE 34

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 implementation (pseudocode)

01: 02: 03: 04: !$acc data create(g_ij) 05: 06: do i = 1,nocc 07: !$acc data copyin(C(:,:,i)) 08: do j = i,nocc 09: !$acc data copyin(C(:,:,j)) 10: !$acc host_data use_device(C(:,:,i),C(:,:,j),g_ij) 11: stat = cublasDgemm_v2(...) // CUBLAS dgemm 12: !$acc end host_data 13: eps_ij = e_occ(i) + e_occ(j) 14: !$acc update host(g_ij) 15: !$omp parallel do reduction(+:energy) 16: do a = 1,nvirt 17: eps_ija = eps_ij - e_virt(a) 18: do b = a,nvirt 19: eps = eps_ija - e_virt(b) 20: energy = energy + perm_sym * (g_ij(a,b)**2 + g_ij(b,a)**2 & 21: & - g_ij(a,b)*g_ij(b,a)) / eps 22: enddo 23: enddo 24: !$omp end parallel do 25: !$acc end data 26: enddo 27: !$acc end data 28: enddo 29: 30: !$acc end data 31:

slide-35
SLIDE 35

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 implementation (pseudocode)

01: 02: 03: 04: !$acc data create(g_ij) 05: 06: do i = 1,nocc 07: !$acc data copyin(C(:,:,i)) 08: do j = i,nocc 09: !$acc data copyin(C(:,:,j)) 10: !$acc host_data use_device(C(:,:,i),C(:,:,j),g_ij) 11: stat = cublasDgemm_v2(...) // CUBLAS dgemm 12: !$acc end host_data 13: eps_ij = e_occ(i) + e_occ(j) 14: !$acc update host(g_ij) 15: !$omp parallel do reduction(+:energy) 16: do a = 1,nvirt 17: eps_ija = eps_ij - e_virt(a) 18: do b = a,nvirt 19: eps = eps_ija - e_virt(b) 20: energy = energy + perm_sym * (g_ij(a,b)**2 + g_ij(b,a)**2 & 21: & - g_ij(a,b)*g_ij(b,a)) / eps 22: enddo 23: enddo 24: !$omp end parallel do 25: !$acc end data 26: enddo 27: !$acc end data 28: enddo 29: 30: !$acc end data 31:

slide-36
SLIDE 36

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 implementation (pseudocode)

01: 02: 03: 04: !$acc data create(g_ij) copyin(e_virt) 05: 06: do i = 1,nocc 07: !$acc data copyin(C(:,:,i)) 08: do j = i,nocc 09: !$acc data copyin(C(:,:,j)) 10: !$acc host_data use_device(C(:,:,i),C(:,:,j),g_ij) 11: stat = cublasDgemm_v2(...) // CUBLAS dgemm 12: !$acc end host_data 13: eps_ij = e_occ(i) + e_occ(j) 14: !$acc parallel loop independent present(e_virt,g_ij) & 15: !$acc& copyin(e_ij) copy(energy) reduction(+:energy) 16: do a = 1,nvirt 17: eps_ija = eps_ij - e_virt(a) 18: do b = a,nvirt 19: eps = eps_ija - e_virt(b) 20: energy = energy + perm_sym * (g_ij(a,b)**2 + g_ij(b,a)**2 & 21: & - g_ij(a,b)*g_ij(b,a)) / eps 22: enddo 23: enddo 24: !$acc end parallel loop 25: !$acc end data 26: enddo 27: !$acc end data 28: enddo 29: 30: !$acc end data 31:

slide-37
SLIDE 37

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 implementation (pseudocode)

01: 02: 03: 04: !$acc data create(g_ij) copyin(e_virt) 05: 06: do i = 1,nocc 07: !$acc data copyin(C(:,:,i)) 08: do j = i,nocc 09: !$acc data copyin(C(:,:,j)) 10: !$acc host_data use_device(C(:,:,i),C(:,:,j),g_ij) 11: stat = cublasDgemm_v2(...) // CUBLAS dgemm 12: !$acc end host_data 13: eps_ij = e_occ(i) + e_occ(j) 14: !$acc parallel loop independent present(e_virt,g_ij) & 15: !$acc& copyin(e_ij) copy(energy) reduction(+:energy) 16: do a = 1,nvirt 17: eps_ija = eps_ij - e_virt(a) 18: do b = a,nvirt 19: eps = eps_ija - e_virt(b) 20: energy = energy + perm_sym * (g_ij(a,b)**2 + g_ij(b,a)**2 & 21: & - g_ij(a,b)*g_ij(b,a)) / eps 22: enddo 23: enddo 24: !$acc end parallel loop 25: !$acc end data 26: enddo 27: !$acc end data 28: enddo 29: 30: !$acc end data 31:

slide-38
SLIDE 38

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 implementation (pseudocode)

01: num_devices = acc_get_num_devices(acc_get_device_type()) 02: !$omp parallel num_threads(num_devices) reduction(+:energy) 03: !$acc set device_num(omp_get_thread_num()) 04: !$acc data create(g_ij) copyin(e_virt) 05: !$omp do schedule(dynamic) 06: do i = 1,nocc 07: !$acc data copyin(C(:,:,i)) 08: do j = i,nocc 09: !$acc data copyin(C(:,:,j)) 10: !$acc host_data use_device(C(:,:,i),C(:,:,j),g_ij) 11: stat = cublasDgemm_v2(...) // CUBLAS dgemm 12: !$acc end host_data 13: eps_ij = e_occ(i) + e_occ(j) 14: !$acc parallel loop independent present(e_virt,g_ij) & 15: !$acc& copyin(e_ij) copy(energy) reduction(+:energy) 16: do a = 1,nvirt 17: eps_ija = eps_ij - e_virt(a) 18: do b = a,nvirt 19: eps = eps_ija - e_virt(b) 20: energy = energy + perm_sym * (g_ij(a,b)**2 + g_ij(b,a)**2 & 21: & - g_ij(a,b)*g_ij(b,a)) / eps 22: enddo 23: enddo 24: !$acc end parallel loop 25: !$acc end data 26: enddo 27: !$acc end data 28: enddo 29: !$omp end do 30: !$acc end data 31: !$omp end parallel

slide-39
SLIDE 39

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 implementation (pseudocode)

01: num_devices = acc_get_num_devices(acc_get_device_type()) 02: !$omp parallel num_threads(num_devices) reduction(+:energy) 03: !$acc set device_num(omp_get_thread_num()) 04: !$acc data create(g_ij) copyin(e_virt) 05: !$omp do schedule(dynamic) 06: do i = 1,nocc 07: !$acc data copyin(C(:,:,i)) 08: do j = i,nocc 09: !$acc data copyin(C(:,:,j)) 10: !$acc host_data use_device(C(:,:,i),C(:,:,j),g_ij) 11: stat = cublasDgemm_v2(...) // CUBLAS dgemm 12: !$acc end host_data 13: eps_ij = e_occ(i) + e_occ(j) 14: !$acc parallel loop independent present(e_virt,g_ij) & 15: !$acc& copyin(e_ij) copy(energy) reduction(+:energy) 16: do a = 1,nvirt 17: eps_ija = eps_ij - e_virt(a) 18: do b = a,nvirt 19: eps = eps_ija - e_virt(b) 20: energy = energy + perm_sym * (g_ij(a,b)**2 + g_ij(b,a)**2 & 21: & - g_ij(a,b)*g_ij(b,a)) / eps 22: enddo 23: enddo 24: !$acc end parallel loop 25: !$acc end data 26: enddo 27: !$acc end data 28: enddo 29: !$omp end do 30: !$acc end data 31: !$omp end parallel

slide-40
SLIDE 40

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 implementation (pseudocode)

01: 02: 03: 04: 05: 06: do i = 1,nocc 07: 08: do j = i,nocc 09: 10: 11: call dgemm(...) // g_ij from [C(:,:,i)]**T and C(:,:,j) 12: 13: eps_ij = e_occ(i) + e_occ(j) 14: 15: !$omp parallel do reduction(+:energy) 16: do a = 1,nvirt 17: eps_ija = eps_ij - e_virt(a) 18: do b = a,nvirt 19: eps = eps_ija - e_virt(b) 20: energy = energy + perm_sym * (g_ij(a,b)**2 + g_ij(b,a)**2 & 21: & - g_ij(a,b)*g_ij(b,a)) / eps 22: enddo 23: enddo 24: !$omp end parallel do 25: 26: enddo 27: 28: enddo 29: 30: 31:

slide-41
SLIDE 41

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 implementation (pseudocode)

01: num_devices = acc_get_num_devices(acc_get_device_type()) 02: !$omp parallel num_threads(num_devices) reduction(+:energy) 03: !$acc set device_num(omp_get_thread_num()) 04: !$acc data create(g_ij) copyin(e_virt) 05: !$omp do schedule(dynamic) 06: do i = 1,nocc 07: !$acc data copyin(C(:,:,i)) 08: do j = i,nocc 09: !$acc data copyin(C(:,:,j)) 10: !$acc host_data use_device(C(:,:,i),C(:,:,j),g_ij) 11: stat = cublasDgemm_v2(...) // CUBLAS dgemm 12: !$acc end host_data 13: eps_ij = e_occ(i) + e_occ(j) 14: !$acc parallel loop independent present(e_virt,g_ij) & 15: !$acc& copyin(e_ij) copy(energy) reduction(+:energy) 16: do a = 1,nvirt 17: eps_ija = eps_ij - e_virt(a) 18: do b = a,nvirt 19: eps = eps_ija - e_virt(b) 20: energy = energy + perm_sym * (g_ij(a,b)**2 + g_ij(b,a)**2 & 21: & - g_ij(a,b)*g_ij(b,a)) / eps 22: enddo 23: enddo 24: !$acc end parallel loop 25: !$acc end data 26: enddo 27: !$acc end data 28: enddo 29: !$omp end do 30: !$acc end data 31: !$omp end parallel

slide-42
SLIDE 42

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Results

slide-43
SLIDE 43

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Calculations

◮ α-helix chains of alanine residues, [ala]n: ◮ Basis sets of increasing size — cc-pVXZ (X = D, T, and Q). ◮ CPU:

◮ 20-core Intel Ivy Bridge E5-2690 v2 @ 3.00GHz.

◮ GPUs:

◮ K20X < K40 < K80 (∗ autoboost enabled for K80)

◮ Compilation (PGI v. 15.10 — CUDA 7.5):

◮ ./setup --fc=pgf90 -mp -ta=host,tesla:cc35

  • lcuda -Mcuda=7.5 -lcublas
slide-44
SLIDE 44

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Calculations

◮ α-helix chains of alanine residues, [ala]n: ◮ Basis sets of increasing size — cc-pVXZ (X = D, T, and Q). ◮ CPU:

◮ 20-core Intel Ivy Bridge E5-2690 v2 @ 3.00GHz.

◮ GPUs:

◮ K20X < K40 < K80 (∗ autoboost enabled for K80)

◮ Compilation (PGI v. 15.10 — CUDA 7.5):

◮ ./setup --fc=pgf90 -mp -ta=host,tesla:cc35

  • lcuda -Mcuda=7.5 -lcublas
slide-45
SLIDE 45

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Calculations

◮ α-helix chains of alanine residues, [ala]n: ◮ Basis sets of increasing size — cc-pVXZ (X = D, T, and Q). ◮ CPU:

◮ 20-core Intel Ivy Bridge E5-2690 v2 @ 3.00GHz.

◮ GPUs:

◮ K20X < K40 < K80 (∗ autoboost enabled for K80)

◮ Compilation (PGI v. 15.10 — CUDA 7.5):

◮ ./setup --fc=pgf90 -mp -ta=host,tesla:cc35

  • lcuda -Mcuda=7.5 -lcublas
slide-46
SLIDE 46

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Calculations

◮ α-helix chains of alanine residues, [ala]n: ◮ Basis sets of increasing size — cc-pVXZ (X = D, T, and Q). ◮ CPU:

◮ 20-core Intel Ivy Bridge E5-2690 v2 @ 3.00GHz.

◮ GPUs:

◮ K20X < K40 < K80 (∗ autoboost enabled for K80)

◮ Compilation (PGI v. 15.10 — CUDA 7.5):

◮ ./setup --fc=pgf90 -mp -ta=host,tesla:cc35

  • lcuda -Mcuda=7.5 -lcublas
slide-47
SLIDE 47

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 results — scaling with system size

◮ Speed-up wrt. CPU impl. ([ala]n / cc-pVDZ / cc-pVDZ-RI):

[ala]5 [ala]6 [ala]7 [ala]8 [ala]9 [ala]10 System 2 4 6 8 10 12 14 16 18 Speed-up (wrt CPU impl.)

K20X-1 K40-1 K80-1 K20X-2 K40-2 K80-2 K20X-3 K40-3 K80-3 K20X-4 K40-4 K80-4 K40-5 K40-6

slide-48
SLIDE 48

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 results — scaling with number of GPUs

◮ Speed-up wrt. 1 GPU ([ala]n / cc-pVDZ / cc-pVDZ-RI):

[ala]5 [ala]6 [ala]7 [ala]8 [ala]9 [ala]10 System 1 2 3 4 5 6 7 Speed-up (wrt 1 GPU)

K20X-2 K40-2 K80-2 K20X-3 K40-3 K80-3 K20X-4 K40-4 K80-4 K40-5 K40-6

slide-49
SLIDE 49

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

RI-MP2 results — scaling with one-electron expansion

◮ Speed-up wrt. CPU impl. ([ala]5 / cc-pVXZ / cc-pVXZ-RI):

K40-1 K40-2 K40-3 K40-4 K40-5 K40-6 Number of K40 GPUs 2 4 6 8 10 12 14 16 18 Speed-up (wrt CPU impl.)

cc-pVDZ cc-pVTZ cc-pVQZ

slide-50
SLIDE 50

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

CCSD(T) results — scaling with system size

◮ Speed-up wrt. CPU impl. ([ala]n / cc-pVDZ):

[ala]1 [ala]2 [ala]3 [ala]4 System 2 4 6 8 10 12 14 16 18 Speed-up (wrt CPU impl.)

K20X-1 K40-1 K80-1 K20X-2 K40-2 K80-2 K20X-3 K40-3 K80-3 K20X-4 K40-4 K80-4 K40-5 K40-6

slide-51
SLIDE 51

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

CCSD(T) results — scaling with number of GPUs

◮ Speed-up wrt. 1 GPU ([ala]n / cc-pVDZ):

[ala]1 [ala]2 [ala]3 [ala]4 System 1 2 3 4 5 6 7 Speed-up (wrt 1 GPU)

K20X-2 K40-2 K80-2 K20X-3 K40-3 K80-3 K20X-4 K40-4 K80-4 K40-5 K40-6

slide-52
SLIDE 52

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

CCSD(T) results — scaling with one-electron expansion

◮ Speed-up wrt. CPU impl. ([ala]1 / cc-pVXZ):

K40-1 K40-2 K40-3 K40-4 K40-5 K40-6 Number of K40 GPUs 2 4 6 8 10 12 14 16 18 Speed-up (wrt CPU impl.)

cc-pVDZ cc-pVTZ cc-pVQZ

slide-53
SLIDE 53

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Conclusions & outlook

◮ Efficient acceleration of quantum chemical many-body

methods using OpenACC compiler accelerator directives.

◮ The implementations are transparent and portable as they

are formulated on top of existing CPU implementations.

◮ Initial OpenACC support in gcc (larger user base). ◮ Even more advanced and powerful GPUs (Maxwell, Volta). ◮ Unified memory (one single block of memory). ◮ NVLink interconnect (5–12x over present-day PCIe).

slide-54
SLIDE 54

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Conclusions & outlook

◮ Efficient acceleration of quantum chemical many-body

methods using OpenACC compiler accelerator directives.

◮ The implementations are transparent and portable as they

are formulated on top of existing CPU implementations.

◮ Initial OpenACC support in gcc (larger user base). ◮ Even more advanced and powerful GPUs (Maxwell, Volta). ◮ Unified memory (one single block of memory). ◮ NVLink interconnect (5–12x over present-day PCIe).

slide-55
SLIDE 55

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Conclusions & outlook

◮ Efficient acceleration of quantum chemical many-body

methods using OpenACC compiler accelerator directives.

◮ The implementations are transparent and portable as they

are formulated on top of existing CPU implementations.

◮ Initial OpenACC support in gcc (larger user base). ◮ Even more advanced and powerful GPUs (Maxwell, Volta). ◮ Unified memory (one single block of memory). ◮ NVLink interconnect (5–12x over present-day PCIe).

“Never, ever trust the compiler to do the right thing.” — John Levesque, Cray

slide-56
SLIDE 56

S6540 Need for Speed: Accelerating High-Accuracy Quantum Chemistry using OpenACC Directives Janus J. Eriksen Motivation OpenACC compiler directives The Schr¨

  • dinger

equation Coupled cluster methods One- and N-electron expansions OpenACC implementations Calculations Results Conclusions &

  • utlook

Acknowledgments

Acknowledgments

◮ Prof. Poul Jørgensen and Dr. Thomas Kjærgaard (Aarhus). ◮ Oak Ridge Leadership Computing Facility (CAAR). ◮ Mark Berger (NVIDIA). ◮ NVIDIA/PGI:

◮ Jeff Larkin. ◮ Brent Leback and Michael Wolfe.

◮ Cray:

◮ John Levesque and Aaron Vose.

◮ OLCF:

◮ Tjerk Straatsma and Dmitry Liakh.