Leveraging modern supercomputing infrastructure for tensor - - PowerPoint PPT Presentation

▶

Mar 15, 2024 221 likes •377 views

Leveraging modern supercomputing infrastructure for tensor contractions in large electronic-structure calculations Ilya A. Kaliman University of Southern California September 18-19, 2017 Tensors in Quantum Chemistry ^ H = E Coupled

SLIDE 1

Leveraging modern supercomputing infrastructure for tensor contractions in large electronic-structure calculations

Ilya A. Kaliman

University of Southern California

September 18-19, 2017

SLIDE 2

Tensors in Quantum Chemistry

^ H ψ=E ψ

Coupled Cluster Equations

SLIDE 3

Tensors in Quantum Chemistry

Tensors of floating point numbers are used

extensively in high-level electronic-structure calculations

4-index tensors are common Coupled Cluster

methods

Contractions are the most expensive step
Complex structure of tensors – must use symmetry

and sparsity

Huge data size (many terabytes)
Large calculations can take weeks

SLIDE 4

Q-Chem Quantum Chemistry Package

Q-Chem ccman2 – Coupled Cluster module libtensor (frontend) libcc – library of CC equatjons Natjve backend libxm backend CTF backend This work

SLIDE 5

Data storage using block tensors

Permutational symmetry Spin symmetry Molecular point-group symmetry Canonical tensor blocks Non-canonical blocks (computed from canonical blocks) Zero blocks a ji=−aij

SLIDE 6

Block tensor operations

Contractions Additions

C11 C12 C13 C21 C22 C23

=

A11 A12 A13 A21 A22 A23 B11 B12 B21 B22 C11 C12 C13 C21 C22 C23 C31 C32 C33

=

A11 A12 A13 A21 A22 A23 A31 A32 A33 B11 B12 B13 B21 B22 B23 B31 B32 B33

+

Only non-zero canonical

blocks (orange) need to be computed

Blocks can be computed

independently in parallel

C11=A11⊗B11+A21⊗B12

x Unfolding + BLAS/BLIS

C12=A12⊗B11+ A22⊗B12

SLIDE 7

Calculations on a single node

Shared Memory

Canonical tensor blocks

CPU CPU CPU CPU

SLIDE 8

Calculations on a supercomputer

Compute node Compute node Compute node Compute node Compute node Compute node

Canonical tensor blocks

Can this scale?

Shared Filesystem

SLIDE 9

Calculations on a supercomputer

Compute node Compute node Compute node Compute node Compute node Compute node

Canonical tensor blocks

Shared Filesystem

Fast cache (SSD, etc)

Can this scale? It can! (with a fast cache)

SLIDE 10

BurstBuffer on NERSC Cori

http://www.nersc.gov/users/computational-systems/cori/burst-buffer/burst-buffer/

6.5 Gb/sec read/write bandwidth

SLIDE 11

Implementation and benchmarks: libxm

Libxm is a library of primitive tensor operations

–

xm_contract(1.0, A, B, 2.0, C, “abcd”, “ijcd”, “ijab”);

–

xm_add(1.0, A, 2.0, B, “ij”, “ji”);

–

...

Main components

– MPI-aware disk-backed memory allocator – Code for tensor operations – Auxiliary routines

Stores all data on disk
Hybrid MPI/OpenMP parallel design

– Static load balancing between the nodes (MPI) – Dynamic load balancing within a node (OpenMP)

https://github.com/ilyak/libxm

SLIDE 12

Libxm parallel scaling on Cori

Total tensor data size is over 2 Tb, time in seconds, speedup relative to one node in parenthesis

SLIDE 13

Conclusions

A new distributed-parallel model for tensor
perations is implemented in the libxm library
Shared filesystem is used as an inter-node

common storage for tensors

Data size is not limited by the amount of RAM or

number of nodes

The hybrid MPI/OpenMP parallel code shows

excellent scaling when adequate data caching is employed

SLIDE 14

Thank you!

Acknowledgments

– Prof. Anna Krylov, USC – Dr. Evgeny Epifanovsky, Q-Chem

Leveraging modern supercomputing infrastructure for tensor contractions in large electronic-structure calculations

Ilya A. Kaliman

University of Southern California

Tensors in Quantum Chemistry

^ H ψ=E ψ

Tensors in Quantum Chemistry

extensively in high-level electronic-structure calculations

methods

and sparsity

Q-Chem Quantum Chemistry Package

Q-Chem ccman2 – Coupled Cluster module libtensor (frontend) libcc – library of CC equatjons Natjve backend libxm backend CTF backend This work

Data storage using block tensors

Block tensor operations

Contractions Additions

=

=

+

Calculations on a single node

Calculations on a supercomputer

Can this scale?

Calculations on a supercomputer

Can this scale? It can! (with a fast cache)

BurstBuffer on NERSC Cori

Implementation and benchmarks: libxm

Libxm parallel scaling on Cori

Conclusions

common storage for tensors

number of nodes

excellent scaling when adequate data caching is employed

Thank you!

https://github.com/ilyak/libxm