Leveraging modern supercomputing infrastructure for tensor - - PowerPoint PPT Presentation

leveraging modern supercomputing infrastructure for
SMART_READER_LITE
LIVE PREVIEW

Leveraging modern supercomputing infrastructure for tensor - - PowerPoint PPT Presentation

Leveraging modern supercomputing infrastructure for tensor contractions in large electronic-structure calculations Ilya A. Kaliman University of Southern California September 18-19, 2017 Tensors in Quantum Chemistry ^ H = E Coupled


slide-1
SLIDE 1

Leveraging modern supercomputing infrastructure for tensor contractions in large electronic-structure calculations

Ilya A. Kaliman

University of Southern California

September 18-19, 2017

slide-2
SLIDE 2

2

Tensors in Quantum Chemistry

^ H ψ=E ψ

Coupled Cluster Equations

slide-3
SLIDE 3

3

Tensors in Quantum Chemistry

  • Tensors of floating point numbers are used

extensively in high-level electronic-structure calculations

  • 4-index tensors are common Coupled Cluster

methods

  • Contractions are the most expensive step
  • Complex structure of tensors – must use symmetry

and sparsity

  • Huge data size (many terabytes)
  • Large calculations can take weeks
slide-4
SLIDE 4

4

Q-Chem Quantum Chemistry Package

Q-Chem ccman2 – Coupled Cluster module libtensor (frontend) libcc – library of CC equatjons Natjve backend libxm backend CTF backend This work

slide-5
SLIDE 5

5

Data storage using block tensors

Permutational symmetry Spin symmetry Molecular point-group symmetry Canonical tensor blocks Non-canonical blocks (computed from canonical blocks) Zero blocks a ji=−aij

slide-6
SLIDE 6

6

Block tensor operations

Contractions Additions

C11 C12 C13 C21 C22 C23

=

A11 A12 A13 A21 A22 A23 B11 B12 B21 B22 C11 C12 C13 C21 C22 C23 C31 C32 C33

=

A11 A12 A13 A21 A22 A23 A31 A32 A33 B11 B12 B13 B21 B22 B23 B31 B32 B33

+

  • Only non-zero canonical

blocks (orange) need to be computed

  • Blocks can be computed

independently in parallel

C11=A11⊗B11+A21⊗B12

x Unfolding + BLAS/BLIS

C12=A12⊗B11+ A22⊗B12

slide-7
SLIDE 7

7

Calculations on a single node

Shared Memory

Canonical tensor blocks

CPU CPU CPU CPU

slide-8
SLIDE 8

8

Calculations on a supercomputer

Compute node Compute node Compute node Compute node Compute node Compute node

Canonical tensor blocks

Can this scale?

Shared Filesystem

slide-9
SLIDE 9

9

Calculations on a supercomputer

Compute node Compute node Compute node Compute node Compute node Compute node

Canonical tensor blocks

Shared Filesystem

Fast cache (SSD, etc)

Can this scale? It can! (with a fast cache)

slide-10
SLIDE 10

10

BurstBuffer on NERSC Cori

http://www.nersc.gov/users/computational-systems/cori/burst-buffer/burst-buffer/

6.5 Gb/sec read/write bandwidth

slide-11
SLIDE 11

11

Implementation and benchmarks: libxm

  • Libxm is a library of primitive tensor operations

xm_contract(1.0, A, B, 2.0, C, “abcd”, “ijcd”, “ijab”);

xm_add(1.0, A, 2.0, B, “ij”, “ji”);

...

  • Main components

– MPI-aware disk-backed memory allocator – Code for tensor operations – Auxiliary routines

  • Stores all data on disk
  • Hybrid MPI/OpenMP parallel design

– Static load balancing between the nodes (MPI) – Dynamic load balancing within a node (OpenMP)

  • https://github.com/ilyak/libxm
slide-12
SLIDE 12

12

Libxm parallel scaling on Cori

Total tensor data size is over 2 Tb, time in seconds, speedup relative to one node in parenthesis

slide-13
SLIDE 13

13

Conclusions

  • A new distributed-parallel model for tensor
  • perations is implemented in the libxm library
  • Shared filesystem is used as an inter-node

common storage for tensors

  • Data size is not limited by the amount of RAM or

number of nodes

  • The hybrid MPI/OpenMP parallel code shows

excellent scaling when adequate data caching is employed

slide-14
SLIDE 14

14

Thank you!

  • Acknowledgments

– Prof. Anna Krylov, USC – Dr. Evgeny Epifanovsky, Q-Chem

https://github.com/ilyak/libxm