From math to peta-app: Challenges to practical computation with - - PowerPoint PPT Presentation
From math to peta-app: Challenges to practical computation with - - PowerPoint PPT Presentation
From math to peta-app: Challenges to practical computation with tensor-based algorithms Robert J. Harrison harrisonrj@ornl.gov robert.harrison@utk.edu Outline Why tensors? Many-body physics Low-rank approximation for operators
Outline
- Why tensors?
– Many-body physics – Low-rank approximation for operators & functions
- Math issues
– Operators: accurate representations without an underlying integral form – Functions: efficient and accurate computation with local/global low-rank approximations
- C/S issues
– Generating efficient code for modern computers – High-level composition
The Electronic Schrödinger Equation
- A 2nd-order, linear, partial differential
equation in 3N dimensions (N electrons)
H r=Er H r ,t=i d r ,t dt H =−1 2 ∑
i
∇i
2−∑ ∑ i
Z ∣r−ri∣∑
i j
1 ∣ri−r j∣
Simulation capability for molecular science
Broad range of molecules, including biomolecules, nanoparticles and heavy elements Electronic structure of molecules (non-relativistic, relativistic, ECPs, first and second derivatives) Solid state capability (DFT plane-wave, CPMD) Molecular dynamics, molecular mechanics
Emphasis on modularity and portability. Freely distributed Performance characteristics – designed for MPP Portable – runs on a wide range of computers
NWChem Overview
Synthesis of High Performance Algorithms for Electronic Structure Calculations
http://www.cis.ohio-state.edu/~gb/TCE
- Collaboration between DOE/SciDAC, NSF/ITR and ORNL/LDRD
- Objective: develop a high level programming tool that translates many-body
quantum theory into efficient massively parallel codes. This is anticipated to revolutionize the rate of progress in this field by eliminating man-years of programming effort.
- NSF Project:
- Sadayappan (PI), Baumgartner, Cociorva, Pitzer
(OSU) Bernholdt, Harrison (unfunded) (ORNL) Ramanujam (LSU) Nooijen (Waterloo)
- DOE SciDAC: Harrison (PI), Hirata (PNNL)
- DOE ORNL/LDRD: Bernholdt (PI, 2002-3)
- Other SciDAC projects adopting this tool: Piecuch, Gordon
- Also being applied to nuclear physics (Bernholdt and Dean)
CCSD Doubles Equation
hbar[a,b,i,j] == sum[f[b,c]*t[i,j,a,c],{c}] -sum[f[k,c]*t[k,b]*t[i,j,a,c],{k,c}] +sum[f[a,c]*t[i,j,c,b],{c}] -sum[f[k,c]*t[k,a]*t[i,j,c,b],{k,c}]
- sum[f[k,j]*t[i,k,a,b],{k}] -sum[f[k,c]*t[j,c]*t[i,k,a,b],{k,c}] -sum[f[k,i]*t[j,k,b,a],{k}] -sum[f[k,c]*t[i,c]*t[j,k,b,a],{k,c}]
+sum[t[i,c]*t[j,d]*v[a,b,c,d],{c,d}] +sum[t[i,j,c,d]*v[a,b,c,d],{c,d}] +sum[t[j,c]*v[a,b,i,c],{c}] -sum[t[k,b]*v[a,k,i,j],{k}] +sum[t[i,c]*v[b,a,j,c],{c}] -sum[t[k,a]*v[b,k,j,i],{k}] -sum[t[k,d]*t[i,j,c,b]*v[k,a,c,d],{k,c,d}] -sum[t[i,c]*t[j,k,b,d]*v[k,a,c,d], {k,c,d}] -sum[t[j,c]*t[k,b]*v[k,a,c,i],{k,c}] +2*sum[t[j,k,b,c]*v[k,a,c,i],{k,c}] -sum[t[j,k,c,b]*v[k,a,c,i],{k,c}]
- sum[t[i,c]*t[j,d]*t[k,b]*v[k,a,d,c],{k,c,d}] +2*sum[t[k,d]*t[i,j,c,b]*v[k,a,d,c],{k,c,d}] -sum[t[k,b]*t[i,j,c,d]*v[k,a,d,c],{k,c,d}]
- sum[t[j,d]*t[i,k,c,b]*v[k,a,d,c],{k,c,d}] +2*sum[t[i,c]*t[j,k,b,d]*v[k,a,d,c],{k,c,d}] -sum[t[i,c]*t[j,k,d,b]*v[k,a,d,c],{k,c,d}]
- sum[t[j,k,b,c]*v[k,a,i,c],{k,c}] -sum[t[i,c]*t[k,b]*v[k,a,j,c],{k,c}] -sum[t[i,k,c,b]*v[k,a,j,c],{k,c}]
- sum[t[i,c]*t[j,d]*t[k,a]*v[k,b,c,d],{k,c,d}] -sum[t[k,d]*t[i,j,a,c]*v[k,b,c,d],{k,c,d}] -sum[t[k,a]*t[i,j,c,d]*v[k,b,c,d],{k,c,d}]
+2*sum[t[j,d]*t[i,k,a,c]*v[k,b,c,d],{k,c,d}] -sum[t[j,d]*t[i,k,c,a]*v[k,b,c,d],{k,c,d}] -sum[t[i,c]*t[j,k,d,a]*v[k,b,c,d],{k,c,d}]
- sum[t[i,c]*t[k,a]*v[k,b,c,j],{k,c}] +2*sum[t[i,k,a,c]*v[k,b,c,j],{k,c}] -sum[t[i,k,c,a]*v[k,b,c,j],{k,c}]
+2*sum[t[k,d]*t[i,j,a,c]*v[k,b,d,c],{k,c,d}] -sum[t[j,d]*t[i,k,a,c]*v[k,b,d,c],{k,c,d}] -sum[t[j,c]*t[k,a]*v[k,b,i,c],{k,c}]
- sum[t[j,k,c,a]*v[k,b,i,c],{k,c}] -sum[t[i,k,a,c]*v[k,b,j,c],{k,c}] +sum[t[i,c]*t[j,d]*t[k,a]*t[l,b]*v[k,l,c,d],{k,l,c,d}]
- 2*sum[t[k,b]*t[l,d]*t[i,j,a,c]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[k,a]*t[l,d]*t[i,j,c,b]*v[k,l,c,d],{k,l,c,d}]
+sum[t[k,a]*t[l,b]*t[i,j,c,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[j,c]*t[l,d]*t[i,k,a,b]*v[k,l,c,d],{k,l,c,d}]
- 2*sum[t[j,d]*t[l,b]*t[i,k,a,c]*v[k,l,c,d],{k,l,c,d}] +sum[t[j,d]*t[l,b]*t[i,k,c,a]*v[k,l,c,d],{k,l,c,d}]
- 2*sum[t[i,c]*t[l,d]*t[j,k,b,a]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,c]*t[l,a]*t[j,k,b,d]*v[k,l,c,d],{k,l,c,d}]
+sum[t[i,c]*t[l,b]*t[j,k,d,a]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,k,c,d]*t[j,l,b,a]*v[k,l,c,d],{k,l,c,d}] +4*sum[t[i,k,a,c]*t[j,l,b,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,k,c,a]*t[j,l,b,d]*v[k,l,c,d],{k,l,c,d}]
- 2*sum[t[i,k,a,b]*t[j,l,c,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,k,a,c]*t[j,l,d,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,k,c,a]*t[j,l,d,b]*v[k,l,c,d],
{k,l,c,d}] +sum[t[i,c]*t[j,d]*t[k,l,a,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,j,c,d]*t[k,l,a,b]*v[k,l,c,d],{k,l,c,d}]
- 2*sum[t[i,j,c,b]*t[k,l,a,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,j,a,c]*t[k,l,b,d]*v[k,l,c,d],{k,l,c,d}] +sum[t[j,c]*t[k,b]*t[l,a]*v[k,l,c,i],
{k,l,c}] +sum[t[l,c]*t[j,k,b,a]*v[k,l,c,i],{k,l,c}] -2*sum[t[l,a]*t[j,k,b,c]*v[k,l,c,i],{k,l,c}] +sum[t[l,a]*t[j,k,c,b]*v[k,l,c,i],{k,l,c}]
- 2*sum[t[k,c]*t[j,l,b,a]*v[k,l,c,i],{k,l,c}] +sum[t[k,a]*t[j,l,b,c]*v[k,l,c,i],{k,l,c}] +sum[t[k,b]*t[j,l,c,a]*v[k,l,c,i],{k,l,c}]
+sum[t[j,c]*t[l,k,a,b]*v[k,l,c,i],{k,l,c}] +sum[t[i,c]*t[k,a]*t[l,b]*v[k,l,c,j],{k,l,c}] +sum[t[l,c]*t[i,k,a,b]*v[k,l,c,j],{k,l,c}]
- 2*sum[t[l,b]*t[i,k,a,c]*v[k,l,c,j],{k,l,c}] +sum[t[l,b]*t[i,k,c,a]*v[k,l,c,j],{k,l,c}] +sum[t[i,c]*t[k,l,a,b]*v[k,l,c,j],{k,l,c}]
+sum[t[j,c]*t[l,d]*t[i,k,a,b]*v[k,l,d,c],{k,l,c,d}] +sum[t[j,d]*t[l,b]*t[i,k,a,c]*v[k,l,d,c],{k,l,c,d}] +sum[t[j,d]*t[l,a]*t[i,k,c,b]*v[k,l,d,c],{k,l,c,d}] -2*sum[t[i,k,c,d]*t[j,l,b,a]*v[k,l,d,c],{k,l,c,d}]
- 2*sum[t[i,k,a,c]*t[j,l,b,d]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,c,a]*t[j,l,b,d]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,a,b]*t[j,l,c,d]*v[k,l,d,c],
{k,l,c,d}] +sum[t[i,k,c,b]*t[j,l,d,a]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,a,c]*t[j,l,d,b]*v[k,l,d,c],{k,l,c,d}] +sum[t[k,a]*t[l,b]*v[k,l,i,j], {k,l}] +sum[t[k,l,a,b]*v[k,l,i,j],{k,l}] +sum[t[k,b]*t[l,d]*t[i,j,a,c]*v[l,k,c,d],{k,l,c,d}] +sum[t[k,a]*t[l,d]*t[i,j,c,b]*v[l,k,c,d], {k,l,c,d}] +sum[t[i,c]*t[l,d]*t[j,k,b,a]*v[l,k,c,d],{k,l,c,d}] -2*sum[t[i,c]*t[l,a]*t[j,k,b,d]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,c]*t[l,a]*t[j,k,d,b]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,j,c,b]*t[k,l,a,d]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,j,a,c]*t[k,l,b,d]*v[l,k,c,d], {k,l,c,d}] -2*sum[t[l,c]*t[i,k,a,b]*v[l,k,c,j],{k,l,c}] +sum[t[l,b]*t[i,k,a,c]*v[l,k,c,j],{k,l,c}] +sum[t[l,a]*t[i,k,c,b]*v[l,k,c,j],{k,l,c}] +v[a,b,i,j]
i j
ab=〈
a b i j∣e
− T1− T 2
H e
T 1 T 2∣0〉
range V = 3000; range O = 100; index a,b,c,d,e,f : V; index i,j,k,l : O; mlimit = 100GB; procedure P(in A[V,V,O,O], in B[V,V,V,O], in C[V,V,O,O], in D[V,V,V,O],
- ut S[V,V,O,O])=
begin S[a,b,i,j] == sum[ A[a,c,i,k] * B[b,e,f,l] * C[d,f,j,k] * D[c,d,e,l], {c,e,f,k,l}]; end
Tensor Contraction Engine (TCE)
- High-level domain-specific language for a
class of problems in quantum chemistry/physics based on contraction of large multi-dimensional tensors
- Specialized optimizing compiler
– Produces F77+GA code, linked to runtime libs
Sabij=∑
cefkl
Aacik Bbefl CdfjkDcdel
CCSD(T) run on Cray XT5 : 18 water
(H2O)18 54 atoms 918 basis functions Cc-pvtz(-f) basis
FP performance at 90K cores: 358 TFlops
- E. Apra, ORNL
CCSD(T) run on Cray XT5 : 20 water
(H2O)20 60 atoms 1020 basis functions Cc-pvtz(-f) basis
Floating-Point performance at 92K cores: 475 TFlops Efficiency > 50%
- E. Apra, ORNL
CSGF June 2008 11
Multiresolution Adaptive Numerical Scientific Simulation
Ariana Beste1, George I. Fann1, Robert J. Harrison1,2, Rebecca Hartman-Baker1, Judy Hill1
1Oak Ridge National Laboratory
2University of Tennessee, Knoxville
In collaboration with Gregory Beylkin4, Fernando Perez4, Lucas Monzon4, Martin Mohlenkamp5 and others
4University of Colorado 5Ohio University
harrisonrj@ornl.gov
CSGF June 2008 12
Funding
- DOE Office of Science, SciDAC, divisions of Advanced
Scientific Computing Research and Basic Energy Science, under contract DE-AC05-00OR22725 with Oak Ridge National Laboratory, in part using the National Center for Computational Sciences.
- NSF CHE 0625598: Cyber-infrastructure and Research
Facilities: Chemical Computations on Future High-end Computers
- NSF CNS-0509410: CAS-AES: An integrated framework
for compile-time/run-time support for multi-scale applications on high-end systems
- DARPA HPCS2: HPCS programming language
evaluation
CSGF June 2008 13
MADNESS objectives
- Scaling to 1+M processors ASAP
- High-level composition (matlab-like) for accurate
and efficient solvers – Targeting graduate students in physical science
- Correct scaling of cost with system size
– Fast algorithms with guaranteed precision
- Broad relevance
– Applications now in chemistry, atomic physics, fusion, material science, nuclear physics http://code.google.com/p/m-a-d-n-e-s-s
CSGF June 2008 14
High-level composition
- Close to the physics
- peratorT op = CoulombOperator(k, rlo, thresh);
functionT rho = psi*psi; double twoe = inner(apply(op,rho),rho); double pe = 2.0*inner(Vnuc*psi,psi); double ke = 0.0; for (int axis=0; axis<3; axis++) { functionT dpsi = diff(psi,axis); ke += inner(dpsi,dpsi); } double energy = ke + pe + twoe;
E=〈∣−1 2 ∇
2V∣〉∫ 2x
1 ∣x−y∣
2 ydx dy
CSGF June 2008 15
Essential techniques for fast computation
- Multiresolution
- Low-separation
rank
- Low-operator
rank
V 0⊂V 1⊂⋯⊂V n V n=V 0V 1−V 0⋯V n−V n−1
f x1,, xn=∑
l=1 M
l∏
i=1 d
f i
l xiO
∥ f i
l∥2=1
l0
A=∑
=1 r
u v
TO
0 v
T v=u T u=
Integral Formulation
- Solving the integral equation
– Eliminates the derivative operator and related “issues” – Converges as fixed point iteration with no preconditioner
2 1 2 1 2 2
2 2 2 * * ( ) ( ) in 3D ; 2 4
k r s
V E E V G V e G f r ds f s k E r s
Such Green’s Functions (bound state Helmholtz, Poisson) can be rapidly and accurately applied with a single, sparse matrix vector product.
CSGF June 2008 18
Separated form for integral operators
- Approach
– Represent the kernel over a finite range as a sum of products of 1-D operators (often, not always, Gaussian) – Only need compute 1D transition matrices (X,Y,Z) – SVD the 1-D operators (low rank away from singularity) – Apply most efficient choice of low/full rank 1-D operator – Even better algorithms not yet implemented
T∗ f =∫ ds K r−s f s
rii' , j j' , k k '
n, l−l'
=∑
=0 M
X ii'
n , lx−l' xY j j' n, l y−l' y Z k k ' n, l z−l' zO
CSGF June 2008 19
Accurate Quadratures
- Trapezoidal quadrature
– Geometric precision for periodic functions with sufficient smoothness
- Beylkin & Monzon
– Further reductions, but not automatic
The kernel for x=1e-4,1e-3,1e-2,1e-,1e0. The curve for x=1e-4 is the rightmost
e
−r
r = 2
∫
∞
e
−x
2t 2− 2/4t 2
dt = 2
∫
−∞ ∞
e
−x
2e 2s− 2e −2 s/4sds
02/20/09 Robert J. Harrison, UT/ORNL 20
Computational kernels
- Discontinuous spectral element
– In each “box” a tensor product of coefficients – Most operations are small matrix-multiplication – Exploit low operator-rank of operator matrices
- Leads to irregular dimensions and shapes
– Typical matrix dimensions are 2 to 30 – E.g., (20,400)T * (20,20)
ri' j' k '=∑
i j k
si jk cii' c j j' ck k '=∑
k ∑ j ∑ i
si jk cii'c j j'ck k ' ⇒r=s
T c T c T c
02/20/09 Robert J. Harrison, UT/ORNL 21
Comparison with MKL, Goto, ATLAS on Intel Xeon 5355 for (20,400)T*(20,n)
(ratio of speeds mtxmq/other ... value > 1 implies mtxmq is faster) n MKL Goto ATLAS n MKL Goto ATLAS 2 6.25 4.1667 5 16 0.8966 1.2581 2.0708 4 3.1042 3.6341 4.6563 18 1.7763 1.3636 2.4545 6 4.375 2.625 5.122 20 0.9556 1.2727 2.6168 8 1.3132 2.0427 5.1957 22 1.6416 1.2968 2.7308 10 2.7368 1.9549 5.3061 24 0.9638 1.2208 1.9664 12 1.0605 1.5843 2.4352 26 1.5337 1.2814 2.1295 14 2.0323 1.4737 2.1356 28 0.8411 1.0588 2.0301
02/20/09 Robert J. Harrison, UT/ORNL 22
XT5 single core FLOPs/cycle (AMD Barcelona)
(nj, ni)T*(nj,nk) ni
nj nk MTXMQ ACML
400 2 20 2.55 0.95 400 4 20 2.62 1.50 400 6 20 2.60 1.79 400 8 20 2.56 2.02 400 10 20 2.58 2.12 400 12 20 2.64 2.27 400 14 20 2.90 2.35 400 16 20 2.80 2.46 400 18 20 2.74 2.49 400 20 20 2.89 2.58 nested transform (nj, ni)T*(nj,nk) ni
nj nk MTXMQ ACML
4 2 2 0.10 0.07 16 4 4 1.04 0.51 36 6 6 1.74 0.99 64 8 8 2.33 1.56 100 10 10 2.61 1.80 144 12 12 2.69 2.12 196 14 14 2.94 2.17 256 16 16 2.97 2.41 324 18 18 2.93 2.38 400 20 20 3.03 2.49 484 22 22 3.01 2.52 576 24 24 3.09 2.73 676 26 26 3.02 2.73 784 28 28 2.87 2.87 900 30 30 2.88 2.81
L2 cache is 512Kb = 2*32^3 doubles
- hence expect good multi-core scaling
- confirmed linear scaling to 8 cores for nested transformation
CSGF June 2008 23
Electron correlation
- All defects in the mean-field model are ascribed to
electron correlation
- Singularities in Hamiltonian imply for a two-electron atom
- Include the inter-electron distance in the wavefunction
– E.g., Hylleraas 1938 wavefunction for He – Potentially very accurate, but not systematically improvable, and (until recently) not computationally feasible for many-electron systems
- Configuration interaction expansion – slowly convergent
r1 r2 r12
r1,r2, r12=11 2 r12⋯ as r120 r1,r2, r12=exp−r1r21a r12⋯
r1,r2,=∑
i
ci∣1
ir12 ir2∣
CSGF June 2008 24
x y
|x-y| |x-y| x-y |x-y| y-x |x-y| |x-y| |x-y| |x-y| y-x x-y y-x x-y
In 3D, ideally must be one box removed from the diagonal Diagonal box has full rank Boxes touching diagonal (face, edge,
- r corner) have
increasingly low rank Away from diagonal r = O(-log ) r = separation rank
∣x− y∣=∑
=1 r
f xg y
Partitioned SVD representation
25
Preliminary results for He atom
Variational E ∆E residual HF
- 2.861 61
- Iter. 0
- 2.871 08
0.414 73 1
- 2.894 92
- 0.023 84
0.017 28 2
- 2.900 43
- 0.005 51
0.007 94 3
- 2.902 18
- 0.001 75
0.003 84 4
- 2.902 88
- 0.000 70
0.002 02 5
- 2.903 20
- 0.000 32
0.001 25 6
- 2.903 39
- 0.000 20
0.000 91 … … … … 12
- 2.903 73
- 0.000 04
0.000 36 13
- 2.903 73
+0.000 004 0.000 32 14
- 2.903 77
- 0.000 04
0.000 28 Computational details:
- 5-th order multiwavelets
- Wavelet threshold: 210-5
- SVD threshold: 210-6
- Exponential correlation factor
Perturbative wavefunction:
- Maximum refinement: n=4
- Memory: 132M in full
partitioned SVD form ~10GB without SCD Energy is variational (small non-variational is just truncation err) exact
- 2.903 74 (E(HF)=-2.861 68)
Hylleraas (6 terms)
- 2.903 24
Löwdin and Redei
- 2.895 4
cc-pV6Z
- 2.903 48 (FCI) (E(HF)= -2.861 67)
02/20/09 Robert J. Harrison, UT/ORNL 26
MADNESS parallel runtime MPI Global Arrays ARMCI GPC/GASNET MADNESS math and numerics MADNESS applications – chemistry, physics, nuclear, ...
MADNESS architecture
Intel Thread Building Blocks being considered for multicore
02/20/09 Robert J. Harrison, UT/ORNL 27
Runtime Objectives
- Scalability to 1+M processors ASAP
- Runtime responsible for
- scheduling and placement,
- managing data dependencies,
- hiding latency, and
- Medium to coarse grain concurrency
- Compatible with existing models
- MPI, Global Arrays
- Borrow successful concepts from
Cilk, Charm++, Python
- Anticipating next gen. languages
02/20/09 Robert J. Harrison, UT/ORNL 28
Key elements
- Futures for hiding latency and
automating dependency management
- Global names and name spaces
- Non-process centric computing
- One-sided messaging between objects
- Retain place=process for MPI/GA legacy
- Dynamic load balancing
- Data redistribution, work stealing, randomization
02/20/09 Robert J. Harrison, UT/ORNL 29
Multi-threaded architecture
RMI Server (MPI or portals) Incoming active messages Task dequeue Incoming active messages Application logical main thread Outgoing active messages Work stealing
CSGF June 2008 30
Summary
- Tensor-based computation central to
– Efficient computation in many-dimensions – Efficient use of present and future computers
- Challenges/opportunities
– Theoretical framework for low-rank representations
- f operators/functions with controlled error