[PPT] - ACESIII Outline Collaborators Design philosophy Mr. Mark PowerPoint Presentation

SLIDE 1

ACESIII

Outline
Design philosophy
Implementation
Results
Conclusions
Collaborators
Mr. Mark Ponton, ACES Q. C.

(SIP/SIAL/Compiler)

Dr. Norbert Flocke, QTP

(Integral package)

Dr. Erik Deumens, QTP

(Architect)

Dr. Ajith Perera, QTP
Dr. H. Lei, ACES Q. C.

(Compiler)

Dr. Anthony Yau, HPTi
Dr. Rodney Bartlett, QTP

ACES Q. C.

SLIDE 2

Traditional Design

control compute communication disk input output hardware

code

SLIDE 3

ACESIII Design

code control hardware compute Disk I/O communication

SLIDE 4

High level Low level Problem ACESIII design concepts Data structures algorithms communication Input/output Super instruction Assembly language SIAL Super instruction Processor SIP (xaces3) input

utput

Performance

SLIDE 5

SIAL (Super Instruction Assembly Language)

Key features
Index segmentation
Data blocking
Task isolation
Advantageous
Flexibility
Tune ability: Fast
ptimization
New methods

implemented in reduced time

Portable

SLIDE 6

Implemented

SCF
MBPT(2) gradient
CCSD gradient
CCSD(T)
MBPT(2) Hessian
EOM CCSD (Tomasz

Kus)

RHF, UHF
RHF, UHF, ROHF
RHF, UHF
RHF, UHF
RHF, UHF, ROHF
RHF, UHF

SLIDE 7

10

1

10

2

10

4

DMMP MBPT(2) gradient timings Nbf = 397, Ncorr

cc = 33

689 min 46 min 43 min Super scaling region Normal scaling region 8 128 Number of processors Time/[sec]

Ideal Actual

SLIDE 8

CCSD(T)

SCF
Transformation
CCSD
CCSD(T)
Easy if you have a

good integrals package

Hard but small cost
Hard as highly

nonlinear

Trivial !!! At least that

is the common wisdom

SLIDE 9

(T) Strategy

ccupied
1
2
3
4

E1 E2 E3 E4 E(TOTAL)

SLIDE 10

(T) Strategy

ccupied
1

E1 E2 E3 E4 E(TOTAL)

2
3
4

SLIDE 11

Advantages of DUAL layer parallelism

Less data replication or I/O bottlenecks
Trivial restart capability
Better turnaround due to queuing
Since more processors are used the

effective (T) time is comparable to the CCSD time making the CCSD as/more important that the (T)!

SLIDE 12

CCSD(T)

Luciferin(C11H8O3S2N2)
RHF
C1 symmetry
Basis = aug-cc-pvdz

(498bf)

Ncorrocc = 46
Sucrose (C12H22O11)
RHF
C1 symmetry
Basis = 6-311G**

(546bf)

68

SLIDE 13

10

2

10

1

10

2

Luciferin CCSD timings, Nbf = 498, Ncorr

cc = 46

115.9 13.1 14.5 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual

32 256

SLIDE 14

10

2

10

1

10

2

Luciferin CCSD timings, Nbf = 498, Ncorr

cc = 46

115.9 13.1 14.5 Super scaling region Normal scaling region

CCSD(T)=420 min/8 orb

Number of processors Time/iteration [min]

Ideal Actual

32 256

SLIDE 15

10

2

10

2

10

3

Sucrose CCSD timings, Nbf = 546, Ncorr

cc = 68

24.0 56.8 908.6 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual

32 512

SLIDE 16

DMMP+OH

H10C 3O 4P
C1
208 basis functions
75 electrons
Number of processors

= 64

Time CCSD = 69

minutes (3.8 min/iter)

Time (T) =111

minutes ***

*** 7 dual jobs

SLIDE 17

Systematic set of Benchmarks

Why? To remove confusion over

technological verses algorithmic advances.

Allow users informed choices.
Provide a set of calculations to evaluate

each program so strengths and weaknesses become evident.

Remove ambiguity in literature.

SLIDE 18

ArN Cluster Benchmarks(Performance)

Specifications(Mine!)
N=6
UHF
C1 symmetry
Basis = aug-cc-pvtz

(300bf)

Ncorrocc = 54
R = 5 bohr
Methods
MBPT(2) gradient
CCSD gradient
CCSD(T) (core

dropped)

MBPT(2) Hessian

(RHF)

SLIDE 19

10

2

10

1

10

2

Ar6 MBPT(2) gradient timings, Nbf = 300, Ncorr

cc = 54

C1 Symmetry 67 8.4 16.0 Super scaling region Normal scaling region Number of processors Time [min]

Ideal Actual

256 32

SLIDE 20

256 32 32

10

2

10

1

10

2

Ar6 UCCSD timings, Nbf = 300, Ncorr

cc = 54, C1 Symmetry

103.5 13.6 12.9 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual

256 32

SLIDE 21

10

2

10

2

Ar6 ULAMBDA timings, Nbf = 300, Ncorr

cc = 54, C1 Symmetry

119.7 16.0 15.0 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual

32 256

10

2

10

2

Ar6 ULAMBDA timings, Nbf = 300, Ncorr

cc = 54, C1 Symmetry

16.0 15.0 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual 10

2

10

2

Ar6 ULAMBDA timings, Nbf = 300, Ncorr

cc = 54, C1 Symmetry

119.7 16.0 15.0 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual

32 256

10

2

10

2

Ar6 ULAMBDA timings, Nbf = 300, Ncorr

cc = 54, C1 Symmetry

119.7 16.0 15.0 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual

32 256

SLIDE 22

10

2

10

3

Ar6 UGRAD timings, Nbf = 300, Ncorr

cc = 54

C1 Symmetry 1273 141 159 Super scaling region Normal scaling region Number of processors Time [min]

Ideal Actual

32 256

SLIDE 23

10

2

10

3

Ar6 Total time for one UHF CCSD gradient, N

bf = 300, Ncorr

cc = 54

C1 Symmetry 3505 437 438 Super scaling region Normal scaling region Number of processors Time [min]

Ideal Actual

32 256

10

2

10

3

Ar6 Total time for one UHF CCSD gradient, N

bf = 300, Ncorr

cc = 54

C1 Symmetry 3505 437 438 Super scaling region Normal scaling region Number of processors Time [min]

Ideal Actual

32 256

SLIDE 24

10

2

10

2

Ar6 UCCSD(T) timings, Nbf = 300, C1 Symmetry

784 98 131 Super scaling region Normal scaling region 32 256 Number of processors Time[min]

Ideal Actual 10

2

10

2

Ar6 UCCSD(T) timings, Nbf = 300, C1 Symmetry

784 98 131 Super scaling region Normal scaling region 32 256 Number of processors Time[min]

Ideal Actual

SLIDE 25

MBPT(2) Hessian

d d[ [Vab

ij Vab ij Dab ij ] / dp ]dq

perturbations dV/dp*dV/dq V*d2V/dp/dq

SLIDE 26

Details of calculation

Number of basis functions = 300
Number of correlated occupied = 54
Number of Hessian elements = 324/2
Number of processors = 128
RHF reference

SLIDE 27

Results

V*d2V/dpdq
dV/dp
dV/dq
dV/dp*dV/dq
T=381 minutes
155 sec / pert p
330 sec / pert q
16 sec / element

SLIDE 28

Observations

Ideally suited for dual layer parallelization

with ‘dual’ layer being over the perturbations.

Dual layer strategy not optimal from an
peration viewpoint as some computation

must be repeated but many advantages: restart capability, real time of calculation, queuing, data storage.

SLIDE 29

Conclusions

ACESIII provides an ideal parallel environment in which

to implement computationally intense methods.

MBPT(2) gradient achieved over 90% scaling until work

exhausted

CCSD achieved better than ideal scaling up to 512

processors (32 as reference) indicating an optimal range

f processors exists for each computation.
CCSD(T) perturbative triples can be computed quit

effectively using a dual layer parallelization strategy so that (T) and CCSD are comparable to compute in a pragmatic way.

CCSD gradients (Ar6) exhibit ideal scaling from 32-256

processors.

SLIDE 30

Conclusions

MBPT(2) Hessians (and others also) benefit from dual

layer parallelism but care bust be taken to segment the work optimally.

A set of benchmark calculations would be very valuable

to the quantum chemistry community to remove ambiguities among various programs.

ACESIII has been successfully ported to the following

systems: IBM SP4 SP5, ALTIX, Linux cluster, Opteron cluster and is available on many DOD machines.

ACESIII benefits from ‘many’ processors indicating

potential in the massively parallel regime.

The flexibility offered by the ACESIII environment allows

for rapid tuning and implementation of codes.