ACESIII Outline Collaborators Design philosophy Mr. Mark - - PowerPoint PPT Presentation

acesiii
SMART_READER_LITE
LIVE PREVIEW

ACESIII Outline Collaborators Design philosophy Mr. Mark - - PowerPoint PPT Presentation

ACESIII Outline Collaborators Design philosophy Mr. Mark Ponton, ACES Q. C. (SIP/SIAL/Compiler) Implementation Dr. Norbert Flocke, QTP Results (Integral package) Conclusions Dr. Erik Deumens,


slide-1
SLIDE 1

ACESIII

  • Outline
  • Design philosophy
  • Implementation
  • Results
  • Conclusions
  • Collaborators
  • Mr. Mark Ponton, ACES Q. C.

(SIP/SIAL/Compiler)

  • Dr. Norbert Flocke, QTP

(Integral package)

  • Dr. Erik Deumens, QTP

(Architect)

  • Dr. Ajith Perera, QTP
  • Dr. H. Lei, ACES Q. C.

(Compiler)

  • Dr. Anthony Yau, HPTi
  • Dr. Rodney Bartlett, QTP

ACES Q. C.

slide-2
SLIDE 2

Traditional Design

control compute communication disk input output hardware

code

slide-3
SLIDE 3

ACESIII Design

code control hardware compute Disk I/O communication

slide-4
SLIDE 4

High level Low level Problem ACESIII design concepts Data structures algorithms communication Input/output Super instruction Assembly language SIAL Super instruction Processor SIP (xaces3) input

  • utput

Performance

slide-5
SLIDE 5

SIAL (Super Instruction Assembly Language)

  • Key features
  • Index segmentation
  • Data blocking
  • Task isolation
  • Advantageous
  • Flexibility
  • Tune ability: Fast
  • ptimization
  • New methods

implemented in reduced time

  • Portable
slide-6
SLIDE 6

Implemented

  • SCF
  • MBPT(2) gradient
  • CCSD gradient
  • CCSD(T)
  • MBPT(2) Hessian
  • EOM CCSD (Tomasz

Kus)

  • RHF, UHF
  • RHF, UHF, ROHF
  • RHF, UHF
  • RHF, UHF
  • RHF, UHF, ROHF
  • RHF, UHF
slide-7
SLIDE 7

10

1

10

2

10

4

DMMP MBPT(2) gradient timings Nbf = 397, Ncorr

  • cc = 33

689 min 46 min 43 min Super scaling region Normal scaling region 8 128 Number of processors Time/[sec]

Ideal Actual

slide-8
SLIDE 8

CCSD(T)

  • SCF
  • Transformation
  • CCSD
  • CCSD(T)
  • Easy if you have a

good integrals package

  • Hard but small cost
  • Hard as highly

nonlinear

  • Trivial !!! At least that

is the common wisdom

slide-9
SLIDE 9

(T) Strategy

  • ccupied
  • 1
  • 2
  • 3
  • 4

E1 E2 E3 E4 E(TOTAL)

slide-10
SLIDE 10

(T) Strategy

  • ccupied
  • 1

E1 E2 E3 E4 E(TOTAL)

  • 2
  • 3
  • 4
slide-11
SLIDE 11

Advantages of DUAL layer parallelism

  • Less data replication or I/O bottlenecks
  • Trivial restart capability
  • Better turnaround due to queuing
  • Since more processors are used the

effective (T) time is comparable to the CCSD time making the CCSD as/more important that the (T)!

slide-12
SLIDE 12

CCSD(T)

  • Luciferin(C11H8O3S2N2)
  • RHF
  • C1 symmetry
  • Basis = aug-cc-pvdz

(498bf)

  • Ncorrocc = 46
  • Sucrose (C12H22O11)
  • RHF
  • C1 symmetry
  • Basis = 6-311G**

(546bf)

  • 68
slide-13
SLIDE 13

10

2

10

1

10

2

Luciferin CCSD timings, Nbf = 498, Ncorr

  • cc = 46

115.9 13.1 14.5 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual

32 256

slide-14
SLIDE 14

10

2

10

1

10

2

Luciferin CCSD timings, Nbf = 498, Ncorr

  • cc = 46

115.9 13.1 14.5 Super scaling region Normal scaling region

CCSD(T)=420 min/8 orb

Number of processors Time/iteration [min]

Ideal Actual

32 256

slide-15
SLIDE 15

10

2

10

2

10

3

Sucrose CCSD timings, Nbf = 546, Ncorr

  • cc = 68

24.0 56.8 908.6 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual

32 512

slide-16
SLIDE 16

DMMP+OH

  • H10C 3O 4P
  • C1
  • 208 basis functions
  • 75 electrons
  • Number of processors

= 64

  • Time CCSD = 69

minutes (3.8 min/iter)

  • Time (T) =111

minutes ***

  • *** 7 dual jobs
slide-17
SLIDE 17

Systematic set of Benchmarks

  • Why? To remove confusion over

technological verses algorithmic advances.

  • Allow users informed choices.
  • Provide a set of calculations to evaluate

each program so strengths and weaknesses become evident.

  • Remove ambiguity in literature.
slide-18
SLIDE 18

ArN Cluster Benchmarks(Performance)

  • Specifications(Mine!)
  • N=6
  • UHF
  • C1 symmetry
  • Basis = aug-cc-pvtz

(300bf)

  • Ncorrocc = 54
  • R = 5 bohr
  • Methods
  • MBPT(2) gradient
  • CCSD gradient
  • CCSD(T) (core

dropped)

  • MBPT(2) Hessian

(RHF)

slide-19
SLIDE 19

10

2

10

1

10

2

Ar6 MBPT(2) gradient timings, Nbf = 300, Ncorr

  • cc = 54

C1 Symmetry 67 8.4 16.0 Super scaling region Normal scaling region Number of processors Time [min]

Ideal Actual

256 32

slide-20
SLIDE 20

256 32 32

10

2

10

1

10

2

Ar6 UCCSD timings, Nbf = 300, Ncorr

  • cc = 54, C1 Symmetry

103.5 13.6 12.9 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual

256 32

slide-21
SLIDE 21

10

2

10

2

Ar6 ULAMBDA timings, Nbf = 300, Ncorr

  • cc = 54, C1 Symmetry

119.7 16.0 15.0 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual

32 256

10

2

10

2

Ar6 ULAMBDA timings, Nbf = 300, Ncorr

  • cc = 54, C1 Symmetry

16.0 15.0 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual 10

2

10

2

Ar6 ULAMBDA timings, Nbf = 300, Ncorr

  • cc = 54, C1 Symmetry

119.7 16.0 15.0 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual

32 256

10

2

10

2

Ar6 ULAMBDA timings, Nbf = 300, Ncorr

  • cc = 54, C1 Symmetry

119.7 16.0 15.0 Super scaling region Normal scaling region Number of processors Time/iteration [min]

Ideal Actual

32 256

slide-22
SLIDE 22

10

2

10

3

Ar6 UGRAD timings, Nbf = 300, Ncorr

  • cc = 54

C1 Symmetry 1273 141 159 Super scaling region Normal scaling region Number of processors Time [min]

Ideal Actual

32 256

slide-23
SLIDE 23

10

2

10

3

Ar6 Total time for one UHF CCSD gradient, N

bf = 300, Ncorr

  • cc = 54

C1 Symmetry 3505 437 438 Super scaling region Normal scaling region Number of processors Time [min]

Ideal Actual

32 256

10

2

10

3

Ar6 Total time for one UHF CCSD gradient, N

bf = 300, Ncorr

  • cc = 54

C1 Symmetry 3505 437 438 Super scaling region Normal scaling region Number of processors Time [min]

Ideal Actual

32 256

slide-24
SLIDE 24

10

2

10

2

Ar6 UCCSD(T) timings, Nbf = 300, C1 Symmetry

784 98 131 Super scaling region Normal scaling region 32 256 Number of processors Time[min]

Ideal Actual 10

2

10

2

Ar6 UCCSD(T) timings, Nbf = 300, C1 Symmetry

784 98 131 Super scaling region Normal scaling region 32 256 Number of processors Time[min]

Ideal Actual

slide-25
SLIDE 25

MBPT(2) Hessian

d d[ [Vab

ij Vab ij Dab ij ] / dp ]dq

perturbations dV/dp*dV/dq V*d2V/dp/dq

slide-26
SLIDE 26

Details of calculation

  • Number of basis functions = 300
  • Number of correlated occupied = 54
  • Number of Hessian elements = 324/2
  • Number of processors = 128
  • RHF reference
slide-27
SLIDE 27

Results

  • V*d2V/dpdq
  • dV/dp
  • dV/dq
  • dV/dp*dV/dq
  • T=381 minutes
  • 155 sec / pert p
  • 330 sec / pert q
  • 16 sec / element
slide-28
SLIDE 28

Observations

  • Ideally suited for dual layer parallelization

with ‘dual’ layer being over the perturbations.

  • Dual layer strategy not optimal from an
  • peration viewpoint as some computation

must be repeated but many advantages: restart capability, real time of calculation, queuing, data storage.

slide-29
SLIDE 29

Conclusions

  • ACESIII provides an ideal parallel environment in which

to implement computationally intense methods.

  • MBPT(2) gradient achieved over 90% scaling until work

exhausted

  • CCSD achieved better than ideal scaling up to 512

processors (32 as reference) indicating an optimal range

  • f processors exists for each computation.
  • CCSD(T) perturbative triples can be computed quit

effectively using a dual layer parallelization strategy so that (T) and CCSD are comparable to compute in a pragmatic way.

  • CCSD gradients (Ar6) exhibit ideal scaling from 32-256

processors.

slide-30
SLIDE 30

Conclusions

  • MBPT(2) Hessians (and others also) benefit from dual

layer parallelism but care bust be taken to segment the work optimally.

  • A set of benchmark calculations would be very valuable

to the quantum chemistry community to remove ambiguities among various programs.

  • ACESIII has been successfully ported to the following

systems: IBM SP4 SP5, ALTIX, Linux cluster, Opteron cluster and is available on many DOD machines.

  • ACESIII benefits from ‘many’ processors indicating

potential in the massively parallel regime.

  • The flexibility offered by the ACESIII environment allows

for rapid tuning and implementation of codes.