Parallel scaling of Teters Minimization for Ab Initio Calculations - - PowerPoint PPT Presentation

parallel scaling of teter s minimization for ab initio
SMART_READER_LITE
LIVE PREVIEW

Parallel scaling of Teters Minimization for Ab Initio Calculations - - PowerPoint PPT Presentation

Introduction Parallelization Hunting the Overlap Parallel scaling of Teters Minimization for Ab Initio Calculations Torsten Hoefler Department of Computer Science Technical University of Chemnitz HPCNano Workshop 2006 Supercomputing06


slide-1
SLIDE 1

university-logo Introduction Parallelization Hunting the Overlap

Parallel scaling of Teter’s Minimization for Ab Initio Calculations

Torsten Hoefler

Department of Computer Science Technical University of Chemnitz

HPCNano Workshop 2006

Supercomputing’06 Tampa, FL, USA

November 13th 2006

Torsten Hoefler Teter Parallelism

slide-2
SLIDE 2

university-logo Introduction Parallelization Hunting the Overlap

Outline

1

Introduction Introduction to ABINIT Teter’s Conjugate Gradient Minimization

2

Parallelization Already implemented Parallelization A new Proposal Verifying this Proposal

3

Hunting the Overlap Non blocking Collectives

Torsten Hoefler Teter Parallelism

slide-3
SLIDE 3

university-logo Introduction Parallelization Hunting the Overlap Introduction to ABINIT Teter’s Conjugate Gradient Minimization

Outline

1

Introduction Introduction to ABINIT Teter’s Conjugate Gradient Minimization

2

Parallelization Already implemented Parallelization A new Proposal Verifying this Proposal

3

Hunting the Overlap Non blocking Collectives

Torsten Hoefler Teter Parallelism

slide-4
SLIDE 4

university-logo Introduction Parallelization Hunting the Overlap Introduction to ABINIT Teter’s Conjugate Gradient Minimization

ABINIT Introduction

ABINIT solves time-independent Schrödinger equation effective one-particle case, uses DFT

  • HtotΦ = EtotΦ

⇒ Eigenvalue problem Eigenvalues and -vectors determined with CG minimization (Teter et al.) wavefunction Φ written in plain-wave basis set

Torsten Hoefler Teter Parallelism

slide-5
SLIDE 5

university-logo Introduction Parallelization Hunting the Overlap Introduction to ABINIT Teter’s Conjugate Gradient Minimization

ABINIT Program Flow

Start

(1)

choose Coefficients

(3)

calculate trial potential

(2)

calculate Electron density

(6)

check convergence

(7)

mix new Density

(8)

calculate Potential calculate total Energy

(5) (4)

minimize electronic Energy Stop

SCF−cycle

Initialization

not converged converged

Torsten Hoefler Teter Parallelism

slide-6
SLIDE 6

university-logo Introduction Parallelization Hunting the Overlap Introduction to ABINIT Teter’s Conjugate Gradient Minimization

ABINIT Tracing

projbd (36.0%/36.0%) cgwf (83.6%/1.3%) fourwf (27.4%/0.0%) nonlop (21.5%/0.0%) sg_fftrisc (27.4%/5.7%) nonlop_pl (21.5%/0.1%) vtowfk (97.3%/4.3%)

  • rthon (5.7%/5.6%)

sg_ffty (14.8%/14.8%) sg_fftpx (6.6%/6.6%)

  • pernl4a (11.6%/10.3%)
  • pernl4b (9.8%/8.7%)

⇒ 83% for Teter minimization

Torsten Hoefler Teter Parallelism

slide-7
SLIDE 7

university-logo Introduction Parallelization Hunting the Overlap Introduction to ABINIT Teter’s Conjugate Gradient Minimization

Outline

1

Introduction Introduction to ABINIT Teter’s Conjugate Gradient Minimization

2

Parallelization Already implemented Parallelization A new Proposal Verifying this Proposal

3

Hunting the Overlap Non blocking Collectives

Torsten Hoefler Teter Parallelism

slide-8
SLIDE 8

university-logo Introduction Parallelization Hunting the Overlap Introduction to ABINIT Teter’s Conjugate Gradient Minimization

Conjugate Gradient Operations

dot- and matrix-vector product dot-product: Φi|Φj matrix-vector product: HΦ

  • H = Ee

kin + V e loc + V e nl

Ee

kin and V e loc in reciprocal (k-) space

V e

nl in real space

⇒ 3D-FFT to transform between real and reciprocal space

Torsten Hoefler Teter Parallelism

slide-9
SLIDE 9

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Outline

1

Introduction Introduction to ABINIT Teter’s Conjugate Gradient Minimization

2

Parallelization Already implemented Parallelization A new Proposal Verifying this Proposal

3

Hunting the Overlap Non blocking Collectives

Torsten Hoefler Teter Parallelism

slide-10
SLIDE 10

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

K-Point Parallelization

Bands have to be minimized for each k-point Minimization for each k-point is independent All k-point data is only needed for the calculation of ETOT ⇒ straightforward parallelization ABINIT implementation:

Good speedup :-) Uses only collective communication :-) Limited to nkpt :-( Uses MPI_COMM_WORLD :-( Uses MPI_BARRIER :-(

Torsten Hoefler Teter Parallelism

slide-11
SLIDE 11

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Band Parallelization

The Teter Method allows parallel CG Orthogonalization constraint forces non-ideal solution ⇒ tricky parallelization ABINIT implementation:

Speedup depends on interconnect :-/ Uses Send/Recv :-( Limited by nband/c (c not easily predictable)

Torsten Hoefler Teter Parallelism

slide-12
SLIDE 12

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Outline

1

Introduction Introduction to ABINIT Teter’s Conjugate Gradient Minimization

2

Parallelization Already implemented Parallelization A new Proposal Verifying this Proposal

3

Hunting the Overlap Non blocking Collectives

Torsten Hoefler Teter Parallelism

slide-13
SLIDE 13

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

G Parallelization

FFT ⇒ Two parallelization schemes:

Distribute plane wave coefficients Distribute real space FFT Grid

Strict load balancing Minimize communication Possible to combine with Band and k-Point parallelization

Vector Distribution

1 2 3 4 5 6 7 8 9 11 10 12 13 14 15 1 2 3 4 5 6 7 8 9 11 10 12 13 14 15 PE1 PE0 PE2 PE3

Torsten Hoefler Teter Parallelism

slide-14
SLIDE 14

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Real Space Distribution

z x y

1 2 3 4 5 6 7 8 9 11 10 12 13 14 15 PE1 PE0 PE2 PE3

1D−FFT on xy−lines

MPI_ALLTOALL

1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

MPI_ALLTOALL

2D−FFT on z−planes FFT−Box FFT−Box

3D−FFT Distribution

Torsten Hoefler Teter Parallelism

slide-15
SLIDE 15

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Implementation Issues

Necessary communication (complexity):

Dot-products (O(1)) Computation of kinetic energy (O(1)) FFT transpose (O(natom))

Only collective communication:

MPI_ALLREDUCE for reductions MPI_ALLTOALL for FFT transpose

Principles:

  • nly coll. communication

separate communicator simplification of the main code heavy usage of math librarys

Torsten Hoefler Teter Parallelism

slide-16
SLIDE 16

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Benchmarking the Implementation of cgwf

10 20 30 40 50 60 10 20 30 40 50 60 Speedup (s) # processors (P) SiO2, natom=43, nband=126, npw=48728 SiO2, natom=86, nband=251, npw=97624 linear

Torsten Hoefler Teter Parallelism

slide-17
SLIDE 17

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Possible Reasons for limited Scalability

serial parts (Amdahl’s law)

allocations scalar calculation index reordering (packin,packout - FFT)

communication overhead

latency of blocking collective operations limits scalability significantly

  • verhead will be modelled in the following

Torsten Hoefler Teter Parallelism

slide-18
SLIDE 18

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Outline

1

Introduction Introduction to ABINIT Teter’s Conjugate Gradient Minimization

2

Parallelization Already implemented Parallelization A new Proposal Verifying this Proposal

3

Hunting the Overlap Non blocking Collectives

Torsten Hoefler Teter Parallelism

slide-19
SLIDE 19

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

The LogP Model

CPU Network

  • s

L

  • r

level time g Sender Receiver g

Torsten Hoefler Teter Parallelism

slide-20
SLIDE 20

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Modelling the MPI_ALLREDUCE

→ MPI_REDUCE to node 0 and MPI_BCAST

P0 P1 P2 P3 P4 P5 P6 P7

  • s

fr = max(

  • r , g)

fs = max(

  • s , g)
  • r

L

s

f

  • r
  • s
  • r

s

f

  • r

s

f

  • r
  • s
  • r
  • s
  • r

tred(P, size) = 2·size·(2o+L+(⌈log2P⌉−1)·max{g, 2o+L})

Torsten Hoefler Teter Parallelism

slide-21
SLIDE 21

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Modelling the MPI_ALLTOALL

→ each node hast to send to all others single host:

P0 P1 P2 P3 P4

  • s

L

  • r

g

  • s

g

  • s

g

  • s
  • r
  • r
  • r

L L L

all hosts send, assuming FBB ta2a(P, size) = size · ((2o + L) + (P − 1) · (g + o))

Torsten Hoefler Teter Parallelism

slide-22
SLIDE 22

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Predicting the Overhead

  • red(P) = nband · (9 + 2 · nband) · tred(P, 1)
  • red(P) = O(log2P)

natom = 43:

  • red(P) = 126 · (9 + 2 · 126) · 2 · (⌈log2P⌉ · 9.88)
  • red(P) = 65772 · (⌈log2P⌉ · 9.88)
  • a2a(P) = 2 · oa2a(P, Nx · Ny · Nz/P)
  • a2a(P) = O(1)

natom = 43:

  • a2a(P) = . . .
  • a2a(P) ≈ 6.3s

Torsten Hoefler Teter Parallelism

slide-23
SLIDE 23

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Verifying the Overhead Prediction

2 4 6 8 10 12 14 16 10 20 30 40 50 60 Time (t) # processors (P) ALLREDUCE Overhead ALLTOALL Overhead Predicted ALLREDUCE Overhead (tred) Predicted ALLTOALL Overhead (ta2a)

Torsten Hoefler Teter Parallelism

slide-24
SLIDE 24

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Can we predict parallel Scaling?

⇒ kind of (comm. overhead as limiting factor) ideal scaling: t(P) = t(1)/P

→ limP→∞ t(P) = 0

  • verhead: o(P) = ored(P) + oa2a(P)

→ limP→∞ o(P) = ∞

crossing point (Pc) denotes maximum scaling t(Pc) = o(Pc)

Torsten Hoefler Teter Parallelism

slide-25
SLIDE 25

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Modelled Prediction

10 20 30 40 50 10 20 30 40 50 60 Time t # processors (P) Predicted Overhead (ored+oa2a) Ideal Calculation Scaling (t(P=1)/P)

Torsten Hoefler Teter Parallelism

slide-26
SLIDE 26

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Comparison to Benchmarks

10 20 30 40 50 60 10 20 30 40 50 60 Speedup (s) # processors (P) SiO2, natom=43, nband=126, npw=48728 SiO2, natom=86, nband=251, npw=97624 linear

Torsten Hoefler Teter Parallelism

slide-27
SLIDE 27

university-logo Introduction Parallelization Hunting the Overlap Already implemented Parallelization A new Proposal Verifying this Proposal

Intermediate Conclusions

Teter’s scheme is efficiently parallelizeable k-pt, band, and g parallelism can be combined parallel scaling can be predicted parallel scaling depends on overhead

  • verhead depends on system size and LogP parameters

⇒ overhead is a hard limitation (is it?)

  • verlapping could help ;o)

Torsten Hoefler Teter Parallelism

slide-28
SLIDE 28

university-logo Introduction Parallelization Hunting the Overlap Non blocking Collectives

Outline

1

Introduction Introduction to ABINIT Teter’s Conjugate Gradient Minimization

2

Parallelization Already implemented Parallelization A new Proposal Verifying this Proposal

3

Hunting the Overlap Non blocking Collectives

Torsten Hoefler Teter Parallelism

slide-29
SLIDE 29

university-logo Introduction Parallelization Hunting the Overlap Non blocking Collectives

Non blocking Communication

Communication can be overlapped with computation

  • Progr. model to support overlapping is too complex

(threads) Non blocking comm. does not change progr. model Supported by MPI (MPI_ISEND, MPI_IRECV)

Torsten Hoefler Teter Parallelism

slide-30
SLIDE 30

university-logo Introduction Parallelization Hunting the Overlap Non blocking Collectives

Send/Recv is there - Why Collectives?

Gorlach, ’04: ”Send-Receive Considered Harmful” ⇔ Dijkstra, ’68: ”Go To Statement Considered Harmful” point to point:

if ( rank == 0) then call MPI_SEND(...) else call MPI_RECV(...) end if

  • vs. collective:

call MPI_GATHER(...)

  • cmp. math libraries vs. loops

Torsten Hoefler Teter Parallelism

slide-31
SLIDE 31

university-logo Introduction Parallelization Hunting the Overlap Non blocking Collectives

Why non blocking Collectives

  • verlap communication and computation

many collectives synchronize unneccessarily scale at least with O(log2P) sends wasted CPU time: log2P · L

Fast Ethernet: L = 50-60 µs Gigabit Ethernet: L = 15-20 µs InfiniBand: L = 2-7 µs 1µs ≈ 4000 FLOPs on a 2GHz Machine

Torsten Hoefler Teter Parallelism

slide-32
SLIDE 32

university-logo Introduction Parallelization Hunting the Overlap Non blocking Collectives

Final Conclusions and Future Work

Conclusions Teter’s minimization scales ok communication overhead is the limiting factor parallel scaling is predictable (not easily) scaling could be enhanced with overlapping communication and computation to hide latency collective communications should be preferred ⇒ non-blocking collective operations LibNBC http://www.unixer.de/NBC Future Work use non-blocking collectives to enhance QM codes e.g., overlapping schemes for 3D-FFT

Torsten Hoefler Teter Parallelism

slide-33
SLIDE 33

university-logo Introduction Parallelization Hunting the Overlap Non blocking Collectives

The Teter Algorithm

Steepest descent: di = − ∂f

∂ xi = −G

xi f( x) → E Kohn Sham Energy Functional

  • x → ψe Wave function for each electron

G → H Hamilton Operator Teter’s scheme:

1

1: check residual for convergence

2

compute steepest descent vector

3

  • rthogonalize it to all bands

4

compute preconditioned steepest descent

5

  • rthogonalize it to all bands

6

compute conjugate gradient vector

7

step into cg direction

8

goto 1

Torsten Hoefler Teter Parallelism

slide-34
SLIDE 34

university-logo Introduction Parallelization Hunting the Overlap Non blocking Collectives

Verifying the Predictions

Kielmann’s logp-mpi benchmark: L = 9.78µs, o = 0.05µs, g = 0.01µs

50 100 150 200 250 300 350 5 10 15 20 25 30 Time (t) # processors (P) ALLREDUCE 16 bytes ALLTOALL 16 bytes Predicted ALLREDUCE Predicted ALLTOALL Torsten Hoefler Teter Parallelism