Analysis of a Parallel 3D MD application Russian-German School on - - PowerPoint PPT Presentation

analysis of a parallel 3d md application
SMART_READER_LITE
LIVE PREVIEW

Analysis of a Parallel 3D MD application Russian-German School on - - PowerPoint PPT Presentation

Analysis of a Parallel 3D MD application Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 8. Day, 6 th of July, 2005 Inst. of Theoretic & Applied Mechanics, Novosibirsk HLRS, University of


slide-1
SLIDE 1

Höchstleistungsrechenzentrum Stuttgart

Analysis of a Parallel 3D MD application

Russian-German School on High-Performance Computer Systems, 27th June - 6th July, Novosibirsk

  • 8. Day, 6th of July, 2005
  • Inst. of Theoretic & Applied Mechanics, Novosibirsk

HLRS, University of Stuttgart

slide-2
SLIDE 2

Höchstleistungsrechenzentrum Stuttgart

Overview of the talk

  • Overview
  • Background of application
  • Numerical scheme
  • Grid setup and communication
  • Optimization experiments
  • Scalability of code
  • Summary
slide-3
SLIDE 3

Höchstleistungsrechenzentrum Stuttgart

Background of the application

  • Computation of the ignition of condensed material and
  • Transition from burning to detonation phase.
  • Here a 3D-(mono)crystal (AgN3) and check applicability of most

general non-stationary continuum mechanics equations.

  • Interactions between molecules in crystal are used with the two-body

term of Stillinger-Weber potential: with eintra=5x10-21J, binter=9,7Å, ainter=3,9Å, and σinter=0.007Å.

U inter=einterbinter/rij

4−1e inter/rij−ainter

slide-4
SLIDE 4

Höchstleistungsrechenzentrum Stuttgart

Numerical scheme

  • The evolution of the system over time is described as:

 r i

k1=

r i

k 

mi  pi

k

2  F i

k

  • Position in space:

 pi

k1=

pi

k

2   F i

k

F i

k1

  • Momentum:
  • with Fk

i being the total force acting onto this i-th atom from all other

atoms.

  • Actually not all other atoms, but rather a short-range interaction with

a cut-off radius.

slide-5
SLIDE 5

Höchstleistungsrechenzentrum Stuttgart

Grid setup

  • The computational domain is then decomposed:

MPI-procs 1 2 3 Y X

  • in the X-direction parallelized using MPI.
  • As we have short-range interaction, computational domain is
  • rganized in cells:

1 2

slide-6
SLIDE 6

Höchstleistungsrechenzentrum Stuttgart

Grid communication 1/2

  • One cell interaction → one column ghost cell.

13 14 15 16 9 10 11 12 5 6 7 8 1 2 3 4

2 1 3

13 9 5 1 161 121 81 41 133 93 53 13 16 12 8 4

  • Actually several information to be exchanged:

– Number of atoms in ghost cells to send off. – Position ri and pi of i-th atom in ghost cells.

  • After recalculation of energy of each cell:

– Number of atoms moving over boundary and migrate.

slide-7
SLIDE 7

Höchstleistungsrechenzentrum Stuttgart

Grid communication 2/2

  • Due to Migration, the amount of data exchanged changes:
slide-8
SLIDE 8

Höchstleistungsrechenzentrum Stuttgart

Running the code 1/2

  • Code (F77) immediately portable to new platform, perfect.
  • Small buglet on 64-Bit platform:

program test5dd implicit DOUBLE PRECISION (a - h, o - z) include 'mpif.h' DOUBLE PRECISION, allocatable::rm(:),pm(:) DOUBLE PRECISION, allocatable::rm_myleft(:),pm_myleft(:) .... c--------------------CLOCKWISE-------------------------------- c-----------------Right Interchange--------------------------- Nri=number_my+1-number_total*int((number_my+1)/(number_total)) call MPI_SEND(rnmyright,1,MPI_DOUBLE_PRECISION,Nri,1, * MPI_COMM_WORLD,ierr) c------------------------Left Interchange--------------------- Nli=number_my-1+number_total*int((number_total-1-(number_my- 1))/(number_total)) call MPI_RECV(rnnleft,1,MPI_DOUBLE_PRECISION,Nli,1, * MPI_COMM_WORLD,status,ierr)

slide-9
SLIDE 9

Höchstleistungsrechenzentrum Stuttgart

Running the code 2/2

  • Small buglets:

– The implicit double precision (a-h, o-z) is neat, but MPI defines opaque types to be integer; MPI_Status needs to be declared as: integer :: status(MPI_STATUS_SIZE) – (known issue) Why did the Number of atoms to be send to left/right neighbor have to be send as double? – Calculating left/right neighbor once may be better: Nri=mod(number_my+1, number_total) Nli=mod(number_my-1+number_total, number_total) – Possible problem with several functions using: implicit real*8 (a-h, o-z) – Code is depending on using Eager-protocol for sending msg:

MPI_SEND(rnmyright,1,MPI_INTEGER,Nri,1,MPI_COMM_WORLD,ierr) MPI_RECV(rnnleft,1,MPI_INTEGER,Nli,1,MPI_COMM_WORLD,status,ie rr)

Need to either use MPI_Sendrecv or non-blocking interchange!

slide-10
SLIDE 10

Höchstleistungsrechenzentrum Stuttgart

Optimization experiments 1/4

  • One issue already discussed:

call MPI_Gather(Eknode,1,MPI_DOUBLE_PRECISION,Ektot_array,1, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) call MPI_Gather(Wsurnode,1,MPI_DOUBLE_PRECISION,Wsur_array,1, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) call MPI_Gather(UNode,1,MPI_DOUBLE_PRECISION,UNode_array,1, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) if (number_my.eq.0) then Ektot=0. Wsurtot=0. Utot=0. do i=1,number_total Ektot=Ektot+Ektot_array(i) Wsurtot=Wsurtot+Wsur_array(i) Utot=Utot+UNode_array(i) end do end if

slide-11
SLIDE 11

Höchstleistungsrechenzentrum Stuttgart

Optimization experiments 2/4

  • May be replaced by single call to MPI_Reduce:

reduce_in_array(1) = EKnode reduce_in_array(2) = Wsurnode reduce_in_array(3) = UNode call MPI_Reduce(reduce_in_array, reduce_out_array, 3, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) if (number_my.eq.0) then EKtot=reduce_out_array(1) Wsurtot=reduce_out_array(2) Utot=reduce_out_array(3) end if

  • This reduces

– 3 collective operations plus – 3 allocations / deallocations dependent on number_total. – On strider this reduces MPI-time from 11 to 9 seconds (8 procs).

slide-12
SLIDE 12

Höchstleistungsrechenzentrum Stuttgart

Optimization experiments 3/4

  • Several messages are send to next neighbor.
  • Integrate that into one one message using MPI_Pack:

call MPI_PACK_SIZE (2 * nmyright3, MPI_DOUBLE_PRECISION, * MPI_COMM_WORLD, buffer_pack_size, ierr) if (buffer_pack_size > send_buffer_size) then if(send_buffer_size>0) then deallocate (send_buffer) end if send_buffer_size = 2 * buffer_pack_size allocate (send_buffer (send_buffer_size)) end if buffer_pos=0 call MPI_PACK(rm_myright, nmyright3, MPI_DOUBLE_PRECISION, * send_buffer,send_buffer_size,buffer_pos,MPI_COMM_WORLD, ierr) call MPI_PACK(pm_myright, nmyright3, MPI_DOUBLE_PRECISION, * send_buffer, send_buffer_size, buffer_pos,MPI_COMM_WORLD, ierr) call MPI_ISEND(send_buffer, buffer_pos, MPI_PACKED, * Nri, 7, MPI_COMM_WORLD, req(1), ierr)

slide-13
SLIDE 13

Höchstleistungsrechenzentrum Stuttgart

Optimization experiments 4/4

  • This actually is slower: execution time up from 250s to 280s.
  • Other way solution to try: using simple memcpy.
  • Currently testing: using separate non-blocking messages.

(if we get nodes on strider).

  • Other option to find bottlenecks:

Using Vampir or Mpitrace&Paraver.

slide-14
SLIDE 14

Höchstleistungsrechenzentrum Stuttgart

Scalability on MVS-1000

0.1 0.15 0.2 0.25 0.3 0.35 1000 2000 3000 4000 5000

4186.2 3238.4 2683.0 2069.2 1861.3 1713.2 1578.9

t,s 1 P

Performance of 3F code.

  • Inverse of Processors
  • From 3 to 10 procs.
  • Linear Scalability.
  • 50000 atoms
  • 5000 iterations
  • On MVS-1000 using Intel P3-800

Data by A.V.Utkin.

slide-15
SLIDE 15

Höchstleistungsrechenzentrum Stuttgart

Scalability on AMD-Opteron

  • Measurements done using original code (buglets fixed):

Procs Total Time MPI-time 2 1062,7 11,69 4 486,4 15,82 8 248,8 2,71 12 174,6 20,13 16 145,6 20,99

  • Measurements done w/ new version (MPI_Gather+Non-block):
slide-16
SLIDE 16

Höchstleistungsrechenzentrum Stuttgart

Outlook

  • Allow better scalability by hiding communication by computation:

– iff possible, as yet no function has been discovered to split and separately compute the ghost cells before the internal domain.

  • Use MPI_Sendrecv in more places.
  • Dynamically shift domain boundaries to get rid of load-imbalance

(which is the main reason for looking at the code): MPI-procs 1 2 3 Y X

slide-17
SLIDE 17

Höchstleistungsrechenzentrum Stuttgart

Acknowledgements

Thanks to Dr. Andrey V. Utkin for providing his code, papers and explaining the code.