analysis of a parallel 3d md application
play

Analysis of a Parallel 3D MD application Russian-German School on - PowerPoint PPT Presentation

Analysis of a Parallel 3D MD application Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 8. Day, 6 th of July, 2005 Inst. of Theoretic & Applied Mechanics, Novosibirsk HLRS, University of


  1. Analysis of a Parallel 3D MD application Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 8. Day, 6 th of July, 2005 Inst. of Theoretic & Applied Mechanics, Novosibirsk HLRS, University of Stuttgart Höchstleistungsrechenzentrum Stuttgart

  2. Overview of the talk • Overview • Background of application • Numerical scheme • Grid setup and communication Optimization experiments • Scalability of code • • Summary Höchstleistungsrechenzentrum Stuttgart

  3. Background of the application • Computation of the ignition of condensed material and • Transition from burning to detonation phase. Here a 3D-(mono)crystal (AgN 3 ) and check applicability of most • general non-stationary continuum mechanics equations. • Interactions between molecules in crystal are used with the two-body term of Stillinger-Weber potential:  inter / r ij − a inter 4 − 1  e U inter = e inter  b inter / r ij with e intra =5x10 -21 J, b inter =9,7Å, a inter =3,9Å, and σ inter =0.007Å. Höchstleistungsrechenzentrum Stuttgart

  4. Numerical scheme • The evolution of the system over time is described as: k   k   k  1 = k • Position in space:  r i r i  p i F i m i 2 k  2   k  k  1 = k  1  • Momentum:  p i p i F i F i with F k i being the total force acting onto this i -th atom from all other • atoms. • Actually not all other atoms, but rather a short-range interaction with a cut-off radius. Höchstleistungsrechenzentrum Stuttgart

  5. Grid setup • The computational domain is then decomposed: MPI-procs 0 1 2 3 Y X • in the X-direction parallelized using MPI. • As we have short-range interaction, computational domain is organized in cells: 1 2 Höchstleistungsrechenzentrum Stuttgart

  6. Grid communication 1/2 One cell interaction → one column ghost cell. • 1 2 3 16 1 13 3 13 13 14 15 16 16 12 1 9 3 9 9 10 11 12 12 8 1 5 3 5 5 6 7 8 8 4 1 1 3 1 1 2 3 4 4 • Actually several information to be exchanged: – Number of atoms in ghost cells to send off. – Position r i and p i of i -th atom in ghost cells. • After recalculation of energy of each cell: – Number of atoms moving over boundary and migrate . Höchstleistungsrechenzentrum Stuttgart

  7. Grid communication 2/2 • Due to Migration, the amount of data exchanged changes: Höchstleistungsrechenzentrum Stuttgart

  8. Running the code 1/2 • Code (F77) immediately portable to new platform, perfect . • Small buglet on 64-Bit platform: program test5dd implicit DOUBLE PRECISION (a - h, o - z) include 'mpif.h' DOUBLE PRECISION, allocatable::rm(:),pm(:) DOUBLE PRECISION, allocatable::rm_myleft(:),pm_myleft(:) .... c--------------------CLOCKWISE-------------------------------- c-----------------Right Interchange--------------------------- Nri=number_my+1-number_total*int((number_my+1)/(number_total)) call MPI_SEND(rnmyright,1,MPI_DOUBLE_PRECISION,Nri,1, * MPI_COMM_WORLD,ierr) c------------------------Left Interchange--------------------- Nli=number_my-1+number_total*int((number_total-1-(number_my- 1))/(number_total)) call MPI_RECV(rnnleft,1,MPI_DOUBLE_PRECISION,Nli,1, * MPI_COMM_WORLD,status,ierr) Höchstleistungsrechenzentrum Stuttgart

  9. Running the code 2/2 • Small buglets: – The implicit double precision (a-h, o-z) is neat, but MPI defines opaque types to be integer; MPI_Status needs to be declared as: integer :: status(MPI_STATUS_SIZE) – (known issue) Why did the Number of atoms to be send to left/right neighbor have to be send as double? – Calculating left/right neighbor once may be better: Nri=mod(number_my+1, number_total) Nli=mod(number_my-1+number_total, number_total) – Possible problem with several functions using: implicit real*8 (a-h, o-z) – Code is depending on using Eager-protocol for sending msg: MPI_SEND(rnmyright,1,MPI_INTEGER,Nri,1,MPI_COMM_WORLD,ierr) MPI_RECV(rnnleft,1,MPI_INTEGER,Nli,1,MPI_COMM_WORLD,status,ie rr) Need to either use MPI_Sendrecv or non-blocking interchange! Höchstleistungsrechenzentrum Stuttgart

  10. Optimization experiments 1/4 • One issue already discussed: call MPI_Gather(Eknode,1,MPI_DOUBLE_PRECISION,Ektot_array,1, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) call MPI_Gather(Wsurnode,1,MPI_DOUBLE_PRECISION,Wsur_array,1, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) call MPI_Gather(UNode,1,MPI_DOUBLE_PRECISION,UNode_array,1, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) if (number_my.eq.0) then Ektot=0. Wsurtot=0. Utot=0. do i=1,number_total Ektot=Ektot+Ektot_array(i) Wsurtot=Wsurtot+Wsur_array(i) Utot=Utot+UNode_array(i) end do end if Höchstleistungsrechenzentrum Stuttgart

  11. Optimization experiments 2/4 May be replaced by single call to MPI_Reduce : • reduce_in_array(1) = EKnode reduce_in_array(2) = Wsurnode reduce_in_array(3) = UNode call MPI_Reduce(reduce_in_array, reduce_out_array, 3, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) if (number_my.eq.0) then EKtot=reduce_out_array(1) Wsurtot=reduce_out_array(2) Utot=reduce_out_array(3) end if • This reduces – 3 collective operations plus – 3 allocations / deallocations dependent on number_total. – On strider this reduces MPI-time from 11 to 9 seconds (8 procs). Höchstleistungsrechenzentrum Stuttgart

  12. Optimization experiments 3/4 • Several messages are send to next neighbor. Integrate that into one one message using MPI_Pack : • call MPI_PACK_SIZE (2 * nmyright3, MPI_DOUBLE_PRECISION, * MPI_COMM_WORLD, buffer_pack_size, ierr) if (buffer_pack_size > send_buffer_size) then if(send_buffer_size>0) then deallocate (send_buffer) end if send_buffer_size = 2 * buffer_pack_size allocate (send_buffer (send_buffer_size)) end if buffer_pos=0 call MPI_PACK(rm_myright, nmyright3, MPI_DOUBLE_PRECISION, * send_buffer,send_buffer_size,buffer_pos,MPI_COMM_WORLD, ierr) call MPI_PACK(pm_myright, nmyright3, MPI_DOUBLE_PRECISION, * send_buffer, send_buffer_size, buffer_pos,MPI_COMM_WORLD, ierr) call MPI_ISEND(send_buffer, buffer_pos, MPI_PACKED, * Nri, 7, MPI_COMM_WORLD, req(1), ierr) Höchstleistungsrechenzentrum Stuttgart

  13. Optimization experiments 4/4 • This actually is slower: execution time up from 250s to 280s. • Other way solution to try: using simple memcpy. • Currently testing: using separate non-blocking messages. (if we get nodes on strider). • Other option to find bottlenecks: Using Vampir or Mpitrace&Paraver. Höchstleistungsrechenzentrum Stuttgart

  14. Scalability on MVS-1000 t,s 5000 4186.2 4000 Performance of 3F code. 3238.4 -Inverse of Processors 3000 -From 3 to 10 procs. 2683.0 -Linear Scalability. 2069.2 2000 -50000 atoms 1861.3 -5000 iterations 1713.2 1578.9 -On MVS-1000 using Intel P3-800 Data by A.V.Utkin. 1000 0.1 0.15 0.2 0.25 0.3 0.35 1 P Höchstleistungsrechenzentrum Stuttgart

  15. Scalability on AMD-Opteron • Measurements done using original code (buglets fixed): Procs Total Time MPI-time 2 1062,7 11,69 4 486,4 15,82 8 248,8 2,71 12 174,6 20,13 16 145,6 20,99 Measurements done w/ new version ( MPI_Gather +Non-block): • Höchstleistungsrechenzentrum Stuttgart

  16. Outlook • Allow better scalability by hiding communication by computation: – iff possible, as yet no function has been discovered to split and separately compute the ghost cells before the internal domain. • Use MPI_Sendrecv in more places. • Dynamically shift domain boundaries to get rid of load-imbalance (which is the main reason for looking at the code): MPI-procs 0 1 2 3 Y X Höchstleistungsrechenzentrum Stuttgart

  17. Acknowledgements Thanks to Dr. Andrey V. Utkin for providing his code, papers and explaining the code. Höchstleistungsrechenzentrum Stuttgart

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend