Analysis of a Parallel 3D MD application Russian-German School on - PowerPoint PPT Presentation

Analysis of a Parallel 3D MD application Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 8. Day, 6 th of July, 2005 Inst. of Theoretic & Applied Mechanics, Novosibirsk HLRS, University of Stuttgart Höchstleistungsrechenzentrum Stuttgart

Overview of the talk • Overview • Background of application • Numerical scheme • Grid setup and communication Optimization experiments • Scalability of code • • Summary Höchstleistungsrechenzentrum Stuttgart

Background of the application • Computation of the ignition of condensed material and • Transition from burning to detonation phase. Here a 3D-(mono)crystal (AgN 3 ) and check applicability of most • general non-stationary continuum mechanics equations. • Interactions between molecules in crystal are used with the two-body term of Stillinger-Weber potential:  inter / r ij − a inter 4 − 1  e U inter = e inter  b inter / r ij with e intra =5x10 -21 J, b inter =9,7Å, a inter =3,9Å, and σ inter =0.007Å. Höchstleistungsrechenzentrum Stuttgart

Numerical scheme • The evolution of the system over time is described as: k   k   k  1 = k • Position in space:  r i r i  p i F i m i 2 k  2   k  k  1 = k  1  • Momentum:  p i p i F i F i with F k i being the total force acting onto this i -th atom from all other • atoms. • Actually not all other atoms, but rather a short-range interaction with a cut-off radius. Höchstleistungsrechenzentrum Stuttgart

Grid setup • The computational domain is then decomposed: MPI-procs 0 1 2 3 Y X • in the X-direction parallelized using MPI. • As we have short-range interaction, computational domain is organized in cells: 1 2 Höchstleistungsrechenzentrum Stuttgart

Grid communication 1/2 One cell interaction → one column ghost cell. • 1 2 3 16 1 13 3 13 13 14 15 16 16 12 1 9 3 9 9 10 11 12 12 8 1 5 3 5 5 6 7 8 8 4 1 1 3 1 1 2 3 4 4 • Actually several information to be exchanged: – Number of atoms in ghost cells to send off. – Position r i and p i of i -th atom in ghost cells. • After recalculation of energy of each cell: – Number of atoms moving over boundary and migrate . Höchstleistungsrechenzentrum Stuttgart

Grid communication 2/2 • Due to Migration, the amount of data exchanged changes: Höchstleistungsrechenzentrum Stuttgart

Running the code 1/2 • Code (F77) immediately portable to new platform, perfect . • Small buglet on 64-Bit platform: program test5dd implicit DOUBLE PRECISION (a - h, o - z) include 'mpif.h' DOUBLE PRECISION, allocatable::rm(:),pm(:) DOUBLE PRECISION, allocatable::rm_myleft(:),pm_myleft(:) .... c--------------------CLOCKWISE-------------------------------- c-----------------Right Interchange--------------------------- Nri=number_my+1-number_total*int((number_my+1)/(number_total)) call MPI_SEND(rnmyright,1,MPI_DOUBLE_PRECISION,Nri,1, * MPI_COMM_WORLD,ierr) c------------------------Left Interchange--------------------- Nli=number_my-1+number_total*int((number_total-1-(number_my- 1))/(number_total)) call MPI_RECV(rnnleft,1,MPI_DOUBLE_PRECISION,Nli,1, * MPI_COMM_WORLD,status,ierr) Höchstleistungsrechenzentrum Stuttgart

Running the code 2/2 • Small buglets: – The implicit double precision (a-h, o-z) is neat, but MPI defines opaque types to be integer; MPI_Status needs to be declared as: integer :: status(MPI_STATUS_SIZE) – (known issue) Why did the Number of atoms to be send to left/right neighbor have to be send as double? – Calculating left/right neighbor once may be better: Nri=mod(number_my+1, number_total) Nli=mod(number_my-1+number_total, number_total) – Possible problem with several functions using: implicit real*8 (a-h, o-z) – Code is depending on using Eager-protocol for sending msg: MPI_SEND(rnmyright,1,MPI_INTEGER,Nri,1,MPI_COMM_WORLD,ierr) MPI_RECV(rnnleft,1,MPI_INTEGER,Nli,1,MPI_COMM_WORLD,status,ie rr) Need to either use MPI_Sendrecv or non-blocking interchange! Höchstleistungsrechenzentrum Stuttgart

Optimization experiments 1/4 • One issue already discussed: call MPI_Gather(Eknode,1,MPI_DOUBLE_PRECISION,Ektot_array,1, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) call MPI_Gather(Wsurnode,1,MPI_DOUBLE_PRECISION,Wsur_array,1, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) call MPI_Gather(UNode,1,MPI_DOUBLE_PRECISION,UNode_array,1, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) if (number_my.eq.0) then Ektot=0. Wsurtot=0. Utot=0. do i=1,number_total Ektot=Ektot+Ektot_array(i) Wsurtot=Wsurtot+Wsur_array(i) Utot=Utot+UNode_array(i) end do end if Höchstleistungsrechenzentrum Stuttgart

Optimization experiments 2/4 May be replaced by single call to MPI_Reduce : • reduce_in_array(1) = EKnode reduce_in_array(2) = Wsurnode reduce_in_array(3) = UNode call MPI_Reduce(reduce_in_array, reduce_out_array, 3, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) if (number_my.eq.0) then EKtot=reduce_out_array(1) Wsurtot=reduce_out_array(2) Utot=reduce_out_array(3) end if • This reduces – 3 collective operations plus – 3 allocations / deallocations dependent on number_total. – On strider this reduces MPI-time from 11 to 9 seconds (8 procs). Höchstleistungsrechenzentrum Stuttgart

Optimization experiments 3/4 • Several messages are send to next neighbor. Integrate that into one one message using MPI_Pack : • call MPI_PACK_SIZE (2 * nmyright3, MPI_DOUBLE_PRECISION, * MPI_COMM_WORLD, buffer_pack_size, ierr) if (buffer_pack_size > send_buffer_size) then if(send_buffer_size>0) then deallocate (send_buffer) end if send_buffer_size = 2 * buffer_pack_size allocate (send_buffer (send_buffer_size)) end if buffer_pos=0 call MPI_PACK(rm_myright, nmyright3, MPI_DOUBLE_PRECISION, * send_buffer,send_buffer_size,buffer_pos,MPI_COMM_WORLD, ierr) call MPI_PACK(pm_myright, nmyright3, MPI_DOUBLE_PRECISION, * send_buffer, send_buffer_size, buffer_pos,MPI_COMM_WORLD, ierr) call MPI_ISEND(send_buffer, buffer_pos, MPI_PACKED, * Nri, 7, MPI_COMM_WORLD, req(1), ierr) Höchstleistungsrechenzentrum Stuttgart

Optimization experiments 4/4 • This actually is slower: execution time up from 250s to 280s. • Other way solution to try: using simple memcpy. • Currently testing: using separate non-blocking messages. (if we get nodes on strider). • Other option to find bottlenecks: Using Vampir or Mpitrace&Paraver. Höchstleistungsrechenzentrum Stuttgart

Scalability on MVS-1000 t,s 5000 4186.2 4000 Performance of 3F code. 3238.4 -Inverse of Processors 3000 -From 3 to 10 procs. 2683.0 -Linear Scalability. 2069.2 2000 -50000 atoms 1861.3 -5000 iterations 1713.2 1578.9 -On MVS-1000 using Intel P3-800 Data by A.V.Utkin. 1000 0.1 0.15 0.2 0.25 0.3 0.35 1 P Höchstleistungsrechenzentrum Stuttgart

Scalability on AMD-Opteron • Measurements done using original code (buglets fixed): Procs Total Time MPI-time 2 1062,7 11,69 4 486,4 15,82 8 248,8 2,71 12 174,6 20,13 16 145,6 20,99 Measurements done w/ new version ( MPI_Gather +Non-block): • Höchstleistungsrechenzentrum Stuttgart

Outlook • Allow better scalability by hiding communication by computation: – iff possible, as yet no function has been discovered to split and separately compute the ghost cells before the internal domain. • Use MPI_Sendrecv in more places. • Dynamically shift domain boundaries to get rid of load-imbalance (which is the main reason for looking at the code): MPI-procs 0 1 2 3 Y X Höchstleistungsrechenzentrum Stuttgart

Acknowledgements Thanks to Dr. Andrey V. Utkin for providing his code, papers and explaining the code. Höchstleistungsrechenzentrum Stuttgart

Analysis of a Parallel 3D MD application Russian-German School on - PowerPoint PPT Presentation

Analysis of a Parallel 3D MD application Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 8. Day, 6 th of July, 2005 Inst. of Theoretic & Applied Mechanics, Novosibirsk HLRS, University of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Scenegraphs and Engines Scenegraphs and Engines Scenegraphs Application Application

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Spherical and hyperbolic 2-spheres with cone singularities Workshop Hyperbolic geometry and

UMBC A B M A L F T U M B C I O M Y O T R 1 (November 26, 2000 11:15 pm) I E

Approximate Graph Operations on Parallel Platforms Approximate Graph Operations on Parallel

Lie to me Demystifying Spark accumulators SergeyZhemzhitsky s.zhemzhitsky@cleverdata.ru DMP

New class of limited-memory variationally-derived variable metric methods 1 Jan Vl cek,

A unified continuum mechanical approach for the computer age About the course Hans Petter

Text Classification using Weka Jrg Steffen, DFKI Substitute Gnter Neumann, DFKI

Influence of Salicylic Acid applica2on on Oxida2ve and Molecular