Analysis and Optimization of a Molecular Dynamics Code using PAPI - - PowerPoint PPT Presentation

analysis and optimization of a molecular dynamics code
SMART_READER_LITE
LIVE PREVIEW

Analysis and Optimization of a Molecular Dynamics Code using PAPI - - PowerPoint PPT Presentation

Center for Information Services and High Performance Computing (ZIH) Analysis and Optimization of a Molecular Dynamics Code using PAPI and the Vampir Toolchain May 2, 2012 Thomas William Zellescher Weg 12 Willers-Bau A 34 +49 351 - 463 32446


slide-1
SLIDE 1

Center for Information Services and High Performance Computing (ZIH)

Analysis and Optimization of a Molecular Dynamics Code using PAPI and the Vampir Toolchain

May 2, 2012

Thomas William Zellescher Weg 12 Willers-Bau A 34 +49 351 - 463 32446 Thomas.William@ZIH.TU-Dresden.de

slide-2
SLIDE 2

Overview

1

Introduction

2

Serial Analysis

3

PAPI measurements

4

Source Code Analysis

5

Source Code Optimization

6

Tracing and Visualization

7

Conclusion

1/34

slide-3
SLIDE 3

Overview

1

Introduction

1/34

slide-4
SLIDE 4

Introduction

IU ZIH FutureGrid MD

1/34

slide-5
SLIDE 5

Introduction: MD Code

Classical molecular dynamics simulation of dense nuclear matter consisting of either fully ionized atoms or free neutrons and protons Main targets are studies of the dense matter in white dwarf and neutron stars. Interaction potentials between particles treated as classical two-particle central potentials

No complicated particle geometry or orientation

Electrons can be treated as a uniform background charge (not explicitly modeled)

2/34

slide-6
SLIDE 6

MD Code Details

Particle-particle-interactions have been implemented in a multitude of ways Located in different files PP01, PP02 and PP03 The code blocks are selectable using preprocessor macros

PP01 is the original implementation with no division into the Ax, Bx or NBS blocks PP02 implements the versions in use by the physicists today

3 different implementations for the Ax block 3 implementations for the Bx block No manual blocking (NBS)

PP03 includes new routines

3 Ax blocks 8 Bx blocks Can be blocked using the NBS value

3/34

slide-7
SLIDE 7

MD Code Details

Two sections of code each have two or more variations One section is labelled A and the other B Variations are numbered MD can be compiled as a serial, OpenMP , MPI or MPI+OpenMP

make MDEF=XRay md_mpi ALG=PP02 BLKA=A0 BLKB=B2 NBS="NBSX=0"

⌦ ⌃ ⇧

4/34

slide-8
SLIDE 8

MD Workflow

The structure of the algorithm is simple. First reads in a parameter file runmd.in And an initial particle configuration file md.in Program then calculates the initial set of accelerations Enters a time-stepping loop, a triply nested ”do” loop

5/34

slide-9
SLIDE 9

Runtime Parameters

⌥ ! Parameters : sim_type = ’ ion−mixture ’ , ! simulation type t s t a r t = 0.00 , ! s t a r t time dt = 25.00 , ! time step ( fm / c ) !Warmup: nwgroup = 2 , ! groups nwsteps = 50 , ! steps per group ! Measurement : ngroup = 2 , ! groups ntot = 2 , ! per group nind = 25 , ! steps between tnormalize = 50 , ! temp normal . ncom = 50 , ! center−of−mass ! motion cancel . ⌦ ⌃ ⇧ Figure : Runtime parameters, snippet from runmd.in

6/34

slide-10
SLIDE 10

The Main Loop

⌥ do 100 ig =1 ,ngroup i n i t i a l i z e group ig s t a t i s t i c s do 40 j =1 , ntot do i =1 , nind ! computes forces ! updates x and v call newton enddo call v t o t enddo compute group ig s t a t i s t i c s 40 continue 100 continue ⌦ ⌃ ⇧ Figure : Simplified version of the main loop

7/34

slide-11
SLIDE 11

MD Implementation Details - newton module

Forces are calculated in the newton module in a pair of nested do-loops

Outer loop goes over target particles Inner loop over source particles

Targets are assigned to MPI processes in a round-robin fashion Within each MPI process, the work is shared among OpenMP threads

8/34

slide-12
SLIDE 12

Overview

2

Serial Analysis

9/34

slide-13
SLIDE 13

XRay - a Cray XT5mTM

Cray XT5m provided by the FutureGrid project XT5m is a 2D mesh of nodes Each node has two sockets each having four cores AMD Opteron 23 ”Shanghai” (45mm) running at 2.4 GHz 84 compute nodes with a total of 672 cores pgi/9.0.4 using the XT PE driver xtpe-shanghai

9/34

slide-14
SLIDE 14

Time Constraints

5k particles measurement takes 5 minutes 27k particles measurement takes 1 hour 55k particles measurement takes 10 hours

10/34

slide-15
SLIDE 15

Overview PP01, PP02, and PP03

0" 10000" 20000" 30000" 40000" 50000" 60000" P P 1 " A " B " P P 2 " A 1 " B 2 " P P 3 " A " B " 1 6 " P P 3 " A " B 1 " 8 " P P 3 " A " B 2 " 2 " P P 3 " A " B 2 " 2 5 6 " P P 3 " A " B 3 " 1 2 8 " P P 3 " A " B 4 " 6 4 " P P 3 " A " B 5 " 3 2 " P P 3 " A 1 " B " 2 " P P 3 " A 1 " B " 2 5 6 " P P 3 " A 1 " B 1 " 1 2 8 " P P 3 " A 1 " B 2 " 6 4 " P P 3 " A 1 " B 3 " 3 2 " P P 3 " A 1 " B 4 " 1 6 " P P 3 " A 1 " B 5 " 8 " P P 3 " A 1 " B 6 " 2 " P P 3 " A 2 " B " 6 4 " P P 3 " A 2 " B 1 " 3 2 " P P 3 " A 2 " B 2 " 1 6 " P P 3 " A 2 " B 3 " 8 " P P 3 " A 2 " B 4 " 2 " P P 3 " A 2 " B 4 " 2 5 6 " P P 3 " A 2 " B 5 " 1 2 8 "

run$me'in'seconds'

Run$me'for'all'code'combina$ons'

O3" O2" FASTSSE"

Figure : Overview of the runtimes for all (143) code-block combinations with an ion-mix input dataset of 55k particles. The naming scheme for the measurements is ”source-code-file A-block B-block blockingfactor”

11/34

slide-16
SLIDE 16

PP02 Code Blocks

0" 5000" 10000" 15000" 20000" 25000" 30000" 35000" 40000" 45000" 50000" P P 1 " A " B " P P 2 " A " B " P P 2 " A " B 1 " P P 2 " A " B 2 " P P 2 " A 1 " B " P P 2 " A 1 " B 1 " P P 2 " A 1 " B 2 " P P 2 " A 2 " B " P P 2 " A 2 " B 1 " P P 2 " A 2 " B 2 " Run$me'in'seconds' Source'code'file'and'code'block'version'

Run$me,'55k'par$cle'run'

O3" O2" FASTSSE"

12/34

slide-17
SLIDE 17

Annotated Source for -O3

⌥ ## do 90 j=i+1,n−1 ## !−−−−−− Block A −−−−−− ## #if defined(A0) ## r2=0.0d0 movsd %xmm2, %xmm1 movq %r12, %rdx movq %r15, %rcx movl $8, %eax .align 16 .LB2_555: ## lineno: 138 ⌦ ⌃ ⇧

13/34

slide-18
SLIDE 18

Annotated Source for -fast

⌥ ## do 90 j=i+1,n−1 ## !−−−−−− Block A −−−−−− ## #if defined(A0) ## r2=0.0d0 ## do k=1,3 ## xx(k)=x(k, i)−x(k,j ) movlpd .BSS2+48(%rip),%xmm2 movlpd .C2_291(%rip),%xmm0 mulsd %xmm2,%xmm2 addsd %xmm1,%xmm2 movlpd %xmm2,344(%rsp) sqrtsd %xmm2,%xmm2 movlpd %xmm2,448(%rsp) mulsd md_globals_10_+120(%rip),%xmm2 subsd %xmm2,%xmm0 .p2align 4,,1 ⌦ ⌃ ⇧

14/34

slide-19
SLIDE 19

Overview

3

PAPI measurements

15/34

slide-20
SLIDE 20

Floating Point Instructions

0" 2E+09" 4E+09" 6E+09" 8E+09" 1E+10" 1.2E+10" A0_B0" A0_B1" A0_B2" A1_B0" A1_B1" A1_B2" A2_B0" A2_B1" A2_B2" #"of"instruc,ons" code"block"

PAPI_FAD_INS"

O2" O3" FAST" 0" 2E+09" 4E+09" 6E+09" 8E+09" 1E+10" 1.2E+10" 1.4E+10" A0_B0" A0_B1" A0_B2" A1_B0" A1_B1" A1_B2" A2_B0" A2_B1" A2_B2" #"of"instruc,ons" code"block"

PAPI_FML_INS"

O2" O3" FAST"

0" 5E+09" 1E+10" 1.5E+10" 2E+10" 2.5E+10" A0_B0" A0_B1" A0_B2" A1_B0" A1_B1" A1_B2" A2_B0" A2_B1" A2_B2" #"of"instruc,ons" code"block"

PAPI_FP_INS"

O2" O3" FAST" 15/34

slide-21
SLIDE 21

FPU idle times

0.00%$ 1.00%$ 2.00%$ 3.00%$ 4.00%$ 5.00%$ 6.00%$ 7.00%$ 8.00%$ A0_B0$ A0_B1$ A0_B2$ A1_B0$ A1_B1$ A1_B2$ A2_B0$ A2_B1$ A2_B2$ idle%&me%in%%% code%block%

FPU%idle%&me%in%%%

O2$ O3$ FAST$

Figure : FPU idle times in percent of PAPI measured total cycles

16/34

slide-22
SLIDE 22

Branch Miss Predictions

0" 2E+09" 4E+09" 6E+09" 8E+09" 1E+10" 1.2E+10" A0_B0" A0_B1" A0_B2" A1_B0" A1_B1" A1_B2" A2_B0" A2_B1" A2_B2" #"of"instruc,ons" code"block"

PAPI_BR_INS"

O2" O3" FAST" 0.E+00% 1.E+08% 2.E+08% 3.E+08% 4.E+08% 5.E+08% 6.E+08% A0_B0% A0_B1% A0_B2% A1_B0% A1_B1% A1_B2% A2_B0% A2_B1% A2_B2% #"of"instruc,ons" code"block"

PAPI_BR_MSP"

O2% O3% FAST%

0.00%$ 2.00%$ 4.00%$ 6.00%$ 8.00%$ 10.00%$ 12.00%$ 14.00%$ 16.00%$ A0_B0$ A0_B1$ A0_B2$ A1_B0$ A1_B1$ A1_B2$ A2_B0$ A2_B1$ A2_B2$ miss$rate$in$%$ code$block$

Branch$predic4on$miss$rate$

O2$ O3$ FAST$ 17/34

slide-23
SLIDE 23

Overview

4

Source Code Analysis

18/34

slide-24
SLIDE 24

Block A of the PP02 file

# i f defined (A0) r2 =0.0d0 do k=1,3 xx ( k)=x ( k , i )−x ( k , j ) i f ( xx ( k ) . gt .+ h a l f l ( k ) ) xx ( k)= xx ( k)− x l ( k ) i f ( xx ( k ) . l t .− h a l f l ( k ) ) xx ( k)= xx ( k)+ x l ( k ) r2=r2+xx ( k )∗ xx ( k ) enddo # e l i f defined (A1) r2 =0.0d0 do k=1,3 xx ( k)=x ( k , i )−x ( k , j ) xx ( k)= xx ( k)− a i n t ( xx ( k )∗ h a l f l i ( k ) ) ∗ x l ( k ) r2=r2+xx ( k )∗ xx ( k ) enddo # e l i f defined (A2) xx ( : ) = x ( : , i )−x ( : , j ) xx=xx−a i n t ( xx∗ h a l f l i )∗ x l r2=xx (1)∗ xx (1)+ xx (2)∗ xx (2)+ xx (3)∗ xx (3) #else

⌦ ⌃ ⇧

18/34

slide-25
SLIDE 25

Block B of the PP02 file

# i f defined (B0) r= sqrt ( r2 ) fc = exp(−xmuc∗ r ) ∗ ( 1 . / r+xmuc ) / r2 do k=1,3 f i ( k ) = f i ( k ) + z i i ( j )∗ fc ∗xx ( k ) f j ( k , j ) = f j ( k , j ) − z i i ( i )∗ fc ∗xx ( k ) enddo # e l i f defined (B1) r= sqrt ( r2 ) fc = exp(−xmuc∗ r ) ∗ ( 1 . / r+xmuc ) / r2 f i ( : ) = f i ( : ) + z i i ( j )∗ fc ∗xx ( : ) f j ( : , j ) = f j ( : , j ) − z i i ( i )∗ fc ∗xx ( : ) # e l i f defined (B2) i f ( r2 . le . r c u t o f f 2 ) then r= sqrt ( r2 ) fc=exp(−xmuc∗ r ) ∗ ( 1 . / r+xmuc ) / r2 f i ( : ) = f i ( : ) + z i i ( j )∗ fc ∗xx ( : ) f j ( : , j ) = f j ( : , j ) − z i i ( i )∗ fc ∗xx ( : ) endif #else

⌦ ⌃ ⇧

19/34

slide-26
SLIDE 26

Overview

5

Source Code Optimization

20/34

slide-27
SLIDE 27

OpenMP Overhead in Vampir

20/34

slide-28
SLIDE 28

Code Changes

  • mp parallel do for the j-loop is removed
  • mp parallel region around the i-loop

j-loop is parallelized explicitly

Entire ij-loop section in a 3rd loop over threads

Each iteration of this outer loop is assigned to an OpenMP thread

allocate ( f j (3 ,0: n−1)) do 100 i =myrank , n−2,nprocs f i ( : ) = 0 . 0 d0 !$omp p a r a l l e l do p ri v at e ( r2 , k , xx , r , fc ) , \ reduction ( + : f i ) , schedule ( runtime ) do 90 j = i +1 ,n−1 ! − − − A −Block− − − . . . . . .

⌦ ⌃ ⇧ ⌥

!$omp p a r a l l e l p ri v at e ( nthrd , nj , nr , j0 , j1 , xx , r2 , r , fc , f i ) nthrd = omp_get_num_threads ( ) allocate ( f i (3 ,0: n−1)) !$omp do schedule ( s t a t i c , 1 ) do 110 i t h r d =0 , nthrd−1 do 100 i =myrank , n−2,nprocs f i ( : , i )=0.0d0 nj =(n −i −1)/ nthrd nr=mod(n −i −1,nthrd ) j0 =( i +1)+ i t h r d ∗nj+min ( i t h r d , nr ) j1 =( i +1)+( i t h r d +1)∗ nj+min ( i t h r d +1 , nr)−1 do 90 j =j0 , j1

⌦ ⌃ ⇧

21/34

slide-29
SLIDE 29

New MPI/OpenMP Version in Vampir

22/34

slide-30
SLIDE 30

Parallel Efficiency of the OpenMP Application

80# 90# 100# 1# 2# 3# 4# 5# 6# 7# 8# Efficiency(in(%( #(of(threads(

Parallel(Efficiency(

Old#Code# New#Code#

23/34

slide-31
SLIDE 31

Parallel Efficiency of the MPI Version

0%# 20%# 40%# 60%# 80%# 100%# 120%# 8# 8# 16# 32# 64# 128# 256# 512# 672# Efficiency(in(%( MPI(cores(

Parallel(Efficiency((MPI(only)(

Parallel#Efficiency#

24/34

slide-32
SLIDE 32

Parallel Efficiency of the Hybrid Version

0%# 20%# 40%# 60%# 80%# 100%# 120%# 8# 64# 128# 256# 512# 672# Efficiency(in(%( #(of(cores((MPI+OpenMP)(

Parallel(Efficiency((MPI+OpenMP)(

Parallel#Efficiency#

25/34

slide-33
SLIDE 33

Overview

6

Tracing and Visualization

26/34

slide-34
SLIDE 34

8 core inter-node run of the MPI-only version of MD

26/34

slide-35
SLIDE 35

8 core inter-node run of the MPI-only version of MD

27/34

slide-36
SLIDE 36

8 core inter-node run of the MPI-only version of MD

28/34

slide-37
SLIDE 37

8 core inter-node run of the MPI-only version of MD

29/34

slide-38
SLIDE 38

8 core inter-node run of the MPI-only version of MD

30/34

slide-39
SLIDE 39

8 core inter-node run of the MPI-only version of MD

31/34

slide-40
SLIDE 40

Impact of MPI_Allreduce on the scaling

Table : MPI-only, PP02 A0 B2 run, all processes, accumulated exclusive time per function

672 cores 8 cores function name time in s % time in s % accel_ion_mix 13293.4 72.7 11735.0 97.9 MPI_Allreduce 1885.8 10.3 44.1 0.4 sync 939.4 5.1 0.1 newton 937.2 5.1 4.9 0.0 MPI_Bcast 807.1 4.4 8.2 0.1 vtot_ion_mix 189.6 1.0 188.2 1.6 MPI_Init 112.8 0.6 0.1 MPI_Barrier 107.5 0.6 1.2

32/34

slide-41
SLIDE 41

Overview

7

Conclusion

33/34

slide-42
SLIDE 42

Conclusion

The analysis of MD found the best serial version Using a cut-off sphere reduces interactions to 19% Runtime decrease although the performance counter indicate otherwise Moved the OpenMP parallelization from the inner to the

  • uter loop

Improved the parallel efficiency Dual socket, dual chip 16-threads AMD Interlagos system shows a 97.6% parallel efficiency up to 32 cores

The changes to the source code will be included in future versions of the Spec OpenMP benchmark.

33/34

slide-43
SLIDE 43

Questions?

Thank you for your attention.

34/34

slide-44
SLIDE 44

AMD Interlagos OpenMP Efficiency

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 1.1" 1" 3" 5" 7" 9" 11" 13" 15" 17" 19" 21" 23" 25" 27" 29" 31" 33" 35" 37" 39" 41" 43" 45" 47" 49" 51" 53" 55" 57" 59" 61" 63" Parallel"Efficiency" Number"of"OpenMP"Threads"

Parallel"Efficiency"

34/34

slide-45
SLIDE 45

Kraken - Old Version

34/34

slide-46
SLIDE 46

Introduction

Code has been extended and rewritten multiple times Several different implementations of the same semantics

Loop reimplemented using the Fortran array syntax Cut-off sphere to restrict interactions to a subset of relatively nearby nucleons/ion

User must specify which variation in the force calculation routines to use when building the code

nucleon pure-ion ion-mixture

34/34

slide-47
SLIDE 47

672 core run of the MPI-only version of MD

34/34

slide-48
SLIDE 48

8 core inter-node run of the MPI-only version of MD

34/34

slide-49
SLIDE 49

34/34