Single Processor Optimization (II) Russian-German School on High - - PowerPoint PPT Presentation

single processor optimization ii
SMART_READER_LITE
LIVE PREVIEW

Single Processor Optimization (II) Russian-German School on High - - PowerPoint PPT Presentation

Single Processor Optimization (II) Russian-German School on High Performance Computer Systems, June, 27 th until July, 6 th 2005, Novosibirsk 2. Day, 28 th of June, 2005 HLRS, University of Stuttgart Slide 1 High Performance Computing Center


slide-1
SLIDE 1

Slide 1 High Performance Computing Center Stuttgart

Single Processor Optimization (II)

Russian-German School on High Performance Computer Systems, June, 27th until July, 6th 2005, Novosibirsk

  • 2. Day, 28th of June, 2005

HLRS, University of Stuttgart

slide-2
SLIDE 2

Slide 2 High Performance Computing Center Stuttgart

Intention

  • different kind of declarations of arrays are tested
  • verhead of procedure calls
  • verhead for leaving procedures
  • allocation/deallocation times
  • performance implications of declarations for operating with arrays
slide-3
SLIDE 3

Slide 3 High Performance Computing Center Stuttgart

tested machines and compilers

1. machine type 2. frequency in MHz 3. number of clock tics for setting the next clock tic 4. number of tics for calling overhead for one clock call

Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi 1000 500 2400 2400 38 31 92 112 37 29 88 88

slide-4
SLIDE 4

Slide 4 High Performance Computing Center Stuttgart

Compiler

  • Intel IA_32

ifc -O3 -nodps -hlo -tpp6 ( or tpp7 )

  • Intel IA-64

efc -O3 -hlo -opt_report -opt_report_levelmax -opt_report_phaseall # -ip # -S

  • Portland Group (may be better options)

pgf90 -fast

  • NEC SX

f90 notice that nonoverlapping pointers is default

slide-5
SLIDE 5

Slide 5 High Performance Computing Center Stuttgart

test environment

  • simplest procedures should allow for best optimization
  • slim instruction body shows the calling overhead
  • timers are based on hardware counters called by assembler routines

– calling overhead 30 - 90 cycles – PAPI is too inaccurate

  • programs are portable as long as hardware counters can be provided
slide-6
SLIDE 6

Slide 6 High Performance Computing Center Stuttgart

how to call IA- 32 counter

  • IA-32 Linux; assembler embedded in C; also pgi; also AMD
  • icc -c clock_tic.c

unsigned long long int clock_tic_ () { unsigned long long int x; __asm__ volatile ("rdtsc\n" : "=A" (x)); return x;} integer(kind=8) :: int_start integer(kind=8) :: int_end integer(kind=8),external :: clock_tic int_start=clock_tic() do ii=1,imax a(ii)=b(ii)+c(ii) enddo int_end=clock_tic()

slide-7
SLIDE 7

Slide 7 High Performance Computing Center Stuttgart

how to call IA- 64 counter

  • IA-64 Linux Assembler
  • ecc -c clock_tic.s

.text .align 16 // C version // long clock_tic() .global clock_tic# .proc clock_tic# clock_tic: mov r8 = ar.itc br.ret.sptk.many b0 .endp clock_tic# .align 16 // Fortran version // integer*8 clock_tic .global clock_tic_# .proc clock_tic_# clock_tic_: mov r8 = ar.itc br.ret.sptk.many b0 .endp clock_tic_# integer(kind=8) :: int_start integer(kind=8) :: int_end integer(kind=8),external :: clock_tic int_start=clock_tic() do ii=1,imax a(ii)=b(ii)+c(ii) enddo int_end=clock_tic()

slide-8
SLIDE 8

Slide 8 High Performance Computing Center Stuttgart

how to call NEC SX counter

  • NEC SX usr time counter
  • as -dl clock_tic.s

NEC SX wall clock counter

  • as -dl clock_tic_wall.s

global clock_tic_ clock_tic_: stusrcc $s123 b 0(,$s32) global clock_tic_wall_ clock_tic_wall_: ststm $s123 b 0(,$s32) integer(kind=8) :: int_start integer(kind=8) :: int_end integer(kind=8),external :: clock_tic int_start=clock_tic() do ii=1,imax a(ii)=b(ii)+c(ii) enddo int_end=clock_tic()

slide-9
SLIDE 9

Slide 9 High Performance Computing Center Stuttgart

Part 1: procedure calls and declarations

  • detailed measurements entering and leaving procedures
  • allocation/deallocation, automatic array timings
  • tested are procedures in a module

– in the same file – and in a different file

  • allocation of large number of pointers
  • simple recursive procedures
slide-10
SLIDE 10

Slide 10 High Performance Computing Center Stuttgart

measuring methodology

int_1=clock_tic() do nn=1,nmax ! repetition loop int_2=clock_tic() call extern_explicit_shape_array(array,ix,iy) int_5=clock_tic() int_time_23=int_time_23+(int_3-int_2) int_time_34=int_time_34+(int_4-int_3) int_time_45=int_time_45+(int_5-int_4) enddo int_6=clock_tic() subroutine explicit_shape_array(array,ix,iy) integer :: ix,iy real(kind=8),dimension(ix,iy) :: array int_3=clock_tic() array(1,1) = 0. int_4=clock_tic() end subroutine explicit_shape_array int_time_16=int_6 - int_1 time_array(1)=real(int_time_23)/real(nmax) - tics_for_calling_clock time_array(2)=real(int_time_34)/real(nmax) - tics_for_calling_clock time_array(3)=real(int_time_45)/real(nmax) - tics_for_calling_clock time_array(4)=int_time_16 - tics_for_calling_clock measured procedure measuring environment calculation of timings

slide-11
SLIDE 11

Slide 11 High Performance Computing Center Stuttgart

interpretation of tables

Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 7 43 4 12 entering 1 5 4 body 3 8 4 leaving 7 40 4 12 entering 1 5 4 body 43 45 88 96 leaving

1- machine 2-4 procedure in the same file ( enter, body, leaving ) 5-7 procedure in a different file ( enter, body, leaving )

slide-12
SLIDE 12

Slide 12 High Performance Computing Center Stuttgart

implicit_procedure

subroutine implicit_procedure2(j,time) integer :: j real(kind=8) :: time j=max(j,int(time)) !only to confuse the compiler end subroutine implicit_procedure2 procedure in an external file; total time for one call

Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 21 42 24 32 call

slide-13
SLIDE 13

Slide 13 High Performance Computing Center Stuttgart

explicit_shape_array

subroutine explicit_shape_array(array,ix,iy) integer :: ix,iy real(kind=8),dimension(ix,iy) :: array int_3=clock_tic() array(1,1) = 0. int_4=clock_tic() end subroutine explicit_shape_array short time for leaving the procedure in the case procedure is in the same file

Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 7 43 4 12 entering 1 5 4 body 3 8 4 leaving 7 40 4 12 entering 1 5 4 body 43 45 88 96 leaving

slide-14
SLIDE 14

Slide 14 High Performance Computing Center Stuttgart

assumed_shape_array

subroutine assumed_shape_array(array) real(kind=8),dimension(:,:) :: array int_3=clock_tic() array(1,1) = 0. int_4=clock_tic() end subroutine assumed_shape_array

11 184 8 745 entering 13 7 body 83 82 180 228 leaving 10 183 8 752 entering 13 7 body 123 119 272 360 leaving

needs much more time for entering and leaving procedure

slide-15
SLIDE 15

Slide 15 High Performance Computing Center Stuttgart

assumed_shape_array_section 1

subroutine assumed_shape_array_section(array_1,array_2,digit,ix,iy) real(kind=8),dimension(:,:) :: array_1 real(kind=8),dimension(:,:) :: array_2 real(kind=8) :: digit integer :: ix,iy int_3=clock_tic() array_1(ix,iy) = digit; array_2(ix,iy) = 2.; digit = array_1(ix,iy) + array_2(ix,iy) int_4=clock_tic() end subroutine assumed_shape_array_section

call assumed_shape_array_section(array_1(1:a,1:b),array_2(1:a,1:b),digit,a,b) call assumed_shape_array_section(array_1(1:a:2,1:b:2),array_2(1:a:2,1:b:2),digit,a,b) call assumed_shape_array_section(array_1,array_2,digit,a,b) three different cases for actual parameters:

slide-16
SLIDE 16

Slide 16 High Performance Computing Center Stuttgart

assumed_shape_array_section 2

2) call assumed_shape_array_section(array_1(1:a,1:b),array_2(1:a,1:b),digit,a,b) 3) call assumed_shape_array_section(array_1(1:a:2,1:b:2),array_2(1:a:2,1:b:2),digit,a,b)

Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 17 377 24 1494 entering 43 27 52 25 body 7 8 4 96 leaving 52 1302 85 2146 entering 40 29 50 22 body 7 8 4 96 leaving 56 1183 84 5997 entering 30 27 36 25 body 51 45 96 4193 leaving

1) call assumed_shape_array_section(array_1,array_2,digit,a,b) copying when entering the procedure for cases 2 and 3

slide-17
SLIDE 17

Slide 17 High Performance Computing Center Stuttgart

deferred_shape_array

subroutine deferred_shape_array(digit,x,y) real(kind=8),allocatable,dimension(:,:) :: array_1 real(kind=8),allocatable,dimension(:,:) :: array_2 real(kind=8) :: digit integer :: x,y int_3=clock_tic() allocate (array_1(x,y),array_2(x,y)) array_1(x,y) = digit; array_2(x,y) = 2. ; digit = array_1(x,y) + array_2(x,y) deallocate (array_1,array_2) int_4=clock_tic() end subroutine deferred_shape_array

high times for allocation and deallocation large times for leaving the procedure

Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 15 218 4 16 entering 682 2405 1036 3218 body 1642 1707 3611 4485 leaving 15 219 4 16 entering 682 2391 1051 3297 body 1719 1962 3709 4577 leaving

slide-18
SLIDE 18

Slide 18 High Performance Computing Center Stuttgart

automatic_arrays

subroutine automatic_arrays(digit,ix,iy) real(kind=8),dimension(ix,iy) :: array_1,array_2 real(kind=8) :: digit integer :: ix,iy int_3 = clock_tic() array_1(ix,iy) = digit; array_2(ix,iy) = 2. digit = array_1(ix,iy) + array_2(ix,iy) int_4 = clock_tic() end subroutine automatic_arrays

Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 372 208 500 1369 entering 15 26 36 16 body 284 78 269 732 leaving 372 206 500 1349 entering 15 26 36 16 body 598 172 625 1493 leaving

NEC SX is quite fast pgi is much worse much better solution as allocation and deallocation

slide-19
SLIDE 19

Slide 19 High Performance Computing Center Stuttgart

pointer_array

subroutine pointer_array(digit,ix,jy) real(kind=8),dimension(:,:),pointer :: array_p1 real(kind=8),dimension(:,:),pointer :: array_p2 integer :: ix,jy real(kind=8) :: digit int_3 = clock_tic() allocate (array_p1(ix,jy),array_p2(ix,jy)) array_p1(ix,jy) = digit; array_p2(ix,jy) = 2. digit = array_p1(ix,jy) +array_p2(ix,jy) deallocate (array_p1,array_p2) int_4 = clock_tic() end subroutine pointer_array

Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 13 39 4 76 entering 549 2674 870 32807 body 678 209 717 1845 leaving 13 127 4 28 entering 543 2776 892 32950 body 758 246 809 2193 leaving

small times for entering high times for leaving the procedure

slide-20
SLIDE 20

Slide 20 High Performance Computing Center Stuttgart

allocate_pointer_array 1

subroutine allocate_pointer_array_2(data,imax) type(pointer_data_type),pointer,dimension(:) :: data integer :: i,imax int_3=clock_tic() allocate(data(imax)) do i =1,imax allocate(data(i)%point) enddo int_4=clock_tic() end subroutine allocate_pointer_array_2 type pointer_data_type real(kind=8),pointer :: point end type pointer_data_type

imax=100

slide-21
SLIDE 21

Slide 21 High Performance Computing Center Stuttgart

allocate_pointer_array 2

subroutine allocate_pointer_array_1(all,data,imax) type(pointer_data_type),pointer,dimension(:) :: data real(kind=8),pointer,dimension(:) :: all integer :: i,imax int_3=clock_tic() allocate(all(imax)) allocate(data(imax)) do i =1,imax data(i)%point => all(i) enddo int_4=clock_tic() end subroutine allocate_pointer_array_1

type pointer_data_type real(kind=8),pointer :: point end type pointer_data_type

imax=100

slide-22
SLIDE 22

Slide 22 High Performance Computing Center Stuttgart

allocate_pointer_array 3

allocate(all(imax)) allocate(data(imax)) do i =1,imax data(i)%point => all(i) enddo

allocate(data(imax)) do i =1,imax allocate(data(i)%point) enddo

Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 12 237 4 333 entering 446 27785 713 7791 body 29 199 4 806 leaving 10 135 4 178 entering 11992 67690 14699 69187 body 30 101 4 431 leaving

NEC very slow allocation of a large buffer with pointer assignments allocation of a large number

  • f atomic pointers
slide-23
SLIDE 23

Slide 23 High Performance Computing Center Stuttgart

recursive self_call

recursive subroutine self_call(number) integer :: number if (number>0) then call self_call(number-1) endif end subroutine self_call

NEC very slow

Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 18 263 14 18 call 19 263 14 16 call

slide-24
SLIDE 24

Slide 24 High Performance Computing Center Stuttgart

recursive Ackermann function

recursive function ackermann(x,y,number) result(resultvalue) real(kind=8) :: x,y,resultvalue integer :: number number = number + 1 if (x == 0.) then resultvalue=(y+1._8) else if( (x > 0._8).and. (y == 0._8) ) then resultvalue = ackermann((x-1.),1._8,number) else resultvalue = ackermann( (x-1._8), ackermann(x, (y-1._8), number), number) endif endif end function ackermann

Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 42 279 53 51 call

slide-25
SLIDE 25

Slide 25 High Performance Computing Center Stuttgart

conclusions and rules 1

  • influence of type of array declarations

– best is the traditional f77 way – assumed shape is slower – automatic arrays for dynamic usage are fast on entering and leaving the procedure – using deferred shape arrays is much slower

  • allocation and deallocation need thousands of cycles; use allocation

and deallocation for a larger buffer and define arrays by pointer assignments

slide-26
SLIDE 26

Slide 26 High Performance Computing Center Stuttgart

conclusions and rules 2

  • array sections (and more complicated constructs) for actual

arguments imply coping of data

  • leaving the procedure needs longer than entering it !
  • pgi compiler shows much worse behaviour
  • except for automatic arrays NEC SX compiler needs more cycles
  • procedures in external files need more time for leaving
  • for some compiler front ends allocation construct may inhibit
  • ptimization ( not tested here) in the same program unit;

define additional interfaces !

slide-27
SLIDE 27

Slide 27 High Performance Computing Center Stuttgart

slide-28
SLIDE 28

Slide 28 High Performance Computing Center Stuttgart

Part 2

  • performance implications of array declarations, attributes, self

defined types on performance of handling arrays – histogram loop – comparison of arrays with different attributes for dummy and actual parameters – derived data types for dynamic data

slide-29
SLIDE 29

Slide 29 High Performance Computing Center Stuttgart

histogram loop with (pseudo-) aliasing data

slide-30
SLIDE 30

Slide 30 High Performance Computing Center Stuttgart

  • histogram loop, assembling loop; in general not pipelineable/vectorizable

do ii=1,imax jj=index(ii) a(jj)=a(jj)+b(ii) enddo

  • with directive

!dir$ ivdep or !cdir nodep do ii=1,imax a(index(ii))=a(index(ii))+b(ii) enddo

  • semantics of forall are dependency free

forall (ii=1:imax) a(index(ii))=a(index(ii))+b(ii) end forall

  • fake (a and aa are identical!)

do ii=1,imax a(index(ii))=aa(index(ii))+b(ii) enddo

histogram loop

for comparison do ii=1,imax a(ii)=b(ii)+c(ii) enddo

slide-31
SLIDE 31

Slide 31 High Performance Computing Center Stuttgart

Pentium 4 with ifc

fake worse forall worse than do; identical to directive large penalty

  • nly for comparisons

5 times

slide-32
SLIDE 32

Slide 32 High Performance Computing Center Stuttgart

Pentium 4 with pgi

fake worse forall better than do identical with directive penalty relatively small pgi better as ifc

slide-33
SLIDE 33

Slide 33 High Performance Computing Center Stuttgart

Itanium 2 with efc

forall does not help; bad implementation fake nearly as good as directive factor 2.5

  • nly for comparisons
slide-34
SLIDE 34

Slide 34 High Performance Computing Center Stuttgart

NEC SX-5 with f90 (vectorized multiplied by 8)

startup phase not vectorized fake identical with directive forall helps remark that vectorized examples have to be divided by 8!

  • nly for comparisons
slide-35
SLIDE 35

Slide 35 High Performance Computing Center Stuttgart

histogram loop summary

  • histogram loop, assembling loop; in general not pipelineable/vectorizable

do ii=1,imax jj=index(ii) a(jj)=a(jj)+b(ii) enddo

  • with directive

good on PC with pgi or efc, Itanium and NEC !dir$ ivdep or !cdir nodep do ii=1,imax a(index(ii))=a(index(ii))+b(ii) enddo

  • semantic of forall is dependency free good on PC with pgi and NEC

forall (ii=1:imax) a(index(ii))=a(index(ii))+b(ii) end forall

  • fake (a and aa are identical!)

slower on PC; faster on pipeline do ii=1,imax a(index(ii))=aa(index(ii))+b(ii) enddo

slide-36
SLIDE 36

Slide 36 High Performance Computing Center Stuttgart

influence of attributes on performance

slide-37
SLIDE 37

Slide 37 High Performance Computing Center Stuttgart

different array types 1

  • dummy array parameters declared in different ways

– assumed size: real(kind=8),dimension(*) :: a, b, c – assumed shape: real(kind=8),dimension(:) :: a, b, c – assumed shape with pointers: real(kind=8),pointer,dimension(:) :: a, b, c

slide-38
SLIDE 38

Slide 38 High Performance Computing Center Stuttgart

different array types 2

  • called with different kind of actual parameters

– fixed sized allocated integer,parameter :: imax_fix=4000000+17 real(kind=8),dimension(imax_fix) :: a_fix, b_fix, c_fix – allocatable arrays real(kind=8),allocatable,dimension(:) :: a_alloc, b_alloc, c_alloc – pointer arrays real(kind=8),pointer,dimension(:) :: a_pointer, b_pointer, c_pointer

slide-39
SLIDE 39

Slide 39 High Performance Computing Center Stuttgart

Pentium 4 with ifc: different kinds of actual (!) arguments 5 10 15 20 25 30 35 40 45 50 5 10 15 20 cycles log 2(loop length) real(kind=8),dimension(*) :: a,b,c; do ii=1,imax; a(ii)=b(ii)+c(ii); enddo test_assumed_size with fix sized arguments test_assumed_size with allocatable arguments test_assumed_size imax with pointer arguments

slide-40
SLIDE 40

Slide 40 High Performance Computing Center Stuttgart

Pentium 4 with pgi: different kinds of actual (!) arguments

5 10 15 20 25 30 35 40 45 50 5 10 15 20 cycles log 2(loop length) real(kind=8)dimension(*) :: a,b,c; do ii=1,imax; a(ii)=b(ii)+c(ii); enddo test_assumed_size with fix sized arguments test_assumed_size with allocatable arguments test_assumed_size imax with pointer arguments

slide-41
SLIDE 41

Slide 41 High Performance Computing Center Stuttgart

Itanium 2 with efc: different kinds of actual (!) arguments

2 4 6 8 10 4 6 8 10 12 14 16 18 20 22 cycles log 2(loop length) real(kind=8),dimension(*) :: a,b,c; do ii=1,imax; a(ii)=b(ii)+c(ii); enddo (*) test_assumed_size with fix sized arguments (*) test_assumed_size with allocatable arguments (*) test_assumed_size imax with pointer arguments

slide-42
SLIDE 42

Slide 42 High Performance Computing Center Stuttgart

NEC SX-5 with f90: different types of actual arguments (*8!)

2 4 6 8 10 5 10 15 20 cycles log 2(loop length) real(kind=8),dimension(*) :: a,b,c; do ii=1,imax; a(ii)=b(ii)+c(ii); enddo test_assumed_size with fix sized arguments test_assumed_size with allocatable arguments test_assumed_size imax with pointer arguments

slide-43
SLIDE 43

Slide 43 High Performance Computing Center Stuttgart

Pentium 4 with ifc

have all pointer actual arguments fix sized or allocatable actual arguments all assumed size (*) all assumed shape (;)

slide-44
SLIDE 44

Slide 44 High Performance Computing Center Stuttgart

5 10 15 20 25 30 35 40 45 50 5 10 15 20 cycles log_2(loop_length) real(kind=8),(pointer,)dimension(*)(:) :: a,b,c; do ii=1,imax; a(ii)=b(ii)+c(ii); enddo (*) test_assumed_size with fix sized arguments (*) test_assumed_size with allocatable arguments (*) test_assumed_size imax with pointer arguments (:) test_assumed_shape with fix sized arguments (:) test_assumed_shape with allocatable arguments (:) test_assumed_shape with pointer arguments pointer,(:) test_pointer_update with pointer arguments

Pentium 4 with ifc

have all pointer actual arguments fix sized or allocatable actual arguments all assumed size (*) all assumed shape (;)

slide-45
SLIDE 45

Slide 45 High Performance Computing Center Stuttgart

20 40 60 80 100 5 10 15 20 cycles log 2(loop length) real(kind=8),(pointer,)dimension(*)(:) :: a,b,c; do ii=1,imax; a(ii)=b(ii)+c(ii); en (*) test_assumed_size with fix sized arguments (*) test_assumed_size with allocatable arguments (*) test_assumed_size imax with pointer arguments (:) test_assumed_shape with fix sized arguments (:) test_assumed_shape with allocatable arguments (:) test_assumed_shape with pointer arguments pointer,(:) test_pointer_update with pointer arguments

Pentium 4 with pgi

have all pointer actual arguments pointer; very slow much worse than ifc fixed sized

  • r

allocatable actual arguments

slide-46
SLIDE 46

Slide 46 High Performance Computing Center Stuttgart

Itanium 2 with efc

pointer arguments (*)(:)fix sized (*)(:)allocatable pointer

slide-47
SLIDE 47

Slide 47 High Performance Computing Center Stuttgart

2 4 6 8 10 5 10 15 20 cycles log 2(loop length) real(kind=8),(pointer,)dimension(*)(:) :: a,b,c; do ii=1,imax; a(ii)=b(ii)+c(ii); en (*) test_assumed_size with fix sized arguments (*) test_assumed_size with allocatable arguments (*) test_assumed_size imax with pointer arguments (:) test_assumed_shape with fix sized arguments (:) test_assumed_shape with allocatable arguments (:) test_assumed_shape with pointer arguments pointer,(:) test_pointer_update with pointer arguments Itanium 2 with efc

pointer arguments (*)(:)fix sized (*)(:)allocatable pointer

slide-48
SLIDE 48

Slide 48 High Performance Computing Center Stuttgart

2 4 6 8 10 5 10 15 20 cycles log 2(loop length) real(kind=8),(pointer,)dimension(*)(:) :: a,b,c; do ii=1,imax; a(ii)=b(ii)+c(ii); en (*) test_assumed_size with fix sized arguments (*) test_assumed_size with allocatable arguments (*) test_assumed_size imax with pointer arguments (:) test_assumed_shape with fix sized arguments (:) test_assumed_shape with allocatable arguments (:) test_assumed_shape with pointer arguments pointer,(:) test_pointer_update with pointer arguments

NEC SX-5 with f90 (multiplied by 8!)

identical for all; but performance decreasing?

slide-49
SLIDE 49

Slide 49 High Performance Computing Center Stuttgart

different attributes of arrays summary

  • simple constructs with dummy array parameters

– assumed size: best way real(kind=8),dimension(*) :: a,b,c – assumed shape: slower than (*) real(kind=8),dimension(:) :: a,b,c – assumed shape with pointers: pointer penalty real(kind=8),pointer,dimension(:) :: a,b,c

  • called with different kind of actual arguments

– fixed sized allocated integer,parameter :: imax_fix=4000000+17 real(kind=8),dimension(imax_fix) :: a_fix,b_fix,c_fix – allocatable arrays real(kind=8),allocatable,dimension(:) :: a_alloc,b_alloc,c_alloc – pointer arrays best way if hidden by call real(kind=8),pointer,dimension(:) :: a_pointer,b_pointer,c_pointer

  • more complicated situations are not covered;

better behaviour cannot be expected

slide-50
SLIDE 50

Slide 50 High Performance Computing Center Stuttgart

influence of self defined types

slide-51
SLIDE 51

Slide 51 High Performance Computing Center Stuttgart

1. dynamic type type array_type real(kind=8),pointer :: xx(:) end type array_type 2. declaration in procedure type(array_type) :: a,b,c 3. simple do loop with pointer arrays do ii=1,imax a%xx(ii)=b%xx(ii)+c%xx(ii) enddo 4. loop with forall forall(ii=1:imax) a%xx(ii)=b%xx(ii)+c%xx(ii) end forall 5. array syntax a%xx=b%xx+c%xx 6. replaced by call call update_call_arr(imax,a%xx,b%xx,c%xx) 7. replaced by call with array sections call update_call_arr(imax,a%xx(1:imax),b%xx(1:imax),c%xx(1:imax))

simple dynamic data type

subroutine update_call_arr(imax,a,b,c) real(kind=8),dimension(imax) :: a,b,c integer :: ii integer :: imax do ii=1,imax a(ii)=b(ii)+c(ii) enddo end subroutine update_call_arr

slide-52
SLIDE 52

Slide 52 High Performance Computing Center Stuttgart

Pentium III with ifc

no penalty for array section penalty 5 for array syntax penalty 8/3 for do and forall

slide-53
SLIDE 53

Slide 53 High Performance Computing Center Stuttgart

Pentium III with pgi

no penalty for array section penalty 38/3 for pointer array penalty 85/25 for pointer array penalty 32/3 for pointer array

slide-54
SLIDE 54

Slide 54 High Performance Computing Center Stuttgart

Pentium 4 with ifc

no penalty for array section forall like simple do loop penalty 8.5/4.5 penalty 18/4.5=4 array syntax much worse

slide-55
SLIDE 55

Slide 55 High Performance Computing Center Stuttgart

Pentium 4 with pgi

no penalty for array section array syntax and forall very slow pointer very slow

slide-56
SLIDE 56

Slide 56 High Performance Computing Center Stuttgart

Itanium 2 with efc

no penalty for array section for all as fast as do array syntax essentially better penalty 11-17.5 for pointer penalty 3

slide-57
SLIDE 57

Slide 57 High Performance Computing Center Stuttgart

NEC SX-5 with f90 ( multiplied by 8!)

array section for call 5 times slower! forall and array syntax 1.75 times slower no penalty for pointer

slide-58
SLIDE 58

Slide 58 High Performance Computing Center Stuttgart

simple dynamic data type summary

1. dynamic type Fortran 2000 will allow allocatable type array_type real(kind=8),pointer :: xx(:) end type array_type 2. declaration in procedure type(array_type) :: a,b,c 3. simple do loop with pointer arrays do ii=1,imax no penalty on NEC a%xx(ii)=b%xx(ii)+c%xx(ii) enddo 4. loop with forall could be faster, but is not forall(ii=1:imax) a%xx(ii)=b%xx(ii)+c%xx(ii) end forall 5. array syntax faster only on Itanium2_efc a%xx=b%xx+c%xx 6. replaced by call best way; calling overhead call update_call_arr(imax,a%xx,b%xx,c%xx) 7. replaced by call with array sections 5 times slower on NEC call update_call_arr(imax,a%xx(1:imax),b%xx(1:imax),c%xx(1:imax))