Single Processor Optimization (II) Russian-German School on High - PowerPoint PPT Presentation

Single Processor Optimization (II) Russian-German School on High Performance Computer Systems, June, 27 th until July, 6 th 2005, Novosibirsk 2. Day, 28 th of June, 2005 HLRS, University of Stuttgart Slide 1 High Performance Computing Center Stuttgart

Intention • different kind of declarations of arrays are tested • overhead of procedure calls • overhead for leaving procedures • allocation/deallocation times • performance implications of declarations for operating with arrays Slide 2 High Performance Computing Center Stuttgart

tested machines and compilers Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi 1000 500 2400 2400 38 31 92 112 37 29 88 88 1. machine type 2. frequency in MHz 3. number of clock tics for setting the next clock tic 4. number of tics for calling overhead for one clock call Slide 3 High Performance Computing Center Stuttgart

Compiler • Intel IA_32 ifc -O3 -nodps -hlo -tpp6 ( or tpp7 ) • Intel IA-64 efc -O3 -hlo -opt_report -opt_report_levelmax -opt_report_phaseall # -ip # -S • Portland Group (may be better options) pgf90 -fast • NEC SX f90 notice that nonoverlapping pointers is default Slide 4 High Performance Computing Center Stuttgart

test environment • simplest procedures should allow for best optimization • slim instruction body shows the calling overhead • timers are based on hardware counters called by assembler routines – calling overhead 30 - 90 cycles – PAPI is too inaccurate • programs are portable as long as hardware counters can be provided Slide 5 High Performance Computing Center Stuttgart

how to call IA- 32 counter • IA-32 Linux; assembler embedded in C; also pgi; also AMD • icc -c clock_tic.c unsigned long long int clock_tic_ () { unsigned long long int x; __asm__ volatile ("rdtsc\n" : "=A" (x)); return x;} integer(kind=8) :: int_start integer(kind=8) :: int_end integer(kind=8),external :: clock_tic int_start=clock_tic() do ii=1,imax a(ii)=b(ii)+c(ii) enddo int_end=clock_tic() Slide 6 High Performance Computing Center Stuttgart

how to call IA- 64 counter .text .align 16 • IA-64 Linux Assembler // C version • ecc -c clock_tic.s // long clock_tic() .global clock_tic# .proc clock_tic# integer(kind=8) :: int_start clock_tic: integer(kind=8) :: int_end mov r8 = ar.itc integer(kind=8),external :: clock_tic br.ret.sptk.many b0 int_start=clock_tic() .endp clock_tic# do ii=1,imax a(ii)=b(ii)+c(ii) .align 16 enddo // Fortran version int_end=clock_tic() // integer*8 clock_tic .global clock_tic_# .proc clock_tic_# clock_tic_: mov r8 = ar.itc br.ret.sptk.many b0 .endp clock_tic_# Slide 7 High Performance Computing Center Stuttgart

how to call NEC SX counter • NEC SX usr time counter • as -dl clock_tic.s global clock_tic_ clock_tic_: stusrcc $s123 integer(kind=8) :: int_start b 0(,$s32) integer(kind=8) :: int_end integer(kind=8),external :: clock_tic NEC SX wall clock counter int_start=clock_tic() • as -dl clock_tic_wall.s do ii=1,imax a(ii)=b(ii)+c(ii) global clock_tic_wall_ enddo clock_tic_wall_: int_end=clock_tic() ststm $s123 b 0(,$s32) Slide 8 High Performance Computing Center Stuttgart

Part 1: procedure calls and declarations • detailed measurements entering and leaving procedures • allocation/deallocation, automatic array timings • tested are procedures in a module – in the same file – and in a different file • allocation of large number of pointers • simple recursive procedures Slide 9 High Performance Computing Center Stuttgart

measuring methodology subroutine measuring environment explicit_shape_array(array,ix,iy) integer :: ix,iy int_1=clock_tic() real(kind=8),dimension(ix,iy) :: array do nn=1,nmax ! repetition loop int_3=clock_tic() int_2=clock_tic() array(1,1) = 0. call extern_explicit_shape_array(array,ix,iy) int_4=clock_tic() int_5=clock_tic() end subroutine explicit_shape_array int_time_23=int_time_23+(int_3-int_2) measured procedure int_time_34=int_time_34+(int_4-int_3) int_time_45=int_time_45+(int_5-int_4) enddo calculation of timings int_6=clock_tic() int_time_16=int_6 - int_1 time_array(1)=real(int_time_23)/real(nmax) - tics_for_calling_clock time_array(2)=real(int_time_34)/real(nmax) - tics_for_calling_clock time_array(3)=real(int_time_45)/real(nmax) - tics_for_calling_clock Slide 10 High Performance Computing Center Stuttgart time_array(4)=int_time_16 - tics_for_calling_clock

interpretation of tables 1- machine 2-4 procedure in the same file ( enter, body, leaving ) 5-7 procedure in a different file ( enter, body, leaving ) Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 7 43 4 12 entering 1 5 0 4 body 3 8 0 4 leaving 7 40 4 12 entering 1 5 0 4 body 43 45 88 96 leaving Slide 11 High Performance Computing Center Stuttgart

implicit_procedure subroutine implicit_procedure2(j,time) integer :: j real(kind=8) :: time j=max(j,int(time)) !only to confuse the compiler end subroutine implicit_procedure2 procedure in an external file; total time for one call Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 21 42 24 32 call Slide 12 High Performance Computing Center Stuttgart

explicit_shape_array subroutine explicit_shape_array(array,ix,iy) integer :: ix,iy real(kind=8),dimension(ix,iy) :: array short time for leaving the procedure in the case int_3=clock_tic() procedure is in the same array(1,1) = 0. file int_4=clock_tic() end subroutine explicit_shape_array Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 7 43 4 12 entering 1 5 0 4 body 3 8 0 4 leaving 7 40 4 12 entering 1 5 0 4 body 43 45 88 96 leaving Slide 13 High Performance Computing Center Stuttgart

assumed_shape_array subroutine assumed_shape_array(array) real(kind=8),dimension(:,:) :: array int_3=clock_tic() array(1,1) = 0. needs much more time for entering int_4=clock_tic() and leaving procedure end subroutine assumed_shape_array 11 184 8 745 entering 13 7 0 0 body 83 82 180 228 leaving 10 183 8 752 entering 13 7 0 0 body 123 119 272 360 leaving Slide 14 High Performance Computing Center Stuttgart

assumed_shape_array_section 1 subroutine assumed_shape_array_section(array_1,array_2,digit,ix,iy) real(kind=8),dimension(:,:) :: array_1 real(kind=8),dimension(:,:) :: array_2 real(kind=8) :: digit integer :: ix,iy int_3=clock_tic() array_1(ix,iy) = digit; array_2(ix,iy) = 2.; digit = array_1(ix,iy) + array_2(ix,iy) int_4=clock_tic() end subroutine assumed_shape_array_section three different cases for actual parameters: call assumed_shape_array_section(array_1,array_2,digit,a,b) call assumed_shape_array_section(array_1(1:a,1:b),array_2(1:a,1:b),digit,a,b) call assumed_shape_array_section(array_1(1:a:2,1:b:2),array_2(1:a:2,1:b:2),digit,a,b) Slide 15 High Performance Computing Center Stuttgart

assumed_shape_array_section 2 1) call assumed_shape_array_section(array_1,array_2,digit,a,b) 2) call assumed_shape_array_section(array_1(1:a,1:b),array_2(1:a,1:b),digit,a,b) 3) call assumed_shape_array_section(array_1(1:a:2,1:b:2),array_2(1:a:2,1:b:2),digit,a,b) Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 17 377 24 1494 entering 43 27 52 25 body 7 8 4 96 leaving 52 1302 85 2146 entering 40 29 50 22 body 7 8 4 96 leaving 56 1183 84 5997 entering 30 27 36 25 body 51 45 96 4193 leaving copying when entering the procedure for cases 2 and 3 Slide 16 High Performance Computing Center Stuttgart

deferred_shape_array subroutine deferred_shape_array(digit,x,y) real(kind=8),allocatable,dimension(:,:) :: array_1 real(kind=8),allocatable,dimension(:,:) :: array_2 high times for allocation and deallocation real(kind=8) :: digit large times for leaving the integer :: x,y procedure int_3=clock_tic() allocate (array_1(x,y),array_2(x,y)) array_1(x,y) = digit; array_2(x,y) = 2. ; digit = array_1(x,y) + array_2(x,y) deallocate (array_1,array_2) int_4=clock_tic() end subroutine deferred_shape_array Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 15 218 4 16 entering 682 2405 1036 3218 body 1642 1707 3611 4485 leaving 15 219 4 16 entering Slide 17 High Performance Computing Center Stuttgart 682 2391 1051 3297 body 1719 1962 3709 4577 leaving

automatic_arrays subroutine automatic_arrays(digit,ix,iy) real(kind=8),dimension(ix,iy) :: array_1,array_2 real(kind=8) :: digit NEC SX is quite fast integer :: ix,iy pgi is much worse int_3 = clock_tic() much better solution as array_1(ix,iy) = digit; array_2(ix,iy) = 2. allocation and deallocation digit = array_1(ix,iy) + array_2(ix,iy) int_4 = clock_tic() end subroutine automatic_arrays Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 372 208 500 1369 entering 15 26 36 16 body 284 78 269 732 leaving 372 206 500 1349 entering 15 26 36 16 body Slide 18 598 High Performance Computing Center Stuttgart 172 625 1493 leaving

Single Processor Optimization (II) Russian-German School on High - PowerPoint PPT Presentation

Single Processor Optimization (II) Russian-German School on High Performance Computer Systems, June, 27 th until July, 6 th 2005, Novosibirsk 2. Day, 28 th of June, 2005 HLRS, University of Stuttgart Slide 1 High Performance Computing Center

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Optimization algorithms on Cell processor Vladim r T rebick y Optimization algorithms

Processor Design Single Cycle Processor Hung-Wei Tseng Recap: the stored-program computer

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only

Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Lecture 2: Processor Design, Single-Processor Performance G63.2011.002/G22.2945.001 September

The Processor: Datapath and Control 3/ 24/ 2016 1 A single-cycle MIPS processor An

Cortex-A15 Processor ARMs next generation mobile applications processor Travis Lanier Senior

Ch. 5: Processor + Memory December 12, 2008 Ch. 5: Processor + Memory Overview of Implementation

Chapter 12 CPU Structure and Function Contents Processor organization Register

Processor Architecture: Current Trends A B Transfer a truckload at a time from A to B Processor

Embedded systems & the Nios II soft core processor A Nios II processor system I equivalent to

Using Single Photons Using Single Photons Using Single Photons Using Single Photons for WIMP

The Making Change Problem (MCP) A greedy algorithm Greedy Algorithms // Input: a , an

From O(n (n)) to O(n *(n)) Recent Results for Splay Trees Joan M. Lucas The College at

A Characterization of Locally Testable Affine-Invariant Properties via Decomposition Theorems

Encoding Sets as Real Numbers Domenico Cantone 1 Alberto Policriti 2 Dept. of Mathematics and

Shortest Paths Shortest Paths path between two given vertices path between two given vertices

Cascadia PTA Middle School Information Night November 5, 2020, 7-9 pm Land Acknowledgment The

Potential Tobacco Product Violations Form Tara Goldman M.S. Office of Compliance and

Towards Automatic Inference of Task Hierarchies in Complex Systems Haohui Mai Chongnan Gao

Single Processor Optimization (II) Russian-German School on High - PowerPoint PPT Presentation

Single Processor Optimization (II) Russian-German School on High Performance Computer Systems, June, 27 th until July, 6 th 2005, Novosibirsk 2. Day, 28 th of June, 2005 HLRS, University of Stuttgart Slide 1 High Performance Computing Center

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Optimization algorithms on Cell processor Vladim r T rebick y Optimization algorithms

Processor Design Single Cycle Processor Hung-Wei Tseng Recap: the stored-program computer

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only

Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Lecture 2: Processor Design, Single-Processor Performance G63.2011.002/G22.2945.001 September

The Processor: Datapath and Control 3/ 24/ 2016 1 A single-cycle MIPS processor An

Cortex-A15 Processor ARMs next generation mobile applications processor Travis Lanier Senior

Ch. 5: Processor + Memory December 12, 2008 Ch. 5: Processor + Memory Overview of Implementation

Chapter 12 CPU Structure and Function Contents Processor organization Register

Processor Architecture: Current Trends A B Transfer a truckload at a time from A to B Processor

Embedded systems &amp; the Nios II soft core processor A Nios II processor system I equivalent to

Using Single Photons Using Single Photons Using Single Photons Using Single Photons for WIMP

The Making Change Problem (MCP) A greedy algorithm Greedy Algorithms // Input: a , an

From O(n (n)) to O(n *(n)) Recent Results for Splay Trees Joan M. Lucas The College at

A Characterization of Locally Testable Affine-Invariant Properties via Decomposition Theorems

Encoding Sets as Real Numbers Domenico Cantone 1 Alberto Policriti 2 Dept. of Mathematics and

Shortest Paths Shortest Paths path between two given vertices path between two given vertices

Cascadia PTA Middle School Information Night November 5, 2020, 7-9 pm Land Acknowledgment The

Potential Tobacco Product Violations Form Tara Goldman M.S. Office of Compliance and

Towards Automatic Inference of Task Hierarchies in Complex Systems Haohui Mai Chongnan Gao

Embedded systems & the Nios II soft core processor A Nios II processor system I equivalent to