High Performance Fortran (HPF) Source: Chapter 7 of "Designing - - PowerPoint PPT Presentation

high performance fortran hpf
SMART_READER_LITE
LIVE PREVIEW

High Performance Fortran (HPF) Source: Chapter 7 of "Designing - - PowerPoint PPT Presentation

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs (Ian Foster, 1995) Question Can't we just have a clever compiler generate a parallel program from a sequential program? Fine-grained


slide-1
SLIDE 1

High Performance Fortran (HPF)

Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

slide-2
SLIDE 2

Question

  • Can't we just have a clever compiler generate a

parallel program from a sequential program?

  • Fine-grained parallelism

x = a*b + c*d

  • Trivial parallelism

for i := 1 to 100 do for j := 1 to 100 do C [i, j] := dotproduct ( A [ i,*], B [*, j ]);

  • d
  • d
slide-3
SLIDE 3

Automatic parallelism

Automatic parallelization of any program is extremely hard Solutions:

  • Make restrictions on source program
  • Restrict kind of parallelism used
  • Use semi-automatic approach
  • Use application-domain oriented languages
slide-4
SLIDE 4

High Performance Fortran (HPF)

  • Designed by a forum from industry, government,

universities

  • Extends Fortran 90
  • To be used for computationally expensive numerical

applications

  • Portable to SIMD machines, vector processors,

shared-memory MIMD and distributed-memory MIMD

slide-5
SLIDE 5

Fortran 90 - Base language of HPF

Extends Fortran 77 with 'modern' features

  • abstract data types, modules
  • recursion
  • pointers, dynamic storage

Array operators A = B + C A = A + 1.0 A(1:7) = B(1:7) + B(2:8) WHERE (X /= 0) X = 1.0/X

slide-6
SLIDE 6

Data parallelism

  • Data parallelism: same operation applied to different data

elements in parallel

  • Data parallel program: sequence of data parallel
  • perations
  • Overall approach:

– Programmer does domain decomposition – Compiler partitions operations automatically

  • Data may be regular (array) or

irregular (tree, sparse matrix)

  • Most data parallel languages only deal

with arrays

slide-7
SLIDE 7

Data parallelism - Concurrency

Explicit parallel operations A = B + C ! A, B, and C are arrays Implicit parallelism do i = 1,m do j = 1,n A(i,j) = B(i,j) + C(i,j) enddo enddo

slide-8
SLIDE 8

Compiling data parallel programs

  • Programs are translated automatically into parallel

SPMD (Single Program Multiple Data) programs

  • Each processor executes same program on subset of

the data

  • Owner computes rule:
  • Each processor owns subset of the data structures
  • Operations required for an element are executed by the owner
  • Each processor may read (but not modify) other elements
slide-9
SLIDE 9

Example

real s, X(100), Y(100) ! s is scalar, X and Y are arrays X = X * 3.0 ! Multiply each X(i) by 3.0 do i = 2,99 Y(i) = (X(i-1) + X(i+1))/2 ! Communication required enddo s = SUM(X) ! Communication required X and Y are distributed (partitioned) s is replicated on each machine

X Y

slide-10
SLIDE 10

HPF primitives for data distribution

  • Directives:

PROCESSORS: shape & size of abstract processors ALIGN: align elements of different arrays DISTRIBUTE: distribute (partition) an array

  • Directives affect performance of the program, not its

result

slide-11
SLIDE 11

Processors directive

!HPF$ PROCESSORS P(32) !HPF$ PROCESSORS Q(4,8)

  • Mapping of abstract to physical processors not

specified in HPF (implementation-dependent)

slide-12
SLIDE 12

Alignment directive

  • Aligns an array with another array
  • Species that specific elements should be mapped to

the same processor real A(50), B(50) !HPF$ ALIGN A(I) WITH B(I) ! A(1) on same cpu as B(1), etc !HPF$ ALIGN A(I) WITH B(I+2) ! A(1) on same cpu as B(3), etc

slide-13
SLIDE 13

Distribution directive

  • Species how elements should be partitioned among

the local memories

  • Each dimension can be distributed as follows:

* no distribution BLOCK (n) block distribution CYCLIC (n) cyclic distribution

slide-14
SLIDE 14

Figure 7.7 from Foster's book

slide-15
SLIDE 15

Example: Successive Over relaxation (SOR)

Recall algorithm discussed in Introduction: float G[1:N, 1:M], Gnew[1:N, 1:M]; for (step = 0; step < NSTEPS; step++) for (i = 2; i < N; i++) /* update grid */ for (j = 2; j < M; j++) Gnew[i,j] = f(G[i,j], G[i-1,j], G[i+1,j],G[i,j-1], G[i,j+1]); G = Gnew;

slide-16
SLIDE 16

Parallel SOR with message passing

float G[lb-1:ub+1, 1:M], Gnew[lb-1:ub+1, 1:M]; for (step = 0; step < NSTEPS; step++) SEND(cpuid-1, G[lb]); /* send 1st row left */ SEND(cpuid+1, G[ub]); /* send last row right */ RECEIVE(cpuid-1, G[lb-1]); /* receive from left */ RECEIVE(cpuid+1, G[ub+1]); /* receive from right */ for (i = lb; i <= ub; i++) /* update my rows */ for (j = 2; j < M; j++) Gnew[i,j] = f(G[i,j], G[i-1,j], G[i+1,j], G[i,j-1], G[i,j+1]); G = Gnew;

slide-17
SLIDE 17

Finite differencing (~ SOR) in HPF

See Ian Foster, Program 7.2; uses convergence criterion instead of fixed number of steps program hpf_finite_difference !HPF$ PROCESSORS pr(4) ! use 4 CPUs real X(100, 100), New(100, 100) ! data arrays !HPF$ ALIGN New(:,:) WITH X(:,:) !HPF$ DISTRIBUTE X(BLOCK,*) ONTO pr ! row-wise New(2:99, 2:99) = (X(1:98, 2:99) + X(3:100, 2:99) + X(2:99, 1:98) + X(2:99, 3:100))/4 diffmax = MAXVAL (ABS (New-X)) end

slide-18
SLIDE 18

Changing the distribution

Use block distribution instead of row distribution program hpf_finite_difference !HPF$ PROCESSORS pr(2,2) ! use 2x2 grid real X(100, 100), New(100, 100) ! data arrays !HPF$ ALIGN New(:,:) WITH X(:,:) !HPF$ DISTRIBUTE X(BLOCK, BLOCK) ONTO pr ! block-wise New(2:99, 2:99) = (X(1:98, 2:99) + X(3:100, 2:99) + X(2:99, 1:98) + X(2:99, 3:100))/4 diffmax = MAXVAL (ABS (New-X)) end

slide-19
SLIDE 19

Performance

Distribution affects

  • Load balance
  • Amount of communication

Example (communication costs): !HPF$ PROCESSORS pr(3) integer A(8), B(8), C(8) !HPF$ ALIGN B(:) WITH A(:) !HPF$ DISTRIBUTE A(BLOCK) ONTO pr !HPF$ DISTRIBUTE C(CYCLIC) ONTO pr

slide-20
SLIDE 20

Figure 7.9 from Foster's book

slide-21
SLIDE 21

Historical Evaluation

  • See : “The rise and fall of High Performance Fortran:

an historical object lesson” by Ken Kennedy, Charles Koelbel, Hans Zima. In: Proceedings of the third ACM SIGPLAN conference on History of programming languages, June 2007 [Optional, obtainable from ACM Digital Library]

slide-22
SLIDE 22

Problems with HPF

  • Immature compiler technology

– Upgrading to Fortran 90 was complicated – Implementing HPF extensions took much time

  • HPC community was impatient and started using MPI
  • Missing features:

– Support for sparse array and other irregular data structures

  • Obtaining portable performance was difficult
  • Performance tuning was difficult
slide-23
SLIDE 23

Impact of HPF

  • Huge impact on parallel language design

– Very frequently cited – Some impact on OpenMP (shared-memory standard) – Impact on programming systems for GPUs – New wave of High Productivity Computing Systems (HPCS) languages: Chapel (Cray), Fortress (Sun), X10 (IBM)

  • Used in extended form (HPF/JA) for Japanese Earth

Simulator

slide-24
SLIDE 24

Conclusions

  • High-level model
  • User species data distribution
  • Compiler generates parallel program + communication
  • More restrictive than general message passing model

(only data parallelism)

  • Restricted to array-based data structures
  • HPF programs will be easy to modify, enhances

portability

  • Changing data distribution only requires changing

directives