SIMD Single Instruction Multiple Data Parallelism through - - PDF document

simd single instruction multiple data parallelism through
SMART_READER_LITE
LIVE PREVIEW

SIMD Single Instruction Multiple Data Parallelism through - - PDF document

SIMD Single Instruction Multiple Data Parallelism through simultaneous operations on different data Lecture 12: SIMD-machines & data parallelism, Fine grain parallelism dependency analysis for automatic vectorizing and parallelizing of


slide-1
SLIDE 1

1

Lecture 12: SIMD-machines & data parallelism, dependency analysis for automatic vectorizing and parallelizing of serial program Part 1

2

SIMD Single Instruction Multiple Data

Parallelism through simultaneous operations on different data

Fine grain parallelism

  • Systolic arrays
  • Parallel SIMD machines

– 10k++ processors

  • Vector/Pipeline units

3

Memory

Systolic Array

  • Network of “processors”, memory around

– Performance by doing all computations before restoring

  • Often hardware implementations solving one problem

– Special topologies

4

SIMD Machine

  • Front End

– Normal von Neuman – Runs the application program

  • Processor array

– Synchronous – The same operation at the same time or idle – Extends the FPU:s instructions – Small memory/processor – Smart memory – I/O

  • Example

– ILLIAC IV, IBM GF 11, Maspar, CM200(Bellman 16k) Host Controller

5

Data Parallell Programming

  • Idea: update the elements of an array at the same time
  • Divides the work between the programmer and the compiler
  • The programmers solves the problem in their model

– Concentrates on structure and concepts on a hight level – Collective operations on large data structures – Keeps data in large arrays with mapping information

  • The compiler maps the program on a physical machine

– Fills in all the details (gladly receives hints from the user) – Optimizes computations and communications

6

Building Blocks in Data Parallel Programming

  • The user controls the placing of data on processors

– Minimize communication: keep all processors busy

  • Operations on whole arrays

– Apply one operation on each element in the array in parallel

  • Methods to access parts of an array
  • Operations can operate on these parts

– Example: element < 0 ⇒ element := 1

  • Reduction operations on arrays

– produces a result from a combination of many array elements: sum, max, min, ...

  • Shift operations along the axis on multidimensional arrays
  • Scan-operations

– prefix/suffix-operations

  • Generalized communication
slide-2
SLIDE 2

7

C*

  • Supports broadcast, reduktion and interprocessor communication
  • Parallel variables has type and shape
  • shape defines number of elements and their organization

shape [16384] employees /* 1-D */ shape [512] [512] image /* 2-D */

  • left-indexing: indexing that refers to parallel variables 1:st dim

as axis 0, 2:nd as axis 1 etc int: employees employee_id [2]employee_id: refers to the 3:rd element in employee_id

8

C*

shape [16384] employees; struct date{ int month; int day; int year; }; struct date: employees birthday Each element in the parallel variable birthday contains a date. birthday.month specifies all 16384 month fields in the parallel variable birthday.

9

C* - Parallel Operations

  • Overloading

– x = y + z (adds y and z in each position in the shape)

  • New operations

– a, b scalar or parallel – a <? b - min of two variables – a >? b - max of two variables

  • Selection of shape (with)

shape [16384] numbers; int: numbers x, y, z; with (numbers) x = y + z

10

C*

setting the context

  • where

– Limits the area where the operation is performed with (numbers) where (z != 0) /* sets active positions */ x = y/z else /* reverses active positions */ x = y

❖ everywhere

– all positions active independently of earlier context

11

C* - Communication

  • Grid communication

– pcoord (~mynode) gives my index along axis in shape

  • Example: Send the value of source to element dest that is
  • ne position higher up

[pcoord(0) + 1]dest = source dot (.) is sometimes used instead of pcoord [. + 1]dest = source [. + 1][. -2]dest = source

12

Compute Pi in C*

Pi = 1/N * Σ(Ν−1)

i=0 4/(1+ xi* xi), where xi = (i+1/2)/N

#define N = 400000 shape[N] chunk double: chunk x; main() { double sum; double width; width = 1.0/N with (chunk) { x = (pcoord(0) + 0.5)*width; sum = (+=(4.0/(1.0+x*x))); } sum =sum * width; printf(“Estimate of Pi = %14.12f\n”, sum); }

in parallel

slide-3
SLIDE 3

13

Compute Partial Sums in Array (C*)

#define N = 1024 shape [N] ArrayShape int: ArrayShape x; int i; main() { with (ArrayShape) for (i = 0; i < log(N); i++) where (pcoord(0) >= pow(2, i-1) x += [pcoord(0) - pow(2, i-1)]x }

Active positions Select shape Left indexing

14

High Performance Fortran

  • Data parallel language (Many similarities to CM FORTRAN)
  • For SIMD and MIMD (NUMA) machines
  • Based on F90 (F77)

– Array operations – User defined data types – Recursion and dynamic memory allocation – Pointers – Control of data distribution – Parallel constructs

  • Data mapping directives
  • FORALL statements and constructs
  • INDEPENDENT directive, etc

HPF F77 + Mess. Pass SPMD Exe-file

15

The PROCESSOR directive

  • Declares an abstract processor arrangement on

which data is mapped

  • Each element of this arrangement corresponds to a

node on the physical machine

  • The declarations are often parametrized with the

intrinsic function NUMBER_OF_PROCESSORS

!hpf$ processors p(NUMBER_OF_PROCESSORS()/2,2) Comment

16

The DISTRIBUTE directive

  • Controls the mapping of data onto processors
  • BLOCK distribution

– Each processor stores a consecutive block of the array REAL a(16) !HPF$ PROCESSORS p(4) !HPF$ DISTRIBUTE a(BLOCK) ONTO p P1 P2 P3 P4 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

  • BLOCK, BLOCK distribution

– For multidimensional arrays, separate blocking in each dimension. REAL a(7,7) !HPF$ PROCESSORS p(2,2) !HPF$ DISTRIBUTE a(BLOCK, BLOCK) ONTO p

17

The DISTRIBUTE directive

  • CYCLIC distribution

P1 P2 P3 P4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 REAL a(16) !HPF$ PROCESSORS p(4) !HPF$ DISTRIBUTE a(CYCLIC) ONTO p

  • CYCLIC,CYCLIC distribution

REAL a(7,7) !HPF$ PROCESSORS p(2,2) !HPF$ DISTRIBUTE a(CYCLIC, CYCLIC) ONTO p

18

The DISTRIBUTE directive

  • CYCLIC,BLOCK distribution

– It is not necessary to have the same distribution in all dimensions REAL a(7,7) !HPF$ PROCESSORS p(2,2) !HPF$ DISTRIBUTE a(CYCLIC, BLOCK) ONTO p !HPF$ DISTRIBUTE a(BLOCK, CYCLIC) ONTO p

slide-4
SLIDE 4

19

The ALIGN directive

  • Describes mapping relations between interacting
  • bjects
  • Both objects are allocated on the same processor

REAL a(6), b(6) !HPF$ ALIGN a(I) WITH b(I) a(1) a(2) a(3) a(4) a(5) a(6) b(1) b(2) b(3) b(4) b(5) b(6) REAL a(4,4), b(4,10) !HPF$ ALIGN a(I,J) WITH b(I, 2*J+1) b(1,3) b(1,5) b(1,7) b(1,9) b(2,3) b(2,5) b(2,7) b(2,9) b(3,3) b(3,5) b(3,7) b(3,9) b(4,3) b(4,5) b(4,7) b(4,9) 1 2 3 4 a 1 2 3 4

20

Example: Simple Matrix Multiplication

PROGRAM ABmult INTEGER, PARAMETER :: N = 100 INTEGER, DIMENSION (N,N) :: A, B, C INTEGER :: i, j !HPF$ PROCESSORS SQ(2,2) !HPF$ DISTRIBUTE C(BLOCK,BLOCK) ONTO SQ !HPF$ ALIGN A(i,*) WITH C(i,*) ! replicate copies of row A(i,*) ! onto processors which compute C(i,j) !HPF$ ALIGN B(*,j) WITH C(*,j) ! replicate copies of column B(*,j)) ! onto processors which compute C(i,j) A = 1, B = 2, C = 0 DO i = 1, N DO j = 1, N ! All the work is local due to ALIGNs C(i,j) = DOT_PRODUCT(A(i,:), B(:,j)) END DO END DO END

C A B

21

The FORALL statement

  • Generalization of array assignment and masked

array assignment (NOT a loop)

  • Single statement FORALL

– FORALL (index, mask) forall-assignment – Equivalent to array assignment in F90 – For every index, controll the mask – Compute right hand side for unmasked values – Carry out the assignments to the left hand side

  • Multiple statement FORALL-semantics

– FORALL (index, mask) forall-body-list END FORALL – forall-body can be FORALL, WHERE, or ordinary forall- assignments – Abbreviation of a series of single statement FORALLs

22

The INDEPENDENT directive

  • States that no iteration affects any other iteration in any way

– Is used to give the compiler extra information about the execution of a DO or FORALL

  • Applied on DO: states that there are no loop carried

dependencies

  • Applied on FORALL: states that no index points to an address

used by any other object !HPF$ INDEPENDENT DO I = 1, N A(INDX(I)) = B(I) END DO

23

The INDEPENDENT directive

FORALL (I=1:3) L1(I) = R1(I) L2(I) = R2(I) END FORALL !HPF$ INDEPENDENT FORALL (I=1:3) L1(I) = R1(I) L2(I) = R2(I) END FORALL

R1(1) R1(2) R1(3) L1(1) L1(2) L1(3) R2(1) R2(2) R2(3) L2(1) L2(2) L2(3) L2(1) L2(2) R2(1) R2(2) R1(1) R1(2) R1(3) L1(1) L1(2) R2(3) L1(3) L2(3)

Time gained Assume that R1(3) & R2(1) takes longer time due to communication Sync Sync Sync

24

Game of LIFE

INTEGER LIFE(64, 64), NCOUNT(64, 64) !HPF$ ALIGN LIFE WITH NCOUNT !HPF$ DISTRIBUTE LIFE(BLOCK, BLOCK) ..... INIT LIFE..... NCOUNT = 0 DO M = 1, NUMBER_OF_GENERATIONS FORALL (I=2:63, J=2:63) NCOUNT(I,J) = SUM(LIFE(I-1:I+1,J-1:J+1))-LIFE(I,J) END FORALL ! Create next generation WHERE ((LIFE.EQ.0).AND.(NCOUNT.EQ.3)) LIFE=1 END WHERE WHERE ((LIFE.EQ.1).AND.(NCOUNT.NE.2).AND.(NCOUNT.EQ.3)) LIFE = 0 END WHERE END DO END

slide-5
SLIDE 5

25

Summation – Data Parallelism

  • Scalable
  • Data parallel programming simpler than message-

passing

  • Data parallel languages

– C*, CM Fortran, HPF

  • SIMD-style: Single Program, Single instruction flow
  • SPMD-style: Single Program, multiple data

– different instruction flows locally

  • Machines: SIMD (CM2, Maspar, ..) or MIMD and

SPMD programming

26

Lecture 12b: Dependency analysis for automatic vectorization and parallelization of serial programs

27

Automatic //

  • Loops are the largest source for parallelism
  • Loop parallelization

– Different iterations on different processors – Different tasks within an iteration on different processors

  • Vectorization/Pipelining

– Pipeline: breaks down instructions intp substeps that are being overlapped – Vector: the piped instructions are carried out on a vector register of fixed length

28

Content

  • Vector hardware
  • Data dependency analysis

– dependency graphs – dependency tests

  • Vectorization

– standard transformations – vector code generation

  • Parallelization

– loop scheduling

29

Vector Supercomputer

(Register-to-Register)

Scalar Control unit Vector Control unit Vector registers Vector pipe Vector pipe Scalar pipes Vector instr Vector data Scalar instr instructions Main Memory (Program & data) Host Computer Scalar data Mass storage I/O 30

  • Transformation of a loop to a sequence of vector

instructions

Vectorization

Scalar do I = 1, N

C(I) = A(I) + B(I)

  • d

Vector C[1:N]= A[1:N] + B[1:N]

L G0, N Load vector length N LA G3, C Load addr for C LA G2, B Load addr for B LA G1, A Load addr for A LOOP VLVCU G0 Set up loop for 128 elements VLD V1, G1 Load 128 A in V1 VLD V2, G2 Load 128 B in V2 VAD V3, V1, V2 A + B -> V3 VSTD V3, G3 V3 -> C BC 2, LOOP If more elements, Loop

Vector instructions

slide-6
SLIDE 6

31

Speedup, Expected speedup

Scalar

do I = 1, N C(I)= A(I) + B(I)

  • d

instruction cycles Load A(I) in i register 1 Load B(I) in i register 1 ADD A(I) + B(I) 3 Store C(I) from register 1 Decr counter by 1 1

  • 7
  • Vector length 128 -> 7*128

Vector

C[1:N]= A[1:N] + B[1:N] instruction cycles Load A(1:128) 128 Load B(1:128) 128 ADD A(1:128)+B(1:128) 128 Store C(1:128) 128

  • 4*128
  • Speedup = 7/4= 1.75

32

What can be Vectorized?

  • Only Do (For) loops can be vectorized
  • Only one loop in a loop nest can be vectorized
  • Vectorizable loops may NOT contain

– Data dependencies – jump in/out/entry/stop – loop variables other that integers – I/O statements – calls to external subprograms

  • In same cases the compiler can rewrite the loop and

then vectorize partially

Side effects

33

Different Types of Dependencies

  • True/Flow dependence, is defined before use (DEF USE)

S1: A = B + C S2: D = A + 2 S3: E = A * 3 (S1 δt S2, S1 δt S3)

  • Anti dependence, is used before defined

S1: A = B + C S2: B = X * 3 (S1 δa S2)

  • Output dependence, is allocated a value several times

S1: A = B + C S2: A = X * 3 (S1 δo S2)

34

Data Dependency

✦ Execution order S(i, j, k) << S(i’, j’, k’) iff (i, j, k) < (i’, j’, k’) ✦ Input & output sets DEF(S) = the set of all variables defined by the statement S USE(S) = the set of all variables used by the statement S

✦ Data dependency between two statements S and T ( S δ T) if

– S << T – it exists a variable, v such that

  • v is in both DEF(S) and USE(T) or
  • v is in both USE(S) and DEF(T) or
  • v id in both DEF(S) and DEF(T)

– it does not exist a statement SI such that

  • S << SI << T and v is in DEF(SI)

35

Data Dependency in Loops

  • Independent loops

– no iteration depends on data from any other iteration

  • Dependent loops

– statement S is depenent on statement Sk if the execution of Sk must occur before the execution of S

  • Loop carried dependency

– if the dependency depends on a loop index

  • Loop independent dependency

– if the dependency does not depend on a loop index

36

Basic Concept

  • Iteration vector

– points to specific iteration of loop (i = i1, i2, .., in) where i1 is outermost

  • Distance vector

– the distance between two iteration vectors i - i’

  • Dependency distance vectors

– if S and S’ are instances of statements in a loop nest and S(i) δ S’(i’) then the dependency distance vector dist(i, i’) = i’ - i

  • Dependency direction vectors

– the same as dependency distance vectors but only the direction is shown (<, =, >) corresponds to (+ , 0 , -)

slide-7
SLIDE 7

37

do i = 2, 99 S1: A(i) = B(i) + C(i) S2: D(i) = A(i)

  • d

Loop independent dependency

do i = 2, 99 S1: A(i) = B(i) + C(i) S2: D(i) = A(i-1)

  • d

Loop carried dependency

Dependency Distance, Distance & Direction Vectors

i = 2: S1: A(2) = B(i) + C(i) S2: D(2) = A(2) i = 3: S1: A(3) = B(i) + C(i) S2: D(3) = A(3) DEF, USE-> S1 δt S2, distance 2-2= 0, direction = i = 2: S1: A(2) = B(i) + C(i) S2: D(2) = A(1) i = 3: S1: A(3) = B(i) + C(i) S2: D(3) = A(2) DEF, USE -> S1 δt S2, distance 3-2 = 1, direction >

38

Representation of Data Dependency

  • Dependency graph

– directed graph G(V, E) where V is a set of statements, and E edges representing dependencies

  • Dependency cycles

– Dependencies starting and ending at the statement S

S1: A = B + E S2: B = C S3: C = A V = {S1, S2, S3} E = {(S1, S2), (S1, S3), (S2, S3)} S1 S2 S3 δt δa δa

39

Loop Dependencies

do i = 2, 99 S1: A(i) = B(i) + C(i) S2: D(i) = A(i-1)

  • d

do i = 2, 99 S1: A(i) = B(i) + C(i) S2: D(i) = A(i+1)

  • d

do i = 2, 99 do j = 2, 99 S1: A(i+1,j-1) =A(i, j) + C(i,j)

  • d
  • d

40

Kontrollfrågor

  • Vilka beroenden finns i kodsnuttarna på föregående sida?

Riktningsvektorer? Hur ser beroendegraferna ut?