simd single instruction multiple data parallelism through
play

SIMD Single Instruction Multiple Data Parallelism through - PDF document

SIMD Single Instruction Multiple Data Parallelism through simultaneous operations on different data Lecture 12: SIMD-machines & data parallelism, Fine grain parallelism dependency analysis for automatic vectorizing and parallelizing of


  1. SIMD Single Instruction Multiple Data Parallelism through simultaneous operations on different data Lecture 12: SIMD-machines & data parallelism, Fine grain parallelism dependency analysis for automatic vectorizing and parallelizing of serial program Part 1 • Systolic arrays • Parallel SIMD machines – 10k++ processors • Vector/Pipeline units 1 2 Systolic Array SIMD Machine • Network of “processors”, memory around • Front End – Performance by doing all computations before – Normal von Neuman – Runs the application program restoring Host Controller • Processor array • Often hardware implementations solving one problem – Synchronous – Special topologies – The same operation at the same time or idle – Extends the FPU:s instructions – Small memory/processor Memory – Smart memory – I/O • Example – ILLIAC IV, IBM GF 11, Maspar, CM200(Bellman 16k) 3 4 Building Blocks in Data Parallel Programming Data Parallell Programming • The user controls the placing of data on processors – Minimize communication: keep all processors busy • Idea: update the elements of an array at the same time • Operations on whole arrays • Divides the work between the programmer and the compiler – Apply one operation on each element in the array in parallel • The programmers solves the problem in their model • Methods to access parts of an array – Concentrates on structure and concepts on a hight level • Operations can operate on these parts – Collective operations on large data structures – Example: element < 0 ⇒ element := 1 – Keeps data in large arrays with mapping information • Reduction operations on arrays • The compiler maps the program on a physical machine – produces a result from a combination of many array elements: sum, – Fills in all the details (gladly receives hints from the user) max, min, ... – Optimizes computations and communications • Shift operations along the axis on multidimensional arrays • Scan-operations – prefix/suffix-operations • Generalized communication 5 6

  2. C* C* � Supports broadcast, reduktion and interprocessor communication Parallel variables has type and shape � shape [16384] employees; � shape defines number of elements and their organization struct date{ int month; shape [16384] employees /* 1-D */ int day; shape [512] [512] image /* 2 - D */ int year; }; struct date: employees birthday • left-indexing: indexing that refers to parallel variables 1:st dim as axis 0, 2:nd as axis 1 etc Each element in the parallel variable birthday contains a date. birthday.month specifies all 16384 month fields in the parallel int: employees employee_id variable birthday. [2]employee_id : refers to the 3:rd element in employee_id 7 8 C* C* - Parallel Operations setting the context • Overloading • where – x = y + z (adds y and z in each position in the shape) – Limits the area where the operation is performed • New operations – a, b scalar or parallel with (numbers) where (z != 0) /* sets active positions */ – a <? b - min of two variables x = y/z – a >? b - max of two variables else /* reverses active positions */ x = y • Selection of shape (with) ❖ everywhere shape [16384] numbers; – all positions active independently of earlier context int: numbers x, y, z; with (numbers) x = y + z 9 10 C* - Communication Compute Pi in C* • Grid communication Pi = 1/N * Σ (Ν−1) i=0 4/(1+ x i * x i ), where x i = (i+1/2)/N – pcoord (~mynode) gives my index along axis in shape #define N = 400000 • Example: Send the value of source to element dest that is shape[N] chunk double: chunk x; one position higher up main() { double sum; [pcoord(0) + 1]dest = source double width; width = 1.0/N in parallel with (chunk) dot (.) is sometimes used instead of pcoord { x = (pcoord(0) + 0.5)*width; sum = (+=(4.0/(1.0+x*x))); [. + 1]dest = source } [. + 1][. -2]dest = source sum =sum * width; printf(“Estimate of Pi = %14.12f\n”, sum); } 11 12

  3. High Performance Fortran Compute Partial Sums in Array (C*) • Data parallel language (Many similarities to CM FORTRAN) • For SIMD and MIMD (NUMA) machines Select shape • Based on F90 (F77) #define N = 1024 shape [N] ArrayShape – Array operations HPF int: ArrayShape x; Active positions int i; – User defined data types main() – Recursion and dynamic memory allocation { with (ArrayShape) – Pointers F77 + Mess. Pass for (i = 0; i < log(N); i++) where (pcoord(0) >= pow(2, i-1) – Control of data distribution SPMD x += [pcoord(0) - pow(2, i-1)]x } – Parallel constructs • Data mapping directives Left indexing • FORALL statements and constructs Exe-file • INDEPENDENT directive, etc 13 14 The PROCESSOR directive The DISTRIBUTE directive • Declares an abstract processor arrangement on which data is mapped • Controls the mapping of data onto processors • BLOCK distribution • Each element of this arrangement corresponds to a – Each processor stores a consecutive block of the array node on the physical machine REAL a(16) P1 P2 P3 P4 • The declarations are often parametrized with the !HPF$ PROCESSORS p(4) 1 5 9 13 !HPF$ DISTRIBUTE a(BLOCK) ONTO p intrinsic function NUMBER_OF_PROCESSORS 2 6 10 14 3 7 11 15 4 8 12 16 ● BLOCK, BLOCK distribution – For multidimensional arrays, separate blocking !hpf$ processors p(NUMBER_OF_PROCESSORS()/2,2) in each dimension. REAL a(7,7) !HPF$ PROCESSORS p(2,2) !HPF$ DISTRIBUTE a(BLOCK, BLOCK) ONTO p Comment 15 16 The DISTRIBUTE directive The DISTRIBUTE directive • CYCLIC,BLOCK distribution – It is not necessary to have the same distribution in all dimensions ● CYCLIC distribution P1 P2 P3 P4 1 2 3 4 REAL a(7,7) REAL a(16) 5 6 7 8 !HPF$ PROCESSORS p(2,2) !HPF$ PROCESSORS p(4) 9 10 11 12 !HPF$ DISTRIBUTE a(CYCLIC) ONTO p 13 14 15 16 !HPF$ DISTRIBUTE a(CYCLIC, BLOCK) ONTO p ● CYCLIC,CYCLIC distribution REAL a(7,7) !HPF$ PROCESSORS p(2,2) !HPF$ DISTRIBUTE a(CYCLIC, CYCLIC) ONTO p !HPF$ DISTRIBUTE a(BLOCK, CYCLIC) ONTO p 17 18

  4. The ALIGN directive Example: Simple Matrix Multiplication • Describes mapping relations between interacting PROGRAM ABmult C A B objects INTEGER, PARAMETER :: N = 100 INTEGER, DIMENSION (N,N) :: A, B, C • Both objects are allocated on the same processor INTEGER :: i, j !HPF$ PROCESSORS SQ(2,2) !HPF$ DISTRIBUTE C(BLOCK,BLOCK) ONTO SQ !HPF$ ALIGN A(i,*) WITH C(i,*) a(1) a(2) a(3) a(4) a(5) a(6) REAL a(6), b(6) ! replicate copies of row A(i,*) b(1) b(2) b(3) b(4) b(5) b(6) !HPF$ ALIGN a(I) WITH b(I) ! onto processors which compute C(i,j) !HPF$ ALIGN B(*,j) WITH C(*,j) ! replicate copies of column B(*,j)) REAL a(4,4), b(4,10) ! onto processors which compute C(i,j) !HPF$ ALIGN a(I,J) WITH b(I, 2*J+1) A = 1, B = 2, C = 0 DO i = 1, N DO j = 1, N a 1 2 3 4 ! All the work is local due to ALIGNs C(i,j) = DOT_PRODUCT(A(i,:), B(:,j)) 1 b(1,3) b(1,5) b(1,7) b(1,9) END DO 2 b(2,3) b(2,5) b(2,7) b(2,9) END DO 3 b(3,3) b(3,5) b(3,7) b(3,9) END b(4,3) b(4,5) b(4,7) b(4,9) 4 19 20 The FORALL statement The INDEPENDENT directive • Generalization of array assignment and masked • States that no iteration affects any other iteration in any way array assignment (NOT a loop) – Is used to give the compiler extra information about the • Single statement FORALL execution of a DO or FORALL – FORALL (index, mask) forall-assignment • Applied on DO : states that there are no loop carried – Equivalent to array assignment in F90 dependencies – For every index, controll the mask • Applied on FORALL : states that no index points to an address – Compute right hand side for unmasked values used by any other object – Carry out the assignments to the left hand side • Multiple statement FORALL -semantics !HPF$ INDEPENDENT – FORALL (index, mask) forall-body-list END FORALL DO I = 1, N – forall-body can be FORALL, WHERE , or ordinary forall- A(INDX(I)) = B(I) END DO assignments – Abbreviation of a series of single statement FORALL s 21 22 The INDEPENDENT directive Game of LIFE FORALL (I=1:3) !HPF$ INDEPENDENT INTEGER LIFE(64, 64), NCOUNT(64, 64) L1(I) = R1(I) FORALL (I=1:3) !HPF$ ALIGN LIFE WITH NCOUNT L2(I) = R2(I) L1(I) = R1(I) !HPF$ DISTRIBUTE LIFE(BLOCK, BLOCK) END FORALL L2(I) = R2(I) ..... INIT LIFE ..... Assume that END FORALL NCOUNT = 0 R1(3) & R2(1) DO M = 1, NUMBER_OF_GENERATIONS takes longer time due FORALL (I=2:63, J=2:63) to communication NCOUNT(I,J) = SUM(LIFE(I-1:I+1,J-1:J+1))-LIFE(I,J) END FORALL R1(1) R1(2) R1(1) R1(2) R1(3) R1(3) ! Create next generation L1(1) L1(2) Sync WHERE ((LIFE.EQ.0).AND.(NCOUNT.EQ.3)) L1(3) LIFE=1 L1(1) L1(2) L1(3) R2(2) END WHERE R2(1) Sync R2(3) WHERE ((LIFE.EQ.1).AND.(NCOUNT.NE.2).AND.(NCOUNT.EQ.3)) L2(2) R2(2) R2(3) LIFE = 0 R2(1) L2(1) L2(3) END WHERE Sync END DO L2(1) L2(2) L2(3) END Time gained 23 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend