SIMD Single Instruction Multiple Data Parallelism through - PDF document

SIMD Single Instruction Multiple Data Parallelism through simultaneous operations on different data Lecture 12: SIMD-machines & data parallelism, Fine grain parallelism dependency analysis for automatic vectorizing and parallelizing of serial program Part 1 • Systolic arrays • Parallel SIMD machines – 10k++ processors • Vector/Pipeline units 1 2 Systolic Array SIMD Machine • Network of “processors”, memory around • Front End – Performance by doing all computations before – Normal von Neuman – Runs the application program restoring Host Controller • Processor array • Often hardware implementations solving one problem – Synchronous – Special topologies – The same operation at the same time or idle – Extends the FPU:s instructions – Small memory/processor Memory – Smart memory – I/O • Example – ILLIAC IV, IBM GF 11, Maspar, CM200(Bellman 16k) 3 4 Building Blocks in Data Parallel Programming Data Parallell Programming • The user controls the placing of data on processors – Minimize communication: keep all processors busy • Idea: update the elements of an array at the same time • Operations on whole arrays • Divides the work between the programmer and the compiler – Apply one operation on each element in the array in parallel • The programmers solves the problem in their model • Methods to access parts of an array – Concentrates on structure and concepts on a hight level • Operations can operate on these parts – Collective operations on large data structures – Example: element < 0 ⇒ element := 1 – Keeps data in large arrays with mapping information • Reduction operations on arrays • The compiler maps the program on a physical machine – produces a result from a combination of many array elements: sum, – Fills in all the details (gladly receives hints from the user) max, min, ... – Optimizes computations and communications • Shift operations along the axis on multidimensional arrays • Scan-operations – prefix/suffix-operations • Generalized communication 5 6

C* C* � Supports broadcast, reduktion and interprocessor communication Parallel variables has type and shape � shape [16384] employees; � shape defines number of elements and their organization struct date{ int month; shape [16384] employees /* 1-D */ int day; shape [512] [512] image /* 2 - D */ int year; }; struct date: employees birthday • left-indexing: indexing that refers to parallel variables 1:st dim as axis 0, 2:nd as axis 1 etc Each element in the parallel variable birthday contains a date. birthday.month specifies all 16384 month fields in the parallel int: employees employee_id variable birthday. [2]employee_id : refers to the 3:rd element in employee_id 7 8 C* C* - Parallel Operations setting the context • Overloading • where – x = y + z (adds y and z in each position in the shape) – Limits the area where the operation is performed • New operations – a, b scalar or parallel with (numbers) where (z != 0) /* sets active positions */ – a <? b - min of two variables x = y/z – a >? b - max of two variables else /* reverses active positions */ x = y • Selection of shape (with) ❖ everywhere shape [16384] numbers; – all positions active independently of earlier context int: numbers x, y, z; with (numbers) x = y + z 9 10 C* - Communication Compute Pi in C* • Grid communication Pi = 1/N * Σ (Ν−1) i=0 4/(1+ x i * x i ), where x i = (i+1/2)/N – pcoord (~mynode) gives my index along axis in shape #define N = 400000 • Example: Send the value of source to element dest that is shape[N] chunk double: chunk x; one position higher up main() { double sum; [pcoord(0) + 1]dest = source double width; width = 1.0/N in parallel with (chunk) dot (.) is sometimes used instead of pcoord { x = (pcoord(0) + 0.5)*width; sum = (+=(4.0/(1.0+x*x))); [. + 1]dest = source } [. + 1][. -2]dest = source sum =sum * width; printf(“Estimate of Pi = %14.12f\n”, sum); } 11 12

High Performance Fortran Compute Partial Sums in Array (C*) • Data parallel language (Many similarities to CM FORTRAN) • For SIMD and MIMD (NUMA) machines Select shape • Based on F90 (F77) #define N = 1024 shape [N] ArrayShape – Array operations HPF int: ArrayShape x; Active positions int i; – User defined data types main() – Recursion and dynamic memory allocation { with (ArrayShape) – Pointers F77 + Mess. Pass for (i = 0; i < log(N); i++) where (pcoord(0) >= pow(2, i-1) – Control of data distribution SPMD x += [pcoord(0) - pow(2, i-1)]x } – Parallel constructs • Data mapping directives Left indexing • FORALL statements and constructs Exe-file • INDEPENDENT directive, etc 13 14 The PROCESSOR directive The DISTRIBUTE directive • Declares an abstract processor arrangement on which data is mapped • Controls the mapping of data onto processors • BLOCK distribution • Each element of this arrangement corresponds to a – Each processor stores a consecutive block of the array node on the physical machine REAL a(16) P1 P2 P3 P4 • The declarations are often parametrized with the !HPF$ PROCESSORS p(4) 1 5 9 13 !HPF$ DISTRIBUTE a(BLOCK) ONTO p intrinsic function NUMBER_OF_PROCESSORS 2 6 10 14 3 7 11 15 4 8 12 16 ● BLOCK, BLOCK distribution – For multidimensional arrays, separate blocking !hpf$ processors p(NUMBER_OF_PROCESSORS()/2,2) in each dimension. REAL a(7,7) !HPF$ PROCESSORS p(2,2) !HPF$ DISTRIBUTE a(BLOCK, BLOCK) ONTO p Comment 15 16 The DISTRIBUTE directive The DISTRIBUTE directive • CYCLIC,BLOCK distribution – It is not necessary to have the same distribution in all dimensions ● CYCLIC distribution P1 P2 P3 P4 1 2 3 4 REAL a(7,7) REAL a(16) 5 6 7 8 !HPF$ PROCESSORS p(2,2) !HPF$ PROCESSORS p(4) 9 10 11 12 !HPF$ DISTRIBUTE a(CYCLIC) ONTO p 13 14 15 16 !HPF$ DISTRIBUTE a(CYCLIC, BLOCK) ONTO p ● CYCLIC,CYCLIC distribution REAL a(7,7) !HPF$ PROCESSORS p(2,2) !HPF$ DISTRIBUTE a(CYCLIC, CYCLIC) ONTO p !HPF$ DISTRIBUTE a(BLOCK, CYCLIC) ONTO p 17 18

The ALIGN directive Example: Simple Matrix Multiplication • Describes mapping relations between interacting PROGRAM ABmult C A B objects INTEGER, PARAMETER :: N = 100 INTEGER, DIMENSION (N,N) :: A, B, C • Both objects are allocated on the same processor INTEGER :: i, j !HPF$ PROCESSORS SQ(2,2) !HPF$ DISTRIBUTE C(BLOCK,BLOCK) ONTO SQ !HPF$ ALIGN A(i,*) WITH C(i,*) a(1) a(2) a(3) a(4) a(5) a(6) REAL a(6), b(6) ! replicate copies of row A(i,*) b(1) b(2) b(3) b(4) b(5) b(6) !HPF$ ALIGN a(I) WITH b(I) ! onto processors which compute C(i,j) !HPF$ ALIGN B(*,j) WITH C(*,j) ! replicate copies of column B(*,j)) REAL a(4,4), b(4,10) ! onto processors which compute C(i,j) !HPF$ ALIGN a(I,J) WITH b(I, 2*J+1) A = 1, B = 2, C = 0 DO i = 1, N DO j = 1, N a 1 2 3 4 ! All the work is local due to ALIGNs C(i,j) = DOT_PRODUCT(A(i,:), B(:,j)) 1 b(1,3) b(1,5) b(1,7) b(1,9) END DO 2 b(2,3) b(2,5) b(2,7) b(2,9) END DO 3 b(3,3) b(3,5) b(3,7) b(3,9) END b(4,3) b(4,5) b(4,7) b(4,9) 4 19 20 The FORALL statement The INDEPENDENT directive • Generalization of array assignment and masked • States that no iteration affects any other iteration in any way array assignment (NOT a loop) – Is used to give the compiler extra information about the • Single statement FORALL execution of a DO or FORALL – FORALL (index, mask) forall-assignment • Applied on DO : states that there are no loop carried – Equivalent to array assignment in F90 dependencies – For every index, controll the mask • Applied on FORALL : states that no index points to an address – Compute right hand side for unmasked values used by any other object – Carry out the assignments to the left hand side • Multiple statement FORALL -semantics !HPF$ INDEPENDENT – FORALL (index, mask) forall-body-list END FORALL DO I = 1, N – forall-body can be FORALL, WHERE , or ordinary forall- A(INDX(I)) = B(I) END DO assignments – Abbreviation of a series of single statement FORALL s 21 22 The INDEPENDENT directive Game of LIFE FORALL (I=1:3) !HPF$ INDEPENDENT INTEGER LIFE(64, 64), NCOUNT(64, 64) L1(I) = R1(I) FORALL (I=1:3) !HPF$ ALIGN LIFE WITH NCOUNT L2(I) = R2(I) L1(I) = R1(I) !HPF$ DISTRIBUTE LIFE(BLOCK, BLOCK) END FORALL L2(I) = R2(I) ..... INIT LIFE ..... Assume that END FORALL NCOUNT = 0 R1(3) & R2(1) DO M = 1, NUMBER_OF_GENERATIONS takes longer time due FORALL (I=2:63, J=2:63) to communication NCOUNT(I,J) = SUM(LIFE(I-1:I+1,J-1:J+1))-LIFE(I,J) END FORALL R1(1) R1(2) R1(1) R1(2) R1(3) R1(3) ! Create next generation L1(1) L1(2) Sync WHERE ((LIFE.EQ.0).AND.(NCOUNT.EQ.3)) L1(3) LIFE=1 L1(1) L1(2) L1(3) R2(2) END WHERE R2(1) Sync R2(3) WHERE ((LIFE.EQ.1).AND.(NCOUNT.NE.2).AND.(NCOUNT.EQ.3)) L2(2) R2(2) R2(3) LIFE = 0 R2(1) L2(1) L2(3) END WHERE Sync END DO L2(1) L2(2) L2(3) END Time gained 23 24

SIMD Single Instruction Multiple Data Parallelism through - PDF document

SIMD Single Instruction Multiple Data Parallelism through simultaneous operations on different data Lecture 12: SIMD-machines & data parallelism, Fine grain parallelism dependency analysis for automatic vectorizing and parallelizing of

Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Short Summery Taxonomy of parallel computers SISD: von Neumann model SIMD: Single

SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common

Software Vector Chaining M. Anton Ertl TU Wien Data Parallelism and SIMD instructions Data

Instruction-Level Parallelism (ILP) Fine-grained parallelism Obtained by: instruction

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 -

SIMD+ Overview Illiac IV History Early machines First massively

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Data-Level Parallelism Vector, SIMD, GPU 1 MO401 Tpicos IC-UNICAMP Vector

1 last time SIMD (single instruction multiple data) hardware idea: wider ALUs and registers

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Concise parallelism Natural C/C++ Parallelism A single operator to control multiple parallel

Chapter 12: The Regression Line We already know that the regression line goes through the point

Rewriting in Artin groups Sarah Rees University of Newcastle Paul Schupp Fest, Stevens Institute

Power-Driven DNN Dataflow Optimization on FPGA Qi Sun 1 , Tinghuan Chen 1 , Jin Miao 2 , Bei Yu 1 1

The Explosion in Neural Network Hardware USC Friday 19 th April Trevor Mudge Bredt Family

Unit 3: Foundations for inference Lecture 3: Decision errors, significance levels, sample size,

Parametric Curves CS 318 Interactive Computer Graphics John C. Hart Linear Interpolation p 1 =(

CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel