 
              Machine Models � What are models good for? � Abstracting from machine properties Bridging Models and Machines � Making programming simple � Making programs portable � Reflecting essential machine properties � Functionality (sure) PRAM – BSP – Delay Model – LogP � Costs (programmer should understand that a program is expensive when (s)he writes it) as long as it cannot be hidden by compilers � Success of the von-Neumann machine model 2 Problem Questions � What is the von-Neumann model to for � Evaluate the models with respect to parallel machines � Programmability � Much more sensitive wrt. reflecting performance � Reality � Much more diverse in existing architectures � Simulations � Message passing networks of different kinds � Compilations � Internet � Shared and virtual shared memory machines � Vector machines … � Conflict between � Easy to program, portability of programs � Accurately reflecting performance 3 4 PRAM revisited Theoretical simulation results + Easy to program � Deterministic and probabilistic methods + Portable programs � Deterministic � Each PRAM cell is stored on different nodes (memory organization scheme) - Unrealistic assumptions like constant time � An general optimum memory organization scheme is memory access unknown (only its existence for individual topologies) - Expensive simulations on existing message � E.g. O (log 2 p / log log p ) for a p - PRAM step on a p - mesh architectures � Probabilistic � Looks ok in the O -calculus � Probabilistic distribution of the memory cells � Large constants on message passing machines � E.g. O (max(log p , v / p )) for a v - PRAM step on a p - CCC, p - hypercube, or p - butterfly (Valiant) - optimal if v > p log p � But constant speed-up is all we can hope for 5 6 1
Measurements - time Measurements - steps MASPAR MP-1 256 processors (Zimmermann, Kumm) (Zimmermann, Kumm) MASPAR MP-1 256 processors 7 8 Measurements BSP (Valiant) MASPAR MP-1 256 processors � Bulk-Synchronous Parallel Machine � Avoids the costs in the PRAM simulation for hashing, sorting, queuing, other organizational tricks ☺ � Let the programmer handle this problem � � Bridging model for parallel computation � Standard results on probabilistic PRAM simulations in Handbook of Theoretical CS are by Valiant � Even he obviously sees a need to get closer to reality 9 10 BSP (Valiant) BSP Computations � In super-steps, each: � Processors read values required in a step � Perform computations locally P P P � Store values computed in that step � Bulk-synchronize before the next step S M M M � Periodicity of L for synchronization � Processor (P), � Virtually shared (S) and/or local Memory (M), � Common synchronization � Router 11 12 2
Cost Model Super-steps � Router can handle h -relations in time hg’+s time � Number of messages sent or received h Barrier � Router throughput g’ Global write � Startup time s in time gh � For simplicity, define a g such that the router can L handle h -relations in time hg for h > h 0 (some Compute initial value) – e.g. take g=2g’ assuming hg’ >s Local read � Router implementation is hidden in a library processors 13 14 Periodic synchronization BSP (McColl) � Assumed to be implemented in hardware � At least independent of the processors � Otherwise there wouldn’t be any processor P P P capacity left for computation in the super-steps � Bound from below by the hardware M M M � Bound from above by the application � Larger super-steps means longer independent parallel computations without the need of � Processor (P), establishing a consistent memory state � Memory (M), � Common synchronization in software � Requires higher granularity in the problem to � Network connected allow that 15 16 Super-steps Discussion Valiant vs. McColl � McColl time Barrier l � Gives up periodicity L as unnecessary constraint � Introduces explicit synchronization time l accounting for synchronization in processors, i.e., sharing the hardware communicate with computation and communication max. h relation in time gh � Assumes message passing to address usual hardware � Valiant Local read � Preserves the ability of managed data distribution from compute the deterministic PRAM simulation max. in time w � Allowing user defined data distribution if applicable processors 17 18 3
Design a BSP program Example Prefix Sums – Plan A f or ( p=0; p<n; p++) i n par al l el { � Execution time: 1. r i ght =i ni t ( p) ; l ef t =0; 2. T= Σ s ∈ super-steps ( max i ∈ procs w i,s + max i ∈ procs h i,s g + l ) f or ( i =1; i <n; i * =2) { target value from to target 3. processor local variable variable i f ( p+i < n) 4. put ( p+i , r i ght , l ef t ) ; 5. � Implications for algorithm design: bar r i er _synchr oni ze( ) ; 6. � Balance computation because of max i ∈ procs w i i f ( p >= i ) 7. � Balance communications because of max i ∈ procs h i g r i ght =r i ght +l ef t ; 8. � Minimize the number of super-steps because of } 9. | super-steps | × l } 10. 19 20 Prefix Sums (cont.) Prefix Sums (cont.) time time p= 10 i= 8 i= 8 i= 4 i= 4 i= 2 i= 2 i= 1 i= 1 processors processors 21 22 Analysis of Prefix Sums Prefix Sums – Plan B � BSP execution time in general: f or ( p=0; p<n; p++) i n par al l el { 1. T= Σ s ∈ super-steps ( max i ∈ procs w i,s + max i ∈ procs h i,s g + l ) r i ght =i ni t ( p) ; ar r ay[ 0… n- 1] =0; 2. f or ( i =p+1; i <n; i ++) 3. � Prefix Sums execution time: put ( i , r i ght , ar r ay[ i ] ) ; 4. � Initialization w i,0 =1 � All steps perform a “ + ” operation w i,s =1 bar r i er _synchr oni ze( ) ; 5. � All steps route a 1 -relation h i,s =1 f or ( i =0; i <p; i ++) 6. � ⎡ log n ⎤ super-steps in total r i ght =r i ght +ar r ay[ i ] ; 7. T= 1 + ⎡ log n ⎤ ( 1 + 1 g + l ) } 8. 23 24 4
Plan B (cont.) Analysis of Prefix Sums – Plan B � Prefix Sums execution time: time � 2 super-steps, one barrier synchronization � Initialization w i, 0 =1 � Processor n -1 performs n “ + ” operations: max i ∈ procs w i,s = w n -1 , 1 = n � Processor 0 sends and processor n -1 receives n -1 messages max i ∈ procs h i,s = h 0 , 0 = n -1 T= 1 + n + ( n -1) g + l processors 25 26 General Prefix Sums Design of a BSP program Assumption n = P ( P - number of actual � Requires machine parameters: l, g, P � processors) can be dropped using either � Analytically derived: too complex, does not work � Benchmarks algorithm – plan A or B: � Requires computation times of sequential 1. Sum of array blocks of size n / P computed algorithm locally (sequential algorithm) � Analytically derived: too complex, does not work 2. Use plan A or B to compute the prefix sum in every n /P-th element (last of each block) � Benchmarks: imprecise since � Caching, pipelining effects not repeatable 3. Receive the result of the left neighbors prefix � Data dependencies of sequential computation sum � In practice: analysis + profiling necessary 4. Add the received value to the local sums 27 28 Micro Benchmarks Load Micro Benchmarks Store SGI Power Challenge SGI Power Challenge 29 30 5
Performance Predictions Some BSP Machines SGI Power Challenge / Radix sort Maschine l g ( P- relation ) g (1 - relation ) P 25.7 0.13 x 0.13 x 4 SGI Power Challenge Hitachi SR2001 1321.7 0.92 x 0.9 x 32 Parsytec GC 6700 34.1 x 14.1 x 32 4664 8.1 x 4 DEC-Farm IBM SP-2 208.2 0.43 x 0.27 x 8 Cray T3D 16.6 0.48 x 0.36 x 32 31.1 0.78 x 0.42 x 256 Cray T3D x in words and time in μ s 31 32 Performance Predictions Profile Plan A IBM SP/2 8 processors SGI Power Challenge / Sample sort Completion Time 33 34 Profile Plan B IBM SP/2 8 processors Profile Plan A Cray T3D 32 processors Completion Completion Time Time 35 36 6
Profile Plan B Cray T3D 32 processors Observations � Plan A could be seen as a PRAM simulation � Plan B designed directly for BSP � Appears absurd on PRAM Completion Time � Advantages show on the more realistic machine model BSP � Programming becomes more difficult � Same situation when comparing � BSP vs. PRAM � PRAM vs. von-Neumann (and parallelization) 37 38 Problems with BSP Example Prefix Sums (revisited) � Algorithms need to be split in global phases time � Computation � Communication i= 8 � Synchronization � In many algorithms computation and communication not balanced over processors i= 4 � On almost all machines � Different times g for local and global communication i= 2 in a P- relation compared to a 1 - relation � Synchronization i= 1 � Not necessary when knowing all data dependencies, � Otherwise, only locally necessary processors 39 40 Prefix Sums Data Dependencies Prefix Sums Task Graph time processors 41 42 7
Recommend
More recommend