Figure 5.1 The execution profile of a hypothetical parallel program - PDF document

Execution Time P0 P1 P2 P3 P4 P5 P6 P7 Essential/Excess Computation Interprocessor Communication Idling Figure 5.1 The execution profile of a hypothetical parallel program executing on eight processing elements. Profile indicates times spent performing computation (both essential and excess), communication, and idling.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (a) Initial data distribution and the first communication step Σ 0 1 Σ 3 Σ 5 Σ 7 Σ 9 Σ 11 Σ 13 Σ 15 2 4 6 8 10 12 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (b) Second communication step Σ 0 3 Σ 7 Σ 11 Σ 15 4 8 12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (c) Third communication step Σ 0 7 Σ 15 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (d) Fourth communication step Σ 0 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (e) Accumulation of the sum at processing element 0 after the final communication Computing the globalsum of 16 partial sums using 16 processing elements. � j Figure 5.2 i denotes the sum of numbers with consecutive labels from i to j .

Processing element 0 Processing element 1 S Figure 5.3 Searching an unstructured tree for a node with a given label, ‘S’, on two processing elements using depth-first traversal. The two-processor version with processor 0 searching the left subtree and processor 1 searching the right subtree expands only the shaded nodes before the solution is found. The corresponding serial formulation expands the entire tree. It is clear that the serial algorithm does more work than the parallel algorithm.

−1 0 1 −2 0 2 −1 0 1 −1 −2 1 0 0 0 −1 2 1 0 1 2 3 (a) (b) (c) Figure 5.4 Example of edge detection: (a) an 8 × 8 image; (b) typical templates for detecting edges; and (c) partitioning of the image across four processors with shaded regions indicating image data that must be communicated from neighboring processors to processor 1.

12 13 14 15 12 13 14 15 8 9 10 11 8 9 10 11 4 5 6 7 4 5 6 7 Σ 0 1 Σ 3 0 1 2 3 2 0 1 2 3 0 1 2 3 Substep 1 Substep 2 12 13 14 15 12 13 14 15 Σ 9 Σ 11 8 9 10 11 8 10 Σ 5 Σ 7 Σ 5 Σ 7 4 6 4 6 Σ 0 1 Σ 3 Σ 0 1 Σ 3 2 2 0 1 2 3 0 1 2 3 Substep 3 Substep 4 (a) Four processors simulating the first communication step of 16 processors Σ 13 Σ 15 Σ 13 Σ 15 12 14 12 14 Σ 9 Σ 11 Σ 9 Σ 11 8 10 8 10 Σ 5 Σ 7 Σ 5 Σ 7 4 6 4 6 Σ 0 1 Σ 3 Σ 0 3 2 0 1 2 3 0 1 2 3 Substep 1 Substep 2 Σ 13 Σ 15 Σ 13 Σ 15 12 14 12 14 Σ 9 11 11 Σ Σ 8 8 10 Σ 4 7 Σ 4 7 3 3 Σ 0 Σ 0 0 1 2 3 0 1 2 3 Substep 3 Substep 4 (b) Four processors simulating the second communication step of 16 processors Figure 5.5 Four processing elements simulating 16 processing elements to compute the sum of 16 numbers (first two steps). � j i denotes the sum of numbers with consecutive labels from i to j .

15 15 Σ 12 Σ 12 Σ 8 11 Σ 8 11 7 Σ 4 Σ 0 3 Σ 0 7 0 1 2 3 0 1 2 3 Substep 1 Substep 2 (c) Simulation of the third step in two substeps Σ 15 8 Σ 0 7 Σ 0 15 0 1 2 3 0 1 2 3 (d) Simulation of the fourth step (e) Final result Figure 5.6 (continued) Four processing elements simulating 16 processing elements to compute the sum of 16 numbers (last three steps).

3 7 11 15 2 6 10 14 1 5 9 13 Σ 3 Σ 7 Σ 11 Σ 15 0 4 8 12 0 4 8 12 0 1 2 3 0 1 2 3 (a) (b) Σ 7 Σ 15 Σ 0 15 0 8 0 1 2 3 0 1 2 3 (c) (d) Figure 5.7 A cost-optimal way of computing the sum of 16 numbers using four processing elements.

45 40 35 30 25 20 15 S Binary exchange 10 2-D transpose 3-D transpose 5 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 n Figure 5.8 A comparison of the speedups obtained by the binary-exchange, 2-D transpose and 3-D transpose algorithms on 64 processing elements with t c = 2 , t w = 4 , t s = 25 , and t h = 2 (see Chapter ?? for details).

35 Linear 30 25 20 n = 512 n = 320 15 n = 192 S 10 n = 64 5 0 0 5 10 15 20 25 30 35 40 p Figure 5.9 Speedup versus the number of processing elements for adding a list of numbers.

Fixed problem size (W) Fixed number of processors (p) E E p W (a) (b) Figure 5.10 Variation of efficiency: (a) as the number of processing elements is increased for a given problem size; and (b) as the problem size is increased for a given number of processing elements. The phenomenon illustrated in graph (b) is not common to all parallel systems.

P P P 0 1 0 Solution Solution (a) DFS with one processing element (b) DFS with two processing elements Figure 5.11 Superlinear(?) speedup in parallel depth first search.

(a) (b) (c) (d) Figure 5.12 Dependency graphs for Problem ?? .

Figure 5.1 The execution profile of a hypothetical parallel program - PDF document

Execution Time P0 P1 P2 P3 P4 P5 P6 P7 Essential/Excess Computation Interprocessor Communication Idling Figure 5.1 The execution profile of a hypothetical parallel program executing on eight process- ing elements. Profile indicates

Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 1 Figure 2 Figure 3 Figure 4

2018 Employer Health Benefits Survey Release Slides October 3, 2018 Figure 1 Figure 2 Figure

Figure 1c Figure 1d Figure 1g Figure 1f Supplementary Figure 1c Supplementary Figure 1g

1 Figure 1. Figure 2. 2 Figure 3. Figure 4. 3 Figure 5. Figure 6. 4 Figure 7. Figure 8.

Figure 1: Team Logo 2 Figure 2: AISC SSBC Event Logo [1] Figure 3: Vertical Load locations [1]

Financial Crime Hypothetical The Law Society Financial Crime Hypothetical ABC Corp ABC

Figure 2 . Figure 3 . Figure 4 US Nuclear Industry Is Achieving Record Levels of Performance

Figure 1: World prices of coltan and gold Figure 2: Local prices of coltan and gold Figure 6:

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Figure 1a: Multilevel Visualization of Market Power Cases in Tree Format Figure 1b: Multilevel

Light from the West Figure: The upper limes germanicus , s. ii CE (CC-BY-SA: source) Figure: The

Signal Theory Figure 2.1: The four-universe paradigm. Figure 2.2: Levels of abstraction in signal

Cashflow from the client and of delays is illustrated in the next figure showing a hypothetical

T - Group 1 - LCA_0195 T - Group 1 - LCA_0777 T - Group 1 - LCA_0802 Hypothetical lipoprotein

A hypothetical model of spontaneous creativity in improvisation Geraint A. Wiggins Centre for

Reasoning about Hypothetical Agent Behaviours and their Parameters Stefano Albrecht and Peter

263-2810: Advanced Compiler Design Compilation with dynamic information Thomas R. Gross Computer

A Mobility-aware Cross-edge Computation Offloading Framework for Partitionable Applications

When can we enhance a triangulated category? Fernando Muro Universitat de Barcelona Dept.

Unipolar Plasma Model of RF Breakdown Z. Insepov, J. Norem 1) Purdue University, 2) Nanosynergy

Systems Flipflops Shankar Balachandran* Associate Professor, CSE Department Indian Institute of

T elescopes T elescopes Ryan Orvedahl Ryan Orvedahl ASTR 1040 Rec ASTR 1040 Rec T odays

SUSY breaking and the MSSM Spontaneous SUSY breaking at tree-level ORaifeartaigh, Fayet,

La Supersymtrie : rsultats de recherche au LHC Marie-Hlne Genest Sminaire du LPSC 8