Performance of Parallel Programs Wolfgang Schreiner Research - PDF document

Performance of Parallel Programs Performance of Parallel Programs Wolfgang Schreiner Research Institute for Symbolic Computation (RISC-Linz) Johannes Kepler University, A-4040 Linz, Austria Wolfgang.Schreiner@risc.uni-linz.ac.at http://www.risc.uni-linz.ac.at/people/schreine Wolfgang Schreiner RISC-Linz

Performance of Parallel Programs Speedup and Efficiency T s • (Absolute) Speedup: S n = T p ( n ) . – T s . . . time of sequential program. – T p ( n ) . . . time of parallel program with n processors. – 0 < S n ≤ n (always?) – Criterium for performance of parallel program. • (Absolute) Efficiency: E n = S n n . – 0 < E n ≤ 1 (always?) – Criterium for expenses of parallel program. • Relative speedup and efficiency use T p (1) instead of T s . – T p (1) ≥ T s (why?) – Relative speedup and efficiency are larger than their absolute counterparts. Observations depend on (size of) input data. Wolfgang Schreiner 1

Performance of Parallel Programs Speedup and Efficency Diagrams Speedup 300 linear sublinear 250 nonlinear 200 150 100 50 0 50 100 150 200 250 Efficiency 1 linear 0.9 sublinear nonlinear 0.8 0.7 0.6 0.5 0.4 0.3 0.2 50 100 150 200 250 Wolfgang Schreiner 2

Performance of Parallel Programs Logarithmic Scales Speedup 256 linear sublinear 64 nonlinear 16 4 1 0.25 1 4 16 64 256 Efficiency 1 linear 0.9 sublinear nonlinear 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 4 16 64 256 Wolfgang Schreiner 3

Performance of Parallel Programs Amdahl’s Law Sequential Program sequential f fraction parallelizable 1-f fraction 1 • Speedup S n ≤ f + 1 − f n • Limit S n ≤ 1 f • Example f = 0 . 01 ⇒ S n < 100! Speedup is limited by the sequential fraction of a program! Wolfgang Schreiner 4

Performance of Parallel Programs Superlinear Speedup Question: Can speedup be larger than the number of processors? S n > n, E n > 1? Answer: In principle, no. Every parallel algorithm solving a problem in time T p with n processors can be in principle simulated by a sequential algorithm in T s = nT p time on a single processor. However, simulation may require some execution overhead. Wolfgang Schreiner 5

Performance of Parallel Programs Speedup Anomalies Still sometimes superlinear speedups can be observed! • Memory/cache effects – More processors typically also provide more memory/cache. – Total computation time decreases due to more page/cache hits. • Search anomalies – Parallel search algorithms. – Decomposition of search range and/or multiple search strategies. – One task may be “lucky” to find result early. Both “advantages” can “in principle” be also achieved on uniprocessors. Wolfgang Schreiner 6

Performance of Parallel Programs Scalability • Scalable algorithm Large efficiency also with larger number of processors. • Scalability analysis Investigate performance of parallel algorithm with – growing processor number, – growing problem size, – various communication costs. • Various workload models Wolfgang Schreiner 7

Performance of Parallel Programs Fixed Workload Model Amdahl’s Law revisited: • Assumption: problem size fixed. – Sequential and parallelizable fraction. – Total time T = T s + T p . • Goal: minimize computation time. S n ≤ T s + T p ≤ T s + T p 1 = = 1 /f . T s + Tp T s Ts Ts + Tp n • Applies when given problem is to be solved as quickly as possible. – Financial market predictions. – Being faster yields a competitive advantage. For not perfectly scalable algorithms, efficiency eventually drops to zero! Wolfgang Schreiner 8

Performance of Parallel Programs Fixed Time Model Gustavson’s Law • Assumption: available time is constant. • Goal: solve largest problem in fixed time. • Strategy: scale workload with processor number. – T = T s + nT p – S n = T s + nT p = T s + nT p T s + T p = fT + n (1 − f ) T fT +(1 − f ) T = f + n (1 − f ) T s + n Tp n • Speedup grows linearly with n ! • Applies where a “better” solution is appre- ciated. – Refined simulation model. – More accurate predictions. Efficiency remains constant. Wolfgang Schreiner 9

Performance of Parallel Programs Fixed Memory Model Sun & Ni • Assumption: available memory is constant. • Goal: solve largest problem in fixed memory. • Strategy: scale problem size with available memory. – T = T s + cnT p , c > 1 – S n = T s + cnT p + cnT p = T s + cnT p T s + cT p = f + cn (1 − f ) f + c (1 − f ) ≈ n T s n • Applies when memory requirements grow slower than computation requirements. Efficiency is maximized. Wolfgang Schreiner 10

Performance of Parallel Programs The Isoefficiency Concept Komon & Rao w ( s ) • Efficiency E n = w ( s )+ h ( s,n ) – s . . . problem size, – w ( s ) . . . workload, – h ( s, n ) . . . communication overhead. • As processor number n grows, communication overhead h ( s, n ) increases and efficiency E n decreases. • For growing s , w ( s ) usually increases much faster than h ( s, n ) . An increase of the workload w ( s ) may out- weigh the increase of the overhead h ( s, n ) for growing processor number n . Wolfgang Schreiner 11

Performance of Parallel Programs The Isoefficiency Concept • Question: For growing n , how fast must s grow such that efficiency remains constant? 1 – E n = 1+ h ( s,n ) w ( s ) – ⇒ w ( s, n ) should grow in proportion to h ( s, n ) . • Constant efficiency E E • Workload w ( s ) = 1 − E h ( s, n ) = Ch ( s, n ) • Isoefficiency function f E ( n ) = Ch ( s, n ) If workload w ( s ) grows as fast as f E ( n ) , constant efficiency can be maintained. Wolfgang Schreiner 12

Performance of Parallel Programs Scalability of Matrix Multiplication • n processors, s × s matrix. • Workload w ( s ) = O ( s 3 ) . • Overhead h ( s, n ) = O ( n log n + s 2 √ n ) • w ( s ) must asymptotically grow at least as fast as h ( s, n ) . 1. w ( s ) = Ω( h ( s, n )) . 2. ⇒ s 3 = Ω( n log n + s 2 √ n ) . 3. ⇒ s 3 = Ω( n log n ) ∧ s 3 = Ω( s 2 √ n ) . 4. s 3 = Ω( s 2 √ n ) ⇔ s = Ω( √ n ) . 5. s = Ω( √ n ) ⇒ s 3 = Ω( n √ n ) ⇒ s 3 = Ω( n log n ) . 6. ⇒ w ( s ) = Ω( n √ n ) . • Isoefficiency f E ( n ) = O ( n √ n ) • Matrix size s = O ( √ n ) Matrix size s must grow with at least √ n ! Wolfgang Schreiner 13

Performance of Parallel Programs More Performance Parameters • Redundancy R ( n ) – Additional workload in parallel program. – R ( n ) = W p ( n ) W s – 1 ≤ R ( n ) ≤ n . • System utilization U ( n ) – Percentage of processors kept busy. – U ( n ) = R ( n ) E ( n ) = W p ( n ) nT p ( n ) – 1 n ≤ E ( n ) ≤ U ( n ) ≤ 1 . – 1 1 n ≤ R ( n ) ≤ E ( n ) ≤ n . • Quality of Parallelism Q ( n ) – Summary of overall performance. T 3 – Q ( n ) = S ( n ) E ( n ) = s nT 2 R ( n ) p ( n ) W p ( n ) – 0 < Q ( n ) ≤ S ( n ) Wolfgang Schreiner 14

Performance of Parallel Programs Parallel Execution Time Three components 1. Computation Time T comp Time spent performing actual computation; may depend on number of tasks or processors (replicated computation, memory and cache effects). 2. Communication Time T msg • Time spent in sending and receiving messages • T msg = t s + t w L • startup cost, cost/word, message length. 3. Idle Time T idle • Processor idle due to lack of computation or lack of data, • Load balancing, • Overlapping computation with communication. Wolfgang Schreiner 15

Performance of Parallel Programs Execution Profiles Determine ratio of 1. Computation time, 2. Message startup time, 3. Data transfer costs, 4. Idle time as a function of the number of processors. Guideline for redesign of algorithm! Wolfgang Schreiner 16

Performance of Parallel Programs Experimental Studies Parallel programming is an experimental dis- cipline! 1. Design experiment • Identify data you wish to obtain. • Measure data for different problem sizes and/or processor numbers; • Be sure that you measure what you intend to measure. 2. Obtain and validate experimental data • Repeat experiments to verify reproducability of results. • Variation by nondeterministic algorithms, inaccurate timers, startup costs, interference from other programs, contention, . . . 3. Fit data to analytic models. For instance, measure communication time and use scaled least-square fitting to determine startup and data transfer costs. Wolfgang Schreiner 17

Performance of Parallel Programs Wolfgang Schreiner Research - PDF document

Performance of Parallel Programs Performance of Parallel Programs Wolfgang Schreiner Research Institute for Symbolic Computation (RISC-Linz) Johannes Kepler University, A-4040 Linz, Austria Wolfgang.Schreiner@risc.uni-linz.ac.at

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Multiple Programs How do programs communicate? 1 Multiple Programs How do programs communicate?

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David B2-206 Topic Overview

c p e c Writing Message-Passing Parallel Programs with MPI 1 Edinburgh Parallel Computing

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Introduction to Parallel Computing George Karypis Analytical Modeling of Parallel Algorithms

Parallel Query Execution in POLARDB for MySQL ystein Grvlen Benny Wang Alibaba Cloud Agenda

Fourth-Quarter and Full-Year Results 2008 Zurich February 11, 2009 Cautionary statement

Sparse Representations Joel A. Tropp Department of Mathematics The University of Michigan

AGENDA 1. Review of Metro Hartford Future 2. Vision and Benchmarks 3. Goals 4. Key

Carbon Taxes Vs Tradable Permits: Efficiency and equity effects for a small open economy John

Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off Matvey Arye,

Prediction Serving Joseph E. Gonzalez Asst. Professor, UC Berkeley jegonzal@cs.berkeley.edu

Evaluating the out-of-sample prediction performance of panel data models 12th Spanish STATA

Workload-Driven Architectural Evaluation Evaluation in Uniprocessors Decisions made only after