Parallel programming 02 Walter Boscheri walter.boscheri@unife.it - - PowerPoint PPT Presentation
Parallel programming 02 Walter Boscheri walter.boscheri@unife.it - - PowerPoint PPT Presentation
Parallel programming 02 Walter Boscheri walter.boscheri@unife.it University of Ferrara - Department of Mathematics and Computer Science A.Y. 2018/2019 - Semester I Outline Classification of parallel systems 1 Performance measure 2
Outline
1
Classification of parallel systems
2
Performance measure
3
Optimization of parallel computational resources
- 1. Classification of parallel systems
Classification of parallel systems
A parallel system can be described by considering: number and type of the processors (massively parallel and coarse-grained parallelism) presence of a global control mechanism (Flynn classification) synchronism (a common clock among processors is present or not) connections among processors (shared memory or distributed memory)
Walter Boscheri Parallel programming 02 2 / 12
- 1. Classification of parallel systems
Flynn classification (1966)
SISD: Single Instruction stream-Single Data stream It includes the model of Von Neumann because one stream of instructions is
- perating on a one stream of data.
SIMD: Single Instruction stream-Multiple Data stream It involves vector processors and pipeline processors, in which all processors follow the same instructions by executing them on different data sets. MISD: Multiple Instruction stream-Single Data stream It can be seen as an extension of SISD. MIMD: Multiple Instruction stream-Multiple Data stream A system based on MIMD has independent processors, each of them has a local control unit. As a consequence, each processor can load different instructions and operate onto different data.
Walter Boscheri Parallel programming 02 3 / 12
- 1. Classification of parallel systems
Flynn classification (1966)
CU PU MM
DS IS
control unit processing unit memory module instruction stream data stream CU PU MM IS DS
SISD
scalar uniprocessor systems Von Neumann architecture
Walter Boscheri Parallel programming 02 3 / 12
- 1. Classification of parallel systems
Flynn classification (1966)
CU PU2 MM2
DS2 IS
control unit processing unit memory module instruction stream data stream CU PU MM IS DS
SIMD
PU1 PUn
DS1 DSn
MM1 MMn synchronized parallelism
- ne single control unit
- ne single instruction operates on several data sets
vector processors and parallel processing
Walter Boscheri Parallel programming 02 3 / 12
- 1. Classification of parallel systems
Flynn classification (1966)
CU2 PU2 MM2
DS2 IS2
control unit processing unit memory module instruction stream data stream CU PU MM IS DS
MIMD
PU1 PUn
DS1 DSn
MM1 MMn CU1 CUn
IS1
IS1 IS2 ISn
ISn
non-synchronized parallelism several processors execute several instructions and operate on several data sets shared or distributed memory
Walter Boscheri Parallel programming 02 3 / 12
- 1. Classification of parallel systems
Shared and distributed memory
Shared memory
single address space all processors have access to the pool of shared memory
Walter Boscheri Parallel programming 02 4 / 12
- 1. Classification of parallel systems
Shared and distributed memory
Distributed memory
each processor has its own local memory message-passing is used to exchange data among processors
Walter Boscheri Parallel programming 02 4 / 12
- 1. Classification of parallel systems
Sequential vs vector processors
Sequential processors execute all instructions in a serial mode, from the first to the last one. Vector processors make use of the pipelining technique: it is based on the parallel execution of several instructions which belong to the sequential algorithm it is similar to the assembly line: it does not reduce the execution time for one single instruction, but it increases the frequency at which the instructions are executed.
Walter Boscheri Parallel programming 02 5 / 12
- 1. Classification of parallel systems
Sequential vs vector processors
Time IS1 IS2
1.6 ns 1.6 ns
3.2 ns 1.6 ns 0.0 ns Time 3.2 ns 1.6 ns 0.0 ns
SEQUENTIAL PROCESSOR VECTOR PROCESSOR (pipeline)
0.4 ns
Instruction order IS1 IS2 Processor: 2.5 GHz (0.4 ns clock period)
Walter Boscheri Parallel programming 02 5 / 12
- 1. Classification of parallel systems
Sequential vs vector processors
Example Sequential processor DO i = 1, N A(i) = B(i) + C(i) B(i) = 2 * A(i+1) ENDDO Vector processor temp (1:N) = A(2:N+1) A(1:N) = B(1:N) + C(1:N) B(1:N) = 2 * temp (1:N)
Walter Boscheri Parallel programming 02 5 / 12
- 2. Performance measure
Speedup
Speedup The speedup S(p) measures the reduction of the computational time tp which has been obtained by using a total number of p processors while keeping the size of the problem fixed. Absolute speedup The speedup is measured w.r.t. the best serial code with computational time tbest: S(p) = tbest t(p). It is also called performance measure. Relative speedup The speedup is measured w.r.t. the same serial code with p = 1: S(p) = t(p=1) t(p) . It is also called scalability measure.
Walter Boscheri Parallel programming 02 6 / 12
- 2. Performance measure
Ideal speedup
In the ideal case, in which the work load is perfectly distributed among all processors, the relative speedup should be equal to 1. This is the case of linear speedup. Actually, linear speedup is never achieved: load balancing is not guaranteed; portions of code which can not be parallelized; synchronization and communication times.
Walter Boscheri Parallel programming 02 7 / 12
- 2. Performance measure
Superlinear speedup
Very rarely, one has S(p) > p. This is the case of superlinear speedup. Superlinear speedup can be occasionally achieved: in a distributed memory system, if the number of processors increases, the total amount of memory increases as well. Therefore, intermediate data and results can be stored, hence avoiding the need of computing them again. In such a way, the number of floating point operations, i.e. the number of computations, can be reduced compared to an execution
- n less processors;
the size of the problem which belongs to one processor, might be re- duced up to the point that it can be entirely stored and managed in the cache.
Walter Boscheri Parallel programming 02 8 / 12
- 2. Performance measure
Model of Flatt and Kennedy
The model qualitatively describes the speedup S(p) as a function of p. Definitions Tser → execution time of the serial portion of an algorithm Tpar → execution time of the parallelizable portion of an algorithm T0(p) → synchronization and communication time for p processors It holds T(1) = Tser + Tpar T(p) = Tser + Tpar p + T0(p) S(p) = Tser + Tpar Tser + Tpar
p
+ T0(p)
Walter Boscheri Parallel programming 02 9 / 12
- 2. Performance measure
Model of Flatt and Kennedy
By considering that the communication time is a linear function of p, that is T0(p) = K p, the speedup results S(p) = Tser + Tpar Tser + Tpar
p
+ T0(p) = (Tser + Tpar) p Tserp + Tpar + Kp2 . It follows that lim
p→∞ S(p) = 0
Walter Boscheri Parallel programming 02 9 / 12
- 2. Performance measure
Model of Flatt and Kennedy
Speedup function: is initially linear; exhibits a saturation point; decreases as the communication cost increases.
Walter Boscheri Parallel programming 02 9 / 12
- 2. Performance measure
Efficiency
Efficiency is defined as the ratio E(p) = S(p) p if S(p) is linear, then E(p) = 1 actually, E(p) = 1 if p = 1 < 1 if p > 1 N.B.- the more the efficiency is far from 1, the worse the parallel computa- tional resources are exploited.
Walter Boscheri Parallel programming 02 10 / 12
- 3. Optimization of parallel computational resources
Optimize the number of processors
Speedup The optimal number of processors is the one which allows us to reach the saturation point. Efficiency The optimal number of processors is the one with E(p) = 1 : p = 1. At the saturation point the speedup is maximum but the efficiency is low.
Walter Boscheri Parallel programming 02 11 / 12
- 3. Optimization of parallel computational resources
Function of Kuck
The function of Kuck K(p) is used in order to measure the efficiency of a parallelization in terms of the number of processors p: K(p) = S(p) E(p) p∗ = arg max K(p)
simultaneous good speedup and efficiency
Walter Boscheri Parallel programming 02 12 / 12