Introduction to High Performance Computing and Optimization Oliver - - PowerPoint PPT Presentation
Introduction to High Performance Computing and Optimization Oliver - - PowerPoint PPT Presentation
Institut fr Numerische Mathematik und Optimierung Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13 Contents 1. Introduction 2. Processor
Contents
- 1. Introduction
- 2. Processor Architecture
- 3. Optimization of Serial Code
3.1 Performance Measurement 3.2 Optimization Guidelines 3.3 Compiler-Aided Optimization 3.4 Combine example 3.5 Further Optimization Issues
- 4. Parallel Computing
4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks
- 5. OpenMP Programming
Oliver Ernst (INMO) HPC Wintersemester 2012/13 1
Contents
- 1. Introduction
- 2. Processor Architecture
- 3. Optimization of Serial Code
- 4. Parallel Computing
4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks
- 5. OpenMP Programming
Oliver Ernst (INMO) HPC Wintersemester 2012/13 138
Contents
- 4. Parallel Computing
4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks
Oliver Ernst (INMO) HPC Wintersemester 2012/13 139
Parallel Computing
Introduction
Many processing units (computers, nodes, processors, cores, threads) collaborate to solve one problem concurrently. Currently: many means up to 1.5 million (current Top500 leader). Objectives: faster execution time for one task (speedup), solution of larger problem (scaleup), memory requirements exceed resources of single computer. Challenges for hardware designers:
Power Communication network Memory bandwidth Low level synchronization (e.g. cache coherency) File system
Challenges for programmer:
Load balancing Synchronization/Communication Algorithm design and redesign Software interface Make maximal use of computer’s resources.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 140
Parallel Computing
Types of parallelism: Data parallelism
The scale of parallelism refers to the size of concurrently executed tasks. Fine-grain parallelism at scale of functional units of processor (ILP), individual instructions or micro-instructions. Medium-grain parallelism at scale of independent iterations of a loop (e.g. linear algebra operations on vectors, matrices, tensors). Coarse-grain parallelism refers to larger computational tasks with looser synchronization (e.g. domain decomposition methods in PDE/linear system solvers). Data parallel applications are usually implemented using an SPMD (Single Program, Multiple Data) software design, in which the same program runs on all processing units, but not in the tightly synchronized lockstep fashion of SIMD.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 141
Parallel Computing
Types of parallelism: Functional parallelism
Concurrent execution of different tasks. Programming style known as MPMD (Multiple Program, Multiple Data). More difficult to load balance. Variants: master-slave scheme: one administrative unit to distribute tasks/collect results; remainung units receive tasks and report results to master upon completion. Large-scale functional decomposition: large loosely coupled tasks executed
- n larger computational units with looser synchronization (e.g. climate
models coupling ocean and atmospheric dynamics, fluid-structure interation codes, “multiphysics” codes)
Oliver Ernst (INMO) HPC Wintersemester 2012/13 142
Contents
- 4. Parallel Computing
4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks
Oliver Ernst (INMO) HPC Wintersemester 2012/13 143
Scalability
Basic considerations
T : time for 1 worker to complete task, N workers S := T N ideal speedup (perfect linear scaling) Not all computational (or other) tasks scale in this ideal way. “Nine women can’t make a baby in one month.”
Fred Brooks. The Mythical Man-Month (1975)
Limiting factors: Not all workers receive tasks of equal complexity (or aren’t equally fast); load imbalance Some resources necessary for task completion not available N times; serialization of concurrent execution while waiting for access. Extra work/waiting time due to parallel execution; overhead which is not required for serial task completion.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 144
Scalability
Basic considerations
T : time for 1 worker to complete task, N workers S := T N ideal speedup (perfect linear scaling) Not all computational (or other) tasks scale in this ideal way. “Nine women can’t make a baby in one month.”
Fred Brooks. The Mythical Man-Month (1975)
Limiting factors: Not all workers receive tasks of equal complexity (or aren’t equally fast); load imbalance Some resources necessary for task completion not available N times; serialization of concurrent execution while waiting for access. Extra work/waiting time due to parallel execution; overhead which is not required for serial task completion.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 144
Scalability
Basic considerations
T : time for 1 worker to complete task, N workers S := T N ideal speedup (perfect linear scaling) Not all computational (or other) tasks scale in this ideal way. “Nine women can’t make a baby in one month.”
Fred Brooks. The Mythical Man-Month (1975)
Limiting factors: Not all workers receive tasks of equal complexity (or aren’t equally fast); load imbalance Some resources necessary for task completion not available N times; serialization of concurrent execution while waiting for access. Extra work/waiting time due to parallel execution; overhead which is not required for serial task completion.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 144
Scalability
Performance metrics: Strong scaling
T = T s
f = s + p
serial task completion time, fixed problem size s : serial portion (not parallelizable) of task p : (perfectly) parallelizable portion of task Solution time using N workers: : T p
f = s + p
N Known as strong scaling since task size fixed. Parallelization used to reduce solution time for fixed problem.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 145
Scalability
Performance metrics: Strong scaling
T = T s
f = s + p
serial task completion time, fixed problem size s : serial portion (not parallelizable) of task p : (perfectly) parallelizable portion of task Solution time using N workers: : T p
f = s + p
N Known as strong scaling since task size fixed. Parallelization used to reduce solution time for fixed problem.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 145
Scalability
Performance metrics: Strong scaling
T = T s
f = s + p
serial task completion time, fixed problem size s : serial portion (not parallelizable) of task p : (perfectly) parallelizable portion of task Solution time using N workers: : T p
f = s + p
N Known as strong scaling since task size fixed. Parallelization used to reduce solution time for fixed problem.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 145
Scalability
Performance metrics: Weak scaling
Use parallelism to solve larger problem: assume s fixed and parallelizable portion grows with N like N α, α > 0 (often α = 1). Then: T = T s
v = s + pN α
serial task completion time, variable problem size Solution time using N workers: T p
v = s + pN α−1
Known as weak scaling since task size variable. Parallelization used to solve larger problem.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 146
Scalability
Performance metrics: Weak scaling
Use parallelism to solve larger problem: assume s fixed and parallelizable portion grows with N like N α, α > 0 (often α = 1). Then: T = T s
v = s + pN α
serial task completion time, variable problem size Solution time using N workers: T p
v = s + pN α−1
Known as weak scaling since task size variable. Parallelization used to solve larger problem.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 146
Scalability
Performance metrics: Weak scaling
Use parallelism to solve larger problem: assume s fixed and parallelizable portion grows with N like N α, α > 0 (often α = 1). Then: T = T s
v = s + pN α
serial task completion time, variable problem size Solution time using N workers: T p
v = s + pN α−1
Known as weak scaling since task size variable. Parallelization used to solve larger problem.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 146
Scalability
Application speedup
Define performance := work time , application speedup := parallel performance serial performance . Serial performance for fixed problem size s + p: P s
f = s + p
T s
f
= s + p s + p = 1. Parallel performance for fixed problem size, normalize s + p = 1: P p
f = s + p
T p
f
= s + p s + p/N = 1 s + 1−s
N
. Application speedup (fixed problem size): Sf = P p
f
P s
f
= 1 s + 1−s
N
(cf. Amdahl’s Law).
Oliver Ernst (INMO) HPC Wintersemester 2012/13 147
Scalability
Application speedup: different notion of “work”
Count as work only parallelizable portion. Serial performance: P sp
f
= p T s
f
= p. Parallel performance: P pp
f
= p T p
f
= 1 − s s + 1−s
N
. Application speedup: Sp
f =
P pp
f
P sp
f
= 1 s + 1−s
N
. P pp
f
no longer identical with Sp
f.
Scalability doesn’t change, but performance does (factor of p smaller).
Oliver Ernst (INMO) HPC Wintersemester 2012/13 148
Scalability
Application speedup: weak scaling
(How much more work can my program do in a given amount of time when I put a larger problem on N CPUs?) Serial performance (N = 1): P s
v = s + p
T s
f
= 1. Parallel performance: P p
v = T s v
T p
v
= s + pN α s + pN α−1 = s + (1 − s)N α s + (1 − s)N α−1 (= Sv since P s
v = 1).
(same as application speedup). Recover Amdahl’s law for α = 0 (strong scaling).
Oliver Ernst (INMO) HPC Wintersemester 2012/13 149
Scalability
Application speedup: weak scaling
For 0 < α < 1 Sv ≈ s + (1 − s)α s = 1 + p sN α (linear in N α). Weak scaling thus allows to break Amdahl barrier (unlimited performance, even for small α. Ideal case α = 1: Sv = s + (1 − s)N s + 1 − s = s + (1 − s)N, Gustafson’s Law i.e., linear speedup, even for small N [Gustafson (1988)]. Note: large serial fraction s leads to small slope.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 150
Scalability
Application speedup: weak scaling, different notion of work
Again, base work only on parallel fraction p. Once more, serial performance is P s,p
v
= p and parallel performance P p,p
v
= pN α s + pN α−1 = (1 − s)N α s + (1 − s)N α−1 corresponding to speedup of Sp
v = P p,p v
P s,p
v
= N α s + (1 − s)N α−1 . Performance and speedup differ by factor p but, for α = 1 speedup is now linear with slope one in contrast to Gustafson’s Law. Conclusion: look carefully at performance metrics being applied.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 151
Parallel Efficiency
Definition
How effectively does a parallel program utilize a given resource? (Assume serial fraction executed by single worker while others wait.) Parallel efficiency: ε := performance on N CPUs N × performance on one CPU = speedup N . Consider only weak scaling as Amdahl case covered for α → 0.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 152
Parallel Efficiency
Definition
How effectively does a parallel program utilize a given resource? (Assume serial fraction executed by single worker while others wait.) Parallel efficiency: ε := performance on N CPUs N × performance on one CPU = speedup N . Consider only weak scaling as Amdahl case covered for α → 0.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 152
Parallel Efficiency
Definition
How effectively does a parallel program utilize a given resource? (Assume serial fraction executed by single worker while others wait.) Parallel efficiency: ε := performance on N CPUs N × performance on one CPU = speedup N . Consider only weak scaling as Amdahl case covered for α → 0.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 152
Parallel Efficiency
Weak scaling
For work = s + pN α: ε = Sv N = sN −α + 1 − s sN 1−α + 1 − s. For α = 0 recover Amdahl case ε = 1 sN + 1 − s → 0 as N → ∞. For α = 1 ε = s N + 1 − s. Ranges from ǫ = 1 for N = 1 with limit ǫ → 1 − s = p as N → ∞, i.e., efficiency limited by parallel fraction of code. Conclusions: Weak scaling allows utilization of at most fraction p of computing power. Wasted CPU time grows linearly with N.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 153
Parallel Efficiency
Weak scaling, different notion of work
For work = pN α: εp = Sp
v
N = N α−1 s + (1 − s)N α−1 . In this case for α = 1 we obtain εp = 1, i.e., perfect efficiency. For a large serial fraction s, this obscures the fact that most of the computational capacity of the computer remains unused. Example: p = 0.1, s = 0.9, α = 1. Weak scaling:efficiency ε → 0.1 as N → ∞, εp ≡ 1 ∀N. However, all processors except one are idle 90% of the time.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 154
Measuring Strong Scalability
Approach to ascertain the scaling properties of a code on a given architecture: measure performance on a small number of processors and determine parameters using least-squares fit. Example: (Hager & Wellein, Sec. 5.3.5)
2 4 6 8 10 12 # cores 1 2 3 4 5 6 7 Relative performance Amdahl s = 0.168 Amdahl s = 0.086 Architecture 1 Architecture 2
Least-squares fit of serial fracti-
- n s in (strong scaling)
Sf = 1 s + 1−s
N
for same code on two different
- systems. (Measurements nor-
malized to single-core values). Parallel part compute-bound, serial part memory bound. Sys- tem 2 has larger memory band- width.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 155
Accelerate Serial or Parallel Part?
Assume serial part can be accelerated by factor ξ > 1. Amdahl’s Law says parallel performance becomes P s,ξ
f
= 1
s ξ + 1−s N
. Optimizing instead parallel part (by same factor) yields P p,ξ
f
= 1 s + 1−s
ξN
. When does accelerating serial part pay off more? Crossover point: P s,ξ
f
P p,ξ
f
= ξs + 1−s
N
s + ξ 1−s
N
≥ 1 ⇔ N ≥ 1 s − 1. Point independent of ξ, corresponds to value of N for which half of asymptotic efficiency achieved in Amdahl’s Law.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 156
Accelerate Serial or Parallel Part?
Amdahl’s Law used with strong scaling gives parallel efficiency of ε = Sf N = 1 sN + 1 − s. For the crossover value N = 1
s − 1 this is
ε∗ = 1 2(1 − s) which, for s ≪ 1, is already close to 0.5. Further increase of N thus results in diminishing returns, hence one should try to accelerate parallel part first.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 157
Refined Performance Models
Models like Amdahl’s and Gustafson’s Laws can be refined to account for communication, load imbalance, parallel startup overhead, etc. Here we consider a simple model accounting for communication. Assume communication not overlapped with computation (often true) and introduce correction term: T p,c
v
= s + pN α−1 + cα(N). Communication overhead does not count as work. Thus parallel speedup is Sc
v = s + pN α
T p,c
v
= s + (1 − s)N α s + (1 − s)N α−1 + cα(N). To model cα(N) note basic message transfer time of λ + κ, where λ : channel latency, κ = n B : streaming time for message size n at bandwidth B. We now compare different network models.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 158
Refined Performance Models
α = 0, blocking network
For bus-like communication only one message can be transmitted at a time. Thus communication overhead per CPU independent of N, cα(N) = (κ + λ)N. Sc
v =
1 s + 1−s
N + (κ + λ)N ≈
1 (κ + λ)N , for N ≫ 1. Performance dominated by communication. Goes to zero for N large. Same situation arises when there is contention for shared resources (memory channels, I/O channels, functional units on a CPU), i.e., serialization.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 159
Refined Performance Models
α = 0, non-blocking network, constant communication cost
Assume communication network can sustain N/2 concurrent messages without collisions and that message size is independent of N. Then cα(N) = κ + λ and Sc
v =
1 s + 1−s
N + κ + λ ≈
1 s + κ + λ, for N ≫ 1. Situation similar to Amdahl’s law with s replaced by s + κ + λ. For large N performance saturates at lower value due to communication
- verhead.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 160
Refined Performance Models
α = 0, non-blocking network, surface-to-volume communication
Such a message size arises e.g. in domain decomposition for PDEs, where message size is proportional to the surface area and work proportional to volume. Here cα(N) = κN −β + λ, β > 0. Communication overhead decreases with N (strong scaling). Sc
v =
1 s + 1−s
N + κN −β + λ ≈
1 s + λ, for N ≫ 1. For large N performance dominated by s and latency λ.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 161
Refined Performance Models
α = 1, non-blocking network, , surface-to-volume communication
In this case communication overhead per CPU is independent of N. As in preceding model, cα(N) = κN −β + λ. Sc
v =
s + pN s + p + κ + λ = s + (1 − s)N 1 + κ + λ . The following figure illustrates this comparison, plotting N against Sc
v for the
parameter values N = 1, . . . 1000, s = 0.05, κ = 0.005, λ = 0.001, β = 2/3.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 162
Refined Performance Models
Illustration
10 10
1
10
2
10
3
2 4 6 8 10 12 14 16 18 20 N Sv
c
Amdahl =0, blocking =0, non−blocking =0, 3D surface−to−volume =1, 3D surface−to−volume
Oliver Ernst (INMO) HPC Wintersemester 2012/13 163
Contents
- 4. Parallel Computing
4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks
Oliver Ernst (INMO) HPC Wintersemester 2012/13 164
Parallel Architectures
Components
Processor Memory Communication Network Prozessor Prozessor Prozessor Prozessor Processor Speicher Speicher Speicher Speicher Memory
Oliver Ernst (INMO) HPC Wintersemester 2012/13 165
Parallel Architectures
Main Distinction
Shared Memory
Memory Bus/Switch P1
Cache
P2
Cache
Pn
Cache
...
Scales to ≈ 500 processors Programming models: OpenMP, Posix Threads (Pthreads)
Oliver Ernst (INMO) HPC Wintersemester 2012/13 166
Parallel Architectures
Main Distinction
Shared Memory
Memory Bus/Switch P1
Cache
P2
Cache
Pn
Cache
...
Scales to ≈ 500 processors Programming models: OpenMP, Posix Threads (Pthreads) Message Passing
Memory Interconnect P1
Cache
Memory P2
Cache
Memory Pn
Cache
...
Scales arbitrarily Programming model: MPI (message passing)
Oliver Ernst (INMO) HPC Wintersemester 2012/13 166
Parallel Architectures
Main Distinction
Shared Memory
Memory Bus/Switch P1
Cache
P2
Cache
Pn
Cache
...
Scales to ≈ 500 processors Programming models: OpenMP, Posix Threads (Pthreads) Message Passing
Memory Interconnect P1
Cache
Memory P2
Cache
Memory Pn
Cache
...
Scales arbitrarily Programming model: MPI (message passing) In real life: Hybrids
Oliver Ernst (INMO) HPC Wintersemester 2012/13 166
Parallel Architectures
More classifications [Flynn (1966)] Single/Multiple Instruction/Data
SPMD, MPMD Shared memory vs. distributed memory UMA vs. ccNUMA Communication network type
Oliver Ernst (INMO) HPC Wintersemester 2012/13 167
Parallel Architectures
Shared-memory computers
Common shared physical addressspace for several processors Uniform Memory Access (UMA) systems: latency and bandwidth the same for all processors and all memory locations (“flat” memory model). Cache-coherent Nonuniform Memory Access (ccNUMA) systems: memory physically distributed, but logically shared. Network logic gives appearance of single adress space for total system memory. However, different access times for local and remote memory locations. Locality problem: fundamental performance limitation if remote accesses can’t be restricted. Contention problem: Simultaneous access to same memory location by several provessors. Both shared-memory variants require cache-coherency mechanisms: a single cache line may reside in multiple caches at any given time. When any of these is modified, this change must be propagated to all copies in order to maintain a consistent view of memory for all processors.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 168
Parallel Architectures
Cache-coherency
Example: Suppose two memory locations A1 and A2 belong to the same cache line. Processor P1 loads cache line into its cache. Processor P2 loads cache line into its cache. Processor P1 modifies A1. Processor P2 modifies A2. Cache line evicted from P1’s cache. Which version should be written to memory?
Oliver Ernst (INMO) HPC Wintersemester 2012/13 169
Parallel Architectures
Cache-coherency: the MESI protocol
Assign to each cache line one of four possible states: M (modified): cache line has been modified and resides in no other cache than this one; upon eviction memory will reflect this cache line’s most current state. E (exclusive): cache line has been read from memory but not yet modified; it resides in no other cache. S (shared): cache line has been read from memory, has not yet been modified, with additional copies possibly residing in other caches. I (invalid): cache line does not represent valid data. Typically
- ccurs when other processor has requested exclusive access while
this cache line was in shared state. When a cache requests exclusive ownership of a cache line, copies in S or E state must be notified, copies in M state must be written to memory. Notifying cache broadcasts request over bus, remaining caches “snoop” bus and act accordingly when holding affected cache line.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 170
Parallel Architectures
Cache-coherency: the MESI protocol
Assign to each cache line one of four possible states: M (modified): cache line has been modified and resides in no other cache than this one; upon eviction memory will reflect this cache line’s most current state. E (exclusive): cache line has been read from memory but not yet modified; it resides in no other cache. S (shared): cache line has been read from memory, has not yet been modified, with additional copies possibly residing in other caches. I (invalid): cache line does not represent valid data. Typically
- ccurs when other processor has requested exclusive access while
this cache line was in shared state. When a cache requests exclusive ownership of a cache line, copies in S or E state must be notified, copies in M state must be written to memory. Notifying cache broadcasts request over bus, remaining caches “snoop” bus and act accordingly when holding affected cache line.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 170
Parallel Architectures
Cache-coherency: the MESI protocol
Back to example: assume cache line L contains memory locations A1 and A2 and resides in caches of processors P1 and P2. Sequence of events:
- 1. C1 requests excusive ownership of L
- 2. In C2 state of L set to I.
- 3. In C1, state of L set to E; A1 modified;
state of L set to M.
- 4. C2 requests exclusive ownership of L.
- 5. Evict L from C1, set state to I.
- 6. Load L to C2, set state to E
- 7. Modify A2 in L and set stete to M.
C1 C2
Memory
A1 A2 A1 A2 A1 A2
P1 P2
1 5 3 7 2 4 6
source: Hager & Wellein
Limitations: bus snooping consumes bandwidth and does not scale. Alternative: directory-based system monitors state of all cache lines, transmits state changes only to affected caches (employed in ccNUMA systems).
Oliver Ernst (INMO) HPC Wintersemester 2012/13 171
Parallel Architectures
Cache-coherency: the MESI protocol
Back to example: assume cache line L contains memory locations A1 and A2 and resides in caches of processors P1 and P2. Sequence of events:
- 1. C1 requests excusive ownership of L
- 2. In C2 state of L set to I.
- 3. In C1, state of L set to E; A1 modified;
state of L set to M.
- 4. C2 requests exclusive ownership of L.
- 5. Evict L from C1, set state to I.
- 6. Load L to C2, set state to E
- 7. Modify A2 in L and set stete to M.
C1 C2
Memory
A1 A2 A1 A2 A1 A2
P1 P2
1 5 3 7 2 4 6
source: Hager & Wellein
Limitations: bus snooping consumes bandwidth and does not scale. Alternative: directory-based system monitors state of all cache lines, transmits state changes only to affected caches (employed in ccNUMA systems).
Oliver Ernst (INMO) HPC Wintersemester 2012/13 171
Parallel Architectures
Cache-coherency: the MESI protocol
Back to example: assume cache line L contains memory locations A1 and A2 and resides in caches of processors P1 and P2. Sequence of events:
- 1. C1 requests excusive ownership of L
- 2. In C2 state of L set to I.
- 3. In C1, state of L set to E; A1 modified;
state of L set to M.
- 4. C2 requests exclusive ownership of L.
- 5. Evict L from C1, set state to I.
- 6. Load L to C2, set state to E
- 7. Modify A2 in L and set stete to M.
C1 C2
Memory
A1 A2 A1 A2 A1 A2
P1 P2
1 5 3 7 2 4 6
source: Hager & Wellein
Limitations: bus snooping consumes bandwidth and does not scale. Alternative: directory-based system monitors state of all cache lines, transmits state changes only to affected caches (employed in ccNUMA systems).
Oliver Ernst (INMO) HPC Wintersemester 2012/13 171
Parallel Architectures
Examples of UMA systems
2 single-core CPUs in separate sockets sharing a common frontside bus (FSB) Single shared path to memory Chipset (“northbridge”) responsible for driving memory modules, I/O-connections, arbitration. Outdated design.
socket
P P
Chipset
Memory
L1D L2 L1D L2
source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 172
Parallel Architectures
Examples of UMA systems
Two dual-core chips, each with FSB connected separately to chipset. Can design chipset-to-memory bandwidth to match combined bandwidth of both FSBs. Anisotropy in distance between cores depending on socket membership.
P
socket
P
Chipset
Memory
L1D L2 L2 L1D
P P
L1D L1D
source: Hager & Wellein
Limitation: bandwidth bottlenecks as number of sockets/FSBs grows beyond certain limits. Cross-bar switches avoid bandwidth bottleneck but are expensive.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 173
Parallel Architectures
Examples of ccNUMA systems
Locality domain (LD): collection of processor cores with locally connected
- memory. Coherent interconnect links multiple LDs. System runs single OS
instance.
coherent link
P
L1D L2
P
L1D L2
P
L1D L2
P
L1D L2
Memory Interface
L3
Memory Memory
P
L1D L2
P
L1D L2
P
L1D L2
P
L1D L2
Memory Interface
L3
source: Hager & Wellein
Two LDs consisting each of one quad-core chip, separate caches, common interface to local memory, LDs linked by high-speed connection. Implementations: QuickPath (Intel), Hypertransport (AMD et al.) Intersocket link can mediate direct, cache-coherent memory access. All required protocols provided by hardware.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 174
Parallel Architectures
Examples of ccNUMA systems
NL
P P P P P P P P
L1D L2 L3 L3 L2 L1D L1D L2 L3 L3 L2 L1D L1D L2 L3 L3 L2 L1D L1D L2 L3 L3 L2 L1D
Memory Memory Memory Memory
R R
S S S S source: Hager & Wellein
ccNUMA system with 4 LDs as used on Intel-based SGI Altix systems. Scales to thousands of processors. Each socket connected to comminication interface (S), providing access to memory as well as proprietary NUMALink (NL) network. Network connections for remote access switched by routers (R) Addition of LDs to network becomes increasingly expensive, adds layers of latency.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 175
Parallel Architectures
Distributed memory model
- rk.
NI NI NI NI NI
C C C C C
M M M M
Communication network
M
P P P P P
source: Hager & Wellein
Programming model only, design obsolete as practical system. Each processor (P) connected to exclusive local memory. At least one network interface (NI) per node. Separate process on each node, processes communicate explicitly using the communication network. No Remote Memory Access (NORMA) system. Programming paradigm more complex, but scales. All current supercomputers based on this model.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 176
Parallel Architectures
Hybrid model
Network Int. Network Int. Network Int. Network Int.
Communication network
P P
Memory
P P
Memory
P P
Memory
P P
Memory
P P
Memory
P P
Memory
P P
Memory
P P
Memory
source: Hager & Wellein
Shared-memory building blocks connected by fast network. Network adds another level of complexity beyond cache hierarchy, ccNUMA nodes, etc. Mixture of programming models possible on same system. Hybrids also refers to other system containing mixture of programming paradims on different hardware layers: GPUs (graphics processing units), FPGAs (field-programmable gate arrays), ASICs (application specific integrated circuits), co-processors, etc.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 177
Contents
- 4. Parallel Computing
4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks
Oliver Ernst (INMO) HPC Wintersemester 2012/13 178
Networks
Basic performance characteristics
Large variety of topologies, technology, protocols. Simplest and least expensive: Gigabit Ethernet, but too slow for parallel applications requiring significant communication. Current standard for small clusters: Infiniband. Basic model for point-to-point connection: transfer time T for message of size N bytes is sum T = Tℓ + N B , Tℓ : latency, B : asymptotic bandwidth [MBytes/sec] Latency generally depends on N (vs. buffer or cache size). Effective bandwidth: Beff = N Tℓ + N
B
. (5.1)
Oliver Ernst (INMO) HPC Wintersemester 2012/13 179
Networks
Basic performance characteristics
Large variety of topologies, technology, protocols. Simplest and least expensive: Gigabit Ethernet, but too slow for parallel applications requiring significant communication. Current standard for small clusters: Infiniband. Basic model for point-to-point connection: transfer time T for message of size N bytes is sum T = Tℓ + N B , Tℓ : latency, B : asymptotic bandwidth [MBytes/sec] Latency generally depends on N (vs. buffer or cache size). Effective bandwidth: Beff = N Tℓ + N
B
. (5.1)
Oliver Ernst (INMO) HPC Wintersemester 2012/13 179
Networks
Ping-pong benchmark To measure latency and effective bandwidth, send message of size B bytes back and forth between two processes running on different processors. (Belongs to suite Intel MPI Benchmarks) Pseudocode: 1 myID = get_process_ID() 2 if(myID.eq.0) then 3 targetID = 1 4 S = get_walltime() 5 call Send_message(buffer,N,targetID) 6 call Receive_message(buffer,N,targetID) 7 E = get_walltime() 8 MBYTES = 2*N/(E-S)/1.d6 ! MBytes/sec rate 9 TIME = (E-S)/2*1.d6 ! transfer time in microsecs 10 ! for single message 11 else 12 targetID = 0 13 call Receive_message(buffer,N,targetID) 14 call Send_message(buffer,N,targetID) 15 endif
Oliver Ernst (INMO) HPC Wintersemester 2012/13 180
Networks
Ping-pong benchmark measurements: Gigabit ethernet
Ping-pong benchmark measurements of Beff on Gigabit ethernet network and least-squares fit of parameters Tℓ and B from (5.1).
10
1
10
2
10
3
10
4
10
5
10
6
N [Bytes] 20 40 60 80 100 120 Beff [MBytes/sec] model fit (Tl = 76µs, B = 111 MBytes/sec) measured (GE)
1 10 100 N [bytes] 42 44 46 48 Latency [µs]
N1/2
source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 181
Networks
Ping-pong benchmark measurements: Gigabit ethernet
Observations: Model (5.1) explains observations well. Small message sizes: low bandwidth, transfer time dominated by latency. Large message sizes: Beff saturates, latency insignificant. Tℓ not accurately reproduced by fit, direct measurement of T in inset. Sources of latency: Administrative overhead (message headers etc.) Minimum message size for some protocols (TCP/IP), small frame of N > 1 bytes sent no matter how small message. Many layers of software involved in initiating message transmission. Commodity PC hardware not optimized for low-latency I/O. Low latency achieved by lightweight protocols, optimized drivers, communication devices directly attached to processor buses.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 182
Networks
Ping-pong benchmark measurements: Gigabit ethernet
Observations: Model (5.1) explains observations well. Small message sizes: low bandwidth, transfer time dominated by latency. Large message sizes: Beff saturates, latency insignificant. Tℓ not accurately reproduced by fit, direct measurement of T in inset. Sources of latency: Administrative overhead (message headers etc.) Minimum message size for some protocols (TCP/IP), small frame of N > 1 bytes sent no matter how small message. Many layers of software involved in initiating message transmission. Commodity PC hardware not optimized for low-latency I/O. Low latency achieved by lightweight protocols, optimized drivers, communication devices directly attached to processor buses.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 182
Networks
Ping-pong benchmark measurements: Infiniband
Ping-pong benchmark measurements of Beff on DDR Infiniband network and least-squares fit of parameters Tℓ and B from (5.1) (separately for small and large message regimes).
10
1
10
2
10
3
10
4
10
5
10
6
N [Bytes] 10
- 1
10 10
1
10
2
10
3
Beff [MBytes/sec] Tl = 4.14 µs, B = 827 MBytes/sec Tl = 20.8 µs, B = 1320 MBytes/sec Tl = 4.14 µs, B = 1320 MBytes/sec measured (IB)
large messages small messages
source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 183
Networks
Ping-pong benchmark measurements: Infiniband
Observations/Remarks: Single model does bad job of fitting entire message size regime. Possible reasons: use different buffering algorithms by message-passing or protocol layer software for messages of different sizes, e.g. splitting messages too large to fit in buffers. Ideal: limiting network bandwidth comparable to internode bandwidth. Many applications operate in latency-dominated regions of bandwidth graph. Quantified by N1/2, the message size at which Beff = B/2. For model (5.1) we obtain N1/2 = BTℓ. Question: is increas in maximum network bandwidth by factor β beneficial for all messages? Answer: At message size N, Beff(βB, Tℓ) Beff(B, Tℓ) = 1 + N/N1/2 1 + N/(βN1/2). Gain 33% for β = 2 at N = N1/2. Same for reduction of Tℓ by factor β. ⇒ Need to reduce both for increased performance across entire range..
Oliver Ernst (INMO) HPC Wintersemester 2012/13 184
Networks
Bisection bandwidth
Issue: Even if individual point-to-point connections have high effective bandwidth, this does not describe global saturation effects, i.e., the deviation of the global aggregate bandwidth from ideal nonblocking property when all nodes are simultaneously transmitting or receiving data. Bisection bandwidth Bb: sum of all effec- tive bandwidths of the minimal number
- f connections cut when partitioning the
system into two parts of equal size. More meaningful for hybrid parallel architec- tures: available bandwidth per core, i.e., bi- section bandwidth divided by overall number
- f compute cores.
Network Network
source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 185
Networks
Bus
Shared medium, can be used by only
- ne communicating device at at time.
Collision detection mechanism necessary. Very common in computer systems. Easy to implement, low latency.
source: Hager & Wellein
Most important shortcoming: bus is blocking, i.e., all devices share constant bandwidth. Susceptible to failure as local problem can easily incluence global operation. In HPC systems limited to communication at processor or socket level or disgnostic networks.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 186
Networks
Switched networks
Switched network divides all communicating devices into 2 groups. Devices in one group connected to switch in starlike fashion. Switches connected to each othe in additional switch layers. Distance between two communicating devices varies according to number
- f “hops” message has to travel from source to destination.
Diameter of network: maximal number of hops between two devices. (1 for bus) Single switch: can eithe support fully nonblocking operation, i.e., all pairs
- f ports can use their full bandwidth concurrently, or can have bus-like
design with limited bandwidth.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 187
Networks
Crossbar switch
OUT OUT OUT OUT
IN IN IN IN
source: Hager & Wellein
Fully nonblocking switch. Circles represent possible connections between pairs of devices; each implemented as 2 × 2 switching element.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 188
Networks
Fat trees
Crossbar switches can be combined to form fat tree network. Design choice: keep nonblocking property across entire network of thin out bandwidth closer to root. Latter choice: bisection bandwidth per compute element less than half leaf switch bandwidth per port. Network must be capable of routing traffic via unused switches. Maximum latency depends only on number of tree levels. If group of workers use single leaf switch the may obtain fully nonblocking communication. Bottlenecks still possible for static routing.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 189
Networks
Fat trees leaf switches spine switches
SW 2 SW 3 SW A SW B SW 1 SW 4
Figure 4.15: A fully nonblocking full-bandwidth fat-tree network with two switch layers. The switches connected to the actual compute elements are called leaf switches, whereas the upper layers form the spines of the hierarchy.
source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 190
Networks
Fat trees
SW
SW SW SW SW
Figure 4.16: A fat-tree network with a bottleneck due to “1:3 oversubscription” of communi- cation links to the spine. By using a single spine switch, the bisection bandwidth is cut in half as compared to the layout in Figure 4.15 because only four nonblocking pairs of connections are possible. Bisection bandwidth per compute element is even lower.
source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 191
Networks
Fat trees
1 3 4 8 7 6 5 2
SW 2 SW 1 SW B SW 4 SW 3 SW A
Figure 4.17: Even in a fully nonblocking fat-tree switch hierarchy (network cabling shown as solid lines), not all possible combinations of N/2 point-to-point connections allow collision- free operation under static routing. When, starting from the collision-free connection pattern shown with dashed lines, the connections 2↔6 and 3↔7 are changed to 2↔7 and 3↔6, re- spectively (dotted-dashed lines), collisions occur, e.g., on the highlighted links if connections 1↔5 and 4↔8 are not re-routed at the same time.
source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 192
Networks
Mesh
Multi-D hypercube; each node at Cartesian grid intersection. Routing performed by ASICS. Bb(N) ∼ N (d−1)/d ⇒ Bb(N)/N → 0 for N large. Popular for large systems when fat trees too expensive. IBM Blue Gene and Cray XT use mesh networks.
source: Hager & Wellein
2D torus mesh network. Bisction bandwidth ∼ √ N.
Oliver Ernst (INMO) HPC Wintersemester 2012/13 193
Networks
Mesh
I/O I/O HT HT HT HT HT HT HT
Memory
P P
Memory
P P
Memory
P P
Memory
P P
source: Hager & Wellein
Four-socket ccNUMA system with HyperTransort-based mesh network connecting 4
- LDs. Each socket has 3 HT links, so network has to be heterogeneous w.r.t.
insersocket latency to accomodate I/O connections and still utilize all HT ports. (2 HT connections used for I/O.)
Oliver Ernst (INMO) HPC Wintersemester 2012/13 194