Introduction to High Performance Computing and Optimization Oliver - - PowerPoint PPT Presentation

introduction to high performance computing and
SMART_READER_LITE
LIVE PREVIEW

Introduction to High Performance Computing and Optimization Oliver - - PowerPoint PPT Presentation

Institut fr Numerische Mathematik und Optimierung Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13 Contents 1. Introduction 2. Processor


slide-1
SLIDE 1

Institut für Numerische Mathematik und Optimierung

Introduction to High Performance Computing and Optimization

Oliver Ernst

Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13

slide-2
SLIDE 2

Contents

  • 1. Introduction
  • 2. Processor Architecture
  • 3. Optimization of Serial Code

3.1 Performance Measurement 3.2 Optimization Guidelines 3.3 Compiler-Aided Optimization 3.4 Combine example 3.5 Further Optimization Issues

  • 4. Parallel Computing

4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks

  • 5. OpenMP Programming

Oliver Ernst (INMO) HPC Wintersemester 2012/13 1

slide-3
SLIDE 3

Contents

  • 1. Introduction
  • 2. Processor Architecture
  • 3. Optimization of Serial Code
  • 4. Parallel Computing

4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks

  • 5. OpenMP Programming

Oliver Ernst (INMO) HPC Wintersemester 2012/13 138

slide-4
SLIDE 4

Contents

  • 4. Parallel Computing

4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks

Oliver Ernst (INMO) HPC Wintersemester 2012/13 139

slide-5
SLIDE 5

Parallel Computing

Introduction

Many processing units (computers, nodes, processors, cores, threads) collaborate to solve one problem concurrently. Currently: many means up to 1.5 million (current Top500 leader). Objectives: faster execution time for one task (speedup), solution of larger problem (scaleup), memory requirements exceed resources of single computer. Challenges for hardware designers:

Power Communication network Memory bandwidth Low level synchronization (e.g. cache coherency) File system

Challenges for programmer:

Load balancing Synchronization/Communication Algorithm design and redesign Software interface Make maximal use of computer’s resources.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 140

slide-6
SLIDE 6

Parallel Computing

Types of parallelism: Data parallelism

The scale of parallelism refers to the size of concurrently executed tasks. Fine-grain parallelism at scale of functional units of processor (ILP), individual instructions or micro-instructions. Medium-grain parallelism at scale of independent iterations of a loop (e.g. linear algebra operations on vectors, matrices, tensors). Coarse-grain parallelism refers to larger computational tasks with looser synchronization (e.g. domain decomposition methods in PDE/linear system solvers). Data parallel applications are usually implemented using an SPMD (Single Program, Multiple Data) software design, in which the same program runs on all processing units, but not in the tightly synchronized lockstep fashion of SIMD.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 141

slide-7
SLIDE 7

Parallel Computing

Types of parallelism: Functional parallelism

Concurrent execution of different tasks. Programming style known as MPMD (Multiple Program, Multiple Data). More difficult to load balance. Variants: master-slave scheme: one administrative unit to distribute tasks/collect results; remainung units receive tasks and report results to master upon completion. Large-scale functional decomposition: large loosely coupled tasks executed

  • n larger computational units with looser synchronization (e.g. climate

models coupling ocean and atmospheric dynamics, fluid-structure interation codes, “multiphysics” codes)

Oliver Ernst (INMO) HPC Wintersemester 2012/13 142

slide-8
SLIDE 8

Contents

  • 4. Parallel Computing

4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks

Oliver Ernst (INMO) HPC Wintersemester 2012/13 143

slide-9
SLIDE 9

Scalability

Basic considerations

T : time for 1 worker to complete task, N workers S := T N ideal speedup (perfect linear scaling) Not all computational (or other) tasks scale in this ideal way. “Nine women can’t make a baby in one month.”

Fred Brooks. The Mythical Man-Month (1975)

Limiting factors: Not all workers receive tasks of equal complexity (or aren’t equally fast); load imbalance Some resources necessary for task completion not available N times; serialization of concurrent execution while waiting for access. Extra work/waiting time due to parallel execution; overhead which is not required for serial task completion.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 144

slide-10
SLIDE 10

Scalability

Basic considerations

T : time for 1 worker to complete task, N workers S := T N ideal speedup (perfect linear scaling) Not all computational (or other) tasks scale in this ideal way. “Nine women can’t make a baby in one month.”

Fred Brooks. The Mythical Man-Month (1975)

Limiting factors: Not all workers receive tasks of equal complexity (or aren’t equally fast); load imbalance Some resources necessary for task completion not available N times; serialization of concurrent execution while waiting for access. Extra work/waiting time due to parallel execution; overhead which is not required for serial task completion.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 144

slide-11
SLIDE 11

Scalability

Basic considerations

T : time for 1 worker to complete task, N workers S := T N ideal speedup (perfect linear scaling) Not all computational (or other) tasks scale in this ideal way. “Nine women can’t make a baby in one month.”

Fred Brooks. The Mythical Man-Month (1975)

Limiting factors: Not all workers receive tasks of equal complexity (or aren’t equally fast); load imbalance Some resources necessary for task completion not available N times; serialization of concurrent execution while waiting for access. Extra work/waiting time due to parallel execution; overhead which is not required for serial task completion.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 144

slide-12
SLIDE 12

Scalability

Performance metrics: Strong scaling

T = T s

f = s + p

serial task completion time, fixed problem size s : serial portion (not parallelizable) of task p : (perfectly) parallelizable portion of task Solution time using N workers: : T p

f = s + p

N Known as strong scaling since task size fixed. Parallelization used to reduce solution time for fixed problem.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 145

slide-13
SLIDE 13

Scalability

Performance metrics: Strong scaling

T = T s

f = s + p

serial task completion time, fixed problem size s : serial portion (not parallelizable) of task p : (perfectly) parallelizable portion of task Solution time using N workers: : T p

f = s + p

N Known as strong scaling since task size fixed. Parallelization used to reduce solution time for fixed problem.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 145

slide-14
SLIDE 14

Scalability

Performance metrics: Strong scaling

T = T s

f = s + p

serial task completion time, fixed problem size s : serial portion (not parallelizable) of task p : (perfectly) parallelizable portion of task Solution time using N workers: : T p

f = s + p

N Known as strong scaling since task size fixed. Parallelization used to reduce solution time for fixed problem.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 145

slide-15
SLIDE 15

Scalability

Performance metrics: Weak scaling

Use parallelism to solve larger problem: assume s fixed and parallelizable portion grows with N like N α, α > 0 (often α = 1). Then: T = T s

v = s + pN α

serial task completion time, variable problem size Solution time using N workers: T p

v = s + pN α−1

Known as weak scaling since task size variable. Parallelization used to solve larger problem.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 146

slide-16
SLIDE 16

Scalability

Performance metrics: Weak scaling

Use parallelism to solve larger problem: assume s fixed and parallelizable portion grows with N like N α, α > 0 (often α = 1). Then: T = T s

v = s + pN α

serial task completion time, variable problem size Solution time using N workers: T p

v = s + pN α−1

Known as weak scaling since task size variable. Parallelization used to solve larger problem.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 146

slide-17
SLIDE 17

Scalability

Performance metrics: Weak scaling

Use parallelism to solve larger problem: assume s fixed and parallelizable portion grows with N like N α, α > 0 (often α = 1). Then: T = T s

v = s + pN α

serial task completion time, variable problem size Solution time using N workers: T p

v = s + pN α−1

Known as weak scaling since task size variable. Parallelization used to solve larger problem.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 146

slide-18
SLIDE 18

Scalability

Application speedup

Define performance := work time , application speedup := parallel performance serial performance . Serial performance for fixed problem size s + p: P s

f = s + p

T s

f

= s + p s + p = 1. Parallel performance for fixed problem size, normalize s + p = 1: P p

f = s + p

T p

f

= s + p s + p/N = 1 s + 1−s

N

. Application speedup (fixed problem size): Sf = P p

f

P s

f

= 1 s + 1−s

N

(cf. Amdahl’s Law).

Oliver Ernst (INMO) HPC Wintersemester 2012/13 147

slide-19
SLIDE 19

Scalability

Application speedup: different notion of “work”

Count as work only parallelizable portion. Serial performance: P sp

f

= p T s

f

= p. Parallel performance: P pp

f

= p T p

f

= 1 − s s + 1−s

N

. Application speedup: Sp

f =

P pp

f

P sp

f

= 1 s + 1−s

N

. P pp

f

no longer identical with Sp

f.

Scalability doesn’t change, but performance does (factor of p smaller).

Oliver Ernst (INMO) HPC Wintersemester 2012/13 148

slide-20
SLIDE 20

Scalability

Application speedup: weak scaling

(How much more work can my program do in a given amount of time when I put a larger problem on N CPUs?) Serial performance (N = 1): P s

v = s + p

T s

f

= 1. Parallel performance: P p

v = T s v

T p

v

= s + pN α s + pN α−1 = s + (1 − s)N α s + (1 − s)N α−1 (= Sv since P s

v = 1).

(same as application speedup). Recover Amdahl’s law for α = 0 (strong scaling).

Oliver Ernst (INMO) HPC Wintersemester 2012/13 149

slide-21
SLIDE 21

Scalability

Application speedup: weak scaling

For 0 < α < 1 Sv ≈ s + (1 − s)α s = 1 + p sN α (linear in N α). Weak scaling thus allows to break Amdahl barrier (unlimited performance, even for small α. Ideal case α = 1: Sv = s + (1 − s)N s + 1 − s = s + (1 − s)N, Gustafson’s Law i.e., linear speedup, even for small N [Gustafson (1988)]. Note: large serial fraction s leads to small slope.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 150

slide-22
SLIDE 22

Scalability

Application speedup: weak scaling, different notion of work

Again, base work only on parallel fraction p. Once more, serial performance is P s,p

v

= p and parallel performance P p,p

v

= pN α s + pN α−1 = (1 − s)N α s + (1 − s)N α−1 corresponding to speedup of Sp

v = P p,p v

P s,p

v

= N α s + (1 − s)N α−1 . Performance and speedup differ by factor p but, for α = 1 speedup is now linear with slope one in contrast to Gustafson’s Law. Conclusion: look carefully at performance metrics being applied.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 151

slide-23
SLIDE 23

Parallel Efficiency

Definition

How effectively does a parallel program utilize a given resource? (Assume serial fraction executed by single worker while others wait.) Parallel efficiency: ε := performance on N CPUs N × performance on one CPU = speedup N . Consider only weak scaling as Amdahl case covered for α → 0.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 152

slide-24
SLIDE 24

Parallel Efficiency

Definition

How effectively does a parallel program utilize a given resource? (Assume serial fraction executed by single worker while others wait.) Parallel efficiency: ε := performance on N CPUs N × performance on one CPU = speedup N . Consider only weak scaling as Amdahl case covered for α → 0.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 152

slide-25
SLIDE 25

Parallel Efficiency

Definition

How effectively does a parallel program utilize a given resource? (Assume serial fraction executed by single worker while others wait.) Parallel efficiency: ε := performance on N CPUs N × performance on one CPU = speedup N . Consider only weak scaling as Amdahl case covered for α → 0.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 152

slide-26
SLIDE 26

Parallel Efficiency

Weak scaling

For work = s + pN α: ε = Sv N = sN −α + 1 − s sN 1−α + 1 − s. For α = 0 recover Amdahl case ε = 1 sN + 1 − s → 0 as N → ∞. For α = 1 ε = s N + 1 − s. Ranges from ǫ = 1 for N = 1 with limit ǫ → 1 − s = p as N → ∞, i.e., efficiency limited by parallel fraction of code. Conclusions: Weak scaling allows utilization of at most fraction p of computing power. Wasted CPU time grows linearly with N.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 153

slide-27
SLIDE 27

Parallel Efficiency

Weak scaling, different notion of work

For work = pN α: εp = Sp

v

N = N α−1 s + (1 − s)N α−1 . In this case for α = 1 we obtain εp = 1, i.e., perfect efficiency. For a large serial fraction s, this obscures the fact that most of the computational capacity of the computer remains unused. Example: p = 0.1, s = 0.9, α = 1. Weak scaling:efficiency ε → 0.1 as N → ∞, εp ≡ 1 ∀N. However, all processors except one are idle 90% of the time.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 154

slide-28
SLIDE 28

Measuring Strong Scalability

Approach to ascertain the scaling properties of a code on a given architecture: measure performance on a small number of processors and determine parameters using least-squares fit. Example: (Hager & Wellein, Sec. 5.3.5)

2 4 6 8 10 12 # cores 1 2 3 4 5 6 7 Relative performance Amdahl s = 0.168 Amdahl s = 0.086 Architecture 1 Architecture 2

Least-squares fit of serial fracti-

  • n s in (strong scaling)

Sf = 1 s + 1−s

N

for same code on two different

  • systems. (Measurements nor-

malized to single-core values). Parallel part compute-bound, serial part memory bound. Sys- tem 2 has larger memory band- width.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 155

slide-29
SLIDE 29

Accelerate Serial or Parallel Part?

Assume serial part can be accelerated by factor ξ > 1. Amdahl’s Law says parallel performance becomes P s,ξ

f

= 1

s ξ + 1−s N

. Optimizing instead parallel part (by same factor) yields P p,ξ

f

= 1 s + 1−s

ξN

. When does accelerating serial part pay off more? Crossover point: P s,ξ

f

P p,ξ

f

= ξs + 1−s

N

s + ξ 1−s

N

≥ 1 ⇔ N ≥ 1 s − 1. Point independent of ξ, corresponds to value of N for which half of asymptotic efficiency achieved in Amdahl’s Law.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 156

slide-30
SLIDE 30

Accelerate Serial or Parallel Part?

Amdahl’s Law used with strong scaling gives parallel efficiency of ε = Sf N = 1 sN + 1 − s. For the crossover value N = 1

s − 1 this is

ε∗ = 1 2(1 − s) which, for s ≪ 1, is already close to 0.5. Further increase of N thus results in diminishing returns, hence one should try to accelerate parallel part first.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 157

slide-31
SLIDE 31

Refined Performance Models

Models like Amdahl’s and Gustafson’s Laws can be refined to account for communication, load imbalance, parallel startup overhead, etc. Here we consider a simple model accounting for communication. Assume communication not overlapped with computation (often true) and introduce correction term: T p,c

v

= s + pN α−1 + cα(N). Communication overhead does not count as work. Thus parallel speedup is Sc

v = s + pN α

T p,c

v

= s + (1 − s)N α s + (1 − s)N α−1 + cα(N). To model cα(N) note basic message transfer time of λ + κ, where λ : channel latency, κ = n B : streaming time for message size n at bandwidth B. We now compare different network models.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 158

slide-32
SLIDE 32

Refined Performance Models

α = 0, blocking network

For bus-like communication only one message can be transmitted at a time. Thus communication overhead per CPU independent of N, cα(N) = (κ + λ)N. Sc

v =

1 s + 1−s

N + (κ + λ)N ≈

1 (κ + λ)N , for N ≫ 1. Performance dominated by communication. Goes to zero for N large. Same situation arises when there is contention for shared resources (memory channels, I/O channels, functional units on a CPU), i.e., serialization.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 159

slide-33
SLIDE 33

Refined Performance Models

α = 0, non-blocking network, constant communication cost

Assume communication network can sustain N/2 concurrent messages without collisions and that message size is independent of N. Then cα(N) = κ + λ and Sc

v =

1 s + 1−s

N + κ + λ ≈

1 s + κ + λ, for N ≫ 1. Situation similar to Amdahl’s law with s replaced by s + κ + λ. For large N performance saturates at lower value due to communication

  • verhead.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 160

slide-34
SLIDE 34

Refined Performance Models

α = 0, non-blocking network, surface-to-volume communication

Such a message size arises e.g. in domain decomposition for PDEs, where message size is proportional to the surface area and work proportional to volume. Here cα(N) = κN −β + λ, β > 0. Communication overhead decreases with N (strong scaling). Sc

v =

1 s + 1−s

N + κN −β + λ ≈

1 s + λ, for N ≫ 1. For large N performance dominated by s and latency λ.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 161

slide-35
SLIDE 35

Refined Performance Models

α = 1, non-blocking network, , surface-to-volume communication

In this case communication overhead per CPU is independent of N. As in preceding model, cα(N) = κN −β + λ. Sc

v =

s + pN s + p + κ + λ = s + (1 − s)N 1 + κ + λ . The following figure illustrates this comparison, plotting N against Sc

v for the

parameter values N = 1, . . . 1000, s = 0.05, κ = 0.005, λ = 0.001, β = 2/3.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 162

slide-36
SLIDE 36

Refined Performance Models

Illustration

10 10

1

10

2

10

3

2 4 6 8 10 12 14 16 18 20 N Sv

c

Amdahl =0, blocking =0, non−blocking =0, 3D surface−to−volume =1, 3D surface−to−volume

Oliver Ernst (INMO) HPC Wintersemester 2012/13 163

slide-37
SLIDE 37

Contents

  • 4. Parallel Computing

4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks

Oliver Ernst (INMO) HPC Wintersemester 2012/13 164

slide-38
SLIDE 38

Parallel Architectures

Components

Processor Memory Communication Network Prozessor Prozessor Prozessor Prozessor Processor Speicher Speicher Speicher Speicher Memory

Oliver Ernst (INMO) HPC Wintersemester 2012/13 165

slide-39
SLIDE 39

Parallel Architectures

Main Distinction

Shared Memory

Memory Bus/Switch P1

Cache

P2

Cache

Pn

Cache

...

Scales to ≈ 500 processors Programming models: OpenMP, Posix Threads (Pthreads)

Oliver Ernst (INMO) HPC Wintersemester 2012/13 166

slide-40
SLIDE 40

Parallel Architectures

Main Distinction

Shared Memory

Memory Bus/Switch P1

Cache

P2

Cache

Pn

Cache

...

Scales to ≈ 500 processors Programming models: OpenMP, Posix Threads (Pthreads) Message Passing

Memory Interconnect P1

Cache

Memory P2

Cache

Memory Pn

Cache

...

Scales arbitrarily Programming model: MPI (message passing)

Oliver Ernst (INMO) HPC Wintersemester 2012/13 166

slide-41
SLIDE 41

Parallel Architectures

Main Distinction

Shared Memory

Memory Bus/Switch P1

Cache

P2

Cache

Pn

Cache

...

Scales to ≈ 500 processors Programming models: OpenMP, Posix Threads (Pthreads) Message Passing

Memory Interconnect P1

Cache

Memory P2

Cache

Memory Pn

Cache

...

Scales arbitrarily Programming model: MPI (message passing) In real life: Hybrids

Oliver Ernst (INMO) HPC Wintersemester 2012/13 166

slide-42
SLIDE 42

Parallel Architectures

More classifications [Flynn (1966)] Single/Multiple Instruction/Data

SPMD, MPMD Shared memory vs. distributed memory UMA vs. ccNUMA Communication network type

Oliver Ernst (INMO) HPC Wintersemester 2012/13 167

slide-43
SLIDE 43

Parallel Architectures

Shared-memory computers

Common shared physical addressspace for several processors Uniform Memory Access (UMA) systems: latency and bandwidth the same for all processors and all memory locations (“flat” memory model). Cache-coherent Nonuniform Memory Access (ccNUMA) systems: memory physically distributed, but logically shared. Network logic gives appearance of single adress space for total system memory. However, different access times for local and remote memory locations. Locality problem: fundamental performance limitation if remote accesses can’t be restricted. Contention problem: Simultaneous access to same memory location by several provessors. Both shared-memory variants require cache-coherency mechanisms: a single cache line may reside in multiple caches at any given time. When any of these is modified, this change must be propagated to all copies in order to maintain a consistent view of memory for all processors.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 168

slide-44
SLIDE 44

Parallel Architectures

Cache-coherency

Example: Suppose two memory locations A1 and A2 belong to the same cache line. Processor P1 loads cache line into its cache. Processor P2 loads cache line into its cache. Processor P1 modifies A1. Processor P2 modifies A2. Cache line evicted from P1’s cache. Which version should be written to memory?

Oliver Ernst (INMO) HPC Wintersemester 2012/13 169

slide-45
SLIDE 45

Parallel Architectures

Cache-coherency: the MESI protocol

Assign to each cache line one of four possible states: M (modified): cache line has been modified and resides in no other cache than this one; upon eviction memory will reflect this cache line’s most current state. E (exclusive): cache line has been read from memory but not yet modified; it resides in no other cache. S (shared): cache line has been read from memory, has not yet been modified, with additional copies possibly residing in other caches. I (invalid): cache line does not represent valid data. Typically

  • ccurs when other processor has requested exclusive access while

this cache line was in shared state. When a cache requests exclusive ownership of a cache line, copies in S or E state must be notified, copies in M state must be written to memory. Notifying cache broadcasts request over bus, remaining caches “snoop” bus and act accordingly when holding affected cache line.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 170

slide-46
SLIDE 46

Parallel Architectures

Cache-coherency: the MESI protocol

Assign to each cache line one of four possible states: M (modified): cache line has been modified and resides in no other cache than this one; upon eviction memory will reflect this cache line’s most current state. E (exclusive): cache line has been read from memory but not yet modified; it resides in no other cache. S (shared): cache line has been read from memory, has not yet been modified, with additional copies possibly residing in other caches. I (invalid): cache line does not represent valid data. Typically

  • ccurs when other processor has requested exclusive access while

this cache line was in shared state. When a cache requests exclusive ownership of a cache line, copies in S or E state must be notified, copies in M state must be written to memory. Notifying cache broadcasts request over bus, remaining caches “snoop” bus and act accordingly when holding affected cache line.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 170

slide-47
SLIDE 47

Parallel Architectures

Cache-coherency: the MESI protocol

Back to example: assume cache line L contains memory locations A1 and A2 and resides in caches of processors P1 and P2. Sequence of events:

  • 1. C1 requests excusive ownership of L
  • 2. In C2 state of L set to I.
  • 3. In C1, state of L set to E; A1 modified;

state of L set to M.

  • 4. C2 requests exclusive ownership of L.
  • 5. Evict L from C1, set state to I.
  • 6. Load L to C2, set state to E
  • 7. Modify A2 in L and set stete to M.

C1 C2

Memory

A1 A2 A1 A2 A1 A2

P1 P2

1 5 3 7 2 4 6

source: Hager & Wellein

Limitations: bus snooping consumes bandwidth and does not scale. Alternative: directory-based system monitors state of all cache lines, transmits state changes only to affected caches (employed in ccNUMA systems).

Oliver Ernst (INMO) HPC Wintersemester 2012/13 171

slide-48
SLIDE 48

Parallel Architectures

Cache-coherency: the MESI protocol

Back to example: assume cache line L contains memory locations A1 and A2 and resides in caches of processors P1 and P2. Sequence of events:

  • 1. C1 requests excusive ownership of L
  • 2. In C2 state of L set to I.
  • 3. In C1, state of L set to E; A1 modified;

state of L set to M.

  • 4. C2 requests exclusive ownership of L.
  • 5. Evict L from C1, set state to I.
  • 6. Load L to C2, set state to E
  • 7. Modify A2 in L and set stete to M.

C1 C2

Memory

A1 A2 A1 A2 A1 A2

P1 P2

1 5 3 7 2 4 6

source: Hager & Wellein

Limitations: bus snooping consumes bandwidth and does not scale. Alternative: directory-based system monitors state of all cache lines, transmits state changes only to affected caches (employed in ccNUMA systems).

Oliver Ernst (INMO) HPC Wintersemester 2012/13 171

slide-49
SLIDE 49

Parallel Architectures

Cache-coherency: the MESI protocol

Back to example: assume cache line L contains memory locations A1 and A2 and resides in caches of processors P1 and P2. Sequence of events:

  • 1. C1 requests excusive ownership of L
  • 2. In C2 state of L set to I.
  • 3. In C1, state of L set to E; A1 modified;

state of L set to M.

  • 4. C2 requests exclusive ownership of L.
  • 5. Evict L from C1, set state to I.
  • 6. Load L to C2, set state to E
  • 7. Modify A2 in L and set stete to M.

C1 C2

Memory

A1 A2 A1 A2 A1 A2

P1 P2

1 5 3 7 2 4 6

source: Hager & Wellein

Limitations: bus snooping consumes bandwidth and does not scale. Alternative: directory-based system monitors state of all cache lines, transmits state changes only to affected caches (employed in ccNUMA systems).

Oliver Ernst (INMO) HPC Wintersemester 2012/13 171

slide-50
SLIDE 50

Parallel Architectures

Examples of UMA systems

2 single-core CPUs in separate sockets sharing a common frontside bus (FSB) Single shared path to memory Chipset (“northbridge”) responsible for driving memory modules, I/O-connections, arbitration. Outdated design.

socket

P P

Chipset

Memory

L1D L2 L1D L2

source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 172

slide-51
SLIDE 51

Parallel Architectures

Examples of UMA systems

Two dual-core chips, each with FSB connected separately to chipset. Can design chipset-to-memory bandwidth to match combined bandwidth of both FSBs. Anisotropy in distance between cores depending on socket membership.

P

socket

P

Chipset

Memory

L1D L2 L2 L1D

P P

L1D L1D

source: Hager & Wellein

Limitation: bandwidth bottlenecks as number of sockets/FSBs grows beyond certain limits. Cross-bar switches avoid bandwidth bottleneck but are expensive.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 173

slide-52
SLIDE 52

Parallel Architectures

Examples of ccNUMA systems

Locality domain (LD): collection of processor cores with locally connected

  • memory. Coherent interconnect links multiple LDs. System runs single OS

instance.

coherent link

P

L1D L2

P

L1D L2

P

L1D L2

P

L1D L2

Memory Interface

L3

Memory Memory

P

L1D L2

P

L1D L2

P

L1D L2

P

L1D L2

Memory Interface

L3

source: Hager & Wellein

Two LDs consisting each of one quad-core chip, separate caches, common interface to local memory, LDs linked by high-speed connection. Implementations: QuickPath (Intel), Hypertransport (AMD et al.) Intersocket link can mediate direct, cache-coherent memory access. All required protocols provided by hardware.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 174

slide-53
SLIDE 53

Parallel Architectures

Examples of ccNUMA systems

NL

P P P P P P P P

L1D L2 L3 L3 L2 L1D L1D L2 L3 L3 L2 L1D L1D L2 L3 L3 L2 L1D L1D L2 L3 L3 L2 L1D

Memory Memory Memory Memory

R R

S S S S source: Hager & Wellein

ccNUMA system with 4 LDs as used on Intel-based SGI Altix systems. Scales to thousands of processors. Each socket connected to comminication interface (S), providing access to memory as well as proprietary NUMALink (NL) network. Network connections for remote access switched by routers (R) Addition of LDs to network becomes increasingly expensive, adds layers of latency.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 175

slide-54
SLIDE 54

Parallel Architectures

Distributed memory model

  • rk.

NI NI NI NI NI

C C C C C

M M M M

Communication network

M

P P P P P

source: Hager & Wellein

Programming model only, design obsolete as practical system. Each processor (P) connected to exclusive local memory. At least one network interface (NI) per node. Separate process on each node, processes communicate explicitly using the communication network. No Remote Memory Access (NORMA) system. Programming paradigm more complex, but scales. All current supercomputers based on this model.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 176

slide-55
SLIDE 55

Parallel Architectures

Hybrid model

Network Int. Network Int. Network Int. Network Int.

Communication network

P P

Memory

P P

Memory

P P

Memory

P P

Memory

P P

Memory

P P

Memory

P P

Memory

P P

Memory

source: Hager & Wellein

Shared-memory building blocks connected by fast network. Network adds another level of complexity beyond cache hierarchy, ccNUMA nodes, etc. Mixture of programming models possible on same system. Hybrids also refers to other system containing mixture of programming paradims on different hardware layers: GPUs (graphics processing units), FPGAs (field-programmable gate arrays), ASICs (application specific integrated circuits), co-processors, etc.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 177

slide-56
SLIDE 56

Contents

  • 4. Parallel Computing

4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks

Oliver Ernst (INMO) HPC Wintersemester 2012/13 178

slide-57
SLIDE 57

Networks

Basic performance characteristics

Large variety of topologies, technology, protocols. Simplest and least expensive: Gigabit Ethernet, but too slow for parallel applications requiring significant communication. Current standard for small clusters: Infiniband. Basic model for point-to-point connection: transfer time T for message of size N bytes is sum T = Tℓ + N B , Tℓ : latency, B : asymptotic bandwidth [MBytes/sec] Latency generally depends on N (vs. buffer or cache size). Effective bandwidth: Beff = N Tℓ + N

B

. (5.1)

Oliver Ernst (INMO) HPC Wintersemester 2012/13 179

slide-58
SLIDE 58

Networks

Basic performance characteristics

Large variety of topologies, technology, protocols. Simplest and least expensive: Gigabit Ethernet, but too slow for parallel applications requiring significant communication. Current standard for small clusters: Infiniband. Basic model for point-to-point connection: transfer time T for message of size N bytes is sum T = Tℓ + N B , Tℓ : latency, B : asymptotic bandwidth [MBytes/sec] Latency generally depends on N (vs. buffer or cache size). Effective bandwidth: Beff = N Tℓ + N

B

. (5.1)

Oliver Ernst (INMO) HPC Wintersemester 2012/13 179

slide-59
SLIDE 59

Networks

Ping-pong benchmark To measure latency and effective bandwidth, send message of size B bytes back and forth between two processes running on different processors. (Belongs to suite Intel MPI Benchmarks) Pseudocode: 1 myID = get_process_ID() 2 if(myID.eq.0) then 3 targetID = 1 4 S = get_walltime() 5 call Send_message(buffer,N,targetID) 6 call Receive_message(buffer,N,targetID) 7 E = get_walltime() 8 MBYTES = 2*N/(E-S)/1.d6 ! MBytes/sec rate 9 TIME = (E-S)/2*1.d6 ! transfer time in microsecs 10 ! for single message 11 else 12 targetID = 0 13 call Receive_message(buffer,N,targetID) 14 call Send_message(buffer,N,targetID) 15 endif

Oliver Ernst (INMO) HPC Wintersemester 2012/13 180

slide-60
SLIDE 60

Networks

Ping-pong benchmark measurements: Gigabit ethernet

Ping-pong benchmark measurements of Beff on Gigabit ethernet network and least-squares fit of parameters Tℓ and B from (5.1).

10

1

10

2

10

3

10

4

10

5

10

6

N [Bytes] 20 40 60 80 100 120 Beff [MBytes/sec] model fit (Tl = 76µs, B = 111 MBytes/sec) measured (GE)

1 10 100 N [bytes] 42 44 46 48 Latency [µs]

N1/2

source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 181

slide-61
SLIDE 61

Networks

Ping-pong benchmark measurements: Gigabit ethernet

Observations: Model (5.1) explains observations well. Small message sizes: low bandwidth, transfer time dominated by latency. Large message sizes: Beff saturates, latency insignificant. Tℓ not accurately reproduced by fit, direct measurement of T in inset. Sources of latency: Administrative overhead (message headers etc.) Minimum message size for some protocols (TCP/IP), small frame of N > 1 bytes sent no matter how small message. Many layers of software involved in initiating message transmission. Commodity PC hardware not optimized for low-latency I/O. Low latency achieved by lightweight protocols, optimized drivers, communication devices directly attached to processor buses.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 182

slide-62
SLIDE 62

Networks

Ping-pong benchmark measurements: Gigabit ethernet

Observations: Model (5.1) explains observations well. Small message sizes: low bandwidth, transfer time dominated by latency. Large message sizes: Beff saturates, latency insignificant. Tℓ not accurately reproduced by fit, direct measurement of T in inset. Sources of latency: Administrative overhead (message headers etc.) Minimum message size for some protocols (TCP/IP), small frame of N > 1 bytes sent no matter how small message. Many layers of software involved in initiating message transmission. Commodity PC hardware not optimized for low-latency I/O. Low latency achieved by lightweight protocols, optimized drivers, communication devices directly attached to processor buses.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 182

slide-63
SLIDE 63

Networks

Ping-pong benchmark measurements: Infiniband

Ping-pong benchmark measurements of Beff on DDR Infiniband network and least-squares fit of parameters Tℓ and B from (5.1) (separately for small and large message regimes).

10

1

10

2

10

3

10

4

10

5

10

6

N [Bytes] 10

  • 1

10 10

1

10

2

10

3

Beff [MBytes/sec] Tl = 4.14 µs, B = 827 MBytes/sec Tl = 20.8 µs, B = 1320 MBytes/sec Tl = 4.14 µs, B = 1320 MBytes/sec measured (IB)

large messages small messages

source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 183

slide-64
SLIDE 64

Networks

Ping-pong benchmark measurements: Infiniband

Observations/Remarks: Single model does bad job of fitting entire message size regime. Possible reasons: use different buffering algorithms by message-passing or protocol layer software for messages of different sizes, e.g. splitting messages too large to fit in buffers. Ideal: limiting network bandwidth comparable to internode bandwidth. Many applications operate in latency-dominated regions of bandwidth graph. Quantified by N1/2, the message size at which Beff = B/2. For model (5.1) we obtain N1/2 = BTℓ. Question: is increas in maximum network bandwidth by factor β beneficial for all messages? Answer: At message size N, Beff(βB, Tℓ) Beff(B, Tℓ) = 1 + N/N1/2 1 + N/(βN1/2). Gain 33% for β = 2 at N = N1/2. Same for reduction of Tℓ by factor β. ⇒ Need to reduce both for increased performance across entire range..

Oliver Ernst (INMO) HPC Wintersemester 2012/13 184

slide-65
SLIDE 65

Networks

Bisection bandwidth

Issue: Even if individual point-to-point connections have high effective bandwidth, this does not describe global saturation effects, i.e., the deviation of the global aggregate bandwidth from ideal nonblocking property when all nodes are simultaneously transmitting or receiving data. Bisection bandwidth Bb: sum of all effec- tive bandwidths of the minimal number

  • f connections cut when partitioning the

system into two parts of equal size. More meaningful for hybrid parallel architec- tures: available bandwidth per core, i.e., bi- section bandwidth divided by overall number

  • f compute cores.

Network Network

source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 185

slide-66
SLIDE 66

Networks

Bus

Shared medium, can be used by only

  • ne communicating device at at time.

Collision detection mechanism necessary. Very common in computer systems. Easy to implement, low latency.

source: Hager & Wellein

Most important shortcoming: bus is blocking, i.e., all devices share constant bandwidth. Susceptible to failure as local problem can easily incluence global operation. In HPC systems limited to communication at processor or socket level or disgnostic networks.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 186

slide-67
SLIDE 67

Networks

Switched networks

Switched network divides all communicating devices into 2 groups. Devices in one group connected to switch in starlike fashion. Switches connected to each othe in additional switch layers. Distance between two communicating devices varies according to number

  • f “hops” message has to travel from source to destination.

Diameter of network: maximal number of hops between two devices. (1 for bus) Single switch: can eithe support fully nonblocking operation, i.e., all pairs

  • f ports can use their full bandwidth concurrently, or can have bus-like

design with limited bandwidth.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 187

slide-68
SLIDE 68

Networks

Crossbar switch

OUT OUT OUT OUT

IN IN IN IN

source: Hager & Wellein

Fully nonblocking switch. Circles represent possible connections between pairs of devices; each implemented as 2 × 2 switching element.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 188

slide-69
SLIDE 69

Networks

Fat trees

Crossbar switches can be combined to form fat tree network. Design choice: keep nonblocking property across entire network of thin out bandwidth closer to root. Latter choice: bisection bandwidth per compute element less than half leaf switch bandwidth per port. Network must be capable of routing traffic via unused switches. Maximum latency depends only on number of tree levels. If group of workers use single leaf switch the may obtain fully nonblocking communication. Bottlenecks still possible for static routing.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 189

slide-70
SLIDE 70

Networks

Fat trees leaf switches spine switches

SW 2 SW 3 SW A SW B SW 1 SW 4

Figure 4.15: A fully nonblocking full-bandwidth fat-tree network with two switch layers. The switches connected to the actual compute elements are called leaf switches, whereas the upper layers form the spines of the hierarchy.

source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 190

slide-71
SLIDE 71

Networks

Fat trees

SW

SW SW SW SW

Figure 4.16: A fat-tree network with a bottleneck due to “1:3 oversubscription” of communi- cation links to the spine. By using a single spine switch, the bisection bandwidth is cut in half as compared to the layout in Figure 4.15 because only four nonblocking pairs of connections are possible. Bisection bandwidth per compute element is even lower.

source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 191

slide-72
SLIDE 72

Networks

Fat trees

1 3 4 8 7 6 5 2

SW 2 SW 1 SW B SW 4 SW 3 SW A

Figure 4.17: Even in a fully nonblocking fat-tree switch hierarchy (network cabling shown as solid lines), not all possible combinations of N/2 point-to-point connections allow collision- free operation under static routing. When, starting from the collision-free connection pattern shown with dashed lines, the connections 2↔6 and 3↔7 are changed to 2↔7 and 3↔6, re- spectively (dotted-dashed lines), collisions occur, e.g., on the highlighted links if connections 1↔5 and 4↔8 are not re-routed at the same time.

source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 192

slide-73
SLIDE 73

Networks

Mesh

Multi-D hypercube; each node at Cartesian grid intersection. Routing performed by ASICS. Bb(N) ∼ N (d−1)/d ⇒ Bb(N)/N → 0 for N large. Popular for large systems when fat trees too expensive. IBM Blue Gene and Cray XT use mesh networks.

source: Hager & Wellein

2D torus mesh network. Bisction bandwidth ∼ √ N.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 193

slide-74
SLIDE 74

Networks

Mesh

I/O I/O HT HT HT HT HT HT HT

Memory

P P

Memory

P P

Memory

P P

Memory

P P

source: Hager & Wellein

Four-socket ccNUMA system with HyperTransort-based mesh network connecting 4

  • LDs. Each socket has 3 HT links, so network has to be heterogeneous w.r.t.

insersocket latency to accomodate I/O connections and still utilize all HT ports. (2 HT connections used for I/O.)

Oliver Ernst (INMO) HPC Wintersemester 2012/13 194