Parallel Computer Architecture Lars Karlsson Ume a University - - PowerPoint PPT Presentation

parallel computer architecture
SMART_READER_LITE
LIVE PREVIEW

Parallel Computer Architecture Lars Karlsson Ume a University - - PowerPoint PPT Presentation

Parallel Computer Architecture Lars Karlsson Ume a University 2009-12-07 Lars Karlsson (Ume a University) Parallel Computer Architecture 2009-12-07 1 / 52 Topics Covered Multicore processors Short vector instructions (SIMD) Advanced


slide-1
SLIDE 1

Parallel Computer Architecture

Lars Karlsson

Ume˚ a University

2009-12-07

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 1 / 52

slide-2
SLIDE 2

Topics Covered

Multicore processors Short vector instructions (SIMD) Advanced instruction level parallelism Cache coherence Hardware multithreading Sample multicore processors Introduction to parallel programming

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 2 / 52

slide-3
SLIDE 3

Part I Introduction

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 3 / 52

slide-4
SLIDE 4

Moore’s Law

Moore’s law predicts an exponential growth in the number of transistors per chip Observed exponential growth over the last couple of decades Appears to continue at least another decade Enables the construction of faster processors

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 4 / 52

slide-5
SLIDE 5

Turning Transistors into Performance

The old approach

Speed up a single instruction stream: Increase clock frequency Pipeline the execution of instructions Predict branches to reduce overhead of pipeline stalls Issue several instructions per clock Schedule instructions out-of-order Use short vector instructions (SIMD) Hide memory latency with a multilevel cache hierarchy Conclusion: relies on Instruction Level Parallelism (ILP)

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 5 / 52

slide-6
SLIDE 6

Limits of the Old Approach

The Power Wall

◮ Power consumption depends linearly on the clock frequency ◮ Power leads to heat ◮ Power is expensive ◮ Frequency around 2–3 GHz since 2001 ◮ Prior to 2001: exponential growth over several decades

The ILP Wall

◮ Already, few applications utilize all functional units ◮ Sublinear return on invested resources (transistors/power) ◮ Diminishing returns Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 6 / 52

slide-7
SLIDE 7

Turning Transistors into Performance

The new approach: multicore architectures

Several cores on one die – increases peak performance Reduce the clock frequency – saves power Use simpler core design – frees transistors Which of the following choices lead to the highest performance? All cores identical: homogeneous multicore Different types of cores: heterogeneous multicore Clearly, heterogeneous multicores are potentially harder to program.

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 7 / 52

slide-8
SLIDE 8

Heterogeneous Multicores

A Simple Model for Building Heterogeneous Multicores

Consider the following core designs: Small: 1 unit of area, 1 unit of performance Medium: 4 units of area, 2 units of performance Large: 16 units of area, 4 units of performance Suppose we have 16 units of die area. Consider these processors: Large: 1 large core

◮ 4 units of sequential performance ◮ 4 units of parallel performance

Medium/Homo: 4 medium cores

◮ 2 units of sequential performance ◮ 8 units of parallel performance

Small/Homo: 16 small cores

◮ 1 unit of sequential performance ◮ 16 units of parallel performance

Hetero: 1 medium and 12 small cores

◮ 2 units of sequential performance ◮ 14 units of parallel performance Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 8 / 52

slide-9
SLIDE 9

Heterogeneous Multicores

Evaluating Design Choices

Partition an algorithm’s execution time Serial fraction f ∈ [0, 1]: no parallel speedup possible

◮ f ≈ 1: sequential algorithm (very rare) ◮ f ≈ 0: perfectly parallel algorithm (quite common)

Parallel fraction (1 − f ): perfect parallel speedup Performance as a function of f :

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 12 14 16 f Performance Large Medium/Homo Small/Homo Hetero

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 9 / 52

slide-10
SLIDE 10

Memory System

Machine characteristics Peak computational performance Memory bandwidth Memory latency The first two impose hardware limits on performance Compute-bound, e.g., most of dense linear algebra Memory-bound, e.g., most of sparse linear algebra Latency-bound, e.g., finite state machines

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 10 / 52

slide-11
SLIDE 11

Memory System

Compute-Bound vs Memory-Bound

Sample difference in performance between a compute-bound and a memory-bound algorithm on Akka @ HPC2N

200 400 600 800 1000 1200 1400 1600 1800 2000 10 20 30 40 50 60 70 80 Matrix Size Gflop/s Compute−bound Memory−bound

43x

4x

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 11 / 52

slide-12
SLIDE 12

Obtaining Peak Floating Point Performance

To obtain peak performance, an algorithm must Have a high arithmetic intensity Exploit the ISA effectively Parallelize over all cores Exploiting the ISA effectively means Balancing the number of multiplies with adds

◮ Fused multiply and add (FMA) ◮ Adder and multiplier in parallel

Using SIMD instructions Having enough instruction level parallelism (ILP) Having a predictable control flow

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 12 / 52

slide-13
SLIDE 13

Part II SISD / MIMD / SIMD

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 13 / 52

slide-14
SLIDE 14

Flynn’s Taxonomy

Flynn’s taxonomy classifies parallel computers based on Number of instruction streams Number of data streams

  • Instr. / Data

Single Multi Single SISD SIMD Multi

MISD

MIMD SISD: Uniprocessor MIMD: Multicores/clusters SIMD: Vector processors/instructions MISD: ???

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 14 / 52

slide-15
SLIDE 15

Single Instruction Multiple Data (SIMD)

Several ALUs operating synchronously in parallel

◮ Same instruction stream ◮ Different data streams

Several variants

◮ SIMD/Vector instructions ◮ Different control flows Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 15 / 52

slide-16
SLIDE 16

SIMD Programming

SSE example (Intel C intrinsics): __m128 vecA, vecB, vecC; vecC = _mm_add_ps(vecA, vecB); Vector data types Vector operations

5 1 2 3 4 4−Vector Addition Issue Logic ALU ALU ALU ALU a b c + = 6 6 6 6 2 3 4

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 16 / 52

slide-17
SLIDE 17

MIMD: Shared vs Distributed Address Space

A key issue in MIMD design is whether to support a shared address space programming model (abbreviated shared memory) in hardware or not (distributed memory). Distributed memory

◮ Each process has its own address space ◮ Explicit communication (message passing)

Shared memory

◮ Each process shares a global address space ◮ Implicit communication (reads/writes + synchronization)

Supporting shared memory in hardware leads to various issues: What if two threads access the same memory location? How to manage multiple cached copies of a memory location? How to synchronize the threads? Supporting distributed memory is much simpler.

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 17 / 52

slide-18
SLIDE 18

MIMD: Synchronization

Thread cooperation requires that some threads write data that other threads read. To avoid corrupted results, it is necessary to synchronize the threads to avoid data races.

Definition (Data Race)

A data race refers to two memory accesses by different threads in which at least one is a write and the two accesses occur one after another. With data races present, the output depends on the execution order. Without any data races, the program is correctly synchronized.

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 18 / 52

slide-19
SLIDE 19

MIMD: Synchronization

Hardware Support

Atomic read/write instructions are not strong enough

◮ Synchronization primitives too expensive to implement ◮ The cost grows with the number of processors

Atomic read-modify-write required

◮ Atomic exchange ◮ Fetch-and-increment ◮ Test-and-set ◮ Compare-and-swap ◮ Load linked – store conditional Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 19 / 52

slide-20
SLIDE 20

MIMD: Synchronization

Implementing a Lock with Atomic Exchange

Represent the state of a lock (locked/free) by an integer

◮ 0: free ◮ 1: locked

Locking:

◮ Atomically exchange the lock variable with 1 ◮ (i) returns 0: lock was free and is now locked – OK! ◮ (ii) returns 1: lock was locked and is still locked – Retry!

Precisely one thread will succeed since the operations are ordered by the hardware. Unlocking:

◮ Overwrite the lock variable with 0 Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 20 / 52

slide-21
SLIDE 21

Compiling for SIMD/Shared Memory/Distributed Memory

Compiling for SIMD instructions

◮ Alignment ◮ Data structures

Compiling for shared memory

◮ Loop-level parallelism ◮ Best strategy depends on usage pattern ◮ Speculative multithreading

Compiling for distributed memory

◮ Data distribution ◮ Communication

Summary: very difficult to compile for parallel architectures. Programmers are responsible for almost all parallelizations.

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 21 / 52

slide-22
SLIDE 22

Part III Advanced Instruction Level Parallelism

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 22 / 52

slide-23
SLIDE 23

Multiple Issue

Pipelining is the basic tool to exploit ILP Multiple issue basically replicates the pipeline

◮ Several instructions issued per clock ◮ Allows a CPI less than one

Example:

◮ 4 GHz ◮ 4-way multiple issue ◮ 5-stage pipeline ◮ 4 × 5 = 20 instructions executing in parallel ◮ 4 × 4 = 16 billion instructions per second ◮ CPI of 0.25 Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 23 / 52

slide-24
SLIDE 24

Multiple Issue Responsibilities (Hardware vs Compiler)

Packaging instructions into issue slots

◮ (Deciding which instructions to issue each clock cycle) ◮ Static multiple issue: compiler at least partially responsible ◮ Dynamic multiple issue: processor responsible but compiler helps

Dealing with data and control hazards

◮ Static multiple issue: some responsibility on the compiler ◮ Dynamic multiple issue: hardware alleviates some hazards Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 24 / 52

slide-25
SLIDE 25

Speculation

To enable more ILP, a processor may speculate on the properties of instructions

◮ Speculate on the outcome of a branch ⋆ Enables instructions after the branch to begin execution ◮ Speculate that a load following a store refers to a distinct address ⋆ Enables executing the load prior to the store

Back-out required when speculation was wrong Buffering of speculated results until speculation status is known

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 25 / 52

slide-26
SLIDE 26

Static Multiple Issue

Instructions grouped into issue packets

◮ Fixed number of instructions per packet ◮ Restrictions on the mix of instructions

Synonym: Very Long Instruction Word (VLIW) Compiler responsible for grouping and scheduling instructions

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 26 / 52

slide-27
SLIDE 27

Dynamic Multiple Issue

Synonym: superscalar processor Hardware decides which instructions to issue each clock

◮ In-order: instructions are issued in program order ◮ Out-of-order: (limited amount of) hardware lookahead ⋆ Synonym: dynamic pipeline scheduling Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 27 / 52

slide-28
SLIDE 28

Part IV Cache Coherence

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 28 / 52

slide-29
SLIDE 29

Cache Coherence

A memory system is coherent if

1 (Program order)

A read of location X on processor P directly following a write of location X on processor P with no other writes of location X by other processors in between returns the value written by P. Program order is preserved.

2 (Coherent view)

A read by a processor to location X that follows a write by another processor to location X returns the written value if the written value if the read and write are sufficiently separated in time and no other writes of location X occur in between. Written values are propagated.

3 (Write serialization)

Writes to the same location are serialized. This means that two writes to the same location by any processors are seen in the same order by all processors. All processors observe the same order of writes to location X.

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 29 / 52

slide-30
SLIDE 30

Migration and Replication

Caches in a multiprocessor provide performance by supporting Migration Data that is used by one processor is moved to its cache. Replication Data which is simultaneously read by multiple processors are replicated to each processor’s local cache. To maintain coherent caches in light of migration and replication, hardware-based cache coherence protocols are used.

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 30 / 52

slide-31
SLIDE 31

MESI: A Cache Coherence Protocol

States

Each cache line can be in one of four states: Modified Cache line only in local cache and is dirty (modified). Exclusive Cache line only in local cache but is unchanged. Shared Cache line may exist in remote caches also but is unchanged. Invalid Cache line is invalid.

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 31 / 52

slide-32
SLIDE 32

MESI: A Cache Coherence Protocol

Transitions

State transitions are triggered by Local reads/writes Intercepted reads/writes on bus: snooping Snooping-based protocols require broadcast medium and hence do not scale very well. Alternatives include the so called directory-based protocols.

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 32 / 52

slide-33
SLIDE 33

Part V Hardware Multithreading

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 33 / 52

slide-34
SLIDE 34

Hardware Multithreading: Several threads on one core

Software multithreading: Purpose: time-sharing to emulate concurrent processing Processor used by one thread at a time Switching threads is infrequent and expensive Hardware multithreading: Purpose: share functional units between threads Processor typically used by multiple threads at a time Switching threads either very cheap or for free Hardware resources tied to each thread

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 34 / 52

slide-35
SLIDE 35

Approaches to Hardware Multithreading

Coarse-grained

◮ Run one thread until expensive stall ◮ Switch to another thread with some overhead

Fine-grained

◮ Switch to a new thread on every clock cycle ◮ Switch with no extra cost

Simultaneous (SMT)

◮ Assumes dynamic pipeline scheduling ◮ Several threads in parallel on every clock ◮ Essentially no switch at all: threads run concurrently Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 35 / 52

slide-36
SLIDE 36

Fine-grained / Coarse-grained / SMT

Time Issue slots Thread A Thread B Thread C Thread D SMT Coarse MT Fine MT

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 36 / 52

slide-37
SLIDE 37

Part VI Sample Architecture Types and Programming Models

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 37 / 52

slide-38
SLIDE 38

Traditional Multicore

R CPU L1 L2 L1 CPU R Mem

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 38 / 52

slide-39
SLIDE 39

Traditional Multicore

Characteristics

Homogeneous multicore (2, 4, 6, 8, . . . (?) cores) Conventional (heavy) core design

◮ Pipelined ◮ Superscalar ◮ SIMD

Multilevel cache hierarchy Caches shared to varying degrees Coherent caches Familiar programming models (processes/threads/...)

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 39 / 52

slide-40
SLIDE 40

Traditional Multicore: Programming

Shared Memory

PThreads example: Summing a vector int sum = 0; pthread_mutex_t sum_lock = PTHREAD_MUTEX_INITIALIZER; void parallel_sum(int np, int me, float A[], int n) { int localsum = 0, i; for( i = (n*me)/np; i < (n*(me+1))/np; i++ ) { localsum += A[i]; } pthread_mutex_lock(&sum_lock); sum += localsum; pthread_mutex_unlock(&sum_lock); } Local computation Synchronized reduction

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 40 / 52

slide-41
SLIDE 41

Cell Broadband Engine (CBE)

Characteristics

Heterogeneous multicore (9 cores) One conventional processor core (PPE)

◮ Suitable for control code such as an OS ◮ Cached access to memory

Eight 128-bit SIMD processors (SPEs)

◮ Specialized for computations ◮ Small (256 KB) scratchpad memory local to each SPE ◮ DMA between local and global memory

Special programming model

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 41 / 52

slide-42
SLIDE 42

Cell Broadband Engine (CBE)

DMA R R CPU L1 L2 Mem CPU R Store Local R CPU Store Local 8

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 42 / 52

slide-43
SLIDE 43

CBE: Programming

float A[1024], B[1024]; uint64_t eaA, eaB; uint32_t tag, i; mfc_get(A, eaA, sizeof(A), tag, 0, 0); mfc_get(B, eaB, sizeof(B), tag, 0, 0); mfc_write_tag_mask(1 << tag); mfc_read_tag_status_all(); for( i = 0; i < 1024; i++ ) { A[i] += B[i]; } mfc_put(A, eaA, sizeof(A), tag, 0, 0); Start asynchronous load of A and B from memory Wait for load of A and B to complete Compute A = A + B locally Start asynchronous store of A to memory

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 43 / 52

slide-44
SLIDE 44

GPU (CUDA)

Characteristics

Homogeneous multicore (128+ cores) Multilevel processor design

◮ SP: Scalar processor ◮ SM: Streaming multiprocessor (8 SPs + scratchpad memory)

Global memory is shared by all SPs Local scratchpad memory for each SM Low clock frequency High memory bandwidth Special programming model (CUDA/OpenCL/...)

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 44 / 52

slide-45
SLIDE 45

GPU

Mem SPSP SP SP SP SP SP SP Mem SM SPSP SP SP SP SP SP SP Mem SM SPSP SP SP SP SP SP SP Mem SM SPSP SP SP SP SP SP SP Mem SM SPSP SP SP SP SP SP SP Mem SM 16

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 45 / 52

slide-46
SLIDE 46

CUDA Programming Model

Code expressed at thread level Several threads combine to implement an algorithm Threads grouped hierarchically:

◮ kernel: any number of threads ◮ grid: any number of thread blocks ◮ thread block: 1 – 512 cooperating threads ◮ warp: 32 threads executing in SIMD fashion

Synchronization only within a thread block Coordination between thread blocks via global memory

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 46 / 52

slide-47
SLIDE 47

CUDA Programming Model

Example

__global__ void saxpy(int n, float alpha, float *x, float *y) { int i; i = blockIdx.x * blockDim.x + threadIdx.x; if( i < n ) y[i] = alpha*x[i] + y[i]; } int nblocks = (n + 255) / 256; saxpy <<< nblocks, 256 >>> (n, 2.0, x, y); Compute one result Invoke kernel: spawn threads in blocks of 256

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 47 / 52

slide-48
SLIDE 48

Part VII The Difficulty of Parallel Programming

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 48 / 52

slide-49
SLIDE 49

Parallel Programming Difficulties

Relevant, correct, portable, ..., and fast Large set of evolving programming models

◮ Risky to invest time and money in short-lived technology

Quality assurance more difficult Portable and fast are often conflicting goals Scalability

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 49 / 52

slide-50
SLIDE 50

Scalability

Amdahl’s Law

Recall the serial fraction f of an algorithm Informally: improved time = time affected improvement + time unaffected Formally: Tp = (1 − f )T1 p + fT1 Limit as p → ∞ is Tp = fT1 ⇔ speedup = T1 Tp = 1 f Suppose f = 5%: speedup < 20 no matter how many processors

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 50 / 52

slide-51
SLIDE 51

Scalability

Load Balance

Suppose p processors work in parallel Each processor has the same amount of work Perfect speedup (p) Suppose one processor has twice the work of each of the others Speedup is almost half of what it was

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 51 / 52

slide-52
SLIDE 52

Conclusions

Complex design space for multicores Some trends are being reversed

◮ Lower clock frequency ◮ Simpler cores ◮ Less emphasis on backwards compatability ◮ Smaller caches

Reducing communication is crucial Programming models are evolving Performance improvements require software development Exotic designs are commercially successful

Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 52 / 52