Parallel Computer Architecture
Lars Karlsson
Ume˚ a University
2009-12-07
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 1 / 52
Parallel Computer Architecture Lars Karlsson Ume a University - - PowerPoint PPT Presentation
Parallel Computer Architecture Lars Karlsson Ume a University 2009-12-07 Lars Karlsson (Ume a University) Parallel Computer Architecture 2009-12-07 1 / 52 Topics Covered Multicore processors Short vector instructions (SIMD) Advanced
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 1 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 2 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 3 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 4 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 5 / 52
◮ Power consumption depends linearly on the clock frequency ◮ Power leads to heat ◮ Power is expensive ◮ Frequency around 2–3 GHz since 2001 ◮ Prior to 2001: exponential growth over several decades
◮ Already, few applications utilize all functional units ◮ Sublinear return on invested resources (transistors/power) ◮ Diminishing returns Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 6 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 7 / 52
◮ 4 units of sequential performance ◮ 4 units of parallel performance
◮ 2 units of sequential performance ◮ 8 units of parallel performance
◮ 1 unit of sequential performance ◮ 16 units of parallel performance
◮ 2 units of sequential performance ◮ 14 units of parallel performance Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 8 / 52
◮ f ≈ 1: sequential algorithm (very rare) ◮ f ≈ 0: perfectly parallel algorithm (quite common)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 12 14 16 f Performance Large Medium/Homo Small/Homo Hetero
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 9 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 10 / 52
200 400 600 800 1000 1200 1400 1600 1800 2000 10 20 30 40 50 60 70 80 Matrix Size Gflop/s Compute−bound Memory−bound
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 11 / 52
◮ Fused multiply and add (FMA) ◮ Adder and multiplier in parallel
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 12 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 13 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 14 / 52
◮ Same instruction stream ◮ Different data streams
◮ SIMD/Vector instructions ◮ Different control flows Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 15 / 52
5 1 2 3 4 4−Vector Addition Issue Logic ALU ALU ALU ALU a b c + = 6 6 6 6 2 3 4
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 16 / 52
◮ Each process has its own address space ◮ Explicit communication (message passing)
◮ Each process shares a global address space ◮ Implicit communication (reads/writes + synchronization)
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 17 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 18 / 52
◮ Synchronization primitives too expensive to implement ◮ The cost grows with the number of processors
◮ Atomic exchange ◮ Fetch-and-increment ◮ Test-and-set ◮ Compare-and-swap ◮ Load linked – store conditional Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 19 / 52
◮ 0: free ◮ 1: locked
◮ Atomically exchange the lock variable with 1 ◮ (i) returns 0: lock was free and is now locked – OK! ◮ (ii) returns 1: lock was locked and is still locked – Retry!
◮ Overwrite the lock variable with 0 Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 20 / 52
◮ Alignment ◮ Data structures
◮ Loop-level parallelism ◮ Best strategy depends on usage pattern ◮ Speculative multithreading
◮ Data distribution ◮ Communication
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 21 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 22 / 52
◮ Several instructions issued per clock ◮ Allows a CPI less than one
◮ 4 GHz ◮ 4-way multiple issue ◮ 5-stage pipeline ◮ 4 × 5 = 20 instructions executing in parallel ◮ 4 × 4 = 16 billion instructions per second ◮ CPI of 0.25 Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 23 / 52
◮ (Deciding which instructions to issue each clock cycle) ◮ Static multiple issue: compiler at least partially responsible ◮ Dynamic multiple issue: processor responsible but compiler helps
◮ Static multiple issue: some responsibility on the compiler ◮ Dynamic multiple issue: hardware alleviates some hazards Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 24 / 52
◮ Speculate on the outcome of a branch ⋆ Enables instructions after the branch to begin execution ◮ Speculate that a load following a store refers to a distinct address ⋆ Enables executing the load prior to the store
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 25 / 52
◮ Fixed number of instructions per packet ◮ Restrictions on the mix of instructions
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 26 / 52
◮ In-order: instructions are issued in program order ◮ Out-of-order: (limited amount of) hardware lookahead ⋆ Synonym: dynamic pipeline scheduling Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 27 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 28 / 52
1 (Program order)
2 (Coherent view)
3 (Write serialization)
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 29 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 30 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 31 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 32 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 33 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 34 / 52
◮ Run one thread until expensive stall ◮ Switch to another thread with some overhead
◮ Switch to a new thread on every clock cycle ◮ Switch with no extra cost
◮ Assumes dynamic pipeline scheduling ◮ Several threads in parallel on every clock ◮ Essentially no switch at all: threads run concurrently Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 35 / 52
Time Issue slots Thread A Thread B Thread C Thread D SMT Coarse MT Fine MT
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 36 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 37 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 38 / 52
◮ Pipelined ◮ Superscalar ◮ SIMD
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 39 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 40 / 52
◮ Suitable for control code such as an OS ◮ Cached access to memory
◮ Specialized for computations ◮ Small (256 KB) scratchpad memory local to each SPE ◮ DMA between local and global memory
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 41 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 42 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 43 / 52
◮ SP: Scalar processor ◮ SM: Streaming multiprocessor (8 SPs + scratchpad memory)
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 44 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 45 / 52
◮ kernel: any number of threads ◮ grid: any number of thread blocks ◮ thread block: 1 – 512 cooperating threads ◮ warp: 32 threads executing in SIMD fashion
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 46 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 47 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 48 / 52
◮ Risky to invest time and money in short-lived technology
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 49 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 50 / 52
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 51 / 52
◮ Lower clock frequency ◮ Simpler cores ◮ Less emphasis on backwards compatability ◮ Smaller caches
Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 52 / 52