Lecture 25: Multi-core Processors Todays topics: Writing parallel - PowerPoint PPT Presentation

Lecture 25: Multi-core Processors • Today’s topics: � Writing parallel programs � SMT � Multi-core examples • Reminder: � Assignment 9 due Tuesday 1

Shared-Memory Vs. Message-Passing Shared-memory: • Well-understood programming model • Communication is implicit and hardware handles protection • Hardware-controlled caching Message-passing: • No cache coherence � simpler hardware • Explicit communication � easier for the programmer to restructure code • Software-controlled caching • Sender can initiate data transfer 2

� � � Ocean Kernel . . Procedure Solve(A) Row 1 begin diff = done = 0; while (!done) do Row k diff = 0; for i 1 to n do for j 1 to n do temp = A[i,j]; Row 2k A[i,j] 0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for end for if (diff < TOL) then done = 1; Row 3k end while … end procedure 3

� � � Shared Address Space Model procedure Solve(A) int i, j, pid, done=0; int n, nprocs; float temp, mydiff=0; float **A, diff; int mymin = 1 + (pid * n/procs); LOCKDEC(diff_lock); int mymax = mymin + n/nprocs -1; BARDEC(bar1); while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); main() for i mymin to mymax begin for j 1 to n do read(n); read(nprocs); … A G_MALLOC(); endfor initialize (A); endfor CREATE (nprocs,Solve,A); LOCK(diff_lock); WAIT_FOR_END (nprocs); diff += mydiff; end main UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; BARRIER (bar1, nprocs); 4 endwhile

� � � � � Message Passing Model main() for i 1 to nn do read(n); read(nprocs); for j 1 to n do CREATE (nprocs-1, Solve); … Solve(); endfor WAIT_FOR_END (nprocs-1); endfor if (pid != 0) procedure Solve() SEND(mydiff, 1, 0, DIFF); int i, j, pid, nn = n/nprocs, done=0; RECEIVE(done, 1, 0, DONE); float temp, tempdiff, mydiff = 0; else myA malloc(…) for i 1 to nprocs-1 do initialize(myA); RECEIVE(tempdiff, 1, *, DIFF); while (!done) do mydiff += tempdiff; mydiff = 0; endfor if (pid != 0) if (mydiff < TOL) done = 1; SEND(&myA[1,0], n, pid-1, ROW); for i 1 to nprocs-1 do if (pid != nprocs-1) SEND(done, 1, I, DONE); SEND(&myA[nn,0], n, pid+1, ROW); endfor if (pid != 0) endif RECEIVE(&myA[0,0], n, pid-1, ROW); endwhile if (pid != nprocs-1) RECEIVE(&myA[nn+1,0], n, pid+1, ROW); 5

Multithreading Within a Processor • Until now, we have executed multiple threads of an application on different processors – can multiple threads execute concurrently on the same processor? • Why is this desireable? � inexpensive – one CPU, no external interconnects � no remote or coherence misses (more capacity misses) • Why does this make sense? � most processors can’t find enough work – peak IPC is 6, average IPC is 1.5! � threads can share resources � we can increase threads without a corresponding linear increase in area 6

How are Resources Shared? Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Thread 1 Thread 2 Thread 3 Cycles Thread 4 Idle Superscalar Fine-Grained Simultaneous Multithreading Multithreading • Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss • Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated • Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot 7

Performance Implications of SMT • Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread • With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4 8

Pentium4: Hyper-Threading • Two threads – the Linux operating system operates as if it is executing on a two-processor system • When there is only one available thread, it behaves like a regular single-threaded superscalar processor 9

Multi-Programmed Speedup 10

Why Multi-Cores? • New constraints: power, temperature, complexity • Because of the above, we can’t introduce complex techniques to improve single-thread performance • Most of the low-hanging fruit for single-thread performance has been picked • Hence, additional transistors have the biggest impact on throughput if they are used to execute multiple threads … this assumes that most users will run multi-threaded applications 11

Efficient Use of Transistors Transistors can be used for: • Cache hierarchies • Number of cores • Multi-threading within a core (SMT) � Should we simplify cores so we have more available transistors? Core 12 Cache bank

Design Space Exploration • Bullet p – scalar pipelines t – threads 13 s – superscalar pipelines From Davis et al., PACT 2005

Case Study I: Sun’s Niagara • Commercial servers require high thread-level throughput and suffer from cache misses • Sun’s Niagara focuses on: � simple cores (low power, design complexity, can accommodate more cores) � fine-grain multi-threading (to tolerate long memory latencies) 14

Niagara Overview 15

SPARC Pipe No branch predictor Low clock speed (1.2 GHz) One FP unit shared by all cores 16

Case Study II: Intel Core Architecture • Single-thread execution is still considered important � � out-of-order execution and speculation very much alive � initial processors will have few heavy-weight cores • To reduce power consumption, the Core architecture (14 pipeline stages) is closer to the Pentium M (12 stages) than the P4 (30 stages) • Many transistors invested in a large branch predictor to reduce wasted work (power) • Similarly, SMT is also not guaranteed for all incarnations of the Core architecture (SMT makes a hotspot hotter) 17

Cache Organizations for Multi-cores • L1 caches are always private to a core • L2 caches can be private or shared – which is better? P1 P2 P3 P4 P1 P2 P3 P4 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 18

Cache Organizations for Multi-cores • L1 caches are always private to a core • L2 caches can be private or shared • Advantages of a shared L2 cache: � efficient dynamic allocation of space to each core � data shared by multiple cores is not replicated � every block has a fixed “home” – hence, easy to find the latest copy • Advantages of a private L2 cache: � quick access to private L2 – good for small working sets � private bus to private L2 � less contention 19

Title • Bullet 20

Lecture 25: Multi-core Processors Todays topics: Writing parallel - PowerPoint PPT Presentation

Lecture 25: Multi-core Processors Todays topics: Writing parallel programs SMT Multi-core examples Reminder: Assignment 9 due Tuesday 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

A Framework for the Derivation of WCET Analyses for Multi-Core Processors Michael Jacobs

Design Space Exploration and Dynamic Thermal Management of Multi-core Processors Sarma Vrudhula

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors

Multi-Processors and GPU Philipp Koehn 2 May 2018 Philipp Koehn Computer Systems Fundamentals:

Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together

CS 105 Intel x86 (IA32/64) Processors Intel x86 (IA32/64) Processors Tour of the Black Holes

Utilizing commercial graphics processors Utilizing commercial graphics processors in the

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Today Digital Signal Processors Digital signal processors Microcontrollers are optimized

Stochastic Processors (or processors that do not always compute correctly by design) Rakesh Kumar

FAUL T TOLERANCE FOR M UL TI-CORE AND M ANY-CORE PROCESSORS Vanessa VARGAS PhD candidate in

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Building Blocks CPUs, Memory and Accelerators Outline Computer layout CPU and Memory

HETEROGENEOUS MULTICORE PROCESSORS A LEXANDER V ITKALOV ENGRC 350 Novem ber 2 1 ,2 0 0 5 1

Evolution of Scripting Languages UNIX shell scripting awk, sed, ksh, csh Tck/Tk Perl

-deformed shuffle bialgebras and renormalization V.C. B` ui, G.H.E. Duchamp, Hoang Ngoc Minh,

Multicore Processors Raul Queiroz Feitosa Parts of these slides are from the support material

Parallel processing Highlights - Making threads - Waiting for threads Terminology CPU = area

Methods for Emulation of Multi-Core CPU Performance Tomasz Buchert 1 Lucas Nussbaum 2 Jens Gustedt

Reli liability-Aware Scheduling on Heterogeneous Multicore Processors Ajeya Naithani Stijn

Sambuz

Useful Links

Newsletter

Mail Us