2 Introduction to parallel computing Chip Multiprocessors (ACS - - PowerPoint PPT Presentation
2 Introduction to parallel computing Chip Multiprocessors (ACS - - PowerPoint PPT Presentation
2 Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins Overview Parallel computing platforms Approaches to building parallel computers Today's chip-multiprocessor architectures Approaches to
Chip Multiprocessors (ACS MPhil) 2
Overview
- Parallel computing platforms
– Approaches to building parallel computers – Today's chip-multiprocessor architectures
- Approaches to parallel programming
– Programming with threads and shared memory – Message-passing libraries – PGAS languages – High-level parallel languages
Chip Multiprocessors (ACS MPhil) 3
Parallel computers
- How might we exploit multiple processing elements
and memories in order to complete a large computation quickly?
– How many processing elements, how powerful? – How do they communicate and cooperate?
- How are memories and processing elements interconnected?
- How is the memory hierarchy organised?
– How might we program such a machine?
Chip Multiprocessors (ACS MPhil) 4
The control structure
- How are the processing elements controlled?
– Centrally from single control unit or can they work independently?
- Flynn's taxonomy:
- Single Instruction Multiple Data (SIMD)
- Multiple Instruction Multiple Data (MIMD)
Chip Multiprocessors (ACS MPhil) 5
The control structure
- SIMD
– The scalar pipelines execute in lockstep – Data-independent logic is shared
- Efficient for highly data
parallel applications
- Much simpler instruction
fetch and supply mechanism
– SIMD hardware can support a SPMD model if the individual threads follow similar control flow
- Masked execution
Reproduced from, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow", W. W. L. Fung et al
A Generic Streaming Multiprocessor (for graphics applications)
Chip Multiprocessors (ACS MPhil) 6
The communication model
- A clear distinction is made between two common
communication models:
– 1. Shared-address-space platforms
- All processors have access to a shared data space
accessed via a shared address space
- All communication takes place via a shared memory
- Each processing element may also have an area of
memory that is private
Chip Multiprocessors (ACS MPhil) 7
The communication model
- 2. Message-passing platforms
– Each processing element has its own exclusive address space – Communication is achieved by sending explicit messages between processing elements – The sending and receiving of messages can be used to both communicate between and synchronize the actions of multiple processing elements
Chip Multiprocessors (ACS MPhil) 8
Multi-core
Figure courtesy of Tim Harris, MSR
Chip Multiprocessors (ACS MPhil) 9
SMP multiprocessor
Figure courtesy of Tim Harris, MSR
Chip Multiprocessors (ACS MPhil) 10
NUMA multiprocessor
Figure courtesy of Tim Harris, MSR
Chip Multiprocessors (ACS MPhil) 11
Message-passing platforms
- Many early message-
passing machines provided hardware primitives that were close to the send/receive user-level communication commands
– e.g. a pair of processors may be interconnected with a hardware FIFO queue – The network topology restricted which processors could be named in a send or receive operation (e.g. only neighbours could communicate in a mesh network)
000 001 010 011 100 110 101 111
[Culler, Figure 1.22]
Chip Multiprocessors (ACS MPhil) 12
Message-passing platforms
- The Transputer (1984)
– The result of an earlier foray into the world of parallel computing! – Transputer contained integrated serial links for building multiprocessors
- IN/OUT instructions in ISA for sending and receiving
messages
– Programmed in OCCAM (based on CSP)
- IBM Victor V256 (1991)
– 16x16 array of transputers – The processors could be partitioned dynamically between different users
Chip Multiprocessors (ACS MPhil) 13
Message-passing platforms
- Recently some chip-
multiprocessors have taken a similar approach (RAW/Tilera and XMOS)
– Message queues (or communication channels) may be register mapped or accessed via special instructions – The processor stalls when reading an empty input queue or when trying to write to a full output buffer
A wireless application mapped to the RAW processor. Data is streamed from one core to another over a statically scheduled network. Network input and output is register mapped.
(See also the iWarp paper on wiki)
Chip Multiprocessors (ACS MPhil) 14
Message-passing platforms
- For larger message-passing machines (typically
scientific supercomputers) direct FIFO designs were soon replaced by designs that built message-passing upon remote memory copies (supported by DMA or a
more general communication assist processor) – The interconnection networks also became more powerful, supporting the automatic routing of messages between arbitrary nodes – No restrictions on programmer or software support required
- Hardware and software evolution meant there was a
general convergence of parallel machine
- rganisations
Chip Multiprocessors (ACS MPhil) 15
Message-passing platforms
- The most fundamental communication primitives in a
message-passing machine are synchronous send and receive operations
– Here data movement must be specified at both ends of the communication, this is known as two-sided
- communication. e.g. MPI_Send and MPI_Recv*
– Non-blocking versions of send and receive are also
- ften provided to allow computation and
communication to be overlapped
*Message Passing Interface (MPI) is a portable message-passing system that is supported by a very wide range of parallel machines.
Chip Multiprocessors (ACS MPhil) 16
One-side communication
- SHMEM
– Provides routines to access the memory of a remote processing element without any assistance from the remote process, e.g:
- shmem_put (target_addr, source_addr,
length, remote_pe)
- shmem_get, shmem_barrier etc.
– One-sided communication may be used to reduce synchronization, simplify programming and reduce data movement
Chip Multiprocessors (ACS MPhil) 17
The communication model
- From a hardware perspective we would like to keep
the machine simple (message-passing)
- But we inevitably need to simplify the programmer's
and compiler's task
– Efficiently support shared-memory programming – Add support for transactional memory? – Create a simple but high-performance target
- Trade-offs between hardware complexity and
complexity of hardware and compiler.
Chip Multiprocessors (ACS MPhil) 18
Today's chip multiprocessors
- Intel Nehalem-EX
(2009)
– 8-cores
- 2-way hyperthreaded
(SMT)
- 16 hardware threads
– L1I 32KB, L1D 32KB – 256 KB L2 (Private) – 24MB L3 (Shared)
- 8-banks
- Inclusive L3
Chip Multiprocessors (ACS MPhil) 19
Today's chip multiprocessors
L1 L2 Shared L3 Memory Intel Nahalem-EX (2009)
Chip Multiprocessors (ACS MPhil) 20
Today's chip multiprocessors
- IBM Power 7 (2010)
– 8 core (dual-chip module to hold 16 cores) – 32MB shared eDRAM L3 cache – 2-channel DDR3 controllers – Individual cores
- 4-thread SMT per core
- 6 ops/cycle
- 4GHz
Chip Multiprocessors (ACS MPhil) 21
Today's chip multiprocessors
IBM Power 7 (2010)
Chip Multiprocessors (ACS MPhil) 22
Today's chip multiprocessors
- Sun Niagara T1 (2005)
Each core has its own level 1 cache (16KB for instructions, 8KB for data). The level 2 caches are 3MB in total and are effectively 12-way associative. They are interleaved by 64-byte cache lines.
Chip Multiprocessors (ACS MPhil) 23
Oracle M7 Processor (2014)
- 32 core
– Dual-issue, OOO
- Dynamic multithreading 1-8
threads/core
- 256KB I&D L2 caches
shared by groups of 4 cores
- 64MB L3
- Technology: 20nm, 13 metal
layers
- 16 DDR channels
– 160GB/s – (vs. ~20GB/s for T1)
- >10B transistors!
Chip Multiprocessors (ACS MPhil) 24
“Manycore” designs: Tilera
- Tilera (now Mellanox)
– Evolution of MIT RAW – 100-cores – grid of identical tiles – Low-power 3-way VLIW cores – Cores interconnected by a selection of static and dynamic on-chip networks
Chip Multiprocessors (ACS MPhil) 25
“Manycore” designs: Celerity (2017)
Tiered Accelerator Fabric General-purpose tier: 5 “Rocket” RISC-V cores Massively parallel tier: 496 5-stage RISC-V cores, 16x31 tiled mesh array Specialised tier: Binarized Neural Network accelerator
Chip Multiprocessors (ACS MPhil) 26
GPUs
“The NVIDIA GeForce 8800 GPU”, Hot Chips 2007
- TESLA P100
– 56 Streaming multiprocessors x 64 cores = 3584 “cores” or lanes – 732GB/s memory bandwidth – 4MB L2 cache – 15.3 billion transistors
Chip Multiprocessors (ACS MPhil) 27
Communication latencies
- Chip multiprocessor
– Some have very fast core to core communication, as low as 1-3 cycles – Opportunities to add dedicated core-to-core links – Typical L1-to-L1 communication latencies may be around 10-100 cycles
- Other types of parallel machine:
– Shared memory multiprocessor ~500 – Cluster/supercomputer ~5000-10000
Chip Multiprocessors (ACS MPhil) 28
Approaches to parallel programming
- “Principles of Parallel
Programming”, Calvin Lin and Lawrence Snyder, Pearson, 2009
- This book provides a
good overview of the different approaches to parallel programming
- There is also a
significant amount of information on the course wiki
– Try some examples!
Chip Multiprocessors (ACS MPhil) 29
Approaches to parallel programming
- Programming with threads and shared memory
- Message-passing libraries
- PGAS languages
- High level parallel languages
Chip Multiprocessors (ACS MPhil) 30
Threads and shared memory
- A thread, or thread of execution, is a unit of
parallelism
– It consists of everything necessary to execute a sequential stream of instructions
- program code, a call stack, set of registers (incl. a single
program counter)
– It shares memory with other threads
- Threads cooperate and coordinate there actions by
reading and writing to shared variables
– Special atomic operations are provided by the multiprocessor for synchronization
Chip Multiprocessors (ACS MPhil) 31
Threads and shared memory
- How might we express threads in our code?
- fork/join
– Fork/Join keywords can appear anywhere in code – General, but unstructured
p1 ; start p5 in || fork(p5) p2 fork(p3) P4 ; wait for p5 to ; complete join(p5) p6 join(p3) p7 A forked procedure runs in parallel with main thread
Chip Multiprocessors (ACS MPhil) 32
Threads and shared memory
- fork/join using the pthreads library
– Limitations to bare metal thread programming?
void *thread_func ( void *ptr) { int i = ((thread_args *) ptr)->input; ((thread_args *) ptr)->output = fib(i); return NULL; } args.input=n-1; // create and start first thread status = pthread_create(&thread, NULL, thread_func, (void*)&args ); // calc. fib(n-2) in parallel result = fib (n-2); // join pthread_join(thread, NULL);
Chip Multiprocessors (ACS MPhil) 33
Threads and shared memory
- parbegin/parend (cobegin/coend)
- Simple and structured, but not as general as fork/join,
e.g. we cannot represent the graph on the previous slide.
p1 parbegin p5 begin p2 parbegin p3 p4 parend end parend p6 p7
Chip Multiprocessors (ACS MPhil) 34
Threads and shared memory
- Even though parbegin..parend can only represent
properly nested dependency graphs it is usually adequate
- Cilk style spawn/sync
cilk int fib (int n) { if (n < 2) return n; else { int x, y; x = spawn fib (n-1); y = spawn fib (n-2); sync; return (x+y); } }
spawn – indicates that the proceduce call can safely proceed in parallel sync – wait until all previously spawned procedures have returned their results
Chip Multiprocessors (ACS MPhil) 35
Threads and shared memory
- forall (doall, parfor)
– Simply allows a programmer to indicate that each iteration of the loop is independent and may be run in parallel – OpenMP example:
#pragma omp parallel for for (i=first; i<n; i+=prime) marked[i]=1;
Chip Multiprocessors (ACS MPhil) 36
Threads and shared memory
- Futures
– Future <expr>
- Evaluate the expression concurrently with calling program.
An asynchronous function call
- If a thread requires the value of a future that has not been
computed, stall the thread until it is available
“The incremental garbage collection of processes”, Baker/Hewitt, 1977
y=future (fn(x)) ..... ..... z=y+1;
Chip Multiprocessors (ACS MPhil) 37
Threads and shared memory
- Synchronization and coordination
– In addition to creating threads, we also need to be able to control the way threads interact. – Often involves identifying critical sections
- Mechanisms
– Locks and barriers – Mutexes and monitors – Condition Variables (wait/signal) – Transactional memory
- See reading group papers and examples
Chip Multiprocessors (ACS MPhil) 38
Message-passing
- Simple (perhaps primitive) programming model
– Programmer must distribute and explicitly move data – The fact that the interactions are explicit can be seen as both an advantage and a disadvantage
- Potentially simple hardware implementation
- Processes communicate and synchronize by sending
messages
– Message Passing Interface (MPI) standard
- Widely used on High-Performance Computing (HPC)
platforms
- Programs tend to be portable
- Usually written in a Single-Program Multiple-Data (SPMD)
style
Chip Multiprocessors (ACS MPhil) 39
PGAS languages
- Partitioned Global Address Space Languages
– Aimed at large-scale distributed memory machines
- Aim to improve on MPI
– PGAS languages overlay a global address space on the virtual memories of the distributed machines
- No expectation that memories will be coherent
- The programmer distinguishes between local and non-local
data
- The compiler generates the necessary communication calls
in response to non-local references
- Compiler exploits one-sided communication primitives
rather than message-passing
- Co-Array Fortran, Unified Parallel C, Titanium (Ti)
(Titanium extends Java)
Chip Multiprocessors (ACS MPhil) 40
High-level parallel languages
- Global view of computation
– Raise level of abstraction
- Hide low-level details of communication and synchronization
- Take a global view and describe the algorithm rather than
per-task behavior
- e.g. ZPL forces programmer to think in parallel style using
array operations (reference to neighboring elements, flood, remap, reduction, ...)
- Compiler, runtime and libraries will manage implementation
details
– Interesting examples:
- ZPL – Array programming language
- NESL, Data Parallel Haskell (see wiki)
- See also Cray Chapel, IBM X10, Sun Fortress languages