2 Introduction to parallel computing Chip Multiprocessors (ACS - - PowerPoint PPT Presentation

2 introduction to parallel computing
SMART_READER_LITE
LIVE PREVIEW

2 Introduction to parallel computing Chip Multiprocessors (ACS - - PowerPoint PPT Presentation

2 Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins Overview Parallel computing platforms Approaches to building parallel computers Today's chip-multiprocessor architectures Approaches to


slide-1
SLIDE 1

2 • Introduction to parallel computing

Chip Multiprocessors (ACS MPhil) Robert Mullins

slide-2
SLIDE 2

Chip Multiprocessors (ACS MPhil) 2

Overview

  • Parallel computing platforms

– Approaches to building parallel computers – Today's chip-multiprocessor architectures

  • Approaches to parallel programming

– Programming with threads and shared memory – Message-passing libraries – PGAS languages – High-level parallel languages

slide-3
SLIDE 3

Chip Multiprocessors (ACS MPhil) 3

Parallel computers

  • How might we exploit multiple processing elements

and memories in order to complete a large computation quickly?

– How many processing elements, how powerful? – How do they communicate and cooperate?

  • How are memories and processing elements interconnected?
  • How is the memory hierarchy organised?

– How might we program such a machine?

slide-4
SLIDE 4

Chip Multiprocessors (ACS MPhil) 4

The control structure

  • How are the processing elements controlled?

– Centrally from single control unit or can they work independently?

  • Flynn's taxonomy:
  • Single Instruction Multiple Data (SIMD)
  • Multiple Instruction Multiple Data (MIMD)
slide-5
SLIDE 5

Chip Multiprocessors (ACS MPhil) 5

The control structure

  • SIMD

– The scalar pipelines execute in lockstep – Data-independent logic is shared

  • Efficient for highly data

parallel applications

  • Much simpler instruction

fetch and supply mechanism

– SIMD hardware can support a SPMD model if the individual threads follow similar control flow

  • Masked execution

Reproduced from, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow", W. W. L. Fung et al

A Generic Streaming Multiprocessor (for graphics applications)

slide-6
SLIDE 6

Chip Multiprocessors (ACS MPhil) 6

The communication model

  • A clear distinction is made between two common

communication models:

– 1. Shared-address-space platforms

  • All processors have access to a shared data space

accessed via a shared address space

  • All communication takes place via a shared memory
  • Each processing element may also have an area of

memory that is private

slide-7
SLIDE 7

Chip Multiprocessors (ACS MPhil) 7

The communication model

  • 2. Message-passing platforms

– Each processing element has its own exclusive address space – Communication is achieved by sending explicit messages between processing elements – The sending and receiving of messages can be used to both communicate between and synchronize the actions of multiple processing elements

slide-8
SLIDE 8

Chip Multiprocessors (ACS MPhil) 8

Multi-core

Figure courtesy of Tim Harris, MSR

slide-9
SLIDE 9

Chip Multiprocessors (ACS MPhil) 9

SMP multiprocessor

Figure courtesy of Tim Harris, MSR

slide-10
SLIDE 10

Chip Multiprocessors (ACS MPhil) 10

NUMA multiprocessor

Figure courtesy of Tim Harris, MSR

slide-11
SLIDE 11

Chip Multiprocessors (ACS MPhil) 11

Message-passing platforms

  • Many early message-

passing machines provided hardware primitives that were close to the send/receive user-level communication commands

– e.g. a pair of processors may be interconnected with a hardware FIFO queue – The network topology restricted which processors could be named in a send or receive operation (e.g. only neighbours could communicate in a mesh network)

000 001 010 011 100 110 101 111

[Culler, Figure 1.22]

slide-12
SLIDE 12

Chip Multiprocessors (ACS MPhil) 12

Message-passing platforms

  • The Transputer (1984)

– The result of an earlier foray into the world of parallel computing! – Transputer contained integrated serial links for building multiprocessors

  • IN/OUT instructions in ISA for sending and receiving

messages

– Programmed in OCCAM (based on CSP)

  • IBM Victor V256 (1991)

– 16x16 array of transputers – The processors could be partitioned dynamically between different users

slide-13
SLIDE 13

Chip Multiprocessors (ACS MPhil) 13

Message-passing platforms

  • Recently some chip-

multiprocessors have taken a similar approach (RAW/Tilera and XMOS)

– Message queues (or communication channels) may be register mapped or accessed via special instructions – The processor stalls when reading an empty input queue or when trying to write to a full output buffer

A wireless application mapped to the RAW processor. Data is streamed from one core to another over a statically scheduled network. Network input and output is register mapped.

(See also the iWarp paper on wiki)

slide-14
SLIDE 14

Chip Multiprocessors (ACS MPhil) 14

Message-passing platforms

  • For larger message-passing machines (typically

scientific supercomputers) direct FIFO designs were soon replaced by designs that built message-passing upon remote memory copies (supported by DMA or a

more general communication assist processor) – The interconnection networks also became more powerful, supporting the automatic routing of messages between arbitrary nodes – No restrictions on programmer or software support required

  • Hardware and software evolution meant there was a

general convergence of parallel machine

  • rganisations
slide-15
SLIDE 15

Chip Multiprocessors (ACS MPhil) 15

Message-passing platforms

  • The most fundamental communication primitives in a

message-passing machine are synchronous send and receive operations

– Here data movement must be specified at both ends of the communication, this is known as two-sided

  • communication. e.g. MPI_Send and MPI_Recv*

– Non-blocking versions of send and receive are also

  • ften provided to allow computation and

communication to be overlapped

*Message Passing Interface (MPI) is a portable message-passing system that is supported by a very wide range of parallel machines.

slide-16
SLIDE 16

Chip Multiprocessors (ACS MPhil) 16

One-side communication

  • SHMEM

– Provides routines to access the memory of a remote processing element without any assistance from the remote process, e.g:

  • shmem_put (target_addr, source_addr,

length, remote_pe)

  • shmem_get, shmem_barrier etc.

– One-sided communication may be used to reduce synchronization, simplify programming and reduce data movement

slide-17
SLIDE 17

Chip Multiprocessors (ACS MPhil) 17

The communication model

  • From a hardware perspective we would like to keep

the machine simple (message-passing)

  • But we inevitably need to simplify the programmer's

and compiler's task

– Efficiently support shared-memory programming – Add support for transactional memory? – Create a simple but high-performance target

  • Trade-offs between hardware complexity and

complexity of hardware and compiler.

slide-18
SLIDE 18

Chip Multiprocessors (ACS MPhil) 18

Today's chip multiprocessors

  • Intel Nehalem-EX

(2009)

– 8-cores

  • 2-way hyperthreaded

(SMT)

  • 16 hardware threads

– L1I 32KB, L1D 32KB – 256 KB L2 (Private) – 24MB L3 (Shared)

  • 8-banks
  • Inclusive L3
slide-19
SLIDE 19

Chip Multiprocessors (ACS MPhil) 19

Today's chip multiprocessors

L1 L2 Shared L3 Memory Intel Nahalem-EX (2009)

slide-20
SLIDE 20

Chip Multiprocessors (ACS MPhil) 20

Today's chip multiprocessors

  • IBM Power 7 (2010)

– 8 core (dual-chip module to hold 16 cores) – 32MB shared eDRAM L3 cache – 2-channel DDR3 controllers – Individual cores

  • 4-thread SMT per core
  • 6 ops/cycle
  • 4GHz
slide-21
SLIDE 21

Chip Multiprocessors (ACS MPhil) 21

Today's chip multiprocessors

IBM Power 7 (2010)

slide-22
SLIDE 22

Chip Multiprocessors (ACS MPhil) 22

Today's chip multiprocessors

  • Sun Niagara T1 (2005)

Each core has its own level 1 cache (16KB for instructions, 8KB for data). The level 2 caches are 3MB in total and are effectively 12-way associative. They are interleaved by 64-byte cache lines.

slide-23
SLIDE 23

Chip Multiprocessors (ACS MPhil) 23

Oracle M7 Processor (2014)

  • 32 core

– Dual-issue, OOO

  • Dynamic multithreading 1-8

threads/core

  • 256KB I&D L2 caches

shared by groups of 4 cores

  • 64MB L3
  • Technology: 20nm, 13 metal

layers

  • 16 DDR channels

– 160GB/s – (vs. ~20GB/s for T1)

  • >10B transistors!
slide-24
SLIDE 24

Chip Multiprocessors (ACS MPhil) 24

“Manycore” designs: Tilera

  • Tilera (now Mellanox)

– Evolution of MIT RAW – 100-cores – grid of identical tiles – Low-power 3-way VLIW cores – Cores interconnected by a selection of static and dynamic on-chip networks

slide-25
SLIDE 25

Chip Multiprocessors (ACS MPhil) 25

“Manycore” designs: Celerity (2017)

Tiered Accelerator Fabric General-purpose tier: 5 “Rocket” RISC-V cores Massively parallel tier: 496 5-stage RISC-V cores, 16x31 tiled mesh array Specialised tier: Binarized Neural Network accelerator

slide-26
SLIDE 26

Chip Multiprocessors (ACS MPhil) 26

GPUs

“The NVIDIA GeForce 8800 GPU”, Hot Chips 2007

  • TESLA P100

– 56 Streaming multiprocessors x 64 cores = 3584 “cores” or lanes – 732GB/s memory bandwidth – 4MB L2 cache – 15.3 billion transistors

slide-27
SLIDE 27

Chip Multiprocessors (ACS MPhil) 27

Communication latencies

  • Chip multiprocessor

– Some have very fast core to core communication, as low as 1-3 cycles – Opportunities to add dedicated core-to-core links – Typical L1-to-L1 communication latencies may be around 10-100 cycles

  • Other types of parallel machine:

– Shared memory multiprocessor ~500 – Cluster/supercomputer ~5000-10000

slide-28
SLIDE 28

Chip Multiprocessors (ACS MPhil) 28

Approaches to parallel programming

  • “Principles of Parallel

Programming”, Calvin Lin and Lawrence Snyder, Pearson, 2009

  • This book provides a

good overview of the different approaches to parallel programming

  • There is also a

significant amount of information on the course wiki

– Try some examples!

slide-29
SLIDE 29

Chip Multiprocessors (ACS MPhil) 29

Approaches to parallel programming

  • Programming with threads and shared memory
  • Message-passing libraries
  • PGAS languages
  • High level parallel languages
slide-30
SLIDE 30

Chip Multiprocessors (ACS MPhil) 30

Threads and shared memory

  • A thread, or thread of execution, is a unit of

parallelism

– It consists of everything necessary to execute a sequential stream of instructions

  • program code, a call stack, set of registers (incl. a single

program counter)

– It shares memory with other threads

  • Threads cooperate and coordinate there actions by

reading and writing to shared variables

– Special atomic operations are provided by the multiprocessor for synchronization

slide-31
SLIDE 31

Chip Multiprocessors (ACS MPhil) 31

Threads and shared memory

  • How might we express threads in our code?
  • fork/join

– Fork/Join keywords can appear anywhere in code – General, but unstructured

p1 ; start p5 in || fork(p5) p2 fork(p3) P4 ; wait for p5 to ; complete join(p5) p6 join(p3) p7 A forked procedure runs in parallel with main thread

slide-32
SLIDE 32

Chip Multiprocessors (ACS MPhil) 32

Threads and shared memory

  • fork/join using the pthreads library

– Limitations to bare metal thread programming?

void *thread_func ( void *ptr) { int i = ((thread_args *) ptr)->input; ((thread_args *) ptr)->output = fib(i); return NULL; } args.input=n-1; // create and start first thread status = pthread_create(&thread, NULL, thread_func, (void*)&args ); // calc. fib(n-2) in parallel result = fib (n-2); // join pthread_join(thread, NULL);

slide-33
SLIDE 33

Chip Multiprocessors (ACS MPhil) 33

Threads and shared memory

  • parbegin/parend (cobegin/coend)
  • Simple and structured, but not as general as fork/join,

e.g. we cannot represent the graph on the previous slide.

p1 parbegin p5 begin p2 parbegin p3 p4 parend end parend p6 p7

slide-34
SLIDE 34

Chip Multiprocessors (ACS MPhil) 34

Threads and shared memory

  • Even though parbegin..parend can only represent

properly nested dependency graphs it is usually adequate

  • Cilk style spawn/sync

cilk int fib (int n) { if (n < 2) return n; else { int x, y; x = spawn fib (n-1); y = spawn fib (n-2); sync; return (x+y); } }

spawn – indicates that the proceduce call can safely proceed in parallel sync – wait until all previously spawned procedures have returned their results

slide-35
SLIDE 35

Chip Multiprocessors (ACS MPhil) 35

Threads and shared memory

  • forall (doall, parfor)

– Simply allows a programmer to indicate that each iteration of the loop is independent and may be run in parallel – OpenMP example:

#pragma omp parallel for for (i=first; i<n; i+=prime) marked[i]=1;

slide-36
SLIDE 36

Chip Multiprocessors (ACS MPhil) 36

Threads and shared memory

  • Futures

– Future <expr>

  • Evaluate the expression concurrently with calling program.

An asynchronous function call

  • If a thread requires the value of a future that has not been

computed, stall the thread until it is available

“The incremental garbage collection of processes”, Baker/Hewitt, 1977

y=future (fn(x)) ..... ..... z=y+1;

slide-37
SLIDE 37

Chip Multiprocessors (ACS MPhil) 37

Threads and shared memory

  • Synchronization and coordination

– In addition to creating threads, we also need to be able to control the way threads interact. – Often involves identifying critical sections

  • Mechanisms

– Locks and barriers – Mutexes and monitors – Condition Variables (wait/signal) – Transactional memory

  • See reading group papers and examples
slide-38
SLIDE 38

Chip Multiprocessors (ACS MPhil) 38

Message-passing

  • Simple (perhaps primitive) programming model

– Programmer must distribute and explicitly move data – The fact that the interactions are explicit can be seen as both an advantage and a disadvantage

  • Potentially simple hardware implementation
  • Processes communicate and synchronize by sending

messages

– Message Passing Interface (MPI) standard

  • Widely used on High-Performance Computing (HPC)

platforms

  • Programs tend to be portable
  • Usually written in a Single-Program Multiple-Data (SPMD)

style

slide-39
SLIDE 39

Chip Multiprocessors (ACS MPhil) 39

PGAS languages

  • Partitioned Global Address Space Languages

– Aimed at large-scale distributed memory machines

  • Aim to improve on MPI

– PGAS languages overlay a global address space on the virtual memories of the distributed machines

  • No expectation that memories will be coherent
  • The programmer distinguishes between local and non-local

data

  • The compiler generates the necessary communication calls

in response to non-local references

  • Compiler exploits one-sided communication primitives

rather than message-passing

  • Co-Array Fortran, Unified Parallel C, Titanium (Ti)

(Titanium extends Java)

slide-40
SLIDE 40

Chip Multiprocessors (ACS MPhil) 40

High-level parallel languages

  • Global view of computation

– Raise level of abstraction

  • Hide low-level details of communication and synchronization
  • Take a global view and describe the algorithm rather than

per-task behavior

  • e.g. ZPL forces programmer to think in parallel style using

array operations (reference to neighboring elements, flood, remap, reduction, ...)

  • Compiler, runtime and libraries will manage implementation

details

– Interesting examples:

  • ZPL – Array programming language
  • NESL, Data Parallel Haskell (see wiki)
  • See also Cray Chapel, IBM X10, Sun Fortress languages

(DARPA HPCS project)